# Projeto 2 - Ciência dos Dados

## Integrantes:
* Gabriela Kimi
* Luiza Ehrenberger
* Pedro Barão
* Rafael Paolino

## Introdução:

## Objetivo: fazer uso de métodos de regressão para prever o preço de carros, usando suas características como base.

Para isso, fizemos uso de uma base de dados que leva em conta as características de um carro e o preço sugerido pelo fabricante. Alguns atributos utilizados para estimar o preço são: 

* marca
* modelo
* ano de fabricação
* tipo de motor

<a href= " https://www.kaggle.com/CooperUnion/cardataset " > Link para a base de dados "Car Features and MSRP"</a>

### Importando as bibiotecas necessárias: 

In [1024]:
%matplotlib notebook

import pandas as pd
import numpy as np
from scipy.stats import norm, probplot
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
from sklearn.tree import DecisionTreeRegressor 
from sklearn.preprocessing import OneHotEncoder
from IPython.display import display
import seaborn as sns

### Função de regressão linear:

#### Y: coluna do DataFrame utilizada como variável resposta. (TARGET)
#### X: coluna(s) do DataFrame utilizada(s) como variável(is) explicativas. (FEATURES)

In [1025]:
def regress(Y,X):
    X_cp = sm.add_constant(X)
    model = sm.OLS(Y,X_cp)
    results = model.fit()
    
    return results

### DataFrame da base de dados

In [1026]:
data = pd.read_csv("data.csv")

In [1027]:
data.head(3)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350


## Mineirando Dados e Características do Dataset

### Colunas que serão utilizadas: 

In [1028]:
data=data.drop('Popularity',axis=1)
data.columns

Index(['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
       'Number of Doors', 'Market Category', 'Vehicle Size', 'Vehicle Style',
       'highway MPG', 'city mpg', 'MSRP'],
      dtype='object')

### Descrevendo as variáveis que serão utilizadas:

* Make - Marca do carro 
* Modelo - Modelo do carro 
* Year - Ano de lançamento do carro 
* Engine Fuel Type - Tipo de combustível utilizado 
* Engine HP - Potência do motor (HP)
* Engine Cylinders - Número de cilindros do motor 
* Transmission Type - Tipo de transmissão (Câmbio)
* Driven wheels - Tipo de tração
* Number of doors - Número de portas 
* Market Category - Categorias usadas no mercado 
* Vehicle Size - Classificação do tamanho do veículo 
* Vehicle Style - Classificação de estilo do veículo
* Highway MPG - Média de rendimento de gasolina na auto-estrada (milhas por galão)
* City MPG - Média de rendimento de gasolina na cidade (milhas por galão)
* MSRP - preço de varejo recomendado pelo fabricante do carro (dólares)

## Remoção de células sem valores

In [1029]:
data=data.dropna()  
data.isnull().sum()

Make                 0
Model                0
Year                 0
Engine Fuel Type     0
Engine HP            0
Engine Cylinders     0
Transmission Type    0
Driven_Wheels        0
Number of Doors      0
Market Category      0
Vehicle Size         0
Vehicle Style        0
highway MPG          0
city mpg             0
MSRP                 0
dtype: int64

### Codificação de colunas categóricas para integer e hot encode:


In [1030]:
data=data.drop(['Make','Model','Vehicle Style','Market Category'],axis=1)
data_2=pd.get_dummies(data)

## Criação de DataFrames de treinamento e de teste:

In [1031]:
train = data.sample(7200)

In [1032]:
train= data_2.sample(frac=0.9)

test=data_2.drop(train.index)

In [1033]:
train.columns

Index(['Year', 'Engine HP', 'Engine Cylinders', 'Number of Doors',
       'highway MPG', 'city mpg', 'MSRP', 'Engine Fuel Type_diesel',
       'Engine Fuel Type_electric',
       'Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)',
       'Engine Fuel Type_flex-fuel (premium unleaded required/E85)',
       'Engine Fuel Type_flex-fuel (unleaded/E85)',
       'Engine Fuel Type_premium unleaded (recommended)',
       'Engine Fuel Type_premium unleaded (required)',
       'Engine Fuel Type_regular unleaded',
       'Transmission Type_AUTOMATED_MANUAL', 'Transmission Type_AUTOMATIC',
       'Transmission Type_DIRECT_DRIVE', 'Transmission Type_MANUAL',
       'Transmission Type_UNKNOWN', 'Driven_Wheels_all wheel drive',
       'Driven_Wheels_four wheel drive', 'Driven_Wheels_front wheel drive',
       'Driven_Wheels_rear wheel drive', 'Vehicle Size_Compact',
       'Vehicle Size_Large', 'Vehicle Size_Midsize'],
      dtype='object')

### Plotagem de gráficos scatter

In [1034]:
#sns.pairplot(train_2)

## Modelos de Predição

### Explicar o que caea modelo de predição faz e explicar como funciona a biblioteca escolhida

### Regressão linear MMQ:

In [1035]:
X=train.drop(['MSRP'],axis=1)
Y=train[['MSRP']]

results=regress(Y,X)
results.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,MSRP,R-squared:,0.513
Model:,OLS,Adj. R-squared:,0.512
Method:,Least Squares,F-statistic:,347.8
Date:,"Mon, 29 Nov 2021",Prob (F-statistic):,0.0
Time:,18:15:06,Log-Likelihood:,-89031.0
No. Observations:,7276,AIC:,178100.0
Df Residuals:,7253,BIC:,178300.0
Df Model:,22,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.254e+05,1.35e+05,-0.928,0.353,-3.9e+05,1.4e+05
Year,45.8704,129.659,0.354,0.724,-208.300,300.040
Engine HP,286.2689,12.530,22.846,0.000,261.706,310.832
Engine Cylinders,1.372e+04,691.877,19.828,0.000,1.24e+04,1.51e+04
Number of Doors,-502.5436,841.118,-0.597,0.550,-2151.380,1146.293
highway MPG,-18.1230,138.289,-0.131,0.896,-289.211,252.965
city mpg,1291.6108,183.808,7.027,0.000,931.293,1651.929
Engine Fuel Type_diesel,731.3560,1.87e+04,0.039,0.969,-3.59e+04,3.74e+04
Engine Fuel Type_electric,-5.474e+04,3.64e+04,-1.502,0.133,-1.26e+05,1.67e+04

0,1,2,3
Omnibus:,13930.692,Durbin-Watson:,1.993
Prob(Omnibus):,0.0,Jarque-Bera (JB):,42869248.181
Skew:,14.641,Prob(JB):,0.0
Kurtosis:,377.897,Cond. No.,5.66e+16


### Remoção de colunas com valor p > 10%:

In [1036]:
X=X.drop(['Year','Number of Doors','highway MPG','Engine Fuel Type_diesel','Engine Fuel Type_electric','Engine Fuel Type_electric','Engine Fuel Type_electric','Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)','Transmission Type_AUTOMATIC','Engine Fuel Type_premium unleaded (required)','Engine Fuel Type_flex-fuel (premium unleaded required/E85)'],axis=1)
Y=train[['MSRP']]



## Regressão Linear

In [1037]:
results=regress(Y,X)
results.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,MSRP,R-squared:,0.513
Model:,OLS,Adj. R-squared:,0.512
Method:,Least Squares,F-statistic:,510.0
Date:,"Mon, 29 Nov 2021",Prob (F-statistic):,0.0
Time:,18:15:06,Log-Likelihood:,-89033.0
No. Observations:,7276,AIC:,178100.0
Df Residuals:,7260,BIC:,178200.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-7.796e+04,3362.676,-23.183,0.000,-8.45e+04,-7.14e+04
Engine HP,287.9734,10.493,27.443,0.000,267.403,308.544
Engine Cylinders,1.367e+04,643.471,21.241,0.000,1.24e+04,1.49e+04
city mpg,1251.8917,138.438,9.043,0.000,980.513,1523.270
Engine Fuel Type_flex-fuel (unleaded/E85),-2.427e+04,2477.031,-9.799,0.000,-2.91e+04,-1.94e+04
Engine Fuel Type_premium unleaded (recommended),-1.788e+04,1920.965,-9.310,0.000,-2.17e+04,-1.41e+04
Engine Fuel Type_regular unleaded,-1.631e+04,1798.157,-9.071,0.000,-1.98e+04,-1.28e+04
Transmission Type_AUTOMATED_MANUAL,2.217e+04,2488.943,8.906,0.000,1.73e+04,2.7e+04
Transmission Type_DIRECT_DRIVE,-5.139e+04,1.72e+04,-2.983,0.003,-8.52e+04,-1.76e+04

0,1,2,3
Omnibus:,13919.664,Durbin-Watson:,1.994
Prob(Omnibus):,0.0,Jarque-Bera (JB):,42666557.464
Skew:,14.616,Prob(JB):,0.0
Kurtosis:,377.008,Cond. No.,2.62e+18


## Regressão em árvore:

In [1038]:
X=test.drop(['MSRP','Year','Number of Doors','highway MPG','Engine Fuel Type_diesel','Engine Fuel Type_electric','Engine Fuel Type_electric','Engine Fuel Type_electric','Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)','Transmission Type_AUTOMATIC','Engine Fuel Type_premium unleaded (required)','Engine Fuel Type_flex-fuel (premium unleaded required/E85)'],axis=1)
Y=test['MSRP']

In [1039]:
regressor = DecisionTreeRegressor(random_state = 0) 

regressor.fit(X, Y)

DecisionTreeRegressor(random_state=0)

## Testes:

In [1040]:
tree_pred = regressor.predict(X)

linear_y=results.predict(test.drop(['Year','Number of Doors','highway MPG','Engine Fuel Type_diesel','Engine Fuel Type_electric','Engine Fuel Type_electric','Engine Fuel Type_electric','Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)','Transmission Type_AUTOMATIC','Engine Fuel Type_premium unleaded (required)','Engine Fuel Type_flex-fuel (premium unleaded required/E85)'],axis=1))

tree_pred=pd.Series(tree_pred)

df=pd.DataFrame({'MSRP(test)':test['MSRP'],'Preço(tree)':tree_pred,'Razão(tree)':tree_pred/test['MSRP'],'Preço':linear_y,'Razão(linear)':linear_y/test['MSRP']})

In [1041]:
df=df.dropna()
df.describe()

Unnamed: 0,MSRP(test),Preço(tree),Razão(tree),Preço,Razão(linear)
count,65.0,65.0,65.0,65.0,65.0
mean,53519.969231,61542.717949,8.051828,46670320.0,-585.906974
std,87257.370808,76530.744759,19.349862,101901200.0,2480.480542
min,2000.0,2000.0,0.041537,-19091280.0,-7257.560711
25%,7600.0,25330.0,0.567502,-6743511.0,-1641.955555
50%,38680.0,38100.0,1.0,27717420.0,742.042029
75%,50200.0,63060.0,5.15214,40742900.0,850.94199
max,455500.0,418950.0,100.452117,521124000.0,1144.070293
