# Projeto 2 - Ciência dos Dados

## Integrantes:
* Gabriela Kimi
* Luiza Ehrenberger
* Pedro Barão
* Rafael Paolino

## Introdução:

## Objetivo: fazer uso de métodos de regressão para prever o preço de carros, usando suas características como base.

Para isso, fizemos uso de uma base de dados que leva em conta as características de um carro e o preço sugerido pelo fabricante. Alguns atributos utilizados para estimar o preço são: 

* marca
* modelo
* ano de fabricação
* tipo de motor

<a href= " https://www.kaggle.com/CooperUnion/cardataset " > Link para a base de dados "Car Features and MSRP"</a>

### Importando as bibiotecas necessárias: 

In [106]:
%matplotlib notebook

import pandas as pd
import numpy as np
from scipy.stats import norm, probplot
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
from sklearn.tree import DecisionTreeRegressor 
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error
from IPython.display import display
import seaborn as sns

## Função de regressão linear:

#### Y: coluna do DataFrame utilizada como variável resposta. (TARGET)
#### X: coluna(s) do DataFrame utilizada(s) como variável(is) explicativas. (FEATURES)

In [107]:
def regress(Y,X):
    X_cp = sm.add_constant(X)
    model = sm.OLS(Y,X_cp)
    results = model.fit()
    
    return results

### DataFrame da base de dados

In [108]:
data = pd.read_csv("data.csv")

In [109]:
data.head(3)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350


## Mineirando Dados e Características do Dataset

### Colunas que serão utilizadas: 

In [110]:
data=data.drop('Popularity',axis=1)
data.columns

Index(['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
       'Number of Doors', 'Market Category', 'Vehicle Size', 'Vehicle Style',
       'highway MPG', 'city mpg', 'MSRP'],
      dtype='object')

## Descrevendo as variáveis que serão utilizadas:

### Targets (varíaveis analisadas):
* Make - Marca do carro 
* Modelo - Modelo do carro 
* Year - Ano de lançamento do carro 
* Engine Fuel Type - Tipo de combustível utilizado 
* Engine HP - Potência do motor (HP)
* Engine Cylinders - Número de cilindros do motor 
* Transmission Type - Tipo de transmissão (Câmbio)
* Driven wheels - Tipo de tração
* Number of doors - Número de portas 
* Market Category - Categorias usadas no mercado 
* Vehicle Size - Classificação do tamanho do veículo 
* Vehicle Style - Classificação de estilo do veículo
* Highway MPG - Média de rendimento de gasolina na auto-estrada (milhas por galão)
* City MPG - Média de rendimento de gasolina na cidade (milhas por galão)

### Feature (varíavel buscada):
* MSRP - preço de varejo recomendado pelo fabricante do carro (dólares)

## Remoção de células sem valores

In [111]:
data=data.dropna()  
data.isnull().sum()

Make                 0
Model                0
Year                 0
Engine Fuel Type     0
Engine HP            0
Engine Cylinders     0
Transmission Type    0
Driven_Wheels        0
Number of Doors      0
Market Category      0
Vehicle Size         0
Vehicle Style        0
highway MPG          0
city mpg             0
MSRP                 0
dtype: int64

## Codificação de colunas categóricas para integer e hot encode:


In [112]:
data=data.drop(['Make','Model','Vehicle Style','Market Category'],axis=1) # Remoção de colunas com quantidade excessiva de categorias individuais
data_2=pd.get_dummies(data)

## Criação de DataFrames de treinamento e de teste:

In [113]:
train=data_2.sample(frac=0.9,random_state=0)

test=data_2.drop(train.index)

In [114]:
train.columns

Index(['Year', 'Engine HP', 'Engine Cylinders', 'Number of Doors',
       'highway MPG', 'city mpg', 'MSRP', 'Engine Fuel Type_diesel',
       'Engine Fuel Type_electric',
       'Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)',
       'Engine Fuel Type_flex-fuel (premium unleaded required/E85)',
       'Engine Fuel Type_flex-fuel (unleaded/E85)',
       'Engine Fuel Type_premium unleaded (recommended)',
       'Engine Fuel Type_premium unleaded (required)',
       'Engine Fuel Type_regular unleaded',
       'Transmission Type_AUTOMATED_MANUAL', 'Transmission Type_AUTOMATIC',
       'Transmission Type_DIRECT_DRIVE', 'Transmission Type_MANUAL',
       'Transmission Type_UNKNOWN', 'Driven_Wheels_all wheel drive',
       'Driven_Wheels_four wheel drive', 'Driven_Wheels_front wheel drive',
       'Driven_Wheels_rear wheel drive', 'Vehicle Size_Compact',
       'Vehicle Size_Large', 'Vehicle Size_Midsize'],
      dtype='object')

### Plotagem de gráficos scatter

In [115]:
#sns.pairplot(train)

## Modelos de Predição

### Explicar o que caea modelo de predição faz e explicar como funciona a biblioteca escolhida

### Regressão linear MMQ:

In [116]:
X=train.drop(['MSRP'],axis=1)
Y=train[['MSRP']]

results=regress(Y,X)
results.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,MSRP,R-squared:,0.54
Model:,OLS,Adj. R-squared:,0.538
Method:,Least Squares,F-statistic:,386.5
Date:,"Tue, 30 Nov 2021",Prob (F-statistic):,0.0
Time:,13:53:32,Log-Likelihood:,-88380.0
No. Observations:,7276,AIC:,176800.0
Df Residuals:,7253,BIC:,177000.0
Df Model:,22,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2.432e+05,1.23e+05,-1.979,0.048,-4.84e+05,-2271.421
Year,163.0812,117.902,1.383,0.167,-68.040,394.202
Engine HP,270.2741,11.449,23.607,0.000,247.831,292.717
Engine Cylinders,1.331e+04,632.669,21.046,0.000,1.21e+04,1.46e+04
Number of Doors,-1118.9402,767.863,-1.457,0.145,-2624.174,386.294
highway MPG,-31.7046,126.324,-0.251,0.802,-279.337,215.928
city mpg,1266.3160,169.276,7.481,0.000,934.487,1598.145
Engine Fuel Type_diesel,-1.626e+04,1.7e+04,-0.955,0.340,-4.96e+04,1.71e+04
Engine Fuel Type_electric,-6.866e+04,3.34e+04,-2.054,0.040,-1.34e+05,-3142.879

0,1,2,3
Omnibus:,14013.228,Durbin-Watson:,2.012
Prob(Omnibus):,0.0,Jarque-Bera (JB):,55711502.785
Skew:,14.712,Prob(JB):,0.0
Kurtosis:,430.667,Cond. No.,7.57e+16


### Remoção de colunas com valor p > 10%:

In [117]:
X=X.drop(['Year','Number of Doors','highway MPG','Engine Fuel Type_diesel','Transmission Type_UNKNOWN','Engine Fuel Type_electric','Engine Fuel Type_electric','Engine Fuel Type_electric','Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)','Transmission Type_AUTOMATIC','Engine Fuel Type_premium unleaded (required)','Engine Fuel Type_flex-fuel (premium unleaded required/E85)'],axis=1)
Y=train[['MSRP']]



## Regressão Linear

In [118]:
results=regress(Y,X)
results.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,MSRP,R-squared:,0.539
Model:,OLS,Adj. R-squared:,0.538
Method:,Least Squares,F-statistic:,606.3
Date:,"Tue, 30 Nov 2021",Prob (F-statistic):,0.0
Time:,13:53:33,Log-Likelihood:,-88385.0
No. Observations:,7276,AIC:,176800.0
Df Residuals:,7261,BIC:,176900.0
Df Model:,14,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-7.426e+04,3088.922,-24.039,0.000,-8.03e+04,-6.82e+04
Engine HP,279.7097,9.557,29.266,0.000,260.974,298.445
Engine Cylinders,1.309e+04,588.908,22.222,0.000,1.19e+04,1.42e+04
city mpg,1264.8061,127.511,9.919,0.000,1014.847,1514.765
Engine Fuel Type_flex-fuel (unleaded/E85),-2.44e+04,2267.887,-10.759,0.000,-2.88e+04,-2e+04
Engine Fuel Type_premium unleaded (recommended),-1.783e+04,1749.535,-10.193,0.000,-2.13e+04,-1.44e+04
Engine Fuel Type_regular unleaded,-1.691e+04,1642.268,-10.298,0.000,-2.01e+04,-1.37e+04
Transmission Type_AUTOMATED_MANUAL,1.943e+04,2277.781,8.530,0.000,1.5e+04,2.39e+04
Transmission Type_DIRECT_DRIVE,-5.464e+04,1.6e+04,-3.425,0.001,-8.59e+04,-2.34e+04

0,1,2,3
Omnibus:,13955.601,Durbin-Watson:,2.012
Prob(Omnibus):,0.0,Jarque-Bera (JB):,54418923.501
Skew:,14.579,Prob(JB):,0.0
Kurtosis:,425.672,Cond. No.,3.51e+18


## Regressão em árvore:

In [119]:
X=test.drop(['MSRP','Year','Number of Doors','highway MPG','Engine Fuel Type_diesel','Engine Fuel Type_electric','Engine Fuel Type_electric','Engine Fuel Type_electric','Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)','Transmission Type_AUTOMATIC','Engine Fuel Type_premium unleaded (required)','Engine Fuel Type_flex-fuel (premium unleaded required/E85)'],axis=1)
Y=test['MSRP']

In [120]:
regressor = DecisionTreeRegressor(random_state = 0) 

regressor.fit(X, Y)

DecisionTreeRegressor(random_state=0)

## Testes:

In [125]:
tree_pred = regressor.predict(X)

linear_y=results.predict(test.drop(['Year','Number of Doors','Transmission Type_UNKNOWN','highway MPG','Engine Fuel Type_diesel','Engine Fuel Type_electric','Engine Fuel Type_electric','Engine Fuel Type_electric','Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)','Transmission Type_AUTOMATIC','Engine Fuel Type_premium unleaded (required)','Engine Fuel Type_flex-fuel (premium unleaded required/E85)'],axis=1))

tree_pred=pd.Series(tree_pred)

MSE_linear=mean_squared_error(test['MSRP'],linear_y,squared=False)
MSE_tree=mean_squared_error(test['MSRP'],tree_pred,squared=False)

df=pd.DataFrame({'MSRP(test)':test['MSRP'],'Preço(tree)':tree_pred,'Preço(linear)':linear_y})

In [126]:
df=df.dropna()
df.describe()

Unnamed: 0,MSRP(test),Preço(tree),Preço(linear)
count,67.0,67.0,67.0
mean,56269.940299,46174.425373,50925670.0
std,92118.385769,56145.783253,109208700.0
min,2000.0,2000.0,-18535880.0
25%,2350.5,23927.5,-7776722.0
50%,33450.0,35332.5,22255010.0
75%,60925.0,44251.25,49659140.0
max,506500.0,361800.0,593844900.0


In [129]:
print(MSE_tree)
print(MSE_linear)

3032.178215713612
122771319.07865794
