# Projeto 2 - Ciência dos Dados

## Integrantes:
* Gabriela Kimi
* Luiza Ehrenberger
* Pedro Barão
* Rafael Paolino

## Introdução:

## Objetivo: fazer uso de métodos de regressão para prever o preço de carros, usando suas características como base.

Para isso, fizemos uso de uma base de dados que leva em conta as características de um carro e o preço sugerido pelo fabricante. Alguns atributos utilizados para estimar o preço são: 

* marca
* modelo
* ano de fabricação
* tipo de motor

<a href= " https://www.kaggle.com/CooperUnion/cardataset " > Link para a base de dados "Car Features and MSRP"</a>

### Importando as bibiotecas necessárias: 

In [81]:
%matplotlib notebook

import pandas as pd
import numpy as np
from scipy.stats import norm, probplot
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
from sklearn.tree import DecisionTreeRegressor 
from sklearn.preprocessing import OneHotEncoder
from IPython.display import display
import seaborn as sns

### Função de regressão linear:

#### Y: coluna do DataFrame utilizada como variável resposta. (TARGET)
#### X: coluna(s) do DataFrame utilizada(s) como variável(is) explicativas. (FEATURES)

In [82]:
def regress(Y,X):
    X_cp = sm.add_constant(X)
    model = sm.OLS(Y,X_cp)
    results = model.fit()
    
    return results

### DataFrame da base de dados

In [83]:
data = pd.read_csv("data.csv")

In [84]:
data.head(3)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350


## Mineirando Dados e Características do Dataset

### Colunas que serão utilizadas: 

In [85]:
data.columns

Index(['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
       'Number of Doors', 'Market Category', 'Vehicle Size', 'Vehicle Style',
       'highway MPG', 'city mpg', 'Popularity', 'MSRP'],
      dtype='object')

### Descrevendo as variáveis que serão utilizadas:

### Codificação da coluna 'Transmission type' para integer e hot encode:


In [86]:
data=data.dropna()
data.isnull().sum()

Make                 0
Model                0
Year                 0
Engine Fuel Type     0
Engine HP            0
Engine Cylinders     0
Transmission Type    0
Driven_Wheels        0
Number of Doors      0
Market Category      0
Vehicle Size         0
Vehicle Style        0
highway MPG          0
city mpg             0
Popularity           0
MSRP                 0
dtype: int64

In [87]:
data.iloc[:,0].size

8084

In [88]:
train = data.sample(7200)

In [89]:
train=train[[ 'Year', 'Engine Fuel Type', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
       'Number of Doors', 'Vehicle Size',
       'highway MPG', 'city mpg', 'Popularity', 'MSRP']]

### Codificação de colunas categóricas para integer e hot encode:


In [90]:
train_2=pd.get_dummies(train)
train_2

Unnamed: 0,Year,Engine HP,Engine Cylinders,Number of Doors,highway MPG,city mpg,Popularity,MSRP,Engine Fuel Type_diesel,Engine Fuel Type_electric,...,Transmission Type_DIRECT_DRIVE,Transmission Type_MANUAL,Transmission Type_UNKNOWN,Driven_Wheels_all wheel drive,Driven_Wheels_four wheel drive,Driven_Wheels_front wheel drive,Driven_Wheels_rear wheel drive,Vehicle Size_Compact,Vehicle Size_Large,Vehicle Size_Midsize
1407,2015,270.0,6.0,4.0,32,22,2009,32350,0,0,...,0,0,0,0,0,1,0,0,0,1
1531,2015,321.0,6.0,4.0,28,18,1624,47615,0,0,...,0,0,0,0,0,0,1,1,0,0
7951,2015,400.0,8.0,4.0,20,14,190,63250,0,0,...,0,0,0,0,0,0,1,0,1,0
3229,2013,318.0,6.0,4.0,26,18,1624,44190,0,0,...,0,0,0,0,0,0,1,0,0,1
9627,2017,355.0,8.0,4.0,23,16,1385,40375,0,0,...,0,0,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3687,2015,329.0,6.0,4.0,29,20,617,62350,0,0,...,0,0,0,0,0,0,1,0,0,1
9487,2012,332.0,8.0,4.0,23,20,1385,49620,0,0,...,0,0,0,0,1,0,0,0,1,0
10438,2014,317.0,8.0,4.0,17,12,2009,37320,0,0,...,0,0,0,0,1,0,0,0,1,0
1634,2009,107.0,4.0,4.0,34,27,1385,14100,0,0,...,0,1,0,0,0,1,0,1,0,0


In [91]:
train_2.columns

Index(['Year', 'Engine HP', 'Engine Cylinders', 'Number of Doors',
       'highway MPG', 'city mpg', 'Popularity', 'MSRP',
       'Engine Fuel Type_diesel', 'Engine Fuel Type_electric',
       'Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)',
       'Engine Fuel Type_flex-fuel (premium unleaded required/E85)',
       'Engine Fuel Type_flex-fuel (unleaded/E85)',
       'Engine Fuel Type_premium unleaded (recommended)',
       'Engine Fuel Type_premium unleaded (required)',
       'Engine Fuel Type_regular unleaded',
       'Transmission Type_AUTOMATED_MANUAL', 'Transmission Type_AUTOMATIC',
       'Transmission Type_DIRECT_DRIVE', 'Transmission Type_MANUAL',
       'Transmission Type_UNKNOWN', 'Driven_Wheels_all wheel drive',
       'Driven_Wheels_four wheel drive', 'Driven_Wheels_front wheel drive',
       'Driven_Wheels_rear wheel drive', 'Vehicle Size_Compact',
       'Vehicle Size_Large', 'Vehicle Size_Midsize'],
      dtype='object')

In [92]:
train_2.describe()

Unnamed: 0,Year,Engine HP,Engine Cylinders,Number of Doors,highway MPG,city mpg,Popularity,MSRP,Engine Fuel Type_diesel,Engine Fuel Type_electric,...,Transmission Type_DIRECT_DRIVE,Transmission Type_MANUAL,Transmission Type_UNKNOWN,Driven_Wheels_all wheel drive,Driven_Wheels_four wheel drive,Driven_Wheels_front wheel drive,Driven_Wheels_rear wheel drive,Vehicle Size_Compact,Vehicle Size_Large,Vehicle Size_Midsize
count,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,...,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0
mean,2012.070417,274.244861,5.775556,3.418194,26.649306,19.593333,1496.249722,49964.8,0.019583,0.001528,...,0.001806,0.208194,0.000417,0.266389,0.077917,0.360278,0.295417,0.377639,0.228472,0.393889
std,6.30856,114.662594,1.889917,0.895717,6.811937,7.17926,1411.985904,69615.01,0.138573,0.03906,...,0.042456,0.406045,0.02041,0.442101,0.268059,0.480114,0.456262,0.48483,0.419878,0.488645
min,1990.0,55.0,0.0,2.0,12.0,7.0,2.0,2000.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2010.0,190.0,4.0,2.0,23.0,16.0,535.0,25888.75,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2015.0,264.0,6.0,4.0,26.0,18.0,1013.0,35000.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,2016.0,320.0,6.0,4.0,30.0,22.0,2009.0,48688.75,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
max,2017.0,1001.0,16.0,4.0,111.0,137.0,5657.0,2065902.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Plotagem de gráficos scatter

In [93]:
#sns.pairplot(train_2)

### Regressão linear MMQ:

In [94]:
X= train_2[['Year', 'Engine HP', 'Engine Cylinders', 'Number of Doors',
       'highway MPG', 'city mpg', 'Popularity','Engine Fuel Type_diesel', 'Engine Fuel Type_electric',
       'Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)',
       'Engine Fuel Type_flex-fuel (premium unleaded required/E85)',
       'Engine Fuel Type_flex-fuel (unleaded/E85)',
       'Engine Fuel Type_premium unleaded (recommended)',
       'Engine Fuel Type_premium unleaded (required)',
       'Engine Fuel Type_regular unleaded',
       'Transmission Type_AUTOMATED_MANUAL', 'Transmission Type_AUTOMATIC',
       'Transmission Type_DIRECT_DRIVE', 'Transmission Type_MANUAL',
       'Transmission Type_UNKNOWN', 'Driven_Wheels_all wheel drive',
       'Driven_Wheels_four wheel drive', 'Driven_Wheels_front wheel drive',
       'Driven_Wheels_rear wheel drive', 'Vehicle Size_Compact',
       'Vehicle Size_Large', 'Vehicle Size_Midsize']]
Y=train_2[['MSRP']]

results=regress(Y,X)
results.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,MSRP,R-squared:,0.532
Model:,OLS,Adj. R-squared:,0.53
Method:,Least Squares,F-statistic:,354.4
Date:,"Thu, 25 Nov 2021",Prob (F-statistic):,0.0
Time:,15:17:10,Log-Likelihood:,-87769.0
No. Observations:,7200,AIC:,175600.0
Df Residuals:,7176,BIC:,175800.0
Df Model:,23,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-3.076e+05,1.34e+05,-2.293,0.022,-5.71e+05,-4.46e+04
Year,222.0066,129.198,1.718,0.086,-31.259,475.272
Engine HP,281.3652,12.094,23.265,0.000,257.658,305.073
Engine Cylinders,1.347e+04,672.801,20.021,0.000,1.22e+04,1.48e+04
Number of Doors,-658.8029,814.340,-0.809,0.419,-2255.150,937.544
highway MPG,-239.6652,285.694,-0.839,0.402,-799.709,320.379
city mpg,1562.1564,251.262,6.217,0.000,1069.610,2054.703
Popularity,-3.2292,0.419,-7.709,0.000,-4.050,-2.408
Engine Fuel Type_diesel,-2.071e+04,1.83e+04,-1.132,0.258,-5.66e+04,1.52e+04

0,1,2,3
Omnibus:,13862.877,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,48493083.325
Skew:,14.767,Prob(JB):,0.0
Kurtosis:,403.963,Cond. No.,5.43e+16


### Remoção de colunas com valor p > 10%:

In [95]:
X= train_2[['Engine HP', 'Engine Cylinders','city mpg', 'Popularity',
       'Engine Fuel Type_flex-fuel (unleaded/E85)',
       'Engine Fuel Type_premium unleaded (recommended)',
       'Engine Fuel Type_regular unleaded',
       'Transmission Type_AUTOMATED_MANUAL',
       'Transmission Type_DIRECT_DRIVE', 'Transmission Type_MANUAL',
       'Transmission Type_UNKNOWN', 'Driven_Wheels_all wheel drive',
       'Driven_Wheels_four wheel drive', 'Driven_Wheels_front wheel drive',
       'Driven_Wheels_rear wheel drive', 'Vehicle Size_Compact',
       'Vehicle Size_Large', 'Vehicle Size_Midsize']]

Y=train_2[['MSRP']]

results=regress(Y,X)
results.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,MSRP,R-squared:,0.531
Model:,OLS,Adj. R-squared:,0.53
Method:,Least Squares,F-statistic:,508.8
Date:,"Thu, 25 Nov 2021",Prob (F-statistic):,0.0
Time:,15:17:10,Log-Likelihood:,-87773.0
No. Observations:,7200,AIC:,175600.0
Df Residuals:,7183,BIC:,175700.0
Df Model:,16,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-7.631e+04,3227.897,-23.640,0.000,-8.26e+04,-7e+04
Engine HP,290.2244,10.141,28.619,0.000,270.345,310.104
Engine Cylinders,1.328e+04,617.270,21.516,0.000,1.21e+04,1.45e+04
city mpg,1419.4894,133.521,10.631,0.000,1157.749,1681.230
Popularity,-3.1423,0.416,-7.550,0.000,-3.958,-2.326
Engine Fuel Type_flex-fuel (unleaded/E85),-2.152e+04,2417.556,-8.904,0.000,-2.63e+04,-1.68e+04
Engine Fuel Type_premium unleaded (recommended),-1.845e+04,1846.629,-9.992,0.000,-2.21e+04,-1.48e+04
Engine Fuel Type_regular unleaded,-1.569e+04,1729.515,-9.069,0.000,-1.91e+04,-1.23e+04
Transmission Type_AUTOMATED_MANUAL,2.21e+04,2411.476,9.162,0.000,1.74e+04,2.68e+04

0,1,2,3
Omnibus:,13819.44,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,47594516.602
Skew:,14.665,Prob(JB):,0.0
Kurtosis:,400.226,Cond. No.,2.36e+19


## Regressão em árvore:

In [96]:
Y=train_2[['MSRP']]

X= train_2[['Year', 'Engine HP', 'Engine Cylinders', 'Number of Doors',
       'highway MPG', 'city mpg', 'Popularity','Engine Fuel Type_diesel', 'Engine Fuel Type_electric',
       'Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)',
       'Engine Fuel Type_flex-fuel (premium unleaded required/E85)',
       'Engine Fuel Type_flex-fuel (unleaded/E85)',
       'Engine Fuel Type_premium unleaded (recommended)',
       'Engine Fuel Type_premium unleaded (required)',
       'Engine Fuel Type_regular unleaded',
       'Transmission Type_AUTOMATED_MANUAL', 'Transmission Type_AUTOMATIC',
       'Transmission Type_DIRECT_DRIVE', 'Transmission Type_MANUAL',
       'Transmission Type_UNKNOWN', 'Driven_Wheels_all wheel drive',
       'Driven_Wheels_four wheel drive', 'Driven_Wheels_front wheel drive',
       'Driven_Wheels_rear wheel drive', 'Vehicle Size_Compact',
       'Vehicle Size_Large', 'Vehicle Size_Midsize']]

regressor = DecisionTreeRegressor(random_state = 0) 

regressor.fit(X, Y)

DecisionTreeRegressor(random_state=0)

In [97]:
y_pred = regressor.predict(X)

y_pred

array([30145.        , 44538.33333333, 63250.        , ...,
       37051.25      , 13810.        , 88328.33333333])