# Projeto 2 - Ciência dos Dados

## Integrantes:
* Gabriela Kimi
* Luiza Ehrenberger
* Pedro Barão
* Rafael Paolino

## Introdução:

### Objetivo: fazer uso de métodos de regressão para prever o preço de carros, tendo como base, suas características.

#### Para isso, fizemos uso de uma base de dados que leva em conta as características de um carro e o preço sugerido pelo fabricante. Alguns atributos utilizados para estimar o preço são: 

* marca
* modelo
* ano de fabricação
* tipo de motor

<a> href= " https://www.kaggle.com/CooperUnion/cardataset " > Link para a base de dados "Car Features and MSRP"</a>

### Importando as bibiotecas necessárias: 

In [19]:
%matplotlib notebook

import pandas as pd
import numpy as np
from scipy.stats import norm, probplot
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols

from sklearn.tree import DecisionTreeRegressor 

from sklearn.preprocessing import OneHotEncoder

from mpl_toolkits.mplot3d import Axes3D

from IPython.display import display

import datetime

import seaborn as sns

### Função de regressão linear:

#### Y: coluna do DataFrame utilizada como variável resposta. (TARGET)
#### X: coluna(s) do DataFrame utilizada(s) como variável(is) explicativas. (FEATURES)

In [20]:
def regress(Y,X):
    X_cp = sm.add_constant(X)
    model = sm.OLS(Y,X_cp)
    results = model.fit()
    
    return results

### DataFrame da base de dados

In [21]:
data = pd.read_csv("data.csv")

In [22]:
data.head(3)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350


## Mineirando Dados e Características do Dataset

### Colunas que serão utilizadas: 

In [23]:
data.columns

Index(['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
       'Number of Doors', 'Market Category', 'Vehicle Size', 'Vehicle Style',
       'highway MPG', 'city mpg', 'Popularity', 'MSRP'],
      dtype='object')

### Descrevendo as variáveis que serão utilizadas:

### Codificação da coluna 'Transmission type' para integer e hot encode:


In [24]:
data=data.dropna()
data.isnull().sum()

Make                 0
Model                0
Year                 0
Engine Fuel Type     0
Engine HP            0
Engine Cylinders     0
Transmission Type    0
Driven_Wheels        0
Number of Doors      0
Market Category      0
Vehicle Size         0
Vehicle Style        0
highway MPG          0
city mpg             0
Popularity           0
MSRP                 0
dtype: int64

In [25]:
data.iloc[:,0].size

8084

In [26]:
train = data.sample(7200)

In [27]:
train=train[[ 'Year', 'Engine Fuel Type', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
       'Number of Doors', 'Vehicle Size',
       'highway MPG', 'city mpg', 'Popularity', 'MSRP']]

### Codificação de colunas categóricas para integer e hot encode:


In [28]:
train_2=pd.get_dummies(train)
train_2

Unnamed: 0,Year,Engine HP,Engine Cylinders,Number of Doors,highway MPG,city mpg,Popularity,MSRP,Engine Fuel Type_diesel,Engine Fuel Type_electric,...,Transmission Type_DIRECT_DRIVE,Transmission Type_MANUAL,Transmission Type_UNKNOWN,Driven_Wheels_all wheel drive,Driven_Wheels_four wheel drive,Driven_Wheels_front wheel drive,Driven_Wheels_rear wheel drive,Vehicle Size_Compact,Vehicle Size_Large,Vehicle Size_Midsize
7983,2015,550.0,10.0,2.0,22,13,3105,182500,0,0,...,0,0,0,1,0,0,0,1,0,0
4759,2016,365.0,6.0,4.0,21,15,5657,42600,0,0,...,0,0,0,1,0,0,0,0,1,0
3629,2017,295.0,6.0,4.0,26,19,1851,42490,0,0,...,0,0,0,0,0,0,1,0,1,0
3100,2008,215.0,6.0,2.0,23,15,1013,39130,0,0,...,0,1,0,0,0,0,1,1,0,0
2379,2017,520.0,8.0,4.0,21,14,1715,116500,0,0,...,0,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1469,1995,63.0,4.0,2.0,38,31,5657,2000,0,0,...,0,1,0,0,0,1,0,1,0,0
8178,2009,210.0,6.0,2.0,20,14,1851,21520,0,0,...,0,0,0,0,0,0,1,0,1,0
4723,2017,130.0,4.0,4.0,36,29,2202,15990,0,0,...,0,1,0,0,0,1,0,1,0,0
5914,2014,98.0,4.0,4.0,44,41,2202,22190,0,0,...,0,0,0,0,0,1,0,1,0,0


In [29]:
train_2.columns

Index(['Year', 'Engine HP', 'Engine Cylinders', 'Number of Doors',
       'highway MPG', 'city mpg', 'Popularity', 'MSRP',
       'Engine Fuel Type_diesel', 'Engine Fuel Type_electric',
       'Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)',
       'Engine Fuel Type_flex-fuel (premium unleaded required/E85)',
       'Engine Fuel Type_flex-fuel (unleaded/E85)',
       'Engine Fuel Type_premium unleaded (recommended)',
       'Engine Fuel Type_premium unleaded (required)',
       'Engine Fuel Type_regular unleaded',
       'Transmission Type_AUTOMATED_MANUAL', 'Transmission Type_AUTOMATIC',
       'Transmission Type_DIRECT_DRIVE', 'Transmission Type_MANUAL',
       'Transmission Type_UNKNOWN', 'Driven_Wheels_all wheel drive',
       'Driven_Wheels_four wheel drive', 'Driven_Wheels_front wheel drive',
       'Driven_Wheels_rear wheel drive', 'Vehicle Size_Compact',
       'Vehicle Size_Large', 'Vehicle Size_Midsize'],
      dtype='object')

In [30]:
train_2.describe()

Unnamed: 0,Year,Engine HP,Engine Cylinders,Number of Doors,highway MPG,city mpg,Popularity,MSRP,Engine Fuel Type_diesel,Engine Fuel Type_electric,...,Transmission Type_DIRECT_DRIVE,Transmission Type_MANUAL,Transmission Type_UNKNOWN,Driven_Wheels_all wheel drive,Driven_Wheels_four wheel drive,Driven_Wheels_front wheel drive,Driven_Wheels_rear wheel drive,Vehicle Size_Compact,Vehicle Size_Large,Vehicle Size_Midsize
count,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,...,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0,7200.0
mean,2012.061111,275.030694,5.780972,3.407639,26.633194,19.545139,1499.456528,50054.16,0.019583,0.001806,...,0.001944,0.208194,0.000417,0.268194,0.080833,0.352778,0.298194,0.377083,0.230833,0.392083
std,6.321382,114.875248,1.891324,0.901033,7.888234,7.272821,1413.721122,70071.71,0.138573,0.042456,...,0.044056,0.406045,0.02041,0.44305,0.272598,0.477867,0.457497,0.48469,0.421395,0.488249
min,1990.0,55.0,0.0,2.0,12.0,7.0,2.0,2000.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2010.0,192.0,4.0,2.0,22.0,16.0,549.0,25893.75,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2015.0,265.0,6.0,4.0,26.0,18.0,1013.0,35060.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,2016.0,321.0,6.0,4.0,30.0,22.0,2009.0,48840.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
max,2017.0,1001.0,16.0,4.0,354.0,137.0,5657.0,2065902.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Plotagem de gráficos scatter

In [31]:
#sns.pairplot(train_2)

### Regressão linear MMQ:

In [32]:
X= train_2[['Year', 'Engine HP', 'Engine Cylinders', 'Number of Doors',
       'highway MPG', 'city mpg', 'Popularity','Engine Fuel Type_diesel', 'Engine Fuel Type_electric',
       'Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)',
       'Engine Fuel Type_flex-fuel (premium unleaded required/E85)',
       'Engine Fuel Type_flex-fuel (unleaded/E85)',
       'Engine Fuel Type_premium unleaded (recommended)',
       'Engine Fuel Type_premium unleaded (required)',
       'Engine Fuel Type_regular unleaded',
       'Transmission Type_AUTOMATED_MANUAL', 'Transmission Type_AUTOMATIC',
       'Transmission Type_DIRECT_DRIVE', 'Transmission Type_MANUAL',
       'Transmission Type_UNKNOWN', 'Driven_Wheels_all wheel drive',
       'Driven_Wheels_four wheel drive', 'Driven_Wheels_front wheel drive',
       'Driven_Wheels_rear wheel drive', 'Vehicle Size_Compact',
       'Vehicle Size_Large', 'Vehicle Size_Midsize']]
Y=train_2[['MSRP']]

results=regress(Y,X)
results.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,MSRP,R-squared:,0.528
Model:,OLS,Adj. R-squared:,0.527
Method:,Least Squares,F-statistic:,349.4
Date:,"Thu, 25 Nov 2021",Prob (F-statistic):,0.0
Time:,15:12:00,Log-Likelihood:,-87843.0
No. Observations:,7200,AIC:,175700.0
Df Residuals:,7176,BIC:,175900.0
Df Model:,23,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2.902e+05,1.31e+05,-2.223,0.026,-5.46e+05,-3.44e+04
Year,206.3793,125.229,1.648,0.099,-39.106,451.865
Engine HP,277.7720,12.194,22.780,0.000,253.869,301.675
Engine Cylinders,1.371e+04,669.440,20.482,0.000,1.24e+04,1.5e+04
Number of Doors,-1540.2834,815.216,-1.889,0.059,-3138.346,57.780
highway MPG,-14.7894,133.567,-0.111,0.912,-276.620,247.042
city mpg,1358.8145,179.744,7.560,0.000,1006.463,1711.166
Popularity,-3.1506,0.421,-7.487,0.000,-3.975,-2.326
Engine Fuel Type_diesel,-2.075e+04,1.86e+04,-1.117,0.264,-5.72e+04,1.57e+04

0,1,2,3
Omnibus:,13799.61,Durbin-Watson:,1.973
Prob(Omnibus):,0.0,Jarque-Bera (JB):,46679557.748
Skew:,14.625,Prob(JB):,0.0
Kurtosis:,396.374,Cond. No.,7.57e+16


### Remoção de colunas com valor p > 10%:

In [33]:
X= train_2[['Engine HP', 'Engine Cylinders','city mpg', 'Popularity',
       'Engine Fuel Type_flex-fuel (unleaded/E85)',
       'Engine Fuel Type_premium unleaded (recommended)',
       'Engine Fuel Type_regular unleaded',
       'Transmission Type_AUTOMATED_MANUAL',
       'Transmission Type_DIRECT_DRIVE', 'Transmission Type_MANUAL',
       'Transmission Type_UNKNOWN', 'Driven_Wheels_all wheel drive',
       'Driven_Wheels_four wheel drive', 'Driven_Wheels_front wheel drive',
       'Driven_Wheels_rear wheel drive', 'Vehicle Size_Compact',
       'Vehicle Size_Large', 'Vehicle Size_Midsize']]

Y=train_2[['MSRP']]

results=regress(Y,X)
results.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,MSRP,R-squared:,0.528
Model:,OLS,Adj. R-squared:,0.527
Method:,Least Squares,F-statistic:,501.7
Date:,"Thu, 25 Nov 2021",Prob (F-statistic):,0.0
Time:,15:12:00,Log-Likelihood:,-87848.0
No. Observations:,7200,AIC:,175700.0
Df Residuals:,7183,BIC:,175800.0
Df Model:,16,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-7.664e+04,3280.884,-23.360,0.000,-8.31e+04,-7.02e+04
Engine HP,288.4310,10.238,28.172,0.000,268.361,308.501
Engine Cylinders,1.346e+04,623.309,21.596,0.000,1.22e+04,1.47e+04
city mpg,1402.1574,137.712,10.182,0.000,1132.202,1672.113
Popularity,-3.0495,0.418,-7.292,0.000,-3.869,-2.230
Engine Fuel Type_flex-fuel (unleaded/E85),-2.103e+04,2438.967,-8.621,0.000,-2.58e+04,-1.62e+04
Engine Fuel Type_premium unleaded (recommended),-1.821e+04,1849.231,-9.849,0.000,-2.18e+04,-1.46e+04
Engine Fuel Type_regular unleaded,-1.568e+04,1748.557,-8.966,0.000,-1.91e+04,-1.22e+04
Transmission Type_AUTOMATED_MANUAL,2.254e+04,2451.224,9.197,0.000,1.77e+04,2.73e+04

0,1,2,3
Omnibus:,13750.181,Durbin-Watson:,1.974
Prob(Omnibus):,0.0,Jarque-Bera (JB):,45712740.008
Skew:,14.511,Prob(JB):,0.0
Kurtosis:,392.273,Cond. No.,3.24e+19


## Regressão em árvore:

In [34]:
Y=train_2[['MSRP']]

X= train_2[['Year', 'Engine HP', 'Engine Cylinders', 'Number of Doors',
       'highway MPG', 'city mpg', 'Popularity','Engine Fuel Type_diesel', 'Engine Fuel Type_electric',
       'Engine Fuel Type_flex-fuel (premium unleaded recommended/E85)',
       'Engine Fuel Type_flex-fuel (premium unleaded required/E85)',
       'Engine Fuel Type_flex-fuel (unleaded/E85)',
       'Engine Fuel Type_premium unleaded (recommended)',
       'Engine Fuel Type_premium unleaded (required)',
       'Engine Fuel Type_regular unleaded',
       'Transmission Type_AUTOMATED_MANUAL', 'Transmission Type_AUTOMATIC',
       'Transmission Type_DIRECT_DRIVE', 'Transmission Type_MANUAL',
       'Transmission Type_UNKNOWN', 'Driven_Wheels_all wheel drive',
       'Driven_Wheels_four wheel drive', 'Driven_Wheels_front wheel drive',
       'Driven_Wheels_rear wheel drive', 'Vehicle Size_Compact',
       'Vehicle Size_Large', 'Vehicle Size_Midsize']]

regressor = DecisionTreeRegressor(random_state = 0) 

regressor.fit(X, Y)

DecisionTreeRegressor(random_state=0)

In [35]:
#y_pred = regressor.predict([[250]])

#print("Predicted price: % d\n"% y_pred) 