# Regressao e Preprocessamento

Vamos gerar um modelo de **regressao** utilizando **arvores de decisao**. O dataset disponibilizado **estima o consumo de combustivel** para diferentes regioes, considerando o imposto cobrado na regiao, a renda media da populacao, a quantidade de ruas pavimentadas, a proporcao da populacao que tem habilitacao para dirigir.
Este dataset foi adaptado de um dataset real, com dados dos Estados Unidos.

**Deve-se** observar os **problemas** existentes na base de dados e **corrigir** usando alguma(s) das tecnicas vistas na aula.

**Treine** o modelo e avalie da forma como achar melhor, considerando o que jah aprendeu nas aulas anteriores.

### Imports

Se precisar usar algo diferente, acrescente o import

In [99]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import requests
from io import StringIO
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer

from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer

### Carga dos dados

**Nao** modifique este trecho

In [100]:
orig_url='https://drive.google.com/file/d/1GxwiSg94tQc-Hhbnpo_rCmljTyVdQP2v/view?usp=sharing'
file_id = orig_url.split('/')[-2]
dwn_url='https://drive.google.com/uc?export=download&id=' + file_id
url = requests.get(dwn_url).text
csv_raw = StringIO(url)
df = pd.read_csv(csv_raw)
df

Unnamed: 0,imposto,renda_media,estradas_pav,populacao_cnh,consumo
0,extr-high,3571.0,1976.0,0.525,541.0
1,extr-high,4092.0,1250.0,0.572,524.0
2,extr-high,3865.0,1586.0,0.58,561.0
3,med,4870.0,2351.0,0.529,414.0
4,high,4399.0,431.0,0.544,410.0
5,max,5342.0,1333.0,,457.0
6,high,5319.0,11868.0,0.451,344.0
7,high,5126.0,2138.0,0.553,467.0
8,high,4447.0,8577.0,0.529,464.0
9,low,4512.0,8507.0,0.552,498.0


### Preprocessamento

In [101]:
df.isna().sum()

Unnamed: 0,0
imposto,0
renda_media,1
estradas_pav,1
populacao_cnh,2
consumo,1


In [102]:
df = df.dropna(subset=['consumo'])
df = df.dropna(thresh=3)
df

Unnamed: 0,imposto,renda_media,estradas_pav,populacao_cnh,consumo
0,extr-high,3571.0,1976.0,0.525,541.0
1,extr-high,4092.0,1250.0,0.572,524.0
2,extr-high,3865.0,1586.0,0.58,561.0
3,med,4870.0,2351.0,0.529,414.0
4,high,4399.0,431.0,0.544,410.0
5,max,5342.0,1333.0,,457.0
6,high,5319.0,11868.0,0.451,344.0
7,high,5126.0,2138.0,0.553,467.0
8,high,4447.0,8577.0,0.529,464.0
9,low,4512.0,8507.0,0.552,498.0


In [103]:
df.isna().sum()

Unnamed: 0,0
imposto,0
renda_media,0
estradas_pav,0
populacao_cnh,1
consumo,0


In [104]:
imputer = SimpleImputer(strategy='mean')
df['populacao_cnh'] = imputer.fit_transform(df[['populacao_cnh']])
df

Unnamed: 0,imposto,renda_media,estradas_pav,populacao_cnh,consumo
0,extr-high,3571.0,1976.0,0.525,541.0
1,extr-high,4092.0,1250.0,0.572,524.0
2,extr-high,3865.0,1586.0,0.58,561.0
3,med,4870.0,2351.0,0.529,414.0
4,high,4399.0,431.0,0.544,410.0
5,max,5342.0,1333.0,0.570319,457.0
6,high,5319.0,11868.0,0.451,344.0
7,high,5126.0,2138.0,0.553,467.0
8,high,4447.0,8577.0,0.529,464.0
9,low,4512.0,8507.0,0.552,498.0


In [105]:
df.isna().sum()

Unnamed: 0,0
imposto,0
renda_media,0
estradas_pav,0
populacao_cnh,0
consumo,0


In [106]:
print(df['imposto'].unique())

['extr-high' 'med' 'high' 'max' 'low' 'very-high' 'very-low' 'min'
 'extr-low']


In [107]:
imposto_map = {
    'min': 0,
    'extr-low': 1,
    'very-low': 2,
    'low': 3,
    'med': 4,
    'high': 5,
    'very-high': 6,
    'extr-high': 7,
    'max': 8
}
df['imposto'] = df['imposto'].map(imposto_map)
df

#enc = OrdinalEncoder(categories=[[
#    'min',
#    'extr-low',
#    'very-low',
#    'low',
#    'med',
#    'high',
#    'very-high',
#    'extr-high',
#    'max'
#]])
#df['imposto'] = enc.fit_transform(df[['imposto']])
#df

Unnamed: 0,imposto,renda_media,estradas_pav,populacao_cnh,consumo
0,7,3571.0,1976.0,0.525,541.0
1,7,4092.0,1250.0,0.572,524.0
2,7,3865.0,1586.0,0.58,561.0
3,4,4870.0,2351.0,0.529,414.0
4,5,4399.0,431.0,0.544,410.0
5,8,5342.0,1333.0,0.570319,457.0
6,5,5319.0,11868.0,0.451,344.0
7,5,5126.0,2138.0,0.553,467.0
8,5,4447.0,8577.0,0.529,464.0
9,3,4512.0,8507.0,0.552,498.0


### Treinamento

In [108]:
X = df.drop(columns=['consumo'])
y = df['consumo']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4, test_size=0.25) # random_state e test_size são importantíssimos

model = DecisionTreeRegressor(random_state=42, max_depth=7, min_samples_split=5, max_leaf_nodes=20)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)



### Avaliacao

In [109]:
#exibe as metricas de erro calculadas
print('\nMean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))


Mean Absolute Error: 48.125
Mean Squared Error: 3681.775462962963
