# Regressao e Preprocessamento

*Previsão de Consumo de Combustível*


Vamos gerar um modelo de regressao utilizando arvores de decisao. O dataset disponibilizado **estima o consumo de combustivel** para diferentes regioes, considerando o imposto cobrado na regiao, a renda media da populacao, a quantidade de ruas pavimentadas, a proporcao da populacao que tem habilitacao para dirigir.
Este dataset foi adaptado de um dataset real, com dados dos Estados Unidos.

Deve-se **observar** os problemas existentes na base de dados e **corrigir** usando alguma(s) das tecnicas vistas na aula.

Treine o modelo e avalie da forma como achar melhor, considerando o que jah aprendeu nas aulas anteriores.

Treine regressores com algoritmos de **regressão linear** e **árvores de decisão**.

### Imports

Se precisar usar algo diferente, acrescente o import

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import requests
from io import StringIO
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer

from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer

### Carga dos dados

**Nao** modifique este trecho

In [2]:
orig_url='https://drive.google.com/file/d/1GxwiSg94tQc-Hhbnpo_rCmljTyVdQP2v/view?usp=sharing'
file_id = orig_url.split('/')[-2]
dwn_url='https://drive.google.com/uc?export=download&id=' + file_id
url = requests.get(dwn_url).text
csv_raw = StringIO(url)
df = pd.read_csv(csv_raw)
df

Unnamed: 0,imposto,renda_media,estradas_pav,populacao_cnh,consumo
0,extr-high,3571.0,1976.0,0.525,541.0
1,extr-high,4092.0,1250.0,0.572,524.0
2,extr-high,3865.0,1586.0,0.58,561.0
3,med,4870.0,2351.0,0.529,414.0
4,high,4399.0,431.0,0.544,410.0
5,max,5342.0,1333.0,,457.0
6,high,5319.0,11868.0,0.451,344.0
7,high,5126.0,2138.0,0.553,467.0
8,high,4447.0,8577.0,0.529,464.0
9,low,4512.0,8507.0,0.552,498.0


### Preprocessamento

In [16]:
# Geral informations
print(df.info())
print(100*"-")

print("\n", df.describe())
print(100*"-")

print("\n", df.isna().sum())
print(100*"-")

mask = ~y.isna()
X = X[mask]
y = y[mask]

# pré-processamento
colunas_numericas = X.select_dtypes(include=["int64", "float64"]).columns
colunas_categoricas = X.select_dtypes(include=["object"]).columns

preprocessor = ColumnTransformer(
    transformers=[
        ("num", SimpleImputer(strategy="mean"), colunas_numericas),
        ("cat", OneHotEncoder(handle_unknown="ignore"), colunas_categoricas),
    ]
)

X_processed = preprocessor.fit_transform(X)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X_processed, y, test_size=0.2, random_state=42
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   imposto        50 non-null     object 
 1   renda_media    49 non-null     float64
 2   estradas_pav   49 non-null     float64
 3   populacao_cnh  48 non-null     float64
 4   consumo        49 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB
None
----------------------------------------------------------------------------------------------------

        renda_media  estradas_pav  populacao_cnh     consumo
count    49.000000     49.000000      48.000000   49.000000
mean   4227.918367   5492.142857       0.569542  576.857143
std     575.913905   3492.811833       0.055731  110.715853
min    3063.000000    431.000000       0.451000  344.000000
25%    3721.000000   2619.000000       0.529750  510.000000
50%    4296.000000   4725.000000       0.563000  571.000000
75%    457

### Treinamento

In [18]:
# Regressão Linear
lr = LinearRegression()
lr.fit(X_train, y_train)

# Árvore de Decisão
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)

### Avaliacao

In [21]:
def avaliar_modelo(modelo, X_train, y_train, X_test, y_test, nome):
    y_pred_train = modelo.predict(X_train)
    y_pred_test = modelo.predict(X_test)

    print(f"\nModelo: {nome}")
    print("Treino:")
    print("  MAE:", mean_absolute_error(y_train, y_pred_train))
    print("  RMSE:", np.sqrt(mean_squared_error(y_train, y_pred_train)))
    print("  R²:", r2_score(y_train, y_pred_train))

    print("Teste:")
    print("  MAE:", mean_absolute_error(y_test, y_pred_test))
    print("  RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_test)))
    print("  R²:", r2_score(y_test, y_pred_test))


# Avaliando ambos os modelos
avaliar_modelo(lr, X_train, y_train, X_test, y_test, "Regressão Linear")
avaliar_modelo(dt, X_train, y_train, X_test, y_test, "Árvore de Decisão")



Modelo: Regressão Linear
Treino:
  MAE: 41.001721748710146
  RMSE: 58.219376443042194
  R²: 0.7366986476161393
Teste:
  MAE: 52.11832476407126
  RMSE: 67.37598996719699
  R²: 0.4213842571898213

Modelo: Árvore de Decisão
Treino:
  MAE: 0.0
  RMSE: 0.0
  R²: 1.0
Teste:
  MAE: 72.6
  RMSE: 122.79983713344248
  R²: -0.922097918676845
