## Árvores de regressão - exercícios 02

Este exercício será uma continuação do anterior, mesma base, mesmas variáveis - vamos tentar buscar a 'melhor árvore'.

A descrição das variáveis está abaixo:

| Variavel | Descrição|
|-|-|
|CRIM| taxa de crimes per-cápita da cidade | 
|ZN| proporção de terrenos residenciais zoneados para lotes com mais de 25.000 pés quadrados. |
|INDUS| proporção de acres de negócios não varejistas por cidade |
|CHAS |vale 1 se o terreno faz fronteira com o *Charles River*; 0 caso contrário |
|NOX| Concentração de óxido nítrico (em partes por 10 milhões) |
|RM| número médio de quartos por habitação |
|AGE| proporção de unidades ocupadas pelo proprietário construídas antes de 1940 |
|DIS| distâncias ponderadas até cinco centros de empregos de Boston |
|RAD| índice de acessibilidade a rodovias radiais |
|TAX| taxa de imposto sobre a propriedade de valor total por \\$10,000 |
|PTRATIO| razão pupilo-professor da cidade |
|B| $ 1000 (Bk - 0,63) ^ 2 $ onde Bk é a proporção de negros por cidade |
|LSTAT| \%status inferior da população |
|MEDV| (variável resposta) Valor mediano das casas ocupadas pelo proprietário em US $ 1.000|

In [32]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn import tree

%matplotlib notebook

boston = pd.read_csv('https://raw.githubusercontent.com/RodzMoraes/curso-ebac/main/M%C3%B3dulo%2011/BostonHousing.csv')
X = boston.drop(columns=['medv']).copy()
y = boston[['medv']]

In [33]:
X.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33


In [34]:
y.head()

Unnamed: 0,medv
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2


### 1. Execute os passos do exercício anterior, até que você tenha uma árvore de regressão predizendo o valor do imóvel na base de treinamento.

In [35]:
matriz_correlacao = boston.corr()
matriz_correlacao

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
crim,1.0,-0.200469,0.406583,-0.055892,0.420972,-0.219247,0.352734,-0.37967,0.625505,0.582764,0.289946,-0.385064,0.455621,-0.388305
zn,-0.200469,1.0,-0.533828,-0.042697,-0.516604,0.311991,-0.569537,0.664408,-0.311948,-0.314563,-0.391679,0.17552,-0.412995,0.360445
indus,0.406583,-0.533828,1.0,0.062938,0.763651,-0.391676,0.644779,-0.708027,0.595129,0.72076,0.383248,-0.356977,0.6038,-0.483725
chas,-0.055892,-0.042697,0.062938,1.0,0.091203,0.091251,0.086518,-0.099176,-0.007368,-0.035587,-0.121515,0.048788,-0.053929,0.17526
nox,0.420972,-0.516604,0.763651,0.091203,1.0,-0.302188,0.73147,-0.76923,0.611441,0.668023,0.188933,-0.380051,0.590879,-0.427321
rm,-0.219247,0.311991,-0.391676,0.091251,-0.302188,1.0,-0.240265,0.205246,-0.209847,-0.292048,-0.355501,0.128069,-0.613808,0.69536
age,0.352734,-0.569537,0.644779,0.086518,0.73147,-0.240265,1.0,-0.747881,0.456022,0.506456,0.261515,-0.273534,0.602339,-0.376955
dis,-0.37967,0.664408,-0.708027,-0.099176,-0.76923,0.205246,-0.747881,1.0,-0.494588,-0.534432,-0.232471,0.291512,-0.496996,0.249929
rad,0.625505,-0.311948,0.595129,-0.007368,0.611441,-0.209847,0.456022,-0.494588,1.0,0.910228,0.464741,-0.444413,0.488676,-0.381626
tax,0.582764,-0.314563,0.72076,-0.035587,0.668023,-0.292048,0.506456,-0.534432,0.910228,1.0,0.460853,-0.441808,0.543993,-0.468536


In [57]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [58]:
# Árvore com profundidade máxima de 8
tree_8 = DecisionTreeRegressor(max_depth=8, random_state=42)
tree_8.fit(X_train, y_train)

# Árvore com profundidade máxima de 2
tree_2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_2.fit(X_train, y_train)

In [59]:
# MSE da árvore com profundidade máxima 8
y_train_8 = tree_8.predict(X_train)
mse_train_8 = mean_squared_error(y_train, y_train_8)
y_test_8 = tree_8.predict(X_test)
mse_test_8 = mean_squared_error(y_test, y_test_8)

# MSE da árvore com profundidade máxima 2
y_train_2 = tree_2.predict(X_train)
mse_train_2 = mean_squared_error(y_train, y_train_2)
y_test_2 = tree_2.predict(X_test)
mse_test_2 = mean_squared_error(y_test, y_test_2)

In [60]:
# Criação do modelo de árvore de regressão
tree_reg = DecisionTreeRegressor()

# Treinamento da árvore de regressão com os dados de treinamento
tree_reg.fit(X_train, y_train)

# Predição do valor do imóvel na base de treinamento
y_train_pred = tree_reg.predict(X_train)

In [61]:
# Plot da árvore de regressão
plt.figure(figsize=(10, 8))
tree.plot_tree(tree_reg, feature_names=X.columns, filled=True)
plt.show()

<IPython.core.display.Javascript object>

### 2.  Calcule o caminho indicado pelos CCP-alfas dessa árvore.

In [62]:
path = tree_reg.cost_complexity_pruning_path(X_train, y_train)
path

{'ccp_alphas': array([0.00000000e+00, 3.51971634e-16, 3.51971634e-16, 1.54798762e-05,
        1.54798762e-05, 1.54798762e-05, 1.54798762e-05, 1.54798762e-05,
        1.54798762e-05, 1.54798762e-05, 1.54798762e-05, 1.54798762e-05,
        1.54798762e-05, 1.54798762e-05, 1.54798762e-05, 1.54798762e-05,
        1.54798762e-05, 1.54798762e-05, 1.54798762e-05, 1.54798762e-05,
        1.54798762e-05, 1.54798762e-05, 2.06398349e-05, 2.06398349e-05,
        2.06398349e-05, 2.06398349e-05, 4.64396285e-05, 4.64396285e-05,
        4.64396285e-05, 6.19195046e-05, 6.19195046e-05, 6.19195046e-05,
        6.19195046e-05, 6.19195046e-05, 6.19195046e-05, 6.19195046e-05,
        6.19195046e-05, 6.19195046e-05, 6.19195046e-05, 8.25593395e-05,
        8.25593395e-05, 8.25593395e-05, 8.25593395e-05, 9.28792570e-05,
        1.24871001e-04, 1.28998968e-04, 1.28998968e-04, 1.28998968e-04,
        1.39318885e-04, 1.39318885e-04, 1.39318885e-04, 1.39318885e-04,
        1.39318885e-04, 1.39318885e-04, 1.39318885

### 3. Paca cada valor de alpha obtido no item 2, treine uma árvore com o respectivo alfa, e guarde essa árvore em uma lista.

In [63]:
ccp_alphas, impurities = path.ccp_alphas, path.impurities

In [64]:
clfs = []

for ccp_alpha in ccp_alphas:
    clf = DecisionTreeRegressor(random_state=0, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)

### 4. Para cada árvore na lista, calcule o MSE da árvore.

In [65]:
tree_depths = [clf.tree_.max_depth for clf in clfs]
plt.figure(figsize=(9,  6))
plt.plot(ccp_alphas[:-1], tree_depths[:-1])
plt.xlabel("Alpha")
plt.ylabel("Profundidade da árvore")

<IPython.core.display.Javascript object>

Text(0, 0.5, 'Profundidade da árvore')

### 5. Monte um gráfico do MSE pelo alpha, escolha um valor de alpha perto do ponto de mínimo do MSE

In [66]:
train_scores = [mean_squared_error(y_train, clf.predict(X_train)) for clf in clfs]
test_scores  = [mean_squared_error(y_test, clf.predict(X_test)) for clf in clfs]

fig, ax = plt.subplots(figsize=(9, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("MSE")
ax.set_title("MSE x Alpha do conjunto de dados de treino e teste")
ax.plot(ccp_alphas[:-1], train_scores[:-1], marker='o', label="treino",
        drawstyle="steps-post")
ax.plot(ccp_alphas[:-1], test_scores[:-1], marker='o', label="teste",
        drawstyle="steps-post")
ax.legend()
plt.show()

<IPython.core.display.Javascript object>

### 6. Calcule o R-quadrado dessa árvore encontrada no item acima

In [67]:
arvore_final = DecisionTreeRegressor(random_state = 0 , ccp_alpha= 2)
print(arvore_final)

DecisionTreeRegressor(ccp_alpha=2, random_state=0)


### 7. Visualize esta árvore.

In [68]:
plt.subplots()
plt.rc('figure', figsize=(10, 10))
tp = tree.plot_tree(tree_reg, feature_names=X.columns, filled=True)

<IPython.core.display.Javascript object>