## Árvores de regressão - exercícios 02

Este exercício será uma continuação do anterior, mesma base, mesmas variáveis - vamos tentar buscar a 'melhor árvore'.

A descrição das variáveis está abaixo:

| Variavel | Descrição|
|-|-|
|CRIM| taxa de crimes per-cápita da cidade | 
|ZN| proporção de terrenos residenciais zoneados para lotes com mais de 25.000 pés quadrados. |
|INDUS| proporção de acres de negócios não varejistas por cidade |
|CHAS |vale 1 se o terreno faz fronteira com o *Charles River*; 0 caso contrário |
|NOX| Concentração de óxido nítrico (em partes por 10 milhões) |
|RM| número médio de quartos por habitação |
|AGE| proporção de unidades ocupadas pelo proprietário construídas antes de 1940 |
|DIS| distâncias ponderadas até cinco centros de empregos de Boston |
|RAD| índice de acessibilidade a rodovias radiais |
|TAX| taxa de imposto sobre a propriedade de valor total por \\$10,000 |
|PTRATIO| razão pupilo-professor da cidade |
|B| $ 1000 (Bk - 0,63) ^ 2 $ onde Bk é a proporção de negros por cidade |
|LSTAT| \%status inferior da população |
|MEDV| (variável resposta) Valor mediano das casas ocupadas pelo proprietário em US $ 1.000|

In [1]:
import pandas as pd

import seaborn as sns

from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
from sklearn.model_selection import train_test_split


boston = datasets.load_boston()
X = pd.DataFrame(boston.data, columns = boston.feature_names)
y = pd.DataFrame(boston.target, columns = ['MEDV'])


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [2]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [3]:
y.head()

Unnamed: 0,MEDV
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2


### 1. Execute os passos do exercício anterior, até que você tenha uma árvore de regressão predizendo o valor do imóvel na base de treinamento.

In [7]:
boston_raw = pd.concat([X, y], axis=1)
boston_a = boston_raw[['RM', 'LSTAT', 'MEDV']].copy()
boston_a.head()

Unnamed: 0,RM,LSTAT,MEDV
0,6.575,4.98,24.0
1,6.421,9.14,21.6
2,7.185,4.03,34.7
3,6.998,2.94,33.4
4,7.147,5.33,36.2


In [8]:
X_a = boston_a.drop(columns = ['MEDV']).copy()
X_a.head()

Unnamed: 0,RM,LSTAT
0,6.575,4.98
1,6.421,9.14
2,7.185,4.03
3,6.998,2.94
4,7.147,5.33


In [9]:
y_a = boston_a.loc[:,'MEDV']
y_a.head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X_a, y_a, random_state=2360873)

In [11]:
regr_a = DecisionTreeRegressor(max_depth=8, min_samples_leaf=10)

regr_a.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=8, min_samples_leaf=10)

### 2.  Calcule o caminho indicado pelos CCP-alfas dessa árvore.

In [14]:
path = regr_1.cost_complexity_pruning_path(X_train, y_train)

path

{'ccp_alphas': array([0.00000000e+00, 2.11081794e-04, 3.29815303e-04, 6.33245383e-04,
        6.46437995e-04, 8.61917326e-04, 1.06860158e-03, 1.51978892e-03,
        1.58751099e-03, 2.58575198e-03, 2.96833773e-03, 2.96833773e-03,
        4.28627968e-03, 8.13104661e-03, 1.10883905e-02, 1.37906772e-02,
        1.43607446e-02, 1.58861038e-02, 1.98438874e-02, 1.98954186e-02,
        2.21679859e-02, 2.26567891e-02, 2.42561765e-02, 2.55648837e-02,
        2.59180074e-02, 2.61807388e-02, 2.81802990e-02, 3.00989446e-02,
        3.07554969e-02, 3.12407338e-02, 4.59234828e-02, 4.70453217e-02,
        4.94010554e-02, 5.05609115e-02, 5.35092348e-02, 5.72129607e-02,
        5.73928100e-02, 5.87858799e-02, 6.11210044e-02, 6.11534112e-02,
        6.19709344e-02, 6.22015957e-02, 6.33622314e-02, 6.79802111e-02,
        7.09278804e-02, 7.20338610e-02, 7.22427441e-02, 7.87659025e-02,
        7.94036731e-02, 8.66794195e-02, 9.55295891e-02, 1.04097684e-01,
        1.06636512e-01, 1.24116095e-01, 1.28916321

In [15]:
ccp_alphas, impurities = path.ccp_alphas, path.impurities

plt.figure(figsize = (10, 6))
plt.plot(ccp_alphas, impurities)
plt.xlabel("Alpha efetivo")
plt.ylabel("Impureza total das folhas")

NameError: name 'plt' is not defined

### 3. Paca cada valor de alpha obtido no item 2, treine uma árvore com o respectivo alfa, e guarde essa árvore em uma lista.

In [16]:
clfs = []

for ccp_alpha in ccp_alphas:
    clf = DecisionTreeRegressor(random_state = 0, ccp_alpha = ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)

In [17]:
clfs

[DecisionTreeRegressor(random_state=0),
 DecisionTreeRegressor(ccp_alpha=0.00021108179419528366, random_state=0),
 DecisionTreeRegressor(ccp_alpha=0.00032981530343007914, random_state=0),
 DecisionTreeRegressor(ccp_alpha=0.000633245382585326, random_state=0),
 DecisionTreeRegressor(ccp_alpha=0.0006464379947238671, random_state=0),
 DecisionTreeRegressor(ccp_alpha=0.0008619173262922902, random_state=0),
 DecisionTreeRegressor(ccp_alpha=0.0010686015831118847, random_state=0),
 DecisionTreeRegressor(ccp_alpha=0.0015197889182142614, random_state=0),
 DecisionTreeRegressor(ccp_alpha=0.0015875109938465032, random_state=0),
 DecisionTreeRegressor(ccp_alpha=0.0025857519788915688, random_state=0),
 DecisionTreeRegressor(ccp_alpha=0.002968337730870712, random_state=0),
 DecisionTreeRegressor(ccp_alpha=0.002968337730870712, random_state=0),
 DecisionTreeRegressor(ccp_alpha=0.0042862796833848405, random_state=0),
 DecisionTreeRegressor(ccp_alpha=0.008131046613896322, random_state=0),
 DecisionTree

In [18]:
tree_depths = [clf.tree_.max_depth for clf in clfs]

plt.figure(figsize = (10, 6))
plt.plot(ccp_alphas[ : -1], tree_depths[ : -1])
plt.xlabel('effective alpha')
plt.ylabel('Profundidade de árvore')

NameError: name 'plt' is not defined

### 4. Para cada árvore na lista, calcule o MSE da árvore.

In [20]:
train_scores = [mean_squared_error(y_train , clf.predict(X_train)) for clf in clfs]
test_scores  = [mean_squared_error(y_test  , clf.predict(X_test )) for clf in clfs]

NameError: name 'mean_squared_error' is not defined

### 5. Monte um gráfico do MSE pelo alpha, escolha um valor de alpha perto do ponto de mínimo do MSE

In [21]:
fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("MSE")
ax.set_title("MSE x alpha do conjunto de dados de treino e teste")
ax.plot(ccp_alphas[:-1], train_scores[:-1], marker='o', label="treino",
        drawstyle="steps-post")
ax.plot(ccp_alphas[:-1], test_scores[:-1], marker='o', label="teste",
        drawstyle="steps-post")
ax.legend()
plt.show()


NameError: name 'plt' is not defined

### 6. Calcule o R-quadrado dessa árvore encontrada no item acima

In [22]:
regr_a = DecisionTreeRegressor(max_depth=8, min_samples_leaf=10, ccp_alpha=0.5)

regr_a.fit(X_train, y_train)

DecisionTreeRegressor(ccp_alpha=0.5, max_depth=8, min_samples_leaf=10)

In [23]:
regr_a.score(X_train, y_train)

0.804158876544052

### 7. Visualize esta árvore.

In [24]:
dot_data = tree.export_graphviz(regr_a, out_file=None, 
                                feature_names=X_a.columns,
                                filled=True)

graph = graphviz.Source(dot_data, format="png") 
graph

NameError: name 'graphviz' is not defined