# <center> Lista 12 - Aprendizado de Máquina</center>

**Aluno(a):** Marianna de Pinho Severo <br>
**Matrícula:** 374856 <br>
**Professor(a):** Regis Pires

Nesta lista, exercitaremos os conceitos de árvore de decisão, boosting e aprendizado não supervisionado.

### Passo 01: Importar bibliotecas

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.cluster import KMeans,DBSCAN
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
# SciPy hierarchical clustering
from scipy.cluster import hierarchy
# library of math
import math
import scipy
import itertools

### Passo 02: Carregar os dados
Nessa lista, utilizaremos dois datasets: o [Wisconsin Diagnostic Breast Cancer (WDBC)](https://www.google.com/url?q=https%3A%2F%2Farchive.ics.uci.edu%2Fml%2Fmachine-learning-databases%2Fbreast-cancer-wisconsin%2Fwdbc.data&sa=D&sntz=1&usg=AFQjCNHGQiH_ahI6h2kbhF2AnFRjnXo7nQ) e um dataset sobre dados de trânsito, presente em [Trânsito](https://raw.githubusercontent.com/datascienceinc/learn-data-science/master/Introduction-to-K-means-Clustering/Data/data_1024.csv).

In [2]:
cols_wdbc = ['id', 'diagnosis', 'radius_mean','texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
           'radius_se','texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se',
           'radius_worst','texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst']

wdbc = pd.read_csv('wdbc.data', header = None, names=cols_wdbc)

In [3]:
cols_t = ['Driver_ID', 'Distance_Feature', 'Speeding_Feature']
transito = pd.read_csv('transito.csv', header = None, names= cols_t, sep='\t')

### Passo 03: Breve análise dos dados

Abaixo, podemos observar as cinco primeiras amostras de cada conjunto de dados.

In [4]:
wdbc.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [5]:
transito.head()

Unnamed: 0,Driver_ID,Distance_Feature,Speeding_Feature
0,3423311935,71.24,28.0
1,3423313212,52.53,25.0
2,3423313724,64.54,27.0
3,3423311373,55.69,22.0
4,3423310999,54.58,25.0


Agora, veremos se algum dos dois datasets possuem algum valor faltante.

In [6]:
wdbc.isna().sum()

id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave_points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave_points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave_points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

In [7]:
transito.isna().sum()

Driver_ID           0
Distance_Feature    0
Speeding_Feature    0
dtype: int64

Conforme podemos observar, nenhum dos dois datasets possuem valores faltantes.

## Questão 01) 

### Passo 01: Pegar valores

In [8]:
dataset = wdbc.values

In [9]:
X = dataset[:, 2:]
y = dataset[:, 1]

In [10]:
X[:2]

array([[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471,
        0.2419, 0.07871, 1.095, 0.9053, 8.589, 153.4, 0.006399, 0.04904,
        0.05372999999999999, 0.01587, 0.03003, 0.006193, 25.38, 17.33,
        184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189],
       [20.57, 17.77, 132.9, 1326.0, 0.08474, 0.07864, 0.0869,
        0.07017000000000001, 0.1812, 0.056670000000000005, 0.5435,
        0.7339, 3.398, 74.08, 0.005225, 0.013080000000000001, 0.0186,
        0.0134, 0.013890000000000001, 0.003532, 24.99, 23.41, 158.8,
        1956.0, 0.1238, 0.1866, 0.2416, 0.18600000000000003, 0.275,
        0.08902]], dtype=object)

In [11]:
y[:5]

array(['M', 'M', 'M', 'M', 'M'], dtype=object)

### Passo 02: Separar conjunto de treino e teste

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.4, random_state = 42, stratify = y)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size = 0.5, random_state = 42, stratify = y_test)

In [13]:
models = {}
scores = {}

In [14]:
#itertools.product para substituir os laços
# for md in [3, 5, 7]:
#     for ne in [50, 100, 200]:
#         for lr in [0.1, 0.05, 0.01]:
#             models[('gb', md, ne, lr)] = GradientBoostingClassifier(max_depth=md, n_estimators=ne, learning_rate=lr, random_state=42)
#             models[('gb', md, ne, lr)].fit(X_train,y_train)
#             scores[('gb', md, ne, lr)] = models[('gb', md, ne, lr)].score(X_val, y_val)

In [17]:
for md, ne, lr in itertools.product([3, 5, 7],[50, 100, 200],[0.1, 0.05, 0.01]):
    models[('gb', md, ne, lr)] = GradientBoostingClassifier(max_depth=md, n_estimators=ne, learning_rate=lr, random_state=42)
    models[('gb', md, ne, lr)].fit(X_train,y_train)
    scores[('gb', md, ne, lr)] = models[('gb', md, ne, lr)].score(X_val, y_val)

In [18]:
lista = list(scores.values())

In [19]:
pd.DataFrame(data= lista, index=scores.keys()).sort_values(by=0, ascending= False)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,0
gb,5,100,0.1,0.95614
gb,5,200,0.05,0.95614
gb,3,100,0.1,0.95614
gb,3,100,0.05,0.95614
gb,5,200,0.1,0.95614
gb,3,200,0.1,0.95614
gb,3,200,0.05,0.95614
gb,3,200,0.01,0.95614
gb,5,50,0.1,0.95614
gb,3,50,0.05,0.95614


## Complete