# Aula 07 - KNN - Overfit - Validação/cross/kfold

Tópicos abordados na aula de hoje:

* Overfit

* Classificador KNN

* Treino - teste - validação

* K-fold Validation

In [126]:
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm

import plotly.express as px
import plotly.graph_objects as go

from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, ShuffleSplit
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier


## Overfit


Supondo uma relação linear 


$$ \Large y(x) = 3x + 5 $$


Poucos dados, espalhando e com ruido

In [3]:
ruido = 20
n = 20
x = 10 + 5*np.random.randn(n)
y = 3*x + 5 + ruido*np.random.rand(n)

In [4]:
px.scatter(x=x, y=y, height=600)

In [11]:
pol = np.polyfit(x, y, 20)
x_pred_pol = np.linspace(x.min(), x.max(), 1000)
y_pred_pol = np.polyval(pol, x_pred_pol)


Polyfit may be poorly conditioned



In [12]:
lin = np.polyfit(x, y, 1)
x_pred_lin = np.linspace(x.min(), x.max(), 1000)
y_pred_lin = np.polyval(lin, x_pred_lin)

In [27]:
fig = go.Figure()

fig.add_scatter(x=x, y=y, mode='markers')
fig.add_scatter(x=x_pred_pol, y=y_pred_pol)
fig.add_scatter(x=x_pred_lin, y=y_pred_lin)
fig.update(layout_yaxis_range = [-10,80])

In [24]:
y_pred_lin_data = np.polyval(lin, x)
y_pred_pol_data = np.polyval(pol, x)

In [25]:
mean_squared_error(y, y_pred_lin_data)

33.43494070141903

In [26]:
mean_squared_error(y, y_pred_pol_data)

10.012411344128099

In [28]:
ruido = 20
x_new = 10 + 5*np.random.randn(20)
y_new = 3*x + 5 + ruido*np.random.rand(20)

In [29]:
y_pred_lin_new = np.polyval(lin, x_new)
y_pred_pol_new = np.polyval(pol, y_new)

In [30]:
mean_squared_error(y_new, y_pred_lin_new)

350.17928076819226

In [31]:
mean_squared_error(y_new, y_pred_pol_new)

1.3880418832679086e+46

## Validação Cruzada

![Cross Validation](https://scikit-learn.org/stable/_images/grid_search_workflow.png)

## k-fold Cross Validation

![5 fold cross validation - sklearn image](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

In [32]:
ruido = 20
n = 100
x = 10 + 5*np.random.randn(n)
y = 3*x + 5 + ruido*np.random.rand(n)

In [33]:
k = 5
x_k, y_k = [], []
nk = int(n/k)
 
for i in range(k):
    x_k.append(x[i*nk: (i+1)*nk])
    y_k.append(y[i*nk: (i+1)*nk])

In [37]:
x[4:10]

array([12.61691404, 13.2423002 ,  4.69803313, 22.58829132, 10.02049564,
        6.01406273])

In [34]:
x_k

[array([ 9.73404129,  6.91090475,  5.02345388, 18.97588588, 12.61691404,
        13.2423002 ,  4.69803313, 22.58829132, 10.02049564,  6.01406273,
        12.2077415 , 10.53140569,  7.12841173, 12.92946448,  7.07283777,
        10.86398743,  9.59515546,  6.13678726, 14.20818915, 12.78063782]),
 array([ 9.77828309,  7.24587996,  9.3552822 ,  5.52836831,  4.70107511,
         9.93942907, 11.86600639,  4.81023877,  6.78842959,  5.30886527,
         8.16684569,  3.40260096, 14.10294111, 16.74360383, 13.63227281,
         5.18882533,  9.79432211,  6.33345512,  9.74825223,  9.48880056]),
 array([ 5.23627902, 12.30361569,  4.07508227,  6.97815337,  7.49949951,
        11.14724689, 10.27519583,  1.04034204,  3.73271048,  8.43378516,
        12.47213671,  7.55768708,  6.44182773,  1.11501256, 13.14094678,
         7.63649246,  7.55192821, 12.34230263, 10.9419374 ,  6.94358593]),
 array([ 1.66106326, 15.49250168, 13.119441  , 10.75184011, 10.00295951,
         8.24108352, 10.37319484, 12.70328493

In [35]:
x

array([ 9.73404129,  6.91090475,  5.02345388, 18.97588588, 12.61691404,
       13.2423002 ,  4.69803313, 22.58829132, 10.02049564,  6.01406273,
       12.2077415 , 10.53140569,  7.12841173, 12.92946448,  7.07283777,
       10.86398743,  9.59515546,  6.13678726, 14.20818915, 12.78063782,
        9.77828309,  7.24587996,  9.3552822 ,  5.52836831,  4.70107511,
        9.93942907, 11.86600639,  4.81023877,  6.78842959,  5.30886527,
        8.16684569,  3.40260096, 14.10294111, 16.74360383, 13.63227281,
        5.18882533,  9.79432211,  6.33345512,  9.74825223,  9.48880056,
        5.23627902, 12.30361569,  4.07508227,  6.97815337,  7.49949951,
       11.14724689, 10.27519583,  1.04034204,  3.73271048,  8.43378516,
       12.47213671,  7.55768708,  6.44182773,  1.11501256, 13.14094678,
        7.63649246,  7.55192821, 12.34230263, 10.9419374 ,  6.94358593,
        1.66106326, 15.49250168, 13.119441  , 10.75184011, 10.00295951,
        8.24108352, 10.37319484, 12.70328493,  8.39075608, 10.77

sklearn - https://scikit-learn.org/stable/modules/cross_validation.html

## KNN   K Nearest Neighbours

Titanic dataset

https://www.kaggle.com/c/titanic

In [106]:
df_train = pd.read_csv('train.csv')

In [100]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [101]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [102]:
corr_matrix = df_train.corr()
corr_matrix["Survived"]

PassengerId   -0.005007
Survived       1.000000
Pclass        -0.338481
Age           -0.077221
SibSp         -0.035322
Parch          0.081629
Fare           0.257307
Name: Survived, dtype: float64

In [103]:
# px.bar(df_train, x='Sex', y='Survived')
px.bar(df_train, x='Embarked', y='Survived')

In [107]:
droped_features = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp', 'Parch', 'Age']
df_train = df_train.drop(droped_features, axis=1)


In [108]:
df_train.head()

Unnamed: 0,Survived,Pclass,Sex,Fare,Embarked
0,0,3,male,7.25,S
1,1,1,female,71.2833,C
2,1,3,female,7.925,S
3,1,1,female,53.1,S
4,0,3,male,8.05,S


In [109]:
df_train.describe()

Unnamed: 0,Survived,Pclass,Fare
count,891.0,891.0,891.0
mean,0.383838,2.308642,32.204208
std,0.486592,0.836071,49.693429
min,0.0,1.0,0.0
25%,0.0,2.0,7.9104
50%,0.0,3.0,14.4542
75%,1.0,3.0,31.0
max,1.0,3.0,512.3292


In [110]:
label_encoder = LabelEncoder()

# df_train.iloc[:, 2].values
df_train.iloc[:, 2] = label_encoder.fit_transform(df_train.iloc[:, 2].values)
df_train.iloc[:, 4] = label_encoder.fit_transform(df_train.iloc[:, 4].values)

In [111]:
df_train.head()


Unnamed: 0,Survived,Pclass,Sex,Fare,Embarked
0,0,3,1,7.25,2
1,1,1,0,71.2833,0
2,1,3,0,7.925,2
3,1,1,0,53.1,2
4,0,3,1,8.05,2


In [114]:
train_test_split?

[1;31mSignature:[0m
[0mtrain_test_split[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[0marrays[0m[1;33m,[0m[1;33m
[0m    [0mtest_size[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mtrain_size[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mshuffle[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mstratify[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Split arrays or matrices into random train and test subsets

Quick utility that wraps input validation and
``next(ShuffleSplit().split(X, y))`` and application to input data
into a single call for splitting (and optionally subsampling) data in a
oneliner.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters
----------
*arrays : sequence of indexables with same length / shape[0]
    Allowed inputs are lists, numpy arrays, scipy-sparse
 

In [112]:
X = df_train.iloc[:, 1:5].values
y = df_train.iloc[:, 0].values

In [115]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

In [116]:
X_train.shape, y_train.shape

((623, 4), (623,))

In [117]:
X_test.shape, y_test.shape

((268, 4), (268,))

In [120]:
classificador = KNeighborsClassifier(n_neighbors=3)
classificador.fit(X_train, y_train)
y_pred_test = classificador.predict(X_test)
print(accuracy_score(y_pred_test, y_test))

0.7686567164179104


0.7686567164179104


Hiperparâmetos: Qual a melhor quantidade de vizinhos ?

In [121]:
k_result = []

for i in range(1, 15):
    classificador = KNeighborsClassifier(n_neighbors=i)
    classificador.fit(X_train, y_train)
    y_pred_test = classificador.predict(X_test)
    k_result.append(accuracy_score(y_pred_test, y_test))

In [124]:
px.line(k_result)

## K-Fold 

In [129]:
classificador = KNeighborsClassifier(n_neighbors=5)
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
scores = cross_val_score(classificador, X, y)
print(scores)

[0.72625698 0.71348315 0.81460674 0.79213483 0.80337079]


In [132]:
k_result = []
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
for i in range(1, 15):
    classificador = KNeighborsClassifier(n_neighbors=i)
    scores = cross_val_score(classificador, X_train, y_train)
    k_result.append(scores.mean())

In [134]:
px.line(k_result)

In [137]:
classificador = KNeighborsClassifier(n_neighbors=1)
classificador.fit(X_train, y_train)
y_pred_test = classificador.predict(X_test)
print(accuracy_score(y_pred_test, y_test))

0.7947761194029851
