# Garimpagem de Dados

## Aula 4 - Exercídio de Classificação com kNN

13/10/2017

**Dataset:** Titanic: Machine Learning from Disaster

https://www.kaggle.com/c/titanic/data

Partindo da aula passada:

1. Atualizar a função que mede a distância euclidiana para o pacote do scikit-learn 

2. Implementar uma função que selecione os k vizinhos mais próximos (k > 1)

3. Implementar uma função que recebe os k vizinhos mais próximos e determinar a classe correta

4. Transformar as features categoricas em numéricas (tip: pandas ou scikit-learn)

5. Analisar a necessidade de normalizar as features numéricas (tip: pandas ou scikit-learn)

6. Selecionar as features baseada na correlação (tip: pandas)

7. Separar o dataset em treino (75%) / teste (25%) / validação (10% do treino)

4. Execute o classificador para 30 k's pulando de 4 em 4 e apresente todas as acurácias utilizando o dataset de validação (Qual o melhor k?) [plotar um gráfico com os resultados]

5. Executar o classificador para o melhor k encontrado utilizando o dataset de teste e apresentar um relatório da precisão (tip: scikit-learn) [plotar um gráfico com os resultados]

In [130]:
import numpy as np
from sklearn import datasets
from sklearn.neighbors import DistanceMetric
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier

In [131]:
class KNNClassifier(object):
    def __init__(self):
        self.X_train = None
        self.y_train = None

   # def euc_distance(self, a, b):
   #     return np.linalg.norm(a-b)

    def euc_distance(self, a, b):
        dist = DistanceMetric.get_metric('euclidean')
        ndarray = dist.pairwise([a, b])
        distance = ndarray[0][-1]
        return distance
    
    def closest(self, row):
        distance_array = []
        for i in self.X_train:
            distance_array.append(self.euc_distance(i, row))
            
        nearest_neighbor = distance_array.index(min(distance_array))
        return self.y_train[nearest_neighbor]

    def k_closests(self, row, n_of_neighbors):
        distance_array = []
        k_nearest_array = []
        for i in self.X_train:
            distance_array.append(self.euc_distance(i, row))
            
        for i in range(0, n_of_neighbors):
            k_nearest_array.append(distance_array.index(min(distance_array)))
            del distance_array[distance_array.index(min(distance_array))]
            
        return self.y_train[k_nearest_array]
    
    def get_neighbor_class(self, neighbors):
        return self.y_train[neighbors]
    
    def get_closest_class(self, neighbor_classes):
        counter = Counter(neighbor_classes)
        return counter.most_common(1)[0][0]
    
    def fit(self, training_data, training_labels):
        self.X_train = training_data
        self.y_train = training_labels

    def predict(self, to_classify):
        predictions = []
        for row in to_classify:
            label = self.closest(row)
            predictions.append(label)
        return predictions
    
    def predict_2(self, to_classify, n_of_neighbors):
        predictions = []
        for row in to_classify:
            nearest_classes = self.k_closests(row, n_of_neighbors)
            label = self.get_closest_class(nearest_classes)
            predictions.append(label)
        return predictions

### Utilizando o dataset titanic

In [132]:
import pandas as pd

In [133]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Agora vamos remover colunas irrelevantes

In [134]:
df.drop(['Ticket'], axis=1,inplace=True)
df.drop(['Name'], axis=1,inplace=True)

In [135]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,7.25,,S
1,2,1,1,female,38.0,1,0,71.2833,C85,C
2,3,1,3,female,26.0,0,0,7.925,,S
3,4,1,1,female,35.0,1,0,53.1,C123,S
4,5,0,3,male,35.0,0,0,8.05,,S


#### Analisando os dados, percebemos que algumas linhas não possuem e informação da idade dos passageiros, vamos assumir, para essas linhas, a idade média dos passageiros

In [136]:
df["Age"] = df.Age.fillna(df.Age.mean())

#### também há linhas com dados faltantes em cabine, então vamos remover estas linhas

In [137]:
df = df.dropna(axis=0, how='any')

In [138]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
1,2,1,1,female,38.000000,1,0,71.2833,C85,C
3,4,1,1,female,35.000000,1,0,53.1000,C123,S
6,7,0,1,male,54.000000,0,0,51.8625,E46,S
10,11,1,3,female,4.000000,1,1,16.7000,G6,S
11,12,1,1,female,58.000000,0,0,26.5500,C103,S
21,22,1,2,male,34.000000,0,0,13.0000,D56,S
23,24,1,1,male,28.000000,0,0,35.5000,A6,S
27,28,0,1,male,19.000000,3,2,263.0000,C23 C25 C27,S
31,32,1,1,female,29.699118,1,0,146.5208,B78,C
52,53,1,1,female,49.000000,1,0,76.7292,D33,C


### Agora vamos codificar label categóricas em labes numéricas

In [139]:
from sklearn.preprocessing import LabelEncoder

In [140]:
le_sex = LabelEncoder()
le_cabin = LabelEncoder()
le_embarked = LabelEncoder()

In [141]:
le_sex.fit(df.Sex)
le_cabin.fit(df.Cabin)
le_embarked.fit(df.Embarked)

LabelEncoder()

In [142]:
df["Sex"] = le_sex.transform(df["Sex"])
df["Cabin"] = le_cabin.transform(df["Cabin"])
df["Embarked"] = le_embarked.transform(df["Embarked"])

In [143]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
1,2,1,1,0,38.0,1,0,71.2833,80,0
3,4,1,1,0,35.0,1,0,53.1,54,2
6,7,0,1,1,54.0,0,0,51.8625,128,2
10,11,1,3,0,4.0,1,1,16.7,144,2
11,12,1,1,0,58.0,0,0,26.55,48,2


#### Verificando a correlação dos dados

In [144]:
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
PassengerId,1.0,0.111985,-0.084147,0.000877,0.028736,-0.081137,-0.064538,0.017465,-0.072897,0.031825
Survived,0.111985,1.0,-0.030513,-0.545297,-0.231887,0.138202,0.042456,0.128261,0.038628,-0.13091
Pclass,-0.084147,-0.030513,1.0,-0.060014,-0.287184,-0.086972,0.056288,-0.31174,0.494,0.170303
Sex,0.000877,-0.545297,-0.060014,1.0,0.167361,-0.152552,-0.110574,-0.137185,-0.083768,0.096805
Age,0.028736,-0.231887,-0.287184,0.167361,1.0,-0.139881,-0.246928,-0.07668,-0.125576,-0.090462
SibSp,-0.081137,0.138202,-0.086972,-0.152552,-0.139881,1.0,0.262348,0.291777,0.056745,0.002228
Parch,-0.064538,0.042456,0.056288,-0.110574,-0.246928,0.262348,1.0,0.38497,0.001291,0.061455
Fare,0.017465,0.128261,-0.31174,-0.137185,-0.07668,0.291777,0.38497,1.0,-0.262818,-0.239213
Cabin,-0.072897,0.038628,0.494,-0.083768,-0.125576,0.056745,0.001291,-0.262818,1.0,0.231418
Embarked,0.031825,-0.13091,0.170303,0.096805,-0.090462,0.002228,0.061455,-0.239213,0.231418,1.0


#### Normalizando as features numericas

In [145]:
from sklearn.preprocessing import MinMaxScaler

In [146]:
df = MinMaxScaler().fit_transform(df)

In [148]:
df

array([[ 0.        ,  1.        ,  0.        , ...,  0.13913574,
         0.55172414,  0.        ],
       [ 0.00225225,  1.        ,  0.        , ...,  0.1036443 ,
         0.37241379,  1.        ],
       [ 0.00563063,  0.        ,  0.        , ...,  0.10122886,
         0.88275862,  1.        ],
       ..., 
       [ 0.98873874,  1.        ,  0.        , ...,  0.16231419,
         0.47586207,  0.        ],
       [ 0.99774775,  1.        ,  0.        , ...,  0.0585561 ,
         0.2       ,  1.        ],
       [ 1.        ,  1.        ,  0.        , ...,  0.0585561 ,
         0.40689655,  0.        ]])

#### Dividindo em treino e teste

In [150]:
X = df[:,:-1]

In [152]:
X.shape

(202, 9)

In [154]:
y = df[:,-1]

In [155]:
y.shape

(202,)

In [157]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)