# <center> Lista 09 - Aprendizado de Máquina </center>

**Aluno(a):** Marianna de Pinho Severo <br>
**Matrícula:** 374856 <br>
**Professor(a):** Regis Pires

Nesta lista utilizaremos o [dataset Iris](https://www.google.com/url?q=https%3A%2F%2Farchive.ics.uci.edu%2Fml%2Fmachine-learning-databases%2Firis%2Firis.data&sa=D&sntz=1&usg=AFQjCNFKq79DXPZbLNQzSgdmE8keMrY2ow) para estudarmos os modelos K Nearest Neighbors (KNN) e Medidas de Distância.

### Passo 01: Importar bibliotecas

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

### Passo 02: Carregar o conjunto de dados

In [2]:
flowers = pd.read_csv('dataset/iris.data', sep=',', header=None)

### Passo 03: Breve análise dos dados

In [3]:
flowers.head()

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
flowers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
0    150 non-null float64
1    150 non-null float64
2    150 non-null float64
3    150 non-null float64
4    150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


In [5]:
flowers[4].value_counts()

Iris-virginica     50
Iris-versicolor    50
Iris-setosa        50
Name: 4, dtype: int64

### Passo 04: Transformar os labels para o formato numérico

In [6]:
le = LabelEncoder()
flowers[4] = le.fit_transform(flowers[4])

## Questão 03: Implementar medidas de distância

Nesta questão, implementaremos alguns algoritmos para calcular a distância entre dois pontos.

### Distância 01: Minkowiski

A distância Minkowiski é dada pela seguinte fórmula $$ d(x,y) = (\sum_{i=1}^{k}(|x_i - y_i|)^p)^{\frac{1}{p}} $$

In [7]:
def minkowiski_distance(X, row, p):
    X_ = abs(X - row)**p
    return (np.sum(X_, axis=1))**(1/p)

### Distância 02: Euclidian

A distância euclidiana é dada pela seguinte fórmula $$ d(x,y) = \sqrt{\sum_{i=1}^{k}(x_i - y_i)^2}$$

In [8]:
def euclidian_distance(X, row):
    return minkowiski_distance(X, row, 2)

### Distância 03: Manhattan

A distância Manhattan é dada pela seguinte fórmula $$ d(x,y) = \sum_{i=1}^{k}|x_i - y_i| $$

In [9]:
def manhattan_distance(X, row):
    return minkowiski_distance(X,row,1)

### Distância 04: Chebyshev

A distância Chebyshev é dada pela seguinte fórmula $$ d(x,y) = \max(|x - y|) $$

In [10]:
def chebyshev_distance(X, row):
    X_ = X - row
    return np.max(X_, axis=1)

## Questão 01: Implementar as classes KNN

### Passo 01: Criar classe KNNModel

In [11]:
class KNNModel_lista:
    
    def __init__(self, n_neighbors = 3, p = 2, metric='minkowski'):
        self.n_neighbors = n_neighbors
        self.p = p
        self.metric = metric

### Passo 02: Criar subclasse KNNClassifier

In [12]:
class KNNClassifier_lista(KNNModel_lista):
    
    def __init__(self, n_neighbors = 3, p = 2, metric='minkowski'):
        super().__init__(n_neighbors, p, metric)
    
    def fit(self, X, y):
        self.X = X
        self.y = y
    
    def get_idx_KNN(self, Row):
        neighbors = []

        if(self.metric == 'minkowski'):
            for line in Row:
                dist = minkowiski_distance(self.X, line,self.p)
                idx_sort = np.argsort(dist)
                neighbors.append(idx_sort[0:self.n_neighbors])
        elif(self.metric == 'chebyshev'):
            for line in Row:
                dist = chebyshev_distance(self.X, line)
                idx_sort = np.argsort(dist)
                neighbors.append(idx_sort[0:self.n_neighbors])

        return neighbors
    
    def predict(self, X_test):
        idx_kNN = self.get_idx_KNN(X_test)
        classes = []
        
        for idx in idx_kNN:
            count = np.bincount(self.y[idx])
            classes.append(np.argmax(count))

        return np.array(classes)

### Passo 03: Criar subclasse KNNRegressor

In [13]:
class KNNRegressor_lista(KNNModel_lista):
    
    def __init__(self, n_neighbors = 3, p = 2, metric='minkowski'):
        super().__init__(n_neighbors, p, metric)
    
    def fit(self, X, y):
        self.X = X
        self.y = y
    
    def get_idx_KNN(self, Row):
        neighbors = []

        if(self.metric == 'minkowski'):
            for line in Row:
                dist = minkowiski_distance(self.X, line,self.p)
                idx_sort = np.argsort(dist)
                neighbors.append(idx_sort[0:self.n_neighbors])
        elif(self.metric == 'chebyshev'):
            for line in Row:
                dist = chebyshev_distance(self.X, line)
                idx_sort = np.argsort(dist)
                neighbors.append(idx_sort[0:self.n_neighbors])

        return neighbors
    
    def predict(self, X_test):
        idx_kNN = self.get_idx_KNN(X_test)
        reg = []
        
        for idx in idx_kNN:
            reg.append(np.mean(self.y[idx]))
        
        return np.array(reg)

## Questão 02: Instanciar e avaliar modelos

### Passo 01: Separar dados de entrada e saída

In [14]:
X = flowers.values[:, :-1]
y = flowers.values[:, -1].astype(int)

### Passo 02: Criar conjunto de treino e teste

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.9, stratify = y, random_state=42)

### Passo 03: Standardizar as features

In [16]:
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

### Passo 04: Instanciar modelos

#### Feitos na Lista

In [17]:
model_list = {}
model_list['k1p1'] = KNNClassifier_lista(n_neighbors=1, p=1)
model_list['k1p2'] = KNNClassifier_lista(n_neighbors=1, p=2)
model_list['k1cheb'] = KNNClassifier_lista(n_neighbors=1, metric='chebyshev')

model_list['k3p1'] = KNNClassifier_lista(n_neighbors=3, p=1)
model_list['k3p2'] = KNNClassifier_lista(n_neighbors=3, p=2)
model_list['k3cheb'] = KNNClassifier_lista(n_neighbors=3, metric='chebyshev')

model_list['k5p1'] = KNNClassifier_lista(n_neighbors=5, p=1)
model_list['k5p2'] = KNNClassifier_lista(n_neighbors=5, p=2)
model_list['k5cheb'] = KNNClassifier_lista(n_neighbors=5, metric='chebyshev')

#### Do SckitLearn

In [18]:
model_sckit = {}
model_sckit['k1p1'] = KNeighborsClassifier(n_neighbors=1, p=1)
model_sckit['k1p2'] = KNeighborsClassifier(n_neighbors=1, p=2)
model_sckit['k1cheb'] = KNeighborsClassifier(n_neighbors=1, metric='chebyshev')

model_sckit['k3p1'] = KNeighborsClassifier(n_neighbors=3, p=1)
model_sckit['k3p2'] = KNeighborsClassifier(n_neighbors=3, p=2)
model_sckit['k3cheb'] = KNeighborsClassifier(n_neighbors=3, metric='chebyshev')

model_sckit['k5p1'] = KNeighborsClassifier(n_neighbors=5, p=1)
model_sckit['k5p2'] = KNeighborsClassifier(n_neighbors=5, p=2)
model_sckit['k5cheb'] = KNeighborsClassifier(n_neighbors=5, metric='chebyshev')

### Passo 05: Fazer a predição com os modelos

In [19]:
y_pred_list = {}
y_pred_sckit = {}

In [20]:
for key in model_list:
    model_list[key].fit(X_train, y_train)
    model_sckit[key].fit(X_train, y_train)
    
    y_pred_list[key] = model_list[key].predict(X_test)
    y_pred_sckit[key] = model_sckit[key].predict(X_test)

### Passo 06: Comparar resultados entre implementações

In [23]:
for key in y_pred_list:
    print("MODEL: {} | ACC: {}" .format(key,accuracy_score(y_test, y_pred_list[key])))

MODEL: k1p1 | ACC: 0.9555555555555556
MODEL: k1p2 | ACC: 0.9555555555555556
MODEL: k1cheb | ACC: 0.6074074074074074
MODEL: k3p1 | ACC: 0.9259259259259259
MODEL: k3p2 | ACC: 0.8888888888888888
MODEL: k3cheb | ACC: 0.5925925925925926
MODEL: k5p1 | ACC: 0.8962962962962963
MODEL: k5p2 | ACC: 0.8814814814814815
MODEL: k5cheb | ACC: 0.6222222222222222


In [24]:
for key in y_pred_sckit:
    print("MODEL: {} | ACC: {}" .format(key,accuracy_score(y_test, y_pred_sckit[key])))

MODEL: k1p1 | ACC: 0.9555555555555556
MODEL: k1p2 | ACC: 0.9555555555555556
MODEL: k1cheb | ACC: 0.9111111111111111
MODEL: k3p1 | ACC: 0.9259259259259259
MODEL: k3p2 | ACC: 0.8888888888888888
MODEL: k3cheb | ACC: 0.7851851851851852
MODEL: k5p1 | ACC: 0.8962962962962963
MODEL: k5p2 | ACC: 0.8814814814814815
MODEL: k5cheb | ACC: 0.7851851851851852


Para analisar os modelos, escolhemos a acurácia como métrica. É possível observar que os modelos implementados por nós e os implementados pelo sklearn apresentaram os mesmos valores de acurácia, exceto quando a distância utilizada foi a chebyshev - nesse caso, o modelo do sklearn apresentou acurácia maior. Além disso, a maior acurácia foi obtida quando utilizamos k = 1 e p = 1 ou p = 2, em ambos os modelos. Isso mostra que as implementações da distância Chebyshev apresentam diferenças.