# Introdução ao KNN

**Objetivos**

* Exemplificar a distância Euclidiana
* Demonstração do KNN para classificação
* Demonstração de pré-processamento de dados para o KNN
* Demonstração do KNN para regressão

**Características do Data Set**

* Linhas: 59
* Colunas: 7
* Formato do arquivo: txt

In [36]:
# Distância Euclidiana é usada no KNN
# Não pode ser negativa
2-5
(2-5)**2
((2-5)**2)**(0.5)

3.0

In [37]:
# Exemplo de distância entre dois pontos
a = [5,0.75]
b = [2,0.50]
((5-2)**2 + (0.75 - 0.50)**2)**0.5

3.010398644698074

### KNN para classificação (rótulos de classes)

In [38]:
# Importando a classe
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd

In [39]:
# Instanciando o método numa variável
# Cada novo elemento será comparado aos 3 vizinhos mais próximos
knn = KNeighborsClassifier(n_neighbors=3)

In [40]:
# Vamos trabalhar novamente com as frutas
data = pd.read_table('fruit_data_with_colors.txt')

In [41]:
# Atributos
X = data[['mass','height','width','color_score']]

# Label(rótulos)
y = data['fruit_label']

In [42]:
from sklearn.model_selection import train_test_split

In [43]:
# Na matemática utilizamos letras maiúsculas para representar matrizes
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [44]:
knn.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=3)

In [45]:
# Para avaliar o modelo
# A divisão de treino e teste divide os dados de maneira aleatória
# Um bom acerto é acima de 70% ou 80%
# A ordem de grandeza dos dados vai influenciar na distância no knn
knn.score(X_test,y_test)

0.7333333333333333

### Pré-processando os dados para o KNN

In [46]:
# Função para conversão de escala
from sklearn.preprocessing import MinMaxScaler

In [47]:
# Armazenando a função
mm = MinMaxScaler()

In [48]:
# Transforma os dados de treino de acordo com os valores máximos e mínimos armazenados
X_train = mm.fit_transform(X_train)

X_train

array([[0.1958042 , 0.72307692, 0.        , 0.47368421],
       [0.44755245, 0.8       , 0.44736842, 0.57894737],
       [0.36363636, 0.43076923, 0.57894737, 0.10526316],
       [0.31468531, 0.50769231, 0.28947368, 1.        ],
       [0.15384615, 0.67692308, 0.05263158, 0.5       ],
       [0.13986014, 0.56923077, 0.13157895, 0.44736842],
       [0.41958042, 0.87692308, 0.39473684, 0.44736842],
       [0.41258741, 0.96923077, 0.36842105, 0.39473684],
       [0.40559441, 0.50769231, 0.68421053, 0.        ],
       [0.46853147, 0.61538462, 0.52631579, 0.71052632],
       [0.43356643, 1.        , 0.39473684, 0.44736842],
       [0.28671329, 0.53846154, 0.34210526, 0.63157895],
       [0.93006993, 0.83076923, 0.84210526, 0.52631579],
       [0.23776224, 0.52307692, 0.26315789, 0.52631579],
       [0.27272727, 0.53846154, 0.34210526, 0.60526316],
       [0.23076923, 0.58461538, 0.47368421, 0.52631579],
       [0.        , 0.        , 0.        , 0.68421053],
       [0.13986014, 0.63076923,

In [49]:
# Transforma os dados de teste de treino de acordo com os valores máximos e mínimos armazenados
X_test = mm.transform(X_test)

X_test

array([[0.01398601, 0.04615385, 0.        , 0.57894737],
       [0.34265734, 0.93846154, 0.39473684, 0.44736842],
       [0.02797203, 0.09230769, 0.05263158, 0.63157895],
       [0.48951049, 0.95384615, 0.39473684, 0.42105263],
       [0.03496503, 0.10769231, 0.10526316, 0.65789474],
       [0.22377622, 0.47692308, 0.39473684, 0.84210526],
       [0.27272727, 0.49230769, 0.36842105, 0.71052632],
       [0.3006993 , 0.47692308, 0.44736842, 0.73684211],
       [0.27972028, 0.47692308, 0.5       , 0.36842105],
       [0.33566434, 0.55384615, 0.34210526, 0.97368421],
       [0.14685315, 0.61538462, 0.02631579, 0.44736842],
       [0.22377622, 0.47692308, 0.23684211, 0.44736842],
       [0.35664336, 0.58461538, 0.34210526, 0.97368421],
       [0.36363636, 0.64615385, 0.47368421, 0.63157895],
       [0.33566434, 0.46153846, 0.42105263, 0.89473684]])

In [50]:
# Retreinando o knn
knn = KNeighborsClassifier(n_neighbors=3)

In [51]:
knn.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=3)

In [52]:
# Ao comparar os dados de teste e os seus rótulos ao resultado destes dados de teste no modelo, deu 100% de acerto
knn.score(X_test,y_test)

1.0

In [53]:
# Para predizer apenas os resultados
knn.predict(X_test)

array([2, 4, 2, 4, 2, 1, 3, 1, 1, 1, 4, 3, 1, 3, 1], dtype=int64)

In [54]:
y_test

5     2
48    4
4     2
46    4
3     2
22    1
42    3
18    1
15    1
11    1
52    4
28    3
8     1
41    3
9     1
Name: fruit_label, dtype: int64

### KNN para Regressão (rótulos são números ordenáveis)

In [55]:
from sklearn.neighbors import KNeighborsRegressor

In [56]:
knn = KNeighborsRegressor(n_neighbors=3)

In [57]:
from sklearn.datasets import load_boston

In [58]:
data = load_boston()

In [59]:
X, y = load_boston(return_X_y=True)

In [60]:
X.shape

(506, 13)

In [61]:
y.shape

(506,)

In [62]:
print(load_boston()['DESCR'])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [63]:
from sklearn.model_selection import train_test_split

In [64]:
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [65]:
knn.fit(X_train,y_train)

KNeighborsRegressor(n_neighbors=3)

In [66]:
knn.score(X_test,y_test)

0.5370055788989855