# Material de exemplo sobre pre-processamento de dados e classificação

Este material aborda pré-processamento de dados bem como o uso de um modelo para realizar classificação. Note que, no geral, as funções para classificação da biblioteca sklearn são geralmente utilizados da mesma maneira. Portanto, embora este exemplo utilize o K-nearest neighbors como classificador, a mesma abordagem pode ser utilizada para outros classificadores.

---
Construindo um dataframe a partir de um dicionário:

In [1]:
import pandas as pd

d = {'Atributo_1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Atributo_2': [6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'Atributo_3': [-1, -7, 18, 99, 1, 0, 2, 3, 40, .6], 'Classe': [0, 0, 1, 1, 0, 0, 0, 0, 1, 0]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,Atributo_1,Atributo_2,Atributo_3,Classe
0,1,6,-1.0,0
1,2,7,-7.0,0
2,3,8,18.0,1
3,4,9,99.0,1
4,5,10,1.0,0
5,6,11,0.0,0
6,7,12,2.0,0
7,8,13,3.0,0
8,9,14,40.0,1
9,10,15,0.6,0


---
Simulando valores ausentes:

In [2]:
import numpy as np

df.iloc[1,1] = np.nan
df.iloc[3,2] = np.nan
df.iloc[4,2] = np.nan

df

Unnamed: 0,Atributo_1,Atributo_2,Atributo_3,Classe
0,1,6.0,-1.0,0
1,2,,-7.0,0
2,3,8.0,18.0,1
3,4,9.0,,1
4,5,10.0,,0
5,6,11.0,0.0,0
6,7,12.0,2.0,0
7,8,13.0,3.0,0
8,9,14.0,40.0,1
9,10,15.0,0.6,0


---
Checando a quantidade de valores nulos:

In [3]:
df.isnull().sum()

Atributo_1    0
Atributo_2    1
Atributo_3    2
Classe        0
dtype: int64

---
Substituindo valores nulos por uma constante:

In [4]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0)

df_clean = imp.fit_transform(df)

df_clean

array([[ 1. ,  6. , -1. ,  0. ],
       [ 2. ,  0. , -7. ,  0. ],
       [ 3. ,  8. , 18. ,  1. ],
       [ 4. ,  9. ,  0. ,  1. ],
       [ 5. , 10. ,  0. ,  0. ],
       [ 6. , 11. ,  0. ,  0. ],
       [ 7. , 12. ,  2. ,  0. ],
       [ 8. , 13. ,  3. ,  0. ],
       [ 9. , 14. , 40. ,  1. ],
       [10. , 15. ,  0.6,  0. ]])

---
Dividindo o dataset em 2 conjuntos, um com $60\%$ dos dados e outro com $40\%$:

In [6]:
from sklearn.model_selection import train_test_split

X = df_clean[:,:-1]
y = df_clean[:,-1:]
print("X:")
print(X)
print("y:")
print(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=0)
print("X_train:")
print(X_train)
print("X_test:")
print(X_test)
print("Y_train:")
print(y_train)
print("Y_test:")
print(y_test)

X:
[[ 1.   6.  -1. ]
 [ 2.   0.  -7. ]
 [ 3.   8.  18. ]
 [ 4.   9.   0. ]
 [ 5.  10.   0. ]
 [ 6.  11.   0. ]
 [ 7.  12.   2. ]
 [ 8.  13.   3. ]
 [ 9.  14.  40. ]
 [10.  15.   0.6]]
y:
[[0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]]
X_train:
[[ 2.  0. -7.]
 [ 7. 12.  2.]
 [ 8. 13.  3.]
 [ 4.  9.  0.]
 [ 1.  6. -1.]
 [ 6. 11.  0.]]
X_test:
[[ 3.   8.  18. ]
 [ 9.  14.  40. ]
 [ 5.  10.   0. ]
 [10.  15.   0.6]]
Y_train:
[[0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]]
Y_test:
[[1.]
 [1.]
 [0.]
 [0.]]


---
Treinando um classificador qualquer:

In [7]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)

print(y_train.reshape(-1))
print(y_train)
knn.fit(X_train, y_train.reshape(-1))

y_pred = knn.predict(X_test)

print(y_pred)

[0. 0. 0. 1. 0. 0.]
[[0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]]
[0. 0. 1. 0.]


---
Extraindo acurácia:

In [8]:
from sklearn.metrics import accuracy_score

print("Predict:")
print(y_pred.reshape(-1))
print("True:")
print(y_test.reshape(-1))

accuracy = accuracy_score(y_test.reshape(-1), y_pred) # (tp + tn)/ (tp + tn + fp + fn)

print("Accuracy: %.2f%%" % (accuracy*100))

Predict:
[0. 0. 1. 0.]
True:
[1. 1. 0. 0.]
Accuracy: 25.00%
