<a href="https://colab.research.google.com/github/Rogerio-mack/IMT_Ciencia_de_Dados/blob/main/IMT_estimadores_scikit_learn2_solucao.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<head>
  <meta name="author" content="Rogério de Oliveira">
  <meta institution="author" content="ITM">
</head>

<img src="https://maua.br/images/selo-60-anos-maua.svg" width=300, align="right">
<!-- <h1 align=left><font size = 6, style="color:rgb(200,0,0)"> optional title </font></h1> -->


# Labs Estimadores e Classificação `Scikit-learn`

Aprenda aqui a empregar estimadores do `scikit-learn` para:

1. Hot encode
2. Normalização
3. e Classificação dos dados

# Dados

Este é um conjunto de dados de casos de pagamentos inadimplentes (default) de clientes de cartões em Taiwan ([UCI Dataset](https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients)).

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('https://github.com/Rogerio-mack/IMT_Ciencia_de_Dados/raw/main/data/default_credit_cards.csv')
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,female,university,married,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,female,university,single,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,female,university,single,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,female,university,married,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,male,university,married,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


## Exploração e Preparação dos dados

### Dados ausentes

Elimine dados ausentes se houverem. Verifique os dados categóricos e havendo valores inválidos atribua 'others'.

In [None]:
df.isnull().sum().sum()

0

In [None]:
for c in ['SEX','EDUCATION','MARRIAGE']:
  print(df[c].value_counts(),'\n')

female    18112
male      11888
Name: SEX, dtype: int64 

university         14030
graduate school    10585
high school         4917
others               468
Name: EDUCATION, dtype: int64 

single     15964
married    13659
others       377
Name: MARRIAGE, dtype: int64 



In [None]:
df.EDUCATION = df.EDUCATION.replace(['0','5','6'],['others','others','others'])
df.MARRIAGE = df.MARRIAGE.replace(['0'],['others'])

### Hot encode

Faça o hot encode dos dados categóricos eliminando a primeira categoria (colinearidade).

Dica: Empregue `hot_encode.get_feature_names_out(['SEX', 'EDUCATION', 'MARRIAGE'])` para nomear os atributos.



In [None]:
from sklearn.preprocessing import OneHotEncoder

hot_encode = OneHotEncoder(handle_unknown='ignore',sparse_output=False,drop='first')
hot_encode.fit(df[['SEX','EDUCATION','MARRIAGE']])



In [None]:
hot_encode.categories_

[array(['female', 'male'], dtype=object),
 array(['graduate school', 'high school', 'others', 'university'],
       dtype=object),
 array(['married', 'others', 'single'], dtype=object)]

In [None]:
hot_encode.get_feature_names_out(['SEX', 'EDUCATION', 'MARRIAGE'])

array(['SEX_male', 'EDUCATION_high school', 'EDUCATION_others',
       'EDUCATION_university', 'MARRIAGE_others', 'MARRIAGE_single'],
      dtype=object)

In [None]:
df_hot_encode = pd.DataFrame(hot_encode.transform(df[['SEX','EDUCATION','MARRIAGE']]),columns=hot_encode.get_feature_names_out(['SEX', 'EDUCATION', 'MARRIAGE']))
df_hot_encode.head()

Unnamed: 0,SEX_male,EDUCATION_high school,EDUCATION_others,EDUCATION_university,MARRIAGE_others,MARRIAGE_single
0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,1.0
2,0.0,0.0,0.0,1.0,0.0,1.0
3,0.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0


In [None]:
df_ = pd.concat([df_hot_encode,df.drop(columns=['bairro','banheiros','preco'])],axis=1)
df_.head()

Unnamed: 0,paraiso,vila-mariana,vila-nova-conceicao,vila-sonia,areaM2,suites,dormitorios,vagas
0,0.0,0.0,1.0,0.0,32,1,1,1
1,0.0,0.0,1.0,0.0,157,2,2,2
2,0.0,0.0,1.0,0.0,205,2,3,3
3,0.0,0.0,1.0,0.0,193,3,3,3
4,0.0,0.0,1.0,0.0,116,1,3,2


Q1. Qual a soma dos hot encodes obtidos?

In [None]:
df_hot_encode.sum().sum() # df_hot_encode é um dataframe somente com os hot encodes

47644.0

### Normalização

Sugestão: crie um dataframe `df_` concatenando os valores hot encode obtidos e os demais atributos preditores de `df`. Em seguida faça a normalização standard scale dos dados. Ao final, se quiser adicione a coluna da variável objetivo.

In [None]:
df_ = pd.concat([df_hot_encode,df.drop(columns=['ID','SEX','EDUCATION','MARRIAGE','default payment next month'])],axis=1)
df_.head()

Unnamed: 0,SEX_male,EDUCATION_high school,EDUCATION_others,EDUCATION_university,MARRIAGE_others,MARRIAGE_single,LIMIT_BAL,AGE,PAY_0,PAY_2,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,0.0,0.0,0.0,1.0,0.0,0.0,20000,24,2,2,...,689,0,0,0,0,689,0,0,0,0
1,0.0,0.0,0.0,1.0,0.0,1.0,120000,26,-1,2,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,0.0,0.0,0.0,1.0,0.0,1.0,90000,34,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,0.0,0.0,0.0,1.0,0.0,0.0,50000,37,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,1.0,0.0,0.0,1.0,0.0,0.0,50000,57,-1,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679


Q2. Qual o shape dos dados obtido?

In [None]:
df_.shape

(30000, 26)

#### Standard Scale

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df_)

df_scaled = scaler.transform(df_)
df_scaled = pd.DataFrame(df_scaled, columns=df_.columns)

df_scaled = pd.concat([df_scaled,df[['default payment next month']]],axis=1)

df_scaled.head()

Unnamed: 0,SEX_male,EDUCATION_high school,EDUCATION_others,EDUCATION_university,MARRIAGE_others,MARRIAGE_single,LIMIT_BAL,AGE,PAY_0,PAY_2,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,-0.810161,-0.442752,-0.125886,1.0669,-0.112812,-1.066471,-1.13672,-1.24602,1.794564,1.782348,...,-0.672497,-0.663059,-0.652724,-0.341942,-0.227086,-0.296801,-0.308063,-0.314136,-0.293382,1
1,-0.810161,-0.442752,-0.125886,1.0669,-0.112812,0.937672,-0.365981,-1.029047,-0.874991,1.782348,...,-0.621636,-0.606229,-0.597966,-0.341942,-0.213588,-0.240005,-0.24423,-0.314136,-0.180878,1
2,-0.810161,-0.442752,-0.125886,1.0669,-0.112812,0.937672,-0.597202,-0.161156,0.014861,0.111736,...,-0.44973,-0.417188,-0.39163,-0.250292,-0.191887,-0.240005,-0.24423,-0.248683,-0.012122,0
3,-0.810161,-0.442752,-0.125886,1.0669,-0.112812,-1.066471,-0.905498,0.164303,0.014861,0.111736,...,-0.232373,-0.186729,-0.156579,-0.221191,-0.169361,-0.228645,-0.237846,-0.244166,-0.23713,0
4,1.234323,-0.442752,-0.125886,1.0669,-0.112812,-1.066471,-0.905498,2.334029,-0.874991,0.111736,...,-0.346997,-0.348137,-0.331482,-0.221191,1.335034,0.271165,0.266434,-0.269039,-0.255187,0


In [None]:
df_scaled.describe()

Unnamed: 0,SEX_male,EDUCATION_high school,EDUCATION_others,EDUCATION_university,MARRIAGE_others,MARRIAGE_single,LIMIT_BAL,AGE,PAY_0,PAY_2,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,...,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,-7.389644e-17,4.1685170000000006e-17,2.842171e-17,-4.642213e-17,-1.4210850000000002e-17,-1.515825e-17,-6.063298e-17,-1.856885e-16,-1.894781e-17,4.9264300000000007e-17,...,5.873820000000001e-17,-2.3684760000000003e-17,1.136868e-17,-4.736952e-18,1.373716e-17,2.0842590000000002e-17,1.800042e-17,-1.9658350000000003e-17,-1.7881990000000003e-17,0.2212
std,1.000017,1.000017,1.000017,1.000017,1.000017,1.000017,1.000017,1.000017,1.000017,1.000017,...,1.000017,1.000017,1.000017,1.000017,1.000017,1.000017,1.000017,1.000017,1.000017,0.415062
min,-0.8101607,-0.4427518,-0.1258857,-0.9372952,-0.1128122,-1.066471,-1.213794,-1.571479,-1.764843,-1.558876,...,-3.315048,-2.000874,-6.355247,-0.3419416,-0.2569895,-0.2968013,-0.3080626,-0.3141361,-0.2933821,0.0
25%,-0.8101607,-0.4427518,-0.1258857,-0.9372952,-0.1128122,-1.066471,-0.9054983,-0.8120745,-0.8749911,-0.7235699,...,-0.6363293,-0.63406,-0.6316338,-0.2815661,-0.2208358,-0.2746506,-0.289168,-0.2976091,-0.2867584,0.0
50%,-0.8101607,-0.4427518,-0.1258857,-0.9372952,-0.1128122,0.9376717,-0.2118326,-0.1611565,0.01486052,0.1117361,...,-0.3763451,-0.3652683,-0.3660725,-0.215153,-0.1697952,-0.1945673,-0.2123132,-0.2159561,-0.2090042,0.0
75%,1.234323,-0.4427518,-0.1258857,1.0669,-0.1128122,0.9376717,0.5589071,0.5982479,0.01486052,0.1117361,...,0.1747667,0.1624955,0.1733997,-0.03970176,-0.03998021,-0.0409323,-0.05188511,-0.05026084,-0.06837436,0.0
max,1.234323,2.258602,7.943712,1.0669,8.864288,0.9376717,6.416528,4.720729,7.133674,6.794184,...,13.18669,14.58743,15.49528,52.39921,72.84299,50.59528,39.33218,27.60363,29.4451,1.0


Q3. Qual a soma dos dados do novo dataframe normalizado, incluindo a variável objetivo?

In [None]:
df_scaled.sum().sum()

6635.999999999993

## Treinamento do Modelo

Empregue os dados de treinamento e teste com a configuração abaixo. Treine então um modelo de K-vizinhos mais Próximos (`k=5`) e um de Regressão Logística (`maxiter=1000`).

Dica: Acesse o [Lab_Knn_Metricas_Resolvido](https://colab.research.google.com/github/Rogerio-mack/IMT_Ciencia_de_Dados/blob/main/IMT_Lab_Knn_Metricas_Resolvido.ipynb).

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

# Entradas e Saídas
X = df_scaled.drop(columns='default payment next month')
y = df_scaled['default payment next month']

# Separação dos Conjuntos de Treinamento e Teste
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=1)

# Definição do Modelo
# clf = LogisticRegression(max_iter=1000)
clf = KNeighborsClassifier(n_neighbors = 5)

# Treinamento
clf.fit(X_train,y_train)

# Predição
y_pred = clf.predict(X_test)

# Métricas
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

cm = confusion_matrix(y_test, y_pred)
print('\nMatriz de Confusão:\n')
print(cm)

accuracy = accuracy_score(y_test, y_pred)
print('\nScore de Acuracidade (1):\n')
print(f'{accuracy:.4f}')

print('\n', clf)

accuracy = clf.score(X_test, y_test)
print('\nScore de Acuracidade (2):\n')
print(f'{accuracy:.4f}')

print('\nClassification Report:\n')
print(classification_report(y_test, y_pred))



Matriz de Confusão:

[[6457  552]
 [1326  665]]

Score de Acuracidade (1):

0.7913

 KNeighborsClassifier()

Score de Acuracidade (2):

0.7913

Classification Report:

              precision    recall  f1-score   support

           0       0.83      0.92      0.87      7009
           1       0.55      0.33      0.41      1991

    accuracy                           0.79      9000
   macro avg       0.69      0.63      0.64      9000
weighted avg       0.77      0.79      0.77      9000



Q4. Qual modelo apresenta maior acuracidade?

Q5. Qual classe (0 ou 1) apresenta mais falsos positivos empregando o melhor modelo?

## Predição de novos casos

Faça a predição para os casos abaixo.

In [None]:
df_case = pd.read_csv('https://github.com/Rogerio-mack/IMT_Ciencia_de_Dados/raw/main/data/default_credit_cards_test.csv')
df_case.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,30101,630000,female,university,single,41,-1,0,-1,-1,...,6500,6500,6500,2870,1000,6500,6500,6500,2870,0
1,32022,70000,male,university,single,30,1,-2,-2,0,...,65701,66782,36137,36894,3200,0,3000,3000,1500,0
2,32125,270000,male,university,married,53,2,2,2,2,...,219829,214795,228084,232680,3000,15000,0,16500,8870,0


## Preparação dos novos casos

Basicamente você deve fazer as mesmas transformações feitas nos dados de treinamento.

### Hot encode

In [None]:
df_case_hot_encode = pd.DataFrame(hot_encode.transform(df_case[['SEX','EDUCATION','MARRIAGE']]),columns=hot_encode.get_feature_names_out(['SEX', 'EDUCATION', 'MARRIAGE']))
df_case_hot_encode.head()

Unnamed: 0,SEX_male,EDUCATION_high school,EDUCATION_others,EDUCATION_university,MARRIAGE_others,MARRIAGE_single
0,0.0,0.0,0.0,1.0,0.0,1.0
1,1.0,0.0,0.0,1.0,0.0,1.0
2,1.0,0.0,0.0,1.0,0.0,0.0


Q6. Qual a soma dos hot encodes obtidos dos novos casos?

In [None]:
df_case_hot_encode.sum().sum()

7.0

### Normalização

In [None]:
df_ = pd.concat([df_case_hot_encode,df_case.drop(columns=['ID','SEX','EDUCATION','MARRIAGE'])],axis=1)
df_.head()

Unnamed: 0,SEX_male,EDUCATION_high school,EDUCATION_others,EDUCATION_university,MARRIAGE_others,MARRIAGE_single,LIMIT_BAL,AGE,PAY_0,PAY_2,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,0.0,0.0,0.0,1.0,0.0,1.0,630000,41,-1,0,...,6500,6500,6500,2870,1000,6500,6500,6500,2870,0
1,1.0,0.0,0.0,1.0,0.0,1.0,70000,30,1,-2,...,65701,66782,36137,36894,3200,0,3000,3000,1500,0
2,1.0,0.0,0.0,1.0,0.0,0.0,270000,53,2,2,...,219829,214795,228084,232680,3000,15000,0,16500,8870,0


In [None]:
df_case_scaled = scaler.transform(df_)
df_case_scaled = pd.DataFrame(df_case_scaled, columns=df_.columns)

df_case_scaled.head()

Unnamed: 0,SEX_male,EDUCATION_high school,EDUCATION_others,EDUCATION_university,MARRIAGE_others,MARRIAGE_single,LIMIT_BAL,AGE,PAY_0,PAY_2,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,-0.810161,-0.442752,-0.125886,1.0669,-0.112812,0.937672,3.564792,0.598248,-0.874991,0.111736,...,-0.584199,-0.571459,-0.556144,-0.604532,-0.281566,0.025123,0.072377,0.106851,-0.126285,-0.293382
1,1.234323,-0.442752,-0.125886,1.0669,-0.112812,0.937672,-0.75135,-0.595102,0.904712,-1.558876,...,0.269478,0.36559,-0.068662,-0.03321,-0.14874,-0.25699,-0.126411,-0.116564,-0.215956,-0.293382
2,1.234323,-0.442752,-0.125886,1.0669,-0.112812,-1.066471,0.790129,1.900084,1.794564,1.782348,...,2.492001,2.666366,3.088561,3.254376,-0.160815,0.394038,-0.296801,0.745181,0.266435,-0.293382


Q7. Qual a soma de todos os valores dos novos casos depois do hot encode e da normalização dos dados?

In [None]:
df_case_scaled.sum().sum()

25.375067116628163

## Predição do Modelo

Q8. Qual a predição do modelo para os 3 novos casos?

In [None]:
clf.predict(df_case_scaled)

array([0, 0, 1])