# Exercício 2

O objetivo deste exercício é testar as técnicas de subamostragem e sobreamostragem utilizando outra base de dados

Carregue a base de dados csv_result-ebay_confianca_completo.csv, que é uma base de dados que utilizei no meu pós-doutorado para prever confiança de usuários baseado em traços de personalidade extraídos de textos

A classe é o atributo reputation, que pode ser reputação boa ou reputação ruim

Utilize o algoritmo Random Forest e faça os três testes conforme o exemplo anterior. O algoritmo Naïve Bayes não terá um bom desempenho nesta base de dados, por isso precisamos utilizar o Random Forest que é um algoritmo baseado em árvores de decisão. A ideia de utilização é a mesma, e no link a seguir você pode verificar a documentação: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

## Dados

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

dataset = pd.read_csv('Bases de dados/csv_result-ebay_confianca_completo.csv')
dataset.shape

(5806, 75)

In [31]:
dataset.info()

<class 'pandas.DataFrame'>
RangeIndex: 5806 entries, 2 to 5807
Data columns (total 75 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   feedbacks                5806 non-null   int64  
 1   reviews                  5806 non-null   int64  
 2   blacklist                5806 non-null   str    
 3   mextraversion            5806 non-null   float64
 4   cextraversion            5806 non-null   float64
 5   sextraversion            5806 non-null   int64  
 6   mneuroticism             5806 non-null   float64
 7   cneuroticism             5806 non-null   float64
 8   sneuroticism             5806 non-null   int64  
 9   mconscientiousness       5806 non-null   float64
 10  cconscientiousness       5806 non-null   float64
 11  sconscientiousness       5806 non-null   int64  
 12  magreeableness           5806 non-null   float64
 13  cagreeableness           5806 non-null   float64
 14  sagreeableness           5806 non-n

In [32]:
dataset.head()

Unnamed: 0,feedbacks,reviews,blacklist,mextraversion,cextraversion,sextraversion,mneuroticism,cneuroticism,sneuroticism,mconscientiousness,...,need_practicaly,need_selfexpression,need_stability,need_structure,value_conservation,value_openess,value_hedonism,value_selfenhancement,value_selftranscendence,reputation
2,0,49,N,4.181642,0.6,1,2.777591,0.6,0,4.08546,...,0.696359,0.698786,0.756963,0.660119,0.619416,0.746372,0.640073,0.598037,0.828716,Bom
3,0,56,N,4.007042,0.6,0,2.69865,0.6,0,4.187338,...,0.7153,0.664572,0.728806,0.66074,0.588969,0.735915,0.644465,0.603042,0.809379,Bom
4,0,50,N,4.53823,0.7,1,2.298492,0.5,1,5.085833,...,0.72015,0.694678,0.669652,0.627962,0.553523,0.766618,0.65547,0.645042,0.826039,Bom
5,72,0,N,4.692854,0.3,0,2.987231,0.5,0,4.83132,...,0.739793,0.637027,0.697221,0.638587,0.675289,0.752234,0.679661,0.674438,0.813391,Bom
6,76,0,N,4.966753,0.3,0,3.04873,0.5,0,4.725294,...,0.71853,0.616852,0.692761,0.646695,0.677245,0.699785,0.648607,0.616075,0.816841,Bom


In [58]:
dataset['blacklist'] = dataset['blacklist'] == 'S'
dataset['reputation'] = dataset['reputation'] == 'Bom'

In [35]:
dataset['reputation'].value_counts()

reputation
True     4299
False    1507
Name: count, dtype: int64

In [60]:
# Seleção das features e target
X = dataset.iloc[:, 0:74].values
y = dataset.iloc[:, 74]
X.shape, y.shape

((5806, 74), (5806,))

## Base de treinamento e base de teste

In [None]:
from sklearn.model_selection import train_test_split
X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X, y, test_size= 0.2, stratify = y)
# 80% do dataset
X_treinamento.shape, y_treinamento.shape

((4644, 74), (4644,))

In [None]:
# 20% do dataset
X_teste.shape, y_teste.shape

((1162, 74), (1162,))

## Classificação com Random Forest

In [42]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

modelo = RandomForestClassifier()
modelo.fit(X_treinamento, y_treinamento)

previsoes = modelo.predict(X_teste)
accuracy_score(previsoes, y_teste)

0.7504302925989673

## Subamostragem (undersampling) - TomekLinks

In [44]:
from imblearn.under_sampling import TomekLinks

tl = TomekLinks()
X_under, y_under = tl.fit_resample(X, y)
X_under.shape, y_under.shape

((5417, 74), (5417,))

In [45]:
np.unique(y, return_counts=True)

(array([False,  True]), array([1507, 4299]))

In [46]:
np.unique(y_under, return_counts=True)

(array([False,  True]), array([1507, 3910]))

In [48]:
X_treinamento_u, X_teste_u, y_treinamento_u, y_teste_u = train_test_split(X, y, test_size= 0.2, stratify = y)
X_treinamento_u.shape, X_teste_u.shape

((4644, 74), (1162, 74))

In [51]:
modelo_u = RandomForestClassifier()
modelo_u.fit(X_treinamento_u, y_treinamento_u)
previsoes_u = modelo_u.predict(X_teste_u)
accuracy_score(previsoes_u, y_teste_u)

0.7469879518072289

## Sobreamostragem (Oversampling) - SMOTE

In [53]:
from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_over, y_over = smote.fit_resample(X, y)
X_over.shape, y_over.shape

((8598, 74), (8598,))

In [54]:
np.unique(y, return_counts=True)

(array([False,  True]), array([1507, 4299]))

In [55]:
np.unique(y_over, return_counts=True)

(array([False,  True]), array([4299, 4299]))

In [56]:
X_treinamento_o, X_teste_o, y_treinamento_o, y_teste_o = train_test_split(X_over, y_over, test_size=0.2, stratify=y_over)
X_treinamento_o.shape, X_teste_o.shape

((6878, 74), (1720, 74))

In [57]:
modelo_o = RandomForestClassifier()
modelo_o.fit(X_treinamento_o, y_treinamento_o)
previsoes_o = modelo_o.predict(X_teste_o)
accuracy_score(previsoes_o, y_teste_o)

0.8046511627906977