# <center>  Lista 11 - Aprendizado de Máquina </center>

**Aluno(a):** Marianna de Pinho Severo <br>
**Matrícula:** 374856 <br>
**Professor(a):** Regis Pires

### Passo 01: Importar bibliotecas

In [28]:
import os
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, accuracy_score
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Estimadores
from sklearn.linear_model import Perceptron, SGDClassifier,LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier

### Passo 02: Carregar conjunto de dados

In [14]:
DATASET_PATH = "dataset"

In [15]:
def load_adult_data(dataset_path = DATASET_PATH):
    cols = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','label']
    csv_path = os.path.join(dataset_path,'adult.data')
    return pd.read_csv(csv_path, names = cols, skipinitialspace=True, na_values = '?')

In [16]:
adults = load_adult_data()

### Passo 03: Fazer breve análise dos dados

In [17]:
adults.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [18]:
adults.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         30725 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        30718 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    31978 non-null object
label             32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [19]:
adults.isna().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
label                0
dtype: int64

Podemos observar que as features *workclass*, *occupation* e *native-country* possuem valores faltantes. Dessa forma, como a porcentagem desses valores faltantes é bem pequena, se compararmos com a quantidade total de amostras dessas features, iremos preencher esses valores vazios ao invés de exluir essas colunas.

## 1) Pré-processamento dos dados

### Passo 01: Preencher valores faltantes

In [21]:
si = SimpleImputer(strategy='most_frequent')
adults = si.fit_transform(adults)

In [26]:
cols = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','label']
adults = pd.DataFrame(adults, columns = cols)

In [29]:
adults.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
label             0
dtype: int64

In [30]:
# for column in ['workclass', 'occupation', 'native-country']:
#     adults[column] = adults[column].replace(np.nan, adults[column].mode()[0])

Conforme podemos observar, os valores nulos foram preenchidos com a moda de cada coluna.

### Passo 02: Transformar label em valor numérico

In [31]:
class_map = {label:idx for idx, label in enumerate(np.unique(adults['label']))}
adults['label'] = adults['label'].map(class_map)

In [32]:
adults.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


Conforme podemos observar, os valores na coluna *label* agora são representados por {0,1} ao invés de {<=50K, >50K}.

### Passo 03: Transformar valores categóricos para valores numéricos

Conforme podemos observar no conjunto de dados, as features *education* e *education-num* passam o mesmo tipo de informação. Dessa forma, para facilitarmos o processamento, utilizaremos apenas a feature *education-num*, exluindo a coluna de *education*, conforme mostrado abaixo.

In [33]:
adults = adults.drop(columns ='education')

Agora, transformaremos as features categóricas para valores numéricos. Para isso, utilizaremos a função *get_dummies*, que recebe um conjunto de dados com colunas categóricas e transforma essas colunas categóricas para representação binária, em que o 1 indica que aquela amostra possui aquela coluna.

In [34]:
adults = pd.get_dummies(adults)

Se olharmos novamente para o DataFrame *adults*, veremos que o label não está na última coluna. Dessa forma, como futuramente precisaremos separar features e labels, abaixo está um algoritmo para pegar a posição do label, seja ela qual for.

In [35]:
adults.head(1)

Unnamed: 0,label,age_17,age_18,age_19,age_20,age_21,age_22,age_23,age_24,age_25,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [36]:
unique_index = pd.Index(adults.columns)

In [37]:
label_index = unique_index.get_loc(key='label')
label_index

0

## 2) Aplicação dos modelos de aprendizagem

### Passo 01: Separar variáveis de entrada e saída

In [38]:
dataset = adults.values

In [29]:
y = dataset[:, label_index]
X = np.delete(dataset, label_index, axis=1)

### Passo 02: Separar conjunto de treino e teste

Nessa lista, reservaremos 25% dos dados para teste e 75% para treino. Também, separaremos os conjuntos de maneira estratificada.

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 42, stratify = y)

### Passo 03: Criar e instanciar modelos de aprendizagem

In [38]:
class AdalineGD(object):
    """ADAptive LInear NEuron classifier.
    Parameters
    ------------
    eta : float
      Learning rate (between 0.0 and 1.0)
    n_iter : int
      Passes over the training dataset.
    random_state : int
      Random number generator seed for random weight
      initialization.
    Attributes
    -----------
    w_ : 1d-array
      Weights after fitting.
    cost_ : list
      Sum-of-squares cost function value in each epoch.
    """
    def __init__(self, eta=0.01, n_iter=50, random_state=1):
        self.eta = eta
        self.n_iter = n_iter
        self.random_state = random_state

    def fit(self, X, y):
        """ Fit training data.
        Parameters
        ----------
        X : {array-like}, shape = [n_samples, n_features]
          Training vectors, where n_samples is the number of samples and
          n_features is the number of features.
        y : array-like, shape = [n_samples]
          Target values.
        Returns
        -------
        self : object
        """
        rgen = np.random.RandomState(self.random_state)
        self.w_ = rgen.normal(loc=0.0, scale=0.01, size=1 + X.shape[1])
        self.cost_ = []

        for i in range(self.n_iter):
            net_input = self.net_input(X)
            # Please note that the "activation" method has no effect
            # in the code since it is simply an identity function. We
            # could write `output = self.net_input(X)` directly instead.
            # The purpose of the activation is more conceptual, i.e.,  
            # in the case of logistic regression (as we will see later), 
            # we could change it to
            # a sigmoid function to implement a logistic regression classifier.
            output = self.activation(net_input)
            errors = (y - output)
            self.w_[1:] += self.eta * X.T.dot(errors)
            self.w_[0] += self.eta * errors.sum()
            cost = (errors**2).sum() / 2.0
            self.cost_.append(cost)
        return self

    def net_input(self, X):
        """Calculate net input"""
        return np.dot(X, self.w_[1:]) + self.w_[0]

    def activation(self, X):
        """Compute linear activation"""
        return X

    def predict(self, X):
        """Return class label after unit step"""
        return np.where(self.activation(self.net_input(X)) >= 0.0, 1, -1)

In [39]:
models = {}

In [40]:
models['perceptron'] = Perceptron(tol=1e-3, random_state=42, max_iter = 1000, eta0=0.01)
models['adaline'] = AdalineGD(n_iter=100, eta=0.0001)
models['sgd'] = SGDClassifier(tol = 1e-3,random_state=42)
models['log_reg'] = LogisticRegression(solver='lbfgs', random_state=42)
models['knn'] = KNeighborsClassifier(algorithm='kd_tree')
models['naive'] = GaussianNB()
models['svm'] = svm.SVC(gamma='scale',random_state=42)
models['tree'] = DecisionTreeClassifier()

Conforme pode ser visto acima, criamos um dicionário de modelos.

### Passo 04: Realizar a cross-validação para melhoria dos hiperparâmetros

In [41]:
cv = StratifiedKFold(n_splits = 5, shuffle=True, random_state=42)

In [42]:
acc = {}
acc_mean = {}

In [43]:
#Não reexecutar aqui
for key in models:
    s = StandardScaler()
    pipeline = Pipeline([('transformer', s), ('estimator', models[key])])
    acc[key] = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='accuracy')
    acc_mean[key] = np.mean(acc[key])









In [44]:
for key in acc:
    print("Modelo: {} | Acc_mean: {}".format(key, acc_mean[key]))

Modelo: perceptron | Acc_mean: 0.8020472384343886
Modelo: adaline | Acc_mean: 0.04680598983057054
Modelo: sgd | Acc_mean: 0.8348489180423151
Modelo: log_reg | Acc_mean: 0.8492220724268094
Modelo: knn | Acc_mean: 0.8227272529573622
Modelo: naive | Acc_mean: 0.38689872673503134
Modelo: svm | Acc_mean: 0.8475432297003909
Modelo: tree | Acc_mean: 0.8138819014017598


### Passo 05: Reconstruir modelos com hiperparâmetros escolhidos e testar

In [45]:
models['perceptron'] = Perceptron(tol=1e-3, random_state=42, max_iter = 1000, eta0=0.01)
models['adaline'] = AdalineGD(n_iter=100, eta=0.0001)
models['sgd'] = SGDClassifier(tol = 1e-3,random_state=42)
models['log_reg'] = LogisticRegression(solver='lbfgs', random_state=42)
models['knn'] = KNeighborsClassifier(algorithm='kd_tree')
models['naive'] = GaussianNB()
models['svm'] = svm.SVC(gamma='scale',random_state=42)
models['tree'] = DecisionTreeClassifier()

In [46]:
ss = StandardScaler()
X_train_std = ss.fit_transform(X_train)
X_test_std = ss.transform(X_test)



In [47]:
y_pred_std = {}
acc_test = {}

In [48]:
for key in models:
    models[key].fit(X_train_std, y_train)
    y_pred_std[key] = models[key].predict(X_test_std)
    acc_test[key] = accuracy_score(y_test, y_pred_std[key])
    print("Modelo: {} Acc: {}".format(key, acc_test[key]))

Modelo: perceptron Acc: 0.7947426606068051
Modelo: adaline Acc: 0.04569463210907751
Modelo: sgd Acc: 0.8430168283994596
Modelo: log_reg Acc: 0.8538263112639725
Modelo: knn Acc: 0.831961675469844
Modelo: naive Acc: 0.39749416533595383
Modelo: svm Acc: 0.8511239405478442
Modelo: tree Acc: 0.8139049256848053


Conforme podemos observar, o modelo que apresentou a melhor acurácia foi o Logistic Regression.