# 1. Introdução

Neste trabalho foi estudado o conjunto de dados disponível em http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients, contendo informações de clientes bancários em Taiwan. O *dataset* possui 23 variáveis explicativas e uma única variável de resposta binária, sendo portanto um problema de classificação binária. Para mais detalhes sobre as métricas confira no link do *dataset*.

A metodologia aplicada para avaliar os classificadores foi realizar 100 (cem) rodadas independentes de separação do conjunto de dados entre treino/teste, construir os classificadores no conjunto de treino e examinar seu desempenho no conjunto de teste. A razão entre treino/teste escolhida foi de 80%/20%.

O código abaixo foi utilizado para carregar o conjunto de dados no sistema computacional e averiguar o número de amostras e atributos.

In [1]:
# carregando dataset
import pandas as pd

dataset = pd.read_excel("./default of credit card clients.xls")
dataset.columns = dataset.iloc[0].values.tolist() # renomeando colunas
dataset = dataset.drop([0]) # deletando linha de ID das variáveis, colunas já renomeadas

temp = dataset.columns.values.tolist()
features = [e for e in temp if e not in ('default payment next month')] # retirando variável de resposta
features = dataset[features]

targets = dataset['default payment next month'].copy()
print('Número de amostras: {}.\nNúmero de atributos: {}.'.format(targets.shape[0],features.shape[1]))

Número de amostras: 30000.
Número de atributos: 24.


# 2. K vizinhos mais próximos (k-NN)

Para a avaliação do classificador k-vizinhos mais próximos foram utilizadas as funções da biblioteca *scikit-learn*. Uma busca em grade de dois hiper-parâmetros também foi realizada:
 - Tipo de mudança de escala dos dados: se *min-max*, que deixa os atributos entre 0 e 1, ou a *std*, conhecida como normalização estatística, deixando os atributos com média zero e variância unitária;
 - O valor de '$k$', ou quantidade de vizinhos mais próximos. A busca foi realizada entre os valores de 1 a 10, inclusivo.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
import numpy as np

In [3]:
# Função para mudar a escala dos dados
def scale_feat(X_train, X_test, scaleType='min-max'):
    if scaleType=='min-max' or scaleType=='std':
        X_tr_norm = np.copy(X_train) # fazendo cópia para deixar original disponível
        X_ts_norm = np.copy(X_test)
        scaler = MinMaxScaler() if scaleType=='min-max' else StandardScaler()
        scaler.fit(X_tr_norm)
        X_tr_norm = scaler.transform(X_tr_norm)
        X_ts_norm = scaler.transform(X_ts_norm)
        return (X_tr_norm, X_ts_norm)
    else:
        raise ValueError("Tipo de escala não definida. Use 'min-max' ou 'std'.")

In [4]:
def eval_knn(features, targets, rodadas, scales, ks):
    # Rodadas de treino/teste
    nn_data = np.zeros((len(ks)*len(scales), 7)) # matriz que guardará resultados numéricos
    for scale in scales:
        for k in ks:
            especificidade = 0
            sensibilidade  = 0
            acc = [0]*rodadas
            for i in range(rodadas):
                # divisão treino/teste 
                X_train, X_test, y_train, y_test = train_test_split(features, 
                                                                    targets,
                                                                    test_size=0.2)
                # escalonar os dados
                X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType=scale)

                # construindo classificador
                k_nn = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)
                k_nn.fit(X_tr_norm, y_train)

                # calculando as métricas de avaliação
                cm = confusion_matrix(y_test, k_nn.predict(X_ts_norm))
                total=sum(sum(cm))

                acc[i] = (cm[0,0]+cm[1,1])/total
                especificidade += cm[0,0]/(cm[0,0]+cm[0,1])
                sensibilidade  += cm[1,1]/(cm[1,1]+cm[1,0])

            especificidade/=rodadas # Valores médios
            sensibilidade /=rodadas

            index = k-1 if scale=='min-max' else k+9 # indíce da linha dos dados
            #index = 0 if scale=='min-max' else 1 # indíce da linha dos dados
            nn_data[index,:] = np.matrix([np.mean(acc), np.median(acc), min(acc), max(acc), 
                                          np.std(acc), sensibilidade, especificidade])
    return nn_data

In [5]:
%%time
features_values = features.astype(np.float64).values
targets_values  = targets.values.tolist()

# Hiper-parâmetros
rodadas = 100
scales = ['min-max', 'std']   # tipos de escalonamentos possíveis
ks = [i for i in range(1,11)]

knn_data = eval_knn(features_values, targets_values, rodadas, scales, ks)

cabecalho = ['Média', 'Mediana', 'Mínimo', 'Máximo', 'Desv. Padrão', 'Sensib. média', 'Especif. média']
index = ['{}-NN [\'{}\']'.format(k, scale) for scale in scales for k in ks]
df_knn = pd.DataFrame(knn_data, columns=cabecalho, index=index)

CPU times: user 6h 7min 42s, sys: 8.74 s, total: 6h 7min 50s
Wall time: 55min 34s


In [6]:
df_knn

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
1-NN ['min-max'],0.72929,0.728917,0.7195,0.741,0.004393,0.379249,0.828502
2-NN ['min-max'],0.784093,0.784333,0.773667,0.793833,0.004683,0.192652,0.951833
3-NN ['min-max'],0.773865,0.773667,0.7605,0.783833,0.00434,0.34229,0.896516
4-NN ['min-max'],0.794213,0.793667,0.782,0.806667,0.004859,0.231983,0.953766
5-NN ['min-max'],0.793453,0.793333,0.7835,0.804167,0.004606,0.326738,0.92592
6-NN ['min-max'],0.799348,0.7995,0.789167,0.809833,0.004608,0.250038,0.95583
7-NN ['min-max'],0.799683,0.799667,0.789,0.810167,0.003779,0.316503,0.937024
8-NN ['min-max'],0.802428,0.802417,0.7925,0.812833,0.004107,0.256828,0.95732
9-NN ['min-max'],0.802495,0.802833,0.793,0.812167,0.00458,0.305032,0.943852
10-NN ['min-max'],0.80287,0.8035,0.7915,0.815,0.004937,0.255645,0.958019


Podemos ver que o tipo de normalização pouca influência teve na performance do modelo, sendo o hiper-parâmetro $k$ o que apresentou correlação com o desempenho, mas estagnando o aumento da taxa de acerto média quando $k=6$. O melhor resultado foi:

In [7]:
df_knn.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose()

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
10-NN ['std'],0.806322,0.806417,0.7975,0.818,0.004066,0.287636,0.954376


# 3. Distância mínima ao centróide (DMC)

De hiper-parâmetro do classificador distância mínima ao centroide só foi testado o tipo de escalonamento dos dados.

In [8]:
from numpy.linalg import norm

def eval_DMC(features, targets, rodadas, scales):
    nCases = len(scales) # número de combinações dos hiperparâmetros
    DMC_data = np.zeros((nCases,7))
    count = 0
    for scale in scales:
        acc = [0]*rodadas
        especificidade = 0
        sensibilidade  = 0
        for i in range(rodadas):
            # divisão treino/teste 
            X_train, X_test, y_train, y_test = train_test_split(features, 
                                                                targets,
                                                                test_size=.20)
            # escalonar os dados
            X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType=scale)

            # Cálculo dos centróides
            c0 = np.mean(X_tr_norm[y_train==0], axis=0)
            c1 = np.mean(X_tr_norm[y_train==1], axis=0) 
                        
            # Predição do conjunto de teste
            y_pred = [1 if norm(u-c1) < norm(u-c0) else 0 for u in X_ts_norm]

            # métricas de avaliação
            cm = confusion_matrix(y_test, y_pred)
            total=sum(sum(cm))

            acc[i] = (cm[0,0]+cm[1,1])/total
            especificidade += cm[0,0]/(cm[0,0]+cm[0,1])
            sensibilidade  += cm[1,1]/(cm[1,1]+cm[1,0])

        especificidade/=rodadas # Valores médios
        sensibilidade /=rodadas

        DMC_data[count,:] = np.matrix([np.mean(acc), np.median(acc), min(acc), max(acc), 
                                         np.std(acc), sensibilidade, especificidade])
        count+=1
    
    return DMC_data

In [9]:
%%time
DMC_data = eval_DMC(features=features.astype(np.float64).values,
                    targets=targets.astype(np.int).values,
                    rodadas=100, scales=scales)
index = ['DMC [\'{}\']'.format(scale) for scale in scales]
DMC_df = pd.DataFrame(DMC_data, columns=cabecalho, index=[index])

CPU times: user 14.8 s, sys: 180 ms, total: 15 s
Wall time: 15 s


In [10]:
DMC_df

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
DMC ['min-max'],0.657158,0.656583,0.641667,0.671833,0.006458,0.600882,0.673152
DMC ['std'],0.658925,0.659417,0.6435,0.676333,0.006153,0.633633,0.666121


O tipo de normalização teve pouco efeito na média de acertos, sendo a escolha entre o melhor dos dois quase arbritária.

In [11]:
DMC_df.sort_values('Desv. Padrão', ascending=True).iloc[0,:].to_frame().transpose()

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
DMC ['std'],0.658925,0.659417,0.6435,0.676333,0.006153,0.633633,0.666121


# 4. Classificador quadrático gaussiano (CQG)

Para assegurar a invertibilidade das matrizes de covariância de cada classe foram realizadas 1.000 (mil) rodadas independentes de separação dos dados em treino/teste e calculados os postos dessas matrizes. O código abaixo aplicou tal procedimento:

In [12]:
%%time
from numpy.linalg import matrix_rank

rodadas = 1000
for i in range(rodadas):
    # divisão treino/teste 
    X_train, X_test, y_train, y_test = train_test_split(features.astype(np.float64).values, 
                                                        targets.astype(np.int).values,
                                                        test_size=.20)
    # Cálculo das médias
    m1 = np.mean(X_train[y_train==1], axis=0)
    m0 = np.mean(X_train[y_train==0], axis=0)

    # Cálculo das matrizes de covariância
    C1 = np.zeros((X_train.shape[1],X_train.shape[1]))
    C0 = np.zeros((X_train.shape[1],X_train.shape[1]))

    for j in range(len(y_train)):
        if y_train[j]: # indivíduo 1
            C1 += np.matmul(np.expand_dims((X_train[j,:]-m1), axis=1),
                                np.expand_dims((X_train[j,:]-m1), axis=0))
        else: # indivíduo 0
            C0 += np.matmul(np.expand_dims((X_train[j,:]-m0), axis=1),
                                np.expand_dims((X_train[j,:]-m0), axis=0))

    # dividindo pelo número de elementos de cada classe
    C1/=sum(y_train)
    C0/=(len(y_train) - sum(y_train))
    
    if any([matrix_rank(C1) < X_train.shape[1],
            matrix_rank(C0) < X_train.shape[1]]):
        print("Rodada {}".format(i))
        print("Nº de atributos: {}".format(X_train.shape[1]))
        print("Posto(C1) = {}\nPosto(C0) = {}".format(
            matrix_rank(C1), matrix_rank(C0)))
        break
    
    if i==rodadas-1:
        print("As matrizes C0 e C1 foram invertíveis em {} rodadas.".format(rodadas))

As matrizes C0 e C1 foram invertíveis em 1000 rodadas.
CPU times: user 4min 34s, sys: 88 ms, total: 4min 34s
Wall time: 4min 34s


Como podemos ver as matrizes se apresentaram invertíveis em todas as rodadas, o que abre espaço para a aplicação do CQG original. A título de curiosidade também foram implementados 3 variantes utilizadas quando há problemas de invertibilidade das matrizes de covariância:

- Variante 1: extrair apenas a diagonal principal das matrizes;
- Variante 2: matriz de covariância agregada (*pooled*);
- Variante 3: regularização de Friedman, nesse caso foi realizada um busca em grade para determinar o hiper-parâmetro $\lambda$.

A etapa de normalização dos dados foi retirada pois tal procedimento já é realizado pelas matrizes de covariância. O código abaixo construiu o CQG original, suas 3 variantes e calculou suas respectivas performances:

In [13]:
from numpy.linalg import inv
from numpy import dot, matmul

def eval_CQG(features, targets, rodadas, lambdas):
    nCases = 3+len(lambdas) # número de casos estudados: original + 3 variantes
    acc = np.zeros((nCases,rodadas))
    especificidade = [0]*nCases
    sensibilidade  = [0]*nCases
    CQG_data = np.zeros((nCases, len(cabecalho)))
    for i in range(rodadas):
        # divisão treino/teste 
        X_train, X_test, y_train, y_test = train_test_split(features, targets,
                                                            test_size=.20)
        # Cálculo das médias
        m1 = np.mean(X_train[y_train==1], axis=0)
        m0 = np.mean(X_train[y_train==0], axis=0)

        # Cálculo das matrizes de covariância
        C1 = np.zeros((X_train.shape[1],X_train.shape[1]))
        C0 = np.zeros((X_train.shape[1],X_train.shape[1]))

        for j in range(len(y_train)):
            if y_train[j]: # indivíduo 1
                C1 += np.matmul(np.expand_dims((X_train[j,:]-m1), axis=1),
                                np.expand_dims((X_train[j,:]-m1), axis=0))
            else: # indivíduo 0
                C0 += np.matmul(np.expand_dims((X_train[j,:]-m0), axis=1),
                                np.expand_dims((X_train[j,:]-m0), axis=0))
        # dividindo pelo número de elementos de cada classe
        C1/=sum(y_train)
        C0/=(len(y_train) - sum(y_train))

        # Predição do conjunto de teste
        y_pred = np.zeros((nCases,len(y_test)))

        
        # Variante 0: (CQG original)
        C1_v0_inv = inv(C1)
        C0_v0_inv = inv(C0)
        y_pred[0] = [
            1 if dot(matmul(
                (X_test[j]-m1),C1_v0_inv),(X_test[j]-m1)) < dot(matmul(
                (X_test[j]-m0),C0_v0_inv),(X_test[j]-m0)) else 
            0 for j in range(len(y_test))]
        
        
        # Variante 1:
        C1_v1_inv = inv(np.diag(np.diag(C1)))
        C0_v1_inv = inv(np.diag(np.diag(C0)))

        y_pred[1] = [
            1 if dot(matmul(
                (X_test[j]-m1),C1_v1_inv),(X_test[j]-m1)) < dot(matmul(
                (X_test[j]-m0),C0_v1_inv),(X_test[j]-m0)) else 
            0 for j in range(len(y_test))]


        # Variante 2:
        C_pool = (sum(y_train)*C1 + (len(y_train)-sum(y_train))*C0)/len(y_train)
        C_pool_inv = inv(C_pool)

        y_pred[2] = [
            1 if dot(matmul(
                (X_test[j]-m1),C_pool_inv),(X_test[j]-m1)) < dot(matmul(
                (X_test[j]-m0),C_pool_inv),(X_test[j]-m0)) else 
            0 for j in range(len(y_test))]

        
        # Variante 3:
        for j in range(len(lambdas)):
            # calcular matriz regularizada
            C1_v3_inv = inv(((1-lambdas[j])*sum(y_train)*C1                + lambdas[j]*len(y_train)*C_pool) / 
                            ((1-lambdas[j])*sum(y_train)                   + lambdas[j]*len(y_train)))
            C0_v3_inv = inv(((1-lambdas[j])*(len(y_train)-sum(y_train))*C0 + lambdas[j]*len(y_train)*C_pool) / 
                            ((1-lambdas[j])*(len(y_train)-sum(y_train))    + lambdas[j]*len(y_train)) 
                               )

            # predição no conjunto de teste
            y_pred[j+3] = [
                1 if dot(matmul(
                    (X_test[j]-m1),C1_v3_inv),(X_test[j]-m1)) < dot(matmul(
                    (X_test[j]-m0),C0_v3_inv),(X_test[j]-m0)) else 
                0 for j in range(len(y_test))]

        # Avaliação da performance
        for j in range(nCases):
            cm = confusion_matrix(y_test, y_pred[j])
            total=sum(sum(cm))

            acc[j][i] = (cm[0,0]+cm[1,1])/total
            especificidade[j] += cm[0,0]/(cm[0,0]+cm[0,1])
            sensibilidade[j]  += cm[1,1]/(cm[1,1]+cm[1,0])

    # consolidando estatísticas
    for j in range(nCases): 
        especificidade[j]/=rodadas # Valores médios
        sensibilidade[j] /=rodadas

        CQG_data[j] = np.matrix([np.mean(acc[j]), np.median(acc[j]), min(acc[j]), max(acc[j]),
                                 np.std(acc[j]), sensibilidade[j], especificidade[j]])    
    
    return CQG_data

In [14]:
%%time
# Hiper-parâmetros:
rodadas = 100
lambdas = np.linspace(0,1,num=10) # possíveis valores de lambda

CQG_data = eval_CQG(features=features.astype(np.float64).values,
                    targets=targets.astype(np.int).values,
                    rodadas=rodadas, lambdas=lambdas)        
index = ["CQG original", "CQG variante 1 (diagonal)", "CQG variante 2 (pooled)"]
index.extend(["CQG variante 3 [$\lambda$ = {0:.3f}]".format(lambda_) for lambda_ in lambdas])
CQG_df = pd.DataFrame(CQG_data, columns=cabecalho, index=[index])

CPU times: user 8min 18s, sys: 12min 55s, total: 21min 13s
Wall time: 3min 20s


In [15]:
CQG_df

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
CQG original,0.648508,0.6495,0.632,0.661333,0.006665,0.709628,0.631215
CQG variante 1 (diagonal),0.571373,0.57225,0.542667,0.592333,0.010401,0.77024,0.515138
CQG variante 2 (pooled),0.723868,0.723833,0.705667,0.7375,0.006427,0.597547,0.759583
CQG variante 3 [$\lambda$ = 0.000],0.648508,0.6495,0.632,0.661333,0.006665,0.709628,0.631215
CQG variante 3 [$\lambda$ = 0.111],0.648705,0.649583,0.631833,0.664167,0.006707,0.714283,0.63016
CQG variante 3 [$\lambda$ = 0.222],0.65932,0.6595,0.644167,0.674167,0.006573,0.705395,0.646293
CQG variante 3 [$\lambda$ = 0.333],0.667695,0.667667,0.651167,0.685333,0.00662,0.697148,0.65937
CQG variante 3 [$\lambda$ = 0.444],0.6714,0.6715,0.655333,0.688833,0.006559,0.691866,0.665616
CQG variante 3 [$\lambda$ = 0.556],0.675703,0.676167,0.656167,0.692167,0.006344,0.688323,0.67214
CQG variante 3 [$\lambda$ = 0.667],0.68205,0.6825,0.663667,0.696333,0.006295,0.683025,0.681778


Com a tabela acima fica claro que quando $\lambda=0$ a variante 3 se torna o CQG original e quando $\lambda=1$ ela é equivalente à variante 2, com a matriz de covariância agregada (*pooled*). O melhor desempenho ocorreu na variante 2 (ou na variante 3 com $\lambda=1$):

In [16]:
CQG_df.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose()

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
CQG variante 2 (pooled),0.723868,0.723833,0.705667,0.7375,0.006427,0.597547,0.759583


# 5. Regressão linear

Uma regressão linear simples através de mínimos quadrados foi realizada e, antes de classificar o conjunto de teste, a seguinte função foi adicionada à saída do modelo:

$$
\phi(u) = \left\{
        \begin{array}{ll}
            1 & \quad u \geq 0,5 \\
            0 & \quad u < 0,5
        \end{array}
    \right.
$$

In [17]:
from sklearn import linear_model

def eval_reg_linear(features, targets, rodadas, scales):
    reg_data = np.zeros((len(scales), 7)) 
    for j in range(len(scales)):
        acc = [0]*rodadas
        especificidade = 0
        sensibilidade  = 0
        for i in range(rodadas):
            # divisão treino/teste 
            X_train, X_test, y_train, y_test = train_test_split(features, 
                                                                targets,
                                                                test_size=.20)
            # escalonar os dados
            X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType=scales[j])

            # construindo o classificador
            reg = linear_model.LinearRegression(n_jobs=-1)
            reg.fit(X_tr_norm, y_train)

            # calculando as métricas de avaliação
            cm = confusion_matrix(y_test, reg.predict(X_ts_norm)>=.5)
            total=sum(sum(cm))

            acc[i] = (cm[0,0]+cm[1,1])/total
            especificidade += cm[0,0]/(cm[0,0]+cm[0,1])
            sensibilidade  += cm[1,1]/(cm[1,1]+cm[1,0])

        especificidade/=rodadas # Valores médios
        sensibilidade /=rodadas

        reg_data[j,:] = np.matrix([np.mean(acc), np.median(acc), min(acc), max(acc),
                                   np.std(acc), sensibilidade, especificidade])
    return reg_data

In [18]:
%%time

reg_data = eval_reg_linear(features_values, targets_values, 100, scales)

index = ['Reg. Linear [\'{}\']'.format(scale) for scale in scales]
reg_df = pd.DataFrame(reg_data, columns=cabecalho, index=index)

CPU times: user 46.4 s, sys: 1min 16s, total: 2min 3s
Wall time: 15.8 s


In [19]:
reg_df

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
Reg. Linear ['min-max'],0.8003,0.800583,0.787,0.809833,0.004855,0.14767,0.984541
Reg. Linear ['std'],0.799587,0.799833,0.788167,0.810833,0.004443,0.148245,0.984285


A diferença entres os dois classificadores é tão pequena que se torna arbritária a escolha entre os dois. A versão com escalonamento *min-max* foi adotado por apresentar na tabela o valor de taxa média de acertos levemente maior.

In [20]:
reg_df.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose()

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
Reg. Linear ['min-max'],0.8003,0.800583,0.787,0.809833,0.004855,0.14767,0.984541


# 6. Regressão logística

Para a regressão logística também foi realizada uma busca em grade pelos hiper-parâmetros:
- Tipo de normalização dos dados, *'min-max'* ou *'std'*;
- Tipo da norma da função custo, $l_1$ ou $l_2$;
- Constante $C$, que é o inverso da força da regularização aplicada à função custo.

In [21]:
from sklearn.linear_model import LogisticRegression

def eval_reg_log(features, targets, rodadas, scales, penalties, Cs):
    nCases = len(scales)*len(penalties)*len(Cs) # número de combinações dos hiperparâmetros
    log_data = np.zeros((nCases,len(cabecalho)))
    index = ['string']*nCases # lista de índices a serem salvos

    count = 0
    for scale in scales:
        for penalty in penalties:
            for C in Cs:       
                acc = [0]*rodadas
                especificidade = 0
                sensibilidade  = 0
                for i in range(rodadas):
                    # divisão treino/teste 
                    X_train, X_test, y_train, y_test = train_test_split(features, 
                                                                        targets,
                                                                        test_size=.20)
                    # escalonar os dados
                    X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, scaleType=scale)

                    solver = 'lbfgs' if scale=='l2' else 'liblinear'
                    clf = LogisticRegression(penalty=penalty, C=C, solver=solver
                                             ,max_iter=int(1e4)
                                            )
                    clf.fit(X_tr_norm, y_train)

                    cm = confusion_matrix(y_test, clf.predict(X_ts_norm)>.5)
                    total=sum(sum(cm))

                    acc[i] = (cm[0,0]+cm[1,1])/total
                    especificidade += cm[0,0]/(cm[0,0]+cm[0,1])
                    sensibilidade  += cm[1,1]/(cm[1,1]+cm[1,0])

                especificidade/=rodadas # Valores médios
                sensibilidade /=rodadas

                log_data[count,:] = np.matrix([np.mean(acc), np.median(acc), min(acc), max(acc), 
                                               np.std(acc), sensibilidade, especificidade])

                count+=1
                
    return log_data

In [22]:
%%time
# Hiper-parâmetros
rodadas = 100
scales = ['min-max', 'std']
penalties = ['l1', 'l2']
Cs = [10**(i) for i in range(-3,5)]

log_data = eval_reg_log(features_values, targets_values, rodadas, scales, penalties, Cs)

index = ['Reg. log. [{} / {} / {}]'
         .format(scale, penalty, C) for scale in scales for penalty in penalties for C in Cs  ]

log_df = pd.DataFrame(log_data, columns=cabecalho, index=index)

CPU times: user 45min 3s, sys: 26min 26s, total: 1h 11min 29s
Wall time: 34min 54s


In [23]:
log_df

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
Reg. log. [min-max / l1 / 0.001],0.778772,0.77875,0.763,0.789,0.00455,0.0,1.0
Reg. log. [min-max / l1 / 0.01],0.784095,0.784667,0.768167,0.795167,0.005387,0.045182,0.99469
Reg. log. [min-max / l1 / 0.1],0.808447,0.808417,0.797333,0.8195,0.004649,0.218197,0.975774
Reg. log. [min-max / l1 / 1],0.810658,0.8115,0.7965,0.826333,0.005622,0.234639,0.974046
Reg. log. [min-max / l1 / 10],0.809798,0.8105,0.7975,0.820833,0.004204,0.237202,0.972856
Reg. log. [min-max / l1 / 100],0.810315,0.810083,0.800333,0.821167,0.004428,0.23813,0.972797
Reg. log. [min-max / l1 / 1000],0.810317,0.8105,0.801,0.819833,0.004341,0.238032,0.972386
Reg. log. [min-max / l1 / 10000],0.810102,0.81,0.798333,0.821,0.004559,0.23717,0.972658
Reg. log. [min-max / l2 / 0.001],0.77914,0.779083,0.7665,0.794333,0.00535,0.0,1.0
Reg. log. [min-max / l2 / 0.01],0.783405,0.783417,0.772833,0.793,0.00413,0.036044,0.99583


In [24]:
log_df.sort_values('Média', ascending=False).head()

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
Reg. log. [std / l1 / 10],0.810978,0.810667,0.794667,0.824833,0.00503,0.238886,0.972515
Reg. log. [std / l1 / 1],0.810683,0.81125,0.800667,0.821167,0.004546,0.238276,0.97316
Reg. log. [min-max / l1 / 1],0.810658,0.8115,0.7965,0.826333,0.005622,0.234639,0.974046
Reg. log. [std / l1 / 10000],0.810425,0.81025,0.800167,0.819333,0.004223,0.239068,0.972931
Reg. log. [std / l2 / 0.1],0.810377,0.810333,0.801667,0.818833,0.003793,0.237235,0.972968


Apesar da pequena diferença entre os primeiros colocados, em relação à média de acertos, o primeiro colocado, destacado na tabela abaixo, foi escolhido como ponto de referência para a comparação com os outros classificadores.

In [25]:
log_df.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose()

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
Reg. log. [std / l1 / 10],0.810978,0.810667,0.794667,0.824833,0.00503,0.238886,0.972515


# 7. Quantização vetorial

Para reduzir o volume de dados o sequinte algoritmo foi utilizado:
1. Separação dos dados em cada classe;
2. Após escolhido $K$ foi aplicado o algoritmo K-médias a cada classe 10 vezes. O melhor resultado segundo a soma das distância quadrática (SSD) foi salvo como a melhor clusterização;
3. Os $K$ protótipos de cada classe substituiram os dados originais e os classificadores já estudados foram reaplicados.

Utilizando a biblioteca *sklearn* foi possível com poucas linhas de código realizar a redução do volume dos dados:

In [26]:
%%time
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

X = features.astype(np.float64).values
y = targets.astype(np.int).values

# Hiper-parâmetros
K = 1000       # número de protótipos para cada classe
rodadas = 10   # número de rodadas com diferentes inicializações

X1 = X[y==1].copy()
X0 = X[y==0].copy()
kmeans_X0 = KMeans(n_clusters=K, n_init=rodadas, init='random', n_jobs=-1).fit(X0)
kmeans_X1 = KMeans(n_clusters=K, n_init=rodadas, init='random', n_jobs=-1).fit(X1)

CPU times: user 4min 9s, sys: 5min 2s, total: 9min 11s
Wall time: 3min 9s


In [27]:
print("SSD no conjunto 0: {}".format(kmeans_X0.inertia_))
print("SSD no conjunto 1: {}".format(kmeans_X1.inertia_))

SSD no conjunto 0: 37906971756658.35
SSD no conjunto 1: 4510890617279.083


In [28]:
X_new = np.concatenate((kmeans_X0.cluster_centers_, kmeans_X1.cluster_centers_), axis=0)
y_new = np.concatenate((np.zeros(K), np.ones(K)))

## 7.1 k-NN com os dados reduzidos

Aplicando o k-NN aos dados reduzidos obtemos a tabela:

In [29]:
%%time
# Hiper-parâmetros
rodadas = 100
scales = ['min-max', 'std']   # tipos de escalonamentos possíveis
ks = [i for i in range(1,11)] # valores possíveis de k para k-NN

nn_data_red = eval_knn(features=X_new,
                       targets=y_new,
                       rodadas=rodadas, scales=scales, ks=ks)

idx_label      = ['{}-NN-red [\'min-max\']'.format(k) for k in ks]
idx_label.extend(['{}-NN-red [\'std\']'.format(k) for k in ks])
df_knn_red = pd.DataFrame(nn_data_red, columns=cabecalho, index=[idx_label])

CPU times: user 2min 47s, sys: 2.73 s, total: 2min 50s
Wall time: 4min 3s


In [30]:
display(df_knn)
display(df_knn_red)

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
1-NN ['min-max'],0.72929,0.728917,0.7195,0.741,0.004393,0.379249,0.828502
2-NN ['min-max'],0.784093,0.784333,0.773667,0.793833,0.004683,0.192652,0.951833
3-NN ['min-max'],0.773865,0.773667,0.7605,0.783833,0.00434,0.34229,0.896516
4-NN ['min-max'],0.794213,0.793667,0.782,0.806667,0.004859,0.231983,0.953766
5-NN ['min-max'],0.793453,0.793333,0.7835,0.804167,0.004606,0.326738,0.92592
6-NN ['min-max'],0.799348,0.7995,0.789167,0.809833,0.004608,0.250038,0.95583
7-NN ['min-max'],0.799683,0.799667,0.789,0.810167,0.003779,0.316503,0.937024
8-NN ['min-max'],0.802428,0.802417,0.7925,0.812833,0.004107,0.256828,0.95732
9-NN ['min-max'],0.802495,0.802833,0.793,0.812167,0.00458,0.305032,0.943852
10-NN ['min-max'],0.80287,0.8035,0.7915,0.815,0.004937,0.255645,0.958019


Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
1-NN-red ['min-max'],0.76585,0.7675,0.71,0.8075,0.019158,0.640626,0.892236
2-NN-red ['min-max'],0.73245,0.7325,0.675,0.805,0.022954,0.48989,0.974683
3-NN-red ['min-max'],0.782625,0.7825,0.73,0.8275,0.021896,0.626215,0.941919
4-NN-red ['min-max'],0.7536,0.7525,0.695,0.7975,0.020804,0.53437,0.973267
5-NN-red ['min-max'],0.7769,0.775,0.725,0.845,0.023633,0.598862,0.954075
6-NN-red ['min-max'],0.748825,0.74875,0.7,0.81,0.021309,0.52553,0.97447
7-NN-red ['min-max'],0.775075,0.7725,0.7325,0.825,0.021022,0.585232,0.964257
8-NN-red ['min-max'],0.752725,0.75375,0.695,0.8175,0.024164,0.528859,0.977192
9-NN-red ['min-max'],0.76825,0.7675,0.7175,0.8075,0.018952,0.569205,0.969886
10-NN-red ['min-max'],0.7523,0.75375,0.695,0.8025,0.021902,0.526362,0.977053


Os resultados do k-NN com os dados reduzidos foram menos conclusivos. Tanto o tipo de normalização quanto a variável $k$ não tiveram uma correlação explícita com a taxa média de acertos.  Embora os resultados tenham sido muito próximos o classificador com a taxa média de acertos mais alta foi escolhido para comparação com os outros modelos.

In [31]:
display(df_knn_red.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose())

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
3-NN-red ['min-max'],0.782625,0.7825,0.73,0.8275,0.021896,0.626215,0.941919


## 7.2 DMC com os dados reduzidos

O mesmo classificador foi aplicado aos dados reduzidos:

In [32]:
%%time
DMC_data_red = eval_DMC(features=X_new,
                        targets=y_new,
                        rodadas=100, scales=scales)
index = ['DMC-red [\'{}\']'.format(scale) for scale in scales]
DMC_df_red = pd.DataFrame(DMC_data_red, columns=cabecalho, index=[index])

CPU times: user 1.36 s, sys: 0 ns, total: 1.36 s
Wall time: 1.37 s


In [33]:
display(DMC_df_red)

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
DMC-red ['min-max'],0.667375,0.67125,0.6075,0.71,0.022258,0.667225,0.667271
DMC-red ['std'],0.6443,0.6425,0.5775,0.6975,0.022341,0.64977,0.638748


Novamente com resultados muito próximos não é fácil determinar qual dos dois classificadores foi o melhor. Para título de comparação com os outros modelos foi pego o com maior taxa média de acertos:

In [34]:
display(DMC_df_red.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose())

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
DMC-red ['min-max'],0.667375,0.67125,0.6075,0.71,0.022258,0.667225,0.667271


## 7.3 CQG com os dados reduzidos

O CQG original com suas 3 variantes foram aplicados ao conjunto de dados reduzidos:

In [35]:
%%time
lambdas = np.linspace(0,1,num=10) # possíveis valores de lambda

CQG_data_red = eval_CQG(features=X_new,
                        targets=y_new,
                        rodadas=100, lambdas=lambdas)        
index = ["CQG-red original", "CQG-red variante 1 (diagonal)", "CQG-red variante 2 (pooled)"]
index.extend(["CQG-red variante 3 [$\lambda$ = {0:.3f}]".format(lambda_) for lambda_ in lambdas])
CQG_df_red = pd.DataFrame(CQG_data_red, columns=cabecalho, index=[index])

CPU times: user 49.7 s, sys: 1min 25s, total: 2min 15s
Wall time: 18.9 s


In [36]:
display(CQG_df)
display(CQG_df_red)

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
CQG original,0.648508,0.6495,0.632,0.661333,0.006665,0.709628,0.631215
CQG variante 1 (diagonal),0.571373,0.57225,0.542667,0.592333,0.010401,0.77024,0.515138
CQG variante 2 (pooled),0.723868,0.723833,0.705667,0.7375,0.006427,0.597547,0.759583
CQG variante 3 [$\lambda$ = 0.000],0.648508,0.6495,0.632,0.661333,0.006665,0.709628,0.631215
CQG variante 3 [$\lambda$ = 0.111],0.648705,0.649583,0.631833,0.664167,0.006707,0.714283,0.63016
CQG variante 3 [$\lambda$ = 0.222],0.65932,0.6595,0.644167,0.674167,0.006573,0.705395,0.646293
CQG variante 3 [$\lambda$ = 0.333],0.667695,0.667667,0.651167,0.685333,0.00662,0.697148,0.65937
CQG variante 3 [$\lambda$ = 0.444],0.6714,0.6715,0.655333,0.688833,0.006559,0.691866,0.665616
CQG variante 3 [$\lambda$ = 0.556],0.675703,0.676167,0.656167,0.692167,0.006344,0.688323,0.67214
CQG variante 3 [$\lambda$ = 0.667],0.68205,0.6825,0.663667,0.696333,0.006295,0.683025,0.681778


Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
CQG-red original,0.589875,0.5875,0.5275,0.6425,0.023239,0.968565,0.208788
CQG-red variante 1 (diagonal),0.641175,0.64125,0.58,0.695,0.022548,0.917234,0.363506
CQG-red variante 2 (pooled),0.79425,0.7925,0.755,0.8375,0.017134,0.701307,0.887575
CQG-red variante 3 [$\lambda$ = 0.000],0.589875,0.5875,0.5275,0.6425,0.023239,0.968565,0.208788
CQG-red variante 3 [$\lambda$ = 0.111],0.613525,0.6125,0.5425,0.6675,0.023444,0.966496,0.2582
CQG-red variante 3 [$\lambda$ = 0.222],0.649825,0.6525,0.58,0.695,0.022889,0.961208,0.336295
CQG-red variante 3 [$\lambda$ = 0.333],0.694025,0.695,0.6075,0.7425,0.021971,0.951509,0.434676
CQG-red variante 3 [$\lambda$ = 0.444],0.733725,0.735,0.6675,0.78,0.020744,0.94024,0.525683
CQG-red variante 3 [$\lambda$ = 0.556],0.759225,0.7575,0.7125,0.8125,0.01885,0.911324,0.605968
CQG-red variante 3 [$\lambda$ = 0.667],0.78155,0.78125,0.71,0.83,0.019519,0.873805,0.688309


Tirando o CQG original, todas as variantes tiveram seu desempenho **melhorado** com a redução dos dados, o que pode ser justificado em parte pelo novo equilíbrio entre o número de amostras de cada classe. Dessa vez o melhor desempenho não foi para a variante 2 mas para a variante 3 com o valor de $\lambda$ próximo a 1:

In [37]:
display(CQG_df.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose())
display(CQG_df_red.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose())

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
CQG variante 2 (pooled),0.723868,0.723833,0.705667,0.7375,0.006427,0.597547,0.759583


Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
CQG-red variante 3 [$\lambda$ = 0.889],0.805,0.8025,0.745,0.855,0.017375,0.781228,0.828569


## 7.4 Reg. Linear com os dados reduzidos

In [38]:
%%time

reg_data_red = eval_reg_linear(X_new, y_new, 100, scales)

index = ['Reg. Linear-red [\'{}\']'.format(scale) for scale in scales]
reg_df_red = pd.DataFrame(reg_data_red, columns=cabecalho, index=index)

CPU times: user 4.34 s, sys: 7.73 s, total: 12.1 s
Wall time: 1.7 s


In [39]:
display(reg_df)
display(reg_df_red)

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
Reg. Linear ['min-max'],0.8003,0.800583,0.787,0.809833,0.004855,0.14767,0.984541
Reg. Linear ['std'],0.799587,0.799833,0.788167,0.810833,0.004443,0.148245,0.984285


Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
Reg. Linear-red ['min-max'],0.790975,0.79,0.745,0.84,0.016522,0.692487,0.890427
Reg. Linear-red ['std'],0.789625,0.7875,0.7525,0.84,0.017289,0.697777,0.882423


A diferença entres os dois classificadores é tão pequena que se torna arbritária a escolha entre os dois. A versão com escalonamento *min-max* foi adotado por apresentar na tabela o valor de taxa média de acertos levemente maior.

In [40]:
reg_df_red.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose()

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
Reg. Linear-red ['min-max'],0.790975,0.79,0.745,0.84,0.016522,0.692487,0.890427


## 7.5 Reg. Logística com os dados reduzidos

In [41]:
%%time
# Hiper-parâmetros
rodadas = 100
scales = ['min-max', 'std']
penalties = ['l1', 'l2']
Cs = [10**(i) for i in range(-3,5)]

log_data_red = eval_reg_log(X_new, y_new, rodadas, scales, penalties, Cs)

index = ['Reg. log.-red [{} / {} / {}]'
         .format(scale, penalty, C) for scale in scales for penalty in penalties for C in Cs  ]

log_df_red = pd.DataFrame(log_data_red, columns=cabecalho, index=index)

CPU times: user 5min 15s, sys: 7min 20s, total: 12min 35s
Wall time: 2min 29s


In [42]:
display(log_df)

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
Reg. log. [min-max / l1 / 0.001],0.778772,0.77875,0.763,0.789,0.00455,0.0,1.0
Reg. log. [min-max / l1 / 0.01],0.784095,0.784667,0.768167,0.795167,0.005387,0.045182,0.99469
Reg. log. [min-max / l1 / 0.1],0.808447,0.808417,0.797333,0.8195,0.004649,0.218197,0.975774
Reg. log. [min-max / l1 / 1],0.810658,0.8115,0.7965,0.826333,0.005622,0.234639,0.974046
Reg. log. [min-max / l1 / 10],0.809798,0.8105,0.7975,0.820833,0.004204,0.237202,0.972856
Reg. log. [min-max / l1 / 100],0.810315,0.810083,0.800333,0.821167,0.004428,0.23813,0.972797
Reg. log. [min-max / l1 / 1000],0.810317,0.8105,0.801,0.819833,0.004341,0.238032,0.972386
Reg. log. [min-max / l1 / 10000],0.810102,0.81,0.798333,0.821,0.004559,0.23717,0.972658
Reg. log. [min-max / l2 / 0.001],0.77914,0.779083,0.7665,0.794333,0.00535,0.0,1.0
Reg. log. [min-max / l2 / 0.01],0.783405,0.783417,0.772833,0.793,0.00413,0.036044,0.99583


In [43]:
display(log_df_red)

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
Reg. log.-red [min-max / l1 / 0.001],0.50115,0.4975,0.4425,0.5625,0.024301,0.0,1.0
Reg. log.-red [min-max / l1 / 0.01],0.498625,0.49625,0.445,0.5525,0.023673,0.0,1.0
Reg. log.-red [min-max / l1 / 0.1],0.722175,0.72,0.66,0.7775,0.020684,0.693482,0.751986
Reg. log.-red [min-max / l1 / 1],0.7878,0.78625,0.73,0.8325,0.017618,0.722587,0.853468
Reg. log.-red [min-max / l1 / 10],0.787375,0.7875,0.745,0.84,0.018393,0.719397,0.855423
Reg. log.-red [min-max / l1 / 100],0.78545,0.7825,0.7575,0.825,0.014749,0.720309,0.850125
Reg. log.-red [min-max / l1 / 1000],0.787375,0.79,0.7325,0.825,0.019658,0.726113,0.849254
Reg. log.-red [min-max / l1 / 10000],0.78785,0.79,0.7275,0.8275,0.019095,0.722605,0.853475
Reg. log.-red [min-max / l2 / 0.001],0.574525,0.585,0.4225,0.66,0.048677,0.854424,0.302786
Reg. log.-red [min-max / l2 / 0.01],0.645525,0.645,0.5575,0.72,0.031329,0.709756,0.584375


In [44]:
log_df_red.sort_values('Média', ascending=False).head()

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
Reg. log.-red [std / l1 / 1000],0.791575,0.7925,0.7475,0.83,0.016312,0.724071,0.860296
Reg. log.-red [std / l1 / 1],0.7904,0.79,0.7225,0.83,0.019335,0.725297,0.856479
Reg. log.-red [min-max / l2 / 10000],0.790275,0.79,0.745,0.8325,0.019193,0.723281,0.856966
Reg. log.-red [std / l2 / 1000],0.790075,0.79,0.73,0.8275,0.017785,0.727489,0.853874
Reg. log.-red [std / l2 / 10],0.789925,0.79,0.75,0.8325,0.017743,0.723952,0.855789


Apesar da pequena diferença entre os primeiros colocados, em relação à média de acertos, o primeiro colocado, destacado na tabela abaixo, foi escolhido como ponto de referência para a comparação com os outros classificadores.

In [45]:
display(log_df.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose())
display(log_df_red.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose())

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
Reg. log. [std / l1 / 10],0.810978,0.810667,0.794667,0.824833,0.00503,0.238886,0.972515


Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
Reg. log.-red [std / l1 / 1000],0.791575,0.7925,0.7475,0.83,0.016312,0.724071,0.860296


# 8. Discussão dos resultados

Para sintetizar todas as simulações realizadas nesse trabalho a tabela abaixo foi gerada:

In [46]:
resultado = pd.concat([
    df_knn.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose(),
    DMC_df.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose(),
    CQG_df.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose(),
    reg_df.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose(),
    log_df.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose(),
        
    df_knn_red.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose(),
    DMC_df_red.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose(),
    CQG_df_red.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose(),
    reg_df_red.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose(),
    log_df_red.sort_values('Média', ascending=False).iloc[0,:].to_frame().transpose()
])
resultado

Unnamed: 0,Média,Mediana,Mínimo,Máximo,Desv. Padrão,Sensib. média,Especif. média
10-NN ['std'],0.806322,0.806417,0.7975,0.818,0.004066,0.287636,0.954376
"(DMC ['std'],)",0.658925,0.659417,0.6435,0.676333,0.006153,0.633633,0.666121
"(CQG variante 2 (pooled),)",0.723868,0.723833,0.705667,0.7375,0.006427,0.597547,0.759583
Reg. Linear ['min-max'],0.8003,0.800583,0.787,0.809833,0.004855,0.14767,0.984541
Reg. log. [std / l1 / 10],0.810978,0.810667,0.794667,0.824833,0.00503,0.238886,0.972515
"(3-NN-red ['min-max'],)",0.782625,0.7825,0.73,0.8275,0.021896,0.626215,0.941919
"(DMC-red ['min-max'],)",0.667375,0.67125,0.6075,0.71,0.022258,0.667225,0.667271
"(CQG-red variante 3 [$\lambda$ = 0.889],)",0.805,0.8025,0.745,0.855,0.017375,0.781228,0.828569
Reg. Linear-red ['min-max'],0.790975,0.79,0.745,0.84,0.016522,0.692487,0.890427
Reg. log.-red [std / l1 / 1000],0.791575,0.7925,0.7475,0.83,0.016312,0.724071,0.860296


De forma geral a queda no desempenho foi pequena. O tempo de processamento foi drasticamente reduzido.
O DMC fez foi melhorar o desempenho.

Os classificadores nos dados reduzidos apresentaram desvio padrão maior.

In [47]:
def print_df(df, nome, descricao):
    print(
        "\\begin{table}[h!]\n"
        "    \captionsetup{width=16cm}%ATENÇÃO: Ajuste a largura do título\n"
        "    \Caption{\label{tab:"+nome+"} "+descricao+"}\n"
        "    \\begin{adjustbox}{width=1\\textwidth}\n"
        "    \small\n"
        +df.to_latex()+
        "    \end{adjustbox}\n"
        "    \Fonte{O autor.}\n"
        "\end{table}")