# Avaliação dos modelos de regressão criados

Em termos gerais um regressor é um modelo de _machine learning_ que prevê uma saída, ou resultado, numérico.

## _Root-mean-square error_

Essa é a forma mais simples de medir a performance de um modelo de regressão. Essa técnica avalia a diferença de cada valor predito para seu valor real, e calcula a média de forma que se torna imune ao fato que valores preditos podem ser tanto maiores quanto menores que seu valor real.

Uma vantagem dessa técnica é que o resultado é na mesma unidade que os próprios valores, mas também se torna uma desvantagem no sentido de que o resultado depende da escala do problema. Se o valor predito ou o real forem numéros muito altos o _RMSE_ vai ser correspondentemente alto. Sendo assim, isso pode se tornar um problema quando se deseja comparar modelos de projetos diferentes.

$$ RSME = \frac{1}{\sqrt{n}} \sqrt{\sum{[y_i - f(x_i)]^2}} $$
<center>onde, $x_i$ e $y_i$ são os inézimos <i>targets</i> e características e $f(x)$ representa a aplicação do medelo no vetor de características, que retorna o valor predito.</center>


## $R^2$

$$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$


$$ SS_{res} = \sum{(y_i - yhat_i)^2} $$
<center>onde, $ (y_i - yhat_i) $ é a distância do ponto real na função para o ponto da predição.</center>

$$ SS_{tot} = \sum{(y_i - y_{avg})^2} $$
<center>onde, $y_{avg}$ é a linha da média de todos os pontos.</center>

Sendo assim, o método de avaliação $R^2$ compara quão bom é o modelo gerado, baseado na linha média dos valores. Quanto $R^2$ está mais perto de $1$ melhor é o modelo, quanto mais longe e perto de $-1$, pior o modelo.

$R^2$ tem um grande problema quando tratamos de funções polinomiais, pois o método $R^2$ ou não descresce ou não muda, dependendo da covariância das variáveis. Dessa forma, a melhor opção para problemas polinomiais é a utilização do $R^2$ ajustado, definido pela fórmula.

$$ Adj R^2 = 1 - (1 - R^2) \frac{n - 1}{n - p - 1} $$
<center>onde, $n = tamanho\ da\ amostra$ e $p = número\ de\ regressores$  </center>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.externals import joblib

%matplotlib inline
%autosave 240

Autosaving every 240 seconds


In [2]:
class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

In [3]:
import os
import re
files = [f for f in os.listdir('BlogFeedback/') if re.search(r'blogData_test-.*\.csv$', f)]
test_datasets = []

for f in files:
    new_test_dataset = pd.read_csv('BlogFeedback/' + f, header=None)
    test_datasets.append(new_test_dataset)

print('Foram importados %d arquivos de teste.' % len(test_datasets))

Foram importados 60 arquivos de teste.


### Defina aqui todas as funções de pré processamento, caso necessite

In [4]:
def SVR_all_columns_preprocess(testdataset):
    print('pre processing', testdataset.shape)
    return testdataset

### Carregando modelos
Nesse passo criamos um dicionário com todos os modelos carregados, com suas respectivas funções de pre-processamento.

In [5]:
'''
    Input here your models with columns value (obrigatory)
    and pre process function (optional). The latter you have
    to define beforehand if using lambda functions and it must
    expect a DataFrame of pandas as input.
    In models_pathnames list you have to input the path to your
    model, in the same sequence that your dictionary is 
'''
models = {
    
    'Random Forest':{
        'columns':[i for i in range(280)] #all columns
    },
     'LinearSVR':{
        'columns': [i for i in range(0,64)] + [i for i in range(263,280)]
        #'columns' : [i for i in range(0,280)]
    },
    'Decision Tree':{
        'columns':[0, 1, 3, 4, 5, 6, 8, 9, 10,
                   11, 13, 14, 15, 16, 18,
                   19, 20, 21, 22, 23, 24,
                   25, 26, 28, 29, 31, 33,
                   36, 38, 40, 41, 43, 44,
                   46, 47, 48, 50, 51, 52,
                   53, 54, 55, 56, 57, 58,
                   59, 60, 61, 63, 66, 67,
                   68, 78, 100, 113 ,119,
                   121, 138, 142, 150, 157, 158,
                   169, 190, 201, 209, 212, 225, 227, 231, 232, 245,
                   247, 269, 270]
    },
      'Random Forest Variance Threshold':{
        'columns':[0,1,2,3,4,5,6,8,9,10,11,13,14,15,
                   16,17,18,19,20,21,22,23,24,25,26,
                   28,29,31,33,36,38,40,41,43,44,46,
                   47,48,50,51,52,53,54,55,56,57,58,
                   59,60,61,66,68,78,100,121,138,142,
                   150,157,209,212,225,231,245,247,
                   276,278,279]
    }
}


models_pathnames = [
                    'modelos/280coluns_random_forest.pkl',
                    'modelos/linsvr.sav',
                    'modelos/decisiontree.sav',
                    'modelos/randomforest_and_variancethreshold.sav'
                   ]
for index, model in enumerate(models):
    try:
        models[model]['model'] = joblib.load(models_pathnames[index])
    except Exception as e:
        print('Error in', e)

In [6]:
# Testing if models where loaded correctly
for model in models:
    print(model)
    if 'preprocess' in models[model]:
        models[model]['preprocess'](test_datasets[0])
    else:
        print('No pre-processing')
    print(models[model]['model'],'\n')

Random Forest Variance Threshold
No pre-processing
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False) 

Random Forest
No pre-processing
LinearSVR(C=1.0, dual=True, epsilon=0.0, fit_intercept=True,
     intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
     random_state=0, tol=0.0001, verbose=0) 

LinearSVR
No pre-processing
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best') 

Decision Tr

### Testando modelos carregados

In [7]:
from time import time

#### Métodos de avaliação
##### _Hits_@10
For each day of the test data we consider 10 blog pages that were predicted
to have to largest number of feedbacks. We count how many out of these pages
are among the 10 pages that received the largest number of feedbacks in the
reality. We call this evaluation measure Hits@10 and we average Hits@10 for
all the days of the test data.
##### _AUC_@10
For the AUC, i.e., area under the receiver-operator curve, see (Tan et al.,
2006), we considered as positive the 10 blog pages receiving the highest number
of feedbacks in the reality. Then, we ranked the pages according to their
predicted number of feedbacks and calculated AUC. We call this evaluation
measure AUC@10.



In [8]:
from sklearn.metrics import r2_score
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import MinMaxScaler

def hits_10_score(test_target, pred_target):
    if not isinstance(test_target, pd.DataFrame):
        try:
            test_target = pd.DataFrame(test_target.copy())
            # test_target = sorted(test_target, reverse=True)
        except:
            raise Exception('Type {} of test_target not known'.format(type(test_target)))
    
    if isinstance(pred_target, np.ndarray):
        pred_target = pd.DataFrame(pred_target.copy())
    elif isinstance(pred_target, pd.DataFrame):
        pred_target = pred_target.copy()
    else:
        raise Exception('Type {} of pred_target not known'.format(type(pred_target)))
        
    test_target = test_target.sort_values(by=280, axis=0, ascending=False, kind='heapsort')
    pred_target = pred_target.sort_values(by=0, axis=0, ascending=False, kind='heapsort')
    
    top_10_test = test_target.index.values[0:10]
    pred_ind = pred_target.index.values
    hits = 0
    for i in range(10):
        hits += 1 if pred_ind[i] in top_10_test else 0
        
    return hits

def auc_10_score(test_target, pred_target):
    if not isinstance(test_target, pd.DataFrame):
        test_target = pd.DataFrame(test_target.copy())
        
    if not isinstance(pred_target, pd.DataFrame):
        pred_target = pd.DataFrame(pred_target.copy())
    
    # Getting only the top 10 blog pages index
    top10_target_ind = test_target.sort_values(by=280, axis=0, ascending=False, kind='heapsort')
    top10_target_ind = top10_target_ind.index.values
    top10_target_ind = top10_target_ind[0:10]
    
    top10_pred_ind = pred_target.sort_values(by=0, axis=0, ascending=False, kind='heapsort')
    top10_pred_ind = top10_pred_ind.index.values
    top10_pred_ind = top10_pred_ind[0:10]
    
    # Scaling the prediction to range (0, 1)
    mms = MinMaxScaler()
    pred_target = mms.fit_transform(pred_target)
    
    # Transforming test_target in a binary array "Received many feedback or not"
    zeros = np.zeros(test_target.shape)
    test_target = zeros
    
    # Inputing ones to top 10 blog pages
    for i in range(10):
        test_target[top10_target_ind[i]] = 1
        pred_target[top10_pred_ind[i]] = 1
    
    score = roc_auc_score(test_target, pred_target)
    
    return score

In [9]:
'''
{
    'model_name': {
        'hits_10' = [1, 2, ...],
        'auc_10' = [.1, .3, ...],
        'r_sqr' = [.3, .4, ...],
        't_test' = [10, 40, ...]
    }
}
'''
results = {}
for model in models:
    
    # Creating Lists
    results[model] = {}
    results[model]['hits_10'] = []
    results[model]['auc_10'] = []
    results[model]['r_sqr'] = []
    results[model]['t_test'] = []
    print(bcolors.HEADER +
          model
         + bcolors.ENDC)

    t_all = time()
    for i, test_dataset in enumerate(test_datasets):
        print(bcolors.HEADER +
              '\t Executing {} of {}...'.format(i, len(test_datasets))
             + bcolors.ENDC)
        # Pre process data, if necessary
        dtest = None
        if 'preprocess' in models[model]:
            dtest = models[model]['preprocess'](test_dataset)
        else:
            dtest = test_dataset
        
        # Selecting columns to model and separating target
        y_test = dtest.loc[:, [280]]
        columns = models[model]['columns']
        dtest = dtest.loc[:, columns]
        
        t_test = time()

        # Predict values
        try:
            y_pred = models[model]['model'].predict(dtest)
        except Exception as e:
            print(bcolors.FAIL + 
                  'Error in predict in model {} in dataset {} with error \'{}\''
                      .format(model, files[i], e) + 
                  bcolors.ENDC)
            continue

        # Evaluate model
        try:
            r_sqr = r2_score(y_test, y_pred)
        except Exception as e:
            r_sqr = np.NaN
            print(bcolors.FAIL + 
                  'Error in r^2 in file {}, with message \'{}\''
                      .format(files[i], e) + 
                  bcolors.ENDC)
            
        try:
            hits_10 = hits_10_score(y_test, y_pred)
        except Exception as e:
            hits_10 = np.NaN
            print(bcolors.FAIL + 
                  'Error in hits_10 in file {}, with message \'{}\''
                      .format(files[i], e) + 
                  bcolors.ENDC)
        
        try:
            auc_10 = auc_10_score(y_test, y_pred)
        except Exception as e:
            auc_10 = np.NaN
            print(bcolors.FAIL + 
                  'Error in auc_10 in file {}, with message \'{}\''
                      .format(files[i], e) + 
                  bcolors.ENDC)
        
        t_test = time() - t_test
        results[model]['hits_10'].append(hits_10)
        results[model]['auc_10'].append(auc_10)
        results[model]['r_sqr'].append(r_sqr)
        results[model]['t_test'].append(t_test)
        
    print(bcolors.OKBLUE + 
          'Execution time for {}: {}'
              .format(model, time() - t_all) + 
          bcolors.ENDC)
        

import json
with open('results.json', 'w') as json_file:
    json.dump(results, json_file)
    print(bcolors.OKGREEN + 
           'Saved results to \'results.json\'' 
           + bcolors.ENDC)

[95mRandom Forest Variance Threshold[0m
[95m	 Executing 0 of 60...[0m
[91mError in predict in model Random Forest Variance Threshold in dataset blogData_test-2012.02.28.00_00.csv with error 'Number of features of the model must match the input. Model n_features is 280 and input n_features is 68 '[0m
[95m	 Executing 1 of 60...[0m
[91mError in predict in model Random Forest Variance Threshold in dataset blogData_test-2012.02.19.00_00.csv with error 'Number of features of the model must match the input. Model n_features is 280 and input n_features is 68 '[0m
[95m	 Executing 2 of 60...[0m
[91mError in predict in model Random Forest Variance Threshold in dataset blogData_test-2012.02.20.00_00.csv with error 'Number of features of the model must match the input. Model n_features is 280 and input n_features is 68 '[0m
[95m	 Executing 3 of 60...[0m
[91mError in predict in model Random Forest Variance Threshold in dataset blogData_test-2012.02.21.00_00.csv with error 'Number of

[91mError in predict in model Random Forest in dataset blogData_test-2012.02.11.00_00.csv with error 'shapes (128,280) and (81,) not aligned: 280 (dim 1) != 81 (dim 0)'[0m
[95m	 Executing 16 of 60...[0m
[91mError in predict in model Random Forest in dataset blogData_test-2012.03.14.00_00.csv with error 'shapes (120,280) and (81,) not aligned: 280 (dim 1) != 81 (dim 0)'[0m
[95m	 Executing 17 of 60...[0m
[91mError in predict in model Random Forest in dataset blogData_test-2012.03.11.00_00.csv with error 'shapes (108,280) and (81,) not aligned: 280 (dim 1) != 81 (dim 0)'[0m
[95m	 Executing 18 of 60...[0m
[91mError in predict in model Random Forest in dataset blogData_test-2012.02.16.00_00.csv with error 'shapes (143,280) and (81,) not aligned: 280 (dim 1) != 81 (dim 0)'[0m
[95m	 Executing 19 of 60...[0m
[91mError in predict in model Random Forest in dataset blogData_test-2012.02.04.00_00.csv with error 'shapes (103,280) and (81,) not aligned: 280 (dim 1) != 81 (dim 0)'[0

[95mLinearSVR[0m
[95m	 Executing 0 of 60...[0m
[91mError in predict in model LinearSVR in dataset blogData_test-2012.02.28.00_00.csv with error 'Number of features of the model must match the input. Model n_features is 72 and input n_features is 81 '[0m
[95m	 Executing 1 of 60...[0m
[91mError in predict in model LinearSVR in dataset blogData_test-2012.02.19.00_00.csv with error 'Number of features of the model must match the input. Model n_features is 72 and input n_features is 81 '[0m
[95m	 Executing 2 of 60...[0m
[91mError in predict in model LinearSVR in dataset blogData_test-2012.02.20.00_00.csv with error 'Number of features of the model must match the input. Model n_features is 72 and input n_features is 81 '[0m
[95m	 Executing 3 of 60...[0m
[91mError in predict in model LinearSVR in dataset blogData_test-2012.02.21.00_00.csv with error 'Number of features of the model must match the input. Model n_features is 72 and input n_features is 81 '[0m
[95m	 Executing 

[91mError in predict in model Decision Tree in dataset blogData_test-2012.03.16.00_00.csv with error 'Number of features of the model must match the input. Model n_features is 68 and input n_features is 75 '[0m
[95m	 Executing 15 of 60...[0m
[91mError in predict in model Decision Tree in dataset blogData_test-2012.02.11.00_00.csv with error 'Number of features of the model must match the input. Model n_features is 68 and input n_features is 75 '[0m
[95m	 Executing 16 of 60...[0m
[91mError in predict in model Decision Tree in dataset blogData_test-2012.03.14.00_00.csv with error 'Number of features of the model must match the input. Model n_features is 68 and input n_features is 75 '[0m
[95m	 Executing 17 of 60...[0m
[91mError in predict in model Decision Tree in dataset blogData_test-2012.03.11.00_00.csv with error 'Number of features of the model must match the input. Model n_features is 68 and input n_features is 75 '[0m
[95m	 Executing 18 of 60...[0m
[91mError in pr

In [10]:
for model in results:
    print(bcolors.HEADER +
          model
         + bcolors.ENDC)
    for key in results[model]:
        result = np.array(results[model][key])
        print(bcolors.OKBLUE +
              '\t{}: {} +- {}'.format(key, np.mean(result), result.var())
             + bcolors.ENDC)

[95mRandom Forest Variance Threshold[0m
[94m	auc_10: nan +- nan[0m
[94m	hits_10: nan +- nan[0m
[94m	t_test: nan +- nan[0m
[94m	r_sqr: nan +- nan[0m
[95mRandom Forest[0m
[94m	auc_10: nan +- nan[0m
[94m	hits_10: nan +- nan[0m
[94m	t_test: nan +- nan[0m
[94m	r_sqr: nan +- nan[0m
[95mLinearSVR[0m
[94m	auc_10: nan +- nan[0m
[94m	hits_10: nan +- nan[0m
[94m	t_test: nan +- nan[0m
[94m	r_sqr: nan +- nan[0m
[95mDecision Tree[0m
[94m	auc_10: nan +- nan[0m
[94m	hits_10: nan +- nan[0m
[94m	t_test: nan +- nan[0m
[94m	r_sqr: nan +- nan[0m


  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)
  if __name__ == '__main__':
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
  ret = ret.dtype.type(ret / rcount)
