#Laboratório 6

###Questão 1

Compare a acurácia da árvore de decisão que utiliza ganho de informação com aquela que usa
índice gini para seleção da característica dos nós de decisão da árvore no dataset wine. Faça a
comparação usando 6 rodadas de validação cruzada estratificada com 5 folds. A menos do critério
de seleção de caraterísticas, use os valores default para os demais hiperparâmetros da árvore.
Indique se existe diferença significativa entre os resultados das árvores usando o teste t de Student.

In [1]:
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from scipy import stats
import numpy as np

wine = datasets.load_wine()
X_wine = wine.data
y_wine = wine.target

#ganho de informção - 'entropy'
dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

rkf = RepeatedStratifiedKFold (n_splits =5, n_repeats=6, random_state = 0)

scores1 = cross_val_score(dt,X_wine,y_wine,
                         scoring='accuracy', cv = rkf)

mean = scores1.mean()
std = scores1.std()
inf, sup = stats.norm.interval(0.95, loc=mean,
                               scale=std/np.sqrt(len(scores1)))


print("\nMean Accuracy: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n" %
       (inf, sup))




Mean Accuracy: 0.92 Standard Deviation: 0.04
Accuracy Confidence Interval (95%): (0.91, 0.94)



In [2]:
#indice gini -  por padrão
dt = DecisionTreeClassifier( random_state = 0)

rkf = RepeatedStratifiedKFold (n_splits = 5, n_repeats=6, random_state = 0)

scores2 = cross_val_score(dt,X_wine,y_wine,
                         scoring='accuracy', cv = rkf)

mean = scores2.mean()
std = scores2.std()
inf, sup = stats.norm.interval(0.95, loc=mean,
                               scale=std/np.sqrt(len(scores2)))


print("\nMean Accuracy: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n" %
       (inf, sup))


Mean Accuracy: 0.90 Standard Deviation: 0.04
Accuracy Confidence Interval (95%): (0.89, 0.92)



In [3]:
from scipy import stats
stats.ttest_rel(scores1, scores2)

TtestResult(statistic=1.9777257052299988, pvalue=0.05753221092912097, df=29)

###Questão 2

Determine qual o valor do hiperparâmetro ccp_alpha (fator de poda) em uma busca em grade
com validação cruzada em 10 folds no dataset wine que obtém a melhor acurácia média. Varie o
hiperparâmetro de 0.1 em 0.1 no intervalo entre 0.1 e 0.7.

In [4]:
from sklearn.model_selection import GridSearchCV

dt = DecisionTreeClassifier( random_state = 0)

parameters = {'ccp_alpha' : [x/10 for x in range(1,8)]}

gs = GridSearchCV(estimator=dt, param_grid = parameters,
                  scoring='accuracy', cv = 10)

gs = gs.fit(X_wine,y_wine)

print("Best Parameter Values: ", gs.best_params_)
print("Best Mean Accuracy: %0.2f" % gs.best_score_)

Best Parameter Values:  {'ccp_alpha': 0.1}
Best Mean Accuracy: 0.80


###Questão 3

Compare o desempenho em f1 macro do classificador Naive Bayes com os do classificadores
Árvore de Decisão (com valores default de hiperparâmetros) e com o classificador aleatório
estratificado em uma validação cruzada com 10 folds no dataset breast.

In [5]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_validate

breast = datasets.load_breast_cancer()
breast_X = breast.data
breast_y = breast.target

gNB = GaussianNB()

scores3 = cross_validate(gNB,breast_X,breast_y,
                         scoring='f1_macro', cv = 10)

scores3_f1 = scores3['test_score']
mean = scores3_f1.mean()
std = scores3_f1.std()
inf, sup = stats.norm.interval(0.95, loc=mean,
                               scale=std/np.sqrt(len(scores3_f1)))
print("\nMedia Macro F1: %0.2f Desvio Padrão: %0.2f" % (mean, std))
print ("Intervalo de Confiança Macro F1 (95%%): (%0.2f, %0.2f)\n" %
       (inf, sup))




Media Macro F1: 0.93 Desvio Padrão: 0.03
Intervalo de Confiança Macro F1 (95%): (0.91, 0.95)



In [6]:
dt = DecisionTreeClassifier( random_state = 0)

scores4 = cross_validate(dt,breast_X,breast_y,
                         scoring='f1_macro', cv = 10)

scores4_f1 = scores4['test_score']
mean = scores4_f1.mean()
std = scores4_f1.std()
inf, sup = stats.norm.interval(0.95, loc=mean,
                               scale=std/np.sqrt(len(scores4_f1)))
print("\nMedia Macro F1: %0.2f Desvio Padrão: %0.2f" % (mean, std))
print ("Intervalo de Confiança Macro F1 (95%%): (%0.2f, %0.2f)\n" %
       (inf, sup))


Media Macro F1: 0.91 Desvio Padrão: 0.04
Intervalo de Confiança Macro F1 (95%): (0.89, 0.93)



In [7]:
from sklearn.dummy import DummyClassifier


aS = DummyClassifier(strategy='stratified',random_state = 0)

scores5 = cross_validate(aS,breast_X,breast_y,
                         scoring='f1_macro', cv = 10)

scores5_f1 = scores5['test_score']
mean = scores5_f1.mean()
std = scores5_f1.std()
inf, sup = stats.norm.interval(0.95, loc=mean,
                               scale=std/np.sqrt(len(scores5_f1)))
print("\nMedia Macro F1: %0.2f Desvio Padrão: %0.2f" % (mean, std))
print ("Intervalo de Confiança Macro F1 (95%%): (%0.2f, %0.2f)\n" %
       (inf, sup))



Media Macro F1: 0.56 Desvio Padrão: 0.03
Intervalo de Confiança Macro F1 (95%): (0.55, 0.58)



###Questão 4

Obtenha a acurácia média, o desvio padrão e o intervalo de confiança a 95% do classificador
Perceptron de Múltiplas Camadas usando validação cruzada com 10 dobras (folds) na base de dados
(dataset) wine padronizada e não padronizada. Altere manualmente o valor da taxa de aprendizado
inicial no melhor classificador para 0.1, 0.01 e 0.0001 e observe o resultado.

In [8]:
from sklearn.neural_network import MLPClassifier

#NÃO PADRONIZADA

mlp = MLPClassifier(random_state = 0)

scores = cross_val_score(mlp,X_wine,y_wine,
                         scoring='accuracy', cv = 10)

mean = scores.mean()
std = scores.std()
inf, sup = stats.norm.interval(0.95, loc=mean,
                               scale=std/np.sqrt(len(scores)))

print("\n\n NÃO PADRONIZADA")
print("\nMean Accuracy: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n\n" %
       (inf, sup))





 NÃO PADRONIZADA

Mean Accuracy: 0.90 Standard Deviation: 0.06
Accuracy Confidence Interval (95%): (0.87, 0.94)






In [9]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

#PADRONIZADA

scalar = StandardScaler()

mlp = MLPClassifier(random_state = 0)

pipeline = Pipeline([('transformer',scalar),('estimator',mlp)])

scores = cross_val_score(pipeline,X_wine,y_wine,
                         scoring='accuracy', cv = 10)

mean = scores.mean()
std = scores.std()
inf, sup = stats.norm.interval(0.95, loc=mean,
                               scale=std/np.sqrt(len(scores)))

print("\n\n PADRONIZADA")
print("\nMean Accuracy: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n\n" %
       (inf, sup))





 PADRONIZADA

Mean Accuracy: 0.98 Standard Deviation: 0.03
Accuracy Confidence Interval (95%): (0.96, 0.99)






In [10]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

#PADRONIZADA

scalar = StandardScaler()

mlp = MLPClassifier(random_state = 0,learning_rate_init=0.1)

pipeline = Pipeline([('transformer',scalar),('estimator',mlp)])

scores = cross_val_score(pipeline,X_wine,y_wine,
                         scoring='accuracy', cv = 10)

mean = scores.mean()
std = scores.std()
inf, sup = stats.norm.interval(0.95, loc=mean,
                               scale=std/np.sqrt(len(scores)))

print("\n\n PADRONIZADA - 0.1")
print("\nMean Accuracy: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n\n" %
       (inf, sup))



 PADRONIZADA - 0.1

Mean Accuracy: 0.98 Standard Deviation: 0.03
Accuracy Confidence Interval (95%): (0.96, 0.99)




In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

#PADRONIZADA

scalar = StandardScaler()

mlp = MLPClassifier(random_state = 0,learning_rate_init = 0.01)

pipeline = Pipeline([('transformer',scalar),('estimator',mlp)])

scores = cross_val_score(pipeline,X_wine,y_wine,
                         scoring='accuracy', cv = 10)

mean = scores.mean()
std = scores.std()
inf, sup = stats.norm.interval(0.95, loc=mean,
                               scale=std/np.sqrt(len(scores)))

print("\n\n PADRONIZADA - 0.01")
print("\nMean Accuracy: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n\n" %
       (inf, sup))



 PADRONIZADA - 0.01

Mean Accuracy: 0.98 Standard Deviation: 0.03
Accuracy Confidence Interval (95%): (0.96, 0.99)




In [12]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

#PADRONIZADA

scalar = StandardScaler()

#usar o max_iterator pode sumir com os warnings - pq dai converge
mlp = MLPClassifier(random_state = 0,learning_rate_init = 0.0001, max_iter=1500)

pipeline = Pipeline([('transformer',scalar),('estimator',mlp)])

scores = cross_val_score(pipeline,X_wine,y_wine,
                         scoring='accuracy', cv = 10)

mean = scores.mean()
std = scores.std()
inf, sup = stats.norm.interval(0.95, loc=mean,
                               scale=std/np.sqrt(len(scores)))

print("\n\n PADRONIZADA - 0.0001")
print("\nMean Accuracy: %0.2f Standard Deviation: %0.2f" % (mean, std))
print ("Accuracy Confidence Interval (95%%): (%0.2f, %0.2f)\n\n" %
       (inf, sup))



 PADRONIZADA - 0.0001

Mean Accuracy: 0.98 Standard Deviation: 0.03
Accuracy Confidence Interval (95%): (0.97, 1.00)


