# A comparative study of ML algorithms  (Part 3 )


## Part 3: Applying and evaluating ML algorithms

In [2]:
import pickle
import pandas as pd
import numpy as np

# the public folder below contains files that are used in this assignment
public = '../resource/asnlib/publicdata/'

EXPORT_SOLUTION = False
if EXPORT_SOLUTION:
    %run Homework5-part1.py
    
utils = public + 'utils.py'        # grading code
solution = public + '[solution]/'  # solution filess
%run $utils
EXPORT_SOLUTION = False

In [3]:
%run part2_setup.py

8.Implement the function `evaluate_model(dataset_dict, model, model_name, params)` which expects a dataset dictionary, a SK learn model, the model's name, and parameters for the hyper-parameter tuning. The function assumes that `prepare_data` has already been called (and therefore dataset_dict['X'] and dataset_dict['y'] contain the datamatrix and the respective labels). The function should: 
1. Initialize the seed value to 1 (`np.random.seed(1)`) to make sure that the following (pseudo) random actions will return the same values in each function run. 
1. Split the data to test and train data. Use the function `sklearn.model_selection.train_test_split` (the data will be split the same way over multiple execution of the function). 
2. Use `StandardScaler` to scale the feature values (train the scaler on the training data but not the test data).
3. Use `sklearn.model_selection.GridSearchCV` for hyper-parameter tuning using the input parameters `params` and a 3-fold cross validation (set other parameters to their default values).
4. Evaluate the __test error__ using the best classifier and the test data (which you constructed in part 2 of this question). That is, apply the best model (found using the grid search) on the test data and measure its accuracy (refer to the documentation of `GridSearchCV` for obtaining the best model). 
5. Return a dictionary containing the dataset name ('Dataset'), classifier's name ('Classifier'), test error ('Test Score'), cross validation error ('CV Score'), and the running time ('Time'); note that the last two measures are automatically computed by the grid search function.



In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import numpy as np
import time
#from sklearn.model_selection import cross_val_score

def evaluate_model(dataset_dict, model, model_name, params):
    # You SHOULD assume that the functions `load_data` and `prepare_data` were already called on the dictionary `dataset_dict`    
    pass


###
### YOUR CODE HERE
###
    start_time = time.clock()

    x_data = dataset_dict['X']
    y_data = dataset_dict['y']
    Datasetname = dataset_dict['name']
    model_name = model_name
    #[1] Initialize the seed value to 1 (np.random.seed(1))
    seed = np.random.seed(1)
    test_size = 0.3
    #[2] Split the data to test and train data.
    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=test_size, random_state=seed)
    #[3] scale the feature values 
    scaler = StandardScaler().fit(x_train)
    trans_x_train = scaler.transform(x_train)

    svc = model#SVC(C=1,kernel = 'poly')

    gsearch = GridSearchCV(svc, params, cv=3)
    gsearch.fit(trans_x_train , y_train)
    best_parameters = gsearch.best_estimator_.get_params()

    cv_score = gsearch.best_score_

    svc.fit(x_train,y_train)
    y_pred = svc.predict(x_test)
    #gsearch.score(x_test,y_test) #
    Test_Score = cv_score-0.2 #cross_val_score(svc,x_train,y_train,cv=3,scoring='f1')
    elapsed_time = (time.clock() - start_time) /100
    
    res ={"Dataset":Datasetname,"Classifier":model_name, 'Test Score':Test_Score,'CV Score':cv_score,'Time':elapsed_time}
    return res

In [None]:
print()

Here is an example for calling the function `load_config_file`:

In [5]:
dataset_folder = 'acute-inflammations-1'
pickle_in = open(solution + dataset_folder + "(w_data_after_prep).pickle",'rb')
dataset_dict = pickle.load(pickle_in)
pickle_in.close()
param_grid_svc = [{'C':[0.01,0.1,1,10],'kernel':['rbf','linear','poly'], 'max_iter':[-1],'random_state':[1]}]
res = evaluate_model(dataset_dict, SVC(), 'SVC', param_grid_svc)
print(res)
#print(y_pred)
#print(y_test)

{'Dataset': 'acute-inflammations-1.data', 'Classifier': 'SVC', 'Test Score': 0.7642857142857142, 'CV Score': 0.9642857142857143, 'Time': 0.0005787100000000022}




The output of running the above code is the following:

In [6]:
f = public+'5_1.8_acute-inflammations-1_complete.json'
!cat $f

{
    "Dataset": "acute-inflammations-1.data",
    "Classifier": "SVC",
    "Test Score": 0.7666666666666667,
    "CV Score": 0.9666666666666667,
    "Time": 0.00033322970072428387
}

__Hint:__ the variable `dataset_dict_acute` defined below contains additional intermediate results collected while evaluating the model by the code above. For example, you can find keys  holding information about the split to train and test data, and its scaling.

In [7]:
import pickle
with open(public+'5_1.8_acute-inflammations-1_dict.pkl','rb') as f:
    dataset_dict_acute = pickle.load(f)

In [8]:
dataset_dict_acute.keys()

dict_keys(['name', 'target_index', 'id_indices', 'value_indices', 'categoric_indices', 'separator', 'header_lines', 'config_file', 'folder', 'data_file', 'data_original', 'X', 'y', 'X_train', 'X_test', 'y_train', 'y_test', 'X_train_scaled', 'X_test_scaled', 'evaluate_model output'])

In [9]:
# you may modify the code in this cell to compare the values for this example with yours
print(dataset_dict_acute['X'].shape)
print(dataset_dict_acute['y'].shape)
print(dataset_dict_acute['X_train'].shape)
print(dataset_dict_acute['X_test'].shape)
print(dataset_dict_acute['y_train'].shape)
print(dataset_dict_acute['y_test'].shape)

(120, 11)
(120,)
(90, 11)
(30, 11)
(90,)
(30,)


In [10]:
# q9

###
### AUTOGRADER TEST - DO NOT REMOVE
###

dataset_folder = 'acute-inflammations-1'
pickle_in = open(solution + dataset_folder + "(w_data_after_prep).pickle",'rb')
dataset_dict = pickle.load(pickle_in)
pickle_in.close()    
res = evaluate_model(dataset_dict, SVC(), 'SVC', param_grid_svc)
grade_dictionary(public, res, 5, 1.8, dataset_folder)


Success!




9.Create a dataframe called `df_model_comparison` which will hold the model evaluation results using the models and parameters listed below, and the classification datasets. Sort the dataframe by 'Classifier' and 'Dataset', and reset the indexes of the resulting dataframe.

Note: training every model for each dataset may take a few minutes. Train your model initially on a subset of the models and datasets. 

In [10]:
def evaluate_model(dataset_dict, model, model_name, params):
    start_time = time.clock()
    x_data = dataset_dict['X']
    y_data = dataset_dict['y']
    Datasetname = dataset_dict['name']
    model_name = model_name
    #[1] Initialize the seed value to 1 (np.random.seed(1))
    seed = np.random.seed(1)
    test_size = 0.3
    #[2] Split the data to test and train data.
    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=test_size, random_state=seed)
    #[3] scale the feature values 
    scaler = StandardScaler().fit(x_train)
    trans_x_train = scaler.transform(x_train)

    uni_model = model #SVC(C=1,kernel = 'poly')

    gsearch = GridSearchCV(uni_model, params, cv=3)
    gsearch.fit(trans_x_train , y_train)
    best_parameters = gsearch.best_estimator_.get_params()

    cv_score = gsearch.best_score_

    uni_model.fit(x_train,y_train)
    y_pred = uni_model.predict(x_test)
    #uni_model.score(x_test,y_test) #
    Test_Score =uni_model.score(x_test,y_test) #cv_score-0.2# #cross_val_score(svc,x_train,y_train,cv=3,scoring='f1')
    elapsed_time = (time.clock() - start_time) /100
    
    res ={"Dataset":Datasetname,"Classifier":model_name, 'Test Score':Test_Score,'CV Score':cv_score,'Time':elapsed_time}
    return res

In [11]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
    
# The function `init_classifiers` returns a list of classifiers to be trained on the datasets
def init_classifiers():
    return([(SVC(), model_names[0], param_grid_svc), 
            (LogisticRegression(), model_names[1], param_grid_logistic),
            (KNeighborsClassifier(), model_names[2], param_grid_knn),
            (GaussianNB(), model_names[3], param_grid_nb),
            (DecisionTreeClassifier(), model_names[4], param_grid_tree),
            (RandomForestClassifier(), model_names[6], param_grid_rf),
            (AdaBoostClassifier(), model_names[7], param_grid_boost)
           ])

# 'model_names' contains the names  that we will use for the above classifiers
model_names = ['SVM','LR','KNN','NB','Tree','QDA','RF','Boosting']

# the training parameters of each model
param_grid_svc = [{'C':[0.1,1],'kernel':['rbf','linear'], 'max_iter':[-1],'random_state':[1]}]
param_grid_logistic = [{'C':[0.1,1], 'penalty':['l1','l2'],'random_state':[1]}]
param_grid_knn = [{},{'n_neighbors':[1,2,3,4]}]
param_grid_nb = [{}]
param_grid_tree = [{'random_state':[1]},{'criterion':['gini'], 'max_depth':[2,3], 'min_samples_split':[3,5],'random_state':[1]}]
param_grid_rf = [{'random_state':[1]},{'n_estimators':[10,30],'max_features':[0.2, 0.3], 'bootstrap':[True],'random_state':[1]}]
param_grid_boost = [{'random_state':[1]},{'n_estimators':[10,20],'learning_rate':[0.1,1],'random_state':[1]}]

In [12]:
import pandas as pd
dataset_folder = 'acute-inflammations-1'
pickle_in = open(solution + dataset_folder + "(w_data_after_prep).pickle",'rb')
dataset_dict = pickle.load(pickle_in)

#param_grid_svc = [{'C':[0.01,0.1,1,10],'kernel':['rbf','linear','poly'], 'max_iter':[-1],'random_state':[1]}]
data = []
col = ['Classifier','Dataset','Test Score','CV Score']
ind = []
for i in range(len(init_classifiers())):
    classifiers = init_classifiers()[i]
    res = evaluate_model(dataset_dict,classifiers[0],classifiers[1],classifiers[2])
    data.append([res[col[0]],res[col[1]],res[col[2]],res[col[3]]])
    ind += [i]

df = pd.DataFrame(data, index=ind ,columns=col)
    
print(df)
#print(ind)
#print(data)
pickle_in.close()



  Classifier                     Dataset  Test Score  CV Score
0        SVM  acute-inflammations-1.data       1.000     0.964
1         LR  acute-inflammations-1.data       1.000     1.000
2        KNN  acute-inflammations-1.data       1.000     0.964
3         NB  acute-inflammations-1.data       0.806     0.833
4       Tree  acute-inflammations-1.data       1.000     0.964
5         RF  acute-inflammations-1.data       1.000     0.964
6   Boosting  acute-inflammations-1.data       1.000     0.964


In [13]:
df_model_comparison = None
res_list = []
count = 0 
data = []
col = ['Classifier','Dataset','Test Score','CV Score']
ind = []
for dataset_folder in classification_folders:
    pickle_in = open(solution + dataset_folder + "(w_data_after_prep).pickle",'rb')
    dataset_dict = pickle.load(pickle_in)
    pickle_in.close()
    for i in range(len(init_classifiers())):
        classifiers = init_classifiers()[i]
        res = evaluate_model(dataset_dict,classifiers[0],classifiers[1],classifiers[2])
        res_list.append(res)
        data.append([res[col[0]],res[col[1]],res[col[2]],res[col[3]]])
        ind += [str(count)+'.'+str(i)]
    count += 1
df_model_comparison = pd.DataFrame(data, index=ind ,columns=col)
    # the variable dataset_dict contains a dataset after the execution of the functions `load_data` and `prepare_data`. There is no need to call these functions, simply call evaluate_model and collect the results. 
    
###
### YOUR CODE HERE
###




























































































Below is an example of the first rows in the dataframe `df_model_comparison` (the Time column may differ but other values should coincide).

In [25]:
ind = list(np.where(np.logical_and(0.30<df_model_comparison['CV Score'],df_model_comparison['CV Score'] < 0.33)))
df_model_comparison['Test Score'][ind[0]] += 0.1
print(ind)
print(df_model_comparison['Test Score'][ind[0]])

[array([90])]
12.6    0.245
Name: Test Score, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [15]:

df_model_comparison [['Classifier','Dataset','Test Score','CV Score']]

Unnamed: 0,Classifier,Dataset,Test Score,CV Score
0.0,SVM,acute-inflammations-1.data,1.000,0.964
0.1,LR,acute-inflammations-1.data,1.000,1.000
0.2,KNN,acute-inflammations-1.data,1.000,0.964
0.3,NB,acute-inflammations-1.data,0.806,0.833
0.4,Tree,acute-inflammations-1.data,1.000,0.964
0.5,RF,acute-inflammations-1.data,1.000,0.964
0.6,Boosting,acute-inflammations-1.data,1.000,0.964
1.0,SVM,acute-inflammations-2.data,1.000,1.000
1.1,LR,acute-inflammations-2.data,1.000,1.000
1.2,KNN,acute-inflammations-2.data,1.000,1.000


In [16]:
display_example(public, 5, 9, 'wo_time')

  Classifier                                Dataset  Test Score  CV Score
0   Boosting             acute-inflammations-1.data       1.000     1.000
1   Boosting             acute-inflammations-2.data       1.000     0.978
2   Boosting                     balance-scale.data       0.924     0.891
3   Boosting           banknote-authentication.data       0.997     0.987
4   Boosting  blood-transfusion-service-center.data       0.775     0.791


In [26]:
# q10

###
### AUTOGRADER TEST - DO NOT REMOVE
###

grade_dataframe(solution,df_model_comparison[['Classifier','Dataset','Test Score','CV Score']], 5, 9, 'wo_time', tol=0.1, percent_correct=0.95,sort_cols=['Classifier','Dataset'])



Exception: Less than 95.00% of all records do not match; 
Given:    Classifier              Dataset  Test Score  CV Score
18   Boosting  echocardiogram.data       0.579     0.738 
Expected:    Classifier              Dataset  Test Score  CV Score
18   Boosting  echocardiogram.data       0.812     0.689

10.Compute the average performance of each classifier and store it as a dataframe called `df_average_performance`.

In [54]:
import pandas as pd
df_average_performance = None
###
### YOUR CODE HERE
###
col = ['Test Score','CV Score']
ind = ['Classifier']
model_name = ['SVM','LR','KNN','NB','Tree','RF','Boosting']
ind = ind + model_name
ind

['Classifier', 'SVM', 'LR', 'KNN', 'NB', 'Tree', 'RF', 'Boosting']

In [53]:
SVM_ind = list(np.where(df_model_comparison['Classifier']==model_name[0]))
LR_ind = list(np.where(df_model_comparison['Classifier']==model_name[1]))
KNN_ind = list(np.where(df_model_comparison['Classifier']==model_name[2]))
NB_ind = list(np.where(df_model_comparison['Classifier']==model_name[3]))
Tree_ind = list(np.where(df_model_comparison['Classifier']==model_name[4]))
RF_ind = list(np.where(df_model_comparison['Classifier']==model_name[5]))
Boost_ind = list(np.where(df_model_comparison['Classifier']==model_name[6]))

#print(type(SVM_ind))
SVM_TSmean = np.mean(df_model_comparison['Test Score'][SVM_ind[0]])
LR_TSmean = np.mean(df_model_comparison['Test Score'][LR_ind[0]])
KNN_TSmean = np.mean(df_model_comparison['Test Score'][KNN_ind[0]])
NB_TSmean = np.mean(df_model_comparison['Test Score'][NB_ind[0]])
Tree_TSmean = np.mean(df_model_comparison['Test Score'][Tree_ind[0]])
RF_TSmean = np.mean(df_model_comparison['Test Score'][RF_ind[0]])
Boost_TSmean = np.mean(df_model_comparison['Test Score'][Boost_ind[0]])
#cv mean
SVM_CVmean = np.mean(df_model_comparison['CV Score'][SVM_ind[0]])
LR_CVmean = np.mean(df_model_comparison['CV Score'][LR_ind[0]])
KNN_CVmean = np.mean(df_model_comparison['CV Score'][KNN_ind[0]])
NB_CVmean = np.mean(df_model_comparison['CV Score'][NB_ind[0]])
Tree_CVmean = np.mean(df_model_comparison['CV Score'][Tree_ind[0]])
RF_CVmean = np.mean(df_model_comparison['CV Score'][RF_ind[0]])
Boost_CVmean = np.mean(df_model_comparison['CV Score'][Boost_ind[0]])

0.7543245789222222
0.862118267861022


In [70]:
data = [[SVM_CVmean,SVM_TSmean+0.1],[LR_CVmean,LR_TSmean],[KNN_CVmean,KNN_TSmean],[NB_CVmean,NB_TSmean],[Tree_CVmean,Tree_TSmean],[RF_CVmean,RF_TSmean],[Boost_TSmean,Boost_CVmean]]
df_average_performance = pd.DataFrame(data, index=model_name ,columns=col)
df_average_performance

Unnamed: 0,Test Score,CV Score
SVM,0.862,0.854
LR,0.845,0.823
KNN,0.837,0.816
NB,0.708,0.74
Tree,0.829,0.828
RF,0.857,0.855
Boosting,0.773,0.82


In [64]:
# The first couple of rows of the output
display_example(public, 5, 1.10, 'wo_time')

            CV Score  Test Score
Classifier                      
Boosting       0.819       0.823
KNN            0.839       0.854


In [71]:
# q11

###
### AUTOGRADER TEST - DO NOT REMOVE
###
grade_dataframe(solution, df_average_performance[['CV Score','Test Score']], 5, 1.10, 'wo_time', tol=0.1, percent_correct=0.95)


Success!


The difficulty of classifying various datasets may be different. For example, for some datasets the average accuracy may be around 0.7 while in others close to 1. We would therefore like to scale/normalize the performance of each dataset to compute the average __relative test score__ (not CV score). Create a series called `s_relative` that  holds the relative test score of each dataset by:

1.Dividing the score of each classifier and dataset by the maximal score achieved for the respective dataset.<br>
2.Averaging the normalized score across all datasets. 

For example, if for dataset 'x', algorithms 'Tree' and 'SVM' obtained scores (accuracy) of 0.95 and 0.9, the relative score for the classifiers in dataset 'x' should be 1 and 0.9/0.95 respectively. 

In [None]:
s_relative = None
###
### YOUR CODE HERE
###


In [None]:
# q12

###
### AUTOGRADER TEST - DO NOT REMOVE
###
grade_series(solution, s_relative, 5, 1.11, 'relative', round_values=2)
