### Objective
To provide the user the functionality to use/evaluate different models.

### In-Scope
Making a highly parameterized function to test out three classification and three regression models

### Future-Scope
+ Making a GUI on top of the function
+ Add more scoring for mulit-class classifications
+ Evaluation on testing
+ Add more edge-cases for unit testing function
+ Add more models
+ Add charts 

### Phase A : Import required libraries

In [1]:
#Import required libraries
import pandas as pd
import numpy as np

In [2]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

### Phase B : Import required datasets

#### Part I : Transform dataset as per requirement

In [3]:
#Read iris pickle.
df = pd.read_pickle('../data/raw/iris.pickle')

In [4]:
#Class is in text. So we label encode.
le = LabelEncoder()
df['class'] = le.fit_transform(df['class'])

#### Part II : Make train and test datasets for simulation

In [5]:
#Specify target and independent variables.
X = df.copy().drop(['class'], axis=1)
y = df['class']

In [6]:
#Train test split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [7]:
#Make a copy of the dataframes as csvs for testing with GUI later on
X_train.to_csv('../data/raw/X_train.csv')
X_test.to_csv('../data/raw/X_test.csv')
y_train.to_csv('../data/raw/y_train.csv')
y_test.to_csv('../data/raw/y_test.csv')

### Phase C : Make user defined function

In [8]:
#Function to evaluate different models
def model_automator(x_train, x_test, y_train, y_test, task, kfold=3, nruns=5):
    
    #Imports here as this will be packaged as a GUI later on and only this will be the source code.
    import warnings
    from collections import OrderedDict
    from time import gmtime, strftime
    from IPython.display import display
    from sklearn.model_selection import KFold, cross_val_score
    
    #Warning from scipy LAPACK to be ignored as it does not affect results.
    warnings.filterwarnings(action='ignore', module='scipy', message='^internal gelsd')
    
    #Lists to record model related metrics to be concatenated into a dataframe later on.
    record_scorer = []
    iter_scorer = []
    model_name = []
    model_accuracy = []
    model_accuracy_std = []
    
    #For classification tasks, need classification models as imports. 
    #Also for multiclass problems set the scoring metric as accuracy.
    if task == 'class':
        
        from sklearn.linear_model import LogisticRegression
        from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
        
        #Currently testing on three models. More models will be added in future versions.
        estimators = [('log', LogisticRegression()), 
                      ('rfc', RandomForestClassifier()), 
                      ('gbm', GradientBoostingClassifier())]
        
        #Check if it is a multiclass classification problem or not.
        if len(np.unique(y_train))>2:
            scoring = ['accuracy']
            
        else:
            scoring = ['accuracy', 'precision', 'recall']

    #For regression tasks, need regression models as imports. 
    elif task == 'reg':
        
        from sklearn.linear_model import LinearRegression
        from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
        
        #Currently testing on three models. More models will be added in future versions.
        estimators = [('lin', LinearRegression()), 
                      ('rfc', RandomForestRegressor()), 
                      ('gbm', GradientBoostingRegressor())]
        scoring = ['explained_variance', 'r2']
    
    #Validation check for wrong option selected.
    else : 
        print('Wrong option')
    
    #Start the process and record the time started.
    print('Process started at %s\n' % (strftime('%Y-%m-%d %H:%M:%S', gmtime())))
    
    #Iterate through scoring metrics.
    for scorer in scoring:
        
        #Iterate through the number of runs. Default is 5.
        for run in range(nruns):
            
            print('Running iteration %s with %s as scoring metric' % ((run + 1), scorer))
            
            for name, estimator in estimators:
                
                #Iterate through differnt models and get cross val score.
                cv_results = cross_val_score(estimator, x_train, y_train, cv=kfold, scoring=scorer)
                
                #Append all results in list form which will be made into a dataframe at the end.
                iter_scorer.append((run + 1))
                record_scorer.append(scorer)
                model_name.append(name)
                model_accuracy.append(cv_results.mean())
                model_accuracy_std.append(cv_results.std())
                
        print()
            
            
    #Process ends here. Record the time. 
    print('\nProcess ended at ', strftime('%Y-%m-%d %H:%M:%S', gmtime()))
    
    #Use ordered dictionary to set the dataframe in the exact order of columns declared.
    results = pd.DataFrame(OrderedDict({'Iteration' : iter_scorer, 
                                        'Scoring Metric' : record_scorer, 
                                        'Model' : model_name, 
                                        'Model Accuracy' : model_accuracy, 
                                        'Model Accuracy Std' : model_accuracy_std}))
    
    #Pivot to view results in a more aesthetic form
    results_pivot = results.pivot_table(index=['Iteration', 'Scoring Metric'], columns=['Model'])
    
    #Display the results
    print('\nFinal results : ')
    display(results_pivot)
    
    #Return the pivot
    return(results_pivot)

### Phase D : Testing

#### Part I : Test Classification

In [9]:
results = model_automator(x_train=X_train, y_train=y_train, x_test=X_test, y_test=y_test, task='class')

Process started at 2018-03-14 07:27:05

Running iteration 1 with accuracy as scoring metric
Running iteration 2 with accuracy as scoring metric
Running iteration 3 with accuracy as scoring metric
Running iteration 4 with accuracy as scoring metric
Running iteration 5 with accuracy as scoring metric


Process ended at  2018-03-14 07:27:08

Final results : 


Unnamed: 0_level_0,Unnamed: 1_level_0,Model Accuracy,Model Accuracy,Model Accuracy,Model Accuracy Std,Model Accuracy Std,Model Accuracy Std
Unnamed: 0_level_1,Model,gbm,log,rfc,gbm,log,rfc
Iteration,Scoring Metric,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
1,accuracy,0.900667,0.940548,0.930132,0.026198,0.049512,0.037813
2,accuracy,0.900667,0.940548,0.949179,0.026198,0.049512,0.051383
3,accuracy,0.900667,0.940548,0.93934,0.026198,0.049512,0.025482
4,accuracy,0.900667,0.940548,0.909614,0.026198,0.049512,0.04328
5,accuracy,0.900667,0.940548,0.939655,0.026198,0.049512,0.043054


#### Part II : Test Regression

In [10]:
results = model_automator(x_train=X_train, y_train=y_train, x_test=X_test, y_test=y_test, task='reg')

Process started at 2018-03-14 07:27:08

Running iteration 1 with explained_variance as scoring metric
Running iteration 2 with explained_variance as scoring metric
Running iteration 3 with explained_variance as scoring metric
Running iteration 4 with explained_variance as scoring metric
Running iteration 5 with explained_variance as scoring metric

Running iteration 1 with r2 as scoring metric
Running iteration 2 with r2 as scoring metric
Running iteration 3 with r2 as scoring metric
Running iteration 4 with r2 as scoring metric
Running iteration 5 with r2 as scoring metric


Process ended at  2018-03-14 07:27:10

Final results : 


Unnamed: 0_level_0,Unnamed: 1_level_0,Model Accuracy,Model Accuracy,Model Accuracy,Model Accuracy Std,Model Accuracy Std,Model Accuracy Std
Unnamed: 0_level_1,Model,gbm,lin,rfc,gbm,lin,rfc
Iteration,Scoring Metric,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
1,explained_variance,0.880785,0.914513,0.914236,0.07011,0.020563,0.065638
1,r2,0.876556,0.910574,0.921667,0.067763,0.023892,0.064458
2,explained_variance,0.878063,0.914513,0.902404,0.06796,0.020563,0.074154
2,r2,0.875476,0.910574,0.906705,0.069252,0.023892,0.066677
3,explained_variance,0.880917,0.914513,0.910029,0.070135,0.020563,0.074608
3,r2,0.879071,0.910574,0.894494,0.070402,0.023892,0.056934
4,explained_variance,0.878589,0.914513,0.913363,0.068183,0.020563,0.061009
4,r2,0.877093,0.910574,0.915906,0.068016,0.023892,0.058606
5,explained_variance,0.879564,0.914513,0.911754,0.069772,0.020563,0.060235
5,r2,0.874448,0.910574,0.911204,0.066797,0.023892,0.064312
