# Initial Modeling Results

### This notebook outlines each of our initial modeling processes and results. 

We will reference cells in other notebooks and show some final results.

As a group, each participant has run various models on a specific dataset
    - Stuart: Full dataset with all of the features
    - Martin: Niave dataset with the initial features
    - Kevin: limited dataset thought to have the most potential based on f1 training

# Stuart
## Used all of the features

### Current Models in development

1) K Means to determine features with the greatest influence but none of the features appeared.  [Stuart-K_Means.jpynb](stuart/Stuart-K_Means.jpynb)

2) Logistic Regression.  There is a problem with the cross-validation which is preventing an accurate f1-score. [Stuart-Logistic_Regression.ipynb](Stuart/Stuart-Logistic_Regression.ipynb)

3) Random Forrest.  Again, there is a problem with the cross-validation which is preventing an accurate f1-score. [Stuart-Random_Forrest.ipynb](Stuart/Stuart-Random_Forrest.ipynb)

4) Ensamble_Learning. Need to complete the model.  Need to split the training and test data. [Stuart-Ensamble_Learning.ipynb](Stuart/Stuart-Ensamble_Learning.ipynb)

### Next steps
1) find and correct the cross-validation problem

2) complete the models

# Kevin

Kevin decided to experiment with xgboost models. Much of the code is found in the [Copy_of_XGBoost_tuning_checkpoint.ipynb](Kevin/Copy_of_XGBoost_tuning_checkpoint.ipynb) notebook. The process is as follows:

1.) Used the `final_trimmed_seqeuntial_data` to build models on. We decided to split on who would use what dataset; Kevin decided to use this dataset.

2.) Set up the training and testing data.

3.) Created a function, fit_model, that uses a predefined xgboost model with predefined hyperparamters and uses xgb.cv function to tune the n_estimators to an appropriate value. This will print out an accuracy, auc score, and f1 score on the training data.

4.) Set up an intitial model and tuned for n_estimators with fit_model. Saved that model as xgboost1 using sklearn.externals module to the `Data` folder.

5.) Started the process of tuning hyperparameters using sklearn.GridSeach. Repeated this process for many of the hyperparameters, tuning on maximizing the f1 score. Also intertwined the `fit_model` function to retune the n_estimators to the new hyperparameters that were selected. 

6.) At the end, saved two models , xgboost3 and xgboost4, to `Data` folder to use for validation purposes.



## Validation results

Here is a 5-fold cross validation results of the last two xgboost models

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import copy
from Modules import *
sns.set()
%matplotlib inline


In [2]:
#read in the data; data formated weird with extra column, 
#use only data we need
data = pd.read_csv('Kevin/Data/Final_trimmed_sequential_data.csv')
Kevin_df = data.iloc[:, 1:]

Kevin_df.head()

Unnamed: 0,AGE,Y,SEX_Female,SEX_Male,EDUCATION_Graduate School,EDUCATION_Other,EDUCATION_University,MARRIAGE_Married,MARRIAGE_Non-married,PERCENT_OF_LIMIT_BAL1,...,PAY_4_Other,PAY_4_more_than_two_month_late,PAY_4_on_time,PAY_4_one_month_late,PAY_5_Other,PAY_5_more_than_two_month_late,PAY_5_on_time,PAY_6_Other,PAY_6_more_than_two_month_late,PAY_6_on_time
0,24,1,1,0,0,0,1,1,0,0.19565,...,0,0,1,0,1,0,0,1,0,0
1,26,1,1,0,0,0,1,0,1,0.02235,...,1,0,0,0,1,0,0,0,1,0
2,34,0,1,0,0,0,1,0,1,0.308011,...,1,0,0,0,1,0,0,1,0,0
3,37,0,1,0,0,0,1,1,0,0.8998,...,1,0,0,0,1,0,0,1,0,0
4,57,0,0,1,0,0,1,1,0,0.13234,...,1,0,0,0,1,0,0,1,0,0


In [3]:
#this block replicates the same training and testing split that the models were built on
#find 5-fold cross-validated scores for the 3 models. 
from sklearn.model_selection import KFold, cross_val_score, train_test_split

Train_data, test_data = train_test_split(Kevin_df, test_size = 0.2, random_state = 2019)

target = 'Y'
predictors = [x for x in Train_data.columns if x not in [target]]

#see scores for the xgboost models

from sklearn.externals import joblib 
for name in ['xgboost1.dat', 'xgboost3.dat', 'xgboost4.dat']:
    loaded_model = joblib.load('Kevin/Models/'+str(name))

    kfold = KFold(n_splits=5, random_state=2019)

    f1_results = cross_val_score(loaded_model, test_data[predictors], test_data[target], cv=kfold, scoring = 'f1')
    auc_results = cross_val_score(loaded_model, test_data[predictors], test_data[target], cv = kfold, scoring = 'roc_auc')
    print(f"f1 5-fold cross-validation results {name}: "+ str(np.mean(f1_results)))
    print(f"auc 5-fold cross-validation results {name}: "+ str(np.mean(auc_results)))
    print()



f1 5-fold cross-validation results xgboost1.dat: 0.5333656992458421
auc 5-fold cross-validation results xgboost1.dat: 0.7648618123728325

f1 5-fold cross-validation results xgboost3.dat: 0.5348757072417618
auc 5-fold cross-validation results xgboost3.dat: 0.7637131204818518

f1 5-fold cross-validation results xgboost4.dat: 0.5426961945341764
auc 5-fold cross-validation results xgboost4.dat: 0.7547574449881267



## Where Kevin will go from here:

Kevin may try out other modeling techniques that are probably out of his league, such as a neural network architecuture. He can try with all of the data in here or just the sequential data that is in this data set to explore. 

Will also have to see what feature selection methods.