# Steps for all of us

Choose a dataset that you want to use. 

You may do whatever steps you think necessary for building the best classifier.

Take the data you chose and do whatever massaging you think is necessary: standardizing, scaling, feature engineering/ transforming, feature selection, etc.  

Build a classifier however you see fit. You may want build one and tweak the paramters manually or use some sort of grid search to look through all possible parameters. 

Remember: The same model built on the massaged data may perform better than if the data was untouched. It may be more conveniant to chose a standared massaging pipeline and tweak a model to that data.

After you massage your data, follow these steps:

if you want to balance your target (which you should) follow along these lines:

### build your features data and target data

- X = df.drop(columns = "whatever your target name is")
- y = df"whatever your target name is"

### split the data into training and testing

- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2019)

### create the oversampled data to train on 

- oversampler = SMOTE(random_state = 2019)
- X_train_oversampled, y_train_oversampled = oversampler.fit_resample(X_train, y_train)

### Put the oversampled data back into a dataframe

- X_train_oversampled = pd.DataFrame(X_train_oversampled, columns = X_train.columns)
- y_train_oversampled = pd.Series(y_train_oversampled)

### Build your classifier here. As an example:

- xgb_clf = xgb.XGBClassifier(max_depth=5, n_estimators=100, colsample_bytree=0.3, learning_rate=0.1, n_jobs=-1)

 
### Fit to the oversampled data; this will train the classifier on the oversampled data

- xgb_clf.fit(X_train_oversampled, y_train_oversampled)

### Use 5-fold cross validation to see how well the classfier you built is doing on test data. 
Some points: you have to substitute your classifer name in the cross_val_score function 

- kfold = KFold(n_splits=5, random_state=2019)
- results = cross_val_score(xgb_clf, X_test, y_test, cv=kfold, scoring = 'f1')


## It may be best to keep all of your models you built; have a log of them to see their scores and keep a record of your process of building your data. 



In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import copy
from Modules import *
sns.set()
%matplotlib inline
import imblearn


### read in the full sequential data

In [2]:
df = pd.read_csv('Data/Sequential_data_trimmed.csv')
y = df['Y']

df.head().T

Unnamed: 0,0,1,2,3,4
AGE,24.0,26.0,34.0,37.0,57.0
PAY_1,2.0,-1.0,0.0,0.0,-1.0
PAY_2,2.0,2.0,0.0,0.0,0.0
PAY_3,-1.0,0.0,0.0,0.0,-1.0
PAY_4,-1.0,0.0,0.0,0.0,0.0
PAY_5,-2.0,0.0,0.0,0.0,0.0
PAY_6,-2.0,2.0,0.0,0.0,0.0
Y,1.0,1.0,0.0,0.0,0.0
SEX_Female,1.0,1.0,1.0,1.0,0.0
SEX_Male,0.0,0.0,0.0,0.0,1.0


In [3]:
#make pipeline

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import classification_report
from imblearn.pipeline import Pipeline 
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from imblearn.over_sampling import SMOTE

#split into training and testing
Train_data, test_data = train_test_split(df, test_size = 0.2, random_state = 2019)

target = 'Y'
predictors = [x for x in df.columns if x not in [target] ]

clf = LogisticRegression()

oversampler = SMOTE(random_state = 2019)
pipeline = Pipeline([('smote', oversampler),('clf', clf)])

#cross validate results
kfold = KFold(n_splits=5, random_state=2019)
results = cross_val_score(pipeline, df[predictors], df[target], cv=kfold, scoring = 'f1')
print(f"5-fold cross-validation results: {np.mean(results)}")



5-fold cross-validation results: 0.4631289668146429


In [4]:
#testing the cross_val inputs; using only the training data
from sklearn.metrics import f1_score
clf2 = LogisticRegression()

pipeline2 = Pipeline([('smote', oversampler), ('clf', clf2)])

results2 = cross_val_score(pipeline2, Train_data[predictors], Train_data[target], cv=kfold, scoring='f1')

print(f"5-fold cross-validation results: {np.mean(results2)}")


5-fold cross-validation results: 0.4576601559702452


In [5]:
#testing cross_val on only the testing data
clf3 = LogisticRegression()
pipeline3 = Pipeline([('smote', oversampler), ('clf', clf3)])

results3 = cross_val_score(pipeline3, test_data[predictors], test_data[target], cv=kfold, scoring='f1')

print(f"5-fold cross-validation results: {np.mean(results3)}")

5-fold cross-validation results: 0.4620556630758485


In [6]:
#testing the seperate fit to training, and predicting

clf4 = LogisticRegression()

pipeline4 = Pipeline([('smote', oversampler), ('clf', clf4)])
pipeline4.fit(Train_data[predictors], Train_data[target])

y_pred = pipeline4.predict(test_data[predictors])

print(f"f1 score: {f1_score(test_data[target], y_pred)}")

f1 score: 0.472495657209033


In [8]:
results5 = cross_val_score(clf, df[predictors], df[target], cv=kfold, scoring='f1')

print(f"5-fold cross-validation results: {np.mean(results5)}")



5-fold cross-validation results: 0.1602779685061767




In [11]:
#make pipeline with randomforest

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import classification_report
from imblearn.pipeline import Pipeline 
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

#split into training and testing
Train_data, test_data = train_test_split(df, test_size = 0.2, random_state = 2019)

target = 'Y'
predictors = [x for x in Train_data.columns if x not in [target] ]

clfRF = RandomForestClassifier()

oversampler = SMOTE(random_state = 2019)
pipeline = Pipeline([('smote', oversampler),('clf', clfRF)])

#cross validate results
kfold = KFold(n_splits=5, random_state=2019)
results = cross_val_score(pipeline, Train_data[predictors], Train_data[target], cv=kfold, scoring = 'f1')
print(f"5-fold cross-validation results: {np.mean(results)}")

5-fold cross-validation results: 0.45792790390017346


In [13]:
#compare the xgboost models

from sklearn.externals import joblib 
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
pipeline5 =  Pipeline([('smote', oversampler),('clf', model)])
kfold = KFold(n_splits=5, random_state=2019)
results = cross_val_score(pipeline5, Train_data[predictors], Train_data[target], cv=kfold, scoring = 'f1')
print(f"5-fold cross-validation results: {np.mean(results)}")


5-fold cross-validation results: 0.3662705580881832
