# ML Test
Below are the instructions provided for this project:

Objective:   Train a classification models to make prediction on testing data set, using the data in the “Sequence model data.zip” file. 

Note: 
1.	This is a sequence classification task, where the order of each feature matters. You could train a model without considering order as a baseline model, but must train a model addressing sequence because in real work, sequence analysis is part of the project.
1.	Each row represents a training/testing sample containing a sequence, where the first element is “PPD_197” and last element is “PPD_0”.
1.	All the sequences have been padding, which is the reason why lots of zeros show up in “PPD_0”. 
1.	All the values in each entry is categorical variable. Imaging every value in the entry as an index of a word in natural language. 
1.	Y-variable is the first column called “LABEL”, in testing data, there is no label

What we expect,
1.	A good prediction result, which we will compare with the hold out y variable in the testing data set
1.	The process of how the prediction is made, including 
  1. create new features based on existing variables could be needed. 
  1. feature analysis 
  1. feature selection 
  1. model comparison
  1. hyper-parameter tuning
1.	If you could use library such as Keras, Tensorflow to train a deep neural network (DNN) classifier, that will be a very good plus, even if neural networks might not be the best performed model. 
You could use any tools available to you for this task. Ultimately, we will assess your work based on two criteria. 
  1. predictive accuracy on the test set using the PR-AUC metric, 
  1. model structure you finally applied, for example, we will consider how advance the model is, or if you could create additional meaningful features from the data we gave to you.
1. You should return to us the following:
  1. A 23,910 x 1 csv or txt file containing one prediction per line for each row in the test dataset.
  1. A brief report describing the techniques you used to obtain the predictions, that at least should include the following parts: 
    1. why do you choose the model you use? 
    1. your estimates of predictive performance on the test data set, 
    1. some words telling us your understanding about the model you use. 
    1. The code for building the model, or the saved model such as pickle file.  


## Plan
Below is the plan that I intend to fulfill

1. Split data into training and testing sets
1. Explore basic summary stats
1. Create a data set for all key predictors 1, 2 and 3 word combinations
  1. Create dictionary of dictionaries to do it (function)
  1. Remove in frequent combinations
  1. Create pandas dataframe with all the good predictors (function)
1. Perform feature selection to reduce to the key features
  1. Get it down to 100 features
1. Run 3 models and try to tune some parameters: 
  1. SVM Linear
  1. SVM
  1. Decision Tree
1. Validate on the testing sets and test the one with the best outcome to ensure that it's not overfit
1. Run the prediction on the lock box training set for the top model


# Libraries
Set up the libraries needed for this project

In [1]:
import pandas as pd  
import numpy as np
import pickle
import re
import time
from sklearn.model_selection import train_test_split
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
from sklearn.utils.fixes import signature
from sklearn import linear_model
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.metrics import classification_report
from sklearn import tree
import pickle

# Functions
Create functions that will be used throughout the data process

## create_phrases

In [2]:
def create_phrases(df):
    """Create unique phrases from the raw data

    Args:
        df(df): dataframe with an 'id' 'label' and n features
        
    Kwargs:
        None

    Returns:
        Dataframe: 'id', 'label', 'comb' combined features
    """   
    # Create a dataframe of combined features with id, label, and concatenated phrases of concatenated words
    df_comb = pd.DataFrame(df[['label']])

    #Create list of all columns to concatenate
    cols = df.columns.values.tolist()
    cols.remove('label')
    #cols.remove('id')

    # Concatenate word to phrases and remove all zeros
    df_comb['comb'] = df.loc[:, cols].apply(lambda x: '-'.join(x), axis=1)
    df_comb['comb'].replace({'-0': ''}  ,inplace=True, regex=True)
    df_comb['comb'].replace({'^[0]+-': ''},inplace=True, regex=True)
    
    return df_comb

## populate_phrase_dict

In [3]:
def populate_phrase_dict(df):
    """Create a dictonary of phrases per observation and records
       phrase counts.

    Args:
        df(df): dataframe with 'id','label', and n features
        phrases: dionary of phrases to include
        
    Kwargs:
        None

    Returns:
        Dataframe: 'id', 'label', 'comb' combined features
    """   
    phrase_d = {} 
    phrase_cnt_d = {}

    for index, row in df.iterrows():
        #Identify all unique 2 and 3 word combinations per row
        r2 = re.findall(r"\d+-\d+",row['comb'])
        r3 = re.findall(r"\d+-\d+-\d+",row['comb'])
        r4 = re.findall(r"\d+-\d+-\d+-\d+",row['comb'])
        r = r2+r3+r4
        r = list(set(r))

        #Create a new key for the row id
        phrase_d[index] = {}

        #Populate dionaries for unique phrases and phrase counts by id
        for j in r:
            phrase_d[index][j] = 1
            if j in phrase_cnt_d:
                phrase_cnt_d[j] += 1
            else:
                phrase_cnt_d[j] = 1
                
    return phrase_d, phrase_cnt_d


## populate_tidy_data

In [4]:
def populate_tidy_data(phrase_d, phrase_l, df_raw):
    """Populate final tidy data set with phrases from list phrase_l

    Args:
        phrase_d(dict): dictionary of features per id
        phrase_l(list): list of phrases to include
        df(df): dataframe with 'label'
        
    Kwargs:
        None

    Returns:
        Dataframe: 'label', 'comb' combined features
    """   
    phrase_df = pd.DataFrame(columns=phrase_l, dtype=bool)
    
    #Add id and label to the table
    train_tidy = pd.concat([df_raw[['label']],phrase_df], axis=1)
    train_tidy = train_tidy.fillna(0)

    # Identify columns used in the process
    columns = list(train_tidy[phrase_l].columns)

    # Populate train_tidy with features from phrase_d
    for index, row in train_tidy.iterrows():
        for col in columns:
            if col in phrase_d[index]:
                train_tidy.at[index,col] = 1
                
    return train_tidy

# Upload data
Upload the raw data and create training and testing data sets

**Output**
* train_df: Training data set
* test_df: Testing data set

In [5]:
#Create system variables from excel into script and review values in dictionary
df = pd.read_csv('in/train.csv', dtype=str)
df.columns = df.columns.str.lower()

# Split into training and testing data
train_df, test_df = train_test_split(df, test_size=0.2)

#Drop df to save memory
del df

In [6]:
#Confirm the training and testing is split 80:20
len(train_df)
len(test_df)

76635

19159

In [7]:
train_df.to_pickle("out/train_df.pkl")
test_df.to_pickle("out/test_df.pkl")

# Tidy Training Data
Create a tidy dataset where each column is a 1, 2 or 3 word phrase identified in the row that is above a certain count

**Output**
* train_tidy: Tidy data set with `id`, `label`, and ~1000 features to analyze
* phrase_list: list of variables to include in the analysis

## Create phrases
Identify all permutations of 1, 2 and 3 word phrases in the document

In [8]:
# Concatenate all words into a single string
df_comb = create_phrases(train_df)

In [9]:
# Identify all 2 and 3 word phrases
(phrase_dict, phrase_cnt_dict) = populate_phrase_dict(df_comb)

## Set Thresholds
Determine cut offs for frequency of phrases so that the future data frame so there is a resonable number of features.  The output of this is an empty dataframe with all of the phrases we are going to consider for the analysis.

* All values that appeared at least 400 times were included as it created around 1000 features and was feasible to calcualte in the future.

In [10]:
# Create dataframe with counts of unique phrases per observation
phrase_cnt_df = pd.DataFrame.from_dict(phrase_cnt_dict, orient='index')
phrase_cnt_df.columns = ['cnt']
phrase_cnt_df.cnt = pd.to_numeric(phrase_cnt_df.cnt)

In [11]:
# A minimum of at least 400 rows having a phrase cuts the list down to around 1000 predictors which is
# reasonable given the computational resources for this exercise
phrase_list = phrase_cnt_df[phrase_cnt_df['cnt']>350].index.tolist()
len(phrase_list)

1183

## Populate Tidy Data
Populate train_tidy with the following:
* id and label from train_df
* True if the feature existed for that observation
* Limited to the ~1000 most popular phrases

In [12]:
train_tidy=populate_tidy_data(phrase_dict, phrase_list, df_comb)

In [13]:
# Pickle data for easy reuse later
train_tidy.to_pickle("out/train_tidy.pkl")

In [14]:
# Reload train_tidy from pickle file
train_tidy = pd.read_pickle("out/train_tidy.pkl")

# Feature Selection
Identify the top 100 features based by using a chi-squarred test.

**Output**
* X_train(df): Training features
* y_train(df): Training Outcome
* X_features(list): List of final features to include in the model

In [15]:
#Create training df and outcome vector
X = train_tidy[phrase_list]
y_train = pd.to_numeric(train_tidy.label)

#Use SelectKBest to get the top 100 features
kBestFeatures = SelectKBest(score_func=chi2, k=10)
fit = kBestFeatures.fit(X,y_train)
X_train = X[X.columns]

#Review top 20 features for scores
score_df = pd.DataFrame(fit.scores_)
feature_df = pd.DataFrame(X.columns)
feature_score_df = pd.concat([feature_df,score_df],axis=1)
feature_score_df.columns = ['Feature','Score'] 
print(feature_score_df.nlargest(10,'Score')) 

      Feature       Score
1043    67-44  856.314669
87       1-44  709.474551
88       1-67  630.473944
306     44-20  619.897800
134      20-6  554.226243
336       2-6  514.573747
1        44-6  478.889323
92        6-6  405.441630
1159  1-67-44  352.361258
996   1-44-20  345.612105


In [16]:
X_train = X[feature_score_df[feature_score_df.index < 50].Feature.tolist()]

# Confirm shape of X_train and y_train are correct
X_train.shape
y_train.shape

(76635, 50)

(76635,)

# Tidy Testing Data
Create the testing data set based on the data process to create the training data

In [17]:
# Concatenate all words into a single string
test_comb_df = create_phrases(test_df)

In [18]:
# Identify all 2 and 3 word phrases
(test_phrase_dict, test_phrase_cnt_dict) = populate_phrase_dict(test_comb_df)

In [19]:
#Create final testing tidy data set
test_tidy= populate_tidy_data(test_phrase_dict, phrase_list, test_comb_df)

In [20]:
#Create final data sets based on the variables we need
X_test = test_tidy[phrase_list]
y_test = pd.to_numeric(test_tidy['label'])
X_test.shape
y_test.shape

(19159, 1183)

(19159,)

# Pickle Data
Pickle the training and testing data sets so they can be reloaded in the future, and it can clear as much memory as possible so that it is saved for the future model runs

In [21]:
#Pickle Data
X.to_pickle("out/X.pkl")
X_train.to_pickle("out/X_train.pkl")
y_train.to_pickle("out/y_train.pkl")
X_test.to_pickle("out/X_test.pkl")
y_test.to_pickle("out/y_test.pkl")

In [22]:
del X_train, y_train, X_test, y_test

In [23]:
#Bring back in the data
X_train = pd.read_pickle("out/X_train.pkl")
y_train = pd.read_pickle("out/y_train.pkl")
X_test  = pd.read_pickle("out/X_test.pkl")
y_test  = pd.read_pickle("out/y_test.pkl")

# Run Models
In the following code several models were trained.

## SVM Linear
A Linear Support Vector machine model was trained an all features created (not just those limited by feature selection). The PR-AUC in the training data was quite good (0.48) and the testing data was close (0.44) implying that the model does not appear to have an issue with being overfit.  The parameter class_weight was set to balanced because this data set is an unbalanced problem and tweaking this parameter should give us the best chance of improving the PR-AUC metric when it is implemented in a real problem.

### Train
Train the Support Vector machine

In [24]:
# Linear Support Vector Machine Model
svmlinear_model = LinearSVC(class_weight='balanced')
svmlinear_model.fit(X, y_train)

LinearSVC(C=1.0, class_weight='balanced', dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [25]:
#Calculate PR-AUC
y_score = svmlinear_model.decision_function(X)
average_precision = average_precision_score(y_train, y_score)
print('Average precision-recall score: {0:0.2f}'.format(average_precision))

Average precision-recall score: 0.48


### Test
Test the Support Vector machine

In [26]:
#Calculate PR-AUC
y_score = svmlinear_model.decision_function(X_test)
average_precision = average_precision_score(y_test, y_score)
print('Average precision-recall score: {0:0.2f}'.format(average_precision))

Average precision-recall score: 0.45


In [27]:
# Calcualte other standard fit statistics
y_pred = svmlinear_model.predict(X_test)
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.90      0.73      0.81     15332
          1       0.39      0.68      0.49      3827

avg / total       0.80      0.72      0.74     19159



## SVM
A support vector machine model was trained using the 100 features identified using feature selection.  Although the model does not appear overfit, the PR-AUC is much lower than SVM Linear or regression.  The model took 17 minutes to run so it's not feasible to tweak the tuning parameters or add additional variables using existing computational resources.

### Train

In [28]:
# Linear Support Vector Machine Model
svm_model = SVC(class_weight='balanced')
svm_model.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [29]:
#Calculate PR-AUC
y_score = svm_model.decision_function(X_train)
average_precision = average_precision_score(y_train, y_score)
print('Average precision-recall score: {0:0.2f}'.format(average_precision))

Average precision-recall score: 0.27


### Test

In [30]:
#Calculate PR-AUC
y_score = svm_model.decision_function(X_test[X_train.columns])
average_precision = average_precision_score(y_test, y_score)
print('Average precision-recall score: {0:0.2f}'.format(average_precision))

Average precision-recall score: 0.26


## Decision Tree
Decision trees have a propensity for overfitting and although I'm seeing a PR-AUC of 0.47 with the training data the testing data has a fit of 0.26 and this appears to be a very overfit model. It's possible that cutting down the predictors with better feature selection may help.

### Train

In [31]:
# Train Decision Tree model
clf_model = tree.DecisionTreeClassifier()
clf_model = clf_model.fit(X_train, y_train)

In [32]:
# Calculate PR-AUC for training data
y_score = clf_model.predict_proba(X_train)
average_precision = average_precision_score(y_train, y_score[:,1])
print('Average precision-recall score: {0:0.2f}'.format(average_precision))

Average precision-recall score: 0.45


### Test

In [33]:
#Calculate PR-AUC for testing data
y_score = clf_model.predict_proba(X_test[X_train.columns])
average_precision = average_precision_score(y_test, y_score[:,1])
print('Average precision-recall score: {0:0.2f}'.format(average_precision))

Average precision-recall score: 0.24


# Final Model

## Import Data
Import the final testing data

In [34]:
#Create system variables from excel into script and review values in dictionary
final_df = pd.read_csv('in/test.csv', dtype=str)
final_df.columns = final_df.columns.str.lower()
final_df['label'] = 1 # need to add column to make functions work

## Tidy Final Data
Create the testing data set based on the data process to create the training data

In [35]:
# Concatenate all words into a single string
final_comb_df = create_phrases(final_df)
final_comb_df.shape

(23909, 2)

In [36]:
# Identify all 2, 3, and 4 word phrases
(final_phrase_dict, final_phrase_cnt_dict) = populate_phrase_dict(final_comb_df)
len(final_phrase_dict.keys())
len(final_phrase_cnt_dict.keys())

23909

502525

In [37]:
#Create final finaling tidy data set
final_tidy= populate_tidy_data(final_phrase_dict, phrase_list, final_comb_df)

In [38]:
#Drop Fake added label so the prediction will run
final_tidy.drop('label',axis=1, inplace=True)
final_tidy.shape

(23909, 1183)

## Predict Outcome
Make the final prediction of the outcome and export to a CSV

In [39]:
#Create final data sets based on the variables we need
y_predict = pd.DataFrame(svmlinear_model.predict(final_tidy))
y_predict.columns = ['predict']

In [40]:
#Export final predictions on testing set to CSV
y_predict.to_csv("out/test_prediction.csv", header=False, index=False)