# ML Test
Objective:   Train a classification models to make prediction on testing data set, using the data in the “Sequence model data.zip” file. 

Note: 
1.	This is a sequence classification task, where the order of each feature matters. You could train a model without considering order as a baseline model, but must train a model addressing sequence because in real work, sequence analysis is part of the project.
1.	Each row represents a training/testing sample containing a sequence, where the first element is “PPD_197” and last element is “PPD_0”.
1.	All the sequences have been padding, which is the reason why lots of zeros show up in “PPD_0”. 
1.	All the values in each entry is categorical variable. Imaging every value in the entry as an index of a word in natural language. 
1.	Y-variable is the first column called “LABEL”, in testing data, there is no label

What we expect,
1.	A good prediction result, which we will compare with the hold out y variable in the testing data set
1.	The process of how the prediction is made, including 
  1. create new features based on existing variables could be needed. 
  1. feature analysis 
  1. feature selection 
  1. model comparison
  1. hyper-parameter tuning
1.	If you could use library such as Keras, Tensorflow to train a deep neural network (DNN) classifier, that will be a very good plus, even if neural networks might not be the best performed model. 
You could use any tools available to you for this task. Ultimately, we will assess your work based on two criteria. 
  1. predictive accuracy on the test set using the PR-AUC metric, 
  1. model structure you finally applied, for example, we will consider how advance the model is, or if you could create additional meaningful features from the data we gave to you.
1. You should return to us the following:
  1. A 23,910 x 1 csv or txt file containing one prediction per line for each row in the test dataset.
  1. A brief report describing the techniques you used to obtain the predictions, that at least should include the following parts: 
    1. why do you choose the model you use? 
    1. your estimates of predictive performance on the test data set, 
    1. some words telling us your understanding about the model you use. 
    1. The code for building the model, or the saved model such as pickle file.  


## Plan
Below is the plan that I intend to fulfill

1. Split data into training and testing sets
1. Explore basic summary stats
1. Create a data set for all key predictors 1, 2 and 3 word combinations
  1. Create dictionary of dictionaries to do it (function)
  1. Remove in frequent combinations
  1. Create pandas dataframe with all the good predictors (function)
1. Perform feature selection to reduce to the key features
  1. Get it down to 100 features
1. Run 3 models and try to tune some parameters: 
  1. Logistic Regressions
  1. Random forest
  1. Support Vector machine
1. Validate on the testing sets and test the one with the best outcome to ensure that it's not overfit
1. Run the prediction on the lock box training set
  1. Create a few summary stats on the prediction to ensure that mistakes weren't made


# Libraries
Set up the libraries needed for this project

In [34]:
import pandas as pd  
import numpy as np
import pickle
import re
import time
from sklearn.model_selection import train_test_split
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
from sklearn.utils.fixes import signature
from sklearn import linear_model
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

import pickle

# Functions
Create functions that will be used throughout the data process

## create_phrases

In [10]:
def create_phrases(df):
    """Create unique phrases from the raw data

    Args:
        df(df): dataframe with an 'id' 'label' and n features
        
    Kwargs:
        None

    Returns:
        Dataframe: 'id', 'label', 'comb' combined features
    """   
    # Create a dataframe of combined features with id, label, and concatenated phrases of concatenated words
    df_comb = pd.DataFrame(df[['id','label']])

    #Create list of all columns to concatenate
    cols = df.columns.values.tolist()
    cols.remove('label')
    cols.remove('id')

    # Concatenate word to phrases and remove all zeros
    df_comb['comb'] = df.loc[:, cols].apply(lambda x: '-'.join(x), axis=1)
    df_comb['comb'].replace({'-0': ''}  ,inplace=True, regex=True)
    df_comb['comb'].replace({'^[0]+-': ''},inplace=True, regex=True)
    
    return df_comb

## populate_phrase_dict

In [11]:
def populate_phrase_dict(df):
    """Create a dionary of phrases per id

    Args:
        df(df): dataframe with 'id','label', and n features
        phrases: dionary of phrases to include
        
    Kwargs:
        None

    Returns:
        Dataframe: 'id', 'label', 'comb' combined features
    """   
    phrase_d = {} 
    phrase_cnt_d = {}

    for index, row in df.iterrows():
        #Identify all unique 2 and 3 word combinations per row
        r2 = re.findall(r"\d+-\d+",row['comb'])
        r3 = re.findall(r"\d+-\d+-\d+",row['comb'])
        r = r2+r3
        r = list(set(r))

        #Create a new key for the row id
        phrase_d[row['id']] = {}

        #Populate dionaries for unique phrases and phrase counts by id
        for j in r:
            phrase_d[row['id']][j] = 1
            if j in phrase_cnt_d:
                phrase_cnt_d[j] += 1
            else:
                phrase_cnt_d[j] = 1
                
    return phrase_d, phrase_cnt_d


## populate_tidy_data

In [12]:
def populate_tidy_data(phrase_d, phrase_l, df_raw):
    """Populate final tidy data set with selected features

    Args:
        phrase_d(dict): dictionary of features per id
        phrase_l(list): list of phrases to include
        df(df): dataframe wtih 'id' and 'label'
        
    Kwargs:
        None

    Returns:
        Dataframe: 'id', 'label', 'comb' combined features
    """   
    phrase_df = pd.DataFrame(columns=phrase_l, dtype=bool)
    
    #Add id and label to the table
    train_tidy = pd.concat([df_raw[['id','label']],phrase_df], axis=1)
    train_tidy = train_tidy.fillna(0)

    # Identify columns used in the process
    columns = list(train_tidy[phrase_l].columns)

    # Populate train_tidy with features from phrase_d
    for index, row in train_tidy.iterrows():
        for col in columns:
            if col in phrase_d[row['id']]:
                train_tidy.at[index,col] = 1
                
    return train_tidy

# Upload data
Upload the raw data and create training and testing data sets

**Output**
* train_df: Training data set
* test_df: Testing data set

In [13]:
#Create system variables from excel into script and review values in dictionary
df = pd.read_csv('in/train.csv', dtype=str)
df.columns = df.columns.str.lower()
df['id'] = range(1, len(df) + 1)

# Split into training and testing data
train_df, test_df = train_test_split(df, test_size=0.2)

#Drop df to save memory
del df

In [14]:
#Confirm the training and testing is split 80:20
len(train_df)
len(test_df)

76635

19159

In [15]:
train_df.to_pickle("out/train_df.pkl")
test_df.to_pickle("out/test_df.pkl")

# Tidy Training Data
Create a tidy dataset where each column is a 1, 2 or 3 word phrase identified in the row that is above a certain count

**Output**
* train_tidy: Tidy data set with `id`, `label`, and ~1000 features to analyze
* phrase_list: list of variables to include in the analysis

## Create phrases
Identify all permutations of 1, 2 and 3 word phrases in the document

In [16]:
# Concatenate all words into a single string
df_comb = create_phrases(train_df)

In [17]:
# Identify all 2 and 3 word phrases
(phrase_dict, phrase_cnt_dict) = populate_phrase_dict(df_comb)

## Set Thresholds
Determine cut offs for frequency of phrases so that the future data frame so there is a resonable number of features.  The output of this is an empty dataframe with all of the phrases we are going to consider for the analysis.

* All values that appeared at least 400 times were included as it created around 1000 features and was feasible to calcualte in the future.

In [18]:
# Create dataframe with counts of unique phrases per observation
phrase_cnt_df = pd.DataFrame.from_dict(phrase_cnt_dict, orient='index')
phrase_cnt_df.columns = ['cnt']
phrase_cnt_df.cnt = pd.to_numeric(phrase_cnt_df.cnt)

In [20]:
# A minimum of at least 400 rows having a phrase cuts the list down to around 1000 predictors which is
# reasonable given the computational resources for this exercise
phrase_list = phrase_cnt_df[phrase_cnt_df['cnt']>400].index.tolist()
len(phrase_list)

1007

## Populate Tidy Data
Populate train_tidy with the following:
* id and label from train_df
* True if the feature existed for that observation
* Limited to the ~1000 most popular phrases

In [21]:
train_tidy=populate_tidy_data(phrase_dict, phrase_list, df_comb)

# Feature Selection
Identify the top 200 features based by using a chi-squarred test.  The following [package](https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e) was used.

**Output**
* X_train: Training features
* y_train: Training Outcome

In [22]:
#Create training df and outcome vector
X = train_tidy[phrase_list]
y_train = pd.to_numeric(train_tidy.label)

#Use SelectKBest to get the top 200 features
kBestFeatures = SelectKBest(score_func=chi2, k=10)
fit = kBestFeatures.fit(X,y_train)
X_train = X[X.columns]

#Review top 20 features for scores
score_df = pd.DataFrame(fit.scores_)
feature_df = pd.DataFrame(X.columns)
feature_score_df = pd.concat([feature_df,score_df],axis=1)
feature_score_df.columns = ['Feature','Score'] 
print(feature_score_df.nlargest(10,'Score')) 

     Feature       Score
539    67-44  869.735421
5       1-44  735.998174
459     1-67  679.176229
588    44-20  658.626661
8       20-6  568.355100
234      2-6  500.345558
815     44-6  498.547558
317      6-6  390.222095
927  1-67-44  357.276328
113      6-2  348.617116


# Tidy Testing Data
Create the testing data set based on the data process to create the training data

In [23]:
# Concatenate all words into a single string
test_comb_df = create_phrases(test_df)

In [24]:
# Identify all 2 and 3 word phrases
(test_phrase_dict, test_phrase_cnt_dict) = populate_phrase_dict(test_comb_df)

In [25]:
#Create final testing tidy data set
test_tidy= populate_tidy_data(test_phrase_dict, phrase_list, test_comb_df)

In [26]:
#Create final data sets based on the variables we need
X_test = test_tidy[X_train.columns]
y_test = pd.to_numeric(test_tidy['label'])

# Pickle Data
Pickle the training and testing data sets so they can be reloaded in the future, and it can clear as much memory as possible so that it is saved for the future model runs

In [28]:
#Pickle Data
X_train.to_pickle("out/X_train.pkl")
y_train.to_pickle("out/y_train.pkl")
X_test.to_pickle("out/X_test.pkl")
y_test.to_pickle("out/y_test.pkl")

In [29]:
del X_train, y_train, X_test, y_test

In [37]:
#Bring back in the data
X_train = pd.read_pickle("out/X_train.pkl")
y_train = pd.read_pickle("out/y_train.pkl")
X_test  = pd.read_pickle("out/X_test.pkl")
y_test  = pd.read_pickle("out/y_test.pkl")

# Run Models

## SVM Linear
Create a support vector machine model

### Train
Train the Support Vector machine

In [38]:
# Linear Support Vector Machine Model
svm_model = LinearSVC()
svm_model.fit(X_train, y_train)
y_score = svm_model.decision_function(X_train)

#Calculate PR-AUC
average_precision = average_precision_score(y_train, y_score)
print('Average precision-recall score: {0:0.2f}'.format(average_precision))

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

Average precision-recall score: 0.48


### Test
Test the Support Vector machine

In [39]:
y_score = svm_model.decision_function(X_test)

#Calculate PR-AUC
average_precision = average_precision_score(y_test, y_score)
print('Average precision-recall score: {0:0.2f}'.format(average_precision))

Average precision-recall score: 0.44


## Random Forest

### Train

### Test

## Model 3

### Train

### Test

# Final Model

## Tidy Final Data
Create the testing data set based on the data process to create the training data

In [None]:
# Concatenate all words into a single string
final_comb_df = create_phrases(final_df)
final_comb_df.shape

In [None]:
# Identify all 2 and 3 word phrases
(final_phrase_dict, final_phrase_cnt_dict) = populate_phrase_dict(final_comb_df)
len(final_phrase_dict.keys())
len(final_phrase_cnt_dict.keys())

In [None]:
#Create final finaling tidy data set
final_tidy= populate_tidy_data(final_phrase_dict, phrase_list, final_comb_df)
final_tidy.shape

In [None]:
#Create final data sets based on the variables we need
X_final = test_tidy[X_train.columns]

# Other Stuff
Other interesting code that was pushed to the back for reference later

In [None]:


precision, recall, _ = precision_recall_curve(y, y_score)

# In matplotlib < 1.5, plt.fill_between does not have a 'step' argument
step_kwargs = ({'step': 'post'}
               if 'step' in signature(plt.fill_between).parameters
               else {})
plt.step(recall, precision, color='b', alpha=0.2,
         where='post')
plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)

plt.xlabel('Recall', color='w')
plt.ylabel('Precision', color='w')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('2-class Precision-Recall curve: AP={0:0.2f}'.format(
          average_precision))
plt.show()

In [None]:
# Frequency
# stats_df = df_melt \
# .groupby('value') \
# ['value'] \
# .agg('count') \
# .pipe(pd.DataFrame) \
# .rename(columns = {'value': 'frequency'})
# 
# stats_df = stats_df.sort_values('frequency', ascending=False)
# 
# # PDF
# stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# 
# # CDF
# stats_df['cdf'] = stats_df['pdf'].cumsum()
# stats_df = stats_df.reset_index()
# stats_df.head()