# ML Test
Objective:   Train a classification models to make prediction on testing data set, using the data in the “Sequence model data.zip” file. 

Note: 
1.	This is a sequence classification task, where the order of each feature matters. You could train a model without considering order as a baseline model, but must train a model addressing sequence because in real work, sequence analysis is part of the project.
1.	Each row represents a training/testing sample containing a sequence, where the first element is “PPD_197” and last element is “PPD_0”.
1.	All the sequences have been padding, which is the reason why lots of zeros show up in “PPD_0”. 
1.	All the values in each entry is categorical variable. Imaging every value in the entry as an index of a word in natural language. 
1.	Y-variable is the first column called “LABEL”, in testing data, there is no label

What we expect,
1.	A good prediction result, which we will compare with the hold out y variable in the testing data set
1.	The process of how the prediction is made, including 
  1. model comparison
  1. hyper-parameter tuning
  1. feature analysis 
  1. feature selection 
  1. create new features based on existing variables could be needed. 
1.	If you could use library such as Keras, Tensorflow to train a deep neural network (DNN) classifier, that will be a very good plus, even if neural networks might not be the best performed model. 
You could use any tools available to you for this task. Ultimately, we will assess your work based on two criteria. 
  1. predictive accuracy on the test set using the PR-AUC metric, 
  1. model structure you finally applied, for example, we will consider how advance the model is, or if you could create additional meaningful features from the data we gave to you.
1. You should return to us the following:
  1. A 23,910 x 1 csv or txt file containing one prediction per line for each row in the test dataset.
  1. A brief report describing the techniques you used to obtain the predictions, that at least should include the following parts: 
    1. why do you choose the model you use? 
    1. your estimates of predictive performance on the test data set, 
    1. some words telling us your understanding about the model you use. 
    1. The code for building the model, or the saved model such as pickle file.  


## Plan
Below is the plan that I intend to fulfill

1. Split data into training and testing sets
1. Explore basic summary stats
1. Create a data set for all key predictors 1, 2 and 3 word combinations
  1. Create dictionary of dictionaries to do it (function)
  1. Remove in frequent combinations
  1. Create pandas dataframe with all the good predictors (function)
1. Perform feature selection to reduce to the key features
  1. Get it down to 100 features
1. Run 3 models and try to tune some parameters: 
  1. Logistic Regressions
  1. Random forest
  1. Support Vector machine
1. Validate on the testing sets and test the one with the best outcome to ensure that it's not overfit
1. Run the prediction on the lock box training set
  1. Create a few summary stats on the prediction to ensure that mistakes weren't made


# Libraries
Set up the libraries needed for this project

In [1]:
import pandas as pd  
import numpy as np
import pickle
import re
from sklearn.model_selection import train_test_split
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Upload data
Upload the raw data and create training and testing data sets

In [120]:
#Create system variables from excel into script and review values in dictionary
df = pd.read_csv('in/train.csv', dtype=str)
df.columns = df.columns.str.lower()
df['id'] = range(1, len(df) + 1)

# Split into training and testing data
train_df, test_df = train_test_split(df, test_size=0.2)

#Drop df to save memory
del df

In [118]:
len(train_df)
len(test_df)

76635

19159

# Explore data
The first stage will be exploring the data.  Since each variable `PPD

There are around 100,000 rows and approximately 20% of rows are the label. Since this is a text

In [6]:
# Frequency
# stats_df = df_melt \
# .groupby('value') \
# ['value'] \
# .agg('count') \
# .pipe(pd.DataFrame) \
# .rename(columns = {'value': 'frequency'})
# 
# stats_df = stats_df.sort_values('frequency', ascending=False)
# 
# # PDF
# stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# 
# # CDF
# stats_df['cdf'] = stats_df['pdf'].cumsum()
# stats_df = stats_df.reset_index()
# stats_df.head()

Unnamed: 0,value,frequency,pdf,cdf
0,2,292490,0.044866,0.044866
1,1,281438,0.043171,0.088038
2,3,247951,0.038034,0.126072
3,4,179368,0.027514,0.153586
4,6,170400,0.026138,0.179724


# Tidy Data
Create a tidy dataset where each column is a 1, 2 or 3 word phrase identified in the row that is above a certain count

## Create phrases
Identify all permutations of 1, 2 and 3 word phrases in the document

In [121]:
df_comb = pd.DataFrame(train_df[['id','label']])

cols = train_df.columns.values.tolist()
cols.remove('label')
cols.remove('id')

df_comb['comb'] = train_df.loc[:, cols].apply(lambda x: '-'.join(x), axis=1)
df_comb['comb'].replace({'-0': ''}  ,inplace=True, regex=True)
df_comb['comb'].replace({'^[0]+-': ''},inplace=True, regex=True)

#Just add 10 columns to get code right.  Will fix later
#df_comb = df_comb.loc[df_comb.id < 10,['id','comb']]
#len(df_comb)

In [161]:
phrase_dict = {} 
phrase_cnt_dict = {}

for index, row in df_comb.iterrows():
    
    #Identify all unique 1, 2, and 3 word combinations per row
    r1 = re.findall(r"\d+",row['comb'])
    r2 = re.findall(r"\d+-\d+",row['comb'])
    r3 = re.findall(r"\d+-\d+-\d+",row['comb'])
    r = r1+r2+r3
    r = list(set(r))
    
    #Create a new key for the row id
    phrase_dict[row['id']] = {}
    
    #Populate dictionaries for unique phrases and phrase counts by id
    for j in r:
        
        phrase_dict[row['id']][j] = 1
        
        if (j) in phrase_cnt_dict:
            phrase_cnt_dict[j] += 1
        else:
            phrase_cnt_dict[j] = 1


## Set Thresholds
Determine cut offs for frequency of phrases so that the future data frame so there is a resonable number of features.  The output of this is an empty dataframe with all of the phrases we are going to consider for the analysis

In [162]:
phrase_cnt_df = pd.DataFrame.from_dict(phrase_cnt_dict, orient='index')
phrase_cnt_df.columns = ['cnt']
phrase_cnt_df.cnt = pd.to_numeric(phrase_cnt_df.cnt)

In [226]:
# A minimum of at least 250 rows having a phrase cuts the list down to around 2000 predictors which is
# reasonable given the computational resources for this exercise
phrase_list = phrase_cnt_df[phrase_cnt_df['cnt']>250].index.tolist()
phrase_df = pd.DataFrame(columns=phrase_list, dtype=int)

2033

## Populate Data
Populate train_tidy with the following:
* id and label from train_df
* 1 or 0 if the feature occured in phrase_dict

In [227]:
#Add id and label to the table
train_tidy = pd.concat([train_df[['id','label']],phrase_df], axis=1)

In [None]:
# Create a loop to go through all the rows and columns and population with 1 or 0

for each id in patinet
    for each phrase in phrase_list
        Set the value to 1 if it's in the dictionary and 0 if it's not
      

# Feature Selection

# Run Models

## Model 1

## Model 2

## Model 3

# Validate
Validate the model on the testing data set

## Model 1

## Model 2

## Model 3

# Final Model