# Binary Logisitic Regression Model Training Via sklearn

## Tech Spec
* Google Cloud Compute Engine
* n1-standard-4 (4 vCPUs, 15 GB memory)
* Debian GNU/ Linux 9

## Performance
Hyperparameter tuning over the parameter grid ----- took ----- minutes.

Traning over ------ samples took ----- minutes.

## Model Training

In [1]:
import numpy as np
import pandas as pd
import time

from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import RandomUnderSampler

from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, \
                            auc, \
                            confusion_matrix, \
                            log_loss, make_scorer, \
                            roc_auc_score, roc_curve, \
                            precision_recall_curve, \
                            precision_score, \
                            recall_score, \
                            f1_score
from sklearn.model_selection import GridSearchCV, \
                                    train_test_split
from sklearn.preprocessing import OneHotEncoder, RobustScaler

### Load Data

In [2]:
df = pd.read_pickle('../data/preprocessed_training_data.pkl')
df = df[:1000000]

### Brief Data Exploration

In [3]:
print(df.head())

                         id           timestamp                campaignId  \
0  5c36658fb58fad351175f0b6 2019-01-09 21:20:15  59687f0d896a6b0e5ce6ea15   
1  5c38d5ab1c16172870186b5a 2019-01-11 17:43:07  59687f0d896a6b0e5ce6ea15   
2  5c38815de8f4e50e256e4f9c 2019-01-11 11:43:25  59687f0d896a6b0e5ce6ea15   
3  5c409ace532d5806d2c6a5e6 2019-01-17 15:10:06  59687f0d896a6b0e5ce6ea15   
4  5c3904b92d798c41e7f3088a 2019-01-11 21:03:53  59687f0d896a6b0e5ce6ea15   

  platform softwareVersion sourceGameId country  startCount  viewCount  \
0      ios          11.4.1      1373094      US          25         24   
1      ios            12.1      2739989      US          10          9   
2      ios          12.1.2      1373094      US          27         26   
3      ios          12.1.2      1217749      US          15         14   
4      ios          12.0.1      1373094      US          20         18   

   clickCount  installCount           lastStart  startCount1d  startCount7d  \
0           0

In [4]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 18 columns):
id                    1000000 non-null object
timestamp             1000000 non-null datetime64[ns]
campaignId            1000000 non-null object
platform              1000000 non-null object
softwareVersion       1000000 non-null object
sourceGameId          1000000 non-null object
country               1000000 non-null object
startCount            1000000 non-null int64
viewCount             1000000 non-null int64
clickCount            1000000 non-null int64
installCount          1000000 non-null int64
lastStart             921178 non-null datetime64[ns]
startCount1d          1000000 non-null int64
startCount7d          1000000 non-null int64
connectionType        1000000 non-null object
deviceType            1000000 non-null object
install               1000000 non-null int64
timeSinceLastStart    1000000 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(7), ob

The training dataset has over 3.5 million records and 17 features. 

In [5]:
df['install'].value_counts()

0    986453
1     13547
Name: install, dtype: int64

We find that the class distribution of the install to no-install status is extremely imbalanced at 1:82. 

In [6]:
numerical_columns = ['startCount', 'viewCount', 'installCount', 'startCount1d', 'startCount7d', 'timeSinceLastStart']
categorical_columns = ['campaignId', 'sourceGameId', 'country']

In [7]:
for feat in categorical_columns:
    print(feat)
    print("==========")
    print(df[feat].value_counts())
    print("        ")

campaignId
5c385d02ee4549000d8b9ddd    36861
5af41f3346d16a019f9d327d    24132
5c333b4d1d94abf8a325e55a    23943
5bd2ccefc9c2110ad461c1b3    23413
5bf54928eb052c1002550102    22749
                            ...  
5bf4b62716fbbb711b8bae88        1
5bc6e0236e2e5aaef4543d92        1
5c01890762973d602111c88a        1
5be04767edcdaf10b88228e8        1
5a9be6ea3f98c037e2583b14        1
Name: campaignId, Length: 2329, dtype: int64
        
sourceGameId
1711292    23611
1483109    14111
19790      11529
36615      11202
2633648    11092
           ...  
2606273        1
2922520        1
1171276        1
1060077        1
1414506        1
Name: sourceGameId, Length: 22926, dtype: int64
        
country
US    131873
IN     86927
RU     80060
TR     48275
ID     48132
       ...  
WS         2
PM         2
NF         1
PW         1
CF         1
Name: country, Length: 217, dtype: int64
        


This shows us that the cardinality of the campaignId and sourceGameId features are very high.

 ### Data Preprocessing

In [8]:
numerical_pipeline = make_pipeline(RobustScaler(with_centering=False))

In [9]:
categorical_pipeline = make_pipeline(OneHotEncoder(handle_unknown='ignore'))

In [10]:
preprocessor = ColumnTransformer(
    [('numerical_preprocessing', numerical_pipeline, numerical_columns), 
     ('categorical_preprocessing', categorical_pipeline, categorical_columns)], 
    remainder='drop')

### Dataset Training/Test Split

In [11]:
X = df[numerical_columns + categorical_columns]
y = df['install']

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

## Performance Metrics

The metrics used to measure the classifier performance other than AUROC, log-loss and prediction bias are the precision and recall.

In [12]:
def log_loss_score(clf, x, y):
    return log_loss(y, clf.predict_proba(x))

def auroc_score(clf, x, y):
    return roc_auc_score(y, clf.predict_proba(x)[:, 1])

## 2.4 Grid Search

In [13]:
pipeline = make_pipeline(preprocessor, 
                         RandomUnderSampler(random_state=0), 
                         LogisticRegression(random_state=0))

In [16]:
param_range = [0.01, 0.1, 1.0, 10.0]
param_grid = [{'logisticregression__C': param_range, 
               'logisticregression__penalty': ['l1'],
               'logisticregression__solver': ['saga']}, 
              {'logisticregression__C': param_range, 
               'logisticregression__penalty': ['l2'],
               'logisticregression__solver': ['lbfgs']}]

In [None]:
t_0 = time.time()
gs = GridSearchCV(estimator=pipeline,
                  param_grid=param_grid,
                  scoring='roc_auc',
                  cv=5)
gs.fit(X_train, y_train)
print('{} minutes'.format((time.time() - t_0) / 60.0))
print(gs.best_score_)
print(gs.best_params_)



Using the stratified cross validation schema along with undersampling the non-install class to balance the dataset, we find that C=0.1 and penalty='l1' gives the optimal log-loss score of 0.63 +/- 0 on the test set and 0.62 +/- 0 on the training set. Meanwhile, for the AUROC we find a test score of 0.72 +/- 0.01 and train score of 0.73 +/- 0.01.

## 2.5 Learning Curve For Optimal Classifier

In [None]:
pipe_lr = Pipeline([('scl', RobustScaler(with_centering=False)),
                    ('clf', LogisticRegression(C=0.1, penalty='l1', solver='liblinear', random_state=0))])
imbalanced_cross_validation_score(pipe_lr, X_train, y_train, 3, log_loss_score)

I ran the preceding code using various sample sizes and compared their performances.
* Sample size: 100,000, CV log-loss score - train: 0.62, test: 0.64
* Sample size: 200,000, CV log-loss score - train: 0.61, test: 0.63
* Sample size: 400,000, CV log-loss score - train: 0.61, test: 0.62
* Sample size: 800,000, CV log-loss score - train: 0.62, test: 0.62
* Sample size: 160,0000, CV log-loss score - train: 0.63, cv: 0.64

Based on these results, it seems that using smaller subsample of the training data is a viable option to get around memory errors.

## 2.6 Other Performance Metrics

In [None]:
x, y = undersample_fit(X_train, y_train)
pipe_lr = Pipeline([('scl', RobustScaler(with_centering=False)),
                    ('clf', LogisticRegression(C=0.1, penalty='l1', solver='liblinear', random_state=0))])
pipe_lr.fit(x, y)
y_pred = pipe_lr.predict(X_test)
print("Precision: {}%".format(int(100 * precision_score(y_test, y_pred))))
print("Recall: {}%".format(int(100 * recall_score(y_test, y_pred))))
print("Log-loss: {}%".format(int(100 * log_loss_score(pipe_lr, X_test, y_test))))
print("AUROC: {}%".format(int(100 * roc_auc_score(y_test, pipe_lr.predict_proba(X_test)[:, 1]))))
tn, fp, fn, tp = confusion_matrix(y_pred, y_test).ravel()
print("True Negatives: {}, Fale Positives: {}, False Negatives: {}, True Positives: {}".format(tn, fp, fn, tp))
print("Prediction bias: {}".format(sum(y_pred) / len(y_pred) - sum(y_test) / len(y_test)))

We find that the model has a very low precision of 4% but perhaps this is jutified by the recall of 72%. 