# Binary Logisitic Regression Model Training Via sklearn

## Tech Spec
* Google Cloud Compute Engine
* n1-standard-4 (4 vCPUs, 15 GB memory)
* Debian GNU/ Linux 9

## Performance
Hyperparameter tuning over the parameter grid ----- took ----- minutes.

Traning over ------ samples took ----- minutes.

## Model Training

In [1]:
import numpy as np
import pandas as pd
import time

from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import RandomUnderSampler

from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, \
                            auc, \
                            confusion_matrix, \
                            log_loss, make_scorer, \
                            roc_auc_score, roc_curve, \
                            precision_recall_curve, \
                            precision_score, \
                            recall_score, \
                            f1_score
from sklearn.model_selection import GridSearchCV, \
                                    train_test_split
from sklearn.preprocessing import OneHotEncoder, RobustScaler

### Load Data

In [16]:
df = pd.read_pickle('../data/preprocessed_training_data.pkl')
df = df[:400000]

### Brief Data Exploration

In [17]:
print(df.head())

                         id           timestamp                campaignId  \
0  5c36658fb58fad351175f0b6 2019-01-09 21:20:15  59687f0d896a6b0e5ce6ea15   
1  5c38d5ab1c16172870186b5a 2019-01-11 17:43:07  59687f0d896a6b0e5ce6ea15   
2  5c38815de8f4e50e256e4f9c 2019-01-11 11:43:25  59687f0d896a6b0e5ce6ea15   
3  5c409ace532d5806d2c6a5e6 2019-01-17 15:10:06  59687f0d896a6b0e5ce6ea15   
4  5c3904b92d798c41e7f3088a 2019-01-11 21:03:53  59687f0d896a6b0e5ce6ea15   

  platform softwareVersion sourceGameId country  startCount  viewCount  \
0      ios          11.4.1      1373094      US          25         24   
1      ios            12.1      2739989      US          10          9   
2      ios          12.1.2      1373094      US          27         26   
3      ios          12.1.2      1217749      US          15         14   
4      ios          12.0.1      1373094      US          20         18   

   clickCount  installCount           lastStart  startCount1d  startCount7d  \
0           0

In [18]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 18 columns):
id                    400000 non-null object
timestamp             400000 non-null datetime64[ns]
campaignId            400000 non-null object
platform              400000 non-null object
softwareVersion       400000 non-null object
sourceGameId          400000 non-null object
country               400000 non-null object
startCount            400000 non-null int64
viewCount             400000 non-null int64
clickCount            400000 non-null int64
installCount          400000 non-null int64
lastStart             369470 non-null datetime64[ns]
startCount1d          400000 non-null int64
startCount7d          400000 non-null int64
connectionType        400000 non-null object
deviceType            400000 non-null object
install               400000 non-null int64
timeSinceLastStart    400000 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(7), object(8)
memory usa

The training dataset has over 3.5 million records and 17 features. 

In [19]:
df['install'].value_counts()

0    393427
1      6573
Name: install, dtype: int64

We find that the class distribution of the install to no-install status is extremely imbalanced at 1:82. 

In [20]:
numerical_columns = ['startCount', 'viewCount', 'installCount', 'startCount1d', 'startCount7d', 'timeSinceLastStart']
categorical_columns = ['campaignId', 'sourceGameId', 'country']

In [21]:
for feat in categorical_columns:
    print(feat)
    print("==========")
    print(df[feat].value_counts())
    print("        ")

campaignId
5bd2ccefc9c2110ad461c1b3    23413
5bf54928eb052c1002550102    22749
5be19ebfea7afe2df87a44cd    15438
5b6d659b9225dc002ec90dff    14075
5af41f681455c215c8a5a559     9976
                            ...  
5a9be6ea3f98c037e2583b14        1
5bc6e0236e2e5aaef4543d92        1
5bf4b62716fbbb711b8bae88        1
5c01890762973d602111c88a        1
5ba0a0c06998b01f00b12e9e        1
Name: campaignId, Length: 1009, dtype: int64
        
sourceGameId
1711292    8096
2762289    5331
2633648    4679
111890     4474
1483109    4148
           ... 
2784214       1
1155685       1
1323843       1
2803389       1
2610183       1
Name: sourceGameId, Length: 16767, dtype: int64
        
country
US    57533
RU    40440
IN    34698
ID    27315
DE    18102
      ...  
MH        1
WF        1
GQ        1
CF        1
KI        1
Name: country, Length: 210, dtype: int64
        


This shows us that the cardinality of the campaignId and sourceGameId features are very high.

 ### Data Preprocessing

In [33]:
numerical_pipeline = make_pipeline(RobustScaler(with_centering=True))

In [34]:
categorical_pipeline = make_pipeline(OneHotEncoder(handle_unknown='ignore'))

In [35]:
preprocessor = ColumnTransformer(
    [('numerical_preprocessing', numerical_pipeline, numerical_columns), 
     ('categorical_preprocessing', categorical_pipeline, categorical_columns)], 
    remainder='drop')

### Dataset Training/Test Split

In [36]:
X = df[numerical_columns + categorical_columns]
y = df['install']

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

## Performance Metrics

The metrics used to measure the classifier performance other than AUROC, log-loss and prediction bias are the precision and recall.

In [37]:
def log_loss_score(clf, x, y):
    return log_loss(y, clf.predict_proba(x))

def auroc_score(clf, x, y):
    return roc_auc_score(y, clf.predict_proba(x)[:, 1])

## 2.4 Grid Search

In [38]:
pipeline = make_pipeline(preprocessor, 
                         RandomUnderSampler(random_state=0), 
                         LogisticRegression(random_state=0))

In [39]:
param_range = [0.01, 0.1, 1.0, 10.0]
param_grid = [{'logisticregression__C': param_range, 
               'logisticregression__penalty': ['l1'],
               'logisticregression__solver': ['saga']}, 
              {'logisticregression__C': param_range, 
               'logisticregression__penalty': ['l2'],
               'logisticregression__solver': ['lbfgs']}]

In [None]:
t_0 = time.time()
gs = GridSearchCV(estimator=pipeline,
                  param_grid=param_grid,
                  scoring='roc_auc',
                  cv=3)
gs.fit(X_train, y_train)
print('{} minutes'.format((time.time() - t_0) / 60.0))
print(gs.best_score_)
print(gs.best_params_)

11 minutes on a million rows of data - AUROC score of 0.73. l2 penalty, C=0.1.

## Optimal classifier training time

In [14]:
t_0 = time.time()
pipeline = make_pipeline(preprocessor, 
                         RandomUnderSampler(random_state=0), 
                         LogisticRegression(C=1.0, penalty='l2', max_iter=2000, random_state=0))
pipeline.fit(X_train, y_train)
print('{} minutes'.format((time.time() - t_0) / 60.0))

0.8826190749804179 minutes


## 2.6 Other Performance Metrics

In [15]:
y_pred = pipeline.predict(X_test)
print("Precision: {}%".format(int(100 * precision_score(y_test, y_pred))))
print("Recall: {}%".format(int(100 * recall_score(y_test, y_pred))))
print("Log-loss: {}%".format(int(100 * log_loss_score(pipeline, X_test, y_test))))
print("AUROC: {}%".format(int(100 * roc_auc_score(y_test, pipeline.predict_proba(X_test)[:, 1]))))
tn, fp, fn, tp = confusion_matrix(y_pred, y_test).ravel()
print("True Negatives: {}, Fale Positives: {}, False Negatives: {}, True Positives: {}".format(tn, fp, fn, tp))
print("Prediction bias: {}".format(sum(y_pred) / len(y_pred) - sum(y_test) / len(y_test)))

Precision: 2%
Recall: 67%
Log-loss: 61%
AUROC: 72%
True Negatives: 489232, Fale Positives: 2879, False Negatives: 249607, True Positives: 6070
Prediction bias: 0.3299437808576762


On a million data points, we find that the model has a very low precision of 2% but perhaps this is jutified by the recall of 67%. 
When using the 3.7 million rows, we find that the same result holds true.