# Algorithm search

Now it's time to select the most appropriate algorithm for the problem. A good principle is to start with the simpler one and work your way up to more complex ones if the results are not satisfying. 

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.utils import shuffle
import numpy as np
import pandas as pd
import random
from pprint import pprint
from time import time

In [2]:
from scipy.sparse import load_npz, hstack

min_count = 5

emails = pd.read_pickle('./data/emails.pkl')
subjects_BoW = load_npz('./data/subjects_BoW.npz')
contents_BoW = load_npz('./data/contents_BoW.npz')
FromUsers = load_npz('./data/FromUsers.npz')
ToUsers = load_npz('./data/ToUsers.npz')
FromDomains = load_npz('./data/FromDomains.npz')
ToDomains = load_npz('./data/ToDomains.npz')
years = load_npz('./data/years.npz')
days = load_npz('./data/days.npz')
hours = load_npz('./data/hours.npz')

# Drop columns that have less than the min count
subjects_BoW = subjects_BoW[:,subjects_BoW.sum(0).A[0] > min_count]
contents_BoW = contents_BoW[:,contents_BoW.sum(0).A[0] > min_count]
FromUsers = FromUsers[:,FromUsers.sum(0).A[0] > min_count]
ToUsers = ToUsers[:,ToUsers.sum(0).A[0] > min_count]
FromDomains = FromDomains[:,FromDomains.sum(0).A[0] > min_count]
ToDomains = ToDomains[:,ToDomains.sum(0).A[0] > min_count]

# Stack the data altogether
processed_data = hstack([subjects_BoW, contents_BoW, FromUsers, ToUsers, FromDomains, ToDomains, years, days, hours], format='csr', dtype=float)
del subjects_BoW; del contents_BoW; del FromUsers; del ToUsers; del FromDomains; del ToDomains

processed_data.shape

(6362, 18861)

Data shuffling and creation of test set

In [3]:
X, y = processed_data, emails['label'][:6362].values
del processed_data
del emails

indexes = list(range(X.shape[0]))
random.seed(1)
random.shuffle(indexes)

X, y = X[indexes], y[indexes]
cutoff = int(X.shape[0]*0.7)

X_train_valid, y_train_valid = X[:cutoff], y[:cutoff]
X_test, y_test = X[cutoff:], y[cutoff:]

del X, y

## Logistic Regression

Learning algorithm, here, the logistic regression has been selected for its low complexity and more than interesting score.

In [4]:
# Initialisation
LR = LogisticRegression()

parameters = {
        'penalty': ('l2', 'l1'),
        'class_weight':('balanced', None),
        'tol':(1e-2, ),
        'C': (0.1, 0.5, 1.0),
        'fit_intercept': (True, False),
        'max_iter':(300,)
    }

grid_search = GridSearchCV(LR, parameters,  verbose=0, iid=True, cv=4, n_jobs=-1, return_train_score=False, scoring='f1')

print("Performing grid search...")
print("parameters:")
pprint(parameters)
t0 = time()
grid_search.fit(X_train_valid, y_train_valid)
print("done in %0.3fs" % (time() - t0))
print()
best_LR = grid_search.best_estimator_
score_train = grid_search.best_score_
print("Best score for validation set : %0.3f" % score_train)

Performing grid search...
parameters:
{'C': (0.1, 0.5, 1.0),
 'class_weight': ('balanced', None),
 'fit_intercept': (True, False),
 'max_iter': (300,),
 'penalty': ('l2', 'l1'),
 'tol': (0.01,)}
done in 72.835s

Best score for validation set : 0.897


## SVM

In [5]:
# Initialisation
SVM = LinearSVC()

parameters = {
        'class_weight':('balanced', None),
        'tol':(1e-2, ),
        'C': (0.1, 0.5, 1.0),
        'fit_intercept': (True, False),
        'max_iter':(200,100)
    }

grid_search = GridSearchCV(SVM, parameters,  verbose=0, iid=True, cv=4, n_jobs=-1, return_train_score=False, scoring='f1')

print("Performing grid search...")
print("parameters:")
pprint(parameters)
t0 = time()
grid_search.fit(X_train_valid, y_train_valid)
print("done in %0.3fs" % (time() - t0))
print()
best_SVM = grid_search.best_estimator_
score_train = grid_search.best_score_
print("Best score for validation set : %0.3f" % score_train)

Performing grid search...
parameters:
{'C': (0.1, 0.5, 1.0),
 'class_weight': ('balanced', None),
 'fit_intercept': (True, False),
 'max_iter': (200, 100),
 'tol': (0.01,)}
done in 16.219s

Best score for validation set : 0.892


## Results

A good start for scoring a classifier is to analyse their precision, recall and f1-score which is the combination of the two. Indeed, here the classes are a bid imbalanced, so this could have a good precision by just predicting the most represented class.

The f1-score usually gives a balanced scoring by giving as much importance to the minority class than the majority one.

In [6]:
predictions = best_LR.predict(X_test)
print('--------------Best Logistic Regression--------------')
print(classification_report(y_test, predictions))
print('\n\n')

print('----------------------Best SVM----------------------')
predictions = best_SVM.predict(X_test)
print(classification_report(y_test, predictions))

--------------Best Logistic Regression--------------
             precision    recall  f1-score   support

          0       0.98      0.96      0.97      1476
          1       0.88      0.92      0.90       433

avg / total       0.95      0.95      0.95      1909




----------------------Best SVM----------------------
             precision    recall  f1-score   support

          0       0.97      0.96      0.97      1476
          1       0.88      0.90      0.89       433

avg / total       0.95      0.95      0.95      1909



In this case, the algorithm seems to be perfectly fine for the problem. We could add some boosting to grab the last bit of precision, but it would cost more calculation at prediction time, so more money overall. Plus, the logistic regression is easy to parallelize if needs be.