In [59]:
import pandas as pd
import numpy as np

## Precision or Recall?

It is important to determine what measure of accuracy we want to use to validate our model. In our case I am going to choose to use precision.

Why?

Well, lets look at what precision measures. We would be looking at the number of observations that have been correctly subscribed against all of the predicted subscribed values <b>(total Y / total Y obs)</b>, i.e. of all the observations that were predicted as Y, which ones were correct?

If we were looking at recall, we would look at <b>total Y correct / total Y</b>, meaning we would also consider the observations that were predicted as Y but were actually N.

We don't particularly care if we incorrectly classify a Y because if we predict them to be a N and they turn out to be a Y, then happy days, but what we really care about are the clients that were predicted to be a Y but turned out to be a N because this will affect future planning for our revenue figures.

## Working with Standardized data

In [60]:
X_train = pd.read_csv('data/train/x_train_stand.csv').values
y_train = pd.read_csv('data/train/y_train_stand.csv').values.flatten()

X_test = pd.read_csv('data/test/x_test_stand.csv').values
y_test = pd.read_csv('data/test/y_test_stand.csv').values.flatten()

### Logistic Regression

In [61]:
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(random_state=0, solver='lbfgs', max_iter=130)
lr_clf.fit(X_train, y_train)

y_pred_lr = lr_clf.predict(X_test)

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score

pd.DataFrame(
    confusion_matrix(y_test, y_pred_lr),
    columns=pd.MultiIndex.from_product([['Prediction'], ['Negative', 'Positive']]),
    index=pd.MultiIndex.from_product([['Actual'], ['Negative', 'Positive']])
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Prediction,Prediction
Unnamed: 0_level_1,Unnamed: 1_level_1,Negative,Positive
Actual,Negative,7991,2
Actual,Positive,1050,0


In [62]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_lr).ravel()
print('{} records were correctly predicted negative (true negative)'.format(tn))
print('{} records were correctly predicted positive (true positives)'.format(tp))
print('{} records were incorrectly predicted negative (false negative)'.format(fn))
print('{} records were incorrectly predicted positive (false positive)'.format(fp))
print('-----------------------------')
print('Accuracy: {:.2f}%'.format(accuracy_score(y_test, y_pred_lr) * 100))
print('Precision: {:.2f}%'.format(precision_score(y_test, y_pred_lr) * 100))

7991 records were correctly predicted negative (true negative)
0 records were correctly predicted positive (true positives)
1050 records were incorrectly predicted negative (false negative)
2 records were incorrectly predicted positive (false positive)
-----------------------------
Accuracy: 88.37%
Precision: 0.00%


### K-NN

In [63]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=5, weights='uniform', metric='minkowski', p=2)
knn_clf.fit(X_train, y_train)

y_pred_knn = knn_clf.predict(X_test)

pd.DataFrame(
    confusion_matrix(y_test, y_pred_knn),
    columns=pd.MultiIndex.from_product([['Prediction'], ['Negative', 'Positive']]),
    index=pd.MultiIndex.from_product([['Actual'], ['Negative', 'Positive']])
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Prediction,Prediction
Unnamed: 0_level_1,Unnamed: 1_level_1,Negative,Positive
Actual,Negative,7789,204
Actual,Positive,923,127


In [64]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_knn).ravel()
print('{} records were correctly predicted negative (true negative)'.format(tn))
print('{} records were correctly predicted positive (true positives)'.format(tp))
print('{} records were incorrectly predicted negative (false negative)'.format(fn))
print('{} records were incorrectly predicted positive (false positive)'.format(fp))
print('-----------------------------')
print('Accuracy: {:.2f}%'.format(accuracy_score(y_test, y_pred_knn) * 100))
print('Precision: {:.2f}%'.format(precision_score(y_test, y_pred_knn) * 100))

7789 records were correctly predicted negative (true negative)
127 records were correctly predicted positive (true positives)
923 records were incorrectly predicted negative (false negative)
204 records were incorrectly predicted positive (false positive)
-----------------------------
Accuracy: 87.54%
Precision: 38.37%


### SVM

In [65]:
from sklearn.svm import SVC

svm_clf = SVC(kernel='linear', random_state=0)
svm_clf.fit(X_train, y_train)

y_pred_svm = svm_clf.predict(X_test)

pd.DataFrame(
    confusion_matrix(y_test, y_pred_svm),
    columns=pd.MultiIndex.from_product([['Prediction'], ['Negative', 'Positive']]),
    index=pd.MultiIndex.from_product([['Actual'], ['Negative', 'Positive']])
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Prediction,Prediction
Unnamed: 0_level_1,Unnamed: 1_level_1,Negative,Positive
Actual,Negative,7993,0
Actual,Positive,1050,0


In [66]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_svm).ravel()
print('{} records were correctly predicted negative (true negative)'.format(tn))
print('{} records were correctly predicted positive (true positives)'.format(tp))
print('{} records were incorrectly predicted negative (false negative)'.format(fn))
print('{} records were incorrectly predicted positive (false positive)'.format(fp))
print('-----------------------------')
print('Accuracy: {:.2f}%'.format(accuracy_score(y_test, y_pred_svm) * 100))
print('Precision: {:.2f}%'.format(precision_score(y_test, y_pred_svm, zero_division=0) * 100))

7993 records were correctly predicted negative (true negative)
0 records were correctly predicted positive (true positives)
1050 records were incorrectly predicted negative (false negative)
0 records were incorrectly predicted positive (false positive)
-----------------------------
Accuracy: 88.39%
Precision: 0.00%


## Kernel SVM

In [67]:
from sklearn.svm import SVC

ksvm_clf = SVC(kernel='rbf', random_state=0)
ksvm_clf.fit(X_train, y_train)

y_pred_ksvm = ksvm_clf.predict(X_test)

pd.DataFrame(
    confusion_matrix(y_test, y_pred_ksvm),
    columns=pd.MultiIndex.from_product([['Prediction'], ['Negative', 'Positive']]),
    index=pd.MultiIndex.from_product([['Actual'], ['Negative', 'Positive']])
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Prediction,Prediction
Unnamed: 0_level_1,Unnamed: 1_level_1,Negative,Positive
Actual,Negative,7993,0
Actual,Positive,1050,0


In [68]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_ksvm).ravel()
print('{} records were correctly predicted negative (true negative)'.format(tn))
print('{} records were correctly predicted positive (true positives)'.format(tp))
print('{} records were incorrectly predicted negative (false negative)'.format(fn))
print('{} records were incorrectly predicted positive (false positive)'.format(fp))
print('-----------------------------')
print('Accuracy: {:.2f}%'.format(accuracy_score(y_test, y_pred_ksvm) * 100))
print('Precision: {:.2f}%'.format(precision_score(y_test, y_pred_ksvm, zero_division=0) * 100))

7993 records were correctly predicted negative (true negative)
0 records were correctly predicted positive (true positives)
1050 records were incorrectly predicted negative (false negative)
0 records were incorrectly predicted positive (false positive)
-----------------------------
Accuracy: 88.39%
Precision: 0.00%


## Naive Bayes

In [69]:
from sklearn.naive_bayes import GaussianNB

gnb_clf = GaussianNB()
gnb_clf.fit(X_train, y_train)

y_pred_gnb = gnb_clf.predict(X_test)

pd.DataFrame(
    confusion_matrix(y_test, y_pred_gnb),
    columns=pd.MultiIndex.from_product([['Prediction'], ['Negative', 'Positive']]),
    index=pd.MultiIndex.from_product([['Actual'], ['Negative', 'Positive']])
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Prediction,Prediction
Unnamed: 0_level_1,Unnamed: 1_level_1,Negative,Positive
Actual,Negative,7146,847
Actual,Positive,820,230


In [70]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_gnb).ravel()
print('{} records were correctly predicted negative (true negative)'.format(tn))
print('{} records were correctly predicted positive (true positives)'.format(tp))
print('{} records were incorrectly predicted negative (false negative)'.format(fn))
print('{} records were incorrectly predicted positive (false positive)'.format(fp))
print('-----------------------------')
print('Accuracy: {:.2f}%'.format(accuracy_score(y_test, y_pred_gnb) * 100))
print('Precision: {:.2f}%'.format(precision_score(y_test, y_pred_gnb) * 100))

7146 records were correctly predicted negative (true negative)
230 records were correctly predicted positive (true positives)
820 records were incorrectly predicted negative (false negative)
847 records were incorrectly predicted positive (false positive)
-----------------------------
Accuracy: 81.57%
Precision: 21.36%


## Decision Tree

In [71]:
from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(criterion='entropy', random_state=0)
dt_clf.fit(X_train, y_train)

y_pred_dt = dt_clf.predict(X_test)

pd.DataFrame(
    confusion_matrix(y_test, y_pred_dt),
    columns=pd.MultiIndex.from_product([['Prediction'], ['Negative', 'Positive']]),
    index=pd.MultiIndex.from_product([['Actual'], ['Negative', 'Positive']])
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Prediction,Prediction
Unnamed: 0_level_1,Unnamed: 1_level_1,Negative,Positive
Actual,Negative,7249,744
Actual,Positive,699,351


In [72]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_dt).ravel()
print('{} records were correctly predicted negative (true negative)'.format(tn))
print('{} records were correctly predicted positive (true positives)'.format(tp))
print('{} records were incorrectly predicted negative (false negative)'.format(fn))
print('{} records were incorrectly predicted positive (false positive)'.format(fp))
print('-----------------------------')
print('Accuracy: {:.2f}%'.format(accuracy_score(y_test, y_pred_dt) * 100))
print('Precision: {:.2f}%'.format(precision_score(y_test, y_pred_dt) * 100))

7249 records were correctly predicted negative (true negative)
351 records were correctly predicted positive (true positives)
699 records were incorrectly predicted negative (false negative)
744 records were incorrectly predicted positive (false positive)
-----------------------------
Accuracy: 84.04%
Precision: 32.05%


## Random Forest

In [73]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
rf_clf.fit(X_train, y_train)

y_pred_rf = dt_clf.predict(X_test)

pd.DataFrame(
    confusion_matrix(y_test, y_pred_rf),
    columns=pd.MultiIndex.from_product([['Prediction'], ['Negative', 'Positive']]),
    index=pd.MultiIndex.from_product([['Actual'], ['Negative', 'Positive']])
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Prediction,Prediction
Unnamed: 0_level_1,Unnamed: 1_level_1,Negative,Positive
Actual,Negative,7249,744
Actual,Positive,699,351


In [74]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_rf).ravel()
print('{} records were correctly predicted negative (true negative)'.format(tn))
print('{} records were correctly predicted positive (true positives)'.format(tp))
print('{} records were incorrectly predicted negative (false negative)'.format(fn))
print('{} records were incorrectly predicted positive (false positive)'.format(fp))
print('-----------------------------')
print('Accuracy: {:.2f}%'.format(accuracy_score(y_test, y_pred_rf) * 100))
print('Precision: {:.2f}%'.format(precision_score(y_test, y_pred_rf) * 100))

7249 records were correctly predicted negative (true negative)
351 records were correctly predicted positive (true positives)
699 records were incorrectly predicted negative (false negative)
744 records were incorrectly predicted positive (false positive)
-----------------------------
Accuracy: 84.04%
Precision: 32.05%


The results have not differed much between the two methods of scaling. The best performing models remain to be:
- Decision Tree
- Random Forest