# Problem Formation

Given a Pattern String as an input, we want to know if it contains dark pattern in it. We use a balanced dataset cotaining all the instances in the Princeton dataset which are all dark patterns, and the instances in the 'normie.csv' file which are labeled as NOT dark patterns. Hence we have a balanced dataset consisting of pattern strings with dark pattern and without park patterns.

Then we use this labeled dataset to build and train supervised machine learning models, and select most suitable ones for our project.

----


In [1]:
import pandas as pd 
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV

# provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
from sklearn.feature_extraction.text import TfidfVectorizer

# Bernoulli Naive Bayes (Similar as  MultinomialNB), this classifier is suitable for discrete data. The difference between MultinomialNB and BernoulliNB is that while  MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolen features, which means in the case of text classification, word occurrence vectores (rather than word count vectors) may be more suitable to be used to train and use this classifier.
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC

# Evaluation metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix, accuracy_score

# joblib is a set of tools to provide lightweight pipelining in Python. It provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently.
import joblib

import matplotlib.pyplot as plt
# import seaborn as sns

## Data Exploration

---
Import the merged dataset, and explore the dataset.

In [2]:
data = pd.read_csv('enriched_data.csv')

In [3]:
data.head(5)

Unnamed: 0,Pattern String,classification
0,Ends in 07:42:09,0
1,Ends in 07:37:10,0
2,Ends in 02:27:10,0
3,Ends in 04:17:10,0
4,Ends in 01:57:10,0


---
`check the dataset information`

There are 7952 NOT NULL instances of pattern strings in the dataset.

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7952 entries, 0 to 7951
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Pattern String  7952 non-null   object
 1   classification  7952 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 124.4+ KB


In [5]:
# check the distribution of the target value --- classification.

print('Distribution of the tags:\n{}'.format(data['classification'].value_counts()))

Distribution of the tags:
1    6897
0    1055
Name: classification, dtype: int64


In [6]:
# Change the label into strings

data['classification'].replace({0:'Dark',1:'Not_Dark'}, inplace = True)

print(data.head(5))

print('\nDistribution of the tags:\n{}'.format(data['classification'].value_counts()))

     Pattern String classification
0  Ends in 07:42:09           Dark
1  Ends in 07:37:10           Dark
2  Ends in 02:27:10           Dark
3  Ends in 04:17:10           Dark
4  Ends in 01:57:10           Dark

Distribution of the tags:
Not_Dark    6897
Dark        1055
Name: classification, dtype: int64


In [7]:
# For later training the model, we should remove the duplicate input to reduce overfitting.

data = data.drop_duplicates(subset="Pattern String")

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7952 entries, 0 to 7951
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Pattern String  7952 non-null   object
 1   classification  7952 non-null   object
dtypes: object(2)
memory usage: 186.4+ KB


In [8]:
print(data.head(5))

print('\nDistribution of the tags:\n{}'.format(data['classification'].value_counts()))

     Pattern String classification
0  Ends in 07:42:09           Dark
1  Ends in 07:37:10           Dark
2  Ends in 02:27:10           Dark
3  Ends in 04:17:10           Dark
4  Ends in 01:57:10           Dark

Distribution of the tags:
Not_Dark    6897
Dark        1055
Name: classification, dtype: int64


In [10]:
# ---Change all the labels into Not Dark

data['classification'] = 'Not_Dark'

print('\nDistribution of the tags:\n{}'.format(data['classification'].value_counts()))


Distribution of the tags:
Not_Dark    7952
Name: classification, dtype: int64


In [11]:
# ---Get the Confirmshaming DP from the Princeton Dataset

df = pd.read_csv('dark_patterns.csv')

df

Unnamed: 0,Pattern String,Comment,Pattern Category,Pattern Type,Where in website?,Deceptive?,Website Page
0,Collin P. from Grandview Missouri just bought ...,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://alaindupetit.com/collections/all-suits...
1,"Faith in Glendale, United States purchased a C...",Periodic popup,Social Proof,Activity Notification,Product Page,No,https://bonescoffee.com/products/strawberry-ch...
2,Sharmeen Atif From Karachi just bought Stylish...,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://brandsego.com/collections/under-rs-99/...
3,9 people are viewing this.,Product detail,Social Proof,Activity Notification,Product Page,No,https://brightechshop.com/products/ambience-so...
4,5338 people viewed this in the last hour,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://bumpboxes.com/
...,...,...,...,...,...,...,...
1813,$132.90 $99.00,Website adds free items to show discount,Misdirection,Visual Interference,Cart Page,No,https://www.planetofthevapes.com/products/plan...
1814,This offer is only VALID if you add to cart now!,Popup asking you to buy more,Misdirection,Visual Interference,Product Page,No,https://www.rockymountainoils.com/single-essen...
1815,,Deterministic draw. Always give you the prize ...,Misdirection,Visual Interference,Product Page,Yes,https://www.sammydress.com/
1816,,Shows you prices in the popup based on your cu...,Misdirection,Visual Interference,Product Page,No,https://www.shoedazzle.com/products/FEELIN-A-L...


In [15]:
cs_df = df.loc[df['Pattern Type'] == 'Confirmshaming']

cs_df

Unnamed: 0,Pattern String,Comment,Pattern Category,Pattern Type,Where in website?,Deceptive?,Website Page
313,No thanks! I don't like deals,Popup,Misdirection,Confirmshaming,Product Page,No,https://koala.com/products/koala-mattress
314,"No, I'll rather pay full price.",Popup,Misdirection,Confirmshaming,Product Page,No,https://biofinest.com/en/home/445-barley-grass...
315,I don't like discounts,Popup,Misdirection,Confirmshaming,Product Page,No,https://uk.scitecnutrition.com/products/vita-g...
316,"No, thanks. I don't like great deals.",Popup,Misdirection,Confirmshaming,Product Page,No,https://bonescoffee.com/products/strawberry-ch...
317,"No Thanks, I rather pay full price",Popup,Misdirection,Confirmshaming,Product Page,No,https://bumpboxes.com/
...,...,...,...,...,...,...,...
477,I don't want to save money,Wheel popup,Misdirection,Confirmshaming,Product Page,No,https://www.vitalityextracts.com/collections/d...
478,"No thanks, I don't like savings",Popup,Misdirection,Confirmshaming,Product Page,No,https://www.wwbw.com/Selmer-Paris-Series-II-Mo...
479,"NO THANKS, I'D RATHER PAY FULL PRICE",Popup,Misdirection,Confirmshaming,Product Page,No,https://www.yandy.com/Varsity-Vixen-Lingerie-C...
480,"No, I don't feel lucky",Side wheel pop-up,Misdirection,Confirmshaming,Product Page,No,https://www.zoolaa.com/collections/cups-and-fi...


In [16]:
cs_df['classification'] = 'Dark'

cs_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cs_df['classification'] = 'Dark'


Unnamed: 0,Pattern String,Comment,Pattern Category,Pattern Type,Where in website?,Deceptive?,Website Page,classification
313,No thanks! I don't like deals,Popup,Misdirection,Confirmshaming,Product Page,No,https://koala.com/products/koala-mattress,Dark
314,"No, I'll rather pay full price.",Popup,Misdirection,Confirmshaming,Product Page,No,https://biofinest.com/en/home/445-barley-grass...,Dark
315,I don't like discounts,Popup,Misdirection,Confirmshaming,Product Page,No,https://uk.scitecnutrition.com/products/vita-g...,Dark
316,"No, thanks. I don't like great deals.",Popup,Misdirection,Confirmshaming,Product Page,No,https://bonescoffee.com/products/strawberry-ch...,Dark
317,"No Thanks, I rather pay full price",Popup,Misdirection,Confirmshaming,Product Page,No,https://bumpboxes.com/,Dark
...,...,...,...,...,...,...,...,...
477,I don't want to save money,Wheel popup,Misdirection,Confirmshaming,Product Page,No,https://www.vitalityextracts.com/collections/d...,Dark
478,"No thanks, I don't like savings",Popup,Misdirection,Confirmshaming,Product Page,No,https://www.wwbw.com/Selmer-Paris-Series-II-Mo...,Dark
479,"NO THANKS, I'D RATHER PAY FULL PRICE",Popup,Misdirection,Confirmshaming,Product Page,No,https://www.yandy.com/Varsity-Vixen-Lingerie-C...,Dark
480,"No, I don't feel lucky",Side wheel pop-up,Misdirection,Confirmshaming,Product Page,No,https://www.zoolaa.com/collections/cups-and-fi...,Dark


In [17]:
cs_df = cs_df[['Pattern String', 'classification']]

cs_df

Unnamed: 0,Pattern String,classification
313,No thanks! I don't like deals,Dark
314,"No, I'll rather pay full price.",Dark
315,I don't like discounts,Dark
316,"No, thanks. I don't like great deals.",Dark
317,"No Thanks, I rather pay full price",Dark
...,...,...
477,I don't want to save money,Dark
478,"No thanks, I don't like savings",Dark
479,"NO THANKS, I'D RATHER PAY FULL PRICE",Dark
480,"No, I don't feel lucky",Dark


In [18]:
# ----Merge two dataset

merged_data = pd.concat([data, cs_df])

merged_data

Unnamed: 0,Pattern String,classification
0,Ends in 07:42:09,Not_Dark
1,Ends in 07:37:10,Not_Dark
2,Ends in 02:27:10,Not_Dark
3,Ends in 04:17:10,Not_Dark
4,Ends in 01:57:10,Not_Dark
...,...,...
477,I don't want to save money,Dark
478,"No thanks, I don't like savings",Dark
479,"NO THANKS, I'D RATHER PAY FULL PRICE",Dark
480,"No, I don't feel lucky",Dark


---
## Data Preparation

In [19]:
# split the dataset into train and test dataset as a ratio of 80%/20% (train/test).

string_train, string_test, dark_train, dark_test = train_test_split(
    merged_data['Pattern String'], merged_data["classification"], train_size = .8)

---
`Encode the target vales into integers` --- 'classification'

In [20]:
encoder = LabelEncoder()
encoder.fit(dark_train)
y_train = encoder.transform(dark_train)
y_test = encoder.transform(dark_test)

In [21]:
# check the mapping of encoding results (from 0 to 1 representing 'Dark', 'Not Dark')

integer_mapping = {label: encoding for encoding, label in enumerate(encoder.classes_)}
print(integer_mapping)

{'Dark': 0, 'Not_Dark': 1}


In [22]:
# Check the frequency distribution of the training pattern classification with pattern classification names.

(unique, counts) = np.unique(dark_train, return_counts=True)
frequencies = np.asarray((unique, counts)).T

print(frequencies)

[['Dark' 135]
 ['Not_Dark' 6361]]


In [23]:
# Check the frequency distribution of the encoded training pattern classification with encoded integers.

(unique, counts) = np.unique(y_train, return_counts=True)
frequencies = np.asarray((unique, counts)).T

print(frequencies)

[[   0  135]
 [   1 6361]]


In [24]:
# Check the frequency distribution of the encoded testing pattern classification with encoded integers.

(unique, counts) = np.unique(y_test, return_counts=True)
frequencies = np.asarray((unique, counts)).T

print(frequencies)

[[   0   34]
 [   1 1591]]


---
`Encode the textual features into series of vector of numbers`

In [25]:
# get the word count vector of the pattern string to encode the pattern string.

tv = TfidfVectorizer()
tv.fit(string_train)

x_train = tv.transform(string_train)
x_test = tv.transform(string_test)

x_train.shape, x_test.shape

((6496, 5614), (1625, 5614))

In [26]:
# save the CountVectorizer to disk

joblib.dump(tv, 'presence_TfidfVectorizer.joblib')

['presence_TfidfVectorizer.joblib']

---
# Rough Idea about the effect of different classifiers
---

In [27]:
# Five models are tested:
# -- Logistic Regression
# -- Linear Support Vector Machine
# -- Random Forest
# -- Multinomial Naive Bayes
# -- Bernoulli Naive Bayes
# -- KNN

classifiers = [LogisticRegression(), LinearSVC(), RandomForestClassifier(), MultinomialNB(), BernoulliNB(), KNeighborsClassifier()]

In [28]:
# Calculate the accuracies of different classifiers using default settings.

acc = []
pre = []
cm = []

for clf in classifiers:
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    acc.append(metrics.accuracy_score(y_test, y_pred))
    pre.append(metrics.precision_score(y_test, y_pred, pos_label=0))
    cm.append(metrics.confusion_matrix(y_test, y_pred))

  _warn_prf(average, modifier, msg_start, len(result))


In [29]:
# List the accuracies of different classifiers.

for i in range(len(classifiers)):
    print("{} accuracy: {:.3f}".format(classifiers[i],acc[i]))
    print("{} precision: {:.3f}".format(classifiers[i],pre[i]))
    print("Confusion Matrix: {}".format(cm[i]))

LogisticRegression() accuracy: 0.995
LogisticRegression() precision: 0.964
Confusion Matrix: [[  27    7]
 [   1 1590]]
LinearSVC() accuracy: 0.998
LinearSVC() precision: 0.970
Confusion Matrix: [[  32    2]
 [   1 1590]]
RandomForestClassifier() accuracy: 0.998
RandomForestClassifier() precision: 0.970
Confusion Matrix: [[  32    2]
 [   1 1590]]
MultinomialNB() accuracy: 0.993
MultinomialNB() precision: 0.960
Confusion Matrix: [[  24   10]
 [   1 1590]]
BernoulliNB() accuracy: 0.979
BernoulliNB() precision: 0.000
Confusion Matrix: [[   0   34]
 [   0 1591]]
KNeighborsClassifier() accuracy: 0.994
KNeighborsClassifier() precision: 0.931
Confusion Matrix: [[  27    7]
 [   2 1589]]


---
# Bernoulli Naive Bayes Classifier


---
### `Use default setting of classifier hyperparameters`

In [30]:
clf_bnb = BernoulliNB().fit(x_train, y_train)

y_pred = clf_bnb.predict(x_test)

In [31]:
clf_bnb.get_params()

{'alpha': 1.0, 'binarize': 0.0, 'class_prior': None, 'fit_prior': True}

---
`use the default setting of hyperparameters of the Bernoulli Naive Bayes classifier`

In [32]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test,y_pred, pos_label=0))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred))

Accuracy: 0.9790769230769231
Precision: 0.0
Confusion Matrix:
 [[   0   34]
 [   0 1591]]


  _warn_prf(average, modifier, msg_start, len(result))


In [33]:
(unique, counts) = np.unique(y_pred, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

array([[   1, 1625]])

---
### `Parameter Tunning of BernoulliNB classifier`
`Define the combination of parameters to be considered`

In [34]:
param_grid = {'alpha':[0,1], 
              'fit_prior':[True, False]}

`Run the Grid Search`

Use cross validation on the training dataset to find optimal model.

In [35]:
gs = GridSearchCV(clf_bnb,param_grid,cv=5, 
                      verbose = 1, n_jobs = -1)

In [36]:
best_bnb = gs.fit(x_train,y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  20 | elapsed:    2.1s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    2.1s finished


In [37]:
scores_df = pd.DataFrame(best_bnb.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df [['rank_test_score', 'mean_test_score', 'param_alpha', 'param_fit_prior']]

Unnamed: 0,rank_test_score,mean_test_score,param_alpha,param_fit_prior
0,1,0.98707,0,True
1,2,0.978756,1,True
2,3,0.978602,1,False
3,4,0.97306,0,False


In [38]:
best_bnb.best_params_

{'alpha': 0, 'fit_prior': True}

In [39]:
y_pred_best = best_bnb.predict(x_test)

(unique, counts) = np.unique(y_pred_best, return_counts=True)
frequencies = np.asarray((unique, counts)).T
print(frequencies)

[[   0   30]
 [   1 1595]]


In [40]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred_best))
print("Precision:", metrics.precision_score(y_test,y_pred_best, pos_label=0))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred_best))

Accuracy: 0.9963076923076923
Precision: 0.9666666666666667
Confusion Matrix:
 [[  29    5]
 [   1 1590]]


---
`Save the best BernoulliNB model for future use`

In [41]:
# save the model to local disk

joblib.dump(best_bnb, 'bnb_presence_classifier.joblib')

['bnb_presence_classifier.joblib']

---
# Random Forest Classifier


---
### `Use default setting of classifier hyperparameters`

In [73]:
clf_rf = RandomForestClassifier().fit(x_train, y_train)

y_pred = clf_rf.predict(x_test)

In [74]:
clf_rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

---
`use the default setting of hyperparameters of the Random Forest classifier.`

In [75]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test,y_pred, pos_label=0))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred))

Accuracy: 0.9981538461538462
Precision: 0.9696969696969697
Confusion Matrix:
 [[  32    2]
 [   1 1590]]


In [76]:
(unique, counts) = np.unique(y_pred, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

array([[   0,   33],
       [   1, 1592]])

---
### `Parameter Tunning of Random Forest classifier`
`Define the combination of parameters to be considered`

In [77]:
param_grid = {'bootstrap':[True,False], 
              'criterion':['gini','entropy'],
              'max_depth':[10,20,30,40,50, None],
              'min_samples_leaf':[1,2,4],
              'min_samples_split':[2,5,10],
              'n_estimators':[100,200,300]}

`Run the Grid Search`

Use cross validation on the training dataset to find optimal model.

In [78]:
gs = GridSearchCV(clf_rf,param_grid,cv=5, 
                      verbose = 1, n_jobs = -1)

In [79]:
best_rf = gs.fit(x_train,y_train)

Fitting 5 folds for each of 648 candidates, totalling 3240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    4.6s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:   32.7s
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 1226 tasks      | elapsed:  7.7min
[Parallel(n_jobs=-1)]: Done 1776 tasks      | elapsed: 12.1min
[Parallel(n_jobs=-1)]: Done 2426 tasks      | elapsed: 20.1min
[Parallel(n_jobs=-1)]: Done 3176 tasks      | elapsed: 27.7min
[Parallel(n_jobs=-1)]: Done 3240 out of 3240 | elapsed: 28.9min finished


In [80]:
scores_df = pd.DataFrame(best_rf.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df [['rank_test_score', 'mean_test_score', 'param_bootstrap', 'param_criterion','param_max_depth','param_min_samples_leaf','param_min_samples_split','param_n_estimators']]

Unnamed: 0,rank_test_score,mean_test_score,param_bootstrap,param_criterion,param_max_depth,param_min_samples_leaf,param_min_samples_split,param_n_estimators
0,1,0.999384,False,gini,50,1,10,200
1,1,0.999384,False,entropy,,2,2,100
2,3,0.999230,False,entropy,,2,5,100
3,4,0.999230,True,gini,,1,2,100
4,4,0.999230,False,gini,50,1,2,300
...,...,...,...,...,...,...,...,...
643,644,0.979526,True,entropy,10,4,10,200
644,645,0.979372,False,entropy,10,4,10,300
645,645,0.979372,False,gini,10,4,10,100
646,647,0.979218,True,gini,10,4,5,300


In [81]:
best_rf.best_params_

{'bootstrap': False,
 'criterion': 'gini',
 'max_depth': 50,
 'min_samples_leaf': 1,
 'min_samples_split': 10,
 'n_estimators': 200}

In [82]:
y_pred_best = best_rf.predict(x_test)

(unique, counts) = np.unique(y_pred_best, return_counts=True)
frequencies = np.asarray((unique, counts)).T
print(frequencies)

[[   0   32]
 [   1 1593]]


In [83]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred_best))
print("Precision:", metrics.precision_score(y_test,y_pred_best, pos_label=0))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred_best))

Accuracy: 0.9987692307692307
Precision: 1.0
Confusion Matrix:
 [[  32    2]
 [   0 1591]]


---
`Save the best Random Forest model for future use`

In [84]:
# save the model to local disk

joblib.dump(best_rf, 'rf_presence_classifier.joblib')

['rf_presence_classifier.joblib']

---
# SVM Classifier


---
### `Use default setting of classifier hyperparameters`

In [49]:
clf_svm = LinearSVC().fit(x_train,y_train)

y_pred = clf_svm.predict(x_test)

In [50]:
clf_svm.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': True,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'loss': 'squared_hinge',
 'max_iter': 1000,
 'multi_class': 'ovr',
 'penalty': 'l2',
 'random_state': None,
 'tol': 0.0001,
 'verbose': 0}

---
`use the default setting of hyperparameters of the Random Forest classifier.`

In [51]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test,y_pred, pos_label=0))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred))

Accuracy: 0.9981538461538462
Precision: 0.9696969696969697
Confusion Matrix:
 [[  32    2]
 [   1 1590]]


In [52]:
(unique, counts) = np.unique(y_pred, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

array([[   0,   33],
       [   1, 1592]])

---
### `Parameter Tunning of SVM classifier`
`Define the combination of parameters to be considered`

In [53]:
param_grid = {'C':[0.1,1,10,100],
              'penalty':['l1','l2']}

`Run the Grid Search`

Use cross validation on the training dataset to find optimal model.

In [54]:
gs = GridSearchCV(clf_svm,param_grid,cv=5, 
                      verbose = 1, n_jobs = -1)

In [55]:
best_svm = gs.fit(x_train,y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    1.8s finished


In [56]:
scores_df = pd.DataFrame(best_svm.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df [['rank_test_score', 'mean_test_score', 'param_penalty', 'param_C']]

Unnamed: 0,rank_test_score,mean_test_score,param_penalty,param_C
0,1,0.998307,l2,1.0
1,2,0.997999,l2,0.1
2,3,0.997999,l2,100.0
3,4,0.997845,l2,10.0
4,5,,l1,0.1
5,6,,l1,1.0
6,7,,l1,10.0
7,8,,l1,100.0


In [57]:
best_svm.best_params_

{'C': 1, 'penalty': 'l2'}

In [58]:
y_pred_best = best_svm.predict(x_test)

(unique, counts) = np.unique(y_pred_best, return_counts=True)
frequencies = np.asarray((unique, counts)).T
print(frequencies)

[[   0   33]
 [   1 1592]]


In [59]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred_best))
print("Precision:", metrics.precision_score(y_test,y_pred_best, pos_label=0))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred_best))

Accuracy: 0.9981538461538462
Precision: 0.9696969696969697
Confusion Matrix:
 [[  32    2]
 [   1 1590]]


---
`Save the best SVM model for future use`

In [60]:
# save the model to local disk

joblib.dump(best_svm, 'svm_presence_classifier.joblib')

['svm_presence_classifier.joblib']

---
# Logistic Regression Classifier


---
### `Use default setting of classifier hyperparameters`

In [61]:
clf_lr = LogisticRegression().fit(x_train, y_train)

y_pred = clf_lr.predict(x_test)

In [62]:
clf_lr.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

---
`use the default setting of hyperparameters of the Logistic Regression classifier.`

In [63]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test,y_pred, pos_label=0))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred))

Accuracy: 0.9950769230769231
Precision: 0.9642857142857143
Confusion Matrix:
 [[  27    7]
 [   1 1590]]


In [64]:
(unique, counts) = np.unique(y_pred, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

array([[   0,   28],
       [   1, 1597]])

---
### `Parameter Tunning of Logistic Regression classifier`
`Define the combination of parameters to be considered`

In [65]:
param_grid = {'penalty':['l1','l2'], 
              'solver':['lbfgs','newton-cg','sag']}

`Run the Grid Search`

Use cross validation on the training dataset to find optimal model.

In [66]:
gs = GridSearchCV(clf_lr,param_grid,cv=5, 
                      verbose = 1, n_jobs = -1)

In [67]:
best_lr = gs.fit(x_train,y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    0.1s finished


In [68]:
scores_df = pd.DataFrame(best_lr.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df [['rank_test_score', 'mean_test_score', 'param_penalty', 'param_solver']]

Unnamed: 0,rank_test_score,mean_test_score,param_penalty,param_solver
0,1,0.994766,l2,lbfgs
1,1,0.994766,l2,newton-cg
2,1,0.994766,l2,sag
3,4,,l1,lbfgs
4,5,,l1,newton-cg
5,6,,l1,sag


In [69]:
best_lr.best_params_

{'penalty': 'l2', 'solver': 'lbfgs'}

In [70]:
y_pred_best = best_lr.predict(x_test)

(unique, counts) = np.unique(y_pred_best, return_counts=True)
frequencies = np.asarray((unique, counts)).T
print(frequencies)

[[   0   28]
 [   1 1597]]


In [71]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred_best))
print("Precision:", metrics.precision_score(y_test,y_pred_best, pos_label=0))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred_best))

Accuracy: 0.9950769230769231
Precision: 0.9642857142857143
Confusion Matrix:
 [[  27    7]
 [   1 1590]]


---
`Save the best Logistic Regression model for future use`

In [72]:
# save the model to local disk

joblib.dump(best_lr, 'lr_presence_classifier.joblib')

['lr_presence_classifier.joblib']