# Problem Formation

Given a Pattern String as an input, we want to know if it contains dark pattern in it. We use a balanced dataset cotaining all the instances in the Princeton dataset which are all dark patterns, and the instances in the 'normie.csv' file which are labeled as NOT dark patterns. Hence we have a balanced dataset consisting of pattern strings with dark pattern and without park patterns.

Then we use this labeled dataset to build and train supervised machine learning models, and select most suitable ones for our project.

----


In [1]:
import pandas as pd 
import numpy as np

from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV

# provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
from sklearn.feature_extraction.text import CountVectorizer
# systematically compute word counts using CountVectorizer and them compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.
from sklearn.feature_extraction.text import TfidfTransformer

# Bernoulli Naive Bayes (Similar as  MultinomialNB), this classifier is suitable for discrete data. The difference between MultinomialNB and BernoulliNB is that while  MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolen features, which means in the case of text classification, word occurrence vectores (rather than word count vectors) may be more suitable to be used to train and use this classifier.
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

# Evaluation metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix, accuracy_score

# joblib is a set of tools to provide lightweight pipelining in Python. It provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently.
import joblib

import matplotlib.pyplot as plt
# import seaborn as sns

## Data Exploration

---
Import the merged dataset, and explore the dataset.

In [2]:
data = pd.read_csv('final_presence.csv')

In [3]:
data.head(5)

Unnamed: 0,Pattern String,classification
0,FREE SHIPPING ON ORDERS OVER $100!,1
1,Starting at $25/mo with Affirm. Learn more,1
2,Please Note: We highly recommend using this up...,1
3,Share Facebook,1
4,Share on Twitter,1


---
`check the dataset information`

There are 2938 NOT NULL instances of pattern strings in the dataset.

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Pattern String  2938 non-null   object
 1   classification  2938 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 46.0+ KB


In [5]:
# check the distribution of the target value --- classification.

data['classification'].value_counts()

1    1879
0    1059
Name: classification, dtype: int64

In [6]:
# Change the label into strings

data['classification'].replace({0:'Dark',1:'Not_Dark'}, inplace = True)

data.head(5),data['classification'].value_counts()

(                                      Pattern String classification
 0                 FREE SHIPPING ON ORDERS OVER $100!       Not_Dark
 1         Starting at $25/mo with Affirm. Learn more       Not_Dark
 2  Please Note: We highly recommend using this up...       Not_Dark
 3                                     Share Facebook       Not_Dark
 4                                   Share on Twitter       Not_Dark,
 Not_Dark    1879
 Dark        1059
 Name: classification, dtype: int64)

In [7]:
# remove the instances with NULL value of 'Pattern String' and 'classification', which will be the input of our model.

data = data[pd.notnull(data["Pattern String"])]
data = data[pd.notnull(data["classification"])]

data.head(5)

Unnamed: 0,Pattern String,classification
0,FREE SHIPPING ON ORDERS OVER $100!,Not_Dark
1,Starting at $25/mo with Affirm. Learn more,Not_Dark
2,Please Note: We highly recommend using this up...,Not_Dark
3,Share Facebook,Not_Dark
4,Share on Twitter,Not_Dark


In [8]:
# check the information of the dataset.

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2938 entries, 0 to 2937
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Pattern String  2938 non-null   object
 1   classification  2938 non-null   object
dtypes: object(2)
memory usage: 68.9+ KB


In [9]:
# check the final distribution of the classification after removing the rows with NULL values.

data['classification'].value_counts()

Not_Dark    1879
Dark        1059
Name: classification, dtype: int64

---
## Data Preparation

In [10]:
Y = data['classification']
X = data['Pattern String']

---
`Encode the target vales into integers` --- 'classification'

In [11]:
encoder = LabelEncoder()
encoder.fit(Y)
y = encoder.transform(Y)
y.shape

(2938,)

In [12]:
# check the mapping of encoding results (from 0 to 1 representing 'Dark', 'Not Dark')

list(encoder.classes_)

['Dark', 'Not_Dark']

In [13]:
# Check the frequency distribution of the training pattern classification with pattern classification names.

(unique, counts) = np.unique(Y, return_counts=True)
frequencies = np.asarray((unique, counts)).T

print(frequencies)

[['Dark' 1059]
 ['Not_Dark' 1879]]


In [14]:
# Check the frequency distribution of the encoded training pattern classification with encoded integers.

(unique, counts) = np.unique(y, return_counts=True)
frequencies = np.asarray((unique, counts)).T

print(frequencies)

[[   0 1059]
 [   1 1879]]


---
`Encode the textual features into series of vector of numbers`

In [15]:
# First get the word count vector of the pattern string to encode the pattern string.

cv = CountVectorizer()
string_train_counts = cv.fit_transform(X)

# Then use the tf-idf score to transform the encoded word count pattern string vectors.

tfidf_tf = TfidfTransformer()
x = tfidf_tf.fit_transform(string_train_counts)

In [16]:
# save the CountVectorizer to disk

joblib.dump(cv, 'presence_CountVectorizer.joblib')

['presence_CountVectorizer.joblib']

---
# Rough Idea about the effect of different classifiers
---

In [17]:
# Five models are tested:
# -- Logistic Regression
# -- Linear Support Vector Machine
# -- Random Forest
# -- Multinomial Naive Bayes
# -- Bernoulli Naive Bayes
# -- KNN

classifiers = [LogisticRegression(), LinearSVC(), SVC(), RandomForestClassifier(), MultinomialNB(), BernoulliNB(), KNeighborsClassifier()]

In [18]:
# Calculate the accuracies of different classifiers using default settings.

acc = []
cm = []

for clf in classifiers:
    y_pred = cross_val_predict(clf, x, y, cv=5, n_jobs = -1)
    acc.append(metrics.accuracy_score(y, y_pred))
    cm.append(metrics.confusion_matrix(y, y_pred))

In [19]:
# List the accuracies of different classifiers.

for i in range(len(classifiers)):
    print(f"{classifiers[i]} accuracy: {acc[i]}")
    # print(f"Confusion Matris: {cm[i]}")

LogisticRegression() accuracy: 0.8628318584070797
LinearSVC() accuracy: 0.9118447923757659
SVC() accuracy: 0.8975493533015657
RandomForestClassifier() accuracy: 0.9370319945541185
MultinomialNB() accuracy: 0.9016337644656228
BernoulliNB() accuracy: 0.9125255275697753
KNeighborsClassifier() accuracy: 0.7985023825731791


---
# Bernoulli Naive Bayes Classifier


---
### `Use default setting of classifier hyperparameters`

In [30]:
clf_bnb = BernoulliNB()

In [31]:
y_pred = cross_val_predict(clf_bnb, x, y, cv=5, n_jobs = -1)

In [32]:
clf_bnb.get_params()

{'alpha': 1.0, 'binarize': 0.0, 'class_prior': None, 'fit_prior': True}

---
`use the default setting of hyperparameters of the Bernoulli Naive Bayes classifier, the accuracy can reach 0.946.`

In [33]:
print("Accuracy:", metrics.accuracy_score(y, y_pred))
print("Confusion Matrix:\n", metrics.confusion_matrix(y, y_pred))

Accuracy: 0.9125255275697753
Confusion Matrix:
 [[ 900  159]
 [  98 1781]]


In [34]:
(unique, counts) = np.unique(y_pred, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

array([[   0,  998],
       [   1, 1940]])

---
### `Parameter Tunning of BernoulliNB classifier`
`Define the combination of parameters to be considered`

In [42]:
param_grid = {'alpha':[0,1], 
              'fit_prior':[True, False]}

`Run the Grid Search`

Use cross validation on the training dataset to find optimal model.

In [43]:
gs = GridSearchCV(clf_bnb,param_grid,cv=5, 
                      verbose = 1, n_jobs = -1)

In [44]:
best_bnb = gs.fit(x,y)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  20 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    0.0s finished


In [45]:
scores_df = pd.DataFrame(best_bnb.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df [['rank_test_score', 'mean_test_score', 'param_alpha', 'param_fit_prior']]

Unnamed: 0,rank_test_score,mean_test_score,param_alpha,param_fit_prior
0,1,0.916985,1,False
1,2,0.912563,1,True
2,3,0.867981,0,True
3,4,0.851644,0,False


In [46]:
best_bnb.best_params_

{'alpha': 1, 'fit_prior': False}

---
`Save the best BernoulliNB model for future use`

In [47]:
# save the model to local disk

joblib.dump(best_bnb, 'bnb_presence_classifier.joblib')

['bnb_presence_classifier.joblib']

---
# Random Forest Classifier


---
### `Use default setting of classifier hyperparameters`

In [48]:
clf_rf = RandomForestClassifier()

In [49]:
y_pred = cross_val_predict(clf_rf, x, y, cv=5, n_jobs = -1)

In [50]:
clf_rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

---
`use the default setting of hyperparameters of the Random Forest classifier, the accuracy can reach 0.938.`

In [51]:
print("Accuracy:", metrics.accuracy_score(y, y_pred))
print("Confusion Matrix:\n", metrics.confusion_matrix(y, y_pred))

Accuracy: 0.9346494213750851
Confusion Matrix:
 [[ 906  153]
 [  39 1840]]


In [52]:
(unique, counts) = np.unique(y_pred, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

array([[   0,  945],
       [   1, 1993]])

---
### `Parameter Tunning of Random Forest classifier`
`Define the combination of parameters to be considered`

In [53]:
param_grid = {'bootstrap':[True,False], 
              'criterion':['gini','entropy'],
              'max_depth':[10,20,30,40,50,60,70,80,90,100, None],
              'min_samples_leaf':[1,2,4],
              'min_samples_split':[2,5,10],
              'n_estimators':[100,200,300,400,500,600]}

`Run the Grid Search`

Use cross validation on the training dataset to find optimal model.

In [54]:
gs = GridSearchCV(clf_rf,param_grid,cv=5, 
                      verbose = 1, n_jobs = -1)

In [55]:
best_rf = gs.fit(x,y)

Fitting 5 folds for each of 2376 candidates, totalling 11880 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    4.1s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:   31.0s
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done 1226 tasks      | elapsed:  7.3min
[Parallel(n_jobs=-1)]: Done 1776 tasks      | elapsed: 12.4min
[Parallel(n_jobs=-1)]: Done 2426 tasks      | elapsed: 18.6min
[Parallel(n_jobs=-1)]: Done 3176 tasks      | elapsed: 25.2min
[Parallel(n_jobs=-1)]: Done 4026 tasks      | elapsed: 30.5min
[Parallel(n_jobs=-1)]: Done 4976 tasks      | elapsed: 39.0min
[Parallel(n_jobs=-1)]: Done 6026 tasks      | elapsed: 48.9min
[Parallel(n_jobs=-1)]: Done 7176 tasks      | elapsed: 57.4min
[Parallel(n_jobs=-1)]: Done 8426 tasks      | elapsed: 73.0min
[Parallel(n_jobs=-1)]: Done 9776 tasks      | elapsed: 84.9min
[Parallel(n_jobs=-1)]: Done 11226 tasks      

In [56]:
scores_df = pd.DataFrame(best_rf.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df [['rank_test_score', 'mean_test_score', 'param_bootstrap', 'param_criterion','param_max_depth','param_min_samples_leaf','param_min_samples_split','param_n_estimators']]

Unnamed: 0,rank_test_score,mean_test_score,param_bootstrap,param_criterion,param_max_depth,param_min_samples_leaf,param_min_samples_split,param_n_estimators
0,1,0.947954,False,entropy,80,2,5,300
1,2,0.946933,False,gini,90,2,2,200
2,3,0.946593,False,entropy,90,2,5,500
3,3,0.946593,False,gini,70,2,2,600
4,5,0.946593,True,gini,100,2,5,100
...,...,...,...,...,...,...,...,...
2371,2372,0.756340,True,entropy,10,4,5,100
2372,2373,0.754301,True,entropy,10,2,10,300
2373,2374,0.753277,False,entropy,10,2,2,200
2374,2375,0.753268,True,gini,10,4,10,100


In [57]:
best_rf.best_params_

{'bootstrap': False,
 'criterion': 'entropy',
 'max_depth': 80,
 'min_samples_leaf': 2,
 'min_samples_split': 5,
 'n_estimators': 300}

---
`Save the best Random Forest model for future use`

In [58]:
# save the model to local disk

joblib.dump(best_rf, 'rf_presence_classifier.joblib')

['rf_presence_classifier.joblib']

---
# KNN Classifier


---
### `Use default setting of classifier hyperparameters`

In [59]:
clf_knn = KNeighborsClassifier()

In [60]:
y_pred = cross_val_predict(clf_knn, x, y, cv=5, n_jobs = -1)

In [61]:
clf_knn.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

---
`use the default setting of hyperparameters of the Random Forest classifier, the accuracy can reach 0.938.`

In [62]:
print("Accuracy:", metrics.accuracy_score(y, y_pred))
print("Confusion Matrix:\n", metrics.confusion_matrix(y, y_pred))

Accuracy: 0.7985023825731791
Confusion Matrix:
 [[ 483  576]
 [  16 1863]]


In [63]:
(unique, counts) = np.unique(y_pred, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

array([[   0,  499],
       [   1, 2439]])

---
### `Parameter Tunning of KNN classifier`
`Define the combination of parameters to be considered`

In [64]:
param_grid = {'n_neighbors':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], 
              'metric':['manhattan','euclidean','minkowski'],
              'p':[1,2,3,4,5],
              'weights':['uniform','distance']}

`Run the Grid Search`

Use cross validation on the training dataset to find optimal model.

In [65]:
gs = GridSearchCV(clf_knn,param_grid,cv=5, 
                      verbose = 1, n_jobs = -1)

In [66]:
best_knn = gs.fit(x,y)

Fitting 5 folds for each of 450 candidates, totalling 2250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 328 tasks      | elapsed:    3.6s
[Parallel(n_jobs=-1)]: Done 828 tasks      | elapsed:    9.2s
[Parallel(n_jobs=-1)]: Done 1528 tasks      | elapsed:   14.1s
[Parallel(n_jobs=-1)]: Done 2250 out of 2250 | elapsed:   16.9s finished


In [67]:
scores_df = pd.DataFrame(best_knn.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df [['rank_test_score', 'mean_test_score', 'param_n_neighbors', 'param_metric','param_p','param_weights']]

Unnamed: 0,rank_test_score,mean_test_score,param_n_neighbors,param_metric,param_p,param_weights
0,1,0.837385,2,manhattan,5,uniform
1,1,0.837385,2,minkowski,1,uniform
2,1,0.837385,2,manhattan,4,uniform
3,1,0.837385,2,manhattan,3,uniform
4,1,0.837385,2,manhattan,1,uniform
...,...,...,...,...,...,...
445,446,,6,minkowski,3,uniform
446,447,,5,minkowski,5,distance
447,448,,5,minkowski,5,uniform
448,449,,7,minkowski,4,distance


In [68]:
best_knn.best_params_

{'metric': 'manhattan', 'n_neighbors': 2, 'p': 1, 'weights': 'uniform'}

---
`Save the best Random Forest model for future use`

In [69]:
# save the model to local disk

joblib.dump(best_knn, 'knn_presence_classifier.joblib')

['knn_presence_classifier.joblib']

---
# SVM Classifier


---
### `Use default setting of classifier hyperparameters`

In [20]:
clf_svm = LinearSVC()

In [21]:
y_pred = cross_val_predict(clf_svm, x, y, cv=5, n_jobs = -1)

In [22]:
clf_svm.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': True,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'loss': 'squared_hinge',
 'max_iter': 1000,
 'multi_class': 'ovr',
 'penalty': 'l2',
 'random_state': None,
 'tol': 0.0001,
 'verbose': 0}

---
`use the default setting of hyperparameters of the Random Forest classifier, the accuracy can reach 0.938.`

In [23]:
print("Accuracy:", metrics.accuracy_score(y, y_pred))
print("Confusion Matrix:\n", metrics.confusion_matrix(y, y_pred))

Accuracy: 0.9118447923757659
Confusion Matrix:
 [[ 839  220]
 [  39 1840]]


In [24]:
(unique, counts) = np.unique(y_pred, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

array([[   0,  878],
       [   1, 2060]])

---
### `Parameter Tunning of SVM classifier`
`Define the combination of parameters to be considered`

In [25]:
param_grid = {'C':[0.1,1,10,100],
              'penalty':['l1','l2']}

`Run the Grid Search`

Use cross validation on the training dataset to find optimal model.

In [26]:
gs = GridSearchCV(clf_svm,param_grid,cv=5, 
                      verbose = 1, n_jobs = -1)

In [27]:
best_svm = gs.fit(x,y)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    0.6s finished


In [28]:
scores_df = pd.DataFrame(best_svm.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df [['rank_test_score', 'mean_test_score', 'param_penalty', 'param_C']]

Unnamed: 0,rank_test_score,mean_test_score,param_penalty,param_C
0,1,0.911898,l2,1.0
1,2,0.908153,l2,10.0
2,3,0.901685,l2,100.0
3,4,0.870731,l2,0.1
4,5,,l1,0.1
5,6,,l1,1.0
6,7,,l1,10.0
7,8,,l1,100.0


In [29]:
best_svm.best_params_

{'C': 1, 'penalty': 'l2'}

---
`Save the best Random Forest model for future use`

In [30]:
# save the model to local disk

joblib.dump(best_svm, 'svm_presence_classifier.joblib')

['svm_presence_classifier.joblib']