# Problem Formation

Given a Pattern String as an input, we want to know if it contains dark pattern in it. We use a balanced dataset cotaining all the instances in the Princeton dataset which are all dark patterns, and the instances in the 'normie.csv' file which are labeled as NOT dark patterns. Hence we have a balanced dataset consisting of pattern strings with dark pattern and without park patterns.

Then we use this labeled dataset to build and train supervised machine learning models, and select most suitable ones for our project.

----


In [73]:
import pandas as pd 
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV

# provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
from sklearn.feature_extraction.text import CountVectorizer
# systematically compute word counts using CountVectorizer and them compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.
from sklearn.feature_extraction.text import TfidfTransformer

# Bernoulli Naive Bayes (Similar as  MultinomialNB), this classifier is suitable for discrete data. The difference between MultinomialNB and BernoulliNB is that while  MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolen features, which means in the case of text classification, word occurrence vectores (rather than word count vectors) may be more suitable to be used to train and use this classifier.
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

# Evaluation metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix

# joblib is a set of tools to provide lightweight pipelining in Python. It provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently.
import joblib

import matplotlib.pyplot as plt
# import seaborn as sns

## Data Exploration

---
Import the Princeton dataset (for instances with dark patterns), the normie.csv dataset (for instances without dark patterns), and explore the datasets.

In [2]:
normie = pd.read_csv('normie.csv')
princeton = pd.read_csv('dark_patterns.csv')

---
### Normie Dataset Exploration

In [3]:
normie.head()

Unnamed: 0,Pattern String,classification
0,FREE SHIPPING ON ORDERS OVER $100!,0.0
1,SOME EXCLUSIONS APPLY - LEARN MORE,0.0
2,HAVE A QUESTION? - CONTACT US,0.0
3,WELCOME TO 034MOTORSPORT!,0.0
4,SHOP AUDISHOP VOLKSWAGENPERFORMANCE SOFTWARE03...,0.0


---
`check the normie dataset information`

There are 2427 NOT NULL instances of pattern strings in the normie dataset (2700 instances in total in the dataset). There are 2699 instances of NOT NULL instances of classification in the dataset.

In [4]:
normie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700 entries, 0 to 2699
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Pattern String  2427 non-null   object 
 1   classification  2699 non-null   float64
dtypes: float64(1), object(1)
memory usage: 42.3+ KB


In [5]:
# check the distribution of the target value --- classification.

normie['classification'].value_counts()

0.0    2094
1.0     605
Name: classification, dtype: int64

In [6]:
# remove the instances with NULL value of 'Pattern String' and 'classification', which will be the input of our model.

normie = normie[pd.notnull(normie["Pattern String"])]
normie = normie[pd.notnull(normie["classification"])]

In [7]:
normie.head()

Unnamed: 0,Pattern String,classification
0,FREE SHIPPING ON ORDERS OVER $100!,0.0
1,SOME EXCLUSIONS APPLY - LEARN MORE,0.0
2,HAVE A QUESTION? - CONTACT US,0.0
3,WELCOME TO 034MOTORSPORT!,0.0
4,SHOP AUDISHOP VOLKSWAGENPERFORMANCE SOFTWARE03...,0.0


In [8]:
# check the information of the normie dataset.

normie.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2426 entries, 0 to 2699
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Pattern String  2426 non-null   object 
 1   classification  2426 non-null   float64
dtypes: float64(1), object(1)
memory usage: 56.9+ KB


In [9]:
# check the final distribution of the classification after removing the rows with NULL values.

normie['classification'].value_counts()

0.0    1894
1.0     532
Name: classification, dtype: int64

---
Only need the rows that are NOT dark patterns, which are those instances where 'classification'==0.

In [10]:
normie = normie[normie["classification"] == 0]

In [11]:
normie.head()

Unnamed: 0,Pattern String,classification
0,FREE SHIPPING ON ORDERS OVER $100!,0.0
1,SOME EXCLUSIONS APPLY - LEARN MORE,0.0
2,HAVE A QUESTION? - CONTACT US,0.0
3,WELCOME TO 034MOTORSPORT!,0.0
4,SHOP AUDISHOP VOLKSWAGENPERFORMANCE SOFTWARE03...,0.0


In [12]:
# There are 1894 instances in the dataset that are NOT dark patterns.

normie.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1894 entries, 0 to 2699
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Pattern String  1894 non-null   object 
 1   classification  1894 non-null   float64
dtypes: float64(1), object(1)
memory usage: 44.4+ KB


In [13]:
normie["classification"] = "Not Dark"

In [14]:
normie.head()

Unnamed: 0,Pattern String,classification
0,FREE SHIPPING ON ORDERS OVER $100!,Not Dark
1,SOME EXCLUSIONS APPLY - LEARN MORE,Not Dark
2,HAVE A QUESTION? - CONTACT US,Not Dark
3,WELCOME TO 034MOTORSPORT!,Not Dark
4,SHOP AUDISHOP VOLKSWAGENPERFORMANCE SOFTWARE03...,Not Dark


In [17]:
# For later training the model, we should remove the duplicate input to reduce overfitting.

normie = normie.drop_duplicates(subset="Pattern String")

normie.head()

Unnamed: 0,Pattern String,classification
0,FREE SHIPPING ON ORDERS OVER $100!,Not Dark
1,SOME EXCLUSIONS APPLY - LEARN MORE,Not Dark
2,HAVE A QUESTION? - CONTACT US,Not Dark
3,WELCOME TO 034MOTORSPORT!,Not Dark
4,SHOP AUDISHOP VOLKSWAGENPERFORMANCE SOFTWARE03...,Not Dark


In [18]:
# After removing the duplicates, there are 1468 instances left in the dataset, to be NOT dark pattern.

normie.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1468 entries, 0 to 2699
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Pattern String  1468 non-null   object
 1   classification  1468 non-null   object
dtypes: object(2)
memory usage: 34.4+ KB


---
### Princeton Dataset Exploration

In [19]:
princeton.head()

Unnamed: 0,Pattern String,Comment,Pattern Category,Pattern Type,Where in website?,Deceptive?,Website Page
0,Collin P. from Grandview Missouri just bought ...,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://alaindupetit.com/collections/all-suits...
1,"Faith in Glendale, United States purchased a C...",Periodic popup,Social Proof,Activity Notification,Product Page,No,https://bonescoffee.com/products/strawberry-ch...
2,Sharmeen Atif From Karachi just bought Stylish...,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://brandsego.com/collections/under-rs-99/...
3,9 people are viewing this.,Product detail,Social Proof,Activity Notification,Product Page,No,https://brightechshop.com/products/ambience-so...
4,5338 people viewed this in the last hour,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://bumpboxes.com/


In [20]:
princeton.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1818 entries, 0 to 1817
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Pattern String     1512 non-null   object
 1   Comment            1798 non-null   object
 2   Pattern Category   1818 non-null   object
 3   Pattern Type       1818 non-null   object
 4   Where in website?  1818 non-null   object
 5   Deceptive?         1818 non-null   object
 6   Website Page       1818 non-null   object
dtypes: object(7)
memory usage: 99.5+ KB


In [23]:
# remove the rows where there are NULL values in 'Pattern String' or 'Pattern Category' columns.

princeton = princeton[pd.notnull(princeton["Pattern String"])]
princeton = princeton[pd.notnull(princeton["Pattern Category"])]

In [24]:
princeton.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1512 entries, 0 to 1817
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Pattern String     1512 non-null   object
 1   Comment            1494 non-null   object
 2   Pattern Category   1512 non-null   object
 3   Pattern Type       1512 non-null   object
 4   Where in website?  1512 non-null   object
 5   Deceptive?         1512 non-null   object
 6   Website Page       1512 non-null   object
dtypes: object(7)
memory usage: 94.5+ KB


In [25]:
princeton.head()

Unnamed: 0,Pattern String,Comment,Pattern Category,Pattern Type,Where in website?,Deceptive?,Website Page
0,Collin P. from Grandview Missouri just bought ...,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://alaindupetit.com/collections/all-suits...
1,"Faith in Glendale, United States purchased a C...",Periodic popup,Social Proof,Activity Notification,Product Page,No,https://bonescoffee.com/products/strawberry-ch...
2,Sharmeen Atif From Karachi just bought Stylish...,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://brandsego.com/collections/under-rs-99/...
3,9 people are viewing this.,Product detail,Social Proof,Activity Notification,Product Page,No,https://brightechshop.com/products/ambience-so...
4,5338 people viewed this in the last hour,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://bumpboxes.com/


In [26]:
# create a column named 'classification' and give all the values to be 'Dark', to match with the normie dataset.

princeton["classification"] = "Dark"

In [27]:
princeton.head()

Unnamed: 0,Pattern String,Comment,Pattern Category,Pattern Type,Where in website?,Deceptive?,Website Page,classification
0,Collin P. from Grandview Missouri just bought ...,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://alaindupetit.com/collections/all-suits...,Dark
1,"Faith in Glendale, United States purchased a C...",Periodic popup,Social Proof,Activity Notification,Product Page,No,https://bonescoffee.com/products/strawberry-ch...,Dark
2,Sharmeen Atif From Karachi just bought Stylish...,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://brandsego.com/collections/under-rs-99/...,Dark
3,9 people are viewing this.,Product detail,Social Proof,Activity Notification,Product Page,No,https://brightechshop.com/products/ambience-so...,Dark
4,5338 people viewed this in the last hour,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://bumpboxes.com/,Dark


In [28]:
# For later training the model, we should remove the duplicate input to reduce overfitting.

princeton = princeton.drop_duplicates(subset="Pattern String")

princeton.head()

Unnamed: 0,Pattern String,Comment,Pattern Category,Pattern Type,Where in website?,Deceptive?,Website Page,classification
0,Collin P. from Grandview Missouri just bought ...,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://alaindupetit.com/collections/all-suits...,Dark
1,"Faith in Glendale, United States purchased a C...",Periodic popup,Social Proof,Activity Notification,Product Page,No,https://bonescoffee.com/products/strawberry-ch...,Dark
2,Sharmeen Atif From Karachi just bought Stylish...,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://brandsego.com/collections/under-rs-99/...,Dark
3,9 people are viewing this.,Product detail,Social Proof,Activity Notification,Product Page,No,https://brightechshop.com/products/ambience-so...,Dark
4,5338 people viewed this in the last hour,Periodic popup,Social Proof,Activity Notification,Product Page,No,https://bumpboxes.com/,Dark


In [29]:
# After removing the duplicate 'Pattern String', there are 1178 instances of Dark Pattern String.

princeton.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1178 entries, 0 to 1817
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Pattern String     1178 non-null   object
 1   Comment            1162 non-null   object
 2   Pattern Category   1178 non-null   object
 3   Pattern Type       1178 non-null   object
 4   Where in website?  1178 non-null   object
 5   Deceptive?         1178 non-null   object
 6   Website Page       1178 non-null   object
 7   classification     1178 non-null   object
dtypes: object(8)
memory usage: 82.8+ KB


In [30]:
# Subset the princeton dataset for joining with normie dataset.

cols = ["Pattern String", "classification"]
princeton = princeton[cols]

princeton.head()

Unnamed: 0,Pattern String,classification
0,Collin P. from Grandview Missouri just bought ...,Dark
1,"Faith in Glendale, United States purchased a C...",Dark
2,Sharmeen Atif From Karachi just bought Stylish...,Dark
3,9 people are viewing this.,Dark
4,5338 people viewed this in the last hour,Dark


---
### Combining two dataset together

In [31]:
df = pd.concat([normie, princeton])

In [32]:
df

Unnamed: 0,Pattern String,classification
0,FREE SHIPPING ON ORDERS OVER $100!,Not Dark
1,SOME EXCLUSIONS APPLY - LEARN MORE,Not Dark
2,HAVE A QUESTION? - CONTACT US,Not Dark
3,WELCOME TO 034MOTORSPORT!,Not Dark
4,SHOP AUDISHOP VOLKSWAGENPERFORMANCE SOFTWARE03...,Not Dark
...,...,...
1809,Competitor Price: $172.00,Dark
1810,TWO FREE PILLOWS AND 30% OFF WITH PROMO CODE,Dark
1813,$132.90 $99.00,Dark
1814,This offer is only VALID if you add to cart now!,Dark


In [33]:
# check the information of the combined dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2646 entries, 0 to 1817
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Pattern String  2646 non-null   object
 1   classification  2646 non-null   object
dtypes: object(2)
memory usage: 62.0+ KB


In [34]:
# check the final distribution of the classification of the combined dataset.
# There are 1468 instances of Not dark pattern, 1178 instances of dark patterns.

df['classification'].value_counts()

Not Dark    1468
Dark        1178
Name: classification, dtype: int64

---
## Data Preparation

---
`Split the dataset into training and testing dataset.`

In [45]:
# split the dataset into train and test dataset as a ratio of 60%/40% (train/test).

string_train, string_test, dark_train, dark_test = train_test_split(
    df['Pattern String'], df["classification"], train_size = .6)

---
`Encode the target vales into integers` --- 'classification'

In [47]:
encoder = LabelEncoder()
encoder.fit(dark_train)
y_train = encoder.transform(dark_train)
y_test = encoder.transform(dark_test)

In [48]:
# check the mapping of encoding results (from 0 to 1 representing 'Dark', 'Not Dark')

list(encoder.classes_)

['Dark', 'Not Dark']

In [51]:
# Check the frequency distribution of the training pattern classification with pattern classification names.

(unique, counts) = np.unique(dark_train, return_counts=True)
frequencies = np.asarray((unique, counts)).T

print(frequencies)

[['Dark' 690]
 ['Not Dark' 897]]


In [52]:
# Check the frequency distribution of the encoded training pattern classification with encoded integers.

(unique, counts) = np.unique(y_train, return_counts=True)
frequencies = np.asarray((unique, counts)).T

print(frequencies)

[[  0 690]
 [  1 897]]


In [53]:
# Check the frequency distribution of the encoded testing pattern classification with encoded integers.

(unique, counts) = np.unique(y_test, return_counts=True)
frequencies = np.asarray((unique, counts)).T

print(frequencies)

[[  0 488]
 [  1 571]]


---
`Encode the textual features into series of vector of numbers`

In [54]:
# First get the word count vector of the pattern string to encode the pattern string.

cv = CountVectorizer()
string_train_counts = cv.fit_transform(string_train)

# Then use the tf-idf score to transform the encoded word count pattern string vectors.

tfidf_tf = TfidfTransformer()
X_train = tfidf_tf.fit_transform(string_train_counts)

In [55]:
# save the CountVectorizer to disk

joblib.dump(cv, 'presence_CountVectorizer.joblib')

['presence_CountVectorizer.joblib']

---
# Rough Idea about the effect of different classifiers
---

In [60]:
# Five models are tested:
# -- Logistic Regression
# -- Linear Support Vector Machine
# -- Random Forest
# -- Multinomial Naive Bayes
# -- Bernoulli Naive Bayes

classifiers = [LogisticRegression(),LinearSVC(), RandomForestClassifier(), MultinomialNB(), BernoulliNB()]

In [61]:
# Calculate the accuracies of different classifiers using default settings.

acc = []
cm = []

for clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(cv.transform(string_test))
    acc.append(metrics.accuracy_score(y_test, y_pred))
    cm.append(metrics.confusion_matrix(y_test, y_pred))

In [62]:
# List the accuracies of different classifiers.

for i in range(len(classifiers)):
    print(f"{classifiers[i]} accuracy: {acc[i]}")
    # print(f"Confusion Matris: {cm[i]}")

LogisticRegression() accuracy: 0.9254013220018886
LinearSVC() accuracy: 0.927289896128423
RandomForestClassifier() accuracy: 0.943342776203966
MultinomialNB() accuracy: 0.9036827195467422
BernoulliNB() accuracy: 0.9461756373937678


---
# Bernoulli Naive Bayes Classifier


---
### `Use default setting of classifier hyperparameters`

In [63]:
clf_bnb = BernoulliNB().fit(X_train, y_train)

In [65]:
clf_bnb.get_params()

{'alpha': 1.0, 'binarize': 0.0, 'class_prior': None, 'fit_prior': True}

In [67]:
y_pred = clf_bnb.predict(cv.transform(string_test))

---
`use the default setting of hyperparameters of the Bernoulli Naive Bayes classifier, the accuracy can reach 0.946.`

In [71]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred))

Accuracy: 0.9461756373937678
Confusion Matrix:
 [[448  40]
 [ 17 554]]


In [69]:
(unique, counts) = np.unique(y_pred, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

array([[  0, 465],
       [  1, 594]])

---
### `Parameter Tunning of BernoulliNB classifier`
`Define the combination of parameters to be considered`

In [74]:
param_grid = {'alpha':[0,1], 
              'fit_prior':[True, False]}

`Run the Grid Search`

Use cross validation on the training dataset to find optimal model.

In [75]:
gs = GridSearchCV(clf_bnb,param_grid,cv=5, 
                      verbose = 1, n_jobs = -1)

In [76]:
best_bnb = gs.fit(X_train,y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  20 | elapsed:    2.1s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    2.1s finished


In [77]:
scores_df = pd.DataFrame(best_bnb.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df [['rank_test_score', 'mean_test_score', 'param_alpha', 'param_fit_prior']]

Unnamed: 0,rank_test_score,mean_test_score,param_alpha,param_fit_prior
0,1,0.947702,1,False
1,2,0.944553,1,True
2,3,0.937631,0,True
3,4,0.935109,0,False


In [78]:
best_bnb.best_params_

{'alpha': 1, 'fit_prior': False}

In [79]:
y_pred_best = best_bnb.predict(cv.transform(string_test))

In [80]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred_best))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred_best))

Accuracy: 0.9480642115203022
Confusion Matrix:
 [[451  37]
 [ 18 553]]


In [82]:
(unique, counts) = np.unique(y_pred_best, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

array([[  0, 469],
       [  1, 590]])

---
`Save the best BernoulliNB model for future use`

In [84]:
# save the model to local disk

joblib.dump(best_bnb, 'bnb_presence_classifier.joblib')

['bnb_presence_classifier.joblib']

---
# Random Forest Classifier


---
### `Use default setting of classifier hyperparameters`

In [85]:
clf_rf = RandomForestClassifier().fit(X_train, y_train)

In [86]:
clf_rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [87]:
y_pred = clf_rf.predict(cv.transform(string_test))

---
`use the default setting of hyperparameters of the Random Forest classifier, the accuracy can reach 0.938.`

In [88]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred))

Accuracy: 0.9376770538243626
Confusion Matrix:
 [[450  38]
 [ 28 543]]


In [89]:
(unique, counts) = np.unique(y_pred, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

array([[  0, 478],
       [  1, 581]])

---
### `Parameter Tunning of Random Forest classifier`
`Define the combination of parameters to be considered`

In [90]:
param_grid = {'bootstrap':[True,False], 
              'criterion':['gini','entropy'],
              'max_depth':[10,20,30,40,50,60,70,80,90,100, None],
              'min_samples_leaf':[1,2,4],
              'min_samples_split':[2,5,10],
              'n_estimators':[100,200,300,400,500,600]}

`Run the Grid Search`

Use cross validation on the training dataset to find optimal model.

In [91]:
gs = GridSearchCV(clf_rf,param_grid,cv=5, 
                      verbose = 1, n_jobs = -1)

In [92]:
best_rf = gs.fit(X_train,y_train)

Fitting 5 folds for each of 2376 candidates, totalling 11880 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:   23.9s
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 1226 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 1776 tasks      | elapsed:  7.2min
[Parallel(n_jobs=-1)]: Done 2426 tasks      | elapsed: 10.2min
[Parallel(n_jobs=-1)]: Done 3176 tasks      | elapsed: 13.5min
[Parallel(n_jobs=-1)]: Done 4026 tasks      | elapsed: 16.6min
[Parallel(n_jobs=-1)]: Done 4976 tasks      | elapsed: 21.0min
[Parallel(n_jobs=-1)]: Done 6026 tasks      | elapsed: 25.7min
[Parallel(n_jobs=-1)]: Done 7176 tasks      | elapsed: 30.6min
[Parallel(n_jobs=-1)]: Done 8426 tasks      | elapsed: 38.5min
[Parallel(n_jobs=-1)]: Done 9776 tasks      | elapsed: 44.8min
[Parallel(n_jobs=-1)]: Done 11226 tasks      

In [93]:
scores_df = pd.DataFrame(best_rf.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df [['rank_test_score', 'mean_test_score', 'param_bootstrap', 'param_criterion','param_max_depth','param_min_samples_leaf','param_min_samples_split','param_n_estimators']]

Unnamed: 0,rank_test_score,mean_test_score,param_bootstrap,param_criterion,param_max_depth,param_min_samples_leaf,param_min_samples_split,param_n_estimators
0,1,0.967246,False,gini,80,1,10,500
1,1,0.967246,False,entropy,90,1,5,200
2,3,0.966617,False,entropy,70,1,10,400
3,4,0.966615,False,gini,100,1,5,100
4,5,0.965986,False,gini,70,1,10,300
...,...,...,...,...,...,...,...,...
2371,2372,0.888479,True,entropy,10,4,10,200
2372,2373,0.888469,True,entropy,10,4,2,300
2373,2374,0.888461,True,entropy,10,4,5,200
2374,2375,0.885943,True,gini,10,4,2,300


In [94]:
best_rf.best_params_

{'bootstrap': False,
 'criterion': 'gini',
 'max_depth': 80,
 'min_samples_leaf': 1,
 'min_samples_split': 10,
 'n_estimators': 500}

In [95]:
y_pred_best = best_rf.predict(cv.transform(string_test))

In [96]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred_best))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred_best))

Accuracy: 0.9405099150141643
Confusion Matrix:
 [[460  28]
 [ 35 536]]


In [97]:
(unique, counts) = np.unique(y_pred_best, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

array([[  0, 495],
       [  1, 564]])

---
`Save the best Random Forest model for future use`

In [98]:
# save the model to local disk

joblib.dump(best_rf, 'rf_presence_classifier.joblib')

['rf_presence_classifier.joblib']

---
# Without encoding the target values into integers

In [40]:
# split the dataset into train and test dataset as a ratio of 60%/40% (train/test).

X_train, X_test, y_train, y_test = train_test_split(
    df['Pattern String'], df["classification"], train_size = .6)

In [41]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [42]:
clf = BernoulliNB().fit(X_train_tfidf, y_train)

In [43]:
y_pred = clf.predict(count_vect.transform(X_test))

In [44]:
print("Accuracy: ", metrics.accuracy_score(y_pred, y_test))

Accuracy:  0.9527856468366384
