<h1><strong><u>Final Model Selection (Assignment 5)</u></strong></h1>

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline

<h2><u>Data Loading</u></h2>

In [2]:
df = pd.DataFrame(columns=["title", "target"])

In [3]:
# Formatting text files
titles = []
targets = []
file_path_dict = {'clickbait': './clickbait_data.txt', 'non clickbait': './non_clickbait_data.txt'}
for key, value in file_path_dict.items():
    with open(value, 'r') as file:
        for line_number, line in enumerate(file):
            line = line.strip()
            if line != "":
                titles.append(line)
                targets.append(key)
data_dict = {"title": titles, "target": targets}
df = pd.DataFrame(data_dict)


In [4]:
data = df.sample(frac=1, random_state=0).reset_index(drop=True)
display(data)

Unnamed: 0,title,target
0,UK guinea pig farm to close after owner's fami...,non clickbait
1,18 Sweet Pumpkin Treats You Won't Believe Are ...,clickbait
2,"A Guy Just Did The Most Epic ""Cha Cha Slide"" D...",clickbait
3,Premium gas discounted for a few hours,non clickbait
4,Sanctions on US products introduced by Brazil,non clickbait
...,...,...
31995,"Men, Stephen King Has A Really Important Messa...",clickbait
31996,Greek government faces censure motion by oppos...,non clickbait
31997,15 Holiday Cocktails That Are Basically Dessert,clickbait
31998,This Corgi And Baby Are Best Friends And It's ...,clickbait


In [5]:
X = data["title"]
y = data["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
display(X_train.shape) 
display(X_test.shape)
display(y_train.shape)
display(y_test.shape)

(25600,)

(6400,)

(25600,)

(6400,)

In [6]:
def custom_tokenizer(text):
        from nltk.tokenize import word_tokenize
        from nltk.corpus import stopwords
        from nltk.stem import WordNetLemmatizer
        import re
        en_stopwords = stopwords.words('english')
        lemmatizer = WordNetLemmatizer()
        new_text = text.lower() #lowercase

        new_text = re.sub(r"([^\w\s])", "", new_text) #remove punctuation

        for word in new_text.split(): #remove stopwords
            if word in en_stopwords:
                new_text = new_text.replace(word, "")
        
        new_text = word_tokenize(new_text) #tokenize

        new_text = [lemmatizer.lemmatize(token) for token in new_text] #lemmatize
        return new_text

<h2><strong><u>Main Model Selection</u></strong></h2>

In [7]:
pipeline = Pipeline(
    [
        ("vect", TfidfVectorizer(tokenizer=custom_tokenizer, token_pattern=None)),
        ("clf", None)
    ]
)
param_grid = {
    "clf": [KNeighborsClassifier(n_neighbors=23), BernoulliNB(alpha=1), MLPClassifier(hidden_layer_sizes=(14,), alpha=1)]
}
model = GridSearchCV(pipeline, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=3)
model.fit(X_train, y_train)
model.best_params_

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV 3/5] END ..........clf=BernoulliNB(alpha=1);, score=0.960 total time=   4.8s
[CV 1/5] END ..........clf=BernoulliNB(alpha=1);, score=0.955 total time=   5.2s
[CV 2/5] END ..........clf=BernoulliNB(alpha=1);, score=0.957 total time=   5.3s
[CV 4/5] END ..........clf=BernoulliNB(alpha=1);, score=0.958 total time=   5.6s
[CV 5/5] END ..........clf=BernoulliNB(alpha=1);, score=0.958 total time=   5.6s
[CV 5/5] END clf=KNeighborsClassifier(n_neighbors=23);, score=0.929 total time=  11.4s
[CV 2/5] END clf=KNeighborsClassifier(n_neighbors=23);, score=0.927 total time=  11.4s
[CV 4/5] END clf=KNeighborsClassifier(n_neighbors=23);, score=0.923 total time=  11.5s
[CV 1/5] END clf=KNeighborsClassifier(n_neighbors=23);, score=0.924 total time=  11.5s
[CV 3/5] END clf=KNeighborsClassifier(n_neighbors=23);, score=0.925 total time=  11.5s
[CV 4/5] END clf=MLPClassifier(alpha=1, hidden_layer_sizes=(14,));, score=0.942 total time=  28.2s
[

{'clf': BernoulliNB(alpha=1)}

<h2><u>Results</u></h2>

<ol>
<li>What data representation did you use?</li>
<p>The representation that I used for the data was the TF-IDF Vectorizer representation.</p></br>
<li>What metric did you select to rank the models?</li>
<p>The metric that I chose to rank the models was accuracy.</p></br>
<li>How did each model score on the selected metric for both the training data and the testing data?</li>
<p>Based on 'accuracy' the models scored on average:</p>
<ol>
<li><u>K-Nearest Neighbors</u></li>
<ul>
<li>Training - 93%</li>
<li>Testing - 92%</li>
</ul>
<li><u>Naive Bayes</u></li>
<ul>
<li>Training - 96%</li>
<li>Testing - 95%</li>
</ul>
<li><u>Multi-Layered Perceptron</u></li>
<ul>
<li>Training - 94%</li>
<li>Testing - 93%</li>
</ul>
</ol>
</br>
<li>What hyperparameter values gave the optimal results in the cross validation?</li>
<p>The hyperparameter values that gave the optimal results in the cross validation were the BernoulliNB classifier with a regularization coefficient of 1.</p></br>
<li>Describe a way in which the classifier could be used as a plugin for a web browser.</li>
<p>A way in which the classifier can be used as a plugin for a web browser is by first converting it into a format that can be ran in a browser such as an ONNX file. Next, a folder needs to be created to store the necessary files needed to build the browser extension. After the files have been created for the browser, the model inference needs to be implemented in the browser. Finally, the extension can be loaded and tested.</p></br>
</ol>