# Clickbait Dataset

Naive Bayes vs. Rocchio Classifier

**Text classification**

**Q1 (A)**<b> Click bait dataset </b>

This dataset contains headlines from various news sites such as ‘WikiNews’, ’New York Times’, ‘The Guardian’, ‘The Hindu’, ‘BuzzFeed’, ‘Upworthy’, ‘ViralNova’, ‘Thatscoop’, ‘Scoopwhoop’ and ‘ViralStories’. It has two columns first one contains headlines and the second one has numerical labels of clickbait in which 1 represents that it is clickbait and 0 represents that it is non-clickbait headline. The dataset contains total 32000 rows of which 50% are clickbait and other 50% are non-clickbait.

In [1]:
# Importing necessary libraries and packages
!pip install spacy
!python -m spacy download en_core_web_sm

import numpy as np 
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
import string as s
import re
import spacy


from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 12.8/12.8 MB 3.5 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


**A dataset to classify news headlines into clickbait or non-clickbait**

Kaggle link for dataset = https://www.kaggle.com/datasets/amananandrai/clickbait-dataset?resource=download


In [3]:
# Reading the dataset from local path

clickBait_data= pd.read_csv('clickbait_data.csv')
clickBait_data.head()

Unnamed: 0,headline,clickbait
0,Should I Get Bings,1
1,Which TV Female Friend Group Do You Belong In,1
2,"The New ""Star Wars: The Force Awakens"" Trailer...",1
3,"This Vine Of New York On ""Celebrity Big Brothe...",1
4,A Couple Did A Stunning Photo Shoot With Their...,1


**Text Preprocessing**

In [4]:
##Preprocessing

# Remove rows with numeric (float or int) values in 'Job Description'
data = clickBait_data[~clickBait_data['headline'].astype(str).str.replace('.', '', 1).str.isnumeric()]

# Convert all text into lower case
df_text = data['headline'].str.lower()

# Remove email ids from the text
df_text = df_text.replace({'<?([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+>?(\s\([A-Za-z ]*\))?':''}, regex = True)    

# Remove hyperlinks from the text
df_text = df_text.replace({'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+':''}, regex = True)

# Remove html tags
df_text = df_text.replace({'<.*?>': ''}, regex = True)         

# Remove non alphabet
df_text = df_text.replace({'[^A-Za-z]': ' '}, regex = True) 


  data = clickBait_data[~clickBait_data['headline'].astype(str).str.replace('.', '', 1).str.isnumeric()]


In [11]:
# Remove stop words from the text

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)


# Applying the function on the entire dataset

df_text = df_text.apply(lambda desc: remove_stopwords(desc))  # remove stop words
data['description after removal of stopwords']=df_text

In [12]:
# Load the English language model
nlp = spacy.load('en_core_web_sm')

#Remove Stop Words, Tokenize and Lemmatize the text
def lemmatize_text(text):
    doc = nlp(text)
    lemmatized_tokens = [token.lemma_ for token in doc if not token.is_stop]
    clean_data = ' '.join(lemmatized_tokens)
    return clean_data

# Applying the lemmatization to the entire dataset
data['preprocessed headline'] = data['description after removal of stopwords'].apply(lambda x: lemmatize_text(x))

In [13]:
print(data['preprocessed headline'].head())

0                                                 bing
1                        tv female friend group belong
2             new star war force awakens trailer chill
3    vine new york celebrity big brother fucking pe...
4    couple stunning photo shoot baby learn inopera...
Name: preprocessed headline, dtype: object


In [14]:
#Splitting the dataset into train set and test set

x= data.headline
y= data.clickbait
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25, random_state=2)

In [15]:
print(test_x[0:5])

26171    Sixteen Christian converts arrested in Iran; f...
16224    Hiring of Isiah Thomas Angers Some F.I.U. Facu...
27534       Fußball-Bundesliga 2007–08: Matchday 1 roundup
27304       Taco Bell mascot Gidget dies from stroke at 15
24836    China Takes Heavy Criticism Over Software Dire...
Name: headline, dtype: object


**Vectorization of Data**

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [17]:
tfidf = TfidfVectorizer()
train_vec= tfidf.fit_transform(train_x)
test_vec= tfidf.transform(test_x)

In [18]:
print("Number of features extracted:")
print(len(tfidf.get_feature_names_out()))
print("\nThe 100 features extracted from TF-IDF:\n")
print(tfidf.get_feature_names_out()[:100])

Number of features extracted:
20022

The 100 features extracted from TF-IDF:

['00' '000' '00s' '04' '05' '08' '08m' '09' '10' '100' '1000' '10000th'
 '1000blackgirls' '100k' '100m' '100th' '100ºf' '101' '101st' '102' '103'
 '104' '105' '106' '108' '109' '109th' '10th' '11' '110' '111' '112' '113'
 '114' '115' '116' '118th' '11k' '11n' '11th' '12' '120' '121' '1215'
 '122' '123' '125' '126' '127' '128' '12th' '13' '130' '132' '134' '137'
 '139' '13th' '14' '140' '147' '149' '14th' '15' '150' '152' '1525' '153'
 '154' '155' '159' '15bn' '16' '160' '162' '163' '165' '168' '16th' '17'
 '170' '1700' '17000' '172' '174' '175' '177' '17th' '18' '180' '18000'
 '1800s' '188' '18th' '19' '191' '1912' '1915' '1917' '1918']


## Naive Bayes Classification Model

In [19]:
from sklearn.naive_bayes import MultinomialNB

In [20]:
NB_MN=MultinomialNB()

In [21]:
# Fit Naive Bayes Classification Model
NB_MN.fit(train_vec,train_y)

#Predict
y_pred_NB= NB_MN.predict(test_vec)


**Evaluation of Naive Bayes Model**

In [22]:
from sklearn.metrics import f1_score,accuracy_score

print("F1 score of the model")
print(f1_score(test_y,y_pred_NB))
print("\nAccuracy of the model")
print(accuracy_score(test_y,y_pred_NB))
print("\nAccuracy of the model in percentage")
print(accuracy_score(test_y,y_pred_NB)*100,"%")

F1 score of the model
0.9731819280019328

Accuracy of the model
0.97225

Accuracy of the model in percentage
97.225 %


**Precision, Recall, F1-Score**

In [23]:
# Evaluate classifier
from sklearn.metrics import classification_report

print(classification_report(test_y, y_pred_NB))

              precision    recall  f1-score   support

           0       0.99      0.96      0.97      3925
           1       0.96      0.99      0.97      4075

    accuracy                           0.97      8000
   macro avg       0.97      0.97      0.97      8000
weighted avg       0.97      0.97      0.97      8000



The matrix shows that the classifier performs well for both classes, with high precision, recall, and F1 score for both 0 and 1. The overall accuracy is also high, indicating that the classifier is able to correctly label the majority of instances.

### Rocchio Classification Model

In [24]:
from sklearn.neighbors import NearestCentroid

In [25]:
# Fit the Rocchio classifier model
rocchio_clf = NearestCentroid()
rocchio_clf.fit(train_vec, train_y)

In [26]:
#Predict using Rocchio classifier
y_pred_rocchio = rocchio_clf.predict(test_vec)

**Evaluation of Rocchio Classifier**

In [27]:
from sklearn.metrics import f1_score,accuracy_score
print("F1 score of the model")
print(f1_score(test_y,y_pred_rocchio))
print("\nAccuracy of the model")
print(accuracy_score(test_y,y_pred_rocchio))
print("\nAccuracy of the model in percentage")
print(accuracy_score(test_y,y_pred_rocchio)*100,"%")

F1 score of the model
0.8855694851960523

Accuracy of the model
0.89275

Accuracy of the model in percentage
89.275 %


**Precision, Recall, F1-Score**

In [28]:
# Evaluate classifier
print(classification_report(test_y, y_pred_rocchio))

              precision    recall  f1-score   support

           0       0.84      0.97      0.90      3925
           1       0.97      0.81      0.89      4075

    accuracy                           0.89      8000
   macro avg       0.90      0.89      0.89      8000
weighted avg       0.90      0.89      0.89      8000



Class 0 performs better in terms of recall, while Class 1 has higher precision. The model's overall performance is balanced, with slightly better performance for Class 0 based on the F1-score.

*Looking at the classification report for both the models we see that Naive Bayes has more accuracy as compared to Rocchio*

**QB 1(a)**<b> Compare the performance of both classifiers in terms of accuracy, precision, recall, and F1-score </b>

**F1 Score**: The F1 score is a metric that balances precision and recall. A higher F1 score indicates better performance in terms of both precision and recall. In this case, the Naive Bayes model achieved a higher F1 score (0.973) compared to the Rocchio algorithm (0.886). This suggests that the Naive Bayes model performs better overall in terms of correctly identifying both positive and negative instances.

**Accuracy**: Accuracy represents the proportion of correctly classified instances out of the total number of instances. The Naive Bayes model achieved an accuracy of 97.225%, while the Rocchio algorithm achieved an accuracy of 89.275%. This indicates that the Naive Bayes model has a higher overall accuracy in classifying instances compared to the Rocchio algorithm.

**Performance Comparison**: Based on the F1 scores and accuracy scores, we can conclude that the Naive Bayes model outperforms the Rocchio algorithm in terms of both F1 score and accuracy. The Naive Bayes model has higher precision, recall, and overall accuracy compared to the Rocchio algorithm.

**Consideration of Trade-offs**: While the Naive Bayes model performs better overall, it's important to consider the specific trade-offs between precision, recall, and computational complexity when choosing between models. The Rocchio algorithm may have lower performance metrics but could be computationally less expensive or easier to interpret, depending on the specific requirements of the application.

**QB 2(b)**<b> Discuss the differences between Naive Bayes and Rocchio classifiers in terms of underlying assumptions, training process, and performance on different types of datasets </b>

Naive Bayes and Rocchio classifiers differ in terms of their underlying assumptions, training processes, and performance on different types of datasets.

**Underlying Assumptions**:
- **Naive Bayes**:
	<br> Assumption: Naive Bayes classifiers are based on Bayes' theorem and assume that features are conditionally independent given the class label. This is a strong and often unrealistic assumption, especially in cases where features are correlated.



- **Rocchio**:
    <br>Assumption: Rocchio classifiers are based on the vector space model and assume that documents belonging to the same class should be close to each other in the feature space. It doesn't assume independence among features.


<b> Training Process </b>

- **Naive Bayes**: 
<br> Estimates the probability of each feature value occurring given a class label using the training data. It then uses Bayes' theorem to calculate the probability of a class label given a new data point. The model can be incrementally updated with new training data.




- **Rocchio**: 
<br> Calculates a centroid vector for each class by averaging the feature vectors of the training data points belonging to that class. New data points are classified based on their similarity (usually cosine similarity) to the class centroids. Rocchio can be sensitive to outliers, and the prototype vectors can be updated iteratively.

<b> Performance on Different Types of Datasets: </b>

**Naive Bayes**:
- Generally performs well with categorical features and datasets where the independence assumption is somewhat reasonable.
- Can struggle with high-dimensional data or datasets with complex relationships between features.


**Rocchio**:
- Often performs well with text data represented using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).
- May not handle datasets with non-convex or multimodal class distributions well (where a class can't be neatly represented by a single centroid).
- Sensitive to outliers and noise in the training data.