# Clickbait detector using Naive Bayes Classifier

This kernel focuses on classifying News headlines into clickbaits and non-clickbaits.

The clickbaits are labelled as **1** and non-clickbaits as **0**.
The headlines are collected from different news sites.

The dataset consists of 32000 headlines of which 50% are clickbaits and the other 50% are non-clickbait.

I have used a *Multinomial Naive Bayes* classification algorithm for text classification of the given dataset. 

# Importing different tools and libraries

The main libraries used are *Numpy*, *Pandas*, *NLTK*(Natural language toolkit) and *Scikit-learn*.

In [1]:
import numpy as np 
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.probability import FreqDist
import string as s
import re

import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


# Loading the Dataset

In [4]:
cb_data= pd.read_csv('clickbait_data_sample.csv')
cb_data.head()

Unnamed: 0,headline,clickbait
0,Should I Get Bings?,1
1,Which TV Female Friend Group Do You Belong In,1
2,"The New ""Star Wars: The Force Awakens"" Trailer...",1
3,"This Vine Of New York On ""Celebrity Big Brothe...",1
4,A Couple Did A Stunning Photo Shoot With Their...,1


# Splitting into Train and Test sets

The dataset is splitted into training and testing sets. The percentage of training data is 75% and testing data is 25%.

In [5]:
x=cb_data.headline
y=cb_data.clickbait
train_x,test_x,train_y,test_y=train_test_split(x,y,test_size=0.25,random_state=2)

# Analyzing Train and Test Data

In [6]:
print("No. of elements in training set")
print(train_x.size)
print("No. of elements in testing set")
print(test_x.size)

No. of elements in training set
324
No. of elements in testing set
108


In [7]:
train_x.head(10)

29     This Country Singer Makes Music On His Game Bo...
394            Seniors Help Michigan St. Maintain Course
358                                 Schedule, Day by Day
414                     Caged children well fed, behaved
298                   Panda gives birth at San Diego Zoo
230                  Alan Turing given posthumous pardon
425                   The Charm and Silliness of Round 1
285    Thousands evacuated after chemical truck overt...
65     29 Books Every '90s Kid Will Immediately Recog...
175    23 Important Life Lessons "The Hunger Games" H...
Name: headline, dtype: object

In [8]:
train_y.head(10)

29     1
394    0
358    0
414    0
298    0
230    0
425    0
285    0
65     1
175    1
Name: clickbait, dtype: int64

In [9]:
test_x.head()

20                 What New Thing Should You Try In 2016
386      Obama Seeks Action Against Credit Card Industry
178    17 Things Only Women Who Aren't That Into Make...
198    You'll Be In Tears After Hearing Josh Groban A...
89      42 Of The Most Romantic Lines From YA Literature
Name: headline, dtype: object

In [10]:
test_y.head()

20     1
386    0
178    1
198    1
89     1
Name: clickbait, dtype: int64

# Tokenization of Data

The data is tokenized i.e. split into tokens which are the smallest or minimal meaningful units. The data is split into words.

In [11]:
def tokenization(text):
    lst=text.split()
    return lst
train_x=train_x.apply(tokenization)
test_x=test_x.apply(tokenization)

# Converting to lowercase

The data is converted into lowercase to avoid ambiguity between same words in different cases like 'NLP', 'nlp' or 'Nlp'. 

In [12]:
def lowercasing(lst):
    new_lst=[]
    for i in lst:
        i=i.lower()
        new_lst.append(i)
    return new_lst
train_x=train_x.apply(lowercasing)
test_x=test_x.apply(lowercasing)  

# Removing punctuation

The punctuations are removed to increase the efficiency of the model. They are irrelevant because they provide no added information.

In [13]:
def remove_punctuations(lst):
    new_lst=[]
    for i in lst:
        for j in s.punctuation:
            i=i.replace(j,'')
        new_lst.append(i)
    return new_lst
train_x=train_x.apply(remove_punctuations)
test_x=test_x.apply(remove_punctuations)  

# Removing Numbers

In [14]:
def remove_numbers(lst):
    nodig_lst=[]
    new_lst=[]
    for i in lst:
        for j in s.digits:    
            i=i.replace(j,'')
        nodig_lst.append(i)
    for i in nodig_lst:
        if i!='':
            new_lst.append(i)
    return new_lst
train_x=train_x.apply(remove_numbers)
test_x=test_x.apply(remove_numbers)

# Removing Stopwords

In [15]:
print("All stopwords of English language ")
", ".join(stopwords.words('english'))

All stopwords of English language 


"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [16]:
def remove_stopwords(lst):
    stop=stopwords.words('english')
    new_lst=[]
    for i in lst:
        if i not in stop:
            new_lst.append(i)
    return new_lst

train_x=train_x.apply(remove_stopwords)
test_x=test_x.apply(remove_stopwords)  

# Removing extra spaces

In [17]:
def remove_spaces(lst):
    new_lst=[]
    for i in lst:
        i=i.strip()
        new_lst.append(i)
    return new_lst
train_x=train_x.apply(remove_spaces)
test_x=test_x.apply(remove_spaces)

# Analyzing data after preprocessing

After preprocessing the data i.e. after removing punctuation, stopwords, spaces and numbers.

In [18]:
train_x.head()

29     [country, singer, makes, music, game, boy, spa...
394      [seniors, help, michigan, st, maintain, course]
358                                 [schedule, day, day]
414                [caged, children, well, fed, behaved]
298               [panda, gives, birth, san, diego, zoo]
Name: headline, dtype: object

In [19]:
test_x.head()

20                                     [new, thing, try]
386       [obama, seeks, action, credit, card, industry]
178           [things, women, arent, makeup, understand]
198    [youll, tears, hearing, josh, groban, kelly, c...
89                     [romantic, lines, ya, literature]
Name: headline, dtype: object

# Lemmatization

Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. It involves the morphological analysis of words.

In lemmatization we find the root word or base form of the word rather than just clipping some characters from the end e.g. *is, are, am* are all converted to its base form *be* in Lemmatization

Here lemmatization is done using NLTK library.

In [23]:
nltk.download('wordnet')
lemmatizer=nltk.stem.WordNetLemmatizer()
def lemmatzation(lst):
    new_lst=[]
    for i in lst:
        i=lemmatizer.lemmatize(i)
        new_lst.append(i)
    return new_lst
train_x=train_x.apply(lemmatzation)
test_x=test_x.apply(lemmatzation)

[nltk_data] Downloading package wordnet to /Users/vw9yis9/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [24]:
train_x=train_x.apply(lambda x: ''.join(i+' ' for i in x))
test_x=test_x.apply(lambda x: ''.join(i+' ' for i in x))

In [25]:
freq_dist={}
for i in train_x.head(20):
    x=i.split()
    for j in x:
        if j not in freq_dist.keys():
            freq_dist[j]=1
        else:
            freq_dist[j]+=1
freq_dist

{'c': 28,
 'o': 37,
 'u': 25,
 'n': 57,
 't': 43,
 'r': 49,
 'y': 18,
 's': 34,
 'i': 55,
 'g': 20,
 'e': 89,
 'm': 24,
 'a': 54,
 'k': 8,
 'b': 12,
 'p': 23,
 'h': 20,
 'l': 33,
 'd': 31,
 'w': 8,
 'f': 7,
 'v': 10,
 'z': 2,
 'j': 1}

# TF-IDF (Term frequency-Inverse Data Frequency)

This method is used to convert the text into features.

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
train_1=tfidf.fit_transform(train_x)
test_1=tfidf.transform(test_x)

ValueError: empty vocabulary; perhaps the documents only contain stop words

In [22]:
print("Number of features extracted")
print(len(tfidf.get_feature_names()))
print()
print("The 100 features extracted from TF-IDF ")
print(tfidf.get_feature_names()[:100])


Number of features extracted
18248

The 100 features extracted from TF-IDF 
['aa', 'aaevpc', 'aaron', 'ab', 'abandon', 'abandoned', 'abandoning', 'abba', 'abbas', 'abbey', 'abbott', 'abby', 'abc', 'abdallahi', 'abdelbaset', 'abdicates', 'abduct', 'abducted', 'abduction', 'abductor', 'abdul', 'abdullah', 'abdulmutallab', 'abel', 'abercrombie', 'abhishek', 'abhisit', 'abide', 'ability', 'abin', 'abitibibowater', 'abject', 'abkhazia', 'ablaze', 'able', 'aboard', 'abolish', 'abombs', 'aborigine', 'aborted', 'abortion', 'abortionrights', 'aboulhosn', 'abound', 'abraham', 'abramoff', 'abrams', 'abroad', 'abrogates', 'abse', 'absence', 'absentee', 'absolute', 'absolutely', 'absolutereturn', 'absorbs', 'abstention', 'abstinence', 'absurd', 'absurdity', 'absurdly', 'abu', 'abuela', 'abuelita', 'abundant', 'abuse', 'abused', 'abuserelated', 'abusing', 'abusive', 'ac', 'academia', 'academic', 'academy', 'acapella', 'acc', 'accelerates', 'accelerating', 'accelerator', 'accent', 'accenture', 'accep

In [23]:
print("Shape of train set",train_1.shape)
print("Shape of test set",test_1.shape)

Shape of train set (24000, 18248)
Shape of test set (8000, 18248)


In [24]:
train_arr=train_1.toarray()
test_arr=test_1.toarray()

# Define Naive Bayes Classifier

In [25]:
NB_MN=MultinomialNB()


# Training the model

In [26]:
NB_MN.fit(train_arr,train_y)
pred=NB_MN.predict(test_arr)
print('first 20 actual labels: ',test_y.tolist()[:20])
print('first 20 predicted labels: ',pred.tolist()[:20])

first 20 actual labels:  [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1]
first 20 predicted labels:  [0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1]


# Evaluation of Result

The Accuracy and F1 score of the model are printed to evaluate the model for text classification.

In [27]:
from sklearn.metrics import f1_score,accuracy_score
print("F1 score of the model")
print(f1_score(test_y,pred))
print("Accuracy of the model")
print(accuracy_score(test_y,pred))
print("Accuracy of the model in percentage")
print(accuracy_score(test_y,pred)*100,"%")

F1 score of the model
0.9632343959936486
Accuracy of the model
0.962375
Accuracy of the model in percentage
96.2375 %


In [28]:
from sklearn.metrics import confusion_matrix
print("Confusion Matrix")
print(confusion_matrix(test_y,pred))

from sklearn.metrics import classification_report
print("Classification Report")
print(classification_report(test_y,pred))


Confusion Matrix
[[3756  169]
 [ 132 3943]]
Classification Report
              precision    recall  f1-score   support

           0       0.97      0.96      0.96      3925
           1       0.96      0.97      0.96      4075

    accuracy                           0.96      8000
   macro avg       0.96      0.96      0.96      8000
weighted avg       0.96      0.96      0.96      8000



## A second model for TF-IDF with different n-grams and fixed feature size

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer(ngram_range=(1,3),max_features=6500)
train_2=tfidf.fit_transform(train_x)
test_2=tfidf.transform(test_x)

NB_MN.fit(train_2.toarray(),train_y)
pred2=NB_MN.predict(test_2.toarray())

print("Accuracy of the model in percentage")
print(accuracy_score(test_y,pred2)*100,"%")

from sklearn.metrics import confusion_matrix
print("Confusion Matrix")
print(confusion_matrix(test_y,pred2))

from sklearn.metrics import classification_report
print("Classification Report")
print(classification_report(test_y,pred2))

Accuracy of the model in percentage
95.7125 %
Confusion Matrix
[[3774  151]
 [ 192 3883]]
Classification Report
              precision    recall  f1-score   support

           0       0.95      0.96      0.96      3925
           1       0.96      0.95      0.96      4075

    accuracy                           0.96      8000
   macro avg       0.96      0.96      0.96      8000
weighted avg       0.96      0.96      0.96      8000

