# fake-and-real-news-dataset

###  Introduction

> - Fake news refers to misinformation or disinformation in the country which is spread through word of mouth and more recently through digital communication such as What's app messages, social media posts, etc.
> - Fake news spreads faster than Real news and creates problems and fear among groups and in society. Here We are going to address these problems using classical NLP techniques and going to classify whether a given message/ text is Real or Fake Message.
> - We will use a Bag of n-grams to pre-process the text and apply different classification algorithms.Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.



#### About Data: Fake News Detection

Credits: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset

>- This data consists of two columns. - Text - label

>- Text is the statements or messages regarding a particular event/situation.

>- label feature tells whether the given Text is Fake or Real.

>- As there are only 2 classes, this problem comes under the Binary Classification.

In [20]:
# Importing librairies
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report 
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
import spacy

In [3]:
df1=pd.read_csv('Fake.csv')
df2=pd.read_csv('True.csv')

In [4]:
df1.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [5]:
# we mark the rows of each dataframe with one for the true news and 0 for the false news 
df1['is_true']=0
df2['is_true']=1

In [6]:
df=pd.concat([df1,df2[:21417]],axis=0)
df.head()

Unnamed: 0,title,text,subject,date,is_true
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [7]:
df_shuffeld=df.sample(frac=1).reset_index(drop=True)
df_shuffeld.head()

Unnamed: 0,title,text,subject,date,is_true
0,HOLY MOLY! Rebel Media Uncovers ILLEGAL USA-Ca...,You ll never guess which mainstream TV network...,politics,"Apr 14, 2017",0
1,N.J. Democrats divided on renewing 'Bridgegate...,(This March 30 story was corrected to note Pr...,politicsNews,"March 30, 2017",1
2,Arab states urge U.S. to abandon Jerusalem mov...,CAIRO (Reuters) - Arab foreign ministers on Su...,worldnews,"December 10, 2017",1
3,Pakistani court issues arrest warrant for ex-P...,ISLAMABAD (Reuters) - A Pakistani court issued...,worldnews,"October 26, 2017",1
4,BREAKING: Sources Reveal Trump HIMSELF Met Wi...,In what could be the most damning information ...,News,"March 7, 2017",0


In [8]:
df_shuffeld.shape

(44898, 5)

In [9]:
df_shuffeld.is_true.value_counts()

is_true
0    23481
1    21417
Name: count, dtype: int64

#### we want to have the exact same number of samples...

In [10]:
label_rows=df_shuffeld[df_shuffeld['is_true']== 0]
remove_rows=label_rows.sample(2064,random_state=50)
df_shuffeld=df_shuffeld.drop(remove_rows.index)

In [11]:
df_shuffeld.is_true.value_counts()

is_true
0    21417
1    21417
Name: count, dtype: int64

In [12]:
df_shuffeld['all']=df_shuffeld['title']+df_shuffeld['text']
df_shuffeld


Unnamed: 0,title,text,subject,date,is_true,all
0,HOLY MOLY! Rebel Media Uncovers ILLEGAL USA-Ca...,You ll never guess which mainstream TV network...,politics,"Apr 14, 2017",0,HOLY MOLY! Rebel Media Uncovers ILLEGAL USA-Ca...
1,N.J. Democrats divided on renewing 'Bridgegate...,(This March 30 story was corrected to note Pr...,politicsNews,"March 30, 2017",1,N.J. Democrats divided on renewing 'Bridgegate...
2,Arab states urge U.S. to abandon Jerusalem mov...,CAIRO (Reuters) - Arab foreign ministers on Su...,worldnews,"December 10, 2017",1,Arab states urge U.S. to abandon Jerusalem mov...
3,Pakistani court issues arrest warrant for ex-P...,ISLAMABAD (Reuters) - A Pakistani court issued...,worldnews,"October 26, 2017",1,Pakistani court issues arrest warrant for ex-P...
4,BREAKING: Sources Reveal Trump HIMSELF Met Wi...,In what could be the most damning information ...,News,"March 7, 2017",0,BREAKING: Sources Reveal Trump HIMSELF Met Wi...
...,...,...,...,...,...,...
44893,Spain plans new elections in Catalonia to end ...,MADRID (Reuters) - The Spanish government has ...,worldnews,"October 20, 2017",1,Spain plans new elections in Catalonia to end ...
44894,Lawsuit says North Carolina bathroom law still...,"WINSTON-SALEM, N.C. (Reuters) - Transgender pe...",politicsNews,"July 21, 2017",1,Lawsuit says North Carolina bathroom law still...
44895,RAND PAUL: SOMEBODY WAS SPYING On Trump Campai...,FOX News Neil Cavuto asked Senator Rand Paul ...,politics,"Mar 22, 2017",0,RAND PAUL: SOMEBODY WAS SPYING On Trump Campai...
44896,WH Official: We Will Keep Saying ‘Fake News’ ...,"Sebastian Gorka, deputy assistant to Donald Tr...",News,"February 8, 2017",0,WH Official: We Will Keep Saying ‘Fake News’ ...


### Modelling without Pre-processing Text data

In [13]:
X_train,X_test,y_train,y_test=train_test_split(df_shuffeld['all'],df_shuffeld['is_true'],test_size=0.2,random_state=2022)

In [14]:
X_train.head()

29664    Lebanon's president rejects terrorism suggesti...
36351    WATCH: GOP REP DAVE BRAT TURNED TABLES On #Fak...
17920    WHY THIS BLUE-COLLAR DEMOCRAT STRONGHOLD Count...
7294     Myanmar's Suu Kyi to visit China amid Western ...
26376    Brazil's Temer says pension bill 2 to 3 dozen ...
Name: all, dtype: object

In [15]:
len(y_train)

34267

**1st Attempt** 
> - we will use  sklearn pipeline module  to create a classification pipeline to classify the Data.
> - we use tree different pipelines for different ngram_range (unigram, bigram, and trigrams) and also the KNN as a model for this attempt.
- we use knn with n_neighbors of 10 and metric as 'euclidean' for distance 

In [17]:
# Unigram range 
clf1 = Pipeline([
    ('my_counterize',CountVectorizer(ngram_range=(1,1))),
    ('knn',KNeighborsClassifier(n_neighbors=10,metric='euclidean'))
])
# Bigram range
clf2 = Pipeline([
    ('my_counterize',CountVectorizer(ngram_range=(1,2))),
    ('knn',KNeighborsClassifier(n_neighbors=10,metric='euclidean'))
])
# Trigram range
clf3 = Pipeline([
    ('my_counterize',CountVectorizer(ngram_range=(1,3))),
    ('knn',KNeighborsClassifier(n_neighbors=10,metric='euclidean'))
])

In [18]:
clf1.fit(X_train,y_train)


In [None]:
clf2.fit(X_train,y_train)
clf3.fit(X_train,y_train)

In [20]:
y_pred1=clf1.predict(X_test)
y_pred2=clf2.predict(X_test)
y_pred3=clf3.predict(X_test)


In [21]:
print(classification_report(y_test,y_pred2))

              precision    recall  f1-score   support

           0       0.78      0.85      0.81      4387
           1       0.83      0.76      0.79      4180

    accuracy                           0.80      8567
   macro avg       0.81      0.80      0.80      8567
weighted avg       0.80      0.80      0.80      8567



In [22]:
print(classification_report(y_test,y_pred1))

              precision    recall  f1-score   support

           0       0.86      0.89      0.87      4387
           1       0.88      0.85      0.86      4180

    accuracy                           0.87      8567
   macro avg       0.87      0.87      0.87      8567
weighted avg       0.87      0.87      0.87      8567



In [23]:
print(classification_report(y_test,y_pred3))

              precision    recall  f1-score   support

           0       0.73      0.79      0.76      4387
           1       0.76      0.69      0.72      4180

    accuracy                           0.74      8567
   macro avg       0.74      0.74      0.74      8567
weighted avg       0.74      0.74      0.74      8567



The unigram seems more effective in this case 

**2nd Attempt**
> - the same as the one before but we will change the metric in the KNN model to 'cosine'

In [16]:
# Unigram range 
clf1 = Pipeline([
    ('my_counterize',CountVectorizer(ngram_range=(1,1))),
    ('knn',KNeighborsClassifier(n_neighbors=10,metric='cosine'))
])
# Bigram range
clf2 = Pipeline([
    ('my_counterize',CountVectorizer(ngram_range=(1,2))),
    ('knn',KNeighborsClassifier(n_neighbors=10,metric='cosine'))
])
# Trigram range
clf3 = Pipeline([
    ('my_counterize',CountVectorizer(ngram_range=(1,3))),
    ('knn',KNeighborsClassifier(n_neighbors=10,metric='cosine'))
])

In [25]:
clf1.fit(X_train,y_train)
clf2.fit(X_train,y_train)
clf3.fit(X_train,y_train)

In [26]:
y_pred1=clf1.predict(X_test)
y_pred2=clf2.predict(X_test)
y_pred3=clf3.predict(X_test)

In [27]:
print(classification_report(y_test,y_pred1))

              precision    recall  f1-score   support

           0       0.85      0.92      0.88      4387
           1       0.91      0.83      0.87      4180

    accuracy                           0.88      8567
   macro avg       0.88      0.87      0.88      8567
weighted avg       0.88      0.88      0.88      8567



In [28]:
print(classification_report(y_test,y_pred2))

              precision    recall  f1-score   support

           0       0.73      0.95      0.83      4387
           1       0.93      0.63      0.75      4180

    accuracy                           0.80      8567
   macro avg       0.83      0.79      0.79      8567
weighted avg       0.83      0.80      0.79      8567



In [29]:
print(classification_report(y_test,y_pred3))

              precision    recall  f1-score   support

           0       0.65      0.97      0.78      4387
           1       0.94      0.45      0.61      4180

    accuracy                           0.72      8567
   macro avg       0.80      0.71      0.69      8567
weighted avg       0.79      0.72      0.70      8567



**3rd Attempt**
>- As usual we use the sklearn pipeline to classify Data,using CountVectorizer with only trigrams.For the classifier it would be the **RandomForest**. Let's see what we will get from the classification report.

In [17]:
#1. create a pipeline object
pipe_forest=Pipeline([
    ('count',CountVectorizer(ngram_range=(3,3))),
    ('random',RandomForestClassifier())
])


#2. fit with X_train and y_train
pipe_forest.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred=pipe_forest.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      4322
           1       0.97      0.97      0.97      4245

    accuracy                           0.97      8567
   macro avg       0.97      0.97      0.97      8567
weighted avg       0.97      0.97      0.97      8567



**4th attempt**
>- Pipeline that classifies Data using CountVectorizer with both unigram and bigrams and uses Multinomial Naive Bayes as the classifier with an alpha value of 0.75.

In [19]:
pipe_naive=Pipeline([
    ('count',CountVectorizer(ngram_range=(3,3))),
    ('naive',MultinomialNB(alpha=0.75))
])


#2. fit with X_train and y_train
pipe_naive.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred=pipe_naive.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      0.97      0.98      4322
           1       0.97      0.99      0.98      4245

    accuracy                           0.98      8567
   macro avg       0.98      0.98      0.98      8567
weighted avg       0.98      0.98      0.98      8567



***Conclusion:*** After we've gone through many classification algoritms, we noticed that there is a  great improvement in terms of precision as we go from **Kneirestneighbors** to **MultinomialNB**, but they were some problems related to the preparation time of the model because of the fact that we did not preprocess the text. So next , we will cover how can 'preprocessing the text' makes the model more precise in less time. 

### Modelling with Pre-processing Text data

#### Remove stop words, punctuations and apply lemmatization

In [22]:
# Another Look into the data
df_shuffeld.head()

Unnamed: 0,title,text,subject,date,is_true,all
0,HOLY MOLY! Rebel Media Uncovers ILLEGAL USA-Ca...,You ll never guess which mainstream TV network...,politics,"Apr 14, 2017",0,HOLY MOLY! Rebel Media Uncovers ILLEGAL USA-Ca...
1,N.J. Democrats divided on renewing 'Bridgegate...,(This March 30 story was corrected to note Pr...,politicsNews,"March 30, 2017",1,N.J. Democrats divided on renewing 'Bridgegate...
2,Arab states urge U.S. to abandon Jerusalem mov...,CAIRO (Reuters) - Arab foreign ministers on Su...,worldnews,"December 10, 2017",1,Arab states urge U.S. to abandon Jerusalem mov...
3,Pakistani court issues arrest warrant for ex-P...,ISLAMABAD (Reuters) - A Pakistani court issued...,worldnews,"October 26, 2017",1,Pakistani court issues arrest warrant for ex-P...
4,BREAKING: Sources Reveal Trump HIMSELF Met Wi...,In what could be the most damning information ...,News,"March 7, 2017",0,BREAKING: Sources Reveal Trump HIMSELF Met Wi...


In [23]:

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 

def preprocess(text):
    # remove stop words, punctuation and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

In [24]:

# create a new column "preprocessed_text" and use the  function above to get the clean data
# this will take some time, please be patient
df_shuffeld['preprocessed_text']=df_shuffeld['all'].apply(lambda x : preprocess(x))
df_shuffeld.head()

Unnamed: 0,title,text,subject,date,is_true,all,preprocessed_text
0,HOLY MOLY! Rebel Media Uncovers ILLEGAL USA-Ca...,You ll never guess which mainstream TV network...,politics,"Apr 14, 2017",0,HOLY MOLY! Rebel Media Uncovers ILLEGAL USA-Ca...,HOLY MOLY rebel medium Uncovers ILLEGAL USA Ca...
1,N.J. Democrats divided on renewing 'Bridgegate...,(This March 30 story was corrected to note Pr...,politicsNews,"March 30, 2017",1,N.J. Democrats divided on renewing 'Bridgegate...,N.J. Democrats divide renew Bridgegate probe C...
2,Arab states urge U.S. to abandon Jerusalem mov...,CAIRO (Reuters) - Arab foreign ministers on Su...,worldnews,"December 10, 2017",1,Arab states urge U.S. to abandon Jerusalem mov...,arab state urge U.S. abandon Jerusalem stateme...
3,Pakistani court issues arrest warrant for ex-P...,ISLAMABAD (Reuters) - A Pakistani court issued...,worldnews,"October 26, 2017",1,Pakistani court issues arrest warrant for ex-P...,pakistani court issue arrest warrant ex PM Sha...
4,BREAKING: Sources Reveal Trump HIMSELF Met Wi...,In what could be the most damning information ...,News,"March 7, 2017",0,BREAKING: Sources Reveal Trump HIMSELF Met Wi...,BREAKING source reveal trump Met russian Amb...


In [25]:
#'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too.;
X_train,X_test,y_train,y_test=train_test_split(df_shuffeld['preprocessed_text'],df_shuffeld['is_true'],stratify=df_shuffeld.is_true,test_size=0.2,random_state=2022)

In [26]:
X_train.shape

(34267,)

**Let's check the scores with our best models till now**


**1st Attempt: RandomForestClassifier**
>-  CountVectorizer with only trigrams + RandomForest as the classifier

In [27]:
#1. create a pipeline object
pipe_forest=Pipeline([
    ('count',CountVectorizer(ngram_range=(3,3))),
    ('random',RandomForestClassifier())
])


#2. fit with X_train and y_train
pipe_forest.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred=pipe_forest.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.90      0.97      0.93      4283
           1       0.97      0.89      0.93      4284

    accuracy                           0.93      8567
   macro avg       0.93      0.93      0.93      8567
weighted avg       0.93      0.93      0.93      8567



**2nd Attempt: RandomForestClassifier**
>-  CountVectorizer with unigrams, bigrams and trigrams + RandomForest as the classifier

In [28]:
#1. create a pipeline object
pipe_forest=Pipeline([
    ('count',CountVectorizer(ngram_range=(1,3))),
    ('random',RandomForestClassifier())
])


#2. fit with X_train and y_train
pipe_forest.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred=pipe_forest.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      0.98      0.98      4283
           1       0.98      0.99      0.98      4284

    accuracy                           0.98      8567
   macro avg       0.98      0.98      0.98      8567
weighted avg       0.98      0.98      0.98      8567



**3rd Attempt: MultinomialNB**
>-  CountVectorizer with only trigrams + Naive bais as a classifier

In [29]:
pipe_naive=Pipeline([
    ('count',CountVectorizer(ngram_range=(3,3))),
    ('naive',MultinomialNB(alpha=0.75))
])


#2. fit with X_train and y_train
pipe_naive.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred=pipe_naive.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      0.95      0.97      4283
           1       0.96      0.99      0.97      4284

    accuracy                           0.97      8567
   macro avg       0.97      0.97      0.97      8567
weighted avg       0.97      0.97      0.97      8567



## Final observations :

>- In general, the difference in the precision for  these classification models was because the fact some of them can handle high dimensional numeric vector given after the execution of  the **Bag of words** technique. In addition, Text preprocessing with **spaCy** makes training our models more easier in terms of time but we've seen that the precision decreased a little bit due to the fact that words had lost their sementic meaning.

>- To conclude, **Bag of words** is one many effective  techniques that makes analysing and processing text much more easier.**Machine Learning is like a trial and error scientific method, where we keep trying all the possible algorithms we have and select the one which gives good results and satisfies the requirements like latency, interpretability, etc.**