
# Problem Statement :
## Fake News Classification with The Help Of Natural Language Processing Technique.
Fake news detection is a hot topic in the field of natural language processing. We consume news through several mediums throughout the day in our daily routine, but sometimes it becomes difficult to decide which one is fake and which one is authentic. Our job is to create a model which predicts whether a given news is real or fake.

## Dataset Description

id: unique id for a news article

title: the title of a news article

author: author of the news article

text: the text of the article; could be incomplete

label: a label that marks the article as potentially unreliable

  1: unreliable
  
  0: reliable
 
 ### Data Collection
- Dataset Source - https://zenodo.org/record/4561253/files/WELFake_Dataset.csv?download=1
- The data consists of 5 column and 20800 rows.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer,TfidfTransformer,CountVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

from tqdm import tqdm # printing the status bar
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings("ignore")
pd.pandas.set_option("display.max_columns", None)

### 1. Reading Data

In [2]:
df = pd.read_csv("data/News_dataset.csv")

print("Shape: ", df.shape)
df.head()

Shape:  (20800, 5)


Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1



### 2. Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

### 2.1 Check Missing values

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [4]:
#df.isna().sum()
df.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

<b>Observation:</b>  There are missing values in the given dataset
- 558 titles features are missing
- 1957 author details are missing
- 39 text features are missing 

### 2.2 Check duplicate values

In [5]:
df.duplicated().sum()

0

In [6]:
df.columns

Index(['id', 'title', 'author', 'text', 'label'], dtype='object')

In [7]:
print("Duplicate values Features wise:")
#print("Title: ", df["title"].duplicated().sum())
#print("Author: ", df["author"].duplicated().sum())
print("text: ", df["text"].duplicated().sum())

Duplicate values Features wise:
text:  413


In [8]:
df[df["text"].duplicated()]

Unnamed: 0,id,title,author,text,label
169,169,Mohamad Khweis: Another “Virginia Man” (Palest...,James Fulford,,1
295,295,A Connecticut Reader Reports Record Voter Regi...,VDARE.com Reader,,1
470,470,BULLETIN: There ARE Righteous Jews For Trump!;...,admin,,1
480,480,Watch: Muslim ‘Palestinians’ Declare “We follo...,admin,jewsnews © 2015 | JEWSNEWS | It's not news...u...,1
573,573,Le top des recherches Google passe en top des ...,,,1
...,...,...,...,...,...
20728,20728,Trump warns of World War III if Clinton is ele...,,Email Donald Trump warned in an interview Tues...,1
20749,20749,Realities Faced by Black Canadians are a Natio...,Anonymous,"Tweet Widget by Robyn Maynard \nCanada, includ...",1
20750,20750,Why Did Four Googles Kill This White?,Andrew Anglin,Migrant Crisis Disclaimer \nWe here at the Dai...,1
20754,20754,No More American Thanksgivings,Glen Ford,Thanksgiving by Glen Ford \n“The core ideologi...,1


In [9]:
## check the percentage of null values present in each feature
feature_na =[feature for feature in df.columns if df[feature].isnull().sum() > 0]

for feature in feature_na:
    print(feature, np.round(df[feature].isnull().mean(),4)*100, " % missing values")

title 2.68  % missing values
author 9.41  % missing values
text 0.19  % missing values


In [10]:
df_text_na = df[df["text"].isnull()]
# Deleteting the rows where text features is null
df.drop(df[df["text"].isnull()].index, inplace = True, axis=0)
df.drop(df[df["id"].isnull()].index, inplace = True, axis=0)
#Update the other features null values with "missing"
df["title"].fillna("missing", inplace=True)
df["author"].fillna("missing", inplace=True)
df.reset_index(inplace=True)
#df.head()

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20761 entries, 0 to 20760
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   20761 non-null  int64 
 1   id      20761 non-null  int64 
 2   title   20761 non-null  object
 3   author  20761 non-null  object
 4   text    20761 non-null  object
 5   label   20761 non-null  int64 
dtypes: int64(3), object(3)
memory usage: 973.3+ KB


In [12]:
#Sorting data according to text in ascending order
sorted_data=df.sort_values('text', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [13]:
sorted_data[sorted_data["text"].duplicated()]

Unnamed: 0,index,id,title,author,text,label
19596,19635,19635,"14 Days to Do 14 Things, If You Want to Surviv...",Dave Hodges,\n\n \nUPDATE: HILLARY CLINTON IS AGAIN UNDER ...,1
16433,16467,16467,Russian Report Warns: American Revolution Has ...,The European Union Times,\nA sobering new Security Council ( SC ) analy...,1
19811,19850,19850,Sick Hillary Needed a Doctor in the Oval Offic...,Brenda Walker,,1
6207,6220,6220,"Radio Derb: Peak White Guilt, PC Now To The LE...",John Derbyshire,,1
20203,20242,20242,Radio Derb Transcript For October 21 Up: The M...,John Derbyshire,,1
...,...,...,...,...,...,...
518,519,519,US NATO To Attack Putin Military Drills in Rus...,Pakalert,source Add To The Conversation Using Facebook ...,1
10322,10343,10343,World war 3 Update & Death of Petrodollar,Pakalert,source Add To The Conversation Using Facebook ...,1
6877,6891,6891,Germany Forming EU Super Army Preparing For Wo...,Pakalert,source Add To The Conversation Using Facebook ...,1
5041,5052,5052,"zentak – World War 3 is coming Russia,USA,Chin...",Pakalert,source Add To The Conversation Using Facebook ...,1


In [14]:
#How many positive and negative reviews are present in our dataset?
df["label"].value_counts()

0    10387
1    10374
Name: label, dtype: int64

# 3.  Text Preprocessing.

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [15]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [16]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [17]:
def preprocess(corpus):
    preprocessed = []
    for sentance in tqdm(corpus):
        #sentance = re.sub(r"http+", "", sentance)
        sentance = re.sub(r"http\S+", "", sentance)
        sentance = BeautifulSoup(sentance, 'lxml').get_text()
        sentance = decontracted(sentance)
        sentance = re.sub("\S*\d\S*", "", sentance).strip()
        sentance = re.sub('[^A-Za-z]+', ' ', sentance)    
        sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
        preprocessed.append(sentance.strip())

    return preprocessed

In [18]:
import nltk
lm = WordNetLemmatizer()
corpus = []
for i in range (len(df)):
    review = re.sub('^a-zA-Z0-9',' ', df['title'][i])
    review = review.lower()
    review = review.split()
    review = [lm.lemmatize(x) for x in review if x not in stopwords]
    review = " ".join(review)
    corpus.append(review)

In [19]:
corpus[0]


'house dem aide: didn’t even see comey’s letter jason chaffetz tweeted'

In [20]:

preprocessed_text = preprocess(df['text'].values)
preprocessed_title = preprocess(df['title'].values)
preprocessed_author = preprocess(df['author'].values)

#print(preprocessed_text[1500])
#print(preprocessed_title[1500])

100%|██████████| 20761/20761 [00:22<00:00, 918.15it/s] 
100%|██████████| 20761/20761 [00:05<00:00, 4061.35it/s]
100%|██████████| 20761/20761 [00:04<00:00, 5149.37it/s]


In [21]:
df["text"] = preprocessed_text
#df["title"] = preprocessed_title
#df["author"] = preprocessed_author

df.head()

Unnamed: 0,index,id,title,author,text,label
0,0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,house dem aide even see comey letter jason cha...,1
1,1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,ever get feeling life circles roundabout rathe...,0
2,2,2,Why the Truth Might Get You Fired,Consortiumnews.com,truth might get fired october tension intellig...,1
3,3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,videos civilians killed single us airstrike id...,1
4,4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,print iranian woman sentenced six years prison...,1


## 4.  Featurization

### BAG OF WORDS

In [22]:
##BoW
count_vect = CountVectorizer() #in scikit-learn
count_vect.fit(preprocessed_text)
print("some feature names ", count_vect.get_feature_names()[:10])
print('='*50)

final_counts = count_vect.transform(preprocessed_text)
print("the type of count vectorizer ",type(final_counts))
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])

some feature names  ['aa', 'aaa', 'aaaaah', 'aaaaggg', 'aaaahhh', 'aaah', 'aaahhh', 'aaajiao', 'aaany', 'aaas']
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (20761, 145197)
the number of unique words  145197


### Bi-Grams and n-Grams

In [23]:
#bi-gram, tri-gram and n-gram

#removing stop words like "not" should be avoided before building n-grams
# count_vect = CountVectorizer(ngram_range=(1,2))
# please do read the CountVectorizer documentation http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# you can choose these numebrs min_df=10, max_features=5000, of your choice
count_vect = CountVectorizer(ngram_range=(1,2), min_df=10, max_features=5000)
final_bigram_counts = count_vect.fit_transform(preprocessed_text)
print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BOW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (20761, 5000)
the number of unique words including both unigrams and bigrams  5000


### TF-IDF

In [24]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), min_df=10)
tf_idf_vect.fit(preprocessed_text)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_names()[0:10])
print('='*50)

final_tf_idf = tf_idf_vect.transform(preprocessed_text)
print("the type of count vectorizer ",type(final_tf_idf))
print("the shape of out text TFIDF vectorizer ",final_tf_idf.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tf_idf.get_shape()[1])

some sample features(unique words in the corpus) ['aa', 'aa superluminal', 'aaa', 'aaron', 'aaron klein', 'aaron rodgers', 'aaronkleinshow', 'aaronkleinshow follow', 'aarp', 'ab']
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text TFIDF vectorizer  (20761, 105469)
the number of unique words including both unigrams and bigrams  105469


### Word2Vec

In [25]:
# Train your own Word2Vec model using your own text corpus
i=0
list_of_sentance=[]
for sentance in preprocessed_text:
    list_of_sentance.append(sentance.split())

In [26]:
# Using Google News Word2Vectors

# in this project we are using a pretrained model by google
# its 3.3G file, once you load this into your memory 
# it occupies ~9Gb, so please do this step only if you have >12G of ram
# we will provide a pickle file wich contains a dict , 
# and it contains all our courpus words as keys and  model[word] as values
# To use this code-snippet, download "GoogleNews-vectors-negative300.bin" 
# from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# it's 1.9GB in size.


# http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.W17SRFAzZPY
# you can comment this whole cell
# or change these varible according to your need

is_your_ram_gt_16g=False
want_to_use_google_w2v = False
want_to_train_w2v = True

if want_to_train_w2v:
    # min_count = 5 considers only words that occured atleast 5 times
    w2v_model=Word2Vec(list_of_sentance,min_count=5)
    print(w2v_model.wv.most_similar('great'))
    print('='*50)
    print(w2v_model.wv.most_similar('worst'))
    
elif want_to_use_google_w2v and is_your_ram_gt_16g:
    if os.path.isfile('GoogleNews-vectors-negative300.bin'):
        w2v_model=KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
        print(w2v_model.wv.most_similar('great'))
        print(w2v_model.wv.most_similar('worst'))
    else:
        print("you don't have gogole's word2vec file, keep want_to_train_w2v = True, to train your own w2v ")

[('incredible', 0.6327919363975525), ('good', 0.6126022934913635), ('amazing', 0.5777631402015686), ('proud', 0.5656858086585999), ('wonderful', 0.5285385251045227), ('fantastic', 0.5174841284751892), ('perfect', 0.4875513017177582), ('love', 0.4818818271160126), ('greatest', 0.4774805009365082), ('greatness', 0.4755479693412781)]
[('biggest', 0.5728533267974854), ('greatest', 0.5631301999092102), ('horrible', 0.5441107153892517), ('nightmare', 0.5412285327911377), ('dangerous', 0.5383963584899902), ('darkest', 0.5321401357650757), ('terrible', 0.5290238857269287), ('hardest', 0.5124144554138184), ('strongest', 0.5079717040061951), ('best', 0.5046460628509521)]


In [27]:
w2v_words = list(w2v_model.wv.index_to_key)

print("number of words that occured minimum 5 times ",len(w2v_words))
print("sample words ", w2v_words[0:50])

number of words that occured minimum 5 times  53120
sample words  ['said', 'not', 'mr', 'trump', 'one', 'would', 'people', 'new', 'clinton', 'no', 'like', 'also', 'president', 'time', 'state', 'us', 'could', 'many', 'even', 'years', 'states', 'two', 'first', 'government', 'american', 'world', 'last', 'obama', 'united', 'news', 'hillary', 'year', 'get', 'may', 'campaign', 'country', 'election', 'ms', 'going', 'make', 'way', 'u', 'house', 'made', 'white', 'back', 'know', 'much', 'media', 'think']


### Converting text into vectors using wAvg W2V, TFIDF-W2V

#### Avg W2v

In [28]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentance): # for each review/sentence
    sent_vec = np.zeros(100) # as word vectors are of zero length 100, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))

100%|██████████| 20761/20761 [34:55<00:00,  9.91it/s]  

20761
100





#### TFIDF weighted W2v

In [29]:
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
model = TfidfVectorizer()
model.fit(preprocessed_text)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(model.get_feature_names(), list(model.idf_)))

In [30]:
# TF-IDF weighted Word2Vec
tfidf_feat = model.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm(list_of_sentance): # for each review/sentence 
    sent_vec = np.zeros(100) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1

  3%|▎         | 533/20761 [16:40<10:33:01,  1.88s/it]


KeyboardInterrupt: 