# About News Classification Project Specific Model - Part B

Problem Statement:The main aim of News Classification is to analyze the Fake & Real News and build a model which can distinguish a Real News from a Fake one, using Natural Language Processing Techniques and Machine Learning Models.

Dataset contains 5 features and 44,921 records. All features are Categorical features. Data is only having used based news.

The dataset were cleaned, converted into lowercase, process of lemmatization were applied in order to avoid any irregularities in our model.

File includes all Data Cleaning, Data Visualization, and Predictive Modeling with required Data Visualization in their support.

Since we are using NLP techniques to achieve categorization of the text reviews, we used count vectorizer and TF- IDF techniques and analyzed the scores for each of the following. We performed K-Fold Analysis as well for each of the models in order to get better accuracy. Hence after vigorous analysis we decided to follow the TFIDF approach and conclude that Random Forest model is the best fit for our Fake News Classification


## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
import re
import string

### Inserting fake and real dataset

In [2]:
df_fake = pd.read_csv("Fake.csv")
df_true = pd.read_csv("True.csv")

In [3]:
df_fake.head(5)

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [4]:
df_true.head(5)

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


>Inserting a column called "class" for fake and real news dataset to categories fake and true news. 

In [5]:
df_fake["class"] = 0
df_true["class"] = 1

>Removing last 10 rows from both the dataset, for manual testing  

In [6]:
df_fake.shape, df_true.shape

((23481, 5), (21417, 5))

In [7]:
df_fake_manual_testing = df_fake.tail(10)
for i in range(23480, 23470, -1):
df_fake.drop([i], axis=0, inplace=True)

df_true_manual_testing = df_true.tail(10)
for i in range(21416, 21406, -1):
df_true.drop([i], axis=0, inplace=True)

In [8]:
df_fake.shape, df_true.shape

((23471, 5), (21407, 5))

>Merging the manual testing dataframe in single dataset and save it in a csv file

In [9]:
df_fake_manual_testing["class"] = 0
df_true_manual_testing["class"] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fake_manual_testing["class"] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_true_manual_testing["class"] = 1


In [10]:
df_fake_manual_testing.head(10)

Unnamed: 0,title,text,subject,date,class
23471,Seven Iranians freed in the prisoner swap have...,"21st Century Wire says This week, the historic...",Middle-east,"January 20, 2016",0
23472,#Hashtag Hell & The Fake Left,By Dady Chery and Gilbert MercierAll writers ...,Middle-east,"January 19, 2016",0
23473,Astroturfing: Journalist Reveals Brainwashing ...,Vic Bishop Waking TimesOur reality is carefull...,Middle-east,"January 19, 2016",0
23474,The New American Century: An Era of Fraud,Paul Craig RobertsIn the last years of the 20t...,Middle-east,"January 19, 2016",0
23475,Hillary Clinton: ‘Israel First’ (and no peace ...,Robert Fantina CounterpunchAlthough the United...,Middle-east,"January 18, 2016",0
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",0
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",0
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",0
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",0
23480,10 U.S. Navy Sailors Held by Iranian Military ...,21st Century Wire says As 21WIRE predicted in ...,Middle-east,"January 12, 2016",0


In [11]:
df_true_manual_testing.head(10)

Unnamed: 0,title,text,subject,date,class
21407,"Mata Pires, owner of embattled Brazil builder ...","SAO PAULO (Reuters) - Cesar Mata Pires, the ow...",worldnews,"August 22, 2017",1
21408,"U.S., North Korea clash at U.N. forum over nuc...",GENEVA (Reuters) - North Korea and the United ...,worldnews,"August 22, 2017",1
21409,"U.S., North Korea clash at U.N. arms forum on ...",GENEVA (Reuters) - North Korea and the United ...,worldnews,"August 22, 2017",1
21410,Headless torso could belong to submarine journ...,COPENHAGEN (Reuters) - Danish police said on T...,worldnews,"August 22, 2017",1
21411,North Korea shipments to Syria chemical arms a...,UNITED NATIONS (Reuters) - Two North Korean sh...,worldnews,"August 21, 2017",1
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017",1
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017",1
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",1
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",1
21416,Indonesia to buy $1.14 billion worth of Russia...,JAKARTA (Reuters) - Indonesia will buy 11 Sukh...,worldnews,"August 22, 2017",1


In [12]:
# Concatenating the selected rows into a single dataframe for manual testing
df_manual_testing = pd.concat([df_fake_manual_testing, df_true_manual_testing], axis=0)

# Saving manual testing dataframe to a CSV file
df_manual_testing.to_csv("manual_testing.csv")

> Merging the main fake and true dataframe

In [13]:
df_marge = pd.concat([df_fake, df_true], axis =0 )
df_marge.head(10)

Unnamed: 0,title,text,subject,date,class
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0
5,Racist Alabama Cops Brutalize Black Boy While...,The number of cases of cops brutalizing and ki...,News,"December 25, 2017",0
6,"Fresh Off The Golf Course, Trump Lashes Out A...",Donald Trump spent a good portion of his day a...,News,"December 23, 2017",0
7,Trump Said Some INSANELY Racist Stuff Inside ...,In the wake of yet another court decision that...,News,"December 23, 2017",0
8,Former CIA Director Slams Trump Over UN Bully...,Many people have raised the alarm regarding th...,News,"December 22, 2017",0
9,WATCH: Brand-New Pro-Trump Ad Features So Muc...,Just when you might have thought we d get a br...,News,"December 21, 2017",0


In [14]:
df_marge.columns

Index(['title', 'text', 'subject', 'date', 'class'], dtype='object')

## "title",  "subject" and "date" columns is not required for detecting the fake news, so  drop the columns.

In [15]:
df = df_marge.drop(["title", "subject", "date"], axis=1)

In [16]:
df.isnull().sum()

text     0
class    0
dtype: int64

## Randomly shuffling the dataframe 

In [17]:
df = df.sample(frac=1)

In [18]:
df.head()

Unnamed: 0,text,class
18908,KAMPALA (Reuters) - Fighting erupted in Uganda...,1
16037,"Meanwhile, back at CNN Russia Russia Russia!Th...",0
8354,LAS VEGAS (Reuters) - Outside political money ...,1
22943,21st Century Wire says Here s an epic discussi...,0
2449,(Note: Strong language in paragraph 3) By Ste...,1


In [19]:
df.reset_index(inplace=True)
df.drop(["index"], axis=1, inplace=True)

In [20]:
df.columns

Index(['text', 'class'], dtype='object')

In [21]:
df.head()

Unnamed: 0,text,class
0,KAMPALA (Reuters) - Fighting erupted in Uganda...,1
1,"Meanwhile, back at CNN Russia Russia Russia!Th...",0
2,LAS VEGAS (Reuters) - Outside political money ...,1
3,21st Century Wire says Here s an epic discussi...,0
4,(Note: Strong language in paragraph 3) By Ste...,1


> Creating a function to convert the text in lowercase, remove the extra space, special chr., ulr and links.

In [22]:
def wordopt(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub("\\W"," ",text) 
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)    
    return text

In [23]:
df["text"] = df["text"].apply(wordopt)

>Defining dependent and independent variable as x and y

In [24]:
x = df["text"]
y = df["class"]

> Splitting the dataset into training set and testing set. 

In [25]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
print(y_train)

12045    1
10646    1
36773    0
35362    1
9330     1
        ..
3622     0
22919    0
26600    0
37855    0
38969    1
Name: class, Length: 33658, dtype: int64


> Convert text to vectors

In [27]:
vectorization = TfidfVectorizer(max_features=3000)
xv_train = vectorization.fit_transform(x_train)
print(xv_train.toarray())
xv_test = vectorization.transform(x_test)
print(xv_test.toarray())

  (0, 994)	0.05571876088111574
  (0, 2920)	0.04206368735753837
  (0, 704)	0.05337380406044902
  (0, 492)	0.054907132698729794
  (0, 515)	0.04327057624992968
  (0, 2579)	0.1062779965176983
  (0, 87)	0.07855413556036382
  (0, 1377)	0.035754839242018714
  (0, 1860)	0.04641272558052136
  (0, 239)	0.03942989616698343
  (0, 98)	0.04993736427361893
  (0, 229)	0.039564201195579816
  (0, 383)	0.05195248575268426
  (0, 111)	0.019402471109563873
  (0, 933)	0.03820224952954742
  (0, 2573)	0.05173050502255499
  (0, 2939)	0.017833599881279652
  (0, 1545)	0.03924173079296619
  (0, 1394)	0.04882195599673138
  (0, 353)	0.048251995544880764
  (0, 2082)	0.04700441675170992
  (0, 2631)	0.04542257354875964
  (0, 2367)	0.0398970623400146
  (0, 1244)	0.01574709518241532
  (0, 1403)	0.041396939214970466
  :	:
  (33657, 259)	0.04783168156695782
  (33657, 1871)	0.026749290239421526
  (33657, 1210)	0.08241615331154463
  (33657, 2926)	0.025968445834087747
  (33657, 1091)	0.04122375301710529
  (33657, 1689)	0.0438

## Random Forest Classifier

In [28]:
#As per the Analysis done in Part A. Using Random Forest TFIDF Model Technique

In [29]:
RFC = RandomForestClassifier(random_state=0)
model = RFC.fit(xv_train, y_train)

In [30]:
pred_rfc = model.predict(xv_test)

In [31]:
RFC.score(xv_test, y_test)

0.996524064171123

In [32]:
print(classification_report(y_test, pred_rfc))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5968
           1       1.00      1.00      1.00      5252

    accuracy                           1.00     11220
   macro avg       1.00      1.00      1.00     11220
weighted avg       1.00      1.00      1.00     11220



# Model Deployment 

### News

In [33]:
import gradio as gr

In [34]:
def output_label(n):
    if n == 0:
        return "Fake News"
    elif n == 1:
        return "Real News"
    
def predict_news(news):
    testing_news = {"text":[news]}
    new_def_test = pd.DataFrame(testing_news)
    new_def_test["text"] = new_def_test["text"].apply(wordopt) 
    new_x_test = new_def_test["text"]
    new_xv_test = vectorization.transform(new_x_test)
    pred_RFC = model.predict(new_xv_test)

    return output_label(pred_RFC[0])

iface = gr.Interface(fn=predict_news, inputs=['text'], outputs=['text'], 
                     theme='grass', title='-NEWS CLASSIFIER-', height=100,
                     article='**Note: Above News Classification is 99.3% - 99.5% accurate at the 95% C.I')
iface.launch()

IMPORTANT: You are using gradio version 2.4.1, however version 2.4.2 is available, please upgrade.
--------
Running on local URL:  http://127.0.0.1:7860/
To create a public link, set `share=True` in `launch()`.


(<Flask 'gradio.networking'>, 'http://127.0.0.1:7860/', None)

# Result

Created News Classifier using Natural Language Processing –TFIDF and Machine Learning - Random Forest Model techniques with 99.5% Accuracy at 95% Confidence Interval. And created Web app GUI at Gradio.
