<a href="https://colab.research.google.com/github/KALPAJYOTII/Web/blob/main/FAKE_NEWS_PREDICTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

About the Dataset:

id: unique id for a news article
title: the title of a news article
author: author of the news article
text: the text of the article; could be incomplete
label: a label that marks whether the news article is real or fake:
    1: Fake news
    0: real News

Importing the Dependencies

In [2]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [4]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
# printing the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Pre-processing

In [8]:
# loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('/content/FakeNewsNet[1].csv')

In [10]:
news_dataset.shape

(23196, 5)

In [12]:
# print the first 5 rows of the dataframe
news_dataset.head()

Unnamed: 0,title,news_url,source_domain,tweet_num,real
0,Kandi Burruss Explodes Over Rape Accusation on...,http://toofab.com/2017/05/08/real-housewives-a...,toofab.com,42,1
1,People's Choice Awards 2018: The best red carp...,https://www.today.com/style/see-people-s-choic...,www.today.com,0,1
2,Sophia Bush Sends Sweet Birthday Message to 'O...,https://www.etonline.com/news/220806_sophia_bu...,www.etonline.com,63,1
3,Colombian singer Maluma sparks rumours of inap...,https://www.dailymail.co.uk/news/article-33655...,www.dailymail.co.uk,20,1
4,Gossip Girl 10 Years Later: How Upper East Sid...,https://www.zerchoo.com/entertainment/gossip-g...,www.zerchoo.com,38,1


In [14]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

Unnamed: 0,0
title,0
news_url,330
source_domain,330
tweet_num,0
real,0


In [16]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [22]:
# merging the author name and news title
news_dataset['content'] = news_dataset['news_url']+' '+news_dataset['title']

In [24]:
print(news_dataset['content'])

0        http://toofab.com/2017/05/08/real-housewives-a...
1        https://www.today.com/style/see-people-s-choic...
2        https://www.etonline.com/news/220806_sophia_bu...
3        https://www.dailymail.co.uk/news/article-33655...
4        https://www.zerchoo.com/entertainment/gossip-g...
                               ...                        
23191    https://www.express.co.uk/news/royal/807049/pi...
23192    hollywoodlife.com/2018/03/13/zayn-malik-gigi-h...
23193    http://www.justjared.com/2018/01/17/jessica-ch...
23194    www.intouchweekly.com/posts/tristan-thompson-f...
23195    https://www.billboard.com/articles/news/bbma/8...
Name: content, Length: 23196, dtype: object


In [28]:
# separating the data & label
X = news_dataset.drop(columns='real', axis=1)
Y = news_dataset['real']

In [30]:
print(X)
print(Y)

                                                   title  \
0      Kandi Burruss Explodes Over Rape Accusation on...   
1      People's Choice Awards 2018: The best red carp...   
2      Sophia Bush Sends Sweet Birthday Message to 'O...   
3      Colombian singer Maluma sparks rumours of inap...   
4      Gossip Girl 10 Years Later: How Upper East Sid...   
...                                                  ...   
23191  Pippa Middleton wedding: In case you missed it...   
23192  Zayn Malik & Gigi Hadid’s Shocking Split: Why ...   
23193  Jessica Chastain Recalls the Moment Her Mother...   
23194  Tristan Thompson Feels "Dumped" After Khloé Ka...   
23195  Kelly Clarkson Performs a Medley of Kendrick L...   

                                                news_url  \
0      http://toofab.com/2017/05/08/real-housewives-a...   
1      https://www.today.com/style/see-people-s-choic...   
2      https://www.etonline.com/news/220806_sophia_bu...   
3      https://www.dailymail.co.uk/news

Stemming:

Stemming is the process of reducing a word to its Root word

example: actor, actress, acting --> act

In [32]:
port_stem = PorterStemmer()

In [34]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [35]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [37]:
print(news_dataset['content'])

0        http toofab com real housew atlanta kandi burr...
1        http www today com style see peopl choic award...
2        http www etonlin com news sophia bush send swe...
3        http www dailymail co uk news articl colombian...
4        http www zerchoo com entertain gossip girl yea...
                               ...                        
23191    http www express co uk news royal pippa middle...
23192    hollywoodlif com zayn malik gigi hadid split g...
23193    http www justjar com jessica chastain recal mo...
23194    www intouchweekli com post tristan thompson fe...
23195    http www billboard com articl news bbma kelli ...
Name: content, Length: 23196, dtype: object


In [39]:
#separating the data and label
X = news_dataset['content'].values
Y = news_dataset['real'].values

In [41]:
print(X)

['http toofab com real housew atlanta kandi burruss rape phaedra park porsha william kandi burruss explod rape accus real housew atlanta reunion video'
 'http www today com style see peopl choic award red carpet look peopl choic award best red carpet look'
 'http www etonlin com news sophia bush send sweet birthday messag one tree hill co star hilari burton breyton eva sophia bush send sweet birthday messag one tree hill co star hilari burton breyton eva'
 ...
 'http www justjar com jessica chastain recal moment mother boyfriend slap kick genit jessica chastain recal moment mother boyfriend slap kick genit'
 'www intouchweekli com post tristan thompson feel dump khloe kardashian tristan thompson feel dump khlo kardashian refus let move la home exclus'
 'http www billboard com articl news bbma kelli clarkson medley bbma kelli clarkson perform medley kendrick lamar humbl hit billboard music award']


In [43]:
print(Y)

[1 1 1 ... 1 0 1]


In [45]:
Y.shape

(23196,)

In [51]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 316924 stored elements and shape (23196, 17314)>
  Coords	Values
  (0, 175)	0.14434726829473674
  (0, 896)	0.3389455172701989
  (0, 2218)	0.4346705393896895
  (0, 3144)	0.028363875555072016
  (0, 5230)	0.19914481139678233
  (0, 7204)	0.27967207824608226
  (0, 7240)	0.03029315079920102
  (0, 8136)	0.4346705393896895
  (0, 11301)	0.16947275863509945
  (0, 11540)	0.2113369959982094
  (0, 11846)	0.21414819505876584
  (0, 12356)	0.33456306448701023
  (0, 12418)	0.2501629274621897
  (0, 12736)	0.1394415242605404
  (0, 15510)	0.1652746695506423
  (0, 16345)	0.1037611672890413
  (0, 16828)	0.12458641416073078
  (1, 985)	0.33310329340670436
  (1, 1532)	0.18067383197494744
  (1, 2474)	0.4135533362491038
  (1, 2851)	0.4642049540481571
  (1, 3144)	0.04487540286284784
  (1, 7240)	0.04792777148735079
  (1, 9039)	0.36163482558348325
  (1, 11465)	0.25473967851260193
  :	:
  (23194, 8547)	0.17169083170481492
  (23194, 8810)	0.211463587763149

Splitting the dataset to training & test data

In [53]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

Training the Model: Logistic Regression

In [55]:
model = LogisticRegression()

In [57]:
model.fit(X_train, Y_train)

Evaluation

accuracy score

In [59]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [61]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.9483724940719983


In [63]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [65]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.9269396551724138


Making a Predictive System

In [76]:
X_new = X_test[1]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[0]
The news is Real


In [75]:
print(Y_test[1])

0
