<a href="https://colab.research.google.com/github/Engineer-Harshad/ML_Fake_news_Prediction/blob/main/Fake_news_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Machine Learning Project:
Fake news prediction with python

Importing the dependencies

In [35]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [36]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [37]:
#Printing the stopwords in english
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data Pre-Processing

In [38]:
# loading the dataset into a pandas Dataframe
news_dataset = pd.read_csv('/content/FakeNewsNet.csv')

In [39]:
news_dataset.shape

(23196, 5)

In [41]:
# print the first 5 rows of the dataframe
news_dataset.head()

Unnamed: 0,title,news_url,source_domain,tweet_num,real
0,Kandi Burruss Explodes Over Rape Accusation on...,http://toofab.com/2017/05/08/real-housewives-a...,toofab.com,42,1
1,People's Choice Awards 2018: The best red carp...,https://www.today.com/style/see-people-s-choic...,www.today.com,0,1
2,Sophia Bush Sends Sweet Birthday Message to 'O...,https://www.etonline.com/news/220806_sophia_bu...,www.etonline.com,63,1
3,Colombian singer Maluma sparks rumours of inap...,https://www.dailymail.co.uk/news/article-33655...,www.dailymail.co.uk,20,1
4,Gossip Girl 10 Years Later: How Upper East Sid...,https://www.zerchoo.com/entertainment/gossip-g...,www.zerchoo.com,38,1


In [42]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum

<bound method NDFrame._add_numeric_operations.<locals>.sum of        title  news_url  source_domain  tweet_num   real
0      False     False          False      False  False
1      False     False          False      False  False
2      False     False          False      False  False
3      False     False          False      False  False
4      False     False          False      False  False
...      ...       ...            ...        ...    ...
23191  False     False          False      False  False
23192  False     False          False      False  False
23193  False     False          False      False  False
23194  False     False          False      False  False
23195  False     False          False      False  False

[23196 rows x 5 columns]>

In [43]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [44]:
# merging the source_domain and news title
news_dataset['content'] = news_dataset['source_domain']+ ' ' + news_dataset['title']

In [45]:
print(news_dataset['content'])

0        toofab.com Kandi Burruss Explodes Over Rape Ac...
1        www.today.com People's Choice Awards 2018: The...
2        www.etonline.com Sophia Bush Sends Sweet Birth...
3        www.dailymail.co.uk Colombian singer Maluma sp...
4        www.zerchoo.com Gossip Girl 10 Years Later: Ho...
                               ...                        
23191    www.express.co.uk Pippa Middleton wedding: In ...
23192    hollywoodlife.com Zayn Malik & Gigi Hadid’s Sh...
23193    www.justjared.com Jessica Chastain Recalls the...
23194    www.intouchweekly.com Tristan Thompson Feels "...
23195    www.billboard.com Kelly Clarkson Performs a Me...
Name: content, Length: 23196, dtype: object


In [46]:
# separating the data & label
X = news_dataset.drop('real', axis = 1)
Y = news_dataset['real']

In [47]:
print(X)
print(Y)

                                                   title  \
0      Kandi Burruss Explodes Over Rape Accusation on...   
1      People's Choice Awards 2018: The best red carp...   
2      Sophia Bush Sends Sweet Birthday Message to 'O...   
3      Colombian singer Maluma sparks rumours of inap...   
4      Gossip Girl 10 Years Later: How Upper East Sid...   
...                                                  ...   
23191  Pippa Middleton wedding: In case you missed it...   
23192  Zayn Malik & Gigi Hadid’s Shocking Split: Why ...   
23193  Jessica Chastain Recalls the Moment Her Mother...   
23194  Tristan Thompson Feels "Dumped" After Khloé Ka...   
23195  Kelly Clarkson Performs a Medley of Kendrick L...   

                                                news_url  \
0      http://toofab.com/2017/05/08/real-housewives-a...   
1      https://www.today.com/style/see-people-s-choic...   
2      https://www.etonline.com/news/220806_sophia_bu...   
3      https://www.dailymail.co.uk/news

Stemming: Stemming is the process of reducing a word to its root word
Example: Acting, actor, actress --> act

In [48]:
port_stem = PorterStemmer()

In [49]:
def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]',' ', content)
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split()
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content

In [50]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [51]:
print(news_dataset['content'])

0        toofab com kandi burruss explod rape accus rea...
1        www today com peopl choic award best red carpe...
2        www etonlin com sophia bush send sweet birthda...
3        www dailymail co uk colombian singer maluma sp...
4        www zerchoo com gossip girl year later upper e...
                               ...                        
23191    www express co uk pippa middleton wed case mis...
23192    hollywoodlif com zayn malik gigi hadid shock s...
23193    www justjar com jessica chastain recal moment ...
23194    www intouchweekli com tristan thompson feel du...
23195    www billboard com kelli clarkson perform medle...
Name: content, Length: 23196, dtype: object


In [52]:
# separating the data & label
X = news_dataset['content'].values
Y = news_dataset['real'].values

In [53]:
print(X)


['toofab com kandi burruss explod rape accus real housew atlanta reunion video'
 'www today com peopl choic award best red carpet look'
 'www etonlin com sophia bush send sweet birthday messag one tree hill co star hilari burton breyton eva'
 ...
 'www justjar com jessica chastain recal moment mother boyfriend slap kick genit'
 'www intouchweekli com tristan thompson feel dump khlo kardashian refus let move la home exclus'
 'www billboard com kelli clarkson perform medley kendrick lamar humbl hit billboard music award']


In [54]:
print(Y)

[1 1 1 ... 1 0 1]


In [55]:
Y.shape

(23196,)

In [56]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)


In [57]:
print(X)

  (0, 13681)	0.21108934405813048
  (0, 12980)	0.29467349125048065
  (0, 10569)	0.2543303205934995
  (0, 10276)	0.23128301802698972
  (0, 10223)	0.30215839846863407
  (0, 6647)	0.3874939993343693
  (0, 5891)	0.2612462026749161
  (0, 4193)	0.3550616497873391
  (0, 2460)	0.05082186658865081
  (0, 1707)	0.3874939993343693
  (0, 651)	0.3073756565758506
  (0, 59)	0.2597097788538541
  (1, 14274)	0.10189043319535993
  (1, 12946)	0.36817044163275126
  (1, 10347)	0.37702288594702094
  (1, 9476)	0.23697414575567807
  (1, 7432)	0.33284454859410584
  (1, 2460)	0.08159478582438673
  (1, 2219)	0.42251489448694257
  (1, 1916)	0.3875096512688879
  (1, 1101)	0.33844294892026705
  (1, 729)	0.3083771474680274
  (2, 14274)	0.05440697837598742
  (2, 13118)	0.3043944712032482
  (2, 12381)	0.2330583052370701
  :	:
  (23194, 10393)	0.31860580642240854
  (23194, 8384)	0.28221547915154177
  (23194, 7230)	0.3087048153381826
  (23194, 6994)	0.26381957644897924
  (23194, 6797)	0.2878505190105094
  (23194, 6661)	0.1

Splitting the dataset to training & test data

In [58]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y, random_state = 2)

Training the model: Logistic Regression

In [59]:
model = LogisticRegression()

In [62]:
model.fit(X_train, Y_train)

Accuracy score

In [63]:
# Accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [65]:
print('Accuracy score of the training data: ', training_data_accuracy)

Accuracy score of the training data:  0.8831105841776244


In [66]:
# Accuracy score on the testing data
X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [67]:
print('Accuracy score of the testing data: ', testing_data_accuracy)

Accuracy score of the testing data:  0.8530172413793103


Making a predictive system

In [72]:
X_new = X_test[112]
prediction = model.predict(X_new)
print(prediction)
if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[1]
The news is Fake


In [73]:
print(Y_test[112])

1
