<a href="https://colab.research.google.com/github/Masud690/Fake_news_detection/blob/main/news_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

About the Dataset:

1.id: unique id for a news article
2.title: the title of a news article
3.author: author of the news article
4.text: the text of the article; could be incomplete
5.label: a label that marks whether the news article is real or fake:
    1: Fake news
    0: real News


In [1]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
# printing the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [4]:
# loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('/content/health_news_dataset.csv')
news_dataset.head()

Unnamed: 0,id,title,author,text,label
0,1,New Study Links Diet to Heart Health,Debra Phillips,Voice attorney state information. Up race skin...,0
1,2,"Vaccines Cause More Harm Than Good, New Study ...",Anthony Norton,Pick through person. Prevent tree black right ...,1
2,3,Big Pharma Hiding Natural Cure for Cancer,Rebecca Morton,Ask through support maintain might to. Soon de...,1
3,4,Mediterranean Diet Shown to Improve Longevity,Karina Harris,Task war effort moment door focus cut. Trade n...,0
4,5,Miracle Fruit That Instantly Cures Diabetes Di...,Joseph Cooper,Professional produce still early major leave w...,1


In [5]:
news_dataset.shape

(1000, 5)

In [6]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

Unnamed: 0,0
id,0
title,0
author,0
text,0
label,0


In [7]:
# merging the author name and news title
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

In [8]:
print(news_dataset['content'])

0      Debra Phillips New Study Links Diet to Heart H...
1      Anthony Norton Vaccines Cause More Harm Than G...
2      Rebecca Morton Big Pharma Hiding Natural Cure ...
3      Karina Harris Mediterranean Diet Shown to Impr...
4      Joseph Cooper Miracle Fruit That Instantly Cur...
                             ...                        
995    Nicholas White Vaccines Cause More Harm Than G...
996    Jessica Fowler New Study Links Diet to Heart H...
997    Brianna Matthews New Study Links Diet to Heart...
998    Emily Perry Vaccines Cause More Harm Than Good...
999    Kevin Harrison Ancient Herbal Tea Found to Rev...
Name: content, Length: 1000, dtype: object


In [9]:
# separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [10]:
print(X)
print(Y)

       id                                              title  \
0       1               New Study Links Diet to Heart Health   
1       2  Vaccines Cause More Harm Than Good, New Study ...   
2       3          Big Pharma Hiding Natural Cure for Cancer   
3       4      Mediterranean Diet Shown to Improve Longevity   
4       5  Miracle Fruit That Instantly Cures Diabetes Di...   
..    ...                                                ...   
995   996  Vaccines Cause More Harm Than Good, New Study ...   
996   997               New Study Links Diet to Heart Health   
997   998               New Study Links Diet to Heart Health   
998   999  Vaccines Cause More Harm Than Good, New Study ...   
999  1000          Ancient Herbal Tea Found to Reverse Aging   

               author                                               text  \
0      Debra Phillips  Voice attorney state information. Up race skin...   
1      Anthony Norton  Pick through person. Prevent tree black right ...   
2  

Stemming:

Stemming is the process of reducing a word to its Root word

example: actor, actress, acting --> act

In [11]:
port_stem = PorterStemmer()

In [12]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [13]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [14]:
print(news_dataset['content'])

0         debra phillip new studi link diet heart health
1      anthoni norton vaccin caus harm good new studi...
2       rebecca morton big pharma hide natur cure cancer
3      karina harri mediterranean diet shown improv l...
4      joseph cooper miracl fruit instantli cure diab...
                             ...                        
995    nichola white vaccin caus harm good new studi ...
996      jessica fowler new studi link diet heart health
997     brianna matthew new studi link diet heart health
998    emili perri vaccin caus harm good new studi claim
999    kevin harrison ancient herbal tea found revers...
Name: content, Length: 1000, dtype: object


In [15]:
#separating the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [16]:
print(X)

['debra phillip new studi link diet heart health'
 'anthoni norton vaccin caus harm good new studi claim'
 'rebecca morton big pharma hide natur cure cancer'
 'karina harri mediterranean diet shown improv longev'
 'joseph cooper miracl fruit instantli cure diabet discov'
 'joshua boyd new studi link diet heart health'
 'adam kraus exercis proven lower risk type diabet'
 'austin henri ancient herbal tea found revers age'
 'alicia scott ancient herbal tea found revers age'
 'kelli turner ancient herbal tea found revers age'
 'victor carlson new studi link diet heart health'
 'aaron price vaccin caus harm good new studi claim'
 'heather marsh vaccin caus harm good new studi claim'
 'carmen white miracl fruit instantli cure diabet discov'
 'timothi brown ancient herbal tea found revers age'
 'dawn sander recommend updat vaccin guidelin'
 'lauri rodriguez research identifi gene link obes'
 'seth hernandez miracl fruit instantli cure diabet discov'
 'jason colon exercis proven lower risk typ

In [17]:
print(Y)

[0 1 1 0 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 1
 0 0 0 1 0 1 0 1 1 1 0 0 0 1 1 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0
 1 1 0 0 1 0 0 1 0 1 1 0 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1
 0 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0
 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 1
 0 0 0 1 1 0 0 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0
 1 0 1 1 0 1 0 0 1 1 1 0 1 1 0 1 1 1 0 0 0 1 1 0 0 1 0 1 1 1 0 1 0 0 0 1 0
 0 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 1 0 1 1
 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 1
 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 1 1 1
 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 1 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 0
 1 0 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 1
 1 1 1 0 0 1 0 1 0 0 0 1 1 1 1 1 0 0 1 0 0 0 1 0 1 1 0 0 0 1 1 1 0 0 0 0 1
 1 0 1 1 0 0 0 1 0 1 1 1 

In [18]:
Y.shape

(1000,)

In [19]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [20]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 7722 stored elements and shape (1000, 828)>
  Coords	Values
  (0, 193)	0.5954709431332976
  (0, 208)	0.21952325291904246
  (0, 314)	0.2805277853546525
  (0, 315)	0.2805277853546525
  (0, 454)	0.22781750811443036
  (0, 558)	0.22566902563829092
  (0, 606)	0.5348629593935922
  (0, 728)	0.22566902563829092
  (1, 36)	0.45762324835270335
  (1, 123)	0.28130137825424334
  (1, 141)	0.28130137825424334
  (1, 279)	0.28130137825424334
  (1, 300)	0.28130137825424334
  (1, 558)	0.21775500072043957
  (1, 568)	0.5745882727336531
  (1, 728)	0.21775500072043957
  (1, 771)	0.22152412305573493
  (2, 67)	0.2675775752120806
  (2, 106)	0.2675775752120806
  (2, 170)	0.21053269179584844
  (2, 329)	0.2675775752120806
  (2, 542)	0.5897127598120162
  (2, 554)	0.2675775752120806
  (2, 603)	0.2675775752120806
  (2, 622)	0.4999260473492118
  :	:
  (997, 90)	0.6254379511258942
  (997, 208)	0.2305707358409978
  (997, 314)	0.29464531448484627
  (997, 315)	0.

**Splitting the dataset to training & test data**

In [21]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

**Training the Model: Logistic Regression**

In [22]:
model = LogisticRegression()

In [23]:
model.fit(X_train, Y_train)

Evaluation

accuracy score

In [24]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [25]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  1.0


In [26]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [27]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  1.0


Making a Predictive System

In [38]:
X_new = X_test[88]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[0]
The news is Real


In [36]:
print(Y_test[11])

1
