<a href="https://colab.research.google.com/github/Anchalkumarinsec/Fake-news-Detection/blob/main/Copy_of_Project_5_Fake_News_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

About the Dataset:

1. id: unique id for a news article
2. title: the title of a news article
3. author: author of the news article
4. text: the text of the article; could be incomplete
5. label: a label that marks whether the news article is real or fake:
           1: Fake news
           0: real News





Importing the Dependencies

In [None]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# printing the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Pre-processing

In [None]:
# loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('/content/archive.zip')

In [None]:
news_dataset.shape

(4000, 24)

In [None]:
# print the first 5 rows of the dataframe
news_dataset.head()

Unnamed: 0,id,title,author,text,state,date_published,source,category,sentiment_score,word_count,...,num_shares,num_comments,political_bias,fact_check_rating,is_satirical,trust_score,source_reputation,clickbait_score,plagiarism_score,label
0,1,Breaking News 1,Jane Smith,This is the content of article 1. It contains ...,Tennessee,30-11-2021,The Onion,Entertainment,-0.22,1302,...,47305,450,Center,FALSE,1,76,6,0.84,53.35,Fake
1,2,Breaking News 2,Emily Davis,This is the content of article 2. It contains ...,Wisconsin,02-09-2021,The Guardian,Technology,0.92,322,...,39804,530,Left,Mixed,1,1,5,0.85,28.28,Fake
2,3,Breaking News 3,John Doe,This is the content of article 3. It contains ...,Missouri,13-04-2021,New York Times,Sports,0.25,228,...,45860,763,Center,Mixed,0,57,1,0.72,0.38,Fake
3,4,Breaking News 4,Alex Johnson,This is the content of article 4. It contains ...,North Carolina,08-03-2020,CNN,Sports,0.94,155,...,34222,945,Center,TRUE,1,18,10,0.92,32.2,Fake
4,5,Breaking News 5,Emily Davis,This is the content of article 5. It contains ...,California,23-03-2022,Daily Mail,Technology,-0.01,962,...,35934,433,Right,Mixed,0,95,6,0.66,77.7,Real


In [None]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

Unnamed: 0,0
id,0
title,0
author,0
text,0
state,0
date_published,0
source,0
category,0
sentiment_score,0
word_count,0


In [None]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [None]:
# merging the author name and news title
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

In [None]:
print(news_dataset['content'])

0            Jane Smith Breaking News 1
1           Emily Davis Breaking News 2
2              John Doe Breaking News 3
3          Alex Johnson Breaking News 4
4           Emily Davis Breaking News 5
                     ...               
3995        John Doe Breaking News 3996
3996    Alex Johnson Breaking News 3997
3997    Alex Johnson Breaking News 3998
3998        John Doe Breaking News 3999
3999        John Doe Breaking News 4000
Name: content, Length: 4000, dtype: object


In [None]:
# separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [None]:
print(X)
print(Y)

        id               title        author  \
0        1     Breaking News 1    Jane Smith   
1        2     Breaking News 2   Emily Davis   
2        3     Breaking News 3      John Doe   
3        4     Breaking News 4  Alex Johnson   
4        5     Breaking News 5   Emily Davis   
...    ...                 ...           ...   
3995  3996  Breaking News 3996      John Doe   
3996  3997  Breaking News 3997  Alex Johnson   
3997  3998  Breaking News 3998  Alex Johnson   
3998  3999  Breaking News 3999      John Doe   
3999  4000  Breaking News 4000      John Doe   

                                                   text           state  \
0     This is the content of article 1. It contains ...       Tennessee   
1     This is the content of article 2. It contains ...       Wisconsin   
2     This is the content of article 3. It contains ...        Missouri   
3     This is the content of article 4. It contains ...  North Carolina   
4     This is the content of article 5. It conta

Stemming:

Stemming is the process of reducing a word to its Root word

example:
actor, actress, acting --> act

In [None]:
port_stem = PorterStemmer()

In [None]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [None]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [None]:
print(news_dataset['content'])

0         jane smith break news
1         emili davi break news
2           john doe break news
3       alex johnson break news
4         emili davi break news
                 ...           
3995        john doe break news
3996    alex johnson break news
3997    alex johnson break news
3998        john doe break news
3999        john doe break news
Name: content, Length: 4000, dtype: object


In [None]:
#separating the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [None]:
print(X)

['jane smith break news' 'emili davi break news' 'john doe break news' ...
 'alex johnson break news' 'john doe break news' 'john doe break news']


In [None]:
print(Y)

['Fake' 'Fake' 'Fake' ... 'Fake' 'Real' 'Real']


In [None]:
Y.shape

(4000,)

In [None]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [None]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 16000 stored elements and shape (4000, 12)>
  Coords	Values
  (0, 1)	0.2537535642970132
  (0, 6)	0.6600069155748003
  (0, 10)	0.2537535642970132
  (0, 11)	0.6600069155748003
  (1, 1)	0.2513177760238013
  (1, 3)	0.6609382538894616
  (1, 5)	0.6609382538894616
  (1, 10)	0.2513177760238013
  (2, 1)	0.2513177760238013
  (2, 4)	0.6609382538894616
  (2, 7)	0.6609382538894616
  (2, 10)	0.2513177760238013
  (3, 0)	0.6598854317699118
  (3, 1)	0.2540693152229076
  (3, 8)	0.6598854317699118
  (3, 10)	0.2540693152229076
  (4, 1)	0.2513177760238013
  (4, 3)	0.6609382538894616
  (4, 5)	0.6609382538894616
  (4, 10)	0.2513177760238013
  (5, 1)	0.2513177760238013
  (5, 4)	0.6609382538894616
  (5, 7)	0.6609382538894616
  (5, 10)	0.2513177760238013
  (6, 1)	0.2513177760238013
  :	:
  (3993, 10)	0.2540693152229076
  (3994, 1)	0.2537535642970132
  (3994, 6)	0.6600069155748003
  (3994, 10)	0.2537535642970132
  (3994, 11)	0.6600069155748003
  (3995

Splitting the dataset to training & test data

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

Training the Model: Logistic Regression

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, Y_train)

In [None]:
from sklearn.tree import DecisionTreeClassifier

DT = DecisionTreeClassifier()
DT.fit(X_train, Y_train)

To understand the structure of `news_dataset`, we can use the `.info()` method to see the column names, their non-null counts, and data types. Then, `.head()` will show the first few rows of the DataFrame.

In [None]:
print(news_dataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 25 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 4000 non-null   int64  
 1   title              4000 non-null   object 
 2   author             4000 non-null   object 
 3   text               4000 non-null   object 
 4   state              4000 non-null   object 
 5   date_published     4000 non-null   object 
 6   source             4000 non-null   object 
 7   category           4000 non-null   object 
 8   sentiment_score    4000 non-null   float64
 9   word_count         4000 non-null   int64  
 10  char_count         4000 non-null   int64  
 11  has_images         4000 non-null   int64  
 12  has_videos         4000 non-null   int64  
 13  readability_score  4000 non-null   float64
 14  num_shares         4000 non-null   int64  
 15  num_comments       4000 non-null   int64  
 16  political_bias     4000 

In [None]:
display(news_dataset.head())

Unnamed: 0,id,title,author,text,state,date_published,source,category,sentiment_score,word_count,...,num_comments,political_bias,fact_check_rating,is_satirical,trust_score,source_reputation,clickbait_score,plagiarism_score,label,content
0,1,Breaking News 1,Jane Smith,This is the content of article 1. It contains ...,Tennessee,30-11-2021,The Onion,Entertainment,-0.22,1302,...,450,Center,FALSE,1,76,6,0.84,53.35,Fake,jane smith break news
1,2,Breaking News 2,Emily Davis,This is the content of article 2. It contains ...,Wisconsin,02-09-2021,The Guardian,Technology,0.92,322,...,530,Left,Mixed,1,1,5,0.85,28.28,Fake,emili davi break news
2,3,Breaking News 3,John Doe,This is the content of article 3. It contains ...,Missouri,13-04-2021,New York Times,Sports,0.25,228,...,763,Center,Mixed,0,57,1,0.72,0.38,Fake,john doe break news
3,4,Breaking News 4,Alex Johnson,This is the content of article 4. It contains ...,North Carolina,08-03-2020,CNN,Sports,0.94,155,...,945,Center,TRUE,1,18,10,0.92,32.2,Fake,alex johnson break news
4,5,Breaking News 5,Emily Davis,This is the content of article 5. It contains ...,California,23-03-2022,Daily Mail,Technology,-0.01,962,...,433,Right,Mixed,0,95,6,0.66,77.7,Real,emili davi break news


In [None]:
from sklearn.metrics import classification_report
print(classification_report(Y_test, X_test_prediction))

              precision    recall  f1-score   support

        Fake       0.53      0.65      0.58       405
        Real       0.54      0.42      0.47       395

    accuracy                           0.54       800
   macro avg       0.54      0.53      0.53       800
weighted avg       0.54      0.54      0.53       800



Evaluation

accuracy score

In [None]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [None]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.511875


In [None]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.535


Making a Predictive System

In [None]:
X_new = X_test[3]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

['Fake']
The news is Fake


In [None]:
print(Y_test[3])

Real
