About the Dataset
1. id : id is always unique for news article
2. title : title of the news given by the author
3. author : the person who wrote the news
4. news : news may be fake empty real
5. fake or real : 0 and 1 respectively

In [4]:
# it will come under classification as either the news is
# FAKE or it is ORIGINAL

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score # we can also use score to evaluate
import re # useful for reading text in our data file

#stopwords are words which dont add much value to our dataset
from nltk.corpus import stopwords  # natural language tool kit and corpus refers to the body of text
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

In [26]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
#here are all the stopwords
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Pre-Processing

In [10]:
#loading our dataset
dataset = pd.read_csv("/content/fake_news_data_dataset.csv")
dataset.head(3)

Unnamed: 0,id,title,author,text,label
0,1,NASA Confirms Water on Moon,John A. Smith,NASA releases new research confirming presence...,1
1,2,Aliens Built Egyptian Pyramids,MysteryReporter12,Newly decoded hieroglyphs suggest extraterrest...,0
2,3,Global Warming Accelerates,Maria K. Lee,Climate scientists warn the pace of warming is...,1


In [12]:
# number of rows and columns
dataset.shape

(201, 5)

In [13]:
# counting number of missing values
dataset.isnull().sum()

Unnamed: 0,0
id,0
title,0
author,0
text,0
label,0


In [14]:
# we can merge the author and news title so it may lead to better accurscy score
dataset["content"] = dataset["author"] + ' ' +dataset["title"]

In [15]:
#separating the data and labels
X = dataset.drop(columns='label', axis=1)
Y = dataset['label']

Stemming : It is the process of reducing a word to its Root word

example : actor, acting, actress ---> act

In [16]:
port_stem = PorterStemmer()

In [17]:
def stemming(content):
  stemmed_content = re.sub('[^a-zA-z]',' ',content)
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split()
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content

In [18]:
dataset['content'] = dataset['content'].apply(stemming)

In [27]:
#separating the data and label
X = dataset['content']
Y = dataset['label'].values

In [28]:
#converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

Splitting the dataset into training and testing

In [29]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1379 stored elements and shape (201, 809)>
  Coords	Values
  (0, 134)	0.38205865418384616
  (0, 362)	0.45656163043145714
  (0, 458)	0.33426612020907337
  (0, 473)	0.42359353824435153
  (0, 669)	0.45656163043145714
  (0, 791)	0.38205865418384616
  (1, 25)	0.3089681982729695
  (1, 97)	0.43761422960035706
  (1, 209)	0.4992421679689901
  (1, 467)	0.4992421679689901
  (1, 558)	0.46319213502658524
  (2, 2)	0.48911209407132916
  (2, 293)	0.27636840151202213
  (2, 390)	0.45379354881395967
  (2, 419)	0.48911209407132916
  (2, 790)	0.48911209407132916
  (3, 17)	0.43825249376995207
  (3, 153)	0.4066064953925089
  (3, 270)	0.30519923152343864
  (3, 315)	0.43825249376995207
  (3, 321)	0.4066064953925089
  (3, 796)	0.43825249376995207
  (4, 113)	0.4022568272653227
  (4, 155)	0.3732100583731507
  (4, 212)	0.4022568272653227
  :	:
  (197, 610)	0.37896562629474095
  (197, 728)	0.2639118306527331
  (198, 141)	0.40885055073470505
  (198, 142)	

In [30]:
print(Y)

[1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
 1 0 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]


In [31]:
X.shape
Y.shape

(201,)

In [32]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2, random_state=40)

Training our Model using Logistic Regression

In [33]:
model = LogisticRegression()

In [34]:
model.fit(X_train,Y_train)

Evaluating the accuracy score and doing prediction

In [35]:
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [36]:
#accuracy score of training data
print(training_data_accuracy)

1.0


Making a predictive system

In [37]:
X_new = X_test[0]
prediction = model.predict(X_new)
print(prediction)

if(prediction[0] == 0):
  print(" the news is fake ")
else:
  print(" the news is real ")

[0]
 the news is fake 
