<a href="https://colab.research.google.com/github/Anujakhatri/Machine-learning/blob/main/Fake_News_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About the Dataset
# Id:unique id for a news article
# Title:the title of a news article
# Author: author of the news article
# Text:the text of the article, could be incomplete



#  Fake news
#  Real news

# Importing the Dependencies

In [2]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
#printing the stopwards in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Pre-processing

In [5]:
news_dataset=pd.read_csv('/content/news_articles.csv')

In [6]:
news_dataset.shape

(2096, 12)

In [7]:
#print the 1st five rows of data frame
news_dataset.head()

Unnamed: 0,author,published,title,text,language,site_url,main_img_url,type,label,title_without_stopwords,text_without_stopwords,hasImage
0,Barracuda Brigade,2016-10-26T21:41:00.000+03:00,muslims busted they stole millions in govt ben...,print they should pay all the back all the mon...,english,100percentfedup.com,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,bias,Real,muslims busted stole millions govt benefits,print pay back money plus interest entire fami...,1.0
1,reasoning with facts,2016-10-29T08:47:11.259+03:00,re why did attorney general loretta lynch plea...,why did attorney general loretta lynch plead t...,english,100percentfedup.com,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,bias,Real,attorney general loretta lynch plead fifth,attorney general loretta lynch plead fifth bar...,1.0
2,Barracuda Brigade,2016-10-31T01:41:49.479+02:00,breaking weiner cooperating with fbi on hillar...,red state \nfox news sunday reported this mor...,english,100percentfedup.com,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,bias,Real,breaking weiner cooperating fbi hillary email ...,red state fox news sunday reported morning ant...,1.0
3,Fed Up,2016-11-01T05:22:00.000+02:00,pin drop speech by father of daughter kidnappe...,email kayla mueller was a prisoner and torture...,english,100percentfedup.com,http://100percentfedup.com/wp-content/uploads/...,bias,Real,pin drop speech father daughter kidnapped kill...,email kayla mueller prisoner tortured isis cha...,1.0
4,Fed Up,2016-11-01T21:56:00.000+02:00,fantastic trumps point plan to reform healthc...,email healthcare reform to make america great ...,english,100percentfedup.com,http://100percentfedup.com/wp-content/uploads/...,bias,Real,fantastic trumps point plan reform healthcare ...,email healthcare reform make america great sin...,1.0


In [8]:
#counting the missing values in dataset
news_dataset.isnull().sum()

Unnamed: 0,0
author,0
published,0
title,0
text,46
language,1
site_url,1
main_img_url,1
type,1
label,1
title_without_stopwords,2


In [9]:
news_dataset=news_dataset.fillna('')

In [10]:
news_dataset['content']=news_dataset['author']+' '+news_dataset['title']

In [11]:
print(news_dataset['content'])

0       Barracuda Brigade muslims busted they stole mi...
1       reasoning with facts re why did attorney gener...
2       Barracuda Brigade breaking weiner cooperating ...
3       Fed Up pin drop speech by father of daughter k...
4       Fed Up fantastic trumps  point plan to reform ...
                              ...                        
2091    -NO AUTHOR- teens walk free after gangrape con...
2092    -NO AUTHOR- school named for munichmassacre ma...
2093            -NO AUTHOR- russia unveils satan  missile
2094    -NO AUTHOR- check out hillarythemed haunted house
2095    Eddy Lavine cannabis aficionados develop thca ...
Name: content, Length: 2096, dtype: object


In [12]:
#seperating the data and label
X=news_dataset.drop(columns='label',axis=1)
Y=news_dataset['label']

In [13]:
print(X)
print(Y)

                    author  ...                                            content
0        Barracuda Brigade  ...  Barracuda Brigade muslims busted they stole mi...
1     reasoning with facts  ...  reasoning with facts re why did attorney gener...
2        Barracuda Brigade  ...  Barracuda Brigade breaking weiner cooperating ...
3                   Fed Up  ...  Fed Up pin drop speech by father of daughter k...
4                   Fed Up  ...  Fed Up fantastic trumps  point plan to reform ...
...                    ...  ...                                                ...
2091           -NO AUTHOR-  ...  -NO AUTHOR- teens walk free after gangrape con...
2092           -NO AUTHOR-  ...  -NO AUTHOR- school named for munichmassacre ma...
2093           -NO AUTHOR-  ...          -NO AUTHOR- russia unveils satan  missile
2094           -NO AUTHOR-  ...  -NO AUTHOR- check out hillarythemed haunted house
2095           Eddy Lavine  ...  Eddy Lavine cannabis aficionados develop thca ...

[20

# Stemming:

# Stemming is the process of reducing a word to its Root word

In [14]:
port_stem=PorterStemmer()

In [15]:
def stemming(content):
  stemmed_content=re.sub('[^a-zA-Z]',' ',content)
  stemmed_content=stemmed_content.lower()
  stemmed_content=stemmed_content.split()
  stemmed_content=[port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content=' '.join(stemmed_content)
  return stemmed_content

In [16]:
news_dataset['content']= news_dataset['content'].apply(stemming)

In [17]:
print(news_dataset['content'])

0       barracuda brigad muslim bust stole million gov...
1       reason fact attorney gener loretta lynch plead...
2       barracuda brigad break weiner cooper fbi hilla...
3       fed pin drop speech father daughter kidnap kil...
4       fed fantast trump point plan reform healthcar ...
                              ...                        
2091                author teen walk free gangrap convict
2092          author school name munichmassacr mastermind
2093                    author russia unveil satan missil
2094                  author check hillarythem haunt hous
2095    eddi lavin cannabi aficionado develop thca cry...
Name: content, Length: 2096, dtype: object


In [18]:
#seperating the data and label

X=news_dataset['content'].values
Y=news_dataset['label'].values

In [19]:
print(X)

['barracuda brigad muslim bust stole million govt benefit'
 'reason fact attorney gener loretta lynch plead fifth'
 'barracuda brigad break weiner cooper fbi hillari email investig' ...
 'author russia unveil satan missil' 'author check hillarythem haunt hous'
 'eddi lavin cannabi aficionado develop thca crystallin strongest hash world thc']


In [20]:
print(Y)

['Real' 'Real' 'Real' ... 'Fake' 'Fake' '']


In [21]:
Y.shape

(2096,)

In [22]:
#converting the texual data to numerical data
vectorizer= TfidfVectorizer()
vectorizer.fit(X)

X=vectorizer.transform(X)

In [23]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 18425 stored elements and shape (2096, 5213)>
  Coords	Values
  (0, 385)	0.360186474515665
  (0, 440)	0.360186474515665
  (0, 589)	0.360186474515665
  (0, 639)	0.3697642072739326
  (0, 1918)	0.38148640479571866
  (0, 2934)	0.2756393343563781
  (0, 3040)	0.30866146215821777
  (0, 4409)	0.39659894583441546
  (1, 293)	0.35712952110668233
  (1, 1577)	0.35001440748294765
  (1, 1650)	0.40227711098282576
  (1, 1851)	0.31651949704469357
  (1, 2707)	0.35712952110668233
  (1, 2742)	0.3284095229839025
  (1, 3451)	0.40227711098282576
  (1, 3728)	0.3011905875442771
  (2, 385)	0.39151117947183195
  (2, 572)	0.31732120654919904
  (2, 589)	0.39151117947183195
  (2, 966)	0.4542427144371104
  (2, 1409)	0.2554614549889334
  (2, 1623)	0.2545896715839206
  (2, 2097)	0.1838024493567392
  (2, 2313)	0.2861131762188201
  (2, 5035)	0.3750843306435056
  :	:
  (2092, 2827)	0.5312937133594603
  (2092, 3030)	0.5312937133594603
  (2092, 3055)	0.5042141500

Splitting the dataset to training and test data

In [29]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Logistic Regression Model

In [30]:
model= LogisticRegression()


In [32]:
model.fit(X_train, Y_train)

# Evaluation

Accuracy score

In [33]:
X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction, Y_train)

In [34]:
print('Acccuracy of the training data: ', training_data_accuracy)

Acccuracy of the training data:  0.9767303102625299


In [35]:
X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction, Y_test)

In [36]:
print('Accuracy of the test data: ', test_data_accuracy)

Accuracy of the test data:  0.8333333333333334


Making a Prediction

In [37]:
#checking First news of the data
X_new= X_test[0]

prediction= model.predict(X_new)
print(prediction)

['Real']


In [43]:
print(Y_test[67])

Fake


In [39]:
X_new= X_test[54]

prediction= model.predict(X_new)
print(prediction)

['Fake']


In [41]:
print(Y_test[43])

Real
