<a href="https://colab.research.google.com/github/Shalu-Yadav0811/Fake-News-Prediction/blob/main/Fake_News_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing the Dependencies

In [106]:
import numpy as np
import pandas as pd
import re # re -> Regular Expression (For searching words in text or paragraph)
from nltk.corpus import stopwords  # nltk -> Natural Language Tool Kit, stopwords -> those words that doesn't add much value to a paragraph (i.e where, what, when, etc.)
from nltk.stem.porter import PorterStemmer # PorterStemmer -> Gives the root word for the particular word
from sklearn.feature_extraction.text import TfidfVectorizer # TfidfVectorizer -> Convert the text into feature Vectors
from sklearn.model_selection import train_test_split # Use to split our data into training and test data
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [107]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [108]:
# printing the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Preprocessing

In [109]:
# loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('/content/train.csv')

In [110]:
news_dataset.shape # number of rows and columns in dataset

(72134, 4)

In [111]:
news_dataset.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [112]:
# Counting the number of missing values in the dataset
news_dataset.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
title,558
text,39
label,0


In [113]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [114]:
news_dataset.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
title,0
text,0
label,0


In [115]:
# merging the news title and text
news_dataset['content'] = news_dataset['title'] + ' ' + news_dataset['text']

In [116]:
print(news_dataset['content'])

0        LAW ENFORCEMENT ON HIGH ALERT Following Threat...
1           Did they post their votes for Hillary already?
2        UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...
3        Bobby Jindal, raised Hindu, uses story of Chri...
4        SATAN 2: Russia unvelis an image of its terrif...
                               ...                        
72129    Russians steal research on Trump in hack of U....
72130     WATCH: Giuliani Demands That Democrats Apolog...
72131    Migrants Refuse To Leave Train At Refugee Camp...
72132    Trump tussle gives unpopular Mexican leader mu...
72133    Goldman Sachs Endorses Hillary Clinton For Pre...
Name: content, Length: 72134, dtype: object


In [117]:
news_dataset['content'].head()

Unnamed: 0,content
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...
1,Did they post their votes for Hillary already?
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...
3,"Bobby Jindal, raised Hindu, uses story of Chri..."
4,SATAN 2: Russia unvelis an image of its terrif...


In [118]:
# Separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [119]:
print(X)

       Unnamed: 0                                              title  \
0               0  LAW ENFORCEMENT ON HIGH ALERT Following Threat...   
1               1                                                      
2               2  UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...   
3               3  Bobby Jindal, raised Hindu, uses story of Chri...   
4               4  SATAN 2: Russia unvelis an image of its terrif...   
...           ...                                                ...   
72129       72129  Russians steal research on Trump in hack of U....   
72130       72130   WATCH: Giuliani Demands That Democrats Apolog...   
72131       72131  Migrants Refuse To Leave Train At Refugee Camp...   
72132       72132  Trump tussle gives unpopular Mexican leader mu...   
72133       72133  Goldman Sachs Endorses Hillary Clinton For Pre...   

                                                    text  \
0      No comment is expected from Barack Obama Membe...   
1         Did t

In [120]:
print(Y)

0        1
1        1
2        1
3        0
4        1
        ..
72129    0
72130    1
72131    0
72132    0
72133    1
Name: label, Length: 72134, dtype: int64


Stemming:

Stemming is the process of reducing a word to its Root word

emaple :
 actor, actress, acting --> act (Root word)

In [121]:
port_stem = PorterStemmer()

In [122]:
def stemming(title):
  stemmed_content = re.sub('[^a-zA-Z]',' ', title) # Regular expression for searching a word from text/paragraph, sub --> substitutes several values, ^ --> Exclusion, a-zA-Z --> alphabetes in lower and upper cases. (It removes everything that is not Alphabet)
  stemmed_content = stemmed_content.lower() # Converting all in lower case bcz may it can create some problems in processing
  stemmed_content = stemmed_content.split() # Splitting it to respective lists
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')] # stemming these words and removing the stopwords
  stemmed_content = ''.join(stemmed_content)
  return stemmed_content

In [123]:
news_dataset['title'] = news_dataset['title'].apply(stemming)

In [124]:
print(news_dataset['title'])

0        lawenforchighalertfollowthreatcopwhiteblackliv...
1                                                         
2        unbelievobamaattorneygenersaycharlottrioterpea...
3        bobbijindalraishinduusestorichristianconverswo...
4        satanrussiaunvimagterrifinewsupernukwesternwor...
                               ...                        
72129          russianstealresearchtrumphackudemocratparti
72130    watchgiulianidemanddemocratapologtrumpracistbi...
72131               migrantrefusleavtrainrefugecamphungari
72132    trumptusslgiveunpopularmexicanleadermuchneedsh...
72133                goldmansachendorshillariclintonpresid
Name: title, Length: 72134, dtype: object


In [125]:
# Separating the data and label
X = news_dataset['title'].values
Y = news_dataset['label'].values

In [126]:
print(X)

['lawenforchighalertfollowthreatcopwhiteblacklivesmattfyfterroristvideo'
 ''
 'unbelievobamaattorneygenersaycharlottrioterpeacprotesthomestatenorthcarolinavideo'
 ... 'migrantrefusleavtrainrefugecamphungari'
 'trumptusslgiveunpopularmexicanleadermuchneedshotarm'
 'goldmansachendorshillariclintonpresid']


In [127]:
print(Y)

[1 1 1 ... 0 0 1]


In [128]:
Y.shape

(72134,)

In [129]:
X.shape

(72134,)

In [130]:
# Converting the textual daya to numerical data
vectorizer = TfidfVectorizer() # Tf --> Term frequency (Basically counts the number of times a particular word is repeating in the documnet/text/paragraph and Assigns particular numberical value to those words), idf -->  Inverse Documnet Frequency (word that doesn't have meaning in it or are not significant and it reduces it importance value), TfidfVectorizer --> Convert the text into feature Vectors
vectorizer.fit(X)

X = vectorizer.transform(X)

In [131]:
print(X)

  (0, 26906)	1.0
  (2, 55387)	1.0
  (3, 3871)	1.0
  (4, 41950)	1.0
  (5, 47408)	1.0
  (6, 13379)	1.0
  (7, 22286)	1.0
  (8, 44882)	1.0
  (9, 26826)	1.0
  (10, 19791)	1.0
  (11, 29313)	1.0
  (12, 42283)	1.0
  (13, 58691)	1.0
  (14, 7062)	1.0
  (15, 3488)	1.0
  (16, 5745)	1.0
  (17, 55823)	1.0
  (18, 28631)	1.0
  (19, 42559)	1.0
  (20, 780)	1.0
  (21, 21542)	1.0
  (22, 42458)	1.0
  (23, 55566)	1.0
  (24, 14889)	1.0
  (25, 4171)	1.0
  :	:
  (72109, 36778)	1.0
  (72110, 21966)	1.0
  (72111, 3930)	1.0
  (72112, 55615)	1.0
  (72113, 18465)	1.0
  (72114, 3623)	1.0
  (72115, 38492)	1.0
  (72116, 42174)	1.0
  (72117, 46765)	1.0
  (72118, 27270)	1.0
  (72119, 3561)	1.0
  (72120, 33081)	1.0
  (72121, 17320)	1.0
  (72122, 45934)	1.0
  (72123, 10938)	1.0
  (72124, 55657)	1.0
  (72125, 61135)	1.0
  (72126, 12133)	1.0
  (72127, 60538)	1.0
  (72128, 25531)	1.0
  (72129, 41273)	1.0
  (72130, 58634)	1.0
  (72131, 30383)	1.0
  (72132, 52644)	1.0
  (72133, 19475)	1.0


Splitting dataset to training and test data

In [132]:
X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

In [133]:
print(X_train)

  (0, 9230)	1.0
  (1, 18526)	1.0
  (2, 3694)	1.0
  (3, 22442)	1.0
  (4, 23569)	1.0
  (5, 39386)	1.0
  (7, 19212)	1.0
  (8, 27446)	1.0
  (9, 39068)	1.0
  (10, 60072)	1.0
  (11, 53190)	1.0
  (12, 14504)	1.0
  (13, 57593)	1.0
  (14, 8803)	1.0
  (15, 33646)	1.0
  (16, 18033)	1.0
  (17, 22194)	1.0
  (18, 3675)	1.0
  (19, 12149)	1.0
  (20, 21698)	1.0
  (21, 47280)	1.0
  (22, 29957)	1.0
  (23, 37900)	1.0
  (24, 6456)	1.0
  (25, 7179)	1.0
  :	:
  (57680, 53122)	1.0
  (57681, 20441)	1.0
  (57682, 6597)	1.0
  (57684, 60262)	1.0
  (57685, 50325)	1.0
  (57686, 59956)	1.0
  (57687, 58754)	1.0
  (57688, 33142)	1.0
  (57689, 1639)	1.0
  (57690, 18025)	1.0
  (57691, 16100)	1.0
  (57692, 918)	1.0
  (57693, 1879)	1.0
  (57694, 10068)	1.0
  (57696, 21220)	1.0
  (57697, 47672)	1.0
  (57698, 25601)	1.0
  (57699, 47449)	1.0
  (57700, 15091)	1.0
  (57701, 26206)	1.0
  (57702, 4045)	1.0
  (57703, 39141)	1.0
  (57704, 2)	1.0
  (57705, 4809)	1.0
  (57706, 35710)	1.0


In [134]:
print(Y_train)

[1 1 1 ... 0 1 1]


Training the Model: Logistic Regression Model

In [135]:
model = LogisticRegression()

In [136]:
model.fit(X_train, Y_train)

Evaluation

accuracy score

In [137]:
print(type(accuracy_score))

<class 'function'>


In [138]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)

In [139]:
print('Accuracy Score of the Training data :',training_data_accuracy)

Accuracy Score of the Training data : 0.9898452527423016


Making a predictive System

In [140]:
X_new = X_test[0]

predictions = model.predict(X_new)
print(predictions)

if (predictions[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[0]
The news is Real


In [141]:
print(Y_test[0])

1


In [142]:
print(Y_test[3])

0
