<a href="https://colab.research.google.com/github/Aryan9012004/Machine_Learning/blob/main/P2_(Fake_News_Prediction).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![picture](https://drive.google.com/uc?export=view&id=1nf7r1i-1S4S1Q96TtI8WMaTZHllet6MW)

In [None]:
import pandas as pd
import numpy as np
import re # (Regular Expresion) usefull for searching words in a paragraph
from nltk.corpus import stopwords # (Natural language toolkit) Words which does not have much value to a paragraph
from nltk.stem.porter import PorterStemmer # Used to stem out words(Convert the original word to root word like running -> run)
from sklearn.feature_extraction.text import TfidfVectorizer # Used to convert the textual data to Numerical data
from sklearn.model_selection import train_test_split # Used to split our data into training & test data
from sklearn.linear_model import LogisticRegression # It model the probability of a binary outcome(either 0 or 1)
from sklearn.metrics import accuracy_score # this function used to predicy accuracy of a data

In [None]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data PreProcessing

In [None]:
news_dataset = pd.read_csv("/content/train.csv")

In [None]:
news_dataset.shape

(20800, 5)

In [None]:
news_dataset.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [None]:
#Used for counting null values in the dataset
news_dataset.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [None]:
#Replacing null values with empty string (Not using imputation as dataset in very big)
news_dataset = news_dataset.fillna(" ")

In [None]:
news_dataset.isnull().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

In [None]:
#Using title and author column to predict whether the news is real or fake
news_dataset['content'] = news_dataset['author']+" "+news_dataset['title']

In [None]:
print(news_dataset['content'])

0        Darrell Lucus House Dem Aide: We Didn’t Even S...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
20795    Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797    Michael J. de la Merced and Rachel Abrams Macy...
20798    Alex Ansary NATO, Russia To Hold Parallel Exer...
20799              David Swanson What Keeps the F-35 Alive
Name: content, Length: 20800, dtype: object


In [None]:
#Seperating the data and label
X = news_dataset.drop(columns="label" ,axis=1)
Y = news_dataset['label']

In [None]:
print(X)

          id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
20795  20795  Rapper T.I.: Trump a ’Poster Child For White S...   
20796  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
20797  20797  Macy’s Is Said to Receive Takeover Approach by...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799                          What Keeps the F-35 Alive   

                                          author  \
0                                  Darrell Lucus   
1                                Daniel J. Flynn   
2                             Consortiu

In [None]:
print(Y)

0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: label, Length: 20800, dtype: int64


**Stemming**

It is the process of reducing a word to its root word

In [None]:
port_stem = PorterStemmer()

In [None]:
def stemming(content): # Creating a user defined fuction and passing a argument as content
  stemmed_content = re.sub('[^a-zA-Z]'," ",content) # Firstly we call the regular expression library , .sub() function is used to performs global search and replace on a given string(In this case we find aplabatical value and send those values to stemmed_content)
  stemmed_content = stemmed_content.lower() # Used for converting the data into lowercase(As it may confuse the ml model about the significance of a particular word)
  stemmed_content = stemmed_content.split() # All the words will be splitted and converted into a list
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words("english")]
  # We reduce every word to root form using stemming fuction , We also remove all the stopwords from the words
  stemmed_content = " ".join(stemmed_content) # Used to join all the seperated words together with a space
  return stemmed_content

In [None]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [None]:
print(news_dataset['content'])

0        darrel lucu hous dem aid even see comey letter...
1        daniel j flynn flynn hillari clinton big woman...
2                   consortiumnew com truth might get fire
3        jessica purkiss civilian kill singl us airstri...
4        howard portnoy iranian woman jail fiction unpu...
                               ...                        
20795    jerom hudson rapper trump poster child white s...
20796    benjamin hoffman n f l playoff schedul matchup...
20797    michael j de la merc rachel abram maci said re...
20798    alex ansari nato russia hold parallel exercis ...
20799                            david swanson keep f aliv
Name: content, Length: 20800, dtype: object


In [None]:
#Seperting the data and lable
X = news_dataset['content'].values # .values this convert the columns of dataset to a list

In [None]:
Y = news_dataset['label'].values

In [None]:
print(X)

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
 'alex ansari nato russia hold parallel exercis balkan'
 'david swanson keep f aliv']


In [None]:
print(Y)

[1 0 1 ... 0 1 1]


Converting the textual data to numerical data

In [None]:
vectorizer = TfidfVectorizer() # Used to count the frequency of a particular word in a document
vectorizer.fit(X)
X = vectorizer.transform(X)

In [None]:
print(X)

  (0, 15686)	0.28485063562728646
  (0, 13473)	0.2565896679337957
  (0, 8909)	0.3635963806326075
  (0, 8630)	0.29212514087043684
  (0, 7692)	0.24785219520671603
  (0, 7005)	0.21874169089359144
  (0, 4973)	0.233316966909351
  (0, 3792)	0.2705332480845492
  (0, 3600)	0.3598939188262559
  (0, 2959)	0.2468450128533713
  (0, 2483)	0.3676519686797209
  (0, 267)	0.27010124977708766
  (1, 16799)	0.30071745655510157
  (1, 6816)	0.1904660198296849
  (1, 5503)	0.7143299355715573
  (1, 3568)	0.26373768806048464
  (1, 2813)	0.19094574062359204
  (1, 2223)	0.3827320386859759
  (1, 1894)	0.15521974226349364
  (1, 1497)	0.2939891562094648
  (2, 15611)	0.41544962664721613
  (2, 9620)	0.49351492943649944
  (2, 5968)	0.3474613386728292
  (2, 5389)	0.3866530551182615
  (2, 3103)	0.46097489583229645
  :	:
  (20797, 13122)	0.2482526352197606
  (20797, 12344)	0.27263457663336677
  (20797, 12138)	0.24778257724396507
  (20797, 10306)	0.08038079000566466
  (20797, 9588)	0.174553480255222
  (20797, 9518)	0.295420

Train_Test_Split

In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2) # stratify is used to segregate the data in equal proportion

In [None]:
print(X.shape,X_train.shape,X_test.shape)

(20800, 17128) (16640, 17128) (4160, 17128)


Training the Model : Logistic Regresion Model

![picture](https://drive.google.com/uc?export=view&id=1nf7r1i-1S4S1Q96TtI8WMaTZHllet6MW)

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train , Y_train)

Evalutation

Acuracy Score

In [None]:
# Accuracy score on training dataset
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction , Y_train)

In [None]:
print("Accuracy score of training data ",training_data_accuracy)

Accuracy score of training data  0.9865985576923076


In [None]:
X_test_prediction = model.predict(X_test)
X_test_accuracy = accuracy_score(X_test_prediction , Y_test)

In [None]:
print("Accuracy score for test data ",X_test_accuracy)

Accuracy score for test data  0.9790865384615385


Making a Predictive System

In [None]:
X_new = X_test[0]
predict = model.predict(X_new)
print(predict)
if(predict==0):
  print("Real News")
else:
  print("Fake News")

[1]
Fake News


In [None]:
print(Y_test[0])

1
