Data Description 

id: unique id for a news article

title: the title of a news article

author: author of the news article

text: the text of the article; could be incomplete

label: a label that marks the article as potentially unreliable

      1: unreliable
      0: reliable

Importing nedded libraries 

In [4]:
import numpy as np 
import pandas as pd 
#importing regular expression // nedded for searching text in a document
import re 
#importing natural language toolkit (stopwords) to remove useless words from our data 
from nltk.corpus import stopwords
#importing steeming library // needed to return the root word
from nltk.stem.porter import PorterStemmer
#importing TfidfVectorizer to convert text into feature vectors
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score

In [5]:
import nltk 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [7]:
#Knowing more about stopwords 
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data preprocessing 

In [11]:
#loading data to a pandas Dataframe 
news_data = pd.read_csv('/content/train.csv') 



In [12]:
news_data.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [14]:
#counting the number of missing values in the data 
news_data.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [16]:
#replacing the null values with empty string
news_data=news_data.fillna('')


In [18]:
#merging the author name and news tittle 
news_data['content'] = news_data['author'] +' '+ news_data['title'] 


In [19]:
news_data.head()

Unnamed: 0,id,title,author,text,label,content
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,Darrell Lucus House Dem Aide: We Didn’t Even S...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo..."
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,Consortiumnews.com Why the Truth Might Get You...
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,Jessica Purkiss 15 Civilians Killed In Single ...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,Howard Portnoy Iranian woman jailed for fictio...


In [20]:
#Separating data and labels 
X= news_data.drop(columns='label', axis=1)
Y= news_data['label']

In [24]:
#stemming
port_stem = PorterStemmer()
#creation od stemming function to be called every time we need it 
def stemming(content):
  #replacing unnecessary characters like ponctuation marks ,  numbers ..., with a space
  stemmed_content=re.sub('[^a-zA-Z]',' ',content)
  #converting into lowercase letters 
  stemmed_content=stemmed_content.lower()
  #converting stemmed_content type to list in which every component is a word 
  stemmed_content=stemmed_content.split()
  #stemming every word in our list expect stopwords 
  stemmed_content= [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  #converting stemmed_content type to string
  stemmed_content=' '.join(stemmed_content)
  return stemmed_content
    

In [26]:
#testing our stemming function 
stemming("Darrell Lucus House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It")

'darrel lucu hous dem aid even see comey letter jason chaffetz tweet'

In [27]:
#applying stemming function to content column 
news_data['content']= news_data['content'].apply(stemming)

In [28]:
#Separating labels and data 
X=news_data['content'].values
Y=news_data['label'].values

In [29]:
#coverting the textual data to numerical data 
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
#converting all X values to their respective features 
X = vectorizer.transform(X)
print(X)

  (0, 15686)	0.28485063562728646
  (0, 13473)	0.2565896679337957
  (0, 8909)	0.3635963806326075
  (0, 8630)	0.29212514087043684
  (0, 7692)	0.24785219520671603
  (0, 7005)	0.21874169089359144
  (0, 4973)	0.233316966909351
  (0, 3792)	0.2705332480845492
  (0, 3600)	0.3598939188262559
  (0, 2959)	0.2468450128533713
  (0, 2483)	0.3676519686797209
  (0, 267)	0.27010124977708766
  (1, 16799)	0.30071745655510157
  (1, 6816)	0.1904660198296849
  (1, 5503)	0.7143299355715573
  (1, 3568)	0.26373768806048464
  (1, 2813)	0.19094574062359204
  (1, 2223)	0.3827320386859759
  (1, 1894)	0.15521974226349364
  (1, 1497)	0.2939891562094648
  (2, 15611)	0.41544962664721613
  (2, 9620)	0.49351492943649944
  (2, 5968)	0.3474613386728292
  (2, 5389)	0.3866530551182615
  (2, 3103)	0.46097489583229645
  :	:
  (20797, 13122)	0.2482526352197606
  (20797, 12344)	0.27263457663336677
  (20797, 12138)	0.24778257724396507
  (20797, 10306)	0.08038079000566466
  (20797, 9588)	0.174553480255222
  (20797, 9518)	0.295420

Splitting data into test and train set 


In [30]:
X_train,X_test,Y_train,Y_test= train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2)

In [None]:
Training Model

In [31]:
#training the model 
model = LogisticRegression()
model.fit(X_train,Y_train)


LogisticRegression()

Model Evaluation 

In [32]:
#accuracy score on the training set 
X_train_predction = model.predict(X_train)
training_set_accuracy= accuracy_score(X_train_predction,Y_train)
print("Accuracy score of the training set = ", training_set_accuracy)

Accuracy score of the training set =  0.9865985576923076


In [33]:
#accuracy score on the test set 
X_test_predction = model.predict(X_test)
test_set_accuracy= accuracy_score(X_test_predction,Y_test)
print("Accuracy score of the test set = ", test_set_accuracy)

Accuracy score of the test set =  0.9790865384615385


Making a Predictive System 

In [34]:
#example of the second row of test set
X_new=X_test[1]
prediction = model.predict(X_new)
if prediction[0]==0:
  print("the news is real")
else :
  print("the news is fake")

the news is real
