#### ----------------------------------------------------------------------------------------------------------------------
### Description of the project.

##### The name of the project is fake news detection.

##### Here we get the text and title of the news and we have to detect wheather the news is fake or not.

##### As here we use nlp i.e natural language processing technique. And we convert the text and title into numerical format and then predict.

##### Then we apply different classification algorithm and choose the one with higher accuracy.

#### ------------------------------------------------------------------------------------------------------------------------

In [1]:
#Now here we import the all important libraries for this project.

#Linear Algebra.
import numpy as np

#Data processing(read csv file)
import pandas as pd

#Data Visualization
import matplotlib.pyplot as plt

#Statistical Visualization.
import seaborn as sns

#Regular expression.
import re

#nltk Stopwords.
from nltk.corpus import stopwords

#Punctuation.
from string import punctuation

#Data Splitting library. 
from sklearn.model_selection import train_test_split

#Logistic regression library.
from sklearn.linear_model import LogisticRegression

#Model accuracy.
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

#Model saving library.
import pickle

#warning handling.
import warnings

#ignore warnings.
warnings.filterwarnings("ignore")

In [2]:
#Now here we read the dataset.

df=pd.read_csv(r"C:\sudhanshu_projects\project-task-training-course\fake-news-prediction\fake_news.csv")

In [3]:
#Now here we check the dataset.

df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [4]:
#Here we remove the id column, because it is not useful for prediction.

df.drop("id",axis=1,inplace=True)

In [5]:
#Now here we again check the dataset again.

df.head()

Unnamed: 0,title,author,text,label
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [6]:
#Here we check the shape of dataset.

df.shape

#The shape of dataset is (20800,4).

(20800, 4)

In [7]:
#Now here we check the text value.

df["text"][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It By Darrell Lucus on October 30, 2016 Subscribe Jason Chaffetz on the stump in American Fork, Utah ( image courtesy Michael Jolley, available under a Creative Commons-BY license) \nWith apologies to Keith Olbermann, there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide, it looks like we also know who the second-worst person is as well. It turns out that when Comey sent his now-infamous letter announcing that the FBI was looking into emails that may be related to Hillary Clinton’s email server, the ranking Democrats on the relevant committees didn’t hear about it from Comey. They found out via a tweet from one of the Republican committee chairmen. \nAs we now know, Comey notified the Republican chairmen and Democratic ranking members of the House Intelligence, Judiciary, and Oversight committees that his agency was reviewing emai

In [8]:
#Now here we check the title of first row of dataset.

df["title"][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

In [9]:
#Now here we check the author of first row of dataset.

df["author"][0]

'Darrell Lucus'

In [10]:
#Here we check is there any null value in the dataset.

df.isnull().sum()

title      558
author    1957
text        39
label        0
dtype: int64

In [11]:
df=df.fillna(" ")

In [12]:
#Here we again check the number of null values in the dataset.

df.isnull().sum()

title     0
author    0
text      0
label     0
dtype: int64

In [13]:
#Now here we do one changes and combine the author and title column and make it one new column i.e news.

df["news"]=df["author"]+" "+df["title"]

In [14]:
#Now here we check the dataset.

df.head()

Unnamed: 0,title,author,text,label,news
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,Darrell Lucus House Dem Aide: We Didn’t Even S...
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo..."
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,Consortiumnews.com Why the Truth Might Get You...
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,Jessica Purkiss 15 Civilians Killed In Single ...
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,Howard Portnoy Iranian woman jailed for fictio...


In [15]:
#Now here we get the stopwords.

stopword=stopwords.words("english")

len(stopword)

179

In [16]:
#Now here we create the porterstemmer .
from nltk.stem.porter import PorterStemmer

ps=PorterStemmer()

In [17]:
#Here we check is there any null value present in the dataset.

df.isnull().sum()

title     0
author    0
text      0
label     0
news      0
dtype: int64

In [18]:
def preprocess_text(text):
    # Remove punctuation
    text = re.sub('[^a-zA-Z]', ' ', text)
    # Convert to lowercase
    text = text.lower()
    # Split into words
    text = text.split()
    # Remove stopwords and apply stemming
    text = [ps.stem(word) for word in text if not word in stopword]
    # Join the words back into a single string
    text = ' '.join(text)
    return text

# Apply the preprocessing function to the 'Review' column
df['news'] = df['news'].apply(preprocess_text)

In [19]:
len(punctuation)

32

In [20]:
#Now here we decide the independent and dependent feature.

x=df["news"] #Independent feature.

y=df["label"] #Dependent feature.

In [21]:
#Now here we check the x and y.

print(x.head())

print(y.head())

0    darrel lucu hous dem aid even see comey letter...
1    daniel j flynn flynn hillari clinton big woman...
2               consortiumnew com truth might get fire
3    jessica purkiss civilian kill singl us airstri...
4    howard portnoy iranian woman jail fiction unpu...
Name: news, dtype: object
0    1
1    0
2    1
3    1
4    1
Name: label, dtype: int64


In [22]:
#Here we convert the news from text to array(i.e word embedding).

#Here we import the tfidf vectorizer.
from sklearn.feature_extraction.text import TfidfVectorizer

#Here we create the tfidf vectorizer model object.
tfid=TfidfVectorizer()

#Now here we fit and transform the x.
x=tfid.fit_transform(x)

In [23]:
#Now here we check the shape of x.

x.shape

#As after embedding, its shape become (20800,17128).

(20800, 17128)

In [24]:
#Here we split the data.

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

In [25]:
#Now here we check the shape of x-train/test and y-train/test.

print(f"The shape of x_train is: {x_train.shape}")

print(f"The shape of x_test is: {x_test.shape}")

print(f"The shape of y_train is: {y_train.shape}")

print(f"The shape of y_test is: {y_test.shape}")

The shape of x_train is: (16640, 17128)
The shape of x_test is: (4160, 17128)
The shape of y_train is: (16640,)
The shape of y_test is: (4160,)


In [26]:
#Here we apply the logistic regression model.

lor=LogisticRegression()

lor.fit(x_train,y_train)

In [27]:
#Now here we find the train and test accuracy.

y_pred_train=lor.predict(x_train)

y_pred_test=lor.predict(x_test)

train_ac=accuracy_score(y_train,y_pred_train)

test_ac=accuracy_score(y_test,y_pred_test)

print(f"The train accuracy is: {train_ac}")

print(f"The test accuracy is: {test_ac}")

The train accuracy is: 0.9874399038461539
The test accuracy is: 0.9783653846153846


#### ------------------------------------------------------------------------------------------------------------------------------------
### Conclusion of Lor model is: 

##### As the train accuracy is 98.74% and test accuracy is 97.83

#### --------------------------------------------------------------------------------------------------------------------------------------

In [28]:
#Now here we apply decision tree classifier.

from sklearn.tree import DecisionTreeClassifier

dtc=DecisionTreeClassifier()

dtc.fit(x_train,y_train)

In [29]:
#Now here we find the train and test accuracy.

y_pred_train1=dtc.predict(x_train)

y_pred_test1=dtc.predict(x_test)

train_ac1=accuracy_score(y_train,y_pred_train1)

test_ac1=accuracy_score(y_test,y_pred_test1)

print(f"The train acccuracy of dtc model is: {train_ac1}")

print(f"The test accuracy of dtc model is: {test_ac1}")

The train acccuracy of dtc model is: 1.0
The test accuracy of dtc model is: 0.9923076923076923


#### ----------------------------------------------------------------------------------------------------------------------------------
### Conclusion of dtc model is:

##### The train accuracy is 100% 

##### And the test accuracy is 99.27%.

#### ------------------------------------------------------------------------------------------------------------------------------------

In [30]:
#Now we have to save the vectorizer and dtc model.

#pickle.dump(tfid,open(r"C:\sudhanshu_projects\project-task-training-course\fake-news-prediction\fake_news_vectorizer.pkl","wb"))

#pickle.dump(dtc,open(r"C:\sudhanshu_projects\project-task-training-course\fake-news-prediction\fake_news_detection.pkl","wb"))

In [33]:
#Now here we load the saved model.

vectorizer = pickle.load(open(r"C:\sudhanshu_projects\project-task-training-course\fake-news-prediction\fake_news_vectorizer.pkl","rb")) 

model=pickle.load(open(r"C:\sudhanshu_projects\project-task-training-course\fake-news-prediction\fake_news_detection.pkl","rb"))

In [40]:
#Now here we predict the fake news on user input.

user_input=input("enter the news author and title for test: ")

user_input=vectorizer.transform([user_input])

prediction=model.predict(user_input)[0]

if prediction==1:
    print("The given news is fake.")
else:
    print("The given news is real")

enter the news author and title for test:  daniel j flynn flynn hillari clinton big woman campu breitbart


The given news is real
