# Fake News Detector

## Overview
This notebook demonstrates the process of building a machine learning model to classify news articles as real or fake. Multiple algorithms are explored and evaluated for their effectiveness in detecting fake news.

## Introduction

### Objective
The objective of this project is to build a model that can distinguish between real and fake news articles using various machine learning techniques.

### Dataset Description
The dataset consists of news articles, which include features such as the title and text. The labels indicate whether the news is real or fake. The dataset is preprocessed to remove noise and prepare it for model training.

## Data Loading and Preprocessing

### Loading the Dataset
The dataset is loaded into the notebook using pandas.

In [8]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [10]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Data Preprocessing
Preprocessing involves cleaning the text data by removing special characters, stop words, and applying stemming to standardize words. The text is then vectorized using a method like Count Vectorizer to convert it into numerical features.

In [13]:
# printing the stopwords in English
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [15]:
true = pd.read_csv('True.csv')

In [17]:
fake = pd.read_csv('Fake.csv')

In [19]:
true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [21]:
fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


## Model Selection

### Model Descriptions
Multiple machine learning models are trained and compared in this notebook. These include:
- **Logistic Regression**: A linear model for binary classification.
- **Decision Tree**: A model that splits data based on feature values.
- **Gradient Boosting**: An ensemble model that builds trees sequentially to reduce error.
- **Random Forest**: An ensemble model that builds multiple trees and averages their predictions.

In [25]:
true["label"] = 1

In [27]:
fake["label"] = 0

In [29]:
news = pd.concat([fake,true],axis = 0)

In [31]:
news.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [33]:
news.isnull().sum()

title      0
text       0
subject    0
date       0
label      0
dtype: int64

In [35]:
news = news.drop(['text','subject','date'],axis=1)

In [37]:
news = news.sample(frac=1)    #data reshuffling...

In [39]:
news.reset_index(inplace=True)

In [41]:
news.head()

Unnamed: 0,index,title,label
0,444,White House says Tillerson still in charge at ...,1
1,10816,"Foreclosure crisis snarls Clinton, Sanders' ef...",1
2,11423,"Peru's president pardons ex-leader Fujimori, c...",1
3,19256,"Magnitude 6.2 quake hits southeast of Oaxaca, ...",1
4,8858,U.S. House Republicans battle each other on gu...,1


In [43]:
news = news.drop(['index'],axis=1)

In [45]:
news.head()

Unnamed: 0,title,label
0,White House says Tillerson still in charge at ...,1
1,"Foreclosure crisis snarls Clinton, Sanders' ef...",1
2,"Peru's president pardons ex-leader Fujimori, c...",1
3,"Magnitude 6.2 quake hits southeast of Oaxaca, ...",1
4,U.S. House Republicans battle each other on gu...,1


In [47]:
news['title']

0        White House says Tillerson still in charge at ...
1        Foreclosure crisis snarls Clinton, Sanders' ef...
2        Peru's president pardons ex-leader Fujimori, c...
3        Magnitude 6.2 quake hits southeast of Oaxaca, ...
4        U.S. House Republicans battle each other on gu...
                               ...                        
44893    Mattis says North Korean ICBM not yet a 'capab...
44894     Republicans In Congress Want To Give Donald T...
44895    Trump offers to help Pakistan, calls PM Sharif...
44896    U.N. rights team warns Mexico of 'crisis' in j...
44897    ‘One for the Ages’ Full Video and Transcript o...
Name: title, Length: 44898, dtype: object

In [49]:
x = news['title']
y = news['label']

In [51]:
x

0        White House says Tillerson still in charge at ...
1        Foreclosure crisis snarls Clinton, Sanders' ef...
2        Peru's president pardons ex-leader Fujimori, c...
3        Magnitude 6.2 quake hits southeast of Oaxaca, ...
4        U.S. House Republicans battle each other on gu...
                               ...                        
44893    Mattis says North Korean ICBM not yet a 'capab...
44894     Republicans In Congress Want To Give Donald T...
44895    Trump offers to help Pakistan, calls PM Sharif...
44896    U.N. rights team warns Mexico of 'crisis' in j...
44897    ‘One for the Ages’ Full Video and Transcript o...
Name: title, Length: 44898, dtype: object

In [53]:
y

0        1
1        1
2        1
3        1
4        1
        ..
44893    1
44894    0
44895    1
44896    1
44897    0
Name: label, Length: 44898, dtype: int64

In [55]:
port_stem = PorterStemmer()

In [57]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [59]:
news['title'] = news['title'].apply(stemming)

In [60]:
print(news['title'])

0        white hous say tillerson still charg state depart
1        foreclosur crisi snarl clinton sander effort r...
2        peru presid pardon ex leader fujimori cite health
3            magnitud quak hit southeast oaxaca mexico usg
4                      u hous republican battl gun control
                               ...                        
44893      matti say north korean icbm yet capabl threat u
44894    republican congress want give donald trump unr...
44895    trump offer help pakistan call pm sharif terri...
44896    u n right team warn mexico crisi journalist sa...
44897    one age full video transcript trump incred un ...
Name: title, Length: 44898, dtype: object


In [63]:
#separating the data and label
x = news['title'].values
y = news['label'].values

In [65]:
x

array(['white hous say tillerson still charg state depart',
       'foreclosur crisi snarl clinton sander effort reach nevada voter',
       'peru presid pardon ex leader fujimori cite health', ...,
       'trump offer help pakistan call pm sharif terrif guy islamabad',
       'u n right team warn mexico crisi journalist safeti',
       'one age full video transcript trump incred un speech video'],
      dtype=object)

In [67]:
y

array([1, 1, 1, ..., 1, 1, 0], dtype=int64)

In [69]:
x.shape

(44898,)

In [71]:
y.shape

(44898,)

In [73]:
from sklearn.model_selection import train_test_split

In [75]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,stratify=y,random_state=2)

In [77]:
x_train.shape

(31428,)

In [79]:
x_test.shape

(13470,)

In [40]:
# from sklearn.feature_extraction.text import TfidfVectorizer

In [81]:
vectorization = TfidfVectorizer()

In [83]:
xv_train = vectorization.fit_transform(x_train)

In [85]:
xv_test = vectorization.transform(x_test)

In [87]:
xv_train

<31428x12009 sparse matrix of type '<class 'numpy.float64'>'
	with 288422 stored elements in Compressed Sparse Row format>

In [89]:
xv_test

<13470x12009 sparse matrix of type '<class 'numpy.float64'>'
	with 121945 stored elements in Compressed Sparse Row format>

## Evaluation

### Evaluation Metrics
The models are evaluated using the following metrics:
- **Accuracy**: The proportion of correct predictions.
- **Precision**: The proportion of positive predictions that are actually positive.
- **Recall**: The proportion of actual positives that are predicted as positive.
- **F1-Score**: The harmonic mean of precision and recall.

In [93]:
from sklearn.linear_model import LogisticRegression

In [95]:
lr = LogisticRegression()

In [97]:
lr.fit(xv_train,y_train)

In [99]:
pred_lr = lr.predict(xv_test)

In [101]:
lr.score(xv_test,y_test)

0.9455085374907202

In [103]:
from sklearn.metrics import classification_report


print(classification_report(y_test, pred_lr))


              precision    recall  f1-score   support

           0       0.95      0.94      0.95      7045
           1       0.94      0.95      0.94      6425

    accuracy                           0.95     13470
   macro avg       0.95      0.95      0.95     13470
weighted avg       0.95      0.95      0.95     13470



In [105]:
from sklearn.tree import DecisionTreeClassifier

In [107]:
dtc = DecisionTreeClassifier()

In [109]:
dtc.fit(xv_train, y_train)

In [113]:
pred_dtc = dtc.predict(xv_test)

In [115]:
dtc.score(xv_test, y_test)

0.8982925018559762

In [117]:
print(classification_report(y_test, pred_dtc))

              precision    recall  f1-score   support

           0       0.91      0.90      0.90      7045
           1       0.89      0.90      0.89      6425

    accuracy                           0.90     13470
   macro avg       0.90      0.90      0.90     13470
weighted avg       0.90      0.90      0.90     13470



In [119]:
from sklearn.ensemble import RandomForestClassifier

In [121]:
rf = RandomForestClassifier()

In [123]:
rf.fit(xv_train, y_train)

In [127]:
pred_rf = dtc.predict(xv_test)

In [129]:
rf.score(xv_test, y_test)

0.9408314773570898

In [131]:
print(classification_report(y_test, pred_rf))

              precision    recall  f1-score   support

           0       0.91      0.90      0.90      7045
           1       0.89      0.90      0.89      6425

    accuracy                           0.90     13470
   macro avg       0.90      0.90      0.90     13470
weighted avg       0.90      0.90      0.90     13470



In [133]:
from sklearn.ensemble import GradientBoostingClassifier

In [135]:
gb = GradientBoostingClassifier()

In [137]:
gb.fit(xv_train, y_train)

In [139]:
pred_gb = dtc.predict(xv_test)

In [141]:
gb.score(xv_test, y_test)

0.842316258351893

In [143]:
print(classification_report(y_test, pred_gb))

              precision    recall  f1-score   support

           0       0.91      0.90      0.90      7045
           1       0.89      0.90      0.89      6425

    accuracy                           0.90     13470
   macro avg       0.90      0.90      0.90     13470
weighted avg       0.90      0.90      0.90     13470



In [145]:
def output_label(n):
    if n==0:
        return "It is a fake news!!"
    elif n==1:
        return "It is a genuine news!!"

In [177]:
def manual_testing(news):

    testing_news = {"title": [news]} # Corrected syntax for defining dictionary

    new_def_test = pd.DataFrame (testing_news)

    new_def_test["title"] = new_def_test["title"].apply(stemming)

    new_x_test = new_def_test["title"]

    new_xv_test = vectorization.transform(new_x_test) # Assuming 'vectorization' is your vectorizer object

    pred_lr = lr.predict(new_xv_test)

    pred_dt = dtc.predict(new_xv_test)

    pred_gb = gb.predict(new_xv_test)

    pred_rf = rf.predict(new_xv_test)

    return print("\n\nLR Prediction: {} \nGB Prediction: {} \nDTC Prediction: {} \nRF Prediction: {}".format(output_label (pred_lr[0]), output_label (pred_gb[0]), output_label (pred_dtc[0]), output_label (pred_rf[0])))

In [185]:
news_article = str(input())
manual_testing(news_article)

 CLAIM: A law enforcement sniper assigned to former President Donald Trump’s rally Saturday in Butler, Penn., says the head of the Secret Service ordered him not to shoot the suspect accused of attempting to assassinate Trump.  AP’S ASSESSMENT: False. Snipers killed the suspected shooter moments after he opened fire on the former president, bloodying Trump’s ear, killing one rally attendee and injuring two. The Secret Service and the Butler Police Department say they have no agents, officers or employees with the name of the person claiming to be the sharpshooter.




LR Prediction: It is a fake news!! 
GB Prediction: It is a fake news!! 
DTC Prediction: It is a fake news!! 
RF Prediction: It is a fake news!!


### Model Comparison
The performance of different models is compared using the above metrics. The results help determine which model is most effective at detecting fake news.

## Prediction

### Manual Testing Function
A function is provided to test the models on new news articles manually. By inputting a news article, the function returns predictions from all models.

## Conclusion

### Summary
In this notebook, we explored various machine learning models for fake news detection. The results indicate that [LogisticRegression classifier] performed the best with an accuracy of [94%].