# Fake News Detector

## Overview
This notebook demonstrates the process of building a machine learning model to classify news articles as real or fake. Multiple algorithms are explored and evaluated for their effectiveness in detecting fake news.

## Introduction

### Objective
The objective of this project is to build a model that can distinguish between real and fake news articles using various machine learning techniques.

### Dataset Description
The dataset consists of news articles, which include features such as the title and text. The labels indicate whether the news is real or fake. The dataset is preprocessed to remove noise and prepare it for model training.

## Data Loading and Preprocessing

### Loading the Dataset
The dataset is loaded into the notebook using pandas.

In [30]:
!pip install numpy
!pip install pandas
!pip install scikit-learn
!pip install nltk




In [31]:

import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [32]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SOHAM\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Data Preprocessing
Preprocessing involves cleaning the text data by removing special characters, stop words, and applying stemming to standardize words. The text is then vectorized using a method like Count Vectorizer to convert it into numerical features.

In [33]:
# printing the stopwords in English
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [34]:
true = pd.read_csv('True.csv')

In [35]:
fake = pd.read_csv('Fake.csv')

In [36]:
true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [37]:
fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


## Model Selection

### Model Descriptions
Multiple machine learning models are trained and compared in this notebook. These include:
- **Logistic Regression**: A linear model for binary classification.
- **Decision Tree**: A model that splits data based on feature values.
- **Gradient Boosting**: An ensemble model that builds trees sequentially to reduce error.
- **Random Forest**: An ensemble model that builds multiple trees and averages their predictions.

In [38]:
true["label"] = 1

In [39]:
fake["label"] = 0

In [40]:
news = pd.concat([fake,true],axis = 0)

In [41]:
news.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [42]:
news.isnull().sum()

title      0
text       0
subject    0
date       0
label      0
dtype: int64

In [43]:
news = news.drop(['text','subject','date'],axis=1)

In [44]:
news = news.sample(frac=1)    #data reshuffling...

In [45]:
news.reset_index(inplace=True)

In [46]:
news.head()

Unnamed: 0,index,title,label
0,16403,COLIN POWELL Picked On The Wrong Guy: GENERAL ...,0
1,4189,Pence says working with allies to put pressure...,1
2,2122,"CNN Panel ERUPTS, Gets UGLY After Guest Defen...",0
3,1748,Factbox: Trump on Twitter (September 13) - Tax...,1
4,10678,Ex-California lawmaker Yee sentenced to five y...,1


In [47]:
news = news.drop(['index'],axis=1)

In [48]:
news.head()

Unnamed: 0,title,label
0,COLIN POWELL Picked On The Wrong Guy: GENERAL ...,0
1,Pence says working with allies to put pressure...,1
2,"CNN Panel ERUPTS, Gets UGLY After Guest Defen...",0
3,Factbox: Trump on Twitter (September 13) - Tax...,1
4,Ex-California lawmaker Yee sentenced to five y...,1


In [49]:
news['title']

0        COLIN POWELL Picked On The Wrong Guy: GENERAL ...
1        Pence says working with allies to put pressure...
2         CNN Panel ERUPTS, Gets UGLY After Guest Defen...
3        Factbox: Trump on Twitter (September 13) - Tax...
4        Ex-California lawmaker Yee sentenced to five y...
                               ...                        
44893    Trump Obamacare move seen harming Americans, b...
44894    Bundy Case Ruled a Mistrial – Will Federal Cas...
44895    Mnangagwa told Mugabe he will be safe in Zimba...
44896     Republicans Are Secretly TERRIFIED That Trump...
44897    WATCH: EBONY MAGAZINE EDITOR Destroy Hillary W...
Name: title, Length: 44898, dtype: object

In [50]:
x = news['title']
y = news['label']

In [51]:
x

0        COLIN POWELL Picked On The Wrong Guy: GENERAL ...
1        Pence says working with allies to put pressure...
2         CNN Panel ERUPTS, Gets UGLY After Guest Defen...
3        Factbox: Trump on Twitter (September 13) - Tax...
4        Ex-California lawmaker Yee sentenced to five y...
                               ...                        
44893    Trump Obamacare move seen harming Americans, b...
44894    Bundy Case Ruled a Mistrial – Will Federal Cas...
44895    Mnangagwa told Mugabe he will be safe in Zimba...
44896     Republicans Are Secretly TERRIFIED That Trump...
44897    WATCH: EBONY MAGAZINE EDITOR Destroy Hillary W...
Name: title, Length: 44898, dtype: object

In [52]:
y

0        0
1        1
2        0
3        1
4        1
        ..
44893    1
44894    0
44895    1
44896    0
44897    0
Name: label, Length: 44898, dtype: int64

In [53]:
port_stem = PorterStemmer()

In [54]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [55]:
news['title'] = news['title'].apply(stemming)

In [56]:
print(news['title'])

0        colin powel pick wrong guy gener flynn rip shr...
1               penc say work alli put pressur north korea
2        cnn panel erupt get ugli guest defend trump mu...
3        factbox trump twitter septemb tax reform flori...
4        ex california lawmak yee sentenc five year prison
                               ...                        
44893    trump obamacar move seen harm american biparti...
44894      bundi case rule mistrial feder case soon crumbl
44895        mnangagwa told mugab safe zimbabw state media
44896    republican secretli terrifi trump alreadi hand...
44897    watch eboni magazin editor destroy hillari one...
Name: title, Length: 44898, dtype: object


In [57]:
#separating the data and label
x = news['title'].values
y = news['label'].values

In [58]:
x

array(['colin powel pick wrong guy gener flynn rip shred nasti comment leak email video',
       'penc say work alli put pressur north korea',
       'cnn panel erupt get ugli guest defend trump muslim ban video',
       ..., 'mnangagwa told mugab safe zimbabw state media',
       'republican secretli terrifi trump alreadi hand congress democrat',
       'watch eboni magazin editor destroy hillari one embarrass question'],
      dtype=object)

In [59]:
y

array([0, 1, 0, ..., 1, 0, 0])

In [60]:
x.shape

(44898,)

In [61]:
y.shape

(44898,)

In [62]:
from sklearn.model_selection import train_test_split

In [63]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,stratify=y,random_state=2)

In [64]:
x_train.shape

(31428,)

In [65]:
x_test.shape

(13470,)

In [66]:
# from sklearn.feature_extraction.text import TfidfVectorizer

In [67]:
vectorization = TfidfVectorizer()

In [68]:
xv_train = vectorization.fit_transform(x_train)

In [69]:
xv_test = vectorization.transform(x_test)

In [70]:
xv_train

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 288201 stored elements and shape (31428, 11975)>

In [71]:
xv_test

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 122073 stored elements and shape (13470, 11975)>

## Evaluation

### Evaluation Metrics
The models are evaluated using the following metrics:
- **Accuracy**: The proportion of correct predictions.
- **Precision**: The proportion of positive predictions that are actually positive.
- **Recall**: The proportion of actual positives that are predicted as positive.
- **F1-Score**: The harmonic mean of precision and recall.

In [72]:
from sklearn.linear_model import LogisticRegression

In [73]:
lr = LogisticRegression()

In [74]:
lr.fit(xv_train,y_train)

In [75]:
pred_lr = lr.predict(xv_test)

In [76]:
lr.score(xv_test,y_test)

0.9429101707498144

In [77]:
from sklearn.metrics import classification_report


print(classification_report(y_test, pred_lr))


              precision    recall  f1-score   support

           0       0.96      0.93      0.94      7045
           1       0.93      0.95      0.94      6425

    accuracy                           0.94     13470
   macro avg       0.94      0.94      0.94     13470
weighted avg       0.94      0.94      0.94     13470



In [78]:
from sklearn.tree import DecisionTreeClassifier

In [79]:
dtc = DecisionTreeClassifier()

In [80]:
dtc.fit(xv_train, y_train)

In [81]:
pred_dtc = dtc.predict(xv_test)

In [82]:
dtc.score(xv_test, y_test)

0.9005939123979213

In [83]:
print(classification_report(y_test, pred_dtc))

              precision    recall  f1-score   support

           0       0.91      0.90      0.90      7045
           1       0.90      0.90      0.90      6425

    accuracy                           0.90     13470
   macro avg       0.90      0.90      0.90     13470
weighted avg       0.90      0.90      0.90     13470



In [84]:
from sklearn.ensemble import RandomForestClassifier

In [85]:
rf = RandomForestClassifier()

In [86]:
rf.fit(xv_train, y_train)

In [87]:
pred_rf = dtc.predict(xv_test)

In [88]:
rf.score(xv_test, y_test)

0.9383815887156645

In [89]:
print(classification_report(y_test, pred_rf))

              precision    recall  f1-score   support

           0       0.91      0.90      0.90      7045
           1       0.90      0.90      0.90      6425

    accuracy                           0.90     13470
   macro avg       0.90      0.90      0.90     13470
weighted avg       0.90      0.90      0.90     13470



In [90]:
from sklearn.ensemble import GradientBoostingClassifier

In [91]:
gb = GradientBoostingClassifier()

In [92]:
gb.fit(xv_train, y_train)

In [93]:
pred_gb = dtc.predict(xv_test)

In [94]:
gb.score(xv_test, y_test)

0.8410541945063104

In [95]:
print(classification_report(y_test, pred_gb))

              precision    recall  f1-score   support

           0       0.91      0.90      0.90      7045
           1       0.90      0.90      0.90      6425

    accuracy                           0.90     13470
   macro avg       0.90      0.90      0.90     13470
weighted avg       0.90      0.90      0.90     13470



In [101]:
def output_label(n):
    if n==0:
        return "It is a fake news!!"
    elif n==1:
        return "It is a genuine news!!"

In [102]:
def manual_testing(news):

    testing_news = {"title": [news]} # Corrected syntax for defining dictionary

    new_def_test = pd.DataFrame (testing_news)

    new_def_test["title"] = new_def_test["title"].apply(stemming)

    new_x_test = new_def_test["title"]

    new_xv_test = vectorization.transform(new_x_test) # Assuming 'vectorization' is your vectorizer object

    pred_lr = lr.predict(new_xv_test)

    pred_dt = dtc.predict(new_xv_test)

    pred_gb = gb.predict(new_xv_test)

    pred_rf = rf.predict(new_xv_test)

    return print("\n\nLR Prediction: {} \nGB Prediction: {} \nDTC Prediction: {} \nRF Prediction: {}".format(output_label (pred_lr[0]), output_label (pred_gb[0]), output_label (pred_dtc[0]), output_label (pred_rf[0])))

In [104]:
news_article = str(input())
manual_testing(news_article)

 Russia's invasion of Ukraine on February 24, 2022, was followed by an information war — replete with a large-scale disinformation campaign, targeted propaganda and conspiracy theories, especially on social media. Beyond that, NewsGuard, a US journalism and technology outfit that has been fighting disinformation for years, identified 311 websites publishing pro-Russian disinformation to justify Moscow's war of aggression against its neighbor.  So it is no wonder that DW's fact-checking team spent most of its energy in 2022 dealing with false claims surrounding the war in Ukraine. But our team also got to the bottom of other odd stories on topics related to health, sports and the environment. Here are 10 of the most blatant and unusual.




LR Prediction: It is a fake news!! 
GB Prediction: It is a fake news!! 
DTC Prediction: It is a fake news!! 
RF Prediction: It is a genuine news!!


### Model Comparison
The performance of different models is compared using the above metrics. The results help determine which model is most effective at detecting fake news.

## Prediction

### Manual Testing Function
A function is provided to test the models on new news articles manually. By inputting a news article, the function returns predictions from all models.

## Conclusion

### Summary
In this notebook, we explored various machine learning models for fake news detection. The results indicate that [LogisticRegression classifier] performed the best with an accuracy of [94%].