<a href="https://colab.research.google.com/github/KalyanMarella/Fake_News_Prediction_Project/blob/main/Fake_News_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Description Of Project
In today's digital age, the rapid spread of misinformation and fake news has become a significant concern.This Project aims to tackle this problem head-on by developing an intelligent system that can distinguish between authentic news and fabricated content.


DataSet: Dataset Collected  at Kaggle \
https://www.kaggle.com/datasets/hassanamin/textdb3



#### Importing Libraries


In [None]:
import pandas as pd
import numpy as np
import spacy
nlp=spacy.load("en_core_web_sm")
import sklearn
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import classification_report,confusion_matrix

#### Reading Data

In [None]:
data=pd.read_csv("/content/fake_or_real_news.csv")

In [None]:
## Snippet of the Data
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


- Removing "Unnamed: 0" column is beneficiary

In [None]:
data.drop('Unnamed: 0',axis=1,inplace=True)

In [None]:
data.head()

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


- Using Text Column as a data to train model is a good practice because "title" column may not have enough data to classify whether the news article is "Fake" or "Real"

In [None]:
train_data=data[["text","label"]]

In [None]:
train_data.head()

Unnamed: 0,text,label
0,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,It's primary day in New York and front-runners...,REAL


In [None]:
train_data.shape

(6335, 2)

- Data is with 6335 news articles with labels as "Real" or "Fake".

In [None]:
data["text"][0] ## Each Individual Article



- Machine Learning models require numerical data to classify the provided information to "N" classes

- To Convert text to numerical data the first and simple approach is Tf-Idf where the numerical is count or frequency of each token in document occured
- wieights of important token Increases and token with no importance weight get dampened

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer=TfidfVectorizer(stop_words="english")

In [None]:
vector=vectorizer.fit_transform(train_data["text"])

In [None]:
vector.shape

(6335, 67351)

- 6335 news articles from each row contributed to "67351" columns while converting from text to numerical

In [None]:
vector.toarray()[0][vector.toarray()[0]!=0]

array([0.02107768, 0.02727679, 0.03131292, 0.02090197, 0.05633448,
       0.06669515, 0.03031228, 0.02647206, 0.077979  , 0.02109143,
       0.02000559, 0.04723657, 0.01528162, 0.01896169, 0.02636593,
       0.02511382, 0.04005669, 0.01980484, 0.02355404, 0.01410968,
       0.02473388, 0.04056545, 0.02430447, 0.01975032, 0.0205038 ,
       0.03584678, 0.01859634, 0.04449653, 0.03355887, 0.09580673,
       0.02629626, 0.04996484, 0.02302592, 0.02478661, 0.0149153 ,
       0.03114695, 0.08301616, 0.01785948, 0.02676525, 0.03355887,
       0.03193558, 0.0387292 , 0.04585959, 0.02483985, 0.02083581,
       0.02846429, 0.01466748, 0.06979064, 0.05810632, 0.02176292,
       0.02744399, 0.03408415, 0.04595708, 0.03905172, 0.01837946,
       0.02503022, 0.03082897, 0.02861889, 0.01849989, 0.06831114,
       0.01355846, 0.03253009, 0.04405848, 0.01691804, 0.02065461,
       0.02218109, 0.03368568, 0.01895227, 0.01994911, 0.02340692,
       0.02418416, 0.17781446, 0.04749802, 0.08811696, 0.02592

- data at 0th index of train_data is transformed to the numerical data
- Likewise all the articles are transformed to numericals


- Splitting data to train_data and test_data
- Labelling "Fake" as 1 and "Real" as 0

In [None]:
train_data["new_label"]=train_data["label"].map({"REAL":0,"FAKE":1})

In [None]:
train_data[:3]

Unnamed: 0,text,label,new_label
0,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,1
1,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,1
2,U.S. Secretary of State John F. Kerry said Mon...,REAL,0


In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(train_data["text"],train_data["new_label"],test_size=0.3,stratify=train_data["new_label"])

In [None]:
vectorizer=TfidfVectorizer(stop_words="english")

In [None]:
vector_train_x=vectorizer.fit_transform(x_train)
vector_test_x=vectorizer.transform(x_test)

## Passive Agressive Classifier
- Passive Aggressive algorithms are online learning algorithms. Such an algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting. Unlike most other algorithms, it does not converge. Its purpose is to make updates that correct the loss, causing very little change in the norm of the weight vector.

In [None]:
model=PassiveAggressiveClassifier(max_iter=100)
model.fit(vector_train_x,y_train)

In [None]:
matrix=confusion_matrix(y_test,model.predict(vector_test_x))

In [None]:
matrix

array([[887,  65],
       [ 46, 903]])

In [None]:
print(classification_report(y_test,model.predict(vector_test_x)))

              precision    recall  f1-score   support

           0       0.95      0.93      0.94       952
           1       0.93      0.95      0.94       949

    accuracy                           0.94      1901
   macro avg       0.94      0.94      0.94      1901
weighted avg       0.94      0.94      0.94      1901



- Predicted 887 as True Negatives and 903 as True Positives
- The scores of precision,recall,f1-score are pretty good

## Creating Pipeline

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
clf=Pipeline([
    ("Vectorizer",TfidfVectorizer(stop_words="english")),
    ("PassiveAggressiveClassifier",PassiveAggressiveClassifier(max_iter=100))
])

- Directly giving text data as input to the pipeline so that vectorizer transform the data from text to numericals and Machine Learning model in pipeline learns from data and gets ready to predict

In [None]:
clf.fit(x_train,y_train)

In [None]:
confusion_matrix(y_test,clf.predict(x_test))

array([[888,  64],
       [ 49, 900]])

In [None]:
print(classification_report(y_test,clf.predict(x_test)))

              precision    recall  f1-score   support

           0       0.95      0.93      0.94       952
           1       0.93      0.95      0.94       949

    accuracy                           0.94      1901
   macro avg       0.94      0.94      0.94      1901
weighted avg       0.94      0.94      0.94      1901

