<h1>Unmasking the Truth: Harnessing Machine Learning to Detect Fake News</h1>
<h2>This Machine Learning project created using Python, uses two powerful models, TfidfVectorizer and PassiveAggressiveClassifier, to detect fake news with high accuracy.</h2>
<div align="center"> <img src="OIP.jpg" width="500" /></div>

In [1]:
#importing necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [2]:
#importing the dataset
df = pd.read_csv('news.csv')

<h1>Some information about the dataset:</h1>
<h3> The dataset cointains 6335 rows of data.</h3>

In [3]:
df

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


In [4]:
#check the dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  6335 non-null   int64 
 1   title       6335 non-null   object
 2   text        6335 non-null   object
 3   label       6335 non-null   object
dtypes: int64(1), object(3)
memory usage: 198.1+ KB


In [5]:
#check for null values
df.isnull().any()

Unnamed: 0    False
title         False
text          False
label         False
dtype: bool

In [6]:
#check for duplicates
df.duplicated().any()

False

<h3>This dataset cotains 3171 "real" and 3164 "fake" news: </h3>

In [7]:
df.label.value_counts()

label
REAL    3171
FAKE    3164
Name: count, dtype: int64

<h1>Data Cleaning: </h1>

In [8]:
#removing the useless column
df = df.drop(['Unnamed: 0'], axis=1)

<h1>Train The Model </h1>

<h3>TfidfVectorizer is used to convert the text data into a matrix of TF-IDF features, and PassiveAggressiveClassifier is used to classify the news as real or fake based on the TF-IDF features. </h3>
<h3>TfidfVectorizer is a text feature extraction tool that converts a collection of raw documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features. It computes the word counts, idf and tf-idf values all at once. It is used to tokenize the documents, learn the vocabulary, and inverse the document frequency weightings, and allows encoding new documents. It helps algorithms to use the importance of the words to predict outcomes.</h3>
<h3>PassiveAggressiveClassifier is a linear model for binary classification that is well suited for large-scale learning. It is an online learning algorithm that remains passive for a correct classification and becomes aggressive for an incorrect classification. It is a good choice for text classification tasks because it can handle high-dimensional sparse data and can learn quickly.</h3>

In [9]:
#Spliting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

In [10]:
#Initializing a TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

In [11]:
#Fit and transform the training set
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

In [12]:
#Transforming the testing set
tfidf_test = tfidf_vectorizer.transform(X_test)

In [13]:
#Initializing a PassiveAggressive Classifier
pac = PassiveAggressiveClassifier(max_iter=50)

In [14]:
#Fiting the model
pac.fit(tfidf_train, y_train)

In [15]:
#Predicting on the testing set
y_pred = pac.predict(tfidf_test)

<h1>Evaluate the model:</h1>

<h3>The accuracy score and confusion matrix are used to evaluate the performance of the model.</h3>

In [16]:
# Calculate the accuracy score and confusion matrix
score = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f'Accuracy: {score}')
print(f'Confusion matrix: {cm}')

Accuracy: 0.936069455406472
Confusion matrix: [[587  41]
 [ 40 599]]


<h3>The accuracy of 0.936, shows that the model correctly classified approximately 93.6% of the news articles as either real or fake.</h3>

<h3>The confusion matrix provides further insights into the performance of the model. It is a table that shows the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). In this case, the confusion matrix is as follows:</h3>

<h3>True Positives (TP): 587 - This represents the number of articles that were actually fake and were correctly classified as fake by the model.</h3>

<h3>True Negatives (TN): 599 - This represents the number of articles that were actually real and were correctly classified as real by the model.</h3>

<h3>False Positives (FP): 41 - This represents the number of articles that were actually real but were incorrectly classified as fake by the model.</h3>

<h3>False Negatives (FN): 40 - This represents the number of articles that were actually fake but were incorrectly classified as real by the model. </h3>