# Introduction
This data science project aims to differentiate between real and fake news by leveraging advanced techniques to build a system to automatically classify news articles.The project promotes media literacy, empowering users to critically evaluate news sources and make informed decisions based on reliable information.

# Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

# Reading and understanding the data

In [None]:
df=pd.read_csv("/kaggle/input/fake-news-detection/data.csv")

In [None]:
df

In [None]:
df.shape

In [None]:
df.isnull().any()

In [None]:
df.Label.value_counts()


The provided data consists of a DataFrame with 4009 rows and 4 columns. The columns include "URLs," "Headline," "Body," and "Label." The "URLs" column contains web addresses, "Headline" contains news headlines, "Body" contains the main content of the news articles, and "Label" indicates whether the news is real (1) or fake (0).The dataset analysis reveals that there are no missing values in the "URLs" and "Headline" columns, but the "Body" column contains some missing values. The "Label" column is complete with no missing values.The distribution of labels shows that there are 2137 instances of fake news (0) and 1872 instances of real news (1).

# Removing missing values

In [None]:
df.dropna(subset=['Body'], inplace=True)
df.shape

In [None]:
df.isnull().any()

21 rows were dropped and now we have no missing values in our dataset

# Splitting the dataset into training and testing sets

In [None]:
x_train,x_test,y_train,y_test=train_test_split(df['Body'],df.Label, test_size=0.2, random_state=7)

# Initialize TfidfVectorizer, fit and transform the vectorizer on the train set, and transform the vectorizer on the test set

In [None]:
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)

# Initialize a PassiveAggressiveClassifier

In [None]:
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

# Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, y_pred, labels=[0, 1])
sns.heatmap(cm, annot=True, fmt="d", cmap="RdPu")
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

So with this model, we have 404 true positives, 386 true negatives, 3 false positives, and 5 false negatives.In a binary classification problem like this (distinguishing between fake and real news), having a high number of true positives and true negatives and a low number of false positives and false negatives is generally a good sign. This means that the model is correctly classifying both fake and real news with a high level of accuracy.

# Conclusion

The project aimed to detect fake news by building a model using machine learning techniques. The model was trained on a dataset consisting of news articles labeled as fake or real. By utilizing the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization method and the PassiveAggressive Classifier algorithm, the model was able to classify news articles with a high degree of accuracy. The evaluation of the model's performance, as indicated by the confusion matrix, demonstrated a high number of true positives and true negatives, along with low numbers of false positives and false negatives. This suggests that the model successfully differentiated between fake and real news, making it a promising approach for detecting fake news.