**Name:- Pranjal Godse - Batch:- 6**

# Prob 3 - Mini NLP Application
## Option A: Fake News Detection System

Objective:
- Classify Fake vs Real News
- Use TF-IDF + Logistic Regression / Naive Bayes
- Show Accuracy, Confusion Matrix
- Display Top Important Words

In [1]:
import pandas as pd
import numpy as np
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt

## Load Dataset
Upload your dataset file (CSV format).
Dataset must contain 'text' column and 'label' column (Fake/Real).

In [5]:
!pip install kagglehub



In [6]:
import kagglehub

path = kagglehub.dataset_download("clmentbisaillon/fake-and-real-news-dataset")

print("Dataset path:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/clmentbisaillon/fake-and-real-news-dataset?dataset_version_number=1...


100%|██████████| 41.0M/41.0M [00:01<00:00, 40.0MB/s]

Extracting files...





Dataset path: /root/.cache/kagglehub/datasets/clmentbisaillon/fake-and-real-news-dataset/versions/1


In [7]:
import pandas as pd
import os

fake = pd.read_csv(os.path.join(path, "Fake.csv"))
true = pd.read_csv(os.path.join(path, "True.csv"))

fake["label"] = "Fake"
true["label"] = "Real"

df = pd.concat([fake, true])
df = df[["text", "label"]]

df.head()

Unnamed: 0,text,label
0,Donald Trump just couldn t wish all Americans ...,Fake
1,House Intelligence Committee Chairman Devin Nu...,Fake
2,"On Friday, it was revealed that former Milwauk...",Fake
3,"On Christmas day, Donald Trump announced that ...",Fake
4,Pope Francis used his annual Christmas Day mes...,Fake


## Text Cleaning

In [8]:
def clean_text(text):
    text = str(text).lower()
    text = ''.join([char for char in text if char not in string.punctuation])
    return text

df['cleaned_text'] = df['text'].apply(clean_text)
df.head()

Unnamed: 0,text,label,cleaned_text
0,Donald Trump just couldn t wish all Americans ...,Fake,donald trump just couldn t wish all americans ...
1,House Intelligence Committee Chairman Devin Nu...,Fake,house intelligence committee chairman devin nu...
2,"On Friday, it was revealed that former Milwauk...",Fake,on friday it was revealed that former milwauke...
3,"On Christmas day, Donald Trump announced that ...",Fake,on christmas day donald trump announced that h...
4,Pope Francis used his annual Christmas Day mes...,Fake,pope francis used his annual christmas day mes...


## Train Test Split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    df['cleaned_text'], df['label'], test_size=0.2, random_state=42)

## TF-IDF Vectorization

In [10]:
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
X_train_tfidf.shape

(35918, 5000)

## Logistic Regression Model

In [11]:
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_tfidf, y_train)
y_pred_lr = lr_model.predict(X_test_tfidf)

print('Accuracy:', accuracy_score(y_test, y_pred_lr))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred_lr))
print('Classification Report:\n', classification_report(y_test, y_pred_lr))

Accuracy: 0.9875278396436525
Confusion Matrix:
 [[4674   59]
 [  53 4194]]
Classification Report:
               precision    recall  f1-score   support

        Fake       0.99      0.99      0.99      4733
        Real       0.99      0.99      0.99      4247

    accuracy                           0.99      8980
   macro avg       0.99      0.99      0.99      8980
weighted avg       0.99      0.99      0.99      8980



## Naive Bayes Model

In [12]:
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)
y_pred_nb = nb_model.predict(X_test_tfidf)

print('Accuracy:', accuracy_score(y_test, y_pred_nb))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred_nb))
print('Classification Report:\n', classification_report(y_test, y_pred_nb))

Accuracy: 0.9266146993318486
Confusion Matrix:
 [[4443  290]
 [ 369 3878]]
Classification Report:
               precision    recall  f1-score   support

        Fake       0.92      0.94      0.93      4733
        Real       0.93      0.91      0.92      4247

    accuracy                           0.93      8980
   macro avg       0.93      0.93      0.93      8980
weighted avg       0.93      0.93      0.93      8980



## Top Important Words (Logistic Regression)

In [13]:
feature_names = tfidf.get_feature_names_out()
coefficients = lr_model.coef_[0]

top_fake = np.argsort(coefficients)[-10:]
top_real = np.argsort(coefficients)[:10]

print('Top Fake News Words:')
print([feature_names[i] for i in top_fake])

print('\nTop Real News Words:')
print([feature_names[i] for i in top_real])

Top Fake News Words:
['nov', 'republican', 'monday', 'friday', 'thursday', 'tuesday', 'wednesday', 'washington', 'said', 'reuters']

Top Real News Words:
['just', 'image', 'gop', 'mr', 'hillary', 'america', 'like', 'images', 'american', 'rep']
