# Fake News Detection using Logistic Regression Model

### Objective:
Build a machine learning system that detects whether a news article is **Real or Fake**
using Natural Language Processing (NLP) techniques and a **Logistic Regression** model.

### Workflow Overview:
1. Import required libraries and dependencies
2. Load and merge real and fake news datasets
3. Preprocess the text:
   - Combine title, text, and subject
   - Clean text using regular expressions
   - Remove stopwords
   - Apply stemming
4. Convert the cleaned text into numerical features using **TF-IDF Vectorization**
5. Split data into training and testing sets (80/20 split, stratified)
6. Compare multiple models:
   - Logistic Regression
   - Multinomial Naive Bayes
   - Random Forest (chosen based on best performance)
7. Train the final model
8. Evaluate model performance:
   - Accuracy, Precision, Recall, F1-score
   - Confusion Matrix (with visualization)
9. Enable custom input prediction system
10. Save the trained model and vectorizer using Pickle

### Problem Type:
- **Binary Classification**
  - `1` → Real News
  - `0` → Fake News

### Dataset:
- Files used: `WELFake_datset.csv`
- Columns: `title`, `text`, `subject`

---


# Step -1
### Importing Dependencies

In [None]:
import numpy as np
import pandas as pd

import zipfile
# Used to extract and read files directly from ZIP archives

import re
# re (regular expression) library is useful for searching, replacing, or cleaning specific patterns in text.

from nltk.corpus import stopwords
# stopwords are common words (like a, the, is) that are usually removed from text data because they don’t add much meaning.

from nltk.stem.porter import PorterStemmer
# PorterStemmer helps reduce words to their base or root form. e.g., “playing”, “played” → “play”.

from sklearn.feature_extraction.text import TfidfVectorizer
# Converts text data into numerical format by calculating importance of words (TF-IDF technique).

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import matplotlib.pyplot as plt
import seaborn as sns

import pickle  # For saving the trained model and vectorizer

from tqdm.notebook import tqdm  # Progress bar for pandas apply

# Step - 2
### Downloading Stopwords

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
# printing the stopwords
print(stopwords.words('english'))

# Step - 3
### Data collection

In [None]:
news_data = pd.read_csv("/content/WELFake_Dataset.csv")

In [None]:
# Print first few rows of the data
news_data.head()

In [None]:
# print the dimensions of our dataset (rows , columns)
news_data.shape

In [None]:
# checking for missing values in dataset
news_data.isnull().sum()

In [None]:
# Drop rows with missing title or text
news_data = news_data.dropna(subset=['title', 'text'])

In [None]:
# Drop the 'Unnamed: 0' column
news_data = news_data.drop(columns=['Unnamed: 0'])

In [None]:
# Reset index after dropping rows
news_data = news_data.reset_index(drop=True)

In [None]:
# Count the number of real (1) and fake (0) news articles in the dataset
# This helps check for class balance before training
print(news_data['label'].value_counts())

In [None]:
# Combine multiple useful columns into a single text feature
news_data['content'] = news_data['title'] + " " + news_data['text']

In [None]:
news_data.head()

# Step - 4
### Stemming
Stemming is the process of reducing a word to its root word.

- e.g:- "enjoyed" , "enjoyable" , "enjoying" ---> "enjoy"

In [None]:
# Load stopwords once
stop_words = set(stopwords.words('english'))

In [None]:
# load an instance of Porter Stemmer in a variable
port_stem = PorterStemmer()

This function cleans the text, removes stopwords, and applies stemming to reduce words to their root form.

In [None]:
# create a function for stemming
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    # Removes everything except letters (gets rid of numbers, punctuation, etc.)
    # ^ means exclude everthing else except ; [a-zA-Z] matches all letters; re.sub replaces non-letters in 'content' with spaces

    stemmed_content = stemmed_content.lower()
    # Converts all text to lowercase

    stemmed_content = stemmed_content.split()
    # Splits the sentence into individual words

    stemmed_content = [port_stem.stem(word)
                       for word in stemmed_content
                       if not word in stop_words]
    # Removes common stopwords and stems each word to its root form

    stemmed_content = " ".join(stemmed_content)
    # Joins the cleaned words back into a single string

    return stemmed_content
    # Returns the final preprocessed text

- `tqdm` is a Python library that shows a progress bar for loops — useful for long operations like text preprocessing.

- `tqdm.notebook` version is specifically designed for Jupyter/Colab notebooks with nice formatting.

- `tqdm.pandas()` integrates `tqdm` with pandas, so you can use `.progress_apply()` on DataFrame columns.

- `news_data['content'].progress_apply(stemming)` applies your custom `stemming()` function to every row in the `content` column, while showing live progress.

- This helps you track the progress of the stemming operation and estimate how long it will take to finish.

In [None]:
tqdm.pandas()  # activate tqdm with pandas

news_data['content'] = news_data['content'].progress_apply(stemming)

In [None]:
# printing the content column
print(news_data['content'])

# Step - 5
### TF-IDF Vectorization

**TF-IDF (Term Frequency-Inverse Document Frequency)** converts the raw text into numerical feature vectors, giving more importance to rare but meaningful words in the corpus.

- Initialize TF-IDF Vectorizer to convert text into numerical features
- Removes English stopwords (common words that add little meaning)
- Limits features to top 10000 most important words/ngrams
- Considers unigrams and bigrams (single words and pairs of words)

In [None]:
vectorizer = TfidfVectorizer(
    stop_words='english',  # Removes common noise words
    max_features=10000,     # Prevents overfitting & keeps model lightweight
    ngram_range=(1, 2)     # Captures both single words AND pairs of words (context!)
)

In [None]:
# Fit the vectorizer on the text and transform it into TF-IDF feature matrix
X_text = vectorizer.fit_transform(news_data['content'].values)

# Step - 6
### Feature and Target split

In [None]:
# Features
X = X_text

In [None]:
# Target
Y = news_data['label'].values

In [None]:
print(X)

In [None]:
print(Y)

# Step - 7
### Train Test Split

- Splitting data into training and testing sets
- 80% training data, 20% testing data
- Stratify to keep label distribution consistent in both sets

In [None]:
X_train , X_test , Y_train , Y_test = train_test_split(
    X , Y , test_size= 0.2 , random_state=2 , stratify= Y)

# Step - 8
### Model Comparison

In [None]:
# Create a list of models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Naive Bayes": MultinomialNB(),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42)
}

# Iterate through the list to get predictions from each model
for name, model in models.items():
    model.fit(X_train, Y_train)   # Fit the model
    preds = model.predict(X_test)

    # Print the Performance of each model
    print(f"\n{name} Performance ")
    print("Accuracy:", accuracy_score(Y_test, preds))
    print("Classification Report:\n", classification_report(Y_test, preds))
    print("-------------------------------------------------------")

- Random Forest achieved the highest accuracy (~95.6%) and showed balanced precision, recall, and F1-score. However, it caused overfitting on our dataset, which could reduce the model's ability to generalize well to unseen data.

- Logistic Regression also performed well (~94.6%) with a slightly lower accuracy but demonstrated better generalization and stability. Therefore, we selected Logistic Regression as our final model to avoid overfitting issues.

- Naive Bayes showed lower accuracy (~86.2%) and was less suitable for this task.



# Step - 9
### Model Training
Initialize and train the Random Forest model

In [None]:
model = LogisticRegression()

In [None]:
# Fit the model on the training data
model.fit(X_train, Y_train)

# Step - 10
### Model Evaluation


In [None]:
# Predict on training data
X_train_pred = model.predict(X_train)

In [None]:
# Evaluation Metrics
training_data_accuracy = accuracy_score( Y_train , X_train_pred)
report = classification_report(Y_train, X_train_pred)
conf_matrix = confusion_matrix(Y_train, X_train_pred)

In [None]:
print(f"Training Data Accuracy is : {training_data_accuracy}")
print(f"Classification Report is : {report}")
print(f"Confusion Matrix is : {conf_matrix}")

In [None]:
# Predict on test data
X_test_pred = model.predict(X_test)

In [None]:
# Evaluation Metrics
test_data_accuracy = accuracy_score( Y_test , X_test_pred)
report_test = classification_report(Y_test, X_test_pred)
conf_matrix_test = confusion_matrix(Y_test, X_test_pred)

In [None]:
print(f"Test Data Accuracy is : {test_data_accuracy}")
print(f"Classification Report is : {report_test}")
print(f"Confusion Matrix is : {conf_matrix_test}")

# Step - 11
### Visualization - Confusion Matrix

In [None]:
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix_test , annot=True, fmt='d', cmap='Blues',
            xticklabels=['Fake', 'Real'], yticklabels=['Fake', 'Real'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Step - 12
### Making a Custom Input System

In [None]:
def predict_custom_news(news_text, vectorizer, model):
    # Preprocess the input using the same stemming function
    stemmed_text = stemming(news_text)
    vectorized_text = vectorizer.transform([stemmed_text])
    prediction = model.predict(vectorized_text)
    return "Real" if prediction[0] == 1 else "Fake"

### Example usage:

In [None]:
sample_news = "Breaking: President gives major update on national policy."
result = predict_custom_news(sample_news, vectorizer, model)
print(f"\nPrediction for custom news input: {result}")

# Step - 13
### Save the model and vectorizer using pickle


In [None]:
with open("logistic_regression_model.pkl", "wb") as model_file:
    pickle.dump(model, model_file)

with open("tfidf_vectorizer.pkl", "wb") as vec_file:
    pickle.dump(vectorizer, vec_file)