# Fake News Detection using Logistic Regression
### Objective:
Build a machine learning system that detects whether a news article is Real or Fake using Natural Language Processing (NLP) techniques and Logistic Regression.

### Workflow Overview:
- Import required libraries
- Load and combine real and fake news datasets
- Preprocess the text (cleaning + stemming)
- Convert text into numerical features using TF-IDF
- Train a Logistic Regression classifier
- Evaluate model performance
- Predict new input samples

### Problem Type:
Binary Classification
  - 1 = Real
  - 0 = Fake

### Dataset:
- Files: True.csv.zip, Fake.csv.zip
- Columns: title, text, subject, label
- Labels:
  - 1 → Real
  - 0 → Fake

# Step - 1
### Importing Dependencies

In [143]:
import numpy as np
import pandas as pd

import zipfile
# Used to extract and read files directly from ZIP archives
import re
# re (regular expression) library is useful for searching, replacing, or cleaning specific patterns in text.

from nltk.corpus import stopwords
# stopwords are common words (like a, the, is) that are usually removed from text data because they don’t add much meaning.

from nltk.stem.porter import PorterStemmer
# PorterStemmer helps reduce words to their base or root form. e.g., “playing”, “played” → “play”.

from sklearn.feature_extraction.text import TfidfVectorizer
# Converts text data into numerical format by calculating importance of words (TF-IDF technique).

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step - 2
### Downloading Stopwords

In [102]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [103]:
# printing the stopwords
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

# Step - 3
### Data collection

In [104]:
# Define the file path to the ZIP file (uploaded in Google Colab)
true_zip_path = "/content/True.csv.zip"
fake_zip_path = "/content/Fake.csv.zip"

In [105]:
# Open the ZIP file using Python's zipfile module
with zipfile.ZipFile(true_zip_path, 'r') as z:

    # Open the CSV file inside the ZIP
    with z.open("True.csv") as true_file:

        # Read the CSV into a DataFrame using pandas
        real = pd.read_csv(true_file)

        # Add a new column 'label' to indicate these are real news articles
        # We'll use 1 for real news
        real['label'] = 1

In [106]:
# Read Fake.csv from Fake.csv.zip
with zipfile.ZipFile(fake_zip_path, 'r') as z:
    with z.open("Fake.csv") as fake_file:
        fake = pd.read_csv(fake_file)
        fake['label'] = 0  # Fake news

In [107]:
# Combine the real and fake DataFrames into one
# We use ignore_index=True to reset the index automatically
news_data = pd.concat([real, fake], ignore_index=True)

# Shuffle the rows so that real and fake articles are mixed randomly
# This helps prevent any order bias during model training
# frac=1 means we’re returning 100% of the data, just shuffled
# random_state=42 ensures reproducibility (same shuffle every time)
news_data = news_data.sample(frac=1, random_state=42).reset_index(drop=True)

In [108]:
# Print first few rows of the data
news_data.head()

Unnamed: 0,title,text,subject,date,label
0,BREAKING: GOP Chairman Grassley Has Had Enoug...,"Donald Trump s White House is in chaos, and th...",News,"July 21, 2017",0
1,Failed GOP Candidates Remembered In Hilarious...,Now that Donald Trump is the presumptive GOP n...,News,"May 7, 2016",0
2,Mike Pence’s New DC Neighbors Are HILARIOUSLY...,Mike Pence is a huge homophobe. He supports ex...,News,"December 3, 2016",0
3,California AG pledges to defend birth control ...,SAN FRANCISCO (Reuters) - California Attorney ...,politicsNews,"October 6, 2017",1
4,AZ RANCHERS Living On US-Mexico Border Destroy...,Twisted reasoning is all that comes from Pelos...,politics,"Apr 25, 2017",0


In [109]:
# print the dimensions of our dataset (rows , columns)
news_data.shape

(44898, 5)

In [110]:
# Count the number of real (1) and fake (0) news articles in the dataset
# This helps check for class balance before training
print(news_data['label'].value_counts())

label
0    23481
1    21417
Name: count, dtype: int64


In [111]:
# checking for missing values in dataset
news_data.isnull().sum()

Unnamed: 0,0
title,0
text,0
subject,0
date,0
label,0


In [112]:
# Combine multiple useful columns into a single text feature
news_data['content'] = news_data['title'] + " " + news_data['text']+ " " + news_data['subject']

In [113]:
news_data.head()

Unnamed: 0,title,text,subject,date,label,content
0,BREAKING: GOP Chairman Grassley Has Had Enoug...,"Donald Trump s White House is in chaos, and th...",News,"July 21, 2017",0,BREAKING: GOP Chairman Grassley Has Had Enoug...
1,Failed GOP Candidates Remembered In Hilarious...,Now that Donald Trump is the presumptive GOP n...,News,"May 7, 2016",0,Failed GOP Candidates Remembered In Hilarious...
2,Mike Pence’s New DC Neighbors Are HILARIOUSLY...,Mike Pence is a huge homophobe. He supports ex...,News,"December 3, 2016",0,Mike Pence’s New DC Neighbors Are HILARIOUSLY...
3,California AG pledges to defend birth control ...,SAN FRANCISCO (Reuters) - California Attorney ...,politicsNews,"October 6, 2017",1,California AG pledges to defend birth control ...
4,AZ RANCHERS Living On US-Mexico Border Destroy...,Twisted reasoning is all that comes from Pelos...,politics,"Apr 25, 2017",0,AZ RANCHERS Living On US-Mexico Border Destroy...


In [114]:
# Count how many news articles belong to each subject/category
news_data.value_counts('subject')

Unnamed: 0_level_0,count
subject,Unnamed: 1_level_1
politicsNews,11272
worldnews,10145
News,9050
politics,6841
left-news,4459
Government News,1570
US_News,783
Middle-east,778


# Step - 4
### Stemming

Stemming is the process of reducing a word to its root word.
- e.g:- "enjoyed" , "enjoyable" , "enjoying" ---> "enjoy"

In [115]:
# Load stopwords once
stop_words = set(stopwords.words('english'))

In [116]:
# load an instance of Porter Stemmer in a variable
port_stem = PorterStemmer()

This function cleans the text, removes stopwords, and applies stemming to reduce words to their root form.


In [117]:
# create a function for stemming
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    # Removes everything except letters (gets rid of numbers, punctuation, etc.)
    # ^ means exclude everthing else except ; [a-zA-Z] matches all letters; re.sub replaces non-letters in 'content' with spaces

    stemmed_content = stemmed_content.lower()
    # Converts all text to lowercase

    stemmed_content = stemmed_content.split()
    # Splits the sentence into individual words

    stemmed_content = [port_stem.stem(word)
                       for word in stemmed_content
                       if not word in stop_words]
    # Removes common stopwords and stems each word to its root form

    stemmed_content = " ".join(stemmed_content)
    # Joins the cleaned words back into a single string

    return stemmed_content
    # Returns the final preprocessed text


- `tqdm` is a Python library that shows a progress bar for loops — useful for long operations like text preprocessing.

- `tqdm.notebook` version is specifically designed for Jupyter/Colab notebooks with nice formatting.

- `tqdm.pandas()` integrates `tqdm` with pandas, so you can use `.progress_apply()` on DataFrame columns.

- `news_data['content'].progress_apply(stemming)` applies your custom `stemming()` function to every row in the `content` column, while showing live progress.

- This helps you track the progress of the stemming operation and estimate how long it will take to finish.



In [118]:
from tqdm.notebook import tqdm

tqdm.pandas()  # activate tqdm with pandas

news_data['content'] = news_data['content'].progress_apply(stemming)

  0%|          | 0/44898 [00:00<?, ?it/s]

In [119]:
# printing the content column
print(news_data['content'])

0        break gop chairman grassley enough demand trum...
1        fail gop candid rememb hilari mock eulog video...
2        mike penc new dc neighbor hilari troll homopho...
3        california ag pledg defend birth control insur...
4        az rancher live us mexico border destroy nanci...
                               ...                        
44893    nigeria say u agre delay million fighter plane...
44894    boiler room fatal illus tune altern current ra...
44895    atheist sue governor texa display capitol grou...
44896    republican tax plan would deal financi hit u u...
44897    u n refuge commission say australia must stop ...
Name: content, Length: 44898, dtype: object


# Step - 5
### TF-IDF Vectorization



**TF-IDF (Term Frequency-Inverse Document Frequency)** converts the raw text into numerical feature vectors,
giving more importance to rare but meaningful words in the corpus.


- Initialize TF-IDF Vectorizer to convert text into numerical features
- Removes English stopwords (common words that add little meaning)
- Limits features to top 5000 most important words/ngrams
- Considers unigrams and bigrams (single words and pairs of words)



In [120]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000, ngram_range=(1,2))

In [121]:
# Fit the vectorizer on the text and transform it into TF-IDF feature matrix
X_text = vectorizer.fit_transform(news_data['content'].values)

# Step - 6
### Feature and Target split

In [122]:
# Features
X = X_text

In [123]:
# Target
Y = news_data['label'].values

In [124]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 6187557 stored elements and shape (44898, 5000)>
  Coords	Values
  (0, 507)	0.0485466479877629
  (0, 1794)	0.10188911568466509
  (0, 655)	0.05186727330013525
  (0, 1819)	0.43158932367733177
  (0, 1106)	0.15036687605128288
  (0, 4575)	0.23205761765630598
  (0, 2276)	0.20427127659121785
  (0, 4430)	0.06835972553802824
  (0, 1243)	0.10410408874692799
  (0, 4888)	0.09984822141101216
  (0, 1989)	0.09131757893957723
  (0, 665)	0.06743181882420565
  (0, 4560)	0.035272432587098806
  (0, 975)	0.05190607908229499
  (0, 3727)	0.044254254155975406
  (0, 3343)	0.09135126841786088
  (0, 2770)	0.07078423115355227
  (0, 1988)	0.04899170180891784
  (0, 3512)	0.09729971476537777
  (0, 37)	0.061120220083829
  (0, 4326)	0.06350151792166332
  (0, 1527)	0.05818819565078035
  (0, 2858)	0.04642367595039506
  (0, 1957)	0.08134114458238977
  (0, 2005)	0.04255248357009092
  :	:
  (44897, 3907)	0.10931322240385702
  (44897, 3546)	0.05664374257278219
  

In [125]:
print(Y)

[0 0 0 ... 0 1 1]


# Step - 7
### Train Test Split

- Splitting data into training and testing sets
- 80% training data, 20% testing data
- Stratify to keep label distribution consistent in both sets

In [126]:
X_train , X_test , Y_train , Y_test = train_test_split(
    X , Y , test_size= 0.2 , random_state=2 , stratify= Y)

# Step - 8
### Model Training
Initialize and train the Logistic Regression model


In [127]:
model = LogisticRegression()

In [133]:
# Fit the model on the training data
model.fit(X_train, Y_train)

# Step - 9
### Model Evaluation

In [134]:
# Predict on training data
X_train_pred = model.predict(X_train)

In [135]:
# Calculate accuracy on training data
training_data_accuracy = accuracy_score( Y_train , X_train_pred)

In [136]:
print(f"Training Data Accuracy is : {training_data_accuracy}")

Training Data Accuracy is : 0.9949607439166992


In [137]:
# Predict on test data
X_test_pred = model.predict(X_test)

In [138]:
# Calculate accuracy on test data
test_data_accuracy = accuracy_score( Y_test , X_test_pred)

In [139]:
print(f"Test Data Accuracy is : {test_data_accuracy}")

Test Data Accuracy is : 0.9920935412026726


# Step - 10
### Making a Predictive Sysytem

In [142]:
# Select the first example from the test set features
# Or replace with any other input of your choice
X_new = X_test[0]

# Predict the label (0: Fake, 1: Real) for this new example
prediction = model.predict(X_new)

# Print the raw prediction output
print(prediction)

# Interpret and print the human-readable result
if (prediction[0] == 0):
  print("News is Fake")
else:
  print("News is Real")

[1]
News is Real
