# Project 06 :Fake News Detection Using Linear Regression

## Abstract

The rise of misinformation in digital media has emphasized the need for automated fake news detection. This study explores the application of linear regression, a traditional machine learning approach, for classifying news articles as real or fake. The model utilizes textual features extracted from news content to predict the likelihood of an article being fake. Performance metrics, including accuracy, precision, recall, and F1-score, are evaluated to assess the model’s effectiveness. Although linear regression is primarily a regression technique, its application in binary classification through thresholding demonstrates moderate success, highlighting both the challenges and potential of traditional models in tackling fake news detection.

## 1. Introduction

The rapid dissemination of information through social media and online news platforms has increased the prevalence of fake news, which can mislead public opinion and influence societal behavior. Automated detection of fake news is essential to mitigate its adverse effects. While modern approaches often rely on deep learning and neural networks, this project investigates the use of linear regression, a simpler and interpretable model, to classify news articles as real or fake based on textual features.

### Objectives

* Implement a linear regression model for fake news detection.
* Evaluate its performance using standard classification metrics.
* Analyze the advantages and limitations of linear regression in this context.

## 2. Related Work

Prior research in fake news detection has largely focused on:

* **Natural Language Processing (NLP) approaches:** Utilizing TF-IDF, word embeddings, and linguistic cues.
* **Machine Learning algorithms:** Logistic regression, decision trees, random forests, and support vector machines.
* **Deep learning models:** LSTM, BERT, and transformer-based architectures.

Linear regression is typically used for predicting continuous variables, but with proper thresholding, it can serve as a simple baseline for binary classification tasks.

## 3. Methodology

### 3.1 Dataset

The dataset used in this study contains news articles labeled as `real` or `fake`. Each article includes:

* Title
* Text content
* Class (0 = real, 1 = fake)

Preprocessing steps included:

* Removing stopwords and punctuation.
* Lowercasing all text.
* Converting text to numerical features using TF-IDF vectorization.

### 3.2 Feature Extraction

TF-IDF (Term Frequency–Inverse Document Frequency) is employed to convert text data into numerical vectors that capture the importance of words in the corpus.

### 3.3 Linear Regression Model

Although linear regression is a regression model, it can approximate probabilities for classification:

$$ \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n $$
where $\hat{y}$ is the predicted value. Thresholding is applied such that:

$$\text{label} =
\begin{cases}
1 & \text{if } \hat{y} \geq 0.5 \\
0 & \text{if } \hat{y} < 0.5
\end{cases}
$$

### 3.4 Evaluation Metrics

Performance is measured using:

* Accuracy
* Precision
* Recall
* F1-score
* Confusion matrix

These metrics allow a comprehensive understanding of the model’s classification performance.

## 4. Discussion

Linear regression provides a simple and interpretable approach to fake news detection. Its strengths include:

* Easy implementation and low computational cost.
* Transparency in understanding feature contributions.

However, limitations include:

* Poor handling of complex, non-linear relationships in textual data.
* Sensitivity to feature scaling and multicollinearity.
* Lower performance compared to specialized classifiers such as logistic regression or deep learning models.

This project demonstrates that linear regression can serve as a baseline model, but advanced models are recommended for real-world deployment.


## Installing Necessary Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

import string
import re

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

## Loading the data

In [2]:
data=pd.read_csv('data/News.csv')

### Data Preview 

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,subject,date,class
0,0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


#### "title",  "subject" and "date" columns is not required for detecting the fake news, so I am going to drop the columns.

In [4]:
data.columns

Index(['Unnamed: 0', 'title', 'text', 'subject', 'date', 'class'], dtype='object')

In [5]:
data=data.drop(['Unnamed: 0','title','subject','date'], axis = 1)

In [6]:
#count of missing values
data.isnull().sum() 

text     0
class    0
dtype: int64

#### Randomly shuffling the dataframe 

In [7]:
data = data.sample(frac = 1)

In [8]:
data.head()

Unnamed: 0,text,class
5530,White privilege is about the fact that for Ame...,0
13374,"We live near the city of Detroit, and anyone c...",0
17358,,0
38900,MOSCOW (Reuters) - Russia opposes a draft U.N....,1
2491,"If any other American company did this, Donald...",0


In [9]:
data.reset_index(inplace = True)
data.drop(['index'], axis = 1, inplace = True)

In [10]:
data.columns

Index(['text', 'class'], dtype='object')

In [11]:
data.head()

Unnamed: 0,text,class
0,White privilege is about the fact that for Ame...,0
1,"We live near the city of Detroit, and anyone c...",0
2,,0
3,MOSCOW (Reuters) - Russia opposes a draft U.N....,1
4,"If any other American company did this, Donald...",0


## Preprocessing Text

#### Creating a function to convert the text in lowercase, remove the extra space, special chr., ulr and links.

In [12]:
def wordopt(text):
    text = text.lower()
    text = re.sub('\[.*?\]','',text)
    text = re.sub("\\W"," ",text)
    text = re.sub('https?://\S+|www\.\S+','',text)
    text = re.sub('<.*?>+',b'',text)
    text = re.sub('[%s]' % re.escape(string.punctuation),'',text)
    text = re.sub('\w*\d\w*','',text)
    return text

In [13]:
data['text'] = data['text'].apply(wordopt)

#### Defining dependent and independent variable as x and y

In [14]:
x = data['text'].values
y = data['class'].values

## Training the model

#### Splitting the dataset into training set and testing set. 

In [15]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.25 , stratify = y, random_state = 42)

### Extracting Features from the Text

#### Convert text to vectors

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test)

## Logistic Regression

In [17]:
from sklearn.linear_model import LogisticRegression

In [18]:
LR = LogisticRegression()
LR.fit(xv_train, y_train)

In [19]:
pred_lr = LR.predict(xv_test)

In [20]:
LR.score(xv_test, y_test)

0.9870881567230633

In [21]:
print (classification_report(y_test, pred_lr))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5876
           1       0.98      0.99      0.99      5354

    accuracy                           0.99     11230
   macro avg       0.99      0.99      0.99     11230
weighted avg       0.99      0.99      0.99     11230



## Save the model and vectorizer

In [22]:
import joblib

# Save the Logistic Regression model
joblib.dump(LR, "model/lr_model.pkl")

# Save the TF-IDF vectorizer
joblib.dump(vectorization, "model/vectorizer.pkl")


['model/vectorizer.pkl']

In [23]:
# Save to CSV
data.to_csv("data/news_dataset.csv", index=False)

### Model Testing With Manual Entry

In [24]:
def output_label(n):
    if n == 0:
        return "Fake News"
    elif n == 1:
        return "True News"

def manual_testing(news):
    testing_news = {"text": [news]}
    new_def_test = pd.DataFrame(testing_news)

    new_def_test["text"] = new_def_test["text"].apply(wordopt)

    new_x_test = new_def_test["text"]
    new_xv_test = vectorization.transform(new_x_test)

    pred_lr = LR.predict(new_xv_test)

    return "\nLR Prediction: {}".format(output_label(pred_lr[0]))
    


In [25]:
news = """
Breaking: Scientists confirm that charging your phone overnight causes it to secretly mine cryptocurrency for foreign governments.

"""

print(manual_testing(news))



LR Prediction: Fake News


In [26]:
news = """
The Government of Bangladesh has announced a new initiative to improve digital education in rural areas, aiming to provide internet access and online learning resources to all schools by 2026.
"""

print(manual_testing(news))



LR Prediction: True News


In [27]:
news = """
Major airline announces plans to switch to sustainable aviation fuel by 2030.
"""
print(manual_testing(news))



LR Prediction: Fake News


## 5. Conclusion

This study explored the application of linear regression for fake news detection. While the model achieved moderate accuracy, its limitations highlight the necessity of more sophisticated algorithms for reliable classification. Future work can focus on combining linear regression with ensemble methods or exploring deep learning architectures to improve performance.
