# Fake News Detection Using Support Vector Machines (SVM)
### Advanced Machile Learning (Project 1)
**Authors:** Otari Samadashvili and Nana Jaoshvili  
**Date:** 03.12.2025.

---

## Introduction

### Project Objective

This project aims to distinguish **Real News** from **Fake News** in an era of overwhelming and chaotic media coverage. The analysis applies **Support Vector Machine (SVM) methods** to classify news articles based on their linguistic and semantic patterns. The objective is to compare the performance of multiple SVM options, including linear, polynomial, and RBF kernels, and identify the most effective one for reliable fake-news detection. 


The working hypothesis is that the linear kernel may perform similarly to more complex kernels while requiring fewer computational resources. Comparing these options is both practical and theoretically informative, as it reveals how model complexity interacts with real‑world fake‑news classification.



### Context & Motivation

**The central problem** addressed in this project is the growing difficulty of verifying online information in real time. Traditional fact‑checking cannot keep up with the velocity at which misinformation spreads, and fake news often imitates the tone, structure, and style of legitimate journalism, making it difficult to distinguish truth from fabrication.

This challenge underlines the importance of automated detection tools. Misinformation can influence democratic processes, disrupt financial markets, and undermine public access to credible information. Automated systems that flag suspicious content serve as a crucial first filter, directing human attention toward items that require deeper review, rather than replacing human judgment entirely.

Machine learning provides a practical way to address this challenge by learning patterns in text that are too subtle or complex for manual screening. Among available methods, SVMs remain widely used for text classification because of their robustness, solid theoretical foundations, and ability to handle high‑dimensional feature spaces derived from word frequencies and semantic representations.


**The motivation** behind this project is to evaluate how different SVM configurations perform when applied to the classification of real and fake news articles. By comparing linear, polynomial, and radial basis function (RBF) kernels within a consistent analytical pipeline, the study seeks to understand how kernel choice influences detection accuracy. Beyond identifying the best-performing model, the project highlights the broader importance of computational efficiency in promoting media literacy and strengthening resilience against misinformation. 

This notebook documents the end-to-end workflow from data preparation and feature extraction to model evaluation providing a simple framework for fake-news detection.

**Motivation for the Chosen Method**:

Support Vector Machines (SVMs) are used as the primary method of analysis in this project. Although deep learning models such as BERT currently dominate natural language processing, SVMs offer several practical and methodological advantages that make them a strong fit for fake-news classification. 

SVMs are a natural choice for this setting because news articles, when vectorized with TF–IDF, yield high-dimensional sparse feature spaces where linear decision boundaries often achieve strong separation. By comparing linear and RBF-kernel SVMs on TF–IDF features, and by training models both before and after aggressive de-leakage, the project investigates whether the fake vs real boundary is effectively linear and how much reported “near-perfect” accuracy is driven by spurious artifacts rather than genuine linguistic understanding.

- **Effective in High-Dimensional Spaces**: Text data represented through TF-IDF or similar methods produces thousands of features. SVMs are  designed to work well in such spaces by maximizing the margin between classes, helping them avoid the typical issues associated with high dimensionality.

- **Computational Efficiency**: Deep learning models require significant computational resources and large labelled datasets. In contrast, linear SVMs train quickly and provide fast inference, making them suitable for systems that need timely predictions or operate under limited computing power.

-  **Robustness and Generalisation**: SVMs focus on identifying the optimal separating hyperplane rather than fitting individual data points. This helps them remain less prone to overfitting compared to models like decision trees, especially in noisy textual environments.

- **Margin Maximisation Principle**: SVMs do more than find a boundary, they find the best possible boundary by maximising the margin between the real and fake news classes. This contributes directly to stable and reliable classification performance.

These properties make SVMs a strong methodological choice for this study, balancing theoretical robustness with practical efficiency in real-world text classification tasks.



## Literature Review
**["A benchmark study of machine learning models for online fake news detection"](https://www.sciencedirect.com/science/article/pii/S266682702100013X)**  
<span style="color:gray; font-style:italic">Khan J.Y., Khondaker T.I., e.t., Machine Learning with Applications, 2021
</span>  
- Empirical studies Show that false content can spread faster and further than truthful information on social platforms, increasing the societal risks of misinformation. 
- Highlights that duplicated content, shared templates, and dataset‑specific markers (e.g. source names) can cause severe data leakage and overoptimistic model performance.
- Claims that SVMs with TF–IDF features often outperform Naive Bayes, KNN, and basic tree‑based models on binary fake‑vs‑real tasks, frequently reaching 90–99% accuracy on curated datasets.

**["Survey of fake news detection using machine intelligence approach"](https://www.sciencedirect.com/science/article/abs/pii/S0169023X22001094)**  
<span style="color:gray; font-style:italic">Pal A, Pranav, Prahdan M, Data & Knowledge Engeneering, 2023
</span>  
- Shows content‑based text classification as a core component of fake‑news detection frameworks, typically complemented by user, network, or temporal features to improve robustness.


**["Fake and Real News Dataset"](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset/data)**  
<span style="color:gray; font-style:italic">Clément Bisaillon, Kaggle, 2023
</span>  
- Provides a widely used benchmark dataset with separate files of fake and true articles containing title, text, subject, and date fields; prior analyses note that many true articles originate from Reuters with stereotyped lead phrases, creating strong lexical cues that must be removed to avoid source‑driven leakage.


**["Advanced machine learning techniques for fake news detection"](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://magnascientiapub.com/journals/msarr/sites/default/files/MSARR-2024-0198.pdf)**  
<span style="color:gray; font-style:italic">Jahan I., Hasan N., Islam S.N., e.t., Magna Scientia Advanced Research and Reviews, 2024
</span>  
- Finds that simple BoW and TF–IDF representations combined with SVMs can match or outperform more complex embedding‑based pipelines for fake‑news classification when carefully tuned.
- Shows that while RBF kernels can yield small gains in some scenarios, linear SVMs often achieve similar performance with substantially lower computational cost when trained on TF–IDF word and n‑gram features.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns


  from pandas.core import (


## Dataset Description
The analysis uses the **["Fake and Real News Dataset"](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset/data)** created by Clément Bisaillon on Kaggle, which is a widely used benchmark for binary fake‑vs‑real news classification. The dataset is provided in two separate CSV files, True.csv and Fake.csv, making the label structure explicit and straightforward for supervised learning.

**Articles included**:
- Total articles - **44,898**
- Real News - **21,417 (48% Real)**
- Fake News - **23,481 (52% Fake)**

The dataset is nearly balanced: 52% Fake, 48% Real.

**Each file contains four columns**:
- `title` - headline of the news article
- `text` - full body text of the article
- `subject` - coarse topical category (e.g., politics, world news)
- `date` - publication date as a string

The classes are close to balanced (approximately 52% fake and 48% real), which reduces the need for complex resampling techniques and allows standard train–test splits and evaluation metrics (accuracy, precision, recall, F1) to be interpreted without heavy class‑imbalance corrections.

From a modelling perspective, the presence of both title and full text enables experimentation with different input granularities (e.g., title‑only vs. full‑text models), while the subject field can be used for exploratory analysis of topic distribution or for stratified splitting if desired. The date column supports temporal checks, such as ensuring that training and test sets are not trivially linked by duplicated or near‑duplicate articles appearing at similar times.


### Data Exploration & Preprocessing

Initial exploration confirmed that the dataset focuses mainly on political and world news. The distribution of subjects is skewed toward politics‑related content, which supports treating the task largely as political/world news classification rather than general news. Basic checks showed no missing values in textual fields, but revealed duplicated articles and strong lexical patterns tied to specific sources such as Reuters, indicating potential data leakage.​

To prepare the data for robust text classification, the project applies a sequence of cleaning and feature‑engineering steps to ensure that the SVM learns generalizable linguistic patterns rather than exploiting superficial artifacts in the dataset.​

##### **Data Cleaning Steps**:
1) **Mapping subject categories into Politics and WorldNews**
    - **Problem**: Subject field contained several labels (`politicsNews`, `worldnews`, `Government News`, `Middle-east`, e.t.) that are semantically overlapping and unevenly represented.​ 
    - **Solution**: Subject values were mapped into two broader groups, `Politics` and `WorldNews`, collapsing related labels into a simpler, binary topical variable. This aggregation reduces sparsity and noise in the subject feature, simplifies stratified sampling or analysis by topic, and aligns with prior descriptions of datasets where most content is political or world news.

2) **Removing source and location strings**
    - **Problem**: Many real articles begin with the `location` or the `source` of news, which serve as near‑perfect shortcuts for the label and create data leakage.​ (e.g. “WASHINGTON (Reuters)”)
    - **Solution**: The dataset was stripped of location–source phrases and similar markers from the article text before vectorization. Without removal, an SVM can achieve artificially high accuracy by detecting the presence of “Reuters” rather than learning stylistic or semantic differences between fake and real news, significantly inflating performance estimates.

3) **Removing duplicate articles**
    - **Problem**: Exploration showed a non‑trivial number of duplicate articles, which would effectively leak information between training and test splits and allow models to memorize specific articles.​
    - **Solution**: Duplicates were removed based and the number of duplicates per class was recorded to document the extent of repetition. Doing so prevents train–test contamination, ensures that evaluation reflects generalization to unseen content, and avoids giving the model multiple identical copies of the same article.

4) **Adding an emotion classifier**
    - **Problem**: TF–IDF features may miss global affective patterns, even though prior work suggests that fake news often exhibits distinct emotional profiles compared to real reporting.​
    - **Solution**: An external emotion classification model was applied to each article to derive an emotion vector (scores for anger, fear, joy, trust, disgust, e.t.), added as an extra feature.​ Incorporating emotion information allows the SVM to exploit differences in emotional tone and intensity between fake and real news, potentially improving performance and interpretability relative to text‑only baselines.

5) **Vocabulary difference inspection**
    - **Problem**: Frequency analysis of tokens by class revealed words that appear almost exclusively in either true or fake articles, some of which were source‑specific or dataset‑specific.​
    - **Solution**: Word frequency counters were computed separately for true and fake texts, and the sets of exclusive or highly skewed terms were inspected. Inspecting vocabulary differences helps distinguish meaningful stylistic signals (e.g. sensational language) from leakage‑driven ones, guiding the cleaning process.


## Training SVM Models

#### Model A: Linear SVM
Assumes that Fake vs Real is linearly separable in high-dimensional space.

Why linear?  

**Pros**:
- a
- a

**Cons**:
- a
- a


#### Model B: Polynomial


**Pros**:
- a
- a

**Cons**:
- a
- a

#### Model C: RBF Kernel SVM
Captures possible non-linear relationships.

## Model Evaluation

We compare:
- Accuracy
- Confusion Matrix
- ROC-AUC
- Errors and misclassifications



table for each model (computational time, inference time, metrics)


### Feature Analysis

Linear SVM allows interpretation by examining the weights of the separating hyperplane.

Useful for identifying:
- sensationalist terms predicting "Fake"
- neutral political terms predicting "Real"




## Conclusion




