# Fake News Detection with Graph Neural Networks
## Project Overview

This project explores the use of **Graph Neural Networks (GNNs)** for detecting fake news articles. Traditional text-classification models treat each article independently, but misinformation often follows shared patterns, topics, and stylistic cues that relate articles to one another. To capture these relationships, this work constructs a **similarity-based graph** where each node represents a news article and edges represent semantic similarity derived from sentence embeddings.

### Key Components of the Project

1. **Text Preprocessing & Embedding Generation**  
   - Real and fake articles from the *FakeNewsNet (PolitiFact subset)* are cleaned and merged into a standardized dataset.  
   - Each article is encoded into a dense semantic vector using **Sentence Transformers (MiniLM)**.

2. **Graph Construction**  
   - Pairwise cosine similarity is computed between article embeddings.  
   - Articles exceeding a similarity threshold are connected, forming a graph where edges capture contextual and semantic relationships.

3. **PyTorch Geometric (PyG) Representation**  
   - The similarity graph is converted into a PyG `Data` object containing:  
     - `x` → node features (embeddings)  
     - `edge_index` → graph connections  
     - `y` → real/fake labels  
   - This unified representation is used to train GNN models.

4. **Baseline Models**  
   - Machine learning classifiers (Logistic Regression, SVM, Random Forest, Gradient Boosting) are trained directly on the MiniLM embeddings.  
   - These serve as reference points to evaluate how much relational information helps.

5. **GNN Modeling**  
   - Three GNN architectures are trained and evaluated:  
     - **GCN** (Graph Convolutional Network)  
     - **GraphSAGE** (Sampling-based aggregation)  
     - **GAT** (Graph Attention Network)  
   - Multiple random seeds ensure robustness and reliability.

6. **Embedding Visualization**  
   - The hidden node embeddings learned by GNNs are projected into 2D using **t-SNE**.  
   - These visualizations reveal how effectively each model separates real and fake news in latent space.

7. **Performance Comparison**  
   - Metrics (accuracy, F1-score), training curves, and t-SNE plots are aggregated.  
   - Results consistently show that **GNN models outperform traditional baselines**, with **GAT delivering the strongest performance** due to its attention mechanism.



## What This Notebook Covers

This notebook walks through the end-to-end workflow of the Fake News GNN project:

- Loading and understanding the dataset  
- Building or loading the similarity graph  
- Converting the graph into PyTorch Geometric format  
- Training baseline machine learning models  
- Training GNN models (GCN, GraphSAGE, GAT)  
- Running t-SNE to visualize node embeddings  
- Comparing all models using consistent metrics  

Overall, this notebook serves both as a **reproducible experiment** and a **complete explanation of the Fake News GNN pipeline**.


## 1. Environment Setup

In [2]:
!pip install -r ../requirements.txt

Collecting torch-geometric (from -r ../requirements.txt (line 2))
  Using cached torch_geometric-2.7.0-py3-none-any.whl.metadata (63 kB)
Collecting aiohttp (from torch-geometric->-r ../requirements.txt (line 2))
  Downloading aiohttp-3.13.2-cp312-cp312-macosx_10_13_x86_64.whl.metadata (8.1 kB)
Collecting xxhash (from torch-geometric->-r ../requirements.txt (line 2))
  Downloading xxhash-3.6.0-cp312-cp312-macosx_10_13_x86_64.whl.metadata (13 kB)
Collecting aiohappyeyeballs>=2.5.0 (from aiohttp->torch-geometric->-r ../requirements.txt (line 2))
  Using cached aiohappyeyeballs-2.6.1-py3-none-any.whl.metadata (5.9 kB)
Collecting aiosignal>=1.4.0 (from aiohttp->torch-geometric->-r ../requirements.txt (line 2))
  Using cached aiosignal-1.4.0-py3-none-any.whl.metadata (3.7 kB)
Collecting attrs>=17.3.0 (from aiohttp->torch-geometric->-r ../requirements.txt (line 2))
  Using cached attrs-25.4.0-py3-none-any.whl.metadata (10 kB)
Collecting frozenlist>=1.1.1 (from aiohttp->torch-geometric->-r ../

## 2. Run Data Preprocessing

In [7]:
!python ../src/preprocess.py

Saved data/processed/posts.csv
Total posts: 1056


## 3. Build Graph

In [10]:
# This cell is optional since I already have the virtual environment set and all the dependencies installed
pip install sentence-transformers

Collecting sentence-transformers
  Using cached sentence_transformers-5.1.2-py3-none-any.whl.metadata (16 kB)
Using cached sentence_transformers-5.1.2-py3-none-any.whl (488 kB)
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-5.1.2
Note: you may need to restart the kernel to use updated packages.


In [14]:
pip install --force-reinstall "numpy<2"

Collecting numpy<2
  Downloading numpy-1.26.4-cp312-cp312-macosx_10_9_x86_64.whl.metadata (61 kB)
Downloading numpy-1.26.4-cp312-cp312-macosx_10_9_x86_64.whl (20.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.3/20.3 MB[0m [31m1.9 MB/s[0m  [33m0:00:10[0mm0:00:01[0m00:01[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.3.5
    Uninstalling numpy-2.3.5:
      Successfully uninstalled numpy-2.3.5
Successfully installed numpy-1.26.4
Note: you may need to restart the kernel to use updated packages.


In [36]:
!python ../src/build_graph.py

Loading posts...
Preparing content column...
Updated posts.csv saved with new 'content' column.
Loading embedding model (MiniLM)...
Generating embeddings...
Batches: 100%|██████████████████████████████████| 33/33 [00:16<00:00,  1.95it/s]
Computing cosine similarity matrix...
Building similarity graph...
Saving graph to data/processed/graph.pkl

Graph saved successfully!
Total nodes: 1056
Total edges: 634


## 4. Train Baseline Models

In [37]:
!python ../src/train_baselines.py

LogisticRegression: Accuracy=0.7925, F1=0.7273
SVM: Accuracy=0.8302, F1=0.7970


## 5. Train GNN Models

In [45]:
!python ../src/train_gnn.py --model gcn


--- Training GCN ---
Seed 42 — Val: 0.8671, Test: 0.8113, Time: 0.93s
Seed 43 — Val: 0.8861, Test: 0.8491, Time: 0.88s
Seed 44 — Val: 0.8544, Test: 0.8742, Time: 0.83s
Seed 45 — Val: 0.8861, Test: 0.8428, Time: 0.86s
Seed 46 — Val: 0.8608, Test: 0.8994, Time: 0.81s

--- Training GraphSAGE ---
Seed 42 — Val: 0.8734, Test: 0.8491, Time: 1.19s
Seed 43 — Val: 0.8924, Test: 0.8553, Time: 1.16s
Seed 44 — Val: 0.8481, Test: 0.8931, Time: 1.21s
Seed 45 — Val: 0.8797, Test: 0.8365, Time: 1.22s
Seed 46 — Val: 0.8671, Test: 0.8994, Time: 1.22s

--- Training GAT ---
Seed 42 — Val: 0.8797, Test: 0.8365, Time: 1.92s
Seed 43 — Val: 0.8861, Test: 0.8742, Time: 1.99s
Seed 44 — Val: 0.8671, Test: 0.8868, Time: 2.02s
Seed 45 — Val: 0.8987, Test: 0.8616, Time: 1.96s
Seed 46 — Val: 0.8608, Test: 0.9057, Time: 1.98s

 Training complete. Results and plots saved to results/


In [46]:
!python ../src/train_gnn.py --model sage


--- Training GCN ---
Seed 42 — Val: 0.8671, Test: 0.8113, Time: 0.83s
Seed 43 — Val: 0.8861, Test: 0.8491, Time: 0.79s
Seed 44 — Val: 0.8544, Test: 0.8742, Time: 0.83s
Seed 45 — Val: 0.8861, Test: 0.8428, Time: 0.86s
Seed 46 — Val: 0.8608, Test: 0.8994, Time: 0.82s

--- Training GraphSAGE ---
Seed 42 — Val: 0.8734, Test: 0.8491, Time: 1.25s
Seed 43 — Val: 0.8924, Test: 0.8553, Time: 1.28s
Seed 44 — Val: 0.8481, Test: 0.8931, Time: 1.20s
Seed 45 — Val: 0.8797, Test: 0.8365, Time: 1.19s
Seed 46 — Val: 0.8671, Test: 0.8994, Time: 1.22s

--- Training GAT ---
Seed 42 — Val: 0.8797, Test: 0.8365, Time: 1.95s
Seed 43 — Val: 0.8861, Test: 0.8742, Time: 1.96s
Seed 44 — Val: 0.8671, Test: 0.8868, Time: 1.93s
Seed 45 — Val: 0.8987, Test: 0.8616, Time: 1.94s
Seed 46 — Val: 0.8608, Test: 0.9057, Time: 1.97s

 Training complete. Results and plots saved to results/


In [47]:
!python ../src/train_gnn.py --model gat


--- Training GCN ---
Seed 42 — Val: 0.8671, Test: 0.8113, Time: 0.89s
Seed 43 — Val: 0.8861, Test: 0.8491, Time: 0.85s
Seed 44 — Val: 0.8544, Test: 0.8742, Time: 0.87s
Seed 45 — Val: 0.8861, Test: 0.8428, Time: 0.86s
Seed 46 — Val: 0.8608, Test: 0.8994, Time: 0.89s

--- Training GraphSAGE ---
Seed 42 — Val: 0.8734, Test: 0.8491, Time: 1.34s
Seed 43 — Val: 0.8924, Test: 0.8553, Time: 1.23s
Seed 44 — Val: 0.8481, Test: 0.8931, Time: 1.25s
Seed 45 — Val: 0.8797, Test: 0.8365, Time: 1.26s
Seed 46 — Val: 0.8671, Test: 0.8994, Time: 1.36s

--- Training GAT ---
Seed 42 — Val: 0.8797, Test: 0.8365, Time: 1.91s
Seed 43 — Val: 0.8861, Test: 0.8742, Time: 1.99s
Seed 44 — Val: 0.8671, Test: 0.8868, Time: 1.96s
Seed 45 — Val: 0.8987, Test: 0.8616, Time: 1.94s
Seed 46 — Val: 0.8608, Test: 0.9057, Time: 1.97s

 Training complete. Results and plots saved to results/


## 6. Visualize Embeddings

In [52]:
# For GCN
!python ../src/tsne_embeddings.py --model GCN

=== Running t-SNE for model: GCN ===
Loaded graph with 1056 nodes, 1268 edges.
Node feature dim = 384, num_classes = 2
Using checkpoint: checkpoints/GCN_seed42.pt
Detected hidden_dim = 16 from 'conv1.lin.weight'
Running t-SNE...
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

Saved t-SNE plot to: results/GCN_tsne.png
=== Done ===


In [53]:
# For GraphSAGE
!python ../src/tsne_embeddings.py --model GraphSAGE

=== Running t-SNE for model: GraphSAGE ===
Loaded graph with 1056 nodes, 1268 edges.
Node feature dim = 384, num_classes = 2
Using checkpoint: checkpoints/GraphSAGE_seed42.pt
Detected hidden_dim = 16 from 'conv1.lin_l.weight'
Running t-SNE...
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

Saved t-SNE plot to: results/GraphSAGE_tsne.png
=== Done ===


In [54]:
# For GAT
!python ../src/tsne_embeddings.py --model GAT

=== Running t-SNE for model: GAT ===
Loaded graph with 1056 nodes, 1268 edges.
Node feature dim = 384, num_classes = 2
Using checkpoint: checkpoints/GAT_seed42.pt
Detected GAT hidden_dim = 16 from 'conv1.att_src'
Running t-SNE...
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

Saved t-SNE plot to: results/GAT_tsne.png
=== Done ===


## 7. Load and Compare Results

In [57]:
import json, pandas as pd, glob

# Load baselines
with open('results/baselines.json') as f:
    baselines = json.load(f)

# Load all GNN report json files
gnn_files = glob.glob("results/*_report.json")

gnn_reports = []
for path in gnn_files:
    with open(path) as f:
        report = json.load(f)
        report["model"] = os.path.basename(path).replace("_report.json", "")
        gnn_reports.append(report)

baselines_df = pd.DataFrame(baselines)
gnn_df = pd.DataFrame(gnn_reports)

baselines_df, gnn_df

(          LogisticRegression     SVM
 accuracy              0.7925  0.8302
 f1_score              0.7273  0.7970,
                                                     0  \
 0   {'precision': 0.9029126213592233, 'recall': 0....   
 1   {'precision': 0.9090909090909091, 'recall': 0....   
 2   {'precision': 0.9117647058823529, 'recall': 0....   
 3   {'precision': 0.8165137614678899, 'recall': 0....   
 4   {'precision': 0.865979381443299, 'recall': 0.8...   
 5   {'precision': 0.7981651376146789, 'recall': 0....   
 6   {'precision': 0.8556701030927835, 'recall': 0....   
 7   {'precision': 0.85, 'recall': 0.89473684210526...   
 8   {'precision': 0.8476190476190476, 'recall': 0....   
 9   {'precision': 0.9150943396226415, 'recall': 0....   
 10  {'precision': 0.9065420560747663, 'recall': 0....   
 11  {'precision': 0.9142857142857143, 'recall': 0....   
 12  {'precision': 0.9186046511627907, 'recall': 0....   
 13  {'precision': 0.9294117647058824, 'recall': 0....   
 14  {'precisio