# Data Preprocessing Notebook

This notebook demonstrates how to download and preprocess datasets for friend recommendation using GNN.

## Datasets:
1. **SNAP Facebook Social Circles**: Small-scale ego-network dataset
2. **OGB Link Prediction**: Large-scale collaboration network
3. **Synthetic Dataset**: For quick testing

## Steps:
1. Download datasets
2. Preprocess graphs (node features, edge information)
3. Compute heuristics (common neighbors, Jaccard, etc.)
4. Prepare link prediction splits (train/val/test)
5. Save processed data


In [None]:
import sys
import os
sys.path.append(os.path.join(os.path.dirname(os.getcwd())))

import torch
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
from src.data.facebook_loader import FacebookDatasetLoader, create_synthetic_dataset
from src.data.ogb_loader import OGBDatasetLoader
from src.data.preprocessing import prepare_link_prediction_data
from src.data.heuristics import compute_heuristics

print("Imports successful!")


## 1. SNAP Facebook Dataset

Let's download and process the Facebook dataset. This is a good starter dataset with ego-networks.


In [None]:
# Initialize Facebook loader
facebook_loader = FacebookDatasetLoader(data_dir="data/raw/facebook")

# Download dataset (this may take a few minutes)
# Uncomment to download:
# facebook_loader.download()


In [None]:
# Process Facebook dataset
# Combine multiple ego networks (using first 10 for faster processing)
print("Processing Facebook dataset...")
data = facebook_loader.process_and_save(max_egos=10)

print(f"Processed graph: {data.num_nodes} nodes, {data.edge_index.size(1) // 2} edges")
print(f"Node features: {data.x.shape}")
print(f"Feature dimension: {data.x.size(1)}")


In [None]:
# Prepare link prediction data
# This splits edges into train/val/test and generates negative samples
print("Preparing link prediction data...")
link_data = prepare_link_prediction_data(data, train_ratio=0.7, val_ratio=0.15, neg_ratio=1.0, seed=42)

print(f"Train edges: {link_data['train_edges'].size(1)}")
print(f"Val edges: {link_data['val_edges'].size(1)}")
print(f"Test edges: {link_data['test_edges'].size(1)}")
print(f"Train positive: {link_data['train_pos'].size(1)}, negative: {link_data['train_neg'].size(1)}")
print(f"Val positive: {link_data['val_pos'].size(1)}, negative: {link_data['val_neg'].size(1)}")
print(f"Test positive: {link_data['test_pos'].size(1)}, negative: {link_data['test_neg'].size(1)}")


In [None]:
# Save link prediction data
torch.save(link_data, "data/processed/facebook_link_data.pt")
print("Saved link prediction data to data/processed/facebook_link_data.pt")


## 2. Compute Heuristics

Compute baseline heuristics for comparison with GNN models.


In [None]:
# Compute heuristics
# Note: This can be slow for large graphs
print("Computing heuristics...")
heuristics = compute_heuristics(data, methods=['common_neighbors', 'jaccard', 'adamic_adar', 'preferential_attachment'])

# Save heuristics
torch.save(heuristics, "data/processed/facebook_heuristics.pt")
print("Saved heuristics to data/processed/facebook_heuristics.pt")
print(f"Computed {len(heuristics)} heuristics: {list(heuristics.keys())}")


## 3. Synthetic Dataset (for quick testing)

Create a small synthetic dataset for quick experimentation.


In [None]:
# Create synthetic dataset
print("Creating synthetic dataset...")
synthetic_data = create_synthetic_dataset(num_nodes=100, num_edges=200, feature_dim=16)

print(f"Synthetic graph: {synthetic_data.num_nodes} nodes, {synthetic_data.edge_index.size(1) // 2} edges")
print(f"Node features: {synthetic_data.x.shape}")

# Save synthetic data
torch.save(synthetic_data, "data/processed/synthetic.pt")

# Prepare link prediction data
synthetic_link_data = prepare_link_prediction_data(synthetic_data, seed=42)
torch.save(synthetic_link_data, "data/processed/synthetic_link_data.pt")

print("Saved synthetic dataset and link prediction data")


## 4. OGB Dataset (Optional - Large Dataset)

For large-scale experiments, we can use OGB datasets. This may take longer to download and process.


In [None]:
# Load OGB dataset (ogbl-collab)
# Uncomment to download and process:
# ogb_loader = OGBDatasetLoader("ogbl-collab", root="data/raw/ogb")
# data_ogb, splits_ogb = ogb_loader.load()
# print(f"OGB Dataset: {data_ogb.num_nodes} nodes, {data_ogb.edge_index.size(1) // 2} edges")
# print(f"Train edges: {splits_ogb['train']['edge'].shape[0]}")
# print(f"Val edges: {splits_ogb['valid']['edge'].shape[0]}")
# print(f"Test edges: {splits_ogb['test']['edge'].shape[0]}")

# Save OGB data
# torch.save(data_ogb, "data/processed/ogbl_collab.pt")
# torch.save(splits_ogb, "data/processed/ogbl_collab_splits.pt")
# print("Saved OGB dataset")


## Summary

- **Facebook dataset**: Processed and saved to `data/processed/facebook_combined.pt`
- **Link prediction data**: Saved to `data/processed/facebook_link_data.pt`
- **Heuristics**: Computed and saved to `data/processed/facebook_heuristics.pt`
- **Synthetic dataset**: Created for quick testing

Next steps:
1. Run baseline models (see `baselines.ipynb`)
2. Train GNN models (see `training_graphsage_gat.ipynb` and `training_seal.ipynb`)
3. Evaluate models (see `evaluation_and_ablation.ipynb`)
