In [None]:
!git clone https://github.com/icedarold/through-pages.git
%cd through-pages


# SOTA Recommendation Pipeline: From Raw Data to Submission

This notebook runs the full pipeline:
1. **Preprocess State**: Clean interactions and items.
2. **Content Encoding (Phase 2)**: Multilingual embeddings.
3. **Multi-Interest (Phase 1)**: Train Transformer to extract 6 interest vectors.
4. **Feature Engineering (Phase 3)**: Generate rich features for 200 candidates.
5. **Reranking & Submission (Phase 4)**: Select Top 20.

In [None]:
# 1. Install dependencies
!pip install -q sentence-transformers pyarrow fastparquet tqdm lightgbm

In [None]:
# 2. Setup Directories (Paths updated for Kaggle Environment)
import os
OUT_DIR = '/kaggle/working/through-pages/experiments/data_v1'
os.makedirs(OUT_DIR, exist_ok=True)
os.makedirs('/kaggle/working/through-pages/experiments/models_v1', exist_ok=True)

print("Using Kaggle Input datasets: /kaggle/input/though-pages")

### Step 3: Global Preprocessing

In [None]:
!python3 src/preprocess.py
!python3 src/data/items.py
!python3 src/data/sequences.py


### Step 4: Phase 2 - Item Content Embeddings
Generates 768-D vectors for all books.

In [None]:
# Encoding 134k items (this takes ~30 mins on P100 GPU)
!python3 src/models/item_encoder.py --batch_size 128

### Step 5: Phase 1 - Multi-Interest Learning
Training the User Encoder to understand multi-faceted interests.

In [None]:
!python3 src/data/enrich_items.py
!python3 src/train.py --epochs 10 --batch-size 256
!python3 src/inference_user.py

### Step 6: Phase 3 - SOTA Feature Factory
Extract Series Gaps, Author Affinities, and Interaction Statistics for the 200 candidates per user.

In [None]:
!python3 src/data/feature_factory.py --exp-dir $OUT_DIR --mode infer

### Step 7: Phase 4 - Reranking & Submission
Final selection of Top 20.

In [None]:
!python3 src/submit.py --features $OUT_DIR/features_infer.parquet --output submission.csv

import pandas as pd
sub = pd.read_csv('submission.csv')
print("Submission generated successfully!")
print(sub.head())
print(f"Total rows: {len(sub)}")