# MovieLens Adapter Guide

This notebook walks through how `MovieLensAdapter` loads ML-100K data,
engineers features, splits by leave-one-out, and produces `TabularDataset`s.

## 1. Building the Adapter

The adapter takes a `data_dir` and a `DataConfig`. Calling `.build()` does everything:
load → merge → split → fit encoders → transform.

In [7]:
import importlib

import deepfm.data.movielens as mv
importlib.reload(mv)

from deepfm.config import ExperimentConfig
from deepfm.data.movielens import MovieLensAdapter

config = ExperimentConfig()
adapter = MovieLensAdapter(config.data)
schema, train_ds, val_ds, test_ds = adapter.build()

print(f"Train: {len(train_ds):,} samples")
print(f"Val:   {len(val_ds):,} samples")
print(f"Test:  {len(test_ds):,} samples")

Train: 98,114 samples
Val:   943 samples
Test:  943 samples


## 2. Understanding the Schema

The schema describes every feature — its type, vocabulary size, embedding dim, and group.
All downstream modules (embedding layer, models) are built from this schema.

In [8]:
import pandas as pd

rows = []
for name, field in schema.fields.items():
    rows.append({
        "name": name,
        "type": field.feature_type.value,
        "vocab_size": field.vocabulary_size,
        "embed_dim": field.embedding_dim,
        "group": field.group,
        "max_length": field.max_length if field.feature_type.value == "sequence" else "-",
    })

pd.DataFrame(rows)

Unnamed: 0,name,type,vocab_size,embed_dim,group,max_length
0,user_id,sparse,944,16,user,-
1,movie_id,sparse,1679,16,item,-
2,gender,sparse,3,4,user,-
3,age,sparse,8,4,user,-
4,occupation,sparse,22,8,user,-
5,zip_prefix,sparse,383,8,user,-
6,genres,sequence,20,8,item,6


In [None]:
print(f"Total fields:         {schema.num_fields}")
print(f"Sparse fields:        {len(schema.sparse_fields)}")
print(f"Sequence fields:      {len(schema.sequence_fields)}")
print(f"Dense fields:         {len(schema.dense_fields)}")
print(f"Total embedding dim:  {schema.total_embedding_dim}")

## 3. Inspecting Samples

Each sample from the dataset is a `(feature_dict, label)` tuple.
Integer features become `torch.long`, the label is `torch.float32`.

In [9]:
features, label = train_ds[0]

print(f"Label: {label.item()}\n")
print("Features:")
for name, tensor in features.items():
    print(f"  {name:15s}  shape={str(tensor.shape):10s}  dtype={tensor.dtype}  value={tensor}")

Label: 1.0

Features:
  user_id          shape=torch.Size([])  dtype=torch.int64  value=1
  movie_id         shape=torch.Size([])  dtype=torch.int64  value=168
  gender           shape=torch.Size([])  dtype=torch.int64  value=2
  age              shape=torch.Size([])  dtype=torch.int64  value=2
  occupation       shape=torch.Size([])  dtype=torch.int64  value=20
  zip_prefix       shape=torch.Size([])  dtype=torch.int64  value=300
  genres           shape=torch.Size([6])  dtype=torch.int64  value=tensor([5, 0, 0, 0, 0, 0])


## 4. How Encoders Work

Encoders are fitted on the **training split only**. Unknown values in val/test map to index 0 (OOV).
Let's peek at a few encoders.

In [10]:
# Gender encoder
gender_enc = adapter._encoders["gender"]
print("Gender mapping (OOV=0):")
print(f"  {gender_enc._mapping}")
print(f"  vocab_size = {gender_enc.vocabulary_size}")
print()

# Age encoder (bucketed)
age_enc = adapter._encoders["age"]
print("Age bucket mapping:")
print(f"  {age_enc._mapping}")
print(f"  vocab_size = {age_enc.vocabulary_size}")
print()

# Genres encoder
genre_enc = adapter._encoders["genres"]
print(f"Genres vocab_size = {genre_enc.vocabulary_size}")
print(f"Genres mapping (first 5): {dict(list(genre_enc._mapping.items())[:5])}")

Gender mapping (OOV=0):
  {'F': 1, 'M': 2}
  vocab_size = 3

Age bucket mapping:
  {1: 1, 18: 2, 25: 3, 35: 4, 45: 5, 50: 6, 56: 7}
  vocab_size = 8

Genres vocab_size = 20
Genres mapping (first 5): {'Action': 1, 'Adventure': 2, 'Animation': 3, "Children's": 4, 'Comedy': 5}


## 5. Leave-One-Out Split Explained

For each user with >= 3 interactions (ordered by timestamp):
- **Last** interaction → test
- **Second-to-last** → validation
- **All remaining** → training

Users with < 3 interactions go entirely to training.

In [11]:
import numpy as np

# Val and test should each have exactly 1 sample per eligible user
# (943 users in ML-100K all have >= 3 interactions)
print(f"Val samples:  {len(val_ds)} (one per eligible user)")
print(f"Test samples: {len(test_ds)} (one per eligible user)")

# Verify no overlap: check a user's movie_id across splits
uid_idx = 0  # first encoded user_id in each split
train_movies = set(train_ds.features["movie_id"][
    train_ds.features["user_id"] == train_ds.features["user_id"][0]
].tolist())
val_movie = val_ds.features["movie_id"][0]
test_movie = test_ds.features["movie_id"][0]

print(f"\nUser 0 has {len(train_movies)} training movies")
print(f"  Val movie:  {val_movie}")
print(f"  Test movie: {test_movie}")

Val samples:  943 (one per eligible user)
Test samples: 943 (one per eligible user)

User 0 has 270 training movies
  Val movie:  74
  Test movie: 102


## 6. Using with DataLoader

The `TabularDataset` works directly with PyTorch's `DataLoader`.

In [12]:
from torch.utils.data import DataLoader

loader = DataLoader(train_ds, batch_size=8, shuffle=True)
batch_features, batch_labels = next(iter(loader))

print("Batch shapes:")
for name, tensor in batch_features.items():
    print(f"  {name:15s}  {tensor.shape}")
print(f"  {'labels':15s}  {batch_labels.shape}")

Batch shapes:
  user_id          torch.Size([8])
  movie_id         torch.Size([8])
  gender           torch.Size([8])
  age              torch.Size([8])
  occupation       torch.Size([8])
  zip_prefix       torch.Size([8])
  genres           torch.Size([8, 6])
  labels           torch.Size([8])


## 7. Label Distribution

In [13]:
train_labels = train_ds.labels
pos = (train_labels == 1).sum()
neg = (train_labels == 0).sum()
total = len(train_labels)

print(f"Training label distribution:")
print(f"  Positive (rating >= {config.data.label_threshold}): {pos:,} ({pos/total:.1%})")
print(f"  Negative (rating <  {config.data.label_threshold}): {neg:,} ({neg/total:.1%})")
print(f"  Total: {total:,}")

Training label distribution:
  Positive (rating >= 4.0): 54,396 (55.4%)
  Negative (rating <  4.0): 43,718 (44.6%)
  Total: 98,114


## 8. User-Item Interaction Map

The adapter tracks which items each user has interacted with.
This will be used by negative sampling (step 2.3) to avoid sampling items the user has already seen.

In [14]:
user_items = adapter.user_items

interaction_counts = [len(items) for items in user_items.values()]
print(f"Total users tracked: {len(user_items)}")
print(f"Interactions per user:")
print(f"  Min:    {min(interaction_counts)}")
print(f"  Max:    {max(interaction_counts)}")
print(f"  Mean:   {np.mean(interaction_counts):.1f}")
print(f"  Median: {np.median(interaction_counts):.1f}")

Total users tracked: 943
Interactions per user:
  Min:    20
  Max:    737
  Mean:   106.0
  Median: 65.0


## Summary

The `TabularDataset` is intentionally simple — it just converts numpy arrays to tensors.
All feature engineering (encoding, splitting, negative sampling) happens upstream in the adapter.

```
Raw Data (u.data, u.user, u.item)
    |
    v
MovieLensAdapter  -->  fits LabelEncoder / MultiHotEncoder on train split
    |
    v
TabularDataset(features_dict, labels)  <-- one per split (train/val/test)
    |
    v
DataLoader(dataset, batch_size=4096, shuffle=True)
    |
    v
Model.forward(batch_features)  -->  logits
```
- `MovieLensAdapter`
```
MovieLensAdapter(DataConfig)
    │
    ├── _load_and_merge()     Read u.data + u.user + u.item → merged DataFrame
    ├── _leave_one_out_split()  Per-user temporal split → train/val/test DataFrames
    ├── _fit_encoders()       Fit LabelEncoder (×6) + MultiHotEncoder (genres) on train
    ├── _build_schema()       Create DatasetSchema with vocab sizes from encoders
    └── _transform()          Apply encoders → TabularDataset
          │
          ▼
    (DatasetSchema, train_ds, val_ds, test_ds)
```

Next step: **Negative sampling** (step 2.3) adds synthetic negative examples to each split.