# FeatureEmbedding Guide

This notebook explores how `FeatureEmbedding` converts raw feature indices from the MovieLens dataset
into the three tensor views consumed by all CTR models.

## 1. Load MovieLens Data & Schema

In [1]:
from deepfm.config import ExperimentConfig
from deepfm.data.movielens import MovieLensAdapter

config = ExperimentConfig()
adapter = MovieLensAdapter(config.data)
schema, train_ds, val_ds, test_ds = adapter.build()

print(f"Schema has {schema.num_fields} fields, total_embedding_dim = {schema.total_embedding_dim}")

Schema has 7 fields, total_embedding_dim = 64


## 2. Examine Per-Field Embedding Dimensions

Each field has its own `embedding_dim` based on cardinality.
The FM/CIN/Attention components need a **common dimension**, so each field
gets a projection layer when its `embedding_dim != fm_embed_dim`.

In [3]:
import pandas as pd

fm_embed_dim = config.feature.fm_embed_dim
print(f"fm_embed_dim = {fm_embed_dim} (common dimension for FM/CIN/Attention)\n")

rows = []
for name, field in schema.fields.items():
    rows.append({
        "field": name,
        "type": field.feature_type.value,
        "vocab_size": field.vocabulary_size,
        "embed_dim": field.embedding_dim,
        "needs_projection": field.embedding_dim != fm_embed_dim,
    })

pd.DataFrame(rows)

fm_embed_dim = 16 (common dimension for FM/CIN/Attention)



Unnamed: 0,field,type,vocab_size,embed_dim,needs_projection
0,user_id,sparse,944,16,False
1,movie_id,sparse,1679,16,False
2,gender,sparse,3,4,True
3,age,sparse,8,4,True
4,occupation,sparse,22,8,True
5,zip_prefix,sparse,383,8,True
6,genres,sequence,20,8,True


## 3. Create the FeatureEmbedding Module

In [4]:
from deepfm.models.layers.embedding import FeatureEmbedding

emb = FeatureEmbedding(schema, fm_embed_dim=fm_embed_dim)
print(emb)

FeatureEmbedding(
  (second_order_embeddings): ModuleDict(
    (user_id): Embedding(944, 16, padding_idx=0)
    (movie_id): Embedding(1679, 16, padding_idx=0)
    (gender): Embedding(3, 4, padding_idx=0)
    (age): Embedding(8, 4, padding_idx=0)
    (occupation): Embedding(22, 8, padding_idx=0)
    (zip_prefix): Embedding(383, 8, padding_idx=0)
    (genres): EmbeddingBag(20, 8, mode='mean', padding_idx=0)
  )
  (first_order_embeddings): ModuleDict(
    (user_id): Embedding(944, 1, padding_idx=0)
    (movie_id): Embedding(1679, 1, padding_idx=0)
    (gender): Embedding(3, 1, padding_idx=0)
    (age): Embedding(8, 1, padding_idx=0)
    (occupation): Embedding(22, 1, padding_idx=0)
    (zip_prefix): Embedding(383, 1, padding_idx=0)
    (genres): EmbeddingBag(20, 1, mode='mean', padding_idx=0)
  )
  (projections): ModuleDict(
    (gender): Linear(in_features=4, out_features=16, bias=False)
    (age): Linear(in_features=4, out_features=16, bias=False)
    (occupation): Linear(in_features=8,

## 4. Forward Pass on a Real Batch

Let's grab a batch from the training DataLoader and pass it through the embedding layer.

In [5]:
import torch
from torch.utils.data import DataLoader

loader = DataLoader(train_ds, batch_size=4, shuffle=True)
batch_features, batch_labels = next(iter(loader))

print("Input batch:")
for name, tensor in batch_features.items():
    print(f"  {name:15s}  shape={str(tensor.shape):15s}  dtype={tensor.dtype}")
print(f"  {'labels':15s}  shape={str(batch_labels.shape):15s}")

Input batch:
  user_id          shape=torch.Size([4])  dtype=torch.int64
  movie_id         shape=torch.Size([4])  dtype=torch.int64
  gender           shape=torch.Size([4])  dtype=torch.int64
  age              shape=torch.Size([4])  dtype=torch.int64
  occupation       shape=torch.Size([4])  dtype=torch.int64
  zip_prefix       shape=torch.Size([4])  dtype=torch.int64
  genres           shape=torch.Size([4, 6])  dtype=torch.int64
  labels           shape=torch.Size([4])


In [6]:
emb.eval()
with torch.no_grad():
    first_order, field_embeddings, flat_embeddings = emb(batch_features)

print("Output tensors:")
print(f"  first_order:      {first_order.shape}      — linear term (summed across fields)")
print(f"  field_embeddings: {field_embeddings.shape}  — projected to common fm_dim, one per field")
print(f"  flat_embeddings:  {flat_embeddings.shape}   — raw concat of all field embeddings")

Output tensors:
  first_order:      torch.Size([4, 1])      — linear term (summed across fields)
  field_embeddings: torch.Size([4, 7, 16])  — projected to common fm_dim, one per field
  flat_embeddings:  torch.Size([4, 64])   — raw concat of all field embeddings


## 5. Understanding the Three Outputs

### 5a. First Order — `(B, 1)`
Each feature has a first-order embedding of dim 1 (i.e., a scalar weight per feature value).
These are summed across all fields to produce the linear term `<w, x>` in FM.

In [7]:
print("First order values (one scalar per sample):")
print(first_order)
print(f"\nThink of this as: bias + w_user_id + w_movie_id + w_gender + ... for each sample")

First order values (one scalar per sample):
tensor([[-1.2366],
        [-0.7426],
        [-1.1634],
        [-0.7224]])

Think of this as: bias + w_user_id + w_movie_id + w_gender + ... for each sample


### 5b. Field Embeddings — `(B, num_fields, fm_embed_dim)`

Each field's embedding is **projected** to the common `fm_embed_dim` so that
FM can compute pairwise dot products between fields.
Fields with `embedding_dim == fm_embed_dim` (like user_id, movie_id at dim=16) skip the projection.

In [8]:
print(f"field_embeddings shape: {field_embeddings.shape}")
print(f"  B={field_embeddings.shape[0]}, F={field_embeddings.shape[1]}, D={field_embeddings.shape[2]}")
print()

field_names = list(schema.fields.keys())
print("Per-field projected embeddings for sample 0:")
for i, name in enumerate(field_names):
    vec = field_embeddings[0, i]
    print(f"  {name:15s}  [{vec[:4].tolist()} ...]  (dim={vec.shape[0]})")

field_embeddings shape: torch.Size([4, 7, 16])
  B=4, F=7, D=16

Per-field projected embeddings for sample 0:
  user_id          [[-0.025335164740681648, 0.0015823764260858297, 0.05761401727795601, -0.024023396894335747] ...]  (dim=16)
  movie_id         [[0.004626244306564331, -0.05903911590576172, -0.012329183518886566, 0.04952222481369972] ...]  (dim=16)
  gender           [[-0.14277571439743042, -0.6031725406646729, -0.06658773869276047, 0.04329908639192581] ...]  (dim=16)
  age              [[-0.05736514925956726, -0.5844314098358154, -0.49363431334495544, 0.2917306125164032] ...]  (dim=16)
  occupation       [[0.3453277349472046, 0.037104543298482895, 0.2612850069999695, -0.014424063265323639] ...]  (dim=16)
  zip_prefix       [[-0.018386617302894592, 0.04026094079017639, 0.10642841458320618, -0.05326128751039505] ...]  (dim=16)
  genres           [[-0.4363022446632385, 0.31358152627944946, -0.14919301867485046, 0.44798552989959717] ...]  (dim=16)


### 5c. Flat Embeddings — `(B, total_dim)`

The **raw** (non-projected) embeddings concatenated across all fields.
This is the input to the DNN component. The total dim equals `sum(field.embedding_dim for all fields)`.

In [9]:
print(f"flat_embeddings shape: {flat_embeddings.shape}")
print(f"  total_dim = {flat_embeddings.shape[1]}")
print()

# Show how total_dim is composed
offset = 0
print("Composition:")
for name, field in schema.fields.items():
    end = offset + field.embedding_dim
    print(f"  {name:15s}  dim={field.embedding_dim:3d}  positions [{offset}:{end})")
    offset = end
print(f"  {'TOTAL':15s}  dim={offset}")

flat_embeddings shape: torch.Size([4, 64])
  total_dim = 64

Composition:
  user_id          dim= 16  positions [0:16)
  movie_id         dim= 16  positions [16:32)
  gender           dim=  4  positions [32:36)
  age              dim=  4  positions [36:40)
  occupation       dim=  8  positions [40:48)
  zip_prefix       dim=  8  positions [48:56)
  genres           dim=  8  positions [56:64)
  TOTAL            dim=64


## 6. OOV / Padding Behavior

Index 0 is reserved for unknown/OOV. The `padding_idx=0` ensures these contribute
zero to both first-order and second-order embeddings.

In [10]:
# Create a batch with index 0 (OOV) for user_id
oov_batch = {name: torch.zeros(1, dtype=torch.long) for name in field_names}
# Sequence field needs 2D
oov_batch["genres"] = torch.zeros(1, 6, dtype=torch.long)

with torch.no_grad():
    fo_oov, fe_oov, fl_oov = emb(oov_batch)

print("All-OOV input:")
print(f"  first_order sum:      {fo_oov.abs().sum().item():.6f}  (should be ~0)")
print(f"  field_embeddings sum: {fe_oov.abs().sum().item():.6f}  (should be ~0)")
print(f"  flat_embeddings sum:  {fl_oov.abs().sum().item():.6f}  (should be ~0)")

All-OOV input:
  first_order sum:      0.000000  (should be ~0)
  field_embeddings sum: 0.000000  (should be ~0)
  flat_embeddings sum:  0.000000  (should be ~0)


## 7. Projection Layers

Fields with `embedding_dim != fm_embed_dim` get a `nn.Linear` projection.
Let's see which fields have projections and their parameter counts.

In [11]:
print(f"Projection layers (field_dim -> {fm_embed_dim}):")
for name, proj in emb.projections.items():
    field_dim = schema.fields[name].embedding_dim
    params = sum(p.numel() for p in proj.parameters())
    print(f"  {name:15s}  {field_dim} -> {fm_embed_dim}  ({params} params)")

if not emb.projections:
    print("  (none — all fields already match fm_embed_dim)")

Projection layers (field_dim -> 16):
  gender           4 -> 16  (64 params)
  age              4 -> 16  (64 params)
  occupation       8 -> 16  (128 params)
  zip_prefix       8 -> 16  (128 params)
  genres           8 -> 16  (128 params)


## 8. Parameter Count Summary

In [12]:
total_params = sum(p.numel() for p in emb.parameters())
trainable_params = sum(p.numel() for p in emb.parameters() if p.requires_grad)

print(f"Total parameters:     {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print()

# Breakdown by component
for component_name, module_dict in [
    ("second_order_embeddings", emb.second_order_embeddings),
    ("first_order_embeddings", emb.first_order_embeddings),
    ("projections", emb.projections),
]:
    params = sum(p.numel() for p in module_dict.parameters())
    print(f"  {component_name:30s}  {params:>8,} params")

Total parameters:     48,983
Trainable parameters: 48,983

  second_order_embeddings           45,412 params
  first_order_embeddings             3,059 params
  projections                          512 params


## 9. How Models Consume These Outputs

```
FeatureEmbedding(batch)
    |
    ├── first_order (B, 1)  ──────────────────────> FM linear term: y_linear = bias + first_order
    │
    ├── field_embeddings (B, F, fm_dim)  ─────────> FM 2nd order: 0.5*(square_of_sum - sum_of_squares)
    │                                     ├──────> CIN: vector-wise interactions (xDeepFM)
    │                                     └──────> Attention: field self-attention (AttentionDeepFM)
    │
    └── flat_embeddings (B, total_dim)  ──────────> DNN: MLP(flat_embeddings) → logit
```

Key insight: **FM and DNN share the same embedding vectors** (no separate wide features).
The only difference is that FM gets the projected view while DNN gets the raw concatenated view.