# Similarity methods: hands-on comparison (Steam 200k)

This notebook compares **similarity functions** commonly used in recommender systems and shows **how the choice changes recommendations**.

The repository’s current ItemCF implementation effectively uses **cosine similarity** (L2-normalize item vectors and take dot products). Here we’ll reproduce that and compare against a few other options.

## What you will learn

- What each similarity measures (intuition)
- How to compute it on the same interaction matrix
- How top-`N` recommendations differ
- Practical guidance: when each similarity is a better fit


## Setup

This notebook expects:

- `dataset/steam-200k.csv` present (already in this repo)
- dependencies installed from `requirements.txt`

It also auto-configures imports by adding the repo’s `src/` directory to `sys.path` so `import steamrec...` works whether you start Jupyter from the repo root or from inside `notebooks/`.

Example (terminal):

```bash
jupyter notebook
```

In [None]:
from __future__ import annotations

import sys
from pathlib import Path

import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.metrics import pairwise_distances
from sklearn.preprocessing import normalize


def _find_src_path(start: Path) -> Path | None:
    # Walk upwards to find `<repo>/src/steamrec`.
    for p in [start] + list(start.parents):
        candidate = p / "src" / "steamrec"
        if candidate.exists() and candidate.is_dir():
            return candidate.parent
    return None


src_path = _find_src_path(Path.cwd().resolve())
if src_path is None:
    raise RuntimeError(
        "Could not locate `src/steamrec`. Run Jupyter from within the repository, or ensure the repo is present on disk."
    )

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

from steamrec.data import (
    build_implicit_interactions,
    build_user_item_matrix,
    index_dataset,
    load_steam_200k_csv,
)

In [2]:
DATASET_PATH = Path("../dataset/steam-200k.csv") if Path("../dataset/steam-200k.csv").exists() else Path("dataset/steam-200k.csv")
raw = load_steam_200k_csv(DATASET_PATH)
interactions = build_implicit_interactions(raw)
ds = index_dataset(interactions)
user_item_full = build_user_item_matrix(ds)

user_item_full.shape, user_item_full.nnz


((12393, 5155), 128804)

## Why we sample

The full matrix is large. For a **hands-on** comparison, we’ll take a reproducible sample of users and items so we can compute similarities explicitly and iterate quickly.

The goal is not to maximize offline metrics here; it’s to see **how each similarity behaves**.

In [3]:
rng = np.random.default_rng(42)

n_users = min(4000, user_item_full.shape[0])
n_items = min(2000, user_item_full.shape[1])

user_idx = rng.choice(user_item_full.shape[0], size=n_users, replace=False)
item_idx = rng.choice(user_item_full.shape[1], size=n_items, replace=False)

user_idx.sort()
item_idx.sort()

user_item = user_item_full[user_idx][:, item_idx].tocsr()
game_titles = ds.game_titles[item_idx]

user_item.shape, user_item.nnz


((4000, 2000), 15402)

## Similarity methods we’ll compare

We’ll compare common choices for interaction vectors (users or items):

- **Cosine similarity**: compares *direction* (pattern) and ignores absolute magnitude.
- **Dot product (raw)**: rewards magnitude; tends to favor very popular / high-activity entities.
- **Jaccard similarity (binary)**: compares *set overlap* (co-occurrence), ignoring weights.
- **Pearson correlation (centered)**: compares deviations from each entity’s mean; classic for explicit ratings.
- **Euclidean distance → similarity**: distance in vector space; we’ll convert distance to similarity as `1 / (1 + d)`.
- **Manhattan (L1) distance → similarity**: like Euclidean but with absolute differences; also converted via `1 / (1 + d)`.

All of them answer “are these two vectors similar?”, but they emphasize different properties.

In [None]:
def cosine_sim_matrix_from_user_item(user_item: sparse.csr_matrix) -> sparse.csr_matrix:
    # Item vectors live in item-user space
    item_user = user_item.T.tocsr()
    item_user_norm = normalize(item_user, norm="l2", axis=1, copy=True)
    return item_user_norm @ item_user_norm.T


def dot_sim_matrix_from_user_item(user_item: sparse.csr_matrix) -> sparse.csr_matrix:
    item_user = user_item.T.tocsr()
    return item_user @ item_user.T


def jaccard_sim_matrix_from_user_item(user_item: sparse.csr_matrix) -> np.ndarray:
    # Binary: presence/absence only
    X = user_item.copy().tocsr()
    X.data[:] = 1.0
    item_user = X.T.tocsr()

    # Intersection counts = A @ A^T for binary
    inter = (item_user @ item_user.T).toarray().astype(np.float32)

    # Each item's set size
    sizes = np.asarray(item_user.getnnz(axis=1), dtype=np.float32)
    union = sizes[:, None] + sizes[None, :] - inter

    out = np.zeros_like(inter, dtype=np.float32)
    mask = union > 0
    out[mask] = inter[mask] / union[mask]
    return out


def pearson_sim_matrix_from_user_item(user_item: sparse.csr_matrix) -> np.ndarray:
    # Pearson is most meaningful for explicit ratings with user/item means.
    # Here we apply it to item vectors in item-user space with mean-centering per item.
    item_user = user_item.T.tocsr().astype(np.float32)

    # Convert to dense for this sampled problem (kept small on purpose).
    A = item_user.toarray()
    means = A.mean(axis=1, keepdims=True)
    A = A - means

    denom = np.linalg.norm(A, axis=1, keepdims=True)
    denom = denom @ denom.T
    num = A @ A.T

    out = np.zeros_like(num, dtype=np.float32)
    mask = denom > 0
    out[mask] = (num[mask] / denom[mask]).astype(np.float32)
    return out


def _distance_to_similarity(D: np.ndarray) -> np.ndarray:
    # Convert distances to a bounded similarity in (0, 1].
    return 1.0 / (1.0 + D.astype(np.float32))


def euclidean_sim_matrix_from_user_item(user_item: sparse.csr_matrix) -> np.ndarray:
    item_user = user_item.T.tocsr()
    # Dense distances for sampled matrix
    D = pairwise_distances(item_user, metric="euclidean", n_jobs=-1)
    S = _distance_to_similarity(D)
    np.fill_diagonal(S, 1.0)
    return S.astype(np.float32)


def manhattan_sim_matrix_from_user_item(user_item: sparse.csr_matrix) -> np.ndarray:
    item_user = user_item.T.tocsr()
    D = pairwise_distances(item_user, metric="manhattan", n_jobs=-1)
    S = _distance_to_similarity(D)
    np.fill_diagonal(S, 1.0)
    return S.astype(np.float32)

## Pick a user to inspect

We’ll choose a user from the sample with at least a few owned/played games so recommendations are non-trivial.

In [5]:
user_nnz = np.asarray(user_item.getnnz(axis=1)).ravel()
candidate_users = np.where(user_nnz >= 5)[0]

if len(candidate_users) == 0:
    raise RuntimeError("No sampled users with >= 5 interactions. Increase n_users or change sampling seed.")

u_local = int(candidate_users[0])
u_global = int(user_idx[u_local])
owned_local = user_item.getrow(u_local).indices
owned_titles = [str(game_titles[i]) for i in owned_local[:15]]

u_local, u_global, len(owned_local), owned_titles


(1,
 5,
 6,
 ['Day of Defeat',
  'Deathmatch Classic',
  'Half-Life 2',
  'Half-Life 2 Deathmatch',
  'Half-Life 2 Lost Coast',
  'Ricochet'])

## A unified way to produce ItemCF recommendations

Given an **item-item similarity matrix** `S` and a user profile vector `p` (the user’s interactions over items), a simple scoring rule is:

- `score = S @ p`

Then we exclude items the user already owns.

This mirrors what your current implementation does for cosine similarity (implemented efficiently via sparse matrix multiplications).

In [6]:
def recommend_from_sim(
    sim,
    user_profile: sparse.csr_matrix,
    owned_idx: set[int],
    titles: np.ndarray,
    n: int = 10,
) -> pd.DataFrame:
    p = user_profile.T

    if sparse.issparse(sim):
        scores = sim @ p
        scores = np.asarray(scores.todense()).ravel()
    else:
        scores = np.asarray(sim @ np.asarray(p.todense()).ravel()).ravel()

    if owned_idx:
        scores[list(owned_idx)] = -np.inf

    top = np.argpartition(-scores, kth=min(n, len(scores) - 1))[:n]
    top = top[np.argsort(-scores[top])]

    out = pd.DataFrame(
        {
            "item_local_idx": top,
            "title": [str(titles[i]) for i in top],
            "score": [float(scores[i]) for i in top],
        }
    )
    return out


In [None]:
profile = user_item.getrow(u_local).tocsr()
owned = set(profile.indices.tolist())

S_cos = cosine_sim_matrix_from_user_item(user_item)
S_dot = dot_sim_matrix_from_user_item(user_item)
S_jac = jaccard_sim_matrix_from_user_item(user_item)
S_prs = pearson_sim_matrix_from_user_item(user_item)
S_euc = euclidean_sim_matrix_from_user_item(user_item)
S_man = manhattan_sim_matrix_from_user_item(user_item)

recs_cos = recommend_from_sim(S_cos, profile, owned, game_titles, n=10)
recs_dot = recommend_from_sim(S_dot, profile, owned, game_titles, n=10)
recs_jac = recommend_from_sim(S_jac, profile, owned, game_titles, n=10)
recs_prs = recommend_from_sim(S_prs, profile, owned, game_titles, n=10)
recs_euc = recommend_from_sim(S_euc, profile, owned, game_titles, n=10)
recs_man = recommend_from_sim(S_man, profile, owned, game_titles, n=10)

recs_cos

Unnamed: 0,item_local_idx,title,score
0,379,Counter-Strike Condition Zero Deleted Scenes,2.077641
1,460,Day of Defeat Source,2.068729
2,1251,Portal,2.047225
3,378,Counter-Strike Condition Zero,1.87491
4,814,Half-Life Source,1.830522
5,962,Left 4 Dead 2,1.425269
6,466,Dead Space,1.171942
7,1635,Synergy,1.132746
8,1060,Metro 2033,1.129617
9,202,BioShock,1.105094


## Compare the top-10 lists

You should see that:

- **Dot product** tends to over-recommend globally frequent items (popularity bias).
- **Cosine** normalizes popularity and focuses on co-occurrence *patterns*.
- **Jaccard** is stricter: it only cares about overlap, not playtime/purchase weights.
- **Pearson** can be unstable with sparse implicit data; it shines more with explicit ratings (like 1–5 stars).

In [None]:
display(recs_cos.assign(method="cosine"))
display(recs_dot.assign(method="dot"))
display(recs_jac.assign(method="jaccard"))
display(recs_prs.assign(method="pearson"))
display(recs_euc.assign(method="euclidean (1/(1+d))"))
display(recs_man.assign(method="manhattan (1/(1+d))"))

Unnamed: 0,item_local_idx,title,score,method
0,379,Counter-Strike Condition Zero Deleted Scenes,2.077641,cosine
1,460,Day of Defeat Source,2.068729,cosine
2,1251,Portal,2.047225,cosine
3,378,Counter-Strike Condition Zero,1.87491,cosine
4,814,Half-Life Source,1.830522,cosine
5,962,Left 4 Dead 2,1.425269,cosine
6,466,Dead Space,1.171942,cosine
7,1635,Synergy,1.132746,cosine
8,1060,Metro 2033,1.129617,cosine
9,202,BioShock,1.105094,cosine


Unnamed: 0,item_local_idx,title,score,method
0,962,Left 4 Dead 2,1962.260376,dot
1,1697,The Elder Scrolls V Skyrim,1779.354614,dot
2,1251,Portal,1461.820068,dot
3,1496,Sid Meier's Civilization V,1238.69751,dot
4,241,Borderlands 2,1198.697266,dot
5,378,Counter-Strike Condition Zero,926.41272,dot
6,460,Day of Defeat Source,917.680786,dot
7,379,Counter-Strike Condition Zero Deleted Scenes,648.263123,dot
8,1060,Metro 2033,610.655518,dot
9,279,Call of Duty Modern Warfare 2,587.408447,dot


Unnamed: 0,item_local_idx,title,score,method
0,378,Counter-Strike Condition Zero,1.648488,jaccard
1,379,Counter-Strike Condition Zero Deleted Scenes,1.648488,jaccard
2,460,Day of Defeat Source,1.568453,jaccard
3,1251,Portal,1.449733,jaccard
4,814,Half-Life Source,1.224432,jaccard
5,962,Left 4 Dead 2,0.944384,jaccard
6,1697,The Elder Scrolls V Skyrim,0.670004,jaccard
7,241,Borderlands 2,0.668861,jaccard
8,202,BioShock,0.638448,jaccard
9,1060,Metro 2033,0.636975,jaccard


Unnamed: 0,item_local_idx,title,score,method
0,460,Day of Defeat Source,1.932249,pearson
1,379,Counter-Strike Condition Zero Deleted Scenes,1.876303,pearson
2,1251,Portal,1.862453,pearson
3,814,Half-Life Source,1.739009,pearson
4,378,Counter-Strike Condition Zero,1.695499,pearson
5,962,Left 4 Dead 2,1.149742,pearson
6,1635,Synergy,1.086213,pearson
7,466,Dead Space,1.069852,pearson
8,243,Borderlands DLC Claptraps New Robot Revolution,0.976642,pearson
9,202,BioShock,0.96874,pearson


## How different are these recommendation sets?

A quick sanity check is pairwise overlap between the top-`k` results.

In [None]:
def topk_set(df: pd.DataFrame, k: int = 10) -> set[str]:
    return set(df.head(k)["title"].tolist())


def overlap(a: pd.DataFrame, b: pd.DataFrame, k: int = 10) -> float:
    A = topk_set(a, k)
    B = topk_set(b, k)
    if not A and not B:
        return 1.0
    return len(A & B) / len(A | B)


methods = {
    "cosine": recs_cos,
    "dot": recs_dot,
    "jaccard": recs_jac,
    "pearson": recs_prs,
    "euclidean": recs_euc,
    "manhattan": recs_man,
}

rows = []
for m1, d1 in methods.items():
    for m2, d2 in methods.items():
        rows.append(
            {
                "m1": m1,
                "m2": m2,
                "jaccard_overlap_top10": overlap(d1, d2, k=10),
            }
        )

pd.DataFrame(rows).pivot(index="m1", columns="m2", values="jaccard_overlap_top10").round(3)

m2,cosine,dot,jaccard,pearson
m1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
cosine,1.0,0.429,0.667,0.818
dot,0.429,1.0,0.667,0.333
jaccard,0.667,0.667,1.0,0.538
pearson,0.818,0.333,0.538,1.0


## Zoom in: similarity for one owned item

Let’s choose one of the user’s owned items and see its nearest neighbors under each similarity.

In [10]:
target_item = int(next(iter(owned)))
target_title = str(game_titles[target_item])
target_title


'Half-Life 2'

In [None]:
def top_neighbors(sim_row, titles: np.ndarray, k: int = 10) -> pd.DataFrame:
    sim_row = np.asarray(sim_row).ravel().copy()
    sim_row[target_item] = -np.inf
    top = np.argpartition(-sim_row, kth=min(k, len(sim_row) - 1))[:k]
    top = top[np.argsort(-sim_row[top])]
    return pd.DataFrame(
        {
            "neighbor_local_idx": top,
            "title": [str(titles[i]) for i in top],
            "similarity": [float(sim_row[i]) for i in top],
        }
    )


cos_row = S_cos.getrow(target_item).toarray()
dot_row = S_dot.getrow(target_item).toarray()
jac_row = S_jac[target_item]
prs_row = S_prs[target_item]
euc_row = S_euc[target_item]
man_row = S_man[target_item]

display(pd.DataFrame({"target": [target_title]}))
display(top_neighbors(cos_row, game_titles, k=10).assign(method="cosine"))
display(top_neighbors(dot_row, game_titles, k=10).assign(method="dot"))
display(top_neighbors(jac_row, game_titles, k=10).assign(method="jaccard"))
display(top_neighbors(prs_row, game_titles, k=10).assign(method="pearson"))
display(top_neighbors(euc_row, game_titles, k=10).assign(method="euclidean (1/(1+d))"))
display(top_neighbors(man_row, game_titles, k=10).assign(method="manhattan (1/(1+d))"))

Unnamed: 0,target
0,Half-Life 2


Unnamed: 0,neighbor_local_idx,title,similarity,method
0,813,Half-Life 2 Lost Coast,0.678666,cosine
1,1251,Portal,0.593128,cosine
2,812,Half-Life 2 Deathmatch,0.464008,cosine
3,814,Half-Life Source,0.419941,cosine
4,1060,Metro 2033,0.337444,cosine
5,962,Left 4 Dead 2,0.333241,cosine
6,466,Dead Space,0.313083,cosine
7,202,BioShock,0.305736,cosine
8,1759,The Witcher 2 Assassins of Kings Enhanced Edition,0.303108,cosine
9,687,Fallout New Vegas Honest Hearts,0.294738,cosine


Unnamed: 0,neighbor_local_idx,title,similarity,method
0,1697,The Elder Scrolls V Skyrim,726.014709,dot
1,962,Left 4 Dead 2,722.882324,dot
2,1251,Portal,634.229858,dot
3,1496,Sid Meier's Civilization V,544.233826,dot
4,241,Borderlands 2,491.693085,dot
5,813,Half-Life 2 Lost Coast,473.601562,dot
6,812,Half-Life 2 Deathmatch,364.989532,dot
7,1759,The Witcher 2 Assassins of Kings Enhanced Edition,283.640381,dot
8,1060,Metro 2033,266.89563,dot
9,71,Age of Empires II HD Edition,236.442474,dot


Unnamed: 0,neighbor_local_idx,title,similarity,method
0,813,Half-Life 2 Lost Coast,0.629747,jaccard
1,812,Half-Life 2 Deathmatch,0.464174,jaccard
2,1251,Portal,0.416667,jaccard
3,814,Half-Life Source,0.300971,jaccard
4,962,Left 4 Dead 2,0.221176,jaccard
5,202,BioShock,0.182573,jaccard
6,1060,Metro 2033,0.179775,jaccard
7,1697,The Elder Scrolls V Skyrim,0.173228,jaccard
8,241,Borderlands 2,0.170418,jaccard
9,459,Day of Defeat,0.168254,jaccard


Unnamed: 0,neighbor_local_idx,title,similarity,method
0,813,Half-Life 2 Lost Coast,0.663343,pearson
1,1251,Portal,0.577217,pearson
2,812,Half-Life 2 Deathmatch,0.439291,pearson
3,814,Half-Life Source,0.409475,pearson
4,1060,Metro 2033,0.319026,pearson
5,466,Dead Space,0.301186,pearson
6,962,Left 4 Dead 2,0.300368,pearson
7,202,BioShock,0.289545,pearson
8,1759,The Witcher 2 Assassins of Kings Enhanced Edition,0.286265,pearson
9,687,Fallout New Vegas Honest Hearts,0.276141,pearson


## When is each similarity better? (practical guidance)

Below is a decision guide. This is *contextual*: the best similarity depends on your signal (implicit vs explicit), sparsity, and what bias you can tolerate.

| Similarity | Best for | Strengths | Weaknesses / gotchas |
|---|---|---|---|
| **Cosine** | **Implicit feedback** (views/clicks/playtime), bag-of-words, embeddings | Scale-invariant; reduces popularity bias; works well with sparse vectors | Still correlates with popularity via co-occurrence; needs normalization step |
| **Dot product** | When magnitude *should* matter (e.g., confidence-weighted signals) | Very simple; fast | Strong popularity/activity bias; tends to recommend head items |
| **Jaccard (binary)** | Pure co-occurrence / sets (bought/not bought), very noisy weights | Ignores weight noise; interpretable | Throws away intensity (hours); can be too strict with sparse overlap |
| **Pearson** | **Explicit ratings** (1–5 stars) | Removes mean bias (“lenient rater” vs “strict rater”) | Unstable on sparse implicit; requires co-rated overlap and centering choices |
| **Euclidean → similarity** | When you want a *distance* notion on numeric features | Simple geometric interpretation | Sensitive to scale and magnitude; distance is not a similarity (we convert with `1/(1+d)`); can behave poorly on very sparse high-dimensional data |
| **Manhattan → similarity** | Robust alternative distance (L1) on numeric features | Less sensitive to outliers than Euclidean in some settings | Same caveats as Euclidean; still scale-sensitive; conversion choice matters |

### Rule of thumb

- If you have **implicit interactions** like this repo: start with **cosine** (what you already do).
- If weights are unreliable or you only trust *presence*: try **Jaccard**.
- If you have **explicit ratings**: Pearson (user-user) or adjusted cosine can be strong baselines.
- If you intentionally want popularity to dominate (e.g., trending): dot product (or just a popularity model) will do that.

If you want more methods that fit this domain well, good next candidates are:

- **Tanimoto / extended Jaccard** for weighted vectors
- **BM25-weighted cosine** (great when data is like “user viewed item”)
- **Adjusted cosine** / mean-centered cosine
- **Cosine on learned embeddings** (matrix factorization or neural embeddings)

## Optional exercises

Try changing:

- `purchase_weight`, `play_weight`, `min_play_hours` in `build_implicit_interactions`
- sample sizes `n_users`, `n_items`
- the selected user (`u_local`)

Then re-run and see which similarity is more stable for your chosen scenario.