# Memory-based Collaborative filtering with MovieLens 100K dataset

* Dataset: ML-100K
* CF-type: Memory
* Implicit/Explicit: Explicit
* User-User/Item-Item: Item-item


## Step 1: Install code dependencies

In [3]:
# Only necessary if you're not running the `.venv` environment kernel
# !pip install numpy pandas scipy scikit-learn

## Step 2: Import dependencies 

In [4]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix

## Step 3: Load ml-100k dataset

We start by loading the raw ratings data from `u.data`. It's these ratings that eventually power the collaborative filtering.


In [None]:
ratings_columns = ["user_id", "item_id", "rating", "timestamp"]
df = pd.read_csv("../data/ml-100k/u.data", sep="\t", names=ratings_columns)
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


To make our work a little more legible, lets bring in `u.item`, which contains movie metadata. We'll eventually merge these two datasets.

In [None]:
# Encoding specified to avoid decoding errors due to special characters
movies_columns = ["item_id", "title"] + [str(i) for i in range(22)]
movies_df = pd.read_csv(
    "../data/ml-100k/u.item",
    sep="|",
    names=movies_columns,
    usecols=[0, 1],
    encoding="latin-1",
)
movies_df.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


Perform the merge.

In [None]:
df = df.merge(movies_df, on="item_id")
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,186,302,3,891717742,L.A. Confidential (1997)
2,22,377,1,878887116,Heavyweights (1994)
3,244,51,2,880606923,Legends of the Fall (1994)
4,166,346,1,886397596,Jackie Brown (1997)


## Step 4: Create a user-item interaction matrix

In [25]:
interaction_matrix = df.pivot_table(
    index="user_id", columns="title", values="rating", fill_value=0
)
interaction_matrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,2.0,5.0,0.0,0.0,3.0,4.0,0.0,0.0,...,0.0,0.0,0.0,5.0,3.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,2.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0


## Step 5: Convert to sparse matrix

We compress our interaction matrix, allowing our computations to skip over zeros and other non-interactions--while still preserving that information. 

Converting our data into a sparse matrix has a few benefits:
1. Efficiency in Memory Usage
2. Faster Computation 
3. Compatibility with ML Algorithms 
4. Clearer Representation of User-Item Interactions

_Note how the original interaction matrix and the sparse matrix are the same shape `(rows [users] x cols [movies])`. The sparse matrix is simply a computational abstratction._

In [None]:
sparse_matrix = csr_matrix(interaction_matrix.values)

print(f"Original interaction matrix shape: {interaction_matrix.shape}")
print(f"Sparse matrix shape: {sparse_matrix.shape}")

Original interaction matrix shape: (943, 1664)
Sparse matrix shape: (943, 1664)


### Step 7: Compute item-item cosine similarity

We transpose the sparse matrix because `cosine_similarity(x)` computes similarity row-wise. So, if we want item-item similarity, we need the rows of X to represent items.

In practice, sparse_matrix.T converts sparse_matrix in the following way:
```
[num_users x num_movies] → [num_movies x num_users]
```

Now, each row represents an movie, and its values are interactions by all users.

So cosine_similarity(sparse_matrix.T) computes the similarity between movies based on user interaction patterns, which is exactly what we want for item-item collaborative filtering.

In [29]:
item_similarity = cosine_similarity(sparse_matrix.T)
item_similarity_df = pd.DataFrame(
    item_similarity,
    index=interaction_matrix.columns,
    columns=interaction_matrix.columns,
)

item_similarity_df

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),1.000000,0.000000,0.024561,0.099561,0.185236,0.159265,0.000000,0.052203,0.000000,0.033326,...,0.000000,0.000000,0.000000,0.027774,0.118840,0.142315,0.029070,0.000000,0.110208,0.000000
1-900 (1994),0.000000,1.000000,0.014139,0.009294,0.007354,0.004702,0.010055,0.067038,0.000000,0.000000,...,0.152499,0.015484,0.000000,0.069284,0.018243,0.023408,0.006694,0.079640,0.042295,0.000000
101 Dalmatians (1996),0.024561,0.014139,1.000000,0.167006,0.061105,0.143878,0.203781,0.225803,0.027642,0.092337,...,0.000000,0.021965,0.030905,0.274877,0.204267,0.101199,0.056976,0.172155,0.045714,0.000000
12 Angry Men (1957),0.099561,0.009294,0.167006,1.000000,0.056822,0.167235,0.304078,0.422506,0.072682,0.394854,...,0.060946,0.016502,0.000000,0.403270,0.259436,0.145519,0.105226,0.038901,0.060101,0.081261
187 (1997),0.185236,0.007354,0.061105,0.056822,1.000000,0.132327,0.042928,0.065060,0.043133,0.027300,...,0.000000,0.141997,0.000000,0.068257,0.067786,0.091293,0.099490,0.025184,0.142667,0.096449
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Young Guns II (1990),0.142315,0.023408,0.101199,0.145519,0.091293,0.225578,0.109645,0.163646,0.000000,0.074020,...,0.153493,0.000000,0.000000,0.256593,0.621260,1.000000,0.051655,0.000000,0.099333,0.000000
"Young Poisoner's Handbook, The (1995)",0.029070,0.006694,0.056976,0.105226,0.099490,0.196235,0.120116,0.155034,0.039261,0.063504,...,0.000000,0.013371,0.000000,0.133463,0.097146,0.051655,1.000000,0.022923,0.000000,0.175581
Zeus and Roxanne (1997),0.000000,0.079640,0.172155,0.038901,0.025184,0.064405,0.017218,0.026388,0.000000,0.021899,...,0.000000,0.053025,0.000000,0.054753,0.000000,0.000000,0.022923,1.000000,0.000000,0.000000
unknown,0.110208,0.042295,0.045714,0.060101,0.142667,0.042755,0.064008,0.096697,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.101775,0.058064,0.099333,0.000000,0.000000,1.000000,0.000000


## Step 8: User recommendation function

In [None]:
def recommend_items(user_id, top_n=5):
    # Grab all movies rated by the user
    user_ratings = interaction_matrix.loc[user_id]
    # We need to filter user_ratings, b/c interaction_matrix included 0 as a fill value for missing ratings
    rated_items = user_ratings[user_ratings > 0].index.tolist()

    # For each user-rated movie, get the similarity scores of all other movies to that movie
    scores = pd.Series(dtype=float)
    for item in rated_items:
        similar_items = item_similarity_df[item]
        scores = scores.add(similar_items, fill_value=0)

    # Remove already rated items
    scores = scores.drop(rated_items, errors="ignore")
    return scores.sort_values(ascending=False).head(top_n)

## Demo: 

In [40]:
user = 1
user_ratings = interaction_matrix.loc[user]
print(f"Top rated movies for User {user}:")
print(user_ratings.sort_values(ascending=False)[:10])
print("\n\n\n")

recommendations = recommend_items(user_id=user, top_n=5)
print(f"Top 5 recommended movies for User {user}:")
print(recommendations)

Top rated movies for User 1:
title
Eat Drink Man Woman (1994)                5.0
Monty Python and the Holy Grail (1974)    5.0
Bound (1996)                              5.0
Pillow Book, The (1995)                   5.0
Godfather, The (1972)                     5.0
Jean de Florette (1986)                   5.0
Blade Runner (1982)                       5.0
Postino, Il (1994)                        5.0
Toy Story (1995)                          5.0
Big Night (1996)                          5.0
Name: 1, dtype: float64




Top 5 recommended movies for User 1:
title
E.T. the Extra-Terrestrial (1982)    105.062429
Speed (1994)                          99.508648
Batman (1989)                         98.866387
True Lies (1994)                      98.659959
Stand by Me (1986)                    97.365360
dtype: float64
