Logs
- [2024/04/30]   
  You do not need to restart this notebook when updating the scratch library

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import csv

from typing import NamedTuple, List
from scratch.linear_algebra import LinearAlgebra as la
from scratch.working_with_data import DimReduction
from scratch.deep_learning import DeepLearning as dl

In [None]:
plt.rcParams.update(plt.rcParamsDefault)
plt.rcParams.update({
  'font.size': 16,
  'grid.alpha': 0.25})

In [None]:
%load_ext autoreload
%autoreload 2 

## Matrix Factorization

In [None]:
# This points to the current directory, modify if your files are elsewhere.
path_to_file = "./datasets/ml-100k/"
MOVIES = path_to_file + "u.item"   # pipe-delimited: movie_id|title|...
RATINGS = path_to_file + "u.data"  # tab-delimited: user_id, movie_id, rating, timestamp

We define a `Rating` class to make things easier when handling with the data

In [None]:
class Rating(NamedTuple):
  user_id: str
  movie_id: str 
  rating: float

> The `movie_id` and `user_id` are actually integers, but they're not    
> consecutive, which means if we worked with them as integers we'd end up with  
> a lot of wasted dimensions (unless we renumbered everything). So to keep it    
> simpler we'll just treat them as strings.

Let's read in the data and explore it

In [None]:
# We specify this encoding to avoid a UnicodeDecodeError.
# See: https://stackoverflow.com/a/53136168/1076346 -- Encoding issues while reading/importing CSV file in Python3 Pandas
with open(MOVIES, encoding="iso-8859-1") as fp:
  reader = csv.reader(fp, delimiter="|")
  movies = {movie_id: title for movie_id, title, *_ in reader}


In [None]:
# Create a list of [Rating]
with open(RATINGS, encoding="iso-8859-1") as fp:
  reader = csv.reader(fp, delimiter="\t")
  ratings = [Rating(user_id, movie_id, float(rating))
              for user_id, movie_id, rating, _ in reader]

# 1682 movies rated by 943 users
assert len(movies) == 1682
assert len(list({rating.user_id for rating in ratings})) == 943

An exploratory data analysis:
- The average ratings for _Star Wars_ movies 

In [None]:
# Data structure for accumulating ratings by movie_id
star_wars_ratings = {movie_id: [] for movie_id, title in movies.items()
                      if re.search("Star Wars|Empire Strikes|Jedi", title)}

# Iterate over ratings, accumulating the Star Wars ones
for rating in ratings:
  if rating.movie_id in star_wars_ratings:
    star_wars_ratings[rating.movie_id].append(rating.rating)

# Compute the average rating for each movie
avg_ratings = [(sum(title_ratings) / len(title_ratings), movie_id)
                for movie_id, title_ratings in star_wars_ratings.items()]

# And then print them in order
for avg_rating, movie_id in sorted(avg_ratings, reverse=True):
  print(f"{avg_rating:.2f} {movies[movie_id]}")

4.36 Star Wars (1977)
4.20 Empire Strikes Back, The (1980)
4.01 Return of the Jedi (1983)


Let's come back to the datasets `movies` and `ratings`. We want to try to come  
up with a a model to predict the ratings. First, we split the ratings data into  
train, validation, and test sets

In [None]:
seed = 24_04_30
rng = np.random.default_rng(seed)

# We re-run ratings data creation to avoid run several times of in-place shuffling
# Create a list of [Rating]
with open(RATINGS, encoding="iso-8859-1") as fp:
  reader = csv.reader(fp, delimiter="\t")
  ratings = [Rating(user_id, movie_id, float(rating))
              for user_id, movie_id, rating, _ in reader]

rng.shuffle(ratings)

split1 = int(len(ratings) * 0.7)
split2 = int(len(ratings) * 0.85)

train = ratings[:split1]              # 70% of the data
validation = ratings[split1:split2]   # 15% of the data
test = ratings[split2:]               # 15% of the data


It is always good to have a simple baseline model and make sure that ours  
constructed model does better than that.

In [None]:
avg_rating = sum(rating.rating for rating in train) / len(train)
baseline_error = sum((rating.rating - avg_rating) ** 2
                      for rating in test) / len(test)

# This is what we hope to do better than
assert 1.26 < baseline_error < 1.27

Creating embeddings for users and movies

In [None]:
seed = 24_04_30
rng = np.random.default_rng(seed)
EMBEDDING_DIM = 2

# Find unique ids
user_ids = {rating.user_id for rating in ratings}
movie_ids = {rating.movie_id for rating in ratings}

# Then create a random vector per id
user_vectors = {user_id: dl.random_tensor(EMBEDDING_DIM, rng=rng) 
                for user_id in user_ids}
movie_vectors = {movie_id: dl.random_tensor(EMBEDDING_DIM, rng=rng) 
                  for movie_id in movie_ids}

[Review in dictionary]   
There is a subtle updating value of dictionary when the dictionary is    
querying and assigning to a new variable. Changing this new variable will    
affect the dictionary

In [4]:
print("A list as a value of a dictionary")
a_dict = {"k1": 2, "k2": [5, 6]}
print(f"a_dict: {a_dict}")

b_list = a_dict["k2"]   # pass by reference not by value!
b_list[0] = -2

print(f"a_dict: {a_dict}")

print("\nA numpy array as a value of a dictionary")
a_dict_with_numpy = {"k1": 2, "k2": np.array([5, 6])}
print(f"a_dict_with_numpy: {a_dict_with_numpy}")

b_numpy = a_dict_with_numpy["k2"]
b_numpy[0] = -2
print(f"a_dict_with_numpy: {a_dict_with_numpy}")

A list as a value of a dictionary
a_dict: {'k1': 2, 'k2': [5, 6]}
a_dict: {'k1': 2, 'k2': [-2, 6]}

A numpy array as a value of a dictionary
a_dict_with_numpy: {'k1': 2, 'k2': array([5, 6])}
a_dict_with_numpy: {'k1': 2, 'k2': array([-2,  6])}


Now we write a training loop for our model

In [None]:
def loop(dataset: List[Rating], learning_rate: float = None) -> None:
  with tqdm.tqdm(dataset) as t:
    loss = 0.0
    for i, rating in enumerate(t):
      movie_vector = movie_vectors[rating.movie_id]  # this is a copy by reference
      user_vector = user_vectors[rating.user_id]     # this is a copy by reference
      predicted = la.dot(user_vector, movie_vector)
      error = predicted - rating.rating
      loss += error ** 2 

      if learning_rate is not None:
        #   predicted = m_0 * u_0 + ... + m_k * u_k
        # So each u_j enters output with coefficient m_j
        # and each m_j enters output with coefficient u_j
        user_gradient = [error * m_j for m_j in movie_vector]
        movie_gradient = [error * u_j for u_j in user_vector]

        # Take gradient steps
        # -- [Note]: updating user_vector and movie_vector will change
        #            user_vectors and movie_vectors
        for j in range(EMBEDDING_DIM):
          user_vector[j] -= learning_rate * user_gradient[j]   
          movie_vector[j] -= learning_rate * movie_gradient[j]
      
      t.set_description(f"avg loss: {loss / (i + 1)}")

In [None]:
N_epochs = 20    #  12 minutes and take a lot of resources
learning_rate = 0.05
for epoch in range(N_epoch):
  learning_rate *= 0.9
  print(epoch, learning_rate)
  loop(train, learning_rate=learning_rate)
  loop(validation)

loop(test)

0 0.045000000000000005


avg loss: 17.023555580112564:   0%|          | 0/70000 [00:00<?, ?it/s]

avg loss: 6.114683603097698: 100%|██████████| 70000/70000 [00:27<00:00, 2529.02it/s] 
avg loss: 1.2843711989112063: 100%|██████████| 15000/15000 [00:05<00:00, 2586.17it/s]


1 0.04050000000000001


avg loss: 1.1314823025058312: 100%|██████████| 70000/70000 [00:27<00:00, 2525.47it/s]
avg loss: 1.0864504499259353: 100%|██████████| 15000/15000 [00:05<00:00, 2546.90it/s]


2 0.03645000000000001


avg loss: 1.0234978405687394: 100%|██████████| 70000/70000 [00:27<00:00, 2547.25it/s]
avg loss: 1.0390623703356796: 100%|██████████| 15000/15000 [00:05<00:00, 2605.12it/s]


3 0.03280500000000001


avg loss: 0.9812651379818479: 100%|██████████| 70000/70000 [00:27<00:00, 2552.21it/s]
avg loss: 1.0120019862066691: 100%|██████████| 15000/15000 [00:05<00:00, 2575.92it/s]


4 0.02952450000000001


avg loss: 0.9529818585920801: 100%|██████████| 70000/70000 [00:27<00:00, 2554.18it/s]
avg loss: 0.9920729800925361: 100%|██████████| 15000/15000 [00:05<00:00, 2610.79it/s]


5 0.02657205000000001


avg loss: 0.9304333465645488: 100%|██████████| 70000/70000 [00:27<00:00, 2550.56it/s]
avg loss: 0.9759284893406273: 100%|██████████| 15000/15000 [00:06<00:00, 2476.46it/s]


6 0.02391484500000001


avg loss: 0.9113048797855346: 100%|██████████| 70000/70000 [00:28<00:00, 2474.94it/s]
avg loss: 0.9624217771363592: 100%|██████████| 15000/15000 [00:05<00:00, 2557.13it/s]


7 0.021523360500000012


avg loss: 0.8947573917030615: 100%|██████████| 70000/70000 [00:27<00:00, 2546.69it/s]
avg loss: 0.9510325926720634: 100%|██████████| 15000/15000 [00:05<00:00, 2594.01it/s]


8 0.01937102445000001


avg loss: 0.8803588449760007: 100%|██████████| 70000/70000 [00:27<00:00, 2563.54it/s]
avg loss: 0.9414253683962088: 100%|██████████| 15000/15000 [00:05<00:00, 2526.70it/s]


9 0.01743392200500001


avg loss: 0.8677913233198074: 100%|██████████| 70000/70000 [00:29<00:00, 2374.94it/s]
avg loss: 0.9333298474113576: 100%|██████████| 15000/15000 [00:05<00:00, 2508.28it/s]


10 0.015690529804500006


avg loss: 0.8567849026342591: 100%|██████████| 70000/70000 [00:27<00:00, 2535.68it/s]
avg loss: 0.9265141135003028: 100%|██████████| 15000/15000 [00:05<00:00, 2535.78it/s]


11 0.014121476824050006


avg loss: 0.8471076704714128: 100%|██████████| 70000/70000 [00:28<00:00, 2458.37it/s]
avg loss: 0.9207784539236817: 100%|██████████| 15000/15000 [00:07<00:00, 2058.72it/s]


12 0.012709329141645007


avg loss: 0.8385635141800183: 100%|██████████| 70000/70000 [00:28<00:00, 2441.93it/s]
avg loss: 0.9159520617023131: 100%|██████████| 15000/15000 [00:05<00:00, 2560.67it/s]


13 0.011438396227480507


avg loss: 0.8309886155551292: 100%|██████████| 70000/70000 [00:27<00:00, 2543.00it/s]
avg loss: 0.9118896538698553: 100%|██████████| 15000/15000 [00:05<00:00, 2567.07it/s]


14 0.010294556604732457


avg loss: 0.8242468213549047: 100%|██████████| 70000/70000 [00:27<00:00, 2511.77it/s]
avg loss: 0.9084681200329222: 100%|██████████| 15000/15000 [00:05<00:00, 2562.30it/s]


15 0.00926510094425921


avg loss: 0.8182249721231604: 100%|██████████| 70000/70000 [00:27<00:00, 2533.53it/s]
avg loss: 0.9055835090232455: 100%|██████████| 15000/15000 [00:06<00:00, 2461.82it/s]


16 0.00833859084983329


avg loss: 0.8128287614251115: 100%|██████████| 70000/70000 [00:28<00:00, 2478.52it/s]
avg loss: 0.9031484346128095: 100%|██████████| 15000/15000 [00:05<00:00, 2560.80it/s]


17 0.007504731764849962


avg loss: 0.8079792798211634: 100%|██████████| 70000/70000 [00:28<00:00, 2455.48it/s]
avg loss: 0.9010898223692646: 100%|██████████| 15000/15000 [00:05<00:00, 2579.41it/s]


18 0.006754258588364966


avg loss: 0.8036102026445161: 100%|██████████| 70000/70000 [00:31<00:00, 2190.86it/s]
avg loss: 0.8993468794712797: 100%|██████████| 15000/15000 [00:10<00:00, 1462.05it/s]


19 0.00607883272952847


avg loss: 0.7996655168510675: 100%|██████████| 70000/70000 [00:47<00:00, 1460.24it/s]
avg loss: 0.8978692104365604: 100%|██████████| 15000/15000 [00:10<00:00, 1377.83it/s]
avg loss: 0.8992023894713554: 100%|██████████| 15000/15000 [00:05<00:00, 2731.03it/s]


We use principal component analysis to inspect the learned vectors

In [None]:
original_vectors = [vector for vector in movie_vectors.values()]
components = DimReduction.pca(original_vectors, 2)

dv: 4454.975: 100%|██████████| 100/100 [00:00<00:00, 206.05it/s]
dv: 986.594: 100%|██████████| 100/100 [00:00<00:00, 214.80it/s]


Let's transform our vectors to represent the principal components and join in   
the movie IDs and average ratings

In [None]:
ratings_by_movie = defaultdict(list)
for rating in ratings:
  ratings_by_movie[rating.movie_id].append(rating.rating)

vectors = [
  (movie_id, sum(ratings_by_movie[movie_id]) / len(ratings_by_movie[movie_id]), 
  movies[movie_id], 
  vector)
  for movie_id, vector in zip(movie_vectors.keys(),
                              transform(original_vectors, components))
]