# Homework: Collaborative Filtering

This notebook is a graded homework that you will turn in. You must complete Problem 1+2 and then *either* Problem 3 *or* Problem 4. If you submit a notebook with both Problem 3 and 4 completed, I will just grade Problem 3.

You may:

* Ask questions of and work with other students in this course. You must each write up your own solutions.
* *Special exception: for Problems 1+2, you may work with other students and copy each others' code.* Those problems are copied verbatim from the in-class exercises.
* Ask questions of the instructor. Your instructor is generous with hints and guidance.
* Use the [Python Documentation](https://docs.python.org/3.11/), especially the [tutorial section](https://docs.python.org/3/tutorial/index.html).
* Use the PyTorch and fast.ai documentation, as well as any other relevant documentation.
* Use the [W3Schools](https://www.w3schools.com/python/) and [Python For Everbody](https://www.py4e.com/) tutorials/reference guides.

You *may not*:
* Discuss this with any other students, faculty, your friends, your family, &c. before it is submitted.
* Use Stack Overflow, Google, ChatGPT or other unspecified resources.


## Problems in this Notebook

1. Collaborative Filtering: New Dataset
2. Nearest Neighbor Distance
3. Improve the Model
4. PCA and UMAP Embeddings

In [1]:
# Load some libraries my dudes
from fastai.collab import *
from fastai.tabular.all import *

# 1. Another Dataset

Using a new dataset, fit the best recommender system you can, using the techniques from class. Here are some recommendations (haha) for datasets hosted on kaggle and around the web. Pick one you feel you know enough about, to make the rest of the problems more tractable.

* [Goodreads book ratings](https://www.kaggle.com/datasets/zygmunt/goodbooks-10k)
* [Anime ratings from MyAnimeList](https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020)
* [Board game recommendations from BoardGameGeek](https://www.kaggle.com/datasets/nfedorov/top-2000-board-games-ratings)
* [Steam video game interaction](https://www.kaggle.com/datasets/tamber/steam-video-games)
* [Amazon Music Reviews](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html)

Hints to get things running more smoothly:
1. [Rename your columns](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) to be `"user"`, `"title"`, and `"rating"`. Also, when you create a learner with `collab_learner()`, make sure to set `use_nn=False`. The code for Problem 2 assumes those, and you'll have to either rename your columns here or edit the code below.
1. Enable GPU on your notebook! Some recent update to PyTorch causes a bunch of warnings to pop up.
2. Scale up your batch size as large as you can and still fit into GPU RAM.
3. Scale down your dataset if necessary, using the [sample method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)
3. Make sure your epochs run relatively quickly! We don't have all day, people.

Load in your data and fit a model. You should be able to get better than about 15% accuracy (rmse of $\lesssim 0.8$ if ratings are out of 5, or $\lesssim 1.5$ if out of 10).

In [2]:
%env KAGGLE_USERNAME=sophia4827
%env KAGGLE_KEY=ac334c3351a2af38b3f6b0ce9d9922a6

!kaggle datasets download -d zygmunt/goodbooks-10k
!unzip goodbooks-10k

env: KAGGLE_USERNAME=sophia4827
env: KAGGLE_KEY=ac334c3351a2af38b3f6b0ce9d9922a6
Dataset URL: https://www.kaggle.com/datasets/zygmunt/goodbooks-10k
License(s): CC-BY-SA-4.0
Downloading goodbooks-10k.zip to /content
 86% 10.0M/11.6M [00:01<00:00, 11.4MB/s]
100% 11.6M/11.6M [00:01<00:00, 7.21MB/s]
Archive:  goodbooks-10k.zip
  inflating: book_tags.csv           
  inflating: books.csv               
  inflating: ratings.csv             
  inflating: sample_book.xml         
  inflating: tags.csv                
  inflating: to_read.csv             


In [4]:
import pandas as pd

In [5]:
ratings = pd.read_csv("ratings.csv")
books = pd.read_csv("books.csv")

books = books[["id", "original_title"]]

In [6]:
ratings.rename(columns = {'book_id':'user'}, inplace = True)

In [7]:
books.rename(columns = {'id':'user'}, inplace = True)

In [8]:
books.rename(columns = {'original_title':'title'}, inplace = True)

In [9]:
ratings = pd.merge(ratings, books, on="user")
ratings

Unnamed: 0,user,user_id,rating,title
0,1,314,5,The Hunger Games
1,1,439,3,The Hunger Games
2,1,588,5,The Hunger Games
3,1,1169,4,The Hunger Games
4,1,1185,4,The Hunger Games
...,...,...,...,...
981751,10000,48386,5,The First World War
981752,10000,49007,4,The First World War
981753,10000,49383,5,The First World War
981754,10000,50124,5,The First World War


In [10]:
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=10000, use_nn=False)
dls.show_batch()

Unnamed: 0,user,title,rating
0,3366,Echo Park,4
1,7778,論語 [Lún Yǔ],3
2,4597,The Souls of Black Folk,3
3,8388,"Elven Star (The Death Gate Cycle, #2)",2
4,8983,The Secret Between Us,3
5,5499,An Ice Cold Grave,4
6,8035,Black Sun Rising,5
7,9128,Retribution,4
8,6815,#na#,5
9,7078,#na#,3


In [11]:
class RecommenderNN (Module):
    def __init__(self, user_sz, item_sz, range=[0.5,5.5], n_act=100):

        self.user_embedding  = Embedding(*user_sz)
        self.book_embedding = Embedding(*item_sz)

        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))

        self.min = range[0]
        self.max = range[1]

    def forward(self, x):
        users   = self.user_embedding(x[:,0])
        books  = self.book_embedding(x[:,1])
        # Makes it user + books
        embeddings = torch.cat([users, books], dim=1)

        raw_rating = self.layers(embeddings)

        return torch.sigmoid(raw_rating)*(self.max-self.min) + self.min

In [12]:
embs = get_emb_sz(dls)

model = RecommenderNN(*embs, [0.5,5.5])
learn = Learner(dls, model, loss_func=MSELossFlat(), metrics=rmse)

learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,_rmse,time
0,1.094552,0.946939,0.973108,00:04
1,0.925749,0.915133,0.956626,00:01
2,0.89376,0.910588,0.954247,00:01
3,0.883093,0.908515,0.953161,00:01
4,0.88013,0.90755,0.952654,00:01


## 2. Cosine Similarity

We can use the embeddings for each title to determine which items are closest to each other (alternatively, which other user has the most similar taste). Imagine each embedding as a vector in space. Two items are similar to each other if the angle between their vectors, $\theta$ is small. This is usually reported as $\cos\theta$, because it's fast to calculate. But it's also useful for humans: $\cos0^\circ=1$, and $\cos90^\circ=0$, so it will always be higher for similar vectors. (In general, for high dimensional spaces, two random directions are usually close to orthogonal to each other)

Below is code [adapted from the textbook](https://github.com/fastai/fastbook/blob/master/08_collab.ipynb) which will find some similar titles based on one submitted. You almost certainly will need to modify your code based on the exact format of your dataset.

(a) First, get the code below to work. It's based on when I did the homework, and I was looking at board games.

In [13]:
# Getting a list of some of the unique titles in the dataset
ratings["title"].unique()[1:10]

array(["Harry Potter and the Philosopher's Stone", 'Twilight',
       'To Kill a Mockingbird', 'The Great Gatsby',
       'The Fault in Our Stars', 'The Hobbit or There and Back Again',
       'The Catcher in the Rye', 'Angels & Demons ',
       'Pride and Prejudice'], dtype=object)

In [14]:
# This is the item's name that I'm looking up.
# I did this assignment with board games
itemname = "The Fault in Our Stars"

In [15]:
weights = learn.model.book_embedding.weight                  # Grab the embeddings for the items
idx = torch.tensor(dls.classes['title'].o2i[itemname]) # Determine where this item is
idx # This will be 0 if the item you looked up doesn't exist
    # (or you picked the 0th entry, but you should know that)

tensor(6479)

In [16]:
# Calculate cosine distances
distances = nn.CosineSimilarity()(weights, weights[idx,None])

# Sort finding the closest distance
top10 = distances.argsort(descending=True)[0:10]
dls.classes['title'][top10]
for id in top10:
    print(dls.classes['title'][id])

The Fault in Our Stars
The Further Adventures of Sherlock Holmes (Classic Crime)
The Blood Mirror
Transmetropolitan Vol. 4: The New Scum
The Crippled God
Kitchen no Ohimesama
UnDivided
The Sweet Gum Tree
Saga, Volume Five
Peace Is Every Step: The Path of Mindfulness in Everyday Life


(b) Produce a few recommendation lists based on items in the dataset. Comment on the results, especially relative to the model accuracy you got in Problem 1.

Do you see any trends in the predictions?

In [None]:
# UnDivided and Kitchen no Ohimesama are similar predictions to the Fault in Our
# Stars because they are both aimed at young adult readers, similar to the Fault
# in Our Stars. The rest of the predictions are all more geared towards adults,
# but could also be read by young adults or older teens. Based on the model
# accuracy from Problem 1, these predictions are 95% accurate to the input.
# I would say this is mostly true because the books are in a similar realm in
# terms of all being geared towards young adults to adults.

(c) If we're using the dot product bias model, do you expect two similar songs of different popularity to have a higher or lower cosine similarity than two very popular songs which are otherwise quite different?

Explain.

In [None]:
# If we're using the dot product bias model, I expect two similar songs of different
# popularity to have a higher cosine similarity than two very popular songs which are
# otherwise quite different because similar songs would have several similar features
# that would have smaller θ values between them which would mean a higher cosine similarity.
# On the other hands, with two very popular songs there may be some vectors that overlao
# because of similar traits that appeal to a wide audeince. If other than that they
# are different, than this might lead to a lower cosine similarity because all the other
# different features of the songs would cause them to have larger θ values.

## 3. Modifying the Model

To add inputs to the model beyond the user, title, and rating, we're going to have to take a few steps:

1. Change the input data
2. Change the DataLoaders
3. Modify the model to take extra inputs

Let me lead you through those.

(a) Modify the DataFrame that contains your dataset to add at least one continuous variable related to the *item* or *title* that you're making predictions for.

In [18]:
new_books = pd.read_csv("books.csv")
new_books = new_books[["id","original_title", "original_publication_year"]]

In [19]:
new_books.rename(columns = {'id':'user'}, inplace = True)

In [20]:
new_books.rename(columns = {'original_title':'title'}, inplace = True)

In [21]:
new_ratings = pd.merge(ratings, new_books, on="user")
new_ratings

Unnamed: 0,user,user_id,rating,title_x,title_y,original_publication_year
0,1,314,5,The Hunger Games,The Hunger Games,2008.0
1,1,439,3,The Hunger Games,The Hunger Games,2008.0
2,1,588,5,The Hunger Games,The Hunger Games,2008.0
3,1,1169,4,The Hunger Games,The Hunger Games,2008.0
4,1,1185,4,The Hunger Games,The Hunger Games,2008.0
...,...,...,...,...,...,...
981751,10000,48386,5,The First World War,The First World War,1998.0
981752,10000,49007,4,The First World War,The First World War,1998.0
981753,10000,49383,5,The First World War,The First World War,1998.0
981754,10000,50124,5,The First World War,The First World War,1998.0


In [22]:
del new_ratings["title_x"]

In [23]:
new_ratings.rename(columns = {'title_y':'title'}, inplace = True)

In [24]:
new_ratings

Unnamed: 0,user,user_id,rating,title,original_publication_year
0,1,314,5,The Hunger Games,2008.0
1,1,439,3,The Hunger Games,2008.0
2,1,588,5,The Hunger Games,2008.0
3,1,1169,4,The Hunger Games,2008.0
4,1,1185,4,The Hunger Games,2008.0
...,...,...,...,...,...
981751,10000,48386,5,The First World War,1998.0
981752,10000,49007,4,The First World War,1998.0
981753,10000,49383,5,The First World War,1998.0
981754,10000,50124,5,The First World War,1998.0


(b) Use a `TabularDataLoaders`, rather than a `CollabDataLoaders`, to load in the data. Make sure your titles and usernames are categorical!

In [25]:
dataset = TabularPandas(new_ratings,
            procs=[Categorify,FillMissing,Normalize],
            cat_names = ['title', 'user'],
            cont_names = ['original_publication_year'],
            y_names='rating',
            splits=RandomSplitter()(range_of(new_ratings)))
dls = dataset.dataloaders(bs=10000)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)


Once you've successfully loaded them in, use your TabularDataLoaders' `one_batch()` method to see what format your data will be passed to your model. You should see a tuple of three tensors: one with the categorical data (one column of ID numbers for the user, one of ID numbers for the title, in order based on how you loaded the data), one with the continuous variables, and one with the target values (the ratings).

In [26]:
dls.one_batch()

(tensor([[7570, 3033,    1],
         [1829, 2288,    1],
         [3617,  957,    1],
         ...,
         [6797,    1,    1],
         [2593, 2098,    1],
         [ 785, 4951,    1]]),
 tensor([[-0.0170],
         [ 0.1252],
         [ 0.1834],
         ...,
         [ 0.1705],
         [-0.1075],
         [ 0.1511]]),
 tensor([[5],
         [4],
         [2],
         ...,
         [4],
         [4],
         [4]], dtype=torch.int8))

(c) Now it's time to modify the model itself. Here's our neural network model from class. You'll need to modify two things:
* In the `__init__()` function, you need to be able to take in additional continuous variables, that is, however many you're using in parts a and b, above.
* In the forward function, you will be passed *three* arguments. In addition to `self`, there will be an argument for the categorical IDs (in the model below, those are just called `x`), but now there will be an additional argument with the numeric values. Make sure the function takes those in as well, and then actually passes them into the neural network you made.

In [27]:
class RecommenderNN (Module):
    def __init__(self, user_sz, item_sz, num_continuous, range=[0.5,5.5], n_act=[200,100]):

        self.user_embedding  = Embedding(*user_sz)
        self.book_embedding = Embedding(*item_sz)

        # What needs to be modelled to take into account the number of continous
        # variables in the model?

        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1]+num_continuous, n_act[0]),
            nn.ReLU(),
            nn.Linear(n_act[0], n_act[1]),
            nn.ReLU(),
            nn.Linear(n_act[1], 1))

        self.min = range[0]
        self.max = range[1]

    def forward(self, x, c): # Add another argument for the continuous inputs, then use them!
        users   = self.user_embedding(x[:,0])
        books  = self.book_embedding(x[:,1])
        embeddings = torch.cat([users, books, c], dim=1)

        raw_rating = self.layers(embeddings)

        return torch.sigmoid(raw_rating)*(self.max-self.min) + self.min

In [28]:
# Here's some starter code for feeding everything into the model.
# The variables names may or may not match what you've got.

embs = get_emb_sz(dls)

model = RecommenderNN((9275, 267), (10001, 278), 1, [0.5,5.5])
learn2 = Learner(dls, model, loss_func=MSELossFlat(), metrics=rmse)


In [29]:
embs

[(9275, 267), (10001, 278), (3, 3)]

(d) Train the model. Do you get an improvement in prediction quality?

In [30]:
learn2.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,_rmse,time
0,1.054596,0.917376,0.957798,00:02
1,0.918318,0.913376,0.955707,00:02
2,0.890852,0.906383,0.952042,00:02
3,0.884143,0.902671,0.95009,00:02
4,0.881499,0.901824,0.949644,00:01


(e) We did this with a neural network. Could we have done it with the dot product model instead? Why or why not?

In [None]:
# It could be done but it would take a lot more work than with the neural network
# because you would not be able to simply enter in the extra continuous variable.

## 4. PCA and UMAP: More Embedding Interpretation

Here's two (hopefully) useful articles to read. Seriously, read them. My explanations below are insufficient and more for me to learn how to write about them than actually useful.
* [Understanding PCA](https://setosa.io/ev/principal-component-analysis/)
* [Understanding UMAP](https://pair-code.github.io/understanding-umap/)

**Principal Component Analysis (PCA)** takes a matrix (like the weights in our model) and finds the best-fit line going through it (which will end up being a combination of each of the dimensions in the data). This is the principal component. The line won't be a perfect fit; so after removing it, another line is found, which is the 2nd principal component. This can be repeated $N$ times, where $N$ is the number of dimensions the data exist in (in our case, the number of latent factors in the embedding, not the number of unique items). In practice, usually just the first two components are used, so they can be graphed.

**Uniform Manifold Approximation and Projection (UMAP)** tries to maintain clusters and distances between clusters. It operates not on the weights directly, like PCA does, but finds all pairwise distances between entries, which forms a matrix of distances between items like this:

| |item 1|item 2|item 3|
|:-:|:-:|:-:|:-:|
|item 1|0|1.2|4.2|
|item 2|1.2|0|2.0|
|item 3|3.2|2.0|0|

It then attempts to find a two-dimensional projection that best maintains that distance matrix.

___

(a) Below is code I've more or less copied from the textbook, with some reworking. I've also added comments.

What the code does is cut the dataset down to the 1000 items with the most ratings and perform PCA on them. It then actually visualizes the positions of the items with the 50 most ratings, along with labels.

Make the code work so you get your own PCA visualization.

In [None]:
# Get a list of the 1000 most-rated items in the dataset
# This means that these items should have the best-established embeddings
most_rated = ratings.groupby('title')['rating'].count()
most_rated = most_rated.sort_values(ascending=False).index.values[:1000]
most_rated[0:10]

array(['The Gift', ' ', 'Twilight', 'Perfect', 'Selected Poems',
       'Heartless', 'The Awakening', 'Gone', 'Defiance', 'Twisted'],
      dtype=object)

In [None]:
# Get the indices for the most-rated items
top_idxs = tensor([learn.dls.classes['title'].o2i[mr] for mr in most_rated])
# Extract the weights for those iems
weights = learn.model.i_weight.weight[top_idxs].cpu().detach()
# Perform PCA to get the 3 most informative dimensions
rating_pca = weights.pca(3)

AttributeError: 'RecommenderNN' object has no attribute 'i_weight'

In [None]:
fac0,fac1,fac2 = rating_pca.t()
top50 = list(range(50))

# Extract the arbitrary X and Y axes using factor 0 and 1 from the PCA
X = fac0[top50]
Y = fac1[top50]
plt.figure(figsize=(8,8))
plt.scatter(X, Y)
for i, x, y in zip(top_movies[top50], X, Y):
    plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
plt.show()

(b) Describe the *regions* in your plot, in the context of your dataset. You should be talking about the left/right, top/bottom, or topleft/bottomright &c. distinctions, not talking about specific clusterings. Point out a few examples to illustrate this.

If you check the textbook, it mentions that PCA on movie data seems to find a split between movies which are popular/unpopular amongst viewers (and made a lot of money) on one axis, and ones which are critically acclaimed or panned on another.

(c) Install and use the [UMAP](https://umap-learn.readthedocs.io/en/latest/) library, and repeat part b of this question. Luckily, UMAP uses essentially the same syntax as Scikit-Learn, which is what was used for PCA.

(d) UMAP preserves groupings and distances between groups, more or less. Repeat part b of this question, describing the clusters you find.