# Problem 2

## Data Import
First, download the movielens (small) [dataset](https://files.grouplens.org/datasets/movielens/ml-latest-small.zip) as `pandas.DataFrame` objects. 

In [1]:
import pandas as pd

path = "Misc_files/movielens_data/ml-latest-small/"

# load movies and ratings DataFrames
movies = pd.read_csv(path+"movies.csv", header=0)
ratings = pd.read_csv(path+"ratings.csv", header=0)

We can then use the `head()` method to see the raw format of these `DataFrame` objects.

In [2]:
n_movies = len(movies)

print(f"Number of Unique Movies: {n_movies}")
movies.head()

Number of Unique Movies: 9742


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
n_ratings = len(movies)
n_users = ratings.userId.nunique()
n_rated_movies = ratings.movieId.nunique()

print(f"Number of Ratings: {n_ratings}\nNumber of Users: {n_users}\nNumber of Unique Rated Movies: {n_rated_movies}")
ratings.head()

Number of Ratings: 9742
Number of Users: 610
Number of Unique Rated Movies: 9724


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Upon inspection of the raw data we note that of the 9,742 movies in the `movies` DataFrame, only 9,724 movies have been rated.

## Preprocessing
We will be implementing the steps found in this [Princeton](https://www.cs.princeton.edu) _Movie Embeddings_ [problem](https://www.cs.princeton.edu/courses/archive/fall16/cos402/ex/MovieEmbedding.pdf).

### Co-occurrence Matrix $X$
To obtain the concurrent number of likes $X_{i,j}$ we must first binary encode (`0` or `1`) each `"rating"` in the `ratings` DataFrame. Let us encode the value of liking a movie for each review as such

$$ \text{Liked}(\text{Rating}) =
    \begin{cases}
        1 & \text{if Rating}\geq 4\\
        0 & \text{otherwise}
    \end{cases}$$

and store these values in a new `"liked"` column. We can subsequently drop the unnecessary `rating` and `timestamp` columns after this process.

In [4]:
import numpy as np

# create liked column
ratings["liked"] = np.where(ratings["rating"] >= 4, 1, 0)

# drop columns
ratings.drop(["rating", "timestamp"], axis=1, inplace=True)

We next create the `movie_ratings` DataFrame by joing the `movies` and `ratings` DataFrames. Setting the `merge` method parameter `how="left"` ensures that the original number of movies, 9,742, are maintained after the join.

In [5]:
# left join on movieId
movie_ratings = pd.merge(movies, ratings, how="left", on="movieId").reset_index()

A user-likes interaction matrix can then be constructed using the `pivot_table` method, whose rows correspond to the number of unique users `n_users` and columns correspond to the number of unique movies `n_movies` from the original data. This results in a sparse matrix whose rows summarize each users liked movies.

In [6]:
# pivot table on userId
user_likes = movie_ratings.pivot_table(values="liked", index="userId", columns="movieId", dropna=False, fill_value=0)

user_likes

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,1,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5.0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
607.0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
608.0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
609.0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


Co-occurrence matrix $X$ can now be constructed as the inner (dot) product of the transpose of `user_likes ` and itself. Element $X_{ij} \in X$ corresponds to the number of users that like both movie $i$ and $j$.

In [7]:
# convert to numpy ndarray for dot product computation
user_likes_array = user_likes.to_numpy()

# create X
X_co_occurrence = np.dot(user_likes_array.T, user_likes_array)

# fill diagonals of X with zeros
np.fill_diagonal(X_co_occurrence, 0)

# display as DataFrame for clarity
X_display = pd.DataFrame(X_co_occurrence, index=movies.movieId, columns = movies.movieId)

X_display

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,21,11,0,7,27,7,1,4,19,...,0,0,0,0,0,0,0,0,0,0
2,21,0,5,0,4,8,6,0,0,9,...,0,0,0,0,0,0,0,0,0,0
3,11,5,0,0,4,4,5,1,2,3,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,7,4,4,0,0,3,4,1,1,2,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
193583,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
193585,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
193587,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Model Architecture

With preprocessing completed, we convert embedding $X$ into a `torch` tensor object and construct cost function $c$.

In [154]:
import torch
from torch import mps

# make torch deterministic for reproducibility
torch.manual_seed(576)

# set device
device = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")

X = torch.tensor(X_co_occurrence, dtype = torch.float32)

We implement the below cost function:
$$c(v_1,\ldots, v_M)=\sum_{i=1}^M\sum_{j=1}^M 1_{[i\neq j]}(v_i^T v_j-X_{i,j})^2$$
by creating a superclass of the `torch` `nn.Module` class.

In [161]:
import torch.nn as nn

class cost_function(nn.Module):
    def __init__(self, v, X):
        super(cost_function, self).__init__()
        self.v = v
        self.X = X
        
    def forward(self, v, X):
        # perform main cost function
        costs = (1 - torch.eye(n_movies)) * (torch.mm(self.v.t(), self.v)-self.X)**2
        
        # sum of costs
        cost = costs.sum()
        
        return cost

In [160]:
# initialize random parameters of v
torch.manual_seed(576)

v = torch.randn(n_movies, n_movies)
v

tensor([[-0.6716, -1.0309,  3.0239,  ..., -0.9355,  0.6747, -0.2185],
        [ 0.3729, -0.4028, -0.3033,  ...,  0.6581,  0.2533, -0.4573],
        [ 1.1745,  2.0087, -0.1851,  ..., -0.6043,  0.3173,  0.4497],
        ...,
        [ 0.5641,  0.8033,  1.3207,  ...,  0.5451, -0.4254,  1.9810],
        [-0.7843,  0.4902, -0.5627,  ...,  0.3236,  0.2838, -0.8483],
        [ 0.3050, -0.1802, -0.6253,  ..., -1.6900,  1.0148,  1.1621]])

In [159]:
import torch.optim as optim

v_copy = v.clone().requires_grad_()

# training loop
cost_function = cost_function()
cost_function.to(device)

optimizer = optim.Adam([v_copy], 0.5)


TypeError: cost_function.forward() missing 2 required positional arguments: 'v' and 'X'

In [116]:
def train(cost_function, v, X, optimizer, n_epochs=100):
    for epoch in range(n_epochs):
        
        costs = []
        
        # forward pass
        cost = cost_function.forward(v, X)
        
        # backward pass
        optimizer.zero_grad()
        cost.backward()
        optimizer.step()
        
        costs.append(cost)
        
        if epoch % 5 == 0:
            print(f"Epoch: {epoch}\tCost: {cost: .5f}")
            
        return costs

In [111]:
history = train(cost_function, v_copy, X, optimizer)

Epoch: 0	Cost:  924582936576.00000
Epoch: 5	Cost:  124892168192.00000


KeyboardInterrupt: 

In [126]:
import tensorflow as tf

In [149]:
v_tf = tf.Variable(tf.random.normal((n_movies,n_movies)), dtype=tf.float32)

X_tf = tf.constant(X_co_occurrence, dtype=tf.float32)

In [150]:
class CostFunction(tf.Module):
    def __init__(self):
        super(tf.Module).__init__()

    def __call__(self, v, X):
        # Perform the main cost function
        eye_matrix = tf.eye(n_movies)
        costs = tf.reduce_sum((1 - eye_matrix) * (tf.matmul(tf.transpose(v), v) - X)**2)

        return costs

In [151]:
def train(cost_function, v, X, optimizer, n_epochs=100):
    costs = []

    for epoch in range(n_epochs):
        # Forward pass
        with tf.GradientTape() as tape:
            cost = cost_function(v, X)

        # Backward pass
        gradients = tape.gradient(cost, v)
        optimizer.apply_gradients([(gradients, v)])

        costs.append(cost)

        if epoch % 5 == 0:
            print(f"Epoch: {epoch}\tCost: {cost.numpy():.5f}")

    return costs

In [153]:
cost_function = CostFunction()

optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=0.01)
history = train(cost_function, v, X, optimizer)

TypeError: Cannot convert the argument `type_value`: tensor([[-0.6716, -1.0309,  3.0239,  ..., -0.9355,  0.6747, -0.2185],
        [ 0.3729, -0.4028, -0.3033,  ...,  0.6581,  0.2533, -0.4573],
        [ 1.1745,  2.0087, -0.1851,  ..., -0.6043,  0.3173,  0.4497],
        ...,
        [ 0.5641,  0.8033,  1.3207,  ...,  0.5451, -0.4254,  1.9810],
        [-0.7843,  0.4902, -0.5627,  ...,  0.3236,  0.2838, -0.8483],
        [ 0.3050, -0.1802, -0.6253,  ..., -1.6900,  1.0148,  1.1621]]) to a TensorFlow DType.