## Recommender systems 

They are algorithms widely utilized for companies in e-commerce, streaming platforms, and other sectors. Their primary goal is to predict user preferences and suggest items or content that align with those preferences.

Recommender systems are designed to enhance user experience by providing personalized recommendations. They leverage data from users’ past interactions, such as:

* Purchase History: In e-commerce, systems suggest products based on what users have previously bought or browsed.
* Ratings: On platforms like movie or music services, recommendations are often based on the ratings users give to items.
* Behavioral Data: This includes browsing history, search queries, and time spent on certain content.

#### Methods Used
* Collaborative Filtering: This method makes predictions based on the behavior and preferences of similar users. For example, if User A and User B have similar tastes, and User A likes a new product, the system may recommend that product to User B.

        Collaborative Filtering can be also classified as:
        Item CF
        User CF
        Matrix factorization

* Content-Based Filtering: Recommendations are based on the attributes of items and users' past preferences. For instance, if a user frequently watches action movies, the system might suggest new action films.

* Hybrid Approaches: These combine collaborative and content-based filtering to improve recommendation accuracy and mitigate the shortcomings of individual methods.

#### Applications
* E-commerce: Suggests products similar to those previously viewed or purchased.
* Streaming Platforms: Recommends movies, shows, or music based on viewing or listening history.
* Social Media: Curates posts or friends' suggestions based on interaction patterns.

Recommender systems play a crucial role in personalizing user experiences, increasing engagement, and driving sales by predicting and catering to individual preferences.

### Collaborative Filtering

Collaborative Filtering (CF) is a popular recommendation algorithm that predicts user preferences based on the behavior and preferences of similar users. The core idea is that if users share similar tastes or behaviors, then items liked by one user can be recommended to another similar user.

#### Because they are so popular, we can find types of collaborative filtering.

* User-Based Collaborative Filtering (User CF): This approach identifies users with similar tastes to the target user. It assumes that if User A and User B have a high overlap in their preferences, they will likely share future preferences as well.

We can formulate this aproch by concider a matrix $R$, this is going to be the user-item matrix where $𝑅_{ij}$ is the rating given by user $i$ to item $j$, the  similarity between users $u$ and $v$ is computed using cosine similarity or Pearson correlation given by:


$Sim(u, v) = \frac{\sum_{i \in I_{uv}} (R_{ui} - \bar{R_u})(R_{vi} - \bar{R_v})}{\sqrt{\sum_{i \in I_{uv}} (R_{ui} - \bar{R_u})^2} \cdot \sqrt{\sum_{i \in I_{uv}} (R_{vi} - \bar{R_v})^2}}$


where $\bar{R_u}$ and $\bar{R_v}$ are the average ratings of users $u$ and $v$, respectively, and $I_{uv}$ is the set of items rated by both users.

* Item-Based Collaborative Filtering (Item CF): This approach identifies items similar to those the target user has rated or liked. It assumes that if items are similar, users who liked one item are likely to like similar items. If a  User A likes Item X and Item X is similar to Item Y, the system will recommend Item Y to User A.

${Sim}(i, j) = \frac{\sum_{u \in U_{ij}} (R_{ui} - \bar{R_u})(R_{uj} - \bar{R_u})}{\sqrt{\sum_{u \in U_{ij}} (R_{ui} - \bar{R_u})^2} \cdot \sqrt{\sum_{u \in U_{ij}} (R_{uj} - \bar{R_u})^2}}
$

* Matrix Factorization: Matrix factorization techniques decompose the user-item matrix $R$  into two lower-dimensional matrices, typically referred to as 
$U$ (user features) and $V$ (item features). The product of these matrices approximates the original matrix $R$. This approach captures latent features of users and items, which helps in making predictions for missing entries in $R$. 

The goal is to find matrices $U$ (user matrix) and $V$ (item matrix) such that their product approximates the original matrix $R$:

$R \approx U \cdot V^T$

Here, $U$ is a matrix of size $m \times k$ (where $m$ is the number of users and $k$ is the number of latent features), and $V$ is a matrix of size $n \times k$ (where $n$ is the number of items). $k$ is typically much smaller than $m$ or $n$.

The optimization problem can be formulated as:

$\min_{U, V} \sum_{(i, j) \in K} (R_{ij} - U_i \cdot V_j^T)^2 + \lambda (\| U \|^2 + \| V \|^2)$

where $K$ is the set of observed ratings, and $\lambda$ is a regularization parameter to prevent overfitting.

Example: In Singular Value Decomposition (SVD), matrix factorization is performed as follows:

$R = U \Sigma V^T$

where $\Sigma$ is a diagonal matrix of singular values, and $U$ and $V$ contain the singular vectors.

Matrix factorization methods are powerful because they can discover hidden patterns in data and are often more effective in capturing complex relationships compared to simpler user-based or item-based methods.



## Neural Collaborative Filtering 

 This an advanced recommendation algorithm that leverages neural networks to model user-item interactions. Unlike traditional collaborative filtering methods, which rely on linear models, NCF uses deep learning techniques to capture complex patterns in the data. Here is a detailed explanation with the mathematical formulation:

### Basic Concepts

**User-Item Matrix**: Let $R$ be the user-item interaction matrix, where $R_{ij}$ represents the interaction (e.g., rating, click, purchase) between user $i$ and item $j$.

**Latent Vectors**:
   - $\mathbf{p}_i$: Latent vector for user $i$ (user embedding).
   - $\mathbf{q}_j$: Latent vector for item $j$ (item embedding).


   In **Generalized Matrix Factorization (GMF)**, the interaction between user $i$ and item $j$ is modeled as:
   $$
   \hat{y}_{ij} = \mathbf{p}_i^T \mathbf{q}_j
   $$
   where $\mathbf{p}_i$ and $\mathbf{q}_j$ are learned via optimization.


NCF extends GMF by replacing the dot product with a neural network that can model non-linear interactions between users and items. The key components are:
**Embedding Layers**:
   Users and items are mapped to latent vectors (tensors) using embedding layers:
   $$
   \mathbf{p}_i = \text{Embedding}_U(i)
   $$
   $$
   \mathbf{q}_j = \text{Embedding}_I(j)
   $$

**Concatenation Layer**:
   The user and item embeddings are concatenated to form a joint representation:
   $$
   \mathbf{z}_{ij} = [\mathbf{p}_i; \mathbf{q}_j]
   $$

**Neural Network Layers**:
   The concatenated vector $\mathbf{z}_{ij}$ is fed into a multi-layer perceptron (MLP) to model the interaction:
   $$
   \mathbf{h}_1 = f_1(\mathbf{W}_1 \mathbf{z}_{ij} + \mathbf{b}_1)
   $$
   $$
   \mathbf{h}_2 = f_2(\mathbf{W}_2 \mathbf{h}_1 + \mathbf{b}_2)
   $$
   $$
   \vdots
   $$
   $$
   \mathbf{h}_L = f_L(\mathbf{W}_L \mathbf{h}_{L-1} + \mathbf{b}_L)
   $$
   where $\mathbf{W}_l$ and $\mathbf{b}_l$ are the weights and biases of the $l$-th layer, and $f_l$ is the activation function (e.g., ReLU).

**Prediction Layer**:
   The output of the final MLP layer is passed through a prediction layer to produce the predicted interaction:
   $$
   \hat{y}_{ij} = \sigma(\mathbf{h}_L)
   $$
   where $\sigma$ is an activation function, typically a sigmoid for binary interactions.

### Loss Function

The model is trained using a loss function that measures the discrepancy between the predicted interactions $\hat{y}_{ij}$ and the actual interactions $y_{ij}$. For binary interactions, a common choice is the binary cross-entropy loss:
$$
\mathcal{L} = - \sum_{(i,j) \in K} \left( y_{ij} \log(\hat{y}_{ij}) + (1 - y_{ij}) \log(1 - \hat{y}_{ij}) \right)
$$
where $K$ is the set of observed interactions.

### Regularization

To prevent overfitting, regularization terms are added to the loss function. This can include $L^{2}$ regularization on the weights and biases of the neural network:
$$
\mathcal{L}_{\text{reg}} = \lambda \left( \|\mathbf{W}\|^2 + \|\mathbf{b}\|^2 \right)
$$
where $\lambda$ is the regularization parameter.



## Coding RecSys

In what follows, I will develop a model for movie recommendation based on a dataset from MovieLens. This dataset contains approximately 33,000,000 ratings and 2,000,000 tag applications applied to 86,000 movies by 330,975 users. It includes tag genome data with 14 million relevance scores across 1,100 tags. The dataset was last updated in September 2018


In [2]:

import numpy as np
import pandas as pd
from sklearn import model_selection, preprocessing
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from collections import defaultdict
from sklearn.metrics import mean_squared_error

In [3]:
main_path = "ml-latest/ratings.csv"

In [4]:
df = pd.read_csv(main_path)

In [5]:
df.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,1225734739
1,1,110,4.0,1225865086
2,1,158,4.0,1225733503
3,1,260,4.5,1225735204
4,1,356,5.0,1225735119
5,1,381,3.5,1225734105
6,1,596,4.0,1225733524
7,1,1036,5.0,1225735626
8,1,1049,3.0,1225734079
9,1,1066,4.0,1225736961


In [6]:
print(f"Unique users: {df.userId.nunique()}, Unique movies: {df.movieId.nunique()}")

Unique users: 330975, Unique movies: 83239


In [7]:
# The following class implements the Dataset module from PyTorch to create a custom tensor dataset for movie recommendations.
# The dataset consists of user IDs, movie IDs, and ratings, where each entry represents a user-movie pair with a rating.


class MovieDataset(Dataset):
    def __init__(self,users,movies,ratings) -> None:
        super().__init__()
        self.users = users
        self.movies = movies
        self.ratings = ratings
        
    def __len__(self):
        return len(self.users)
        
    def __getitem__(self,idx):
        users = self.users[idx]
        movies = self.movies[idx]
        ratings = self.ratings[idx]
        
        users_tensor = torch.tensor(users, dtype=torch.long)
        movies_tensor = torch.tensor(movies, dtype=torch.long)
        ratings_tensor = torch.tensor(ratings, dtype=torch.long)

        return users_tensor,movies_tensor,ratings_tensor
    
    

In [8]:
# This class, derived from nn.Module, implements a neural network model for a recommendation system.
# The model is designed to embed user and movie IDs into tensors and then predict the rating for a given user-movie pair.
# The key components of the model are:
# - Embedding Layers: Maps user and movie IDs to latent vectors (dense tensors), capturing latent features of users and items.
# - Concatenation Layer: Joins the user and movie embeddings to form a combined representation of user-item interactions.
# - Neural Network Layers: A linear layer that takes the concatenated embeddings to predict the rating, mimicking the final step 
#   of NCF models, explain before


class RecSysMoviesLens(nn.Module):
    def __init__(self,n_users,n_movies,n_embbedings=32) -> None:
        super().__init__()
        self.user_embeding = nn.Embedding(n_users,n_embbedings)
        self.movies_embeding = nn.Embedding(n_movies,n_embbedings)
        self.out = nn.Linear(n_embbedings*2,1)
        
    def forward(self,users,movies):
        users_embedding = self.user_embeding(users)
        movies_embedding = self.movies_embeding(movies)
        x = torch.cat([users_embedding,movies_embedding],dim=1)
        x = self.out(x)
        return x
        
    

In [9]:
df.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,1225734739
1,1,110,4.0,1225865086
2,1,158,4.0,1225733503
3,1,260,4.5,1225735204
4,1,356,5.0,1225735119


 Notice that the  Users and Movies  Ids start from 1, when working with tensors in PyTorch, especially # in the context of embedding layers or other 
 indexing operations, having user and movie IDs starting from 1 (instead of 0) can cause problems like the indexing in PyTorch, like most programming 
 languages and libraries, uses zero-based indexing. This means that the first element in a tensor or array is accessed with index 0. The Embedding layers
 are other source of possible error, when one uses an embedding layer (torch.nn.Embedding), the input indices should range from 0 to $n−1$ (where $n$ is 
 the number of unique items). If the indices start from 1, the embedding layer will not correctly map these indices unless the input is adjusted.

In [10]:
lbl_user = preprocessing.LabelEncoder()
lbl_movies = preprocessing.LabelEncoder()

df["userId"]=lbl_user.fit_transform(df["userId"])
df["movieId"]=lbl_movies.fit_transform(df["movieId"])
df.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,0,0,4.0,1225734739
1,0,108,4.0,1225865086
2,0,156,4.0,1225733503
3,0,257,4.5,1225735204
4,0,351,5.0,1225735119
5,0,376,3.5,1225734105
6,0,588,4.0,1225733524
7,0,1013,5.0,1225735626
8,0,1025,3.0,1225734079
9,0,1041,4.0,1225736961


In [11]:
#Train and Test splip, this is easyli prepared whit skleaarn

df_train, df_test  = model_selection.train_test_split(df,test_size=0.2,random_state=123)

In [12]:
# Create a MovieDataset instances for the train and test data set
train_dataset = MovieDataset(
    users=df_train.userId.values,
    movies=df_train.movieId.values,
    ratings=df_train.rating.values,    
)

test_dataset = MovieDataset(
    users=df_test.userId.values,
    movies=df_test.movieId.values,
    ratings=df_test.rating.values,    
)


In [13]:
## Now will create a DataLoader

bs = 4
train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=bs,
    shuffle=True)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=bs,
    shuffle=True)

In [14]:
# Create the instance of our model, RecSysMoviesLens

model = RecSysMoviesLens(
    n_users=len(lbl_user.classes_),
    n_movies=len(lbl_movies.classes_) 
    )

optimizer = torch.optim.Adam(model.parameters())
creiteria = nn.MSELoss()

In [None]:
N_epoch = 1

model.train()
for epoch_i in range(N_epoch):
    for users, movies, ratings in train_loader:
        optimizer.zero_grad()
        y_pred = model(users,movies)
        y_true = ratings.unsqueeze(dim=1).to(torch.float32)
        loss = creiteria(y_pred,y_true)
        loss.backward()
        optimizer.step()
    

In [17]:
# move the model to the gpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.MSELoss()

# training
N_epoch = 1
model.train()
for epoch_i in range(N_epoch):
    for users, movies, ratings in train_loader:
        users, movies, ratings = users.to(device), movies.to(device), ratings.to(device)
        optimizer.zero_grad()
        y_pred = model(users, movies)
        y_true = ratings.unsqueeze(dim=1).to(torch.float32)
        loss = criterion(y_pred, y_true)
        loss.backward()
        optimizer.step()

KeyboardInterrupt: 