<center>
<img src="https://upload.wikimedia.org/wikipedia/fr/thumb/1/1d/Logo_T%C3%A9l%C3%A9com_SudParis.svg/1014px-Logo_T%C3%A9l%C3%A9com_SudParis.svg.png" width="10%" />
</center>

<center> <h2> NET 4103/7431 Complex Network </h2> </center>

<center> <h3> Vincent Gauthier (vincent.gauthier@telecom-sudparis.eu) </h3> </center>

### Note
Avant de commencer les exercices, assurez-vous que tout fonctionne comme prévu. Tout d'abord, le redémarrage du kernel **(dans la barre de menus, sélectionnez le kernel $\rightarrow$ Restart)**.

Assurez-vous que vous remplir les célluler aux endroits marquer «YOUR CODE HERE». 

Veuillez supprimer les ligne «raise NotImplementedError()» dans toutes les cellules auxquelles vous avez répondu, ainsi que votre nom et prénom ci-dessous:

In [None]:
NOM = "XXX"
PRENOM = "XXX"

---

<h1 align="center">Lab #5: Building Recommander System (RecSys) With Neural Matrix Factorization</h1> 
<br />
<br />
<br />
<img src="../../images/network.png" style="display:block;margin-left:auto;margin-right:auto;width:80%;"></img>

In [None]:
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
import numpy as np

import torch

%matplotlib inline

# Style pour le Notebook
from IPython.core.display import HTML

def css_styling():
    styles = open("../../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

In [None]:
import networkx as nx
from packaging import version
import sys 
import torch

print("Python version:", sys.version)
print("networkx version:", nx.__version__)
print("torch versions:", torch.__version__)

# assert networkx version is greater or equal to 3.0
assert version.parse(nx.__version__) >= version.parse("3.0")
assert version.parse(torch.__version__) >= version.parse("2.0")
# assert python version is greater that 3.9
assert sys.version_info[0] == 3
assert sys.version_info[1] >= 9  

# If working in colab mount the drive filesystem 
if 'google.colab' in str(get_ipython()):
    print('Working in colab')
    
    from google.colab import drive
    drive.mount('/content/drive')
else:
    print("working locally")

In [None]:
import requests
import zipfile
from pathlib import Path
import matplotlib.pylab as plt

import pandas as pd
import numpy as np

import torch 
import torch.nn as nn
import torch.nn.functional as F
import torch

from torch.utils.data import Dataset, DataLoader, random_split
from torch_geometric.data import Data

import wandb
import tqdm

### Download and Parse the MovieLen Dataset

In [None]:
class MovieLenSmall(Dataset):
    def __init__(self, root="./data"):
        # Setup path to data folder
        self.root_path = Path(root)
        self.url = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
        self.download()
        self.process()
        self.data = torch.load(self.processed_path / "movielen.pt")
        self.enc_movie = dict()
        self.enc_user = dict()
        
    @property
    def raw_path(self):
        return self.root_path / "raw"

    @property
    def processed_path(self):
        return self.root_path / "processed"

    def download(self):
        # If the image folder doesn't exist, download it and prepare it... 
        if not self.raw_path.is_dir():
            print(f"Did not find {self.raw_path} directory, creating one...")
            self.raw_path.mkdir(parents=True, exist_ok=True)
    
            # Download 
            with open(self.raw_path / "ml-latest-small.zip", "wb") as f:
                request = requests.get(self.url)
                print("Downloading...")
                f.write(request.content)
    
            # Unzip 
            with zipfile.ZipFile(self.raw_path /  "ml-latest-small.zip", "r") as zip_ref:
                print("Unzipping...") 
                zip_ref.extractall(self.raw_path)

    def process(self):
        if not self.processed_path.is_dir():
            data = Data()
            # If the image processed_path doesn't exist, prepare it... 
            print(f"Did not find {self.processed_path} directory, creating one...")
            self.processed_path.mkdir(parents=True, exist_ok=True)
            file = self.raw_path / "ml-latest-small" / "ratings.csv"
            rating = pd.read_csv(file)
            file = self.raw_path / "ml-latest-small" / "movies.csv"
            movie = pd.read_csv(file)
            df, num_user, num_movie, num_sample = self.parse_ratings(rating, movie)
            data.num_user = num_user
            data.num_movie = num_movie
            data.num_sample = num_sample
            data.user = torch.from_numpy(df.userId.values).long()
            data.movie = torch.from_numpy(df.movieId.values).long()
            data.rating = torch.from_numpy(df.rating.values).float()
            data.title = df.title.values
            data.genres = df.genres.values
            
            torch.save(data, self.processed_path / "movielen.pt")

    def parse_ratings(self, rating, movie):
        # merge 
        df = pd.merge(rating, movie, on="movieId")
        # Normalize ratings
        df.drop("timestamp", axis=1, inplace=True)
        rating, min_rating, max_rating = df["rating"], df["rating"].min(), df["rating"].max()
        # minmax scaler
        df["rating"] = (df.rating - min_rating) / (max_rating - min_rating)
        # save the real ratings
        df["rating_rel"] = df["rating"]
        
        # Do not recommend if the rating is less than 0.5
        cond = df["rating"] < 0.5
        df["rating"].where(cond, 0, inplace=True)
        df["rating"].where(~cond, 1, inplace=True)
    
        # Encode userId and movieId
        self.enc_movie = {movieId:idx for idx, movieId in enumerate(pd.unique(df.movieId))}
        df["movieId"] = [self.enc_movie[movieId] for movieId in df.movieId]
        self.enc_user = {userId:idx for idx, userId in enumerate(pd.unique(df.userId))}
        df["userId"] = [self.enc_user[userId] for userId in df.userId]
        
        return df, len(self.enc_user), len(self.enc_movie), len(df)
    
    def __len__(self):
        return self.data.num_sample

    @property
    def num_movie(self):
        return self.data.num_movie

    @property
    def num_user(self):
        return self.data.num_user

    def __getitem__(self, idx):
        return {
            "user": self.data.user[idx], 
            "movie": self.data.movie[idx], 
            "rating": self.data.rating[idx], 
            "title": self.data.title[idx],
            "genres": self.data.genres[idx]
        }

## Generate the dataset

In [None]:
movielen_dataset = MovieLenSmall()

## Question 1: Plot the distribution of the number of recommandation per movie

### What can you conclude about the distribution ?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

ax.loglog(edges[:-1], counts, 'o')
ax.set_xlabel("# of review")
ax.set_ylabel("PDF");
plt.tight_layout()
plt.show()

## What is the distribution of the movies ratings ?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## The model: The Generalized Matrix Factorization model

We search here what are the embeddings such as the product of the items embedding and the user embedding 
- fill out the forward method

In [None]:
#Question for student add bias to the user and movie embeddings
# Look for Lecture 3 Part B Matrix Factorization with PyTorch https://www.youtube.com/watch?v=LJX5hdw-zUI&ab_channel=YannetInterian

class MF(nn.Module):
    """The Generalized Matrix Factorization model."""
    def __init__(self, num_user, num_movie, emb_size):
        super(MF, self).__init__()
        self.user_emb = nn.Embedding(num_user, emb_size)
        self.movie_emb = nn.Embedding(num_movie, emb_size)
        self.affine_tranform = nn.Linear(in_features=emb_size, out_features=1)
        self.reset_params()
        

    def reset_params(self):
        self.user_emb.weight.data.uniform_(0.5, 1.0)
        self.movie_emb.weight.data.uniform_(0.5, 1.0)
    
    def forward(self, u, v):
        # YOUR CODE HERE
        raise NotImplementedError()
        return out

## Train and Validation Functions

In [None]:
#check for gpu
def check_device():
    if torch.backends.mps.is_available():
        # for mac os GPU (Apple silicon)
       return torch.device("mps")
    elif torch.backends.cuda.is_available():
        # for cuda device 
        return torch.device("cuda")
    else:
        return torch.device("cpu")

In [None]:
def train(model, data, lr, wd, epochs, batch_size, device, log_idx=100, best_model_path='best-model-parameters.pt'):
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    # Why did we choose the BCE loss
    criterion = nn.BCEWithLogitsLoss()
    batchs = DataLoader(data, batch_size=batch_size, shuffle=True)
    best_loss = 1.0
    
    model.train()
    print("Start trainning...")
    pbar = tqdm.tqdm(range(epochs))
    for epoch in pbar:
        running_loss = 0.0
        for idx, batch in enumerate(batchs):
            R_hat = model(batch["user"].to(device), batch["movie"].to(device)).reshape(-1)
            loss = criterion(R_hat, batch["rating"].to(device))

            # Compute the RMSE here 
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            # accumulate loss and log
            running_loss += loss.item()
            if idx % log_idx == log_idx - 1:
                if (running_loss / log_idx) < best_loss:
                    best_loss = (running_loss / log_idx)
                    # Save best model 
                    save_model(model, best_model_path) 
                message = f"[{epoch}:{idx + 1}] {running_loss / log_idx:.3f}"
                wandb.log({"train/loss": running_loss / log_idx})
                pbar.set_description(message)
                pbar.refresh() # to show immediately the update
                running_loss = 0.0

def save_model(model, best_model_path):
    # official recommended
    root = Path("./saved_models")
    if not root.is_dir():
        root.mkdir()
    torch.save(model.state_dict(), root / best_model_path) 

In [None]:
def validation(model, best_model_path, validation_data, device):
    criterion = nn.BCEWithLogitsLoss()
    # Setup the model with the best weights 
    path = Path("./saved_models")
    model.load_state_dict(torch.load(path / best_model_path))
    model.to(device)
    # Evaluate 
    model.eval()
    data = validation_data[0:-1]
    R_hat = model(data["user"].to(device), data["movie"].to(device)).reshape(-1)
    loss = criterion(R_hat, data["rating"].to(device))
    # Logs
    print(f'Valiation Loss: {loss.item() / len(validation_data)}')
    wandb.log({"validation/loss": loss.item() / len(validation_data)})

### Main Function

In [None]:
import wandb

dataset = MovieLenSmall()
device = check_device()
print(f'device: {device}')

train_len = int(0.8*len(dataset))
valid_len = len(dataset) - train_len
train_data = torch.utils.data.Subset(dataset, range(train_len))
val_data = torch.utils.data.Subset(dataset, range(train_len, len(dataset)))

#train_data, val_data = random_split(dataset, [train_len, valid_len])

epochs = 45
lr = 5e-4
wd = 1e-4
batch_size = 10
emb_size = 100
best_model_path = "MF-parameters.pt"
wandb.login()
run = wandb.init(
    # Set the project where this run will be logged
    project="Matrix Factorization",
    # Track hyperparameters and run metadata
    config={
        "learning_rate": lr,
        "epochs": epochs,
        "batch_size":batch_size,
        "embedding_size": emb_size,
        "weight_decay": wd,
    },
)

model = MF(dataset.num_user, dataset.num_movie, emb_size).to(device)
train(model, train_data, lr, wd, epochs, batch_size=batch_size, device=device, log_idx=1000, best_model_path=best_model_path)
validation(model, best_model_path, val_data, device)

#### [show example of loss function](https://wandb.ai/vgauthier/Matrix%20Factorization/reports/train-loss-24-07-22-20-49-22---Vmlldzo4NzY4MDM0)

## Question: Compute Recal@k and Precision@k 

* [Building Recommender System with PyTorch using Collaborative Filtering](https://www.youtube.com/watch?v=Wj-nkk7dFS8&ab_channel=AIAlchemy)

* https://surprise.readthedocs.io/en/latest/FAQ.html#how-to-compute-precision-k-and-recall-k

In [None]:
from collections import defaultdict

best_model_path = "saved_models/MF-parameters.pt"
model.load_state_dict(torch.load(best_model_path))
model.to(device)

batch_size = 10
val_loader = DataLoader(val_data, batch_size=batch_size)
user_est_true = defaultdict(list)

with torch.no_grad():
    for idx, batch in enumerate(val_loader):
        users = batch["user"]
        movies = batch["movie"]
        ratings = batch["rating"]
        pred = model(batch["user"].to(device), batch["movie"].to(device)).reshape(-1).sigmoid()
        
        for i in range(len(users)):
            user_id = users[i].item()
            movie_id = movies[i].item()
            pred_rating = pred[i].item()
            true_rating = ratings[i].item()
            user_est_true[user_id].append((pred_rating, true_rating))

In [None]:
threshold = 0.5
k = 10
good_user = dict()
for uid, user_ratings in user_est_true.items():
    n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
    if n_rel >= k:
        good_user[uid] = user_ratings

In [None]:
def precision_recall_at_k(predictions, k=10, threshold=0.75):
    """Return precision and recall at k metrics for each user"""

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in predictions.items():
        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(
            ((true_r >= threshold) and (est >= threshold))
            for (est, true_r) in user_ratings[:k]
        )
        #print(uid, n_rel_and_rec_k)
        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
    return precisions, recalls

In [None]:
precisions, recalls = precision_recall_at_k(good_user, threshold=0.8, k=10)

# Precision and recall can then be averaged over all users
print(sum(prec for prec in precisions.values()) / len(precisions))
print(sum(rec for rec in recalls.values()) / len(recalls))

## Question: Add parameters to the model to the model such as:

- a parameter for the mean user rating
- a parameter for the mean item rating
- mean rating

## Question: Use a optimizer to fine tune the model hyperparameters 

## References 

1. [Recommender Systems: Generalized Matrix Factorization from Scratch](https://www.youtube.com/watch?v=gZgftF5hZOs&ab_channel=DavidOniani)
2. [The Generalized Matrix Factorization model](https://github.com/oniani/ai/blob/main/model/dl/gmf.py)
3. [cours CNAM](https://cedric.cnam.fr/vertigo/Cours/RCP216/coursSimilariteRecommandation.html#systemes-de-recommandation)
4. [Google Notes on Matrix Factorization](https://developers.google.com/machine-learning/recommendation/collaborative/matrix?hl=fr)
5. [Mining Massive Dataset Stanford University (webpage)](https://web.stanford.edu/class/cs246/)
6. [Mining Massive Dataset Stanford University (youtube)](https://www.youtube.com/watch?v=xoA5v9AO7S0&list=PLLssT5z_DsK9JDLcT8T62VtzwyW9LNepV&ab_channel=ArtificialIntelligence-AllinOne)
7. [Recommendation Systems — A walk through](https://chaitanyabelhekar.medium.com/recommendation-systems-a-walk-trough-33587fecc195)