# Disentangled Multimodal Representation Learning for Recommendation (DMRL)

Ideias:
- tunar os hiperparametros
- paralelizar a previsao
- Adicionar a coluna do assunto em conjunto com o texto
- Limpar os dados de conteudo
- usa a imagem alem do texto
- renomear as colunas para ter um nome mais representativo para o bert
- usar algum algoritmo mais simples apenas para usuarios novos, knn para content based
- ordenar os itens novos que ficaram por ultimo por popularidade 
- Entender o parametro exclude_unknowns=True do RatioSplit e se tem alguma forma do DMRL gerar previsoes para itens e usuarios novos 

Perguntas:


## Imports

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import torch
import cornac
import numpy as np
import pandas as pd
from cornac.metrics import NDCG
from cornac.eval_methods import RatioSplit
from cornac.data import TextModality
from cornac.models.dmrl.recom_dmrl import DMRL

from utils import load_data, preprocessing_content_data


# pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
np.set_printoptions(threshold=np.inf)

  from .autonotebook import tqdm as notebook_tqdm


## Load and process data

In [3]:
ratings, content, targets = load_data()

In [4]:
ratings["TimestampDate"] = ratings['Timestamp'].dt.date
ratings.loc[ratings.Rating == 0, "Rating"] = 0.01

### Limpar os dados

- FEITO - Year: deletar o - no texto '2013–'
- FEITO - Rated: alterar as diversas formas de escrever NA para NA. Esse e um caso especial
- FEITO - alterar as diversas formas de escrever NA para None em todas as coluans
- FEITO - Language: tem varias linhas que possuem bizarices como  'None, English' e 'None, French' , 'English, None'...
- FEITO - Ratings: criar uma coluna para cada chave do dicionario, entender quais sao todas as chaves que existem

-----------------------------------------------
Variaveis qwue nao precisariam ser tratadas com um bert:
- Metascore
- imdbRating
- Type

In [5]:
content_auxiliar = content.drop(columns=["Poster", "Website", "Response", "Episode", "seriesID"]).copy()

In [6]:
content_auxiliar['Year'] = content_auxiliar['Year'].str.replace('–', '')

In [7]:
nan = content.totalSeasons.unique()[0]
dict_transform_to_na = {
    "Rated":['N/A', 'Not Rated', 'Unrated', 'UNRATED', 'NOT RATED'],
    "all": [nan, 'N/A', 'None', np.nan],
}

for na_value in dict_transform_to_na["all"]:
    content_auxiliar = content_auxiliar.replace(na_value, None)

for na_value in dict_transform_to_na["Rated"]:
    content_auxiliar['Rated'] = content_auxiliar['Rated'].replace(na_value, None)

In [8]:
content_auxiliar['Language'] = content_auxiliar['Language'].str.replace('None, ', '')
content_auxiliar['Language'] = content_auxiliar['Language'].str.replace(', None', '')

In [9]:
# Entendendo os valores possiveis para a coluna Ratings
# A coluna content_auxiliar.Ratings quarda uma lista que posde ter entre 0 e 3 dicionarios. Cada dicionario possui a chave 'Source', 'Value'.
num_ratings_per_item = []
unique_keys = []
rating_sources = []
rating_values = []

for rating_list in content_auxiliar.Ratings:
    num_ratings_per_item.append(len(rating_list))
    for rating_dict in rating_list:
        for key in rating_dict:
            unique_keys.append(key)
        rating_sources.append(rating_dict['Source'])
        rating_values.append(rating_dict['Value'])

set(rating_sources)

{'Internet Movie Database', 'Metacritic', 'Rotten Tomatoes'}

In [10]:
InternetMovieDatabase_list = []
Metacritic_list = []
RottenTomatoes_list = []
for rating_list in content_auxiliar.Ratings:
    InternetMovieDatabase_list.append(None)
    Metacritic_list.append(None)
    RottenTomatoes_list.append(None)
    for rating_dict in rating_list:
        if rating_dict['Source'] == 'Internet Movie Database':
            InternetMovieDatabase_list[-1] = rating_dict['Value']
        elif rating_dict['Source'] == 'Metacritic':
            Metacritic_list[-1] = rating_dict['Value']
        elif rating_dict['Source'] == 'Rotten Tomatoes':
            RottenTomatoes_list[-1] = rating_dict['Value']

In [11]:
content_auxiliar['Internet Movie Database'] = InternetMovieDatabase_list
content_auxiliar['Metacritic'] = Metacritic_list
content_auxiliar['Rotten Tomatoes'] = RottenTomatoes_list


In [12]:
content_auxiliar.drop(columns=['Ratings'], inplace=True)

### Apendar a coluna no valor do dataframe

In [13]:
content_columns = content_auxiliar.columns.to_list()
content_columns.pop(0)

'ItemId'

In [14]:
for column in content_columns:
    content_auxiliar[column] = content_auxiliar[column].apply(lambda x: f"{column}: {x}; " if x is not None else f"{column}: unknown value; ")

In [15]:
content_processed = content_auxiliar[['ItemId']].copy()
content_processed["text"] = content_auxiliar[content_columns].astype(str).fillna('').agg(' '.join, axis=1)

## Basic analysis

In [16]:
ratings.head()

Unnamed: 0,UserId,ItemId,Timestamp,Rating,TimestampDate
0,c4ca4238a0,91766eac45,2013-10-05 22:00:50,8.0,2013-10-05
1,c81e728d9d,5c739554f7,2013-08-17 16:26:38,9.0,2013-08-17
2,c81e728d9d,48f6d7ce7c,2013-08-17 13:28:27,8.0,2013-08-17
3,c81e728d9d,e9318d627a,2013-06-15 15:38:09,1.0,2013-06-15
4,a87ff679a2,17e6357973,2014-01-31 23:27:59,8.0,2014-01-31


In [17]:
# Number of unique users and items
ratings.UserId.nunique(), ratings.ItemId.nunique()

(51671, 29674)

In [18]:
# how many itens purchased by each user purchase
ratings.groupby(["UserId", 'Timestamp'])["ItemId"].nunique().value_counts()

1     659392
2         54
3         14
6          3
7          2
11         2
4          2
22         1
28         1
20         1
8          1
38         1
Name: ItemId, dtype: int64

In [19]:
# how many itens purchased by each user day by day
ratings.groupby(["UserId", 'TimestampDate'])["ItemId"].nunique().value_counts()

1      420843
2       60533
3       14547
4        4755
5        2065
        ...  
60          1
363         1
145         1
189         1
82          1
Name: ItemId, Length: 88, dtype: int64

In [20]:
# how many times each user purchased items
ratings.groupby("UserId")['Timestamp'].nunique().value_counts()

1      23092
2       6193
3       3341
4       2229
5       1646
       ...  
471        1
427        1
321        1
429        1
392        1
Name: Timestamp, Length: 440, dtype: int64

In [21]:
# how many times each user purchased items per day
ratings.groupby("UserId")['TimestampDate'].nunique().value_counts()

1      25113
2       6048
3       3158
4       2187
5       1579
       ...  
420        1
198        1
602        1
224        1
332        1
Name: TimestampDate, Length: 341, dtype: int64

In [22]:
content.isna().sum()

ItemId              0
Title               0
Year                0
Rated               0
Released            0
Runtime             0
Genre               0
Director            0
Writer              0
Actors              0
Plot                0
Language            0
Country             0
Awards              0
Poster              0
Ratings             0
Metascore           0
imdbRating          0
imdbVotes           0
Type                0
DVD                24
BoxOffice          24
Production         24
Website            24
Response            0
totalSeasons    37989
Season          38011
Episode         38011
seriesID        38011
dtype: int64

In [23]:
ratings.Rating.unique()

array([ 8.  ,  9.  ,  1.  ,  7.  ,  6.  , 10.  ,  5.  ,  4.  ,  2.  ,
        3.  ,  0.01])

## Train model

In [24]:
item_text_modality = TextModality(
    corpus=content_processed.text.to_list(),
    ids=content_processed.ItemId.to_list(),
)

In [25]:
ratio_split = RatioSplit(
    data=ratings[['UserId', 'ItemId', 'Rating']].values.tolist(),
    test_size=0.2,
    exclude_unknowns=True,
    verbose=True,
    seed=12012001,
    rating_threshold=0.5,
    item_text=item_text_modality,
)

rating_threshold = 0.5
exclude_unknowns = True
---
Training data:
Number of users = 46750
Number of items = 27045
Number of ratings = 527776
Max rating = 10.0
Min rating = 0.0
Global mean = 7.3
---
Test data:
Number of users = 46750
Number of items = 27045
Number of ratings = 124031
Number of unknown users = 0
Number of unknown items = 0
---
Total users = 46750
Total items = 27045


In [26]:
# Instantiate DMRL recommender
dmrl_recommender = DMRL(
    batch_size=1024,
    epochs=20,
    log_metrics=False,
    learning_rate=0.1,
    num_factors=2,
    decay_r=0.0001,
    decay_c=0.001,
    num_neg=4,
    embedding_dim=128,
)

In [None]:
# Put everything together into an experiment and run it
cornac.Experiment(
    eval_method=ratio_split, models=[dmrl_recommender], metrics=[NDCG(k=20)]
).run()


[DMRL] Training started!
Pre-encoding the entire corpus. This might take a while.
Using device cpu for training
  batch 5 loss: 714.3922241210937
  batch 10 loss: 737.5060668945313
  batch 15 loss: 790.3735595703125
  batch 20 loss: 864.58916015625
  batch 25 loss: 877.6736572265625
  batch 30 loss: 810.6030151367188
  batch 35 loss: 764.0460815429688
  batch 40 loss: 727.0631713867188
  batch 45 loss: 714.0484008789062
  batch 50 loss: 705.6056518554688
  batch 55 loss: 701.7337646484375
  batch 60 loss: 690.3317138671875
  batch 65 loss: 677.1714477539062
  batch 70 loss: 680.3330810546875
  batch 75 loss: 676.5327392578125
  batch 80 loss: 671.8184814453125
  batch 85 loss: 650.7092407226562
  batch 90 loss: 649.2645263671875
  batch 95 loss: 651.9765380859375
  batch 100 loss: 637.3532470703125
  batch 105 loss: 628.8649047851562
  batch 110 loss: 623.105859375
  batch 115 loss: 614.1168579101562
  batch 120 loss: 605.641064453125
  batch 125 loss: 621.0272583007812
  batch 130 lo

Ranking: 100%|██████████| 19975/19975 [23:05<00:00, 14.41it/s]



TEST:
...
     | NDCG@-1 | Train (s) |  Test (s)
---- + ------- + --------- + ---------
DMRL |  0.2459 |  979.5150 | 1385.9265



In [28]:
# como usar esse algoritmo para tratar itens novos?
# # como usar esse algoritmo para tratar usuarios novos?


# eu tenho o conteudo de todos os itens, incluive itens que estao apenas n conjunto de teste?

In [29]:
target_prediction = targets.copy()
target_prediction["Rating"] = -1

user_id_list = targets.UserId.unique()
for user_id in user_id_list:
    # Get the train dataframe index of the user to predict
    user_index = ratio_split.train_set.uid_map.get(user_id)

    if user_index is None:
        print(f"User {user_id} is not in the train set")
        continue

    # Flter by items to predict 
    items_to_predict = targets.loc[targets.UserId == user_id, "ItemId"].to_list()

    # Get the train dataframe index of the items to predict
    items_to_predict_index = np.array([ratio_split.train_set.iid_map.get(item_id) for item_id in items_to_predict])

    items_to_predict_tensor = torch.tensor([idx for idx in items_to_predict_index if idx is not None])

    # Get the position of items that are not in the train set
    none_indices = [i for i, x in enumerate(items_to_predict_index) if x is None]

    # Get the prediction for the items
    line_rating = dmrl_recommender.score(user_index=user_index, item_indices=items_to_predict_tensor)

    # Insert -1 in the position of items that are not in the train set
    for index_to_insert in none_indices:
        line_rating = np.insert(line_rating, index_to_insert, -1)

    # Insert the prediction in the target_prediction dataframe
    target_prediction.loc[targets.UserId == user_id, "Rating"] = line_rating

User 02b780c583 is not in the train set
User 03fbe3c1e8 is not in the train set
User 0503bf6097 is not in the train set
User 0958560bd4 is not in the train set
User 0b17c3df40 is not in the train set
User 10c3aef33f is not in the train set
User 119653373a is not in the train set
User 139ccdbf5a is not in the train set
User 1671fa4df3 is not in the train set
User 16757f4d90 is not in the train set
User 187454feb4 is not in the train set
User 1af811e7b9 is not in the train set
User 1bbf59e4a4 is not in the train set
User 1ef563bb8b is not in the train set
User 1f6a565547 is not in the train set
User 211a7a84d3 is not in the train set
User 243be2818a is not in the train set
User 24a13b72fc is not in the train set
User 27649dc46e is not in the train set
User 2a3500f75b is not in the train set
User 2ae6906d5d is not in the train set
User 2cfdb14e05 is not in the train set
User 2ed4d0c23c is not in the train set
User 2ef13886ae is not in the train set
User 302c6ebc7d is not in the train set


In [30]:
target_prediction = target_prediction.sort_values(["UserId", "Rating"], ascending=[True, False])

In [31]:
target_prediction.to_csv("submissao_3,5_DMRL_versao_2,5.csv", index=False)

In [32]:
target_prediction = target_prediction.drop(columns="Rating")

In [33]:
target_prediction.to_csv("submissao_3,5_DMRL_versao_2,5_sem_rating.csv", index=False)