# Disentangled Multimodal Representation Learning for Recommendation (DMRL)

Ideias:
- FEITO: Usar a coluna para no texto
- FEITO: Limpar os dados de conteudo
- FEITO (lucas): ordenar os itens novos que ficaram por ultimo por popularidade
- FEITO: Adicionar o nome da coluna, porque vai indicar o assunto, em conjunto do texto

- tunar os hiperparametros
- usa a imagem alem do texto
- usar algum algoritmo mais simples apenas para usuarios novos, knn para content based
- Entender o parametro exclude_unknowns=True do RatioSplit e se tem alguma forma do DMRL gerar previsoes para itens e usuarios novos 

- paralelizar a previsao
Perguntas:


## Imports

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import torch
import cornac
import numpy as np
import pandas as pd
from cornac.metrics import NDCG
from cornac.eval_methods import RatioSplit
from cornac.data import TextModality
from cornac.models.dmrl.recom_dmrl import DMRL, ImageModality
from tqdm import tqdm
from utils import load_data, preprocessing_content_data
import requests
from PIL import Image
from io import BytesIO
from joblib import Parallel, delayed


# pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
np.set_printoptions(threshold=np.inf)

  from .autonotebook import tqdm as notebook_tqdm


## Load and process data

### Load data

In [3]:
ratings, content, targets = load_data()

In [4]:
ratings["TimestampDate"] = ratings['Timestamp'].dt.date
ratings.loc[ratings.Rating == 0, "Rating"] = 0.01

### Limpar os dados

- FEITO - Year: deletar o - no texto '2013–'
- FEITO - Rated: alterar as diversas formas de escrever NA para NA. Esse e um caso especial
- FEITO - alterar as diversas formas de escrever NA para None em todas as coluans
- FEITO - Language: tem varias linhas que possuem bizarices como  'None, English' e 'None, French' , 'English, None'...
- FEITO - Ratings: criar uma coluna para cada chave do dicionario, entender quais sao todas as chaves que existem

-----------------------------------------------
Variaveis qwue nao precisariam ser tratadas com um bert:
- Metascore
- imdbRating
- Type

In [5]:
content_auxiliar = content.drop(columns=["Poster", "Website", "Response", "Episode", "seriesID"]).copy()

In [6]:
content_auxiliar['Year'] = content_auxiliar['Year'].str.replace('–', '')

In [7]:
nan = content.totalSeasons.unique()[0]
dict_transform_to_na = {
    "Rated":['N/A', 'Not Rated', 'Unrated', 'UNRATED', 'NOT RATED'],
    "all": [nan, 'N/A', 'None', np.nan],
}

for na_value in dict_transform_to_na["all"]:
    content_auxiliar = content_auxiliar.replace(na_value, None)

for na_value in dict_transform_to_na["Rated"]:
    content_auxiliar['Rated'] = content_auxiliar['Rated'].replace(na_value, None)

In [8]:
content_auxiliar['Language'] = content_auxiliar['Language'].str.replace('None, ', '')
content_auxiliar['Language'] = content_auxiliar['Language'].str.replace(', None', '')

In [9]:
# Entendendo os valores possiveis para a coluna Ratings
# A coluna content_auxiliar.Ratings quarda uma lista que posde ter entre 0 e 3 dicionarios. Cada dicionario possui a chave 'Source', 'Value'.
num_ratings_per_item = []
unique_keys = []
rating_sources = []
rating_values = []

for rating_list in content_auxiliar.Ratings:
    num_ratings_per_item.append(len(rating_list))
    for rating_dict in rating_list:
        for key in rating_dict:
            unique_keys.append(key)
        rating_sources.append(rating_dict['Source'])
        rating_values.append(rating_dict['Value'])

set(rating_sources)

{'Internet Movie Database', 'Metacritic', 'Rotten Tomatoes'}

In [10]:
InternetMovieDatabase_list = []
Metacritic_list = []
RottenTomatoes_list = []
for rating_list in content_auxiliar.Ratings:
    InternetMovieDatabase_list.append(None)
    Metacritic_list.append(None)
    RottenTomatoes_list.append(None)
    for rating_dict in rating_list:
        if rating_dict['Source'] == 'Internet Movie Database':
            InternetMovieDatabase_list[-1] = rating_dict['Value']
        elif rating_dict['Source'] == 'Metacritic':
            Metacritic_list[-1] = rating_dict['Value']
        elif rating_dict['Source'] == 'Rotten Tomatoes':
            RottenTomatoes_list[-1] = rating_dict['Value']

In [11]:
content_auxiliar['Internet Movie Database'] = InternetMovieDatabase_list
content_auxiliar['Metacritic'] = Metacritic_list
content_auxiliar['Rotten Tomatoes'] = RottenTomatoes_list


In [12]:
content_auxiliar.drop(columns=['Ratings'], inplace=True)

### Apendar a coluna no valor do dataframe

In [13]:
content_columns = content_auxiliar.columns.to_list()
content_columns.pop(0)

'ItemId'

In [14]:
for column in content_columns:
    content_auxiliar[column] = content_auxiliar[column].apply(lambda x: f"{column}: {x}; " if x is not None else f"{column}: unknown value; ")

In [15]:
content_processed = content_auxiliar[['ItemId']].copy()
content_processed["text"] = content_auxiliar[content_columns].astype(str).fillna('').agg(' '.join, axis=1)

### Pegando os dados de imagem

In [16]:
folder_path = 'poster_images'

# Lista todos os arquivos na pasta
file_list = os.listdir(folder_path)
item_id_saved = [file.split('=')[1].split('.')[0] for file in file_list]

In [17]:
content_to_get_image = content.loc[content.Poster != 'N/A', ['ItemId', 'Poster']].copy()
content_to_get_image = content_to_get_image.loc[~content_to_get_image.ItemId.isin(item_id_saved), :]
content_to_get_image = content_to_get_image.reset_index(drop=True)

In [None]:

def download_image(index):
    image_url = content_to_get_image.loc[index, "Poster"]
    item_id = content_to_get_image.loc[index, "ItemId"]

    # Fazer a requisição HTTP para baixar a imagem
    # try:
    response = requests.get(image_url)
    # except:
        # return None

    # Verificar se a requisição foi bem-sucedida
    if response.status_code == 200:
        # Abrir a imagem
        img = Image.open(BytesIO(response.content))
        # Salvar a imagem
        img.save(f"{folder_path}/item_id={item_id}.jpg")
        return None
    else:
        return (item_id, image_url)

# Lista de índices para processar
indices = content_to_get_image.index.tolist()

# Usar todos os CPUs disponíveis
result_list = Parallel(n_jobs=-1, verbose=100)(delayed(download_image)(index) for index in indices)

# Filtrar os resultados que não conseguiram baixar
nao_conseguiu_baixar = [result for result in result_list if result is None]

## Train model

In [16]:
import cornac
from cornac.data import TextModality, ImageModality
from cornac.datasets import amazon_clothing
from cornac.eval_methods import RatioSplit

In [20]:
image_features, image_item_ids = amazon_clothing.load_visual_feature()

In [21]:
image_modality = ImageModality(features=image_features, ids=image_item_ids)

In [None]:
image_features.shape

In [None]:
image_matrices2

In [None]:
type(image_features)

In [None]:
image_features[0].shape

In [17]:
from PIL import Image
import numpy as np
import os

folder_path = 'poster_images'
image_files = os.listdir(folder_path)
images_item_ids = [file.split('=')[1].split('.')[0] for file in image_files]

# Initialize a list to store the image matrices
def process_image(image_file):
    image_path = os.path.join(folder_path, image_file)
    image = Image.open(image_path)
    image_matrix = np.array(image)
    return image_matrix.reshape(-1)

# Use Parallel to process images in parallel
image_matrices = Parallel(n_jobs=-1, verbose=100)(delayed(process_image)(image_file) for image_file in image_files)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.19791217901000982s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done  11 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done  12 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]

In [20]:
#  E preciso que todas as imagens tenham o mesmo tamanho, para isso vamos adicionar zeros no final das imagens menores

# maior_imagem = -1
# for image in image_matrices:
#     if image.shape[0] > maior_imagem:
#         maior_imagem = image.shape[0]

# for index, image in enumerate(image_matrices):
#     new_image = np.insert(image, len(image), np.zeros(maior_imagem - len(image)))
#     del image
#     image_matrices[index] = new_image

# def add_zeros_to_image(image):
#     new_image = np.insert(image, len(image), np.zeros(maior_imagem - len(image)))
#     del image
#     return new_image

# image_matrices = Parallel(n_jobs=-1, verbose=100)(delayed(add_zeros_to_image)(image) for image in image_matrices)

In [18]:
#  E preciso que todas as imagens tenham o mesmo tamanho, para isso vamos deletar as maoires imagens
menor_imagem = np.Inf
for image in image_matrices:
    if image.shape[0] < menor_imagem:
        menor_imagem = image.shape[0]


In [19]:
def diminui_a_imagem(image):
    return image[:menor_imagem]

image_matrices2 = Parallel(n_jobs=-1, verbose=100)(delayed(diminui_a_imagem)(image) for image in image_matrices)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0056383609771728516s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  11 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  12 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1

In [24]:
# Convert the list of image matrices to a numpy array
image_matrices2 = np.array(image_matrices2)

# print(f"Loaded {len(image_matrices)} images into a numpy array with shape {image_matrices.shape}")

In [25]:
image_modality = ImageModality(features=image_matrices2, ids=images_item_ids)

In [26]:
item_text_modality = TextModality(
    corpus=content_processed.text.to_list(),
    ids=content_processed.ItemId.to_list(),
)

In [27]:
ratio_split = RatioSplit(
    data=ratings[['UserId', 'ItemId', 'Rating']].values.tolist(),
    test_size=0.2,
    exclude_unknowns=True,
    verbose=True,
    seed=12012001,
    rating_threshold=0.5,
    item_text=item_text_modality,
)

rating_threshold = 0.5
exclude_unknowns = True
---
Training data:
Number of users = 46750
Number of items = 27045
Number of ratings = 527776
Max rating = 10.0
Min rating = 0.0
Global mean = 7.3
---
Test data:
Number of users = 46750
Number of items = 27045
Number of ratings = 124031
Number of unknown users = 0
Number of unknown items = 0
---
Total users = 46750
Total items = 27045


In [28]:
# Instantiate DMRL recommender
# dmrl_recommender = DMRL(
#     batch_size=4096,
#     epochs=20,
#     log_metrics=False,
#     learning_rate=0.01,
#     num_factors=2,
#     decay_r=0.5,
#     decay_c=0.01,
#     num_neg=3,
#     embedding_dim=100,
# )

In [30]:
dmrl_recommender = DMRL(
    batch_size=1024,
    epochs=20,
    log_metrics=False,
    learning_rate=0.1,
    num_factors=2,
    decay_r=0.0001,
    decay_c=0.001,
    num_neg=4,
    embedding_dim=128,
)

In [31]:
# Put everything together into an experiment and run it
cornac.Experiment(
    eval_method=ratio_split, models=[dmrl_recommender], metrics=[NDCG()]
).run()


[DMRL] Training started!
Pre-encoding the entire corpus. This might take a while.
Using device cuda for training
  batch 5 loss: 715.0111083984375
  batch 10 loss: 737.4337524414062
  batch 15 loss: 770.6747680664063
  batch 20 loss: 788.4083251953125
  batch 25 loss: 762.1679565429688
  batch 30 loss: 716.5503540039062
  batch 35 loss: 691.4635986328125
  batch 40 loss: 682.841064453125
  batch 45 loss: 681.993505859375
  batch 50 loss: 677.434375
  batch 55 loss: 678.61220703125
  batch 60 loss: 667.6045043945312
  batch 65 loss: 671.3977416992187
  batch 70 loss: 665.6492431640625
  batch 75 loss: 666.575146484375
  batch 80 loss: 646.55546875
  batch 85 loss: 645.0768920898438
  batch 90 loss: 627.4946655273437
  batch 95 loss: 644.9012817382812
  batch 100 loss: 628.7125854492188
  batch 105 loss: 627.4274047851562
  batch 110 loss: 607.3368286132812
  batch 115 loss: 620.1595947265625
  batch 120 loss: 615.0681762695312
  batch 125 loss: 599.0192504882813
  batch 130 loss: 606.5

Ranking: 100%|██████████| 19975/19975 [05:44<00:00, 58.05it/s]


TEST:
...
     | NDCG@-1 | Train (s) | Test (s)
---- + ------- + --------- + --------
DMRL |  0.2411 |  198.5040 | 344.1339






In [None]:
target_prediction = targets.copy()
target_prediction["Rating"] = -1

user_id_list = targets.UserId.unique()
for user_id in user_id_list:
    # Get the train dataframe index of the user to predict
    user_index = ratio_split.train_set.uid_map.get(user_id)

    if user_index is None:
        print(f"User {user_id} is not in the train set")
        continue

    # Flter by items to predict 
    items_to_predict = targets.loc[targets.UserId == user_id, "ItemId"].to_list()

    # Get the train dataframe index of the items to predict
    items_to_predict_index = np.array([ratio_split.train_set.iid_map.get(item_id) for item_id in items_to_predict])

    items_to_predict_tensor = torch.tensor([idx for idx in items_to_predict_index if idx is not None])

    # Get the position of items that are not in the train set
    none_indices = [i for i, x in enumerate(items_to_predict_index) if x is None]

    # Get the prediction for the items
    line_rating = dmrl_recommender.score(user_index=user_index, item_indices=items_to_predict_tensor)

    # Insert -1 in the position of items that are not in the train set
    for index_to_insert in none_indices:
        line_rating = np.insert(line_rating, index_to_insert, -1)

    # Insert the prediction in the target_prediction dataframe
    target_prediction.loc[targets.UserId == user_id, "Rating"] = line_rating

In [32]:
target_prediction = targets.copy()
target_prediction["Rating"] = -1

user_id_list = targets.UserId.unique()
for user_id in user_id_list:
    # Get the train dataframe index of the user to predict
    user_index = ratio_split.train_set.uid_map.get(user_id)

    if user_index is None:
        print(f"User {user_id} is not in the train set")
        continue

    # Flter by items to predict 
    items_to_predict = targets.loc[targets.UserId == user_id, "ItemId"].to_list()

    # Get the train dataframe index of the items to predict
    items_to_predict_index = np.array([ratio_split.train_set.iid_map.get(item_id) for item_id in items_to_predict])

    items_to_predict_tensor = torch.tensor([idx for idx in items_to_predict_index if idx is not None])

    # Get the position of items that are not in the train set
    none_indices = [i for i, x in enumerate(items_to_predict_index) if x is None]

    # Get the prediction for the items
    line_rating = dmrl_recommender.score(user_index=user_index, item_indices=items_to_predict_tensor)

    # Insert -1 in the position of items that are not in the train set
    for index_to_insert in none_indices:
        line_rating = np.insert(line_rating, index_to_insert, -1)

    # Insert the prediction in the target_prediction dataframe
    target_prediction.loc[targets.UserId == user_id, "Rating"] = line_rating

User 02b780c583 is not in the train set
User 03fbe3c1e8 is not in the train set
User 0503bf6097 is not in the train set
User 0958560bd4 is not in the train set
User 0b17c3df40 is not in the train set
User 10c3aef33f is not in the train set
User 119653373a is not in the train set
User 139ccdbf5a is not in the train set
User 1671fa4df3 is not in the train set
User 16757f4d90 is not in the train set
User 187454feb4 is not in the train set
User 1af811e7b9 is not in the train set
User 1bbf59e4a4 is not in the train set
User 1ef563bb8b is not in the train set
User 1f6a565547 is not in the train set
User 211a7a84d3 is not in the train set
User 243be2818a is not in the train set
User 24a13b72fc is not in the train set
User 27649dc46e is not in the train set
User 2a3500f75b is not in the train set
User 2ae6906d5d is not in the train set
User 2cfdb14e05 is not in the train set
User 2ed4d0c23c is not in the train set
User 2ef13886ae is not in the train set
User 302c6ebc7d is not in the train set


In [33]:
target_prediction = target_prediction.sort_values(["UserId", "Rating"], ascending=[True, False])

In [34]:
target_prediction.to_csv("submissao_4_DMRL_versao_3.csv", index=False)

In [35]:
target_prediction = target_prediction.drop(columns="Rating")

In [36]:
target_prediction.to_csv("submissao_4_DMRL_versao_3_sem_rating.csv", index=False)