# DeepFM

이번 실습에서는 DeepFM 모델을 이해하고 구현해보겠습니다.  

DeepFM 모델은 Factorization machines와 neural network를 합친 모델로, Wide & Deep model과 유사하지만, feature engineering이 필요하지 않다는 특징을 가지고 있습니다.  
<br/>
사용자가 영화에 대해 Rating한 데이터, 영화의 장르 데이터를 이용하여 Train/Test data를 생성한 다음, Train data로 학습한 모델을 Test data에 대해 평가해봅니다.   
사용한 데이터는 Implicit feedback data로, 사용자가 시청한 영화(Positive instances)는 rating = 1로 기록됩니다. 따라서 시청하지 않은 영화에 대해 각 유저별로 Negative instances sampling을 진행합니다.   
<br/>
**구현에 앞서, DeepFM 논문을 꼭 읽어보시길 권장합니다.**

* 참고  
    - DeepFM: A Factorization-Machine based Neural Network for CTR Prediction (https://arxiv.org/pdf/1703.04247.pdf)  
    - Wide & Deep Learning for Recommender Systems (https://arxiv.org/pdf/1606.07792.pdf)
    - Factorization Machines (https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5694074)
    - https://d2l.ai/chapter_recommender-systems/deepfm.html

# Modules

In [1]:
import csv
import numpy as np
import pandas as pd
from collections import Counter
from tqdm import tqdm

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# Data preprocessing
0. Dataset 다운로드  
<br/>
1. Rating df 생성  
rating 데이터(train_ratings.csv)를 불러와 [user, item, rating]의 컬럼으로 구성된 데이터 프레임을 생성합니다.   
<br/>
2. Genre df 생성   
genre 정보가 담긴 데이터(genres.tsv)를 불러와 genre이름을 id로 변경하고, [item, genre]의 컬럼으로 구성된 데이터 프레임을 생성합니다.    
<br/>
3. Negative instances 생성   
rating 데이터는 implicit feedback data(rating :0/1)로, positive instances로 구성되어 있습니다. 따라서 rating이 없는 item중 negative instances를 뽑아서 데이터에 추가하게 됩니다.   
<br/>
4. Join dfs   
rating df와 genre df를 join하여 [user, item, rating, genre]의 컬럼으로 구성된 데이터 프레임을 생성합니다.   
<br/>
5. zero-based index로 mapping   
Embedding을 위해서 user,item,genre를 zero-based index로 mapping합니다.
    - user : 0-31359
    - item : 0-6806
    - genre : 0-17  
<br/>
6. feature matrix X, label tensor y 생성   
[user, item, genre] 3개의 field로 구성된 feature matrix를 생성합니다.   
<br/>
7. data loader 생성

## 데이터 다운로드
이곳에 대회 사이트(AI Stages)에 있는 data의 URL을 입력해주세요. 
- 데이터 URL은 변경될 수 있습니다.
- 예) `!wget https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000176/data/data.tar.gz`

In [2]:
# 0. Dataset 다운로드
!wget https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000179/data/data.tar.gz
!tar -xf data.tar.gz

--2022-03-24 12:15:44--  https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000179/data/data.tar.gz
Resolving aistages-prod-server-public.s3.amazonaws.com (aistages-prod-server-public.s3.amazonaws.com)... 52.218.244.106
Connecting to aistages-prod-server-public.s3.amazonaws.com (aistages-prod-server-public.s3.amazonaws.com)|52.218.244.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33425907 (32M) [binary/octet-stream]
Saving to: ‘data.tar.gz.2’


2022-03-24 12:15:48 (8.89 MB/s) - ‘data.tar.gz.2’ saved [33425907/33425907]

tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.dropbox.att

In [2]:
# 1. Rating df 생성
rating_data = "./data/train/train_ratings.csv"

raw_rating_df = pd.read_csv(rating_data)
raw_rating_df
raw_rating_df['rating'] = 1.0 # implicit feedback

users = set(raw_rating_df.loc[:, 'user'])
items = set(raw_rating_df.loc[:, 'item'])
# times = list(set(raw_rating_df.loc[:, 'time']))

# #2. Time df 생성
# raw_time_df = raw_rating_df[["user","item","time"]]
raw_rating_df.drop(['time'],axis=1,inplace=True)

#2. Genre df 생성
genre_data = "./data/train/genres.tsv"

raw_genre_df = pd.read_csv(genre_data, sep='\t')
raw_genre_df = raw_genre_df.drop_duplicates(subset=['item']) #item별 하나의 장르만 남도록 drop
genre_dict = {genre:i for i, genre in enumerate(set(raw_genre_df['genre']))}
raw_genre_df['genre']  = raw_genre_df['genre'].map(lambda x : genre_dict[x]) #genre id로 변경

#2. year df 생성
year_data = "./data/train/years.tsv"

raw_year_df = pd.read_csv(year_data, sep='\t')
year_dict = {year:i for i, year in enumerate(set(raw_year_df['year']))}
raw_year_df['year']  = raw_year_df['year'].map(lambda x : year_dict[x]) #year id로 변경

#2. director df 생성
director_data = "./data/train/directors.tsv"

raw_director_df = pd.read_csv(director_data, sep='\t')
raw_director_df = raw_director_df.drop_duplicates(subset=['item']) #item별 하나의 감독만 남도록 drop
director_dict = {director:i for i, director in enumerate(set(raw_director_df['director']))}
raw_director_df['director']  = raw_director_df['director'].map(lambda x : director_dict[x]) #director id로 변경


#2. writer df 생성
writer_data = "./data/train/writers.tsv"

raw_writer_df = pd.read_csv(writer_data, sep='\t')
raw_writer_df = raw_writer_df.drop_duplicates(subset=['item']) #item별 하나의 각본가만 남도록 drop
writer_dict = {writer:i for i, writer in enumerate(set(raw_writer_df['writer']))}
raw_writer_df['writer']  = raw_writer_df['writer'].map(lambda x : writer_dict[x]) #writer id로 변경

In [3]:
# 3. Negative instance 생성
print("Create Nagetive instances")
# num_negative = 50
user_group_dfs = list(raw_rating_df.groupby('user')['item'])
user_group_dfs

Create Nagetive instances


[(11,
  0       4643
  1        170
  2        531
  3        616
  4       2140
         ...  
  371    48738
  372     6291
  373    46578
  374     7153
  375     4226
  Name: item, Length: 376, dtype: int64),
 (14,
  376    8961
  377    1396
  378     471
  379    2105
  380    1042
         ... 
  551    1282
  552     252
  553    2161
  554    1271
  555     468
  Name: item, Length: 180, dtype: int64),
 (18,
  556     1952
  557     1283
  558     3507
  559     4280
  560    51084
         ...  
  628     8254
  629    63062
  630      186
  631    60482
  632    71033
  Name: item, Length: 77, dtype: int64),
 (25,
  633      261
  634       22
  635     2161
  636     3255
  637      372
         ...  
  719      337
  720     1732
  721     4027
  722     2692
  723    52319
  Name: item, Length: 91, dtype: int64),
 (31,
  724      260
  725     1196
  726     1210
  727     7153
  728     4993
         ...  
  873    53464
  874    58025
  875    56775
  876     7317
  877

In [4]:

first_row = True
user_neg_dfs = pd.DataFrame()

for u, u_items in tqdm(user_group_dfs):
    u_items = set(u_items)
    num_negative = len(u_items)
    i_user_neg_item = np.random.choice(list(items - u_items), num_negative, replace=False)
    
    i_user_neg_df = pd.DataFrame({'user': [u]*num_negative, 'item': i_user_neg_item, 'rating': [0]*num_negative})
    if first_row == True:
        user_neg_dfs = i_user_neg_df
        first_row = False
    else:
        user_neg_dfs = pd.concat([user_neg_dfs, i_user_neg_df], axis = 0, sort=False)
        
# user_neg_dfs['time'] = raw_rating_df['time']
joined_rating_df = pd.concat([raw_rating_df, user_neg_dfs], axis = 0, sort=False)
joined_rating_df = pd.merge(joined_rating_df, raw_genre_df, left_on='item', right_on='item', how='inner')
joined_rating_df = pd.merge(joined_rating_df, raw_year_df, left_on='item', right_on='item', how='inner')
joined_rating_df = pd.merge(joined_rating_df, raw_director_df, left_on='item', right_on='item', how='inner')
joined_rating_df = pd.merge(joined_rating_df, raw_writer_df, left_on='item', right_on='item', how='inner')
# 4. Join dfs
# print("Joined rating df")
# print(joined_rating_df)

100%|██████████| 31360/31360 [20:55<00:00, 24.97it/s]


In [5]:
joined_rating_df

Unnamed: 0,user,item,rating,genre,year,director,writer
0,11,4643,1.0,3,79,904,257
1,189,4643,1.0,3,79,904,257
2,294,4643,1.0,3,79,904,257
3,383,4643,1.0,3,79,904,257
4,421,4643,1.0,3,79,904,257
...,...,...,...,...,...,...,...
8381436,136426,102880,0.0,3,91,731,1186
8381437,137584,102880,0.0,3,91,731,1186
8381438,137932,102880,0.0,3,91,731,1186
8381439,138164,102880,0.0,3,91,731,1186


In [6]:

# 5. user, item을 zero-based index로 mapping
users = list(set(joined_rating_df.loc[:,'user']))
users.sort()
items =  list(set((joined_rating_df.loc[:, 'item'])))
items.sort()
genres =  list(set((joined_rating_df.loc[:, 'genre'])))
genres.sort()
years =  list(set((joined_rating_df.loc[:, 'year'])))
years.sort()
directors =  list(set((joined_rating_df.loc[:, 'director'])))
directors.sort()
writers =  list(set((joined_rating_df.loc[:, 'writer'])))
writers.sort()

if len(users)-1 != max(users):
    users_dict = {users[i]: i for i in range(len(users))}
    joined_rating_df['user']  = joined_rating_df['user'].map(lambda x : users_dict[x])
    users = list(set(joined_rating_df.loc[:,'user']))
    
if len(items)-1 != max(items):
    items_dict = {items[i]: i for i in range(len(items))}
    joined_rating_df['item']  = joined_rating_df['item'].map(lambda x : items_dict[x])
    items =  list(set((joined_rating_df.loc[:, 'item'])))

if len(genres)-1 != max(genres):
    genres_dict = {genres[i]: i for i in range(len(genres))}
    joined_rating_df['genre']  = joined_rating_df['genre'].map(lambda x : genres_dict[x])
    genres =  list(set((joined_rating_df.loc[:, 'genre'])))

if len(years)-1 != max(years):
    years_dict = {years[i]: i for i in range(len(years))}
    joined_rating_df['year']  = joined_rating_df['year'].map(lambda x : years_dict[x])
    years =  list(set((joined_rating_df.loc[:, 'year'])))

if len(directors)-1 != max(directors):
    directors_dict = {directors[i]: i for i in range(len(directors))}
    joined_rating_df['director']  = joined_rating_df['director'].map(lambda x : directors_dict[x])
    directors =  list(set((joined_rating_df.loc[:, 'director'])))
    
if len(writers)-1 != max(writers):
    writers_dict = {writers[i]: i for i in range(len(writers))}
    joined_rating_df['writer']  = joined_rating_df['writer'].map(lambda x : writers_dict[x])
    writers =  list(set((joined_rating_df.loc[:, 'writer'])))
    
joined_rating_df = joined_rating_df.sort_values(by=['user'])
joined_rating_df.reset_index(drop=True, inplace=True)

data = joined_rating_df
print("Data")
print(data)

n_data = len(data)
n_user = len(users)
n_item = len(items)
n_genre = len(genres)
n_year = len(years)
n_director = len(directors)
n_writers = len(writers)

# print("# of data : {}\n# of users : {}\n# of items : {}\n# of genres : {}".format(n_data, n_user, n_item, n_genre))

Data
          user  item  rating  genre  year  director  writer
0            0  2033     1.0      3    79       878     230
1            0   261     0.0      3    71       409     606
2            0  2679     0.0      9    63       115     157
3            0  3231     0.0     13    63      1065     405
4            0   125     0.0      5    73      1215    1808
...        ...   ...     ...    ...   ...       ...     ...
8381436  31359   574     0.0      3    59       462     678
8381437  31359  3027     0.0      8    10      1130    1397
8381438  31359   112     0.0      5    73       994    1624
8381439  31359  1932     1.0      5    79       132     958
8381440  31359  1388     1.0      5    77       973    1230

[8381441 rows x 7 columns]


In [7]:
#6. feature matrix X, label tensor y 생성
user_col = torch.tensor(data.loc[:,'user'])
item_col = torch.tensor(data.loc[:,'item'])
genre_col = torch.tensor(data.loc[:,'genre'])
year_col = torch.tensor(data.loc[:,'year'])
director_col = torch.tensor(data.loc[:,'director'])
writer_col = torch.tensor(data.loc[:,'writer'])

offsets = [0, n_user, n_user+n_item, n_user+n_item, n_user+n_item, n_user+n_item]
for col, offset in zip([user_col, item_col, genre_col, year_col, director_col, writer_col], offsets):
    col += offset

X = torch.cat([user_col.unsqueeze(1), item_col.unsqueeze(1), genre_col.unsqueeze(1), year_col.unsqueeze(1), director_col.unsqueeze(1), writer_col.unsqueeze(1)], dim=1)
y = torch.tensor(list(data.loc[:,'rating']))


#7. data loader 생성
class RatingDataset(Dataset):
    def __init__(self, input_tensor, target_tensor):
        self.input_tensor = input_tensor.long()
        self.target_tensor = target_tensor.long()

    def __getitem__(self, index):
        return self.input_tensor[index], self.target_tensor[index]

    def __len__(self):
        return self.target_tensor.size(0)


dataset = RatingDataset(X, y)
train_ratio = 0.9

train_size = int(train_ratio * len(data))
test_size = len(data) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])

train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=512, shuffle=False)

   # Model architecture (DeepFM)
   DeepFM 모델은 1) FM component와  2) Deep component가 병렬적으로 결합되어 있습니다. 구조는 다음과 같습니다.
<img src='https://drive.google.com/uc?id=1vwcxUJQTIsg5QH9CuH5PcUEfExhToUHR'>  
각 구조는 다음과 같습니다.  
   **1. FM component**  
       FM component는 우리가 아는 2-way Factorization machines(degree=2)입니다. FM은 variables 간의 interaction을 다음과 같이 모델링 합니다.   
     **<center> equation (1) </center>**
   $$\hat{y}(x):=w_0 + \sum_{i=1}^{n}w_ix_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n}<\mathbf{v}_i,\mathbf{v}_j>x_ix_j$$   
   이때, 세번째 interaction term을 전개하여 다음과 같이 쓸 수 있습니다.(논문 참고)  
   구현 코드는 전개된 식을 바탕으로 합니다.   
     **<center> equation (2)> </center>**
   $$\sum_{i=1}^{n}\sum_{j=i+1}^{n}<\mathbf{v}_i,\mathbf{v}_j>x_ix_j = \frac{1}{2}\sum_{f=1}^{k}((\sum_{i=1}^{n}v_{i,f}x_i)^2-\sum_{i=1}^{n}v_{i,f}^2x_i^2)$$   
           
   **2. Deep component**  
       Deep component는 MLP Layers로 구성되어 있습니다.   
       구현 코드는 Input dimension이 30-20-10인 3 layer MLP 구조입니다.
  
   

# DeepFM

In [8]:
class DeepFM(nn.Module):
    def __init__(self, input_dims, embedding_dim, mlp_dims, drop_rate=0.1):
        super(DeepFM, self).__init__()
        total_input_dim = int(sum(input_dims)) # n_user + n_movie + n_genre

        # Fm component의 constant bias term과 1차 bias term
        self.bias = nn.Parameter(torch.zeros((1,)))
        self.fc = nn.Embedding(total_input_dim, 1)
        
        self.embedding = nn.Embedding(total_input_dim, embedding_dim) 
        self.embedding_dim = len(input_dims) * embedding_dim

        mlp_layers = []
        for i, dim in enumerate(mlp_dims):
            if i==0:
                mlp_layers.append(nn.Linear(self.embedding_dim, dim))
            else:
                mlp_layers.append(nn.Linear(mlp_dims[i-1], dim)) #TODO 1 : linear layer를 넣어주세요.
            mlp_layers.append(nn.ReLU(True))
            mlp_layers.append(nn.Dropout(drop_rate))
        mlp_layers.append(nn.Linear(mlp_dims[-1], 1))
        self.mlp_layers = nn.Sequential(*mlp_layers)

    def fm(self, x):
        # x : (batch_size, total_num_input)
        embed_x = self.embedding(x)

        fm_y = self.bias + torch.sum(self.fc(x), dim=1)
        square_of_sum = torch.sum(embed_x, dim=1) ** 2         #TODO 2 : torch.sum을 이용하여 square_of_sum을 작성해주세요(hint : equation (2))
        sum_of_square = torch.sum(embed_x ** 2, dim=1)         #TODO 3 : torch.sum을 이용하여 sum_of_square을 작성해주세요(hint : equation (2))
        fm_y += 0.5 * torch.sum(square_of_sum - sum_of_square, dim=1, keepdim=True)
        return fm_y
    
    def mlp(self, x):
        embed_x = self.embedding(x)
        
        inputs = embed_x.view(-1, self.embedding_dim)
        mlp_y = self.mlp_layers(inputs)
        return mlp_y

    def forward(self, x):
        #fm component
        fm_y = self.fm(x).squeeze(1)
        
        #deep component
        mlp_y = self.mlp(x).squeeze(1)
        
        y = torch.sigmoid(fm_y + mlp_y)
        return y


# Training

In [9]:
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

In [10]:
device = torch.device('cuda')
input_dims = [n_user, n_item, n_genre, n_year, n_director, n_writers]
embedding_dim = 10
model = DeepFM(input_dims, embedding_dim, mlp_dims=[30, 20, 10]).to(device)
bce_loss = nn.MSELoss() # Binary Cross Entropy loss
lr, num_epochs = 0.01, 10
optimizer = optim.Adam(model.parameters(), lr=lr)

for e in tqdm(range(num_epochs)) :
    for x, y in train_loader:
        x, y = x.to(device), y.to(device)
        model.train()
        optimizer.zero_grad()
        output = model(x)
        loss = bce_loss(output, y.float())
        loss.backward()
        optimizer.step()
        

100%|██████████| 10/10 [19:05<00:00, 114.54s/it]


# Evaluation
평가는 모델이 postive instance에 대해 0.5이상, negative instance에 대해 0.5미만의 값을 예측한 Accuracy를 측정하여 진행됩니다.

In [11]:
correct_result_sum = 0
for x, y in test_loader:
    x, y = x.to(device), y.to(device)
    model.eval()
    output = model(x)
    result = torch.round(output)
    correct_result_sum += (result == y).sum().float()

acc = correct_result_sum/len(test_dataset)*100
print("Final Acc : {:.2f}%".format(acc.item()))

Final Acc : 83.12%


# Inference

In [12]:
first_row = True
infr_df = raw_rating_df
infr_df = pd.merge(infr_df, raw_genre_df, left_on='item', right_on='item', how='inner')
infr_df = pd.merge(infr_df, raw_year_df, left_on='item', right_on='item', how='inner')
infr_df = pd.merge(infr_df, raw_director_df, left_on='item', right_on='item', how='inner')
infr_df = pd.merge(infr_df, raw_writer_df, left_on='item', right_on='item', how='inner')

In [13]:
# 5. user, item을 zero-based index로 mapping
users = list(set(infr_df.loc[:,'user']))
users.sort()
items =  list(set((infr_df.loc[:, 'item'])))
items.sort()
genres =  list(set((infr_df.loc[:, 'genre'])))
genres.sort()
years =  list(set((infr_df.loc[:, 'year'])))
years.sort()
directors =  list(set((infr_df.loc[:, 'director'])))
directors.sort()
writers =  list(set((infr_df.loc[:, 'writer'])))
writers.sort()
if len(users)-1 != max(users):
    users_dict = {users[i]: i for i in range(len(users))}
    infr_df['user']  = infr_df['user'].map(lambda x : users_dict[x])
    users = list(set(infr_df.loc[:,'user']))
    
if len(items)-1 != max(items):
    items_dict = {items[i]: i for i in range(len(items))}
    infr_df['item']  = infr_df['item'].map(lambda x : items_dict[x])
    items =  list(set((infr_df.loc[:, 'item'])))

if len(genres)-1 != max(genres):
    genres_dict = {genres[i]: i for i in range(len(genres))}
    infr_df['genre']  = infr_df['genre'].map(lambda x : genres_dict[x])
    genres =  list(set((infr_df.loc[:, 'genre'])))

if len(years)-1 != max(years):
    years_dict = {years[i]: i for i in range(len(years))}
    infr_df['year']  = infr_df['year'].map(lambda x : years_dict[x])
    years =  list(set((infr_df.loc[:, 'year'])))

if len(directors)-1 != max(directors):
    directors_dict = {directors[i]: i for i in range(len(directors))}
    infr_df['director']  = infr_df['director'].map(lambda x : directors_dict[x])
    directors =  list(set((infr_df.loc[:, 'director'])))
    
if len(writers)-1 != max(writers):
    writers_dict = {writers[i]: i for i in range(len(writers))}
    infr_df['writer']  = infr_df['writer'].map(lambda x : writers_dict[x])
    writers =  list(set((infr_df.loc[:, 'writer'])))

In [21]:
user_group_dfs = infr_df.groupby('user')[['item','genre','year','director','writer']]
items = set(raw_rating_df["item"])
ans = None
for u, x in tqdm(user_group_dfs): #(u_items, u_genres, u_years, u_directors, u_wirters)
    first = True
    tmp_df = user_group_dfs[user_group_dfs[user_group_dfs["users"]!=u]]
    user_col = torch.tensor(tmp_df.loc[:,'user'])
    item_col = torch.tensor(tmp_df.loc[:,'item'])
    genre_col = torch.tensor(tmp_df.loc[:,'genre'])
    year_col = torch.tensor(tmp_df.loc[:,'year'])
    director_col = torch.tensor(tmp_df.loc[:,'director'])
    writer_col = torch.tensor(tmp_df.loc[:,'writer'])
    X = torch.cat([user_col.unsqueeze(1), item_col.unsqueeze(1), genre_col.unsqueeze(1), year_col.unsqueeze(1), director_col.unsqueeze(1), writer_col.unsqueeze(1)], dim=1)
    y = torch.tensor(list(infr_df.loc[:,'rating']))
    dataset = RatingDataset(X, y)
    sbm_loader =  DataLoader(dataset, batch_size=1024, shuffle=False)
    
    for x, _ in sbm_loader:
        x = x.to(device)
        model.eval()
        output = model(x).to(device)
        tmp = pd.dataFrame({"user":x[:,0], "item": x[:,1], "score":output})
        if first:
            first = False
            user_items = tmp
        else:
            user_items = pd.concat([user_items, tmp], axis = 0, sort=False)
    user_items.sort_values(by="score")
    user_items = user_items[0:9]
    if not ans:
        ans = user_items
    else:
        ans = pd.concat([user_items, tmp], axis = 0, sort=False)




  0%|          | 0/31360 [00:00<?, ?it/s]


IndexError: Column(s) ['item', 'genre', 'year', 'director', 'writer'] already selected

In [25]:
# from utils import generate_submission_file

# dataset = RatingDataset(X, y)
# sbm_loader =  DataLoader(dataset, batch_size=1024, shuffle=True)

# tmp = X[:,0]
# ans = torch.Tensor().to(device)
# for x, _ in sbm_loader:
#     x = x.to(device)
#     model.eval()
#     output = model(x).to(device)
#     for idx in range(len(x)):
#         if int(x[idx][1]) in user_items.get(int(x[idx][0]),set()):
#             output[idx] = 0
#     ans = torch.cat([ans,output], dim=0)

# # generate_submission_file("/opt/ml/input/code/output/train_ratings.csv",pred_list)

In [26]:
# a = {x:[] for x in range(31360)}
# for i in range(len(X)):
#     a[int(tmp[i])].append((X[i][1],ans[i]))

In [27]:
# for i in a.keys():
#     a[i].sort(key = lambda x: -x[1])
#     a[i] = a[i][:10]

In [None]:
pd.DataFrame(ans, columns=["user", "item"]).to_csv(
    "output/submission.csv", index=False
)

In [29]:
# generate_submission_file("/opt/ml/input/code/data/train/train_ratings.csv",preds)