# DeepFM

이번 실습에서는 DeepFM 모델을 이해하고 구현해보겠습니다.  

DeepFM 모델은 Factorization machines와 neural network를 합친 모델로, Wide & Deep model과 유사하지만, feature engineering이 필요하지 않다는 특징을 가지고 있습니다.  
<br/>
사용자가 영화에 대해 Rating한 데이터, 영화의 장르 데이터를 이용하여 Train/Test data를 생성한 다음, Train data로 학습한 모델을 Test data에 대해 평가해봅니다.   
사용한 데이터는 Implicit feedback data로, 사용자가 시청한 영화(Positive instances)는 rating = 1로 기록됩니다. 따라서 시청하지 않은 영화에 대해 각 유저별로 Negative instances sampling을 진행합니다.   
<br/>
**구현에 앞서, DeepFM 논문을 꼭 읽어보시길 권장합니다.**

* 참고  
    - DeepFM: A Factorization-Machine based Neural Network for CTR Prediction (https://arxiv.org/pdf/1703.04247.pdf)  
    - Wide & Deep Learning for Recommender Systems (https://arxiv.org/pdf/1606.07792.pdf)
    - Factorization Machines (https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5694074)
    - https://d2l.ai/chapter_recommender-systems/deepfm.html

### Modules

In [1]:
import csv
import os
import pickle
import numpy as np
import pandas as pd
from collections import Counter
from tqdm import tqdm

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

### Data preprocessing
0. Dataset 다운로드  
<br/>
1. Rating df 생성  
rating 데이터(train_ratings.csv)를 불러와 [user, item, rating]의 컬럼으로 구성된 데이터 프레임을 생성합니다.   
<br/>
2. Genre df 생성   
genre 정보가 담긴 데이터(genres.tsv)를 불러와 genre이름을 id로 변경하고, [item, genre]의 컬럼으로 구성된 데이터 프레임을 생성합니다.    
<br/>
3. Negative instances 생성   
rating 데이터는 implicit feedback data(rating :0/1)로, positive instances로 구성되어 있습니다. 따라서 rating이 없는 item중 negative instances를 뽑아서 데이터에 추가하게 됩니다.   
<br/>
4. Join dfs   
rating df와 genre df를 join하여 [user, item, rating, genre]의 컬럼으로 구성된 데이터 프레임을 생성합니다.   
<br/>
5. zero-based index로 mapping   
Embedding을 위해서 user,item,genre를 zero-based index로 mapping합니다.
    - user : 0-31359
    - item : 0-6806
    - genre : 0-17  
<br/>
6. feature matrix X, label tensor y 생성   
[user, item, genre] 3개의 field로 구성된 feature matrix를 생성합니다.   
<br/>
7. data loader 생성

#### 데이터 다운로드
이곳에 대회 사이트(AI Stages)에 있는 data의 URL을 입력해주세요. 
- 데이터 URL은 변경될 수 있습니다.
- 예) `!wget https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000176/data/data.tar.gz`

In [None]:
# # 0. Dataset 다운로드
# !wget <대회 데이터 URL>
# !tar -xf data.tar.gz

In [2]:
# 1. Rating df 생성
rating_data = "/opt/ml/input/data/train/train_ratings.csv"

raw_rating_df = pd.read_csv(rating_data)
raw_rating_df
raw_rating_df['rating'] = 1.0 # implicit feedback
raw_rating_df.drop(['time'],axis=1,inplace=True)
print("Raw rating df")
print(raw_rating_df)

users = set(raw_rating_df.loc[:, 'user'])
items = set(raw_rating_df.loc[:, 'item'])

#2. Genre df 생성
genre_data = "/opt/ml/input/data/train/genres.tsv"

raw_genre_df = pd.read_csv(genre_data, sep='\t')
raw_genre_df = raw_genre_df.drop_duplicates(subset=['item']) #item별 하나의 장르만 남도록 drop 
# print(raw_genre_df)

genre_dict = {genre:i for i, genre in enumerate(set(raw_genre_df['genre']))}
raw_genre_df['genre']  = raw_genre_df['genre'].map(lambda x : genre_dict[x]) #genre id로 변경
print("Raw genre df - changed to id")
print(raw_genre_df)

Raw rating df
           user   item  rating
0            11   4643     1.0
1            11    170     1.0
2            11    531     1.0
3            11    616     1.0
4            11   2140     1.0
...         ...    ...     ...
5154466  138493  44022     1.0
5154467  138493   4958     1.0
5154468  138493  68319     1.0
5154469  138493  40819     1.0
5154470  138493  27311     1.0

[5154471 rows x 3 columns]
Raw genre df - changed to id
         item  genre
0         318      9
2        2571     16
5        2959     16
9         296     13
13        356     13
...       ...    ...
15925   73106     13
15926  109850     16
15929    8605     16
15931    3689     13
15932    8130      0

[6807 rows x 2 columns]


In [3]:
# 3. Negative instance 생성
print("Create Nagetive instances")
num_negative = 50
user_group_dfs = list(raw_rating_df.groupby('user')['item'])
first_row = True
user_neg_dfs = pd.DataFrame()

for u, u_items in tqdm(user_group_dfs):
    u_items = set(u_items)
    i_user_neg_item = np.random.choice(list(items - u_items), num_negative, replace=False)  # 관측 없는 데이터 중에서 num만큼 추출
    
    i_user_neg_df = pd.DataFrame({'user': [u]*num_negative, 'item': i_user_neg_item, 'rating': [0]*num_negative})  # 0점짜리 negative df 생성
    if first_row == True:
        user_neg_dfs = i_user_neg_df
        first_row = False
    else:
        user_neg_dfs = pd.concat([user_neg_dfs, i_user_neg_df], axis = 0, sort=False)  # 행으로 concat

raw_rating_df = pd.concat([raw_rating_df, user_neg_dfs], axis = 0, sort=False)  # concat(positive, negative)

# 4. Join dfs
joined_rating_df = pd.merge(raw_rating_df, raw_genre_df, left_on='item', right_on='item', how='inner') 
# print("Joined rating df")
# print(joined_rating_df)

# 5. user, item을 zero-based index로 mapping
users = list(set(joined_rating_df.loc[:,'user']))  # len = 31360
users.sort()
items =  list(set(joined_rating_df.loc[:, 'item']))  # len = 6807
items.sort()
genres =  list(set(joined_rating_df.loc[:, 'genre']))  # 0 ~ 17
genres.sort()

if len(users)-1 != max(users):  # -> index 작업 안 되어 있으면, 다시 매핑
    users_dict = {users[i]: i for i in range(len(users))}
    joined_rating_df['user']  = joined_rating_df['user'].map(lambda x : users_dict[x])
    users = list(set(joined_rating_df.loc[:,'user']))
    
if len(items)-1 != max(items):
    items_dict = {items[i]: i for i in range(len(items))}
    joined_rating_df['item']  = joined_rating_df['item'].map(lambda x : items_dict[x])
    items =  list(set((joined_rating_df.loc[:, 'item'])))

joined_rating_df = joined_rating_df.sort_values(by=['user'])
joined_rating_df.reset_index(drop=True, inplace=True)

data = joined_rating_df
print("Data")
print(data)

n_data = len(data)
n_user = len(users)
n_item = len(items)
n_genre = len(genres)

print("# of data : {}\n# of users : {}\n# of items : {}\n# of genres : {}".format(n_data, n_user, n_item, n_genre))

Create Nagetive instances


100%|██████████| 31360/31360 [05:36<00:00, 93.23it/s] 


Data
          user  item  rating  genre
0            0  2505     1.0     16
1            0   112     1.0     16
2            0  2344     1.0      2
3            0  1462     1.0     16
4            0   159     1.0     13
...        ...   ...     ...    ...
6722466  31359   190     1.0     16
6722467  31359  1774     1.0     14
6722468  31359  1719     1.0     10
6722469  31359  3326     1.0     16
6722470  31359  3373     1.0     13

[6722471 rows x 4 columns]
# of data : 6722471
# of users : 31360
# of items : 6807
# of genres : 18


In [4]:
#6. feature matrix X, label tensor y 생성
user_col = torch.tensor(data.loc[:,'user'])
item_col = torch.tensor(data.loc[:,'item'])
genre_col = torch.tensor(data.loc[:,'genre'])

offsets = [0, n_user, n_user+n_item]  # [0, 31360, 38167]
for col, offset in zip([user_col, item_col, genre_col], offsets): # [u, i, g] + [0, n_user, n_user + n_item]
    col += offset

X = torch.cat([user_col.unsqueeze(1), item_col.unsqueeze(1), genre_col.unsqueeze(1)], dim=1)
y = torch.tensor(list(data.loc[:,'rating']))


#7. data loader 생성
class RatingDataset(Dataset):
    def __init__(self, input_tensor, target_tensor):
        self.input_tensor = input_tensor.long()
        self.target_tensor = target_tensor.long()

    def __getitem__(self, index):
        return self.input_tensor[index], self.target_tensor[index]

    def __len__(self):
        return self.target_tensor.size(0)


dataset = RatingDataset(X, y)
train_ratio = 0.9

train_size = int(train_ratio * len(data))
test_size = len(data) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])

train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=512, shuffle=False)
all_data_loader = DataLoader(dataset, batch_size=1024, shuffle=True)

   ### Model architecture (DeepFM)
   DeepFM 모델은 1) FM component와  2) Deep component가 병렬적으로 결합되어 있습니다. 구조는 다음과 같습니다.
<img src='https://drive.google.com/uc?id=1vwcxUJQTIsg5QH9CuH5PcUEfExhToUHR'>  
각 구조는 다음과 같습니다.  
   **1. FM component**  
       FM component는 우리가 아는 2-way Factorization machines(degree=2)입니다. FM은 variables 간의 interaction을 다음과 같이 모델링 합니다.   
     **<center> equation (1) </center>**
   $$\hat{y}(x):=w_0 + \sum_{i=1}^{n}w_ix_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n}<\mathbf{v}_i,\mathbf{v}_j>x_ix_j$$   
   이때, 세번째 interaction term을 전개하여 다음과 같이 쓸 수 있습니다.(논문 참고)  
   구현 코드는 전개된 식을 바탕으로 합니다.   
     **<center> equation (2)> </center>**
   $$\sum_{i=1}^{n}\sum_{j=i+1}^{n}<\mathbf{v}_i,\mathbf{v}_j>x_ix_j = \frac{1}{2}\sum_{f=1}^{k}((\sum_{i=1}^{n}v_{i,f}x_i)^2-\sum_{i=1}^{n}v_{i,f}^2x_i^2)$$   
           
   **2. Deep component**  
       Deep component는 MLP Layers로 구성되어 있습니다.   
       구현 코드는 Input dimension이 30-20-10인 3 layer MLP 구조입니다.
  
   

### DeepFM

In [89]:
class DeepFM(nn.Module):
    def __init__(self, input_dims, embedding_dim, mlp_dims, drop_rate=0.1):
        super(DeepFM, self).__init__()
        total_input_dim = int(sum(input_dims)) # n_user + n_movie + n_genre

        # Fm component의 constant bias term과 1차 bias term
        self.bias = nn.Parameter(torch.zeros((1,)))
        self.fc = nn.Embedding(total_input_dim, 1)
        
        self.embedding = nn.Embedding(total_input_dim, embedding_dim) 
        self.embedding_dim = len(input_dims) * embedding_dim

        mlp_layers = []
        for i, dim in enumerate(mlp_dims):
            if i==0:
                mlp_layers.append(nn.Linear(self.embedding_dim, dim))
            else:
                mlp_layers.append(nn.Linear(mlp_dims[i-1], dim)) #TODO 1 : linear layer를 넣어주세요.
            mlp_layers.append(nn.ReLU(True))
            mlp_layers.append(nn.Dropout(drop_rate))
        mlp_layers.append(nn.Linear(mlp_dims[-1], 1))
        self.mlp_layers = nn.Sequential(*mlp_layers)

    def fm(self, x):
        # x : (batch_size, total_num_input)
        embed_x = self.embedding(x)

        fm_y = self.bias + torch.sum(self.fc(x), dim=1)
        square_of_sum = torch.sum(embed_x, dim=1) ** 2         #TODO 2 : torch.sum을 이용하여 square_of_sum을 작성해주세요(hint : equation (2))
        sum_of_square = torch.sum(embed_x ** 2, dim=1)         #TODO 3 : torch.sum을 이용하여 sum_of_square을 작성해주세요(hint : equation (2))
        fm_y += 0.5 * torch.sum(square_of_sum - sum_of_square, dim=1, keepdim=True)
        return fm_y
    
    def mlp(self, x):
        embed_x = self.embedding(x)
        
        inputs = embed_x.view(-1, self.embedding_dim)
        mlp_y = self.mlp_layers(inputs)
        return mlp_y

    def forward(self, x):
        embed_x = self.embedding(x)
        #fm component
        fm_y = self.fm(x).squeeze(1)
        
        #deep component
        mlp_y = self.mlp(x).squeeze(1)
        
        y = torch.sigmoid(fm_y + mlp_y)
        return y, fm_y + mlp_y


### Training

In [81]:
device = torch.device('cuda')
input_dims = [n_user, n_item, n_genre]
embedding_dim = 10
model = DeepFM(input_dims, embedding_dim, mlp_dims=[30, 20, 10]).to(device)
bce_loss = nn.BCELoss() # Binary Cross Entropy loss
lr, num_epochs = 0.01, 10
optimizer = optim.Adam(model.parameters(), lr=lr)

for e in tqdm(range(num_epochs)) :
    for x, y in train_loader:
        x, y = x.to(device), y.to(device)
        model.train()
        optimizer.zero_grad()
        output = model(x)
        loss = bce_loss(output, y.float())
        loss.backward()
        optimizer.step()
        

100%|██████████| 10/10 [10:39<00:00, 64.00s/it]


### Evaluation
평가는 모델이 postive instance에 대해 0.5이상, negative instance에 대해 0.5미만의 값을 예측한 Accuracy를 측정하여 진행됩니다.

In [50]:
correct_result_sum = 0
for x, y in test_loader:
    x, y = x.to(device), y.to(device)
    model.eval()
    output = model(x)
    result = torch.round(output)
    correct_result_sum += (result == y).sum().float()

acc = correct_result_sum/len(test_dataset)*100
print("Final Acc : {:.2f}%".format(acc.item()))

Final Acc : 90.51%


# MovieRec(deepFM)

### data
: deepFM(위)에서 training 전까지 실행 필요

- user_all_neg_dfs

In [15]:
set_items = set(raw_rating_df.loc[:, 'item'])
user_all_neg_dfs = pd.DataFrame()

# for u, u_items in tqdm(user_group_dfs):
#     u_items = set(u_items)
#     user_all_neg_item = list(set_items - u_items)
#     user_all_neg_df = pd.DataFrame({'user': [u]*len(user_all_neg_item), 'item': user_all_neg_item, 'rating': [0]*len(user_all_neg_item)})  # 모든 관측 안 된 데이터
#     if first_row == True:
#         user_all_neg_dfs = user_all_neg_df
#     else:
#         user_all_neg_dfs = pd.concat([user_all_neg_dfs, user_all_neg_df], axis=0, sort=False)

In [6]:
with open('/opt/ml/input/code/experiment/deep_fm/user_all_neg_dfs', 'rb') as f:
    user_all_neg_dfs = pickle.load(f)

In [7]:
user_all_neg_dfs

Unnamed: 0,user,item,rating
0,11,2,0
1,11,3,0
2,11,32770,0
3,11,5,0
4,11,6,0
...,...,...,...
6488,138493,32743,0
6489,138493,49130,0
6490,138493,65514,0
6491,138493,49132,0


In [8]:
joined_all_df = pd.merge(user_all_neg_dfs, raw_genre_df, on='item', how='left') 
joined_all_df

Unnamed: 0,user,item,rating,genre
0,11,2,0,2
1,11,3,0,13
2,11,32770,0,14
3,11,5,0,13
4,11,6,0,16
...,...,...,...,...
208313044,138493,32743,0,14
208313045,138493,49130,0,13
208313046,138493,65514,0,16
208313047,138493,49132,0,0


In [9]:
# zero-based index로 mapping
joined_all_df['user']  = joined_all_df['user'].map(lambda x : users_dict[x])
joined_all_df['item']  = joined_all_df['item'].map(lambda x : items_dict[x])

inference_data = joined_all_df.sort_values(by=['user'])
inference_data.reset_index(drop=True, inplace=True)

In [24]:
inference_data.head()

Unnamed: 0,user,item,rating,genre
0,0,1,0,2
1,0,3580,0,14
2,0,6341,0,0
3,0,3578,0,0
4,0,6340,0,16


- dataset, dataloader

In [118]:
# feature matrix X, label tensor y 생성
print('col processing started.')

user_col = torch.tensor(inference_data.loc[:,'user'])
print('user done.')
item_col = torch.tensor(inference_data.loc[:,'item'])
print('item done.')
genre_col = torch.tensor(inference_data.loc[:,'genre'])
print('genre done.')

offsets = [0, n_user, n_user+n_item]  # [0, 31360, 38167]
for col, offset in zip([user_col, item_col, genre_col], offsets): # [u, i, g] + [0, n_user, n_user + n_item]
    col += offset

print('col processing done.')

col processing started.
user done.
item done.
genre done.
col processing done.


In [119]:
# # col 저장
# with open('user_col', 'wb') as f:
#     pickle.dump(user_col, f)
# with open('item_col', 'wb') as f:
#     pickle.dump(item_col, f)
# with open('genre_col', 'wb') as f:
#     pickle.dump(genre_col, f)

In [None]:
# # col 로드
# with open('user_col', 'rb') as f:
#     user_col = pickle.load(f)
# with open('item_col', 'rb') as f:
#     item_col = pickle.load(f)
# with open('genre_col', 'rb') as f:
#     genre_col = pickle.load(f)

In [148]:
X.size()

torch.Size([208313049, 3])

In [149]:
# dataset, data loader 생성

X = torch.cat([user_col.unsqueeze(1), item_col.unsqueeze(1), genre_col.unsqueeze(1)], dim=1)
y = torch.tensor(list(inference_data.loc[:,'rating']))  # 사용 x

class RatingDataset(Dataset):
    def __init__(self, input_tensor, target_tensor):
        self.input_tensor = input_tensor.long()
        self.target_tensor = target_tensor.long()

    def __getitem__(self, index):
        return self.input_tensor[index], self.target_tensor[index]

    def __len__(self):
        return self.target_tensor.size(0)


inference_dataset = RatingDataset(X, y)
inference_loader = DataLoader(inference_dataset, batch_size=1024, shuffle=True)

In [150]:
print(inference_dataset.__len__())
print(next(iter(inference_loader))[0][0])

208313049
tensor([14155, 36063, 38181])


### model

In [151]:
# 모델 불러오기

embedding_dim = 200
model = DeepFM(input_dims, embedding_dim, mlp_dims=[30, 20, 10]).to(device)
MODEL_PATH = '/opt/ml/input/code/experiment/deep_fm'
model.load_state_dict(torch.load(os.path.join(
    MODEL_PATH, "deepFM_neg200_emb200_iter150_statedict.pt")))

<All keys matched successfully>

### inference

In [152]:
# make u*i matrix
base_path = '/opt/ml/input/data/train'
train_df_path = os.path.join(base_path, 'train_ratings.csv')

train_raing_df = pd.read_csv(train_df_path)
train_raing_df['viewed'] = -100
user_item_matrix = train_raing_df.pivot_table('viewed', 'user', 'item').fillna(0)

In [28]:
dict_genre = dict(map(reversed, genre_dict.items()))
dict_items = dict(map(reversed, items_dict.items()))
dict_users = dict(map(reversed, users_dict.items()))
offsets = [0, n_user, n_user+n_item]
offsets

[0, 31360, 38167]

In [153]:
for x, y in tqdm(inference_loader):
    model.eval()
    in_x = x.to(device)
    output = model(in_x)[1]
    result = torch.round(output[1])
    
    x = x.numpy() - offsets
    for u, i, r in zip(x[:,0], x[:,1], output.to('cpu').detach().numpy()):
        user_item_matrix.loc[dict_users[u], dict_items[i]] = r



100%|██████████| 203431/203431 [3:46:35<00:00, 14.96it/s]   


In [154]:
user_item_matrix

item,1,2,3,4,5,6,7,8,9,10,...,116823,117176,117533,117881,118696,118700,118900,118997,119141,119145
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11,-10.000000,40.601505,-11.658003,-27.578480,-25.750988,-21.337523,-18.243971,-2.616234,-20.730715,7.762153,...,-9.068460,-2.651220,-73.754646,-46.502365,-43.363914,-57.010468,-47.834541,-30.942148,-43.980843,-67.059944
14,-10.000000,42.148415,16.749626,-43.226532,1.620664,-59.123260,-10.000000,0.076384,-64.318184,-26.155884,...,-29.260498,-23.000345,-45.468727,-20.847450,-32.722061,-33.012386,-49.442001,-37.416229,-45.282631,-32.778305
18,-1.058390,-8.197233,-32.631416,-36.670246,-9.226320,-3.475581,-18.639887,-31.579742,-50.839161,-27.639931,...,-48.574467,-34.974583,-38.801346,-79.863403,-51.789825,-68.397926,-68.804176,-30.839195,-20.833889,-5.941283
25,5.192780,9.503859,-41.190498,-56.882580,-17.871708,-28.228313,-51.531868,-27.211288,-64.658012,-8.899348,...,-33.044182,-37.781715,-53.712975,-70.890709,-67.059402,-48.454308,-60.522987,-64.079552,-51.041481,-80.363205
31,11.142850,25.987328,-10.146541,-60.935051,-58.274197,-45.199253,-10.542144,18.738348,-63.158222,-32.891861,...,27.982119,2.371132,0.629409,1.221288,-10.000000,-14.536253,-3.639129,-21.257038,-44.141182,-61.556820
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138473,-10.000000,-29.455151,-32.541748,-18.731014,-32.304680,-1.264413,-36.214905,4.400986,-65.417198,8.413362,...,-42.543068,-37.799767,-51.490273,-47.038895,-14.714188,-72.351883,-46.207298,-33.905022,-51.312157,-34.061798
138475,-7.823964,-52.363029,-55.383034,-43.389751,-29.977068,40.168907,-30.070002,-44.305305,-33.032265,-53.718460,...,-37.472752,-42.623627,-2.481892,-42.418869,-39.978989,-13.596056,-3.402564,-8.388869,-31.727377,-55.409771
138486,-10.000000,42.991596,-7.564743,-41.877403,2.732503,-55.845436,-67.826523,-19.769258,-28.848188,-6.565908,...,-58.314251,-76.745026,-54.405945,-41.762730,-64.996307,-38.359131,-48.440685,-53.726391,-52.124962,-120.126526
138492,4.323761,-27.006874,-5.570723,-66.044342,-22.344395,13.629883,-10.805572,-23.719601,-4.011243,-23.407898,...,-44.454468,-60.721096,-39.419559,-79.754959,-78.661133,-54.406853,-80.171516,-33.898727,-19.331497,-82.182014


### result

In [155]:
result = np.argpartition(user_item_matrix, -10).iloc[:, -10:]
final_users, final_items = list(), list()
item_columns = user_item_matrix.columns
for idx in range(result.shape[0]):
    final_users.extend([result.index[idx]] * 10)
    for i in result.values[idx]:
        final_items.append(item_columns[i])
        
submission_df = pd.DataFrame(zip(final_users,final_items), columns=['user','item'])
submission_df.to_csv("/opt/ml/input/code/experiment/deep_fm/neg200_emb200_iter150.csv", index=False)