# Section 6: Project #1
## 딥러닝 기반의 추천 시스템

빅데이터 분석 기술의 발전에 따라 활동이나 선호도를 개인별로 추정할 수 있는 방법들이 개발되고 있습니다.  
<br>
이런 개인화 서비스 영역에 큰 비중을 차지하고 있는 추천 시스템을 이번 프로젝트 주제로 삼게 되었고  
특히 여러가지 모델을 통해 기능을 구현하면서 모델들 간에 어떤 차이가 있는지 파악하고자 했습니다.  
동시에 딥러닝 모델을 적용한다면 성능을 올릴 수 있을지 확인해보고자 했습니다.

### 1) Data Description
#### MovieLens Dataset
```
It contains 20000263 ratings and 465564 tag applications across 27278 movies. 
These data were created by 138493 users between January 09, 1995 and March 31, 2015. 
This dataset was generated on October 17, 2016.
``` 

In [2]:
import pandas as pd
import numpy as np

dataset = pd.read_csv('dataset/Movielens_Dataset/rating.csv')
print(dataset.shape)
dataset.head()

(20000263, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


### 2) Preprocessing

2-1) 데이터 전처리

In [3]:
# timestamp 칼럼 제거
dataset = dataset[['userId', 'movieId', 'rating']]
# 중복값, 결측치 제거
dataset.drop_duplicates(inplace=True)
dataset.dropna(inplace=True)

In [9]:
dataset.groupby('userId')['rating'].agg(['count', 'mean'])\
        .sort_values(by='count', ascending=False).T

userId,118205,8405,82418,121535,125794,74142,34576,131904,83090,59477,...,12608,75769,75755,3555,24457,89305,110463,96990,134747,6526
count,9254.0,7515.0,5646.0,5520.0,5491.0,5447.0,5356.0,5330.0,5169.0,4988.0,...,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0
mean,3.279069,3.208317,3.516915,2.793116,3.762976,1.577474,3.011669,3.248874,2.404914,2.455092,...,2.7,3.4,3.875,3.85,2.9,3.75,3.75,3.825,2.975,3.275


In [10]:
dataset.groupby('movieId')['rating'].agg(['count', 'mean'])\
        .sort_values(by='count', ascending=False).T

movieId,296,356,318,593,480,260,110,589,2571,527,...,110794,110798,110800,110802,110805,110807,110811,78984,110818,131262
count,67310.0,66172.0,63366.0,63299.0,59715.0,54502.0,53769.0,52244.0,51334.0,50054.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mean,4.174231,4.029,4.44699,4.177057,3.664741,4.190672,4.042534,3.931954,4.187186,4.310175,...,3.5,4.5,0.5,0.5,2.5,4.0,1.5,2.0,1.0,4.0


### 3) ML modeling

기본적으로 추천시스템을 구현하는 방식은 여러 가지가 존재합니다.  
- 각각 영화의 감독, 주연배우, 장르, 출시 시기 등의 요소들을 기반으로 선호도를 예측하는 컨텐츠 기반 필터링
- 이용자들 간 비슷한 선호를 가진 사람을 통해 선호도를 예측하는 협력 필터링

이 프로젝트에서는 '개인화'라는 키워드에 초점을 맞추고 있기 때문에 협력 필터링을 구현했습니다.
<br><br>
실제로 시도하는 모델은 다음과 같습니다.
- Baseline: ALS (Alternative Least Square) 모델
- 특이값분해 SVD (Singular Value Decomposition) 모델
- Deep learning 구조 기반의 AutoEncoder 모델

세 가지 모델은 모두 구조적으로 행렬분해 (Matrix Factorization)에 기반을 두고 있습니다.  
비슷한 방식으로 작동하는 모델이기 때문에 비교가 용이할 것이라 생각했습니다.
<br><br>
행렬 분해의 작동 과정을 정리하자면  
1) 차원 축소를 통해 잠재특성 (latent feature) 추출  
2) 잠재특성을 기반으로 원래의 행렬을 예측  
3) 예측된 수치와 실제 수치 간의 차이를 최소화하는 방식으로 최적화
<br><br>
이 프로젝트에서는 python surprise library를 통해서 ALS와 SVD 모델을 구현했습니다.

In [6]:
import surprise
from surprise import Reader, Dataset, BaselineOnly, SVD, accuracy
from surprise.model_selection import train_test_split, KFold

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(dataset[['userId', 'movieId', 'rating']], reader)
trainset, validset = train_test_split(data, test_size=.3)

In [25]:
# Alternative Least Square

bsl_options = {'method':'als', 'n_epochs':15}

als_algo = BaselineOnly(bsl_options=bsl_options)
als_algo.fit(trainset)
als_predict = als_algo.test(validset)

als_score = accuracy.rmse(als_predict)
als_fcp = accuracy.fcp(als_predict)

Estimating biases using als...
RMSE: 0.8587
FCP:  0.7095


In [26]:
# Singular Value Decomposition

algo = SVD()
algo.fit(trainset)
svd_predict = algo.test(validset)

svd_score = accuracy.rmse(svd_predict)
svd_fcp = accuracy.fcp(svd_predict)

RMSE: 0.7929
FCP:  0.7509


### 4) DL Modeling

협력 필터링의 행렬분해 구조를 구현하려면 잠재특성 (Latent Feature)를 추출해야 합니다.
<br>
AutoEncoder 모델을 적용하여 추천 시스템을 구현했습니다.

작성한 함수는 NVIDIA 사에서 공개한 'DeepRecommender' model을 참고해서 작성했습니다.

In [12]:
## NVIDIA Deep Recommender (AutoEncoder)
# Copyright (c) 2017 NVIDIA Corporation
## reco_encoder.model

import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.nn.init as weight_init
from torch.autograd import Variable

def activation(input, kind):
    return F.selu(input)

def MSEloss(inputs, targets, size_average=True):
    mask = targets != 0
    num_ratings = torch.sum(mask.float())
    criterion = nn.MSELoss(reduction='sum' if not size_average else 'mean')
    return criterion(inputs * mask.float(), targets),\
           Variable(torch.Tensor([1.0])) if size_average else num_ratings

class AutoEncoder(nn.Module):
    def __init__(self, layer_sizes, nl_type='selu', is_constrained=True, dp_drop_prob=0.0, last_layer_activations=True):
        super(AutoEncoder, self).__init__()
        self._dp_drop_prob = dp_drop_prob
        self._last_layer_activations = last_layer_activations
        if dp_drop_prob>0:
            self.drop = nn.Dropout(dp_drop_prob)
        self._last = len(layer_sizes) - 2
        self._nl_type = nl_type
        self.encode_w = nn.ParameterList(
            [nn.Parameter(torch.rand(layer_sizes[i+1], layer_sizes[i])) for i in range(len(layer_sizes)-1)])
        for ind, w in enumerate(self.encode_w):
            weight_init.xavier_uniform_(w)
        self.encode_b = nn.ParameterList(
        [nn.Parameter(torch.zeros(layer_sizes[i+1])) for i in range(len(layer_sizes)-1)])
        reversed_enc_layers = list(reversed(layer_sizes))
        self.is_constrained = is_constrained
        
        if not is_constrained:
            self.decode_w = nn.ParameterList(
            [nn.Parameter(torch.zeros(reversed_enc_layers[i+1])) for i in range(len(reversed_enc_layers)-1)])
            for ind, w in enumerate(self.decode_w):
                weight_init.xavier_uniform(w)
        self.decode_b = nn.ParameterList(
        [nn.Parameter(torch.zeros(reversed_enc_layers[i+1])) for i in range(len(reversed_enc_layers)-1)])
        print('Encoder pass:')
        for ind, w in enumerate(self.encode_w):
            print(w.data.size())
            print(self.encode_b[ind].size())
    
    def encode(self, x):
        for ind, w in enumerate(self.encode_w):
            x = activation(input=F.linear(input=x, weight=w, bias=self.encode_b[ind]), kind=self._nl_type)
        if self._dp_drop_prob > 0:
            x = self.drop(x)
        return x
    
    def decode(self, z):
        if self.is_constrained:
            for ind, w in enumerate(list(reversed(self.encode_w))):
                z = activation(input=F.linear(input=z, weight=w.transpose(0, 1), bias=self.decode_b[ind]),
                              kind=self._nl_type if ind!=self._last or self._last_layer_activations else 'none')
        else:
            for ind, w in enumerate(self.decode_w):
                z = activation(input=F.linear(input=z, weight=w, bias=self.decode_b[ind]),
                              kind=self._nl_type if ind!=self._last or self._last_layer_activations else 'none')
        return z
    
    def forward(self, x):
        return self.decode(self.encode(x))

In [13]:
## Data Layer Classes
## reco_encoder.data.input_layer

from os import listdir, path
from random import shuffle
from pathlib import Path

class UserItemRecDataProvider:
    def __init__(self, params, user_id_map=None, item_id_map=None):
        self._params = params
        self._data_dir = self.params['data_dir']
        self._extension = ".txt" if 'extension' not in self.params else self.params['extension']
        self._i_id = 0 if 'itemIdInd' not in self.params else self.params['itemIdInd']
        self._u_id = 1 if 'userIdInd' not in self.params else self.params['userIdInd']
        self._r_id = 2 if 'ratingInd' not in self.params else self.params['ratingInd']
        self._major = 'items' if 'major' not in self.params else self.params['major']
        if not (self._major == 'items' or self._major=='users'):
            raise ValueError('Major must be "users" or "items", but got {}'.format(self._major))
        self._major_ind = self._i_id if self._major == 'items' else self._u_id
        self._minor_ind = self._u_id if self._major == 'items' else self._i_id
        self._delimiter = '\t' if 'delimiter' not in self.params else self.params['delimiter']
        if user_id_map is None or item_id_map is None:
            self._build_maps()
        else:
            self._user_id_map = user_id_map
            self._item_id_map = item_id_map
        major_map = self._item_id_map if self._major=='items' else self._user_id_map
        minor_map = self._user_id_map if self._major=='items' else self._item_id_map
        self._vector_dim = len(minor_map)
        src_files = [path.join(self._data_dir, f)
                    for f in listdir(self._data_dir)
                    if path.isfile(path.join(self._data_dir, f)) and f.endswith(self._extension)]
        self._batch_size = 32
        
        self.data = dict()
        for source_file in src_files:
            with open(source_file, 'r') as src:
                for line in src.readlines():
                    parts = line.strip().split(self._delimiter)
                    if len(parts)<3:
                        raise ValueError
                    key = major_map[int(parts[self._major_ind])]
                    value = minor_map[int(parts[self._minor_ind])]
                    rating = float(parts[self._r_id])
                    if key not in self.data:
                        self.data[key] = []
                    self.data[key].append((value, rating))
                    
    def _build_maps(self):
        self._user_id_map=dict()
        self._item_id_map=dict()
        src_files=[path.join(self._data_dir, f)
                  for f in listdir(self._data_dir)
                  if path.isfile(path.join(self._data_dir, f)) and f.endswith(self._extension)]
        
        u_id = 0
        i_id = 0
        for source_file in src_files:
            with open(source_file, 'r') as src:
                for line in src.readlines():
                    parts = line.strip().split(self._delimiter)
                    if len(parts)<3:
                        raise ValueError
                        
                    u_id_orig = int(parts[self._u_id])
                    if u_id_orig not in self._user_id_map:
                        self._user_id_map[u_id_orig] = u_id
                        u_id += 1
                        
                    i_id_orig = int(parts[self._i_id])
                    if i_id_orig not in self._item_id_map:
                        self._item_id_map[i_id_orig] = i_id
                        i_id += 1
                        
    def iterate_one_epoch(self):
        data = self.data
        keys = list(data.keys())
        shuffle(keys)
        s_ind = 0
        e_ind = self._batch_size
        while e_ind < len(keys):
            local_ind=0
            inds1=[]
            inds2=[]
            vals=[]
            for ind in range(s_ind, e_ind):
                inds2 += [v[0] for v in data[keys[ind]]]
                inds1 += [local_ind]*len([v[0] for v in data[keys[ind]]])
                vals += [v[1] for v in data[keys[ind]]]
                local_ind += 1
            i_torch = torch.LongTensor([inds1, inds2])
            v_torch = torch.FloatTensor(vals)
            mini_batch = torch.sparse.FloatTensor(i_torch, v_torch, torch.Size([self._batch_size, self._vector_dim]))
            s_ind += self._batch_size
            e_ind += self._batch_size
            yield mini_batch
            
    def iterate_one_epoch_eval(self, for_inf=False):
        keys = list(self.data.keys())
        s_ind = 0
        while s_ind < len(keys):
            inds1 = [0] * len([v[0] for v in self.data[keys[s_ind]]])
            inds2 = [v[0] for v in self.data[keys[s_ind]]]
            vals = [v[1] for v in self.data[keys[s_ind]]]
            src_inds1 = [0] * len([v[0] for v in self.src_data[keys[s_ind]]])
            src_inds2 = [v[0] for v in self.src_data[keys[s_ind]]]
            src_vals = [v[1] for v in self.src_data[keys[s_ind]]]
            
            i_torch = torch.LongTensor([inds1, inds2])
            v_torch = torch.FloatTensor(vals)
            src_i_torch = torch.LongTensor([src_inds1, src_inds2])
            src_v_torch = torch.FloatTensor(src_vals)
            mini_batch = (torch.sparse.FloatTensor(i_torch, v_torch, torch.Size([1, self._vector_dim])),
                          torch.sparse.FloatTensor(src_i_torch, src_v_torch, torch.Size([1, self._vector_dim])))
            s_ind += 1
            if not for_inf:
                yield mini_batch
            else:
                yield mini_batch, keys[s_ind - 1]
    @property
    def vector_dim(self):
        return self._vector_dim
    @property
    def userIdMap(self):
        return self._user_id_map
    @property
    def itemIdMap(self):
        return self._item_id_map
    @property
    def params(self):
        return self._params

In [14]:
# logger object
import tensorflow as tf
import numpy as np
import scipy.misc
from io import BytesIO

class Logger(object):
    def __init__(self, log_dir):
        self.writer = tf.summary.create_file_writer(log_dir)
    def scalar_summary(self, tag, value, step):
        with self.writer.as_default(step=step):
            tf.summary.scalar(name=tag, data=value)
    def image_summary(self, tag, images, step):
        img_summaries = []
        for i, img in enumerate(images):
            s = BytesIO()
            scipy.misc.toimage(img).save(s, format='png')
            img_sum = tf.Summary.Image(encoded_image_string=s.getvalue(),
                                  height=img.shape[0],
                                  width=img.shape[1])
            img_summaries.append(tf.Summary.Value(tag='%s/%d' %(tag, i), image=img_sum))
        summary = tf.Summary(value=img_summaries)
        self.writer.add_summary(summary, step)
    def histo_summary(self, tag, values, step, bins=1000):
        counts, bin_edges = np.histogram(values, bins=bins)
        hist = tf.HistogramProto()
        hist.min = float(np.min(values))
        hist.max = float(np.max(values))
        hist.num = int(np.prod(values.shape))
        hist.sum = float(np.sum(values))
        hist.sum_squares = float(np.sum(values ** 2))
        bin_edges = bin_edges[1:]
        for edge in bin_edges:
            hist.bucket_limit.append(egde)
        for c in counts:
            hist.bucket.append(c)
        summary = tf.Summary(value=[tf.Summary.Value(tag=tag, histo=hist)])
        self.writer.add_summary(summary, step)
        self.writer.flush()

In [15]:
def do_eval(encoder, evaluation_data_layer):
    encoder.eval()
    denom = 0.0
    total_epoch_loss = 0.0
    for i, (eval, src) in enumerate(evaluation_data_layer.iterate_one_epoch_eval()):
        inputs = Variable(src.cuda().to_dense())
        targets = Variable(eval.cuda().to_dense())
        outputs = encoder(inputs)
        loss, num_ratings = MSEloss(outputs, targets)
        total_epoch_loss += loss.item()
        denom += num_ratings.item()
    return sqrt(total_epoch_loss/ denom)

def log_var_and_grad_summaries(logger, layers, global_step, prefix, log_histograms=False):
    for ind, w in enumerate(layers):
        w_var = w.data.cpu().numpy()
        logger.scalar_summary('Variable/FrobNorm/{}_{}'.format(prefix, ind), np.linalg.norm(w_var), global_step)
        if log_histograms:
            logger.histo_summary(tag='Variables/{}_{}'.format(prefix, ind), values=w.data.cpu().numpy(), step=global_step)
        w_grad = w.grad.data.cpu().numpy()
        logger.scalar_summary("Gradients/FrobNorm/{}_{}".format(prefix, ind), np.linalg.norm(w_grad), global_step)
        if log_histograms:
            logger.histo_summary(tag='Gradients/{}_{}'.format(prefix, ind), values=w.grad.data.cpu().numpy(), step=global_step)

In [16]:
# modified to use DataFrame object as datasource
#_____________________________________________________
class UserItemRecDataProvider_from_Df(UserItemRecDataProvider):
    def __init__(self, params, user_id_map=None, item_id_map=None):
        self._params = params
        self.DataFrame = params['dataframe']
        
        self._i_id = 1 if 'itemIdInd' not in self.params else self.params['itemIdInd']
        self._u_id = 0 if 'userIdInd' not in self.params else self.params['userIdInd']
        self._r_id = 2 if 'ratingInd' not in self.params else self.params['ratingInd']
        self._major = 'users' if 'major' not in self.params else self.params['major']
        if not (self._major == 'items' or self._major=='users'):
            raise ValueError('Major must be "users" or "items", but got {}'.format(self._major))
        self._major_ind = self._i_id if self._major == 'items' else self._u_id
        self._minor_ind = self._u_id if self._major == 'items' else self._i_id
        if user_id_map is None or item_id_map is None:
            self._build_maps()
        else:
            self._user_id_map = user_id_map
            self._item_id_map = item_id_map
        major_map = self._item_id_map if self._major=='items' else self._user_id_map
        minor_map = self._user_id_map if self._major=='items' else self._item_id_map
        self._vector_dim = len(minor_map)
        self._batch_size = 32
        
        self.data = dict()
        for idx, row in self.DataFrame.iterrows():
            key = major_map[int(row['userId'])]
            value = minor_map[int(row['movieId'])]
            rating = float(row['rating'])
            if key not in self.data:
                self.data[key] = []
            self.data[key].append((value, rating))
                    
    def _build_maps(self):
        self._user_id_map=dict()
        self._item_id_map=dict()
       
        u_id = 0
        i_id = 0
        for idx, row in self.DataFrame.iterrows():
            u_id_orig = int(row['userId'])
            if u_id_orig not in self._user_id_map:
                self._user_id_map[u_id_orig] = u_id
                u_id += 1
            i_id_orig = int(row['movieId'])
            if i_id_orig not in self._item_id_map:
                self._item_id_map[i_id_orig] = i_id
                i_id+=1

In [17]:
# 훈련-검증 데이터셋 분리 
from sklearn.model_selection import train_test_split
train, valid = train_test_split(dataset, test_size=.3, random_state=42)
# 신경망구조 작성
params=dict()
params['dataframe'] = dataset
params['batch_size'] = 64
params['major'] = 'users'
params['itemIdInd'] = 1
params['userIdInd'] = 0

hidden_layers='512, 512, 1024'   # [512 - 512 - 1024(latent space) - 512 - 512]
drop_prob=0.65                   # dropout rate
num_epochs=15
non_linearity_type='selu'        # activation function = 'SELU'
gpu_ids=1
skip_last_layer_nl=True
constrained=True
summary_frequency = 2000
save_every = 3
logger = Logger('movielens_logs')
logdir = 'movielens_logs'

data_layer = UserItemRecDataProvider_from_Df(params=params)
print('data loaded')

data loaded


In [18]:
import copy
import time
import torch.optim as optim
from math import sqrt

eval_params = copy.deepcopy(params)
eval_params['dataframe'] = valid
eval_data_layer = UserItemRecDataProvider_from_Df(params = eval_params,
                                         user_id_map=data_layer.userIdMap,
                                         item_id_map=data_layer.itemIdMap)
eval_data_layer.src_data = data_layer.data

rencoder = AutoEncoder(layer_sizes=[data_layer.vector_dim] + [int(l) for l in hidden_layers.split(',')],
                       nl_type = non_linearity_type,
                       is_constrained=constrained,
                       dp_drop_prob=drop_prob,
                       last_layer_activations=not skip_last_layer_nl)

rencoder = rencoder.cuda()
optimizer = optim.Adam(rencoder.parameters(),
                       lr=0.0001,
                       weight_decay=0.0)

os.makedirs(logdir, exist_ok=True)
model_checkpoint = logdir + '/model'
path_to_model = Path(model_checkpoint)
if path_to_model.is_file():
    rencoder.load_state_dict(torch.load(model_checkpoint))

Encoder pass:
torch.Size([512, 26744])
torch.Size([512])
torch.Size([512, 512])
torch.Size([512])
torch.Size([1024, 512])
torch.Size([1024])


In [19]:
loss_list=[]
t_loss=0.0
t_loss_denom=0.0
global_step=0

for epoch in range(1, num_epochs+1):
    print(f'---------------epoch {epoch} of {num_epochs}-------------')
    print(global_step)
    print(t_loss)
    e_start_time = time.time()
    rencoder.train()
    total_epoch_loss = 0.0
    denom = 0.0
    
    for i, mb in enumerate(data_layer.iterate_one_epoch()):
        inputs=Variable(mb.cuda().to_dense())
        optimizer.zero_grad()
        outputs = rencoder(inputs)
        loss, num_ratings = MSEloss(outputs, inputs)
        loss.backward()
        optimizer.step()
        global_step += 1
        t_loss += loss.item()
        t_loss_denom += 1
    
        if i%summary_frequency == 0:
            logger.scalar_summary("Training_RMSE", sqrt(t_loss/t_loss_denom), global_step)
            print('Training_RMSE', sqrt(t_loss/t_loss_denom))
            t_loss=0
            t_loss_denom=0
            log_var_and_grad_summaries(logger, rencoder.encode_w, global_step, "Encode_W")
            log_var_and_grad_summaries(logger, rencoder.encode_b, global_step, "Encode_b")
            if not rencoder.is_constrained:
                log_var_and_grad_summaries(logger, rencoder.decode_w, global_step, "Decode_W")
            log_var_and_grad_summaries(logger, rencoder.decode_b, global_step, "Decode_b")
        total_epoch_loss += loss.item()
        denom += 1
    loss_list.append(loss.item())
    e_end_time = time.time()
    logger.scalar_summary("Training_RMSE_per_epoch", sqrt(total_epoch_loss/denom), epoch)
    logger.scalar_summary("Epoch_time", e_end_time - e_start_time, epoch)
    print('Time consumed : ', e_end_time - e_start_time)
    if epoch%save_every==0 or epoch==num_epochs-1:
        eval_loss = do_eval(rencoder, eval_data_layer)
        print('eval_loss  : ', eval_loss)
        logger.scalar_summary("EVALUATION_RMSE", eval_loss, epoch)
        print("Saving model to {}".format(model_checkpoint+".epoch_"+str(epoch)))
        torch.save(rencoder.state_dict(), model_checkpoint+".epoch_"+str(epoch))
        
    torch.save(rencoder.state_dict(), model_checkpoint + '.last')   
    dummy_input = Variable(torch.randn(params['batch_size'], data_layer.vector_dim).type(torch.float))
    torch.onnx.export(rencoder.float(), dummy_input.cuda(),
                       model_checkpoint + '.onnx')
    print('ONNX model saved to {}'.format(model_checkpoint + '.onnx'))

---------------epoch 1 of 15-------------
0
0.0
Training_RMSE 0.2809507049071048
Training_RMSE 0.07574949932885153
Training_RMSE 0.06920665659886573
Time consumed :  67.42151618003845
ONNX model saved to movielens_logs/model.onnx
---------------epoch 2 of 15-------------
4327
1.4578574448823929
Training_RMSE 0.06683862974943665
Training_RMSE 0.06733752525466596
Training_RMSE 0.06548449864813309
Time consumed :  66.31130766868591
ONNX model saved to movielens_logs/model.onnx
---------------epoch 3 of 15-------------
8654
1.3917606316972524
Training_RMSE 0.06536167993805636
Training_RMSE 0.06448217626801031
Training_RMSE 0.06519381431950387
Time consumed :  67.03990530967712
eval_loss  :  0.03407816663722414
Saving model to movielens_logs/model.epoch_3
ONNX model saved to movielens_logs/model.onnx
---------------epoch 4 of 15-------------
12981
1.3342143872287124
Training_RMSE 0.06398366636398493
Training_RMSE 0.0634752669527735
Training_RMSE 0.06362363893071686
Time consumed :  67.37563

In [21]:
#predict for movielens

save_path = 'movielens_logs/model.epoch_15'
prediction_path = 'out.txt'
params = dict()
params['batch_size'] = 1
params['dataframe'] = dataset
params['major'] = 'users'
params['itemIdInd'] = 1
params['userIdInd'] = 0

data_layer = UserItemRecDataProvider_from_Df(params=params)

eval_params = copy.deepcopy(params)
eval_params['batch_size'] = 1
eval_params['dataframe'] = valid
eval_data_layer = UserItemRecDataProvider_from_Df(params=eval_params,
                                          user_id_map=data_layer.userIdMap,
                                          item_id_map=data_layer.itemIdMap)
rencoder = AutoEncoder(layer_sizes=[data_layer.vector_dim] + \
                       [int(l) for l in hidden_layers.split(',')],
                      nl_type=non_linearity_type,
                      is_constrained = constrained,
                      dp_drop_prob=drop_prob,
                      last_layer_activations=not skip_last_layer_nl)
path_to_model = Path(save_path)
if path_to_model.is_file():
    rencoder.load_state_dict(torch.load(save_path))

rencoder.eval()
rencoder.cuda()

inv_userIdMap = {v: k for k, v in data_layer.userIdMap.items()}
inv_itemIdMap = {v: k for k, v in data_layer.itemIdMap.items()}

eval_data_layer.src_data = data_layer.data
with open(prediction_path, 'w') as outf:
    for i, ((out, src), majorInd) in enumerate(eval_data_layer.iterate_one_epoch_eval(for_inf=True)):
        inputs = Variable(src.cuda().to_dense())
        targets_np = out.to_dense().numpy()[0, :]
        outputs = rencoder(inputs).cpu().data.numpy()[0, :]
        non_zeros = targets_np.nonzero()[0].tolist()
        major_key = inv_userIdMap[majorInd]
        for ind in non_zeros:
            outf.write('{}\t{}\t{}\t{}\n'.format(major_key, inv_itemIdMap[ind], outputs[ind], targets_np[ind]))
        if i%10000 == 0:
            print('Doing : {}'.format(i))

Encoder pass:
torch.Size([512, 26744])
torch.Size([512])
torch.Size([512, 512])
torch.Size([512])
torch.Size([1024, 512])
torch.Size([1024])
Doing : 0
Doing : 10000
Doing : 20000
Doing : 30000
Doing : 40000
Doing : 50000
Doing : 60000
Doing : 70000
Doing : 80000
Doing : 90000
Doing : 100000
Doing : 110000
Doing : 120000
Doing : 130000


In [22]:
# computing model prediction score
# for movielens

with open(prediction_path, 'r') as inpt:
    lines = inpt.readlines()
    n = 0
    denom = 0.0
    for line in lines:
        parts = line.split('\t')
        prediction = round(float(parts[2]))
        rating = float(parts[3])
        denom += (prediction - rating)*(prediction - rating)
        n += 1
    renc_score = sqrt(denom/n)
    print("RMSE : {}".format(renc_score))

RMSE : 0.8139762648022553


### 5) Evaluation
비교하는 지표로 RMSE를 선정:  
평균제곱근편차 Root Mean Squared Error는 모든 오차를 동일한 가중치로 계산  
<br>-> 예측과 실제값이 얼마나 가까운지를 나타내는 지표 (낮을수록 성능이 좋다!)

이 외에도 추천시스템을 평가하는 방식에는 Precision, Recall@k 등 다른 지표들도 활용되고 있습니다.  
'개인화' 측면에서는 이런 지표들이 더 성능을 잘 반영할 수 있다고 생각합니다.  
<br>
아쉽게도 이 프로젝트에서는 시간적 한계로 다른 평가지표들까지 구현하기는 어려웠습니다.

In [27]:
# 각 모델의 RMSE
print('ALS model   :', als_score)
print('SVD model   :', svd_score)
print('AutoEncoder :', renc_score)

ALS model   : 0.8587301661401836
SVD model   : 0.7929245866825421
AutoEncoder : 0.8139762648022553


In [28]:
als_output = pd.DataFrame(bsl_predict, 
                          columns=['userId', 'movieId', 'rating', 'rating_pred', 'state'])
als_output = als_output[['userId', 'movieId', 'rating_pred', 'rating']]

In [29]:
svd_output = pd.DataFrame(svd_predict, 
                          columns=['userId', 'movieId', 'rating', 'rating_pred', 'state'])
svd_output = svd_output[['userId', 'movieId', 'rating_pred', 'rating']]

In [38]:
renc_predict = pd.read_csv(prediction_path, delimiter='\t', header=None,
                          names=['userId', 'movieId', 'rating_pred', 'rating'])
print(renc_predict.shape)
renc_predict.head()

(6000079, 4)


Unnamed: 0,userId,movieId,rating_pred,rating
0,122270,296,4.670368,5.0
1,122270,593,3.832587,4.0
2,122270,919,3.761301,4.0
3,122270,1193,4.010227,5.0
4,122270,1261,3.41652,5.0


In [34]:
# 유저-아이템 선호도 테이블
pd.pivot_table(svd_output.sort_values(by='userId')[:500],
               values='rating_pred', index='userId', columns='movieId').fillna('')

movieId,1,2,6,10,15,17,21,32,34,39,...,61465,62265,64497,64614,66171,66297,67365,68263,69945,70227
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,3.626617,,,,3.560899,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,2.763842,3.602076,,,,,...,,,,,,,,,,
8,,,,,,,3.947162,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,


In [35]:
pd.pivot_table(renc_predict.sort_values(by='userId')[:500],
               values='rating_pred', index='userId', columns='movieId').fillna('')

movieId,1,2,3,6,10,11,19,21,24,25,...,60516,60832,64508,65514,66297,67197,67799,67867,68319,70305
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,4.29282,,,,,,,,...,,,,,,,,,,
3,3.881709,,,,,,,,4.238307,,...,,,,,,,,,,
4,,,,,,,2.506626,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,3.533792,,...,,,,,,,,,,
8,,,4.000562,,,,,3.768695,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,3.847795,,,,3.822665,...,,,,,,,,,,


### 6) Conclusion
RMSE로 측정한 성능 자체에서는 SVD 모델이나 Autoencoder 모델 간의 큰 차이가 보이지 않습니다.  
오히려 더 높을 것으로 기대했던 AutoEncoder 모델의 성능이 안좋게 나왔습니다.  
더 정밀하게 튜닝을 거친다면 성능의 개선을 이룰 수 있을 것이라 생각합니다.

추천 시스템은 개인화 서비스 중 큰 비중을 차지하고 있고 다양한 섹터에서 활용되고 있기 때문에 시스템 성능의 개선은 여러 서비스에 직접적인 영향을 줄 수 있는 요소라고 생각합니다.  
<br>딥러닝이라는 기술 자체가 점점 발전되고 고도화되고 있는 만큼 이를 적용해서 추천 시스템을 구현하면 정확도가 높은 모델을 구현할 수 있을 것이라 기대하고 있습니다.  
이번 프로젝트에서는 딥러닝 구조를 직접 적용하고 잘 작동하는 것을 확인한 점에서 의미가 있다고 생각합니다.