# 프로젝트 - Movielens 영화 SBR

Movielens 1M Dataset을 기반으로, Session based Recommendation 시스템을 제작해 보겠습니다.

```bash
$ wget http://files.grouplens.org/datasets/movielens/ml-1m.zip
```

In [1]:
import datetime as dt
from pathlib import Path
import os

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
data_path = Path(os.getenv('HOME')+'/aiffel/yoochoose-data/ml-1m') 
train_path = data_path / 'ratings.dat'

def load_data(data_path: Path, nrows=None):
    data = pd.read_csv(data_path, sep='::', header=None, usecols=[0, 1, 2, 3], dtype={0: np.int32, 1: np.int32, 2: np.int32}, nrows=nrows)
    data.columns = ['UserId', 'ItemId', 'Rating', 'Time']
    return data

data = load_data(train_path, None)
data = data.sort_values(['UserId'], inplace=False)  # data를 id와 시간 순서로 정렬해줍니다. # inplace = False면 자체를 다시 저장
data = data.reset_index(drop=True)
data

Unnamed: 0,UserId,ItemId,Rating,Time
0,1,1193,5,978300760
1,1,745,3,978824268
2,1,2294,4,978824291
3,1,3186,4,978300019
4,1,1566,4,978824330
...,...,...,...,...
1000204,6040,1594,3,964828599
1000205,6040,1587,1,956716374
1000206,6040,3182,5,984195682
1000207,6040,300,2,956704716


- 여기서 이전 실습내역과 가장 크게 다른 부분은 바로 SessionID 대신 UserID 항목이 들어갔다는 점입니다. 이 데이터셋은 명확한 1회 세션의 SessionID를 포함하지 않고 있습니다. 그래서 이번에는 UserID가 SessionID 역할을 해야 합니다.

- Rating 정보가 포함되어 있습니다. 이전 실습내역에서는 이런 항목이 포함되어 있지 않았으므로, 무시하고 제외할 수 있습니다. 하지만, 직전에 봤던 영화가 맘에 들었는지 여부가 비슷한 영화를 더 고르게 하는 것과 상관이 있을 수도 있습니다. 아울러, Rating이 낮은 데이터를 어떻게 처리할지도 고민해야 합니다.

- Time 항목에는 UTC time 가 포함되어, 1970년 1월 1일부터 경과된 초단위 시간이 기재되어 있습니다.

위와 같은 정보를 바탕으로 오늘의 실습과정과 유사한 프로젝트 과정을 진행해 보겠습니다.

## Step 1. 데이터의 전처리

위와 같이 간단히 구성해 본 데이터셋을 꼼꼼이 살펴보면서 항목별 기본분석, session length, session time, cleaning 등의 작업을 진행합니다.
특히, 이 데이터셋에서는 Session이 아닌 UserID 단위로 데이터가 생성되어 있으므로, 이를 Session 단위로 어떻게 해석할지에 주의합니다.

같은 UserId에서 Time 별로 다시 줄을 세워보도록하겠습니다.

In [3]:
data['UserId'].nunique()

6040

In [4]:
userid_length = data.groupby('UserId').size()
userid_length

UserId
1        53
2       129
3        51
4        21
5       198
       ... 
6036    888
6037    202
6038     20
6039    123
6040    341
Length: 6040, dtype: int64

In [5]:
userid_length[1]

53

In [6]:
len(userid_length)

6040

temp_df를 활용하여 userid, time을 순서대로 묶은 데이터프레임을 만들어주겠습니다.

In [7]:
temp_df = data[0:0+userid_length[1]]
temp_df = temp_df.sort_values(['Time'], inplace= False)
temp_df = temp_df.reset_index(drop=True)
temp_df

Unnamed: 0,UserId,ItemId,Rating,Time
0,1,3186,4,978300019
1,1,1270,5,978300055
2,1,1022,5,978300055
3,1,1721,4,978300055
4,1,2340,3,978300103
5,1,1836,5,978300172
6,1,3408,4,978300275
7,1,2804,5,978300719
8,1,1207,4,978300719
9,1,1193,5,978300760


In [8]:
ii = 0
new_df = data[0:1]
for i in range(len(userid_length)):
    temp_df = data[ii:ii+userid_length[i+1]]
    temp_df = temp_df.sort_values(['Time'], inplace= False)
    ii += userid_length[i+1]
    new_df = new_df.append(temp_df)
    new_df = new_df.reset_index(drop=True)

In [9]:
new_df = new_df[1:1000210]

In [10]:
new_df

Unnamed: 0,UserId,ItemId,Rating,Time
1,1,3186,4,978300019
2,1,1270,5,978300055
3,1,1022,5,978300055
4,1,1721,4,978300055
5,1,2340,3,978300103
...,...,...,...,...
1000205,6040,2917,4,997454429
1000206,6040,1784,3,997454464
1000207,6040,1921,4,997454464
1000208,6040,161,3,997454486


In [11]:
new_df['Time_new'] = pd.to_datetime(new_df['Time'], unit='s')

In [12]:
new_df

Unnamed: 0,UserId,ItemId,Rating,Time,Time_new
1,1,3186,4,978300019,2000-12-31 22:00:19
2,1,1270,5,978300055,2000-12-31 22:00:55
3,1,1022,5,978300055,2000-12-31 22:00:55
4,1,1721,4,978300055,2000-12-31 22:00:55
5,1,2340,3,978300103,2000-12-31 22:01:43
...,...,...,...,...,...
1000205,6040,2917,4,997454429,2001-08-10 14:40:29
1000206,6040,1784,3,997454464,2001-08-10 14:41:04
1000207,6040,1921,4,997454464,2001-08-10 14:41:04
1000208,6040,161,3,997454486,2001-08-10 14:41:26


rating이 3미만인 영화들은 지워주도록 하겠습니다.

In [13]:
new_df = new_df[new_df['Rating']>=3]
new_df = new_df.reset_index(drop=True)

In [14]:
new_df

Unnamed: 0,UserId,ItemId,Rating,Time,Time_new
0,1,3186,4,978300019,2000-12-31 22:00:19
1,1,1270,5,978300055,2000-12-31 22:00:55
2,1,1022,5,978300055,2000-12-31 22:00:55
3,1,1721,4,978300055,2000-12-31 22:00:55
4,1,2340,3,978300103,2000-12-31 22:01:43
...,...,...,...,...,...
836473,6040,2917,4,997454429,2001-08-10 14:40:29
836474,6040,1784,3,997454464,2001-08-10 14:41:04
836475,6040,1921,4,997454464,2001-08-10 14:41:04
836476,6040,161,3,997454486,2001-08-10 14:41:26


하나의 유저가 시간에 따라 다른 판단을 할수도 있겠다는 생각에 한 유저가 다른 시간에 선택을 한 아이템에는 다른 SessionId를 부여하였습니다.

In [15]:
new_df['SessionId'] = 0

In [16]:
new_df

Unnamed: 0,UserId,ItemId,Rating,Time,Time_new,SessionId
0,1,3186,4,978300019,2000-12-31 22:00:19,0
1,1,1270,5,978300055,2000-12-31 22:00:55,0
2,1,1022,5,978300055,2000-12-31 22:00:55,0
3,1,1721,4,978300055,2000-12-31 22:00:55,0
4,1,2340,3,978300103,2000-12-31 22:01:43,0
...,...,...,...,...,...,...
836473,6040,2917,4,997454429,2001-08-10 14:40:29,0
836474,6040,1784,3,997454464,2001-08-10 14:41:04,0
836475,6040,1921,4,997454464,2001-08-10 14:41:04,0
836476,6040,161,3,997454486,2001-08-10 14:41:26,0


In [17]:
len(new_df)

836478

In [19]:
sessionid = 1

new_df['SessionId'][0]=1

In [20]:
new_df

Unnamed: 0,UserId,ItemId,Rating,Time,Time_new,SessionId
0,1,3186,4,978300019,2000-12-31 22:00:19,1
1,1,1270,5,978300055,2000-12-31 22:00:55,0
2,1,1022,5,978300055,2000-12-31 22:00:55,0
3,1,1721,4,978300055,2000-12-31 22:00:55,0
4,1,2340,3,978300103,2000-12-31 22:01:43,0
...,...,...,...,...,...,...
836473,6040,2917,4,997454429,2001-08-10 14:40:29,0
836474,6040,1784,3,997454464,2001-08-10 14:41:04,0
836475,6040,1921,4,997454464,2001-08-10 14:41:04,0
836476,6040,161,3,997454486,2001-08-10 14:41:26,0


In [21]:
for i in range(len(new_df)-1):
    if new_df['UserId'][i] == new_df['UserId'][i+1]:
        
        if (new_df['Time'][i+1]-new_df['Time'][i]) <= 1800:
            new_df['SessionId'][i+1] = sessionid
        
        else:
            
            sessionid += 1
            new_df['SessionId'][i+1] = sessionid
    else: 
        
        sessionid += 1
        new_df['SessionId'][i+1] = sessionid

In [22]:
new_df

Unnamed: 0,UserId,ItemId,Rating,Time,Time_new,SessionId
0,1,3186,4,978300019,2000-12-31 22:00:19,1
1,1,1270,5,978300055,2000-12-31 22:00:55,1
2,1,1022,5,978300055,2000-12-31 22:00:55,1
3,1,1721,4,978300055,2000-12-31 22:00:55,1
4,1,2340,3,978300103,2000-12-31 22:01:43,1
...,...,...,...,...,...,...
836473,6040,2917,4,997454429,2001-08-10 14:40:29,24142
836474,6040,1784,3,997454464,2001-08-10 14:41:04,24142
836475,6040,1921,4,997454464,2001-08-10 14:41:04,24142
836476,6040,161,3,997454486,2001-08-10 14:41:26,24142


In [23]:
print(new_df['UserId'].nunique())
print(new_df['ItemId'].nunique())
print(new_df['SessionId'].nunique())

6039
3628
24143


In [24]:
sessionid_length = new_df.groupby('SessionId').size()
sessionid_length

SessionId
1         40
2         13
3        116
4         46
5         19
        ... 
24139      7
24140     13
24141      1
24142     21
24143      1
Length: 24143, dtype: int64

In [25]:
sessionid_length.median(), sessionid_length.mean()

(6.0, 34.64681274075301)

In [26]:
sessionid_length.min(), sessionid_length.max()

(1, 960)

In [27]:
sessionid_length.quantile(0.999)

609.8580000000038

In [28]:
long_session = sessionid_length[sessionid_length==960].index[0]
new_df[new_df['SessionId']==long_session]

Unnamed: 0,UserId,ItemId,Rating,Time,Time_new,SessionId
531367,3841,1221,5,965995059,2000-08-11 11:57:39,15729
531368,3841,969,5,965995059,2000-08-11 11:57:39,15729
531369,3841,858,5,965995059,2000-08-11 11:57:39,15729
531370,3841,1480,3,965995059,2000-08-11 11:57:39,15729
531371,3841,2019,5,965995059,2000-08-11 11:57:39,15729
...,...,...,...,...,...,...
532322,3841,1928,4,966003791,2000-08-11 14:23:11,15729
532323,3841,2016,3,966003791,2000-08-11 14:23:11,15729
532324,3841,432,3,966003791,2000-08-11 14:23:11,15729
532325,3841,2478,3,966003791,2000-08-11 14:23:11,15729


In [29]:
length_count = sessionid_length.groupby(sessionid_length).size()
length_percent_cumsum = sessionid_length.cumsum() / sessionid_length.sum()
length_percent_cumsum_999 = length_percent_cumsum[length_percent_cumsum < 0.999]

import matplotlib.pyplot as plt

plt.figure(figsize=(20, 10))
plt.bar(x=length_percent_cumsum_999.index,
        height=length_percent_cumsum_999, color='red')
plt.xticks(length_percent_cumsum_999.index)
plt.yticks(np.arange(0, 1.01, 0.05))
plt.title('Cumsum Percentage Until 0.999', size=20)
plt.show()

In [30]:
oldest, latest = new_df['Time_new'].min(), new_df['Time_new'].max()
print(oldest) 
print(latest)

2000-04-25 23:05:32
2003-02-28 17:49:50


이 데이터는 총 3년간의 데이터를 담고있습니다. 기간에 따라 train, validation, test_set을 만들어 주도록 하겠습니다.

In [31]:
three_years_ago = latest - dt.timedelta(1095)     
new_df_1 = new_df[new_df['Time_new'] > three_years_ago] 
new_df_1

Unnamed: 0,UserId,ItemId,Rating,Time,Time_new,SessionId
0,1,3186,4,978300019,2000-12-31 22:00:19,1
1,1,1270,5,978300055,2000-12-31 22:00:55,1
2,1,1022,5,978300055,2000-12-31 22:00:55,1
3,1,1721,4,978300055,2000-12-31 22:00:55,1
4,1,2340,3,978300103,2000-12-31 22:01:43,1
...,...,...,...,...,...,...
836473,6040,2917,4,997454429,2001-08-10 14:40:29,24142
836474,6040,1784,3,997454464,2001-08-10 14:41:04,24142
836475,6040,1921,4,997454464,2001-08-10 14:41:04,24142
836476,6040,161,3,997454486,2001-08-10 14:41:26,24142


In [32]:
# short_session을 제거한 다음 unpopular item을 제거하면 다시 길이가 1인 session이 생길 수 있습니다.
# 이를 위해 반복문을 통해 지속적으로 제거 합니다.
def cleanse_recursive(data: pd.DataFrame, shortest) -> pd.DataFrame:
    while True:
        before_len = len(data)
        data = cleanse_short_session(data, shortest)
        after_len = len(data)
        if before_len == after_len:
            break
    return data


def cleanse_short_session(data: pd.DataFrame, shortest):
    session_len = data.groupby('SessionId').size()
    session_use = session_len[session_len >= shortest].index
    data = data[data['SessionId'].isin(session_use)]
    return data

In [33]:
new_df_1 = cleanse_recursive(new_df_1, shortest=2)

new_df_1

Unnamed: 0,UserId,ItemId,Rating,Time,Time_new,SessionId
0,1,3186,4,978300019,2000-12-31 22:00:19,1
1,1,1270,5,978300055,2000-12-31 22:00:55,1
2,1,1022,5,978300055,2000-12-31 22:00:55,1
3,1,1721,4,978300055,2000-12-31 22:00:55,1
4,1,2340,3,978300103,2000-12-31 22:01:43,1
...,...,...,...,...,...,...
836472,6040,232,5,997454398,2001-08-10 14:39:58,24142
836473,6040,2917,4,997454429,2001-08-10 14:40:29,24142
836474,6040,1784,3,997454464,2001-08-10 14:41:04,24142
836475,6040,1921,4,997454464,2001-08-10 14:41:04,24142


## Step 2. 미니 배치의 구성

실습코드 내역을 참고하여 데이터셋과 미니 배치를 구성해 봅시다. Session-Parallel Mini-Batch의 개념에 따라, 학습 속도의 저하가 최소화될 수 있도록 구성합니다.
단, 위 Step 1에서 Session 단위를 어떻게 정의했느냐에 따라서 Session-Parallel Mini-Batch이 굳이 필요하지 않을 수도 있습니다.

데이터셋으로 3년치의 데이터를 뽑아내었습니다.  
가장 최근일을 기준으로 2년치의 data를 train_set으로, 1개월의 data를 test_set으로 만들겠습니다. 

In [34]:
two_years_ago = latest - dt.timedelta(720)
three_months_ago = latest - dt.timedelta(90)
train_set = new_df_1[new_df_1['Time_new'] < two_years_ago]
validation_set = new_df_1[(new_df_1['Time_new'] > two_years_ago)&
                         (new_df_1['Time_new'] < three_months_ago)]
test_set = new_df_1[new_df_1['Time_new'] > three_months_ago]

In [35]:
train_set.shape

(779144, 6)

In [36]:
validation_set.shape

(47247, 6)

In [37]:
test_set.shape

(3405, 6)

In [38]:
# train set에 없는 아이템이 val,test 기간에 생길 수 있으므로 train data를 기준으로 인덱싱합니다.
id2idx = {item_id : index for index, item_id in enumerate(train_set['ItemId'].unique())}

def indexing(df, id2idx):
    df['item_idx'] = df['ItemId'].map(lambda x: id2idx.get(x, -1))  # id2idx에 없는 아이템은 모르는 값(-1) 처리 해줍니다.
    return df

train_set = indexing(train_set, id2idx)
validation_set = indexing(validation_set, id2idx)
test_set = indexing(test_set, id2idx)

In [39]:
train_set

Unnamed: 0,UserId,ItemId,Rating,Time,Time_new,SessionId,item_idx
0,1,3186,4,978300019,2000-12-31 22:00:19,1,0
1,1,1270,5,978300055,2000-12-31 22:00:55,1,1
2,1,1022,5,978300055,2000-12-31 22:00:55,1,2
3,1,1721,4,978300055,2000-12-31 22:00:55,1,3
4,1,2340,3,978300103,2000-12-31 22:01:43,1,4
...,...,...,...,...,...,...,...
836450,6040,535,4,964828734,2000-07-28 23:58:54,24140,2224
836451,6040,1273,4,964828734,2000-07-28 23:58:54,24140,2350
836452,6040,3751,4,964828782,2000-07-28 23:59:42,24140,458
836453,6040,1077,5,964828799,2000-07-28 23:59:59,24140,898


In [41]:
validation_set

Unnamed: 0,UserId,ItemId,Rating,Time,Time_new,SessionId,item_idx
2183,19,318,4,994556598,2001-07-08 01:43:18,43,60
2184,19,1234,5,994556636,2001-07-08 01:43:56,43,505
2198,20,1694,3,1009669071,2001-12-29 23:37:51,45,1797
2199,20,1468,3,1009669071,2001-12-29 23:37:51,45,2542
2200,20,2858,4,1009669071,2001-12-29 23:37:51,45,61
...,...,...,...,...,...,...,...
836472,6040,232,5,997454398,2001-08-10 14:39:58,24142,977
836473,6040,2917,4,997454429,2001-08-10 14:40:29,24142,1099
836474,6040,1784,3,997454464,2001-08-10 14:41:04,24142,85
836475,6040,1921,4,997454464,2001-08-10 14:41:04,24142,315


In [42]:
test_set

Unnamed: 0,UserId,ItemId,Rating,Time,Time_new,SessionId,item_idx
4566,36,1701,4,1040544350,2002-12-22 08:05:50,88,440
4567,36,2269,5,1040544350,2002-12-22 08:05:50,88,1238
4568,36,2694,3,1040544494,2002-12-22 08:08:14,88,792
4569,36,3786,4,1040544521,2002-12-22 08:08:41,88,205
4570,36,2369,4,1040544564,2002-12-22 08:09:24,88,851
...,...,...,...,...,...,...,...
823653,5950,3893,3,1046369569,2003-02-27 18:12:49,23862,1804
823654,5950,3948,4,1046369637,2003-02-27 18:13:57,23862,461
823655,5950,3578,4,1046369670,2003-02-27 18:14:30,23862,93
823656,5950,3793,3,1046369710,2003-02-27 18:15:10,23862,373


In [43]:
save_path = data_path / 'processed2'
save_path.mkdir(parents=True, exist_ok = True)

train_set.to_pickle(save_path / 'train.pkl')
validation_set.to_pickle(save_path / 'valid.pkl')
test_set.to_pickle(save_path / 'test.pkl')

In [44]:
class SessionDataset:
    """Credit to yhs-968/pyGRU4REC."""

    def __init__(self, data):
        self.df = data
        self.click_offsets = self.get_click_offsets()
        self.session_idx = np.arange(self.df['SessionId'].nunique())  # indexing to SessionId

    def get_click_offsets(self):
        """
        Return the indexes of the first click of each session IDs,
        """
        offsets = np.zeros(self.df['SessionId'].nunique() + 1, dtype=np.int32)
        offsets[1:] = self.df.groupby('SessionId').size().cumsum()
        return offsets

In [45]:
tr_dataset = SessionDataset(train_set)
tr_dataset.df.head(10)

Unnamed: 0,UserId,ItemId,Rating,Time,Time_new,SessionId,item_idx
0,1,3186,4,978300019,2000-12-31 22:00:19,1,0
1,1,1270,5,978300055,2000-12-31 22:00:55,1,1
2,1,1022,5,978300055,2000-12-31 22:00:55,1,2
3,1,1721,4,978300055,2000-12-31 22:00:55,1,3
4,1,2340,3,978300103,2000-12-31 22:01:43,1,4
5,1,1836,5,978300172,2000-12-31 22:02:52,1,5
6,1,3408,4,978300275,2000-12-31 22:04:35,1,6
7,1,2804,5,978300719,2000-12-31 22:11:59,1,7
8,1,1207,4,978300719,2000-12-31 22:11:59,1,8
9,1,1193,5,978300760,2000-12-31 22:12:40,1,9


In [46]:
class SessionDataLoader:
    """Credit to yhs-968/pyGRU4REC."""

    def __init__(self, dataset: SessionDataset, batch_size=50):
        self.dataset = dataset
        self.batch_size = batch_size

    def __iter__(self):
        """ Returns the iterator for producing session-parallel training mini-batches.
        Yields:
            input (B,):  Item indices that will be encoded as one-hot vectors later.
            target (B,): a Variable that stores the target item indices
            masks: Numpy array indicating the positions of the sessions to be terminated
        """

        start, end, mask, last_session, finished = self.initialize()  # initialize 메소드에서 확인해주세요.
        """
        start : Index Where Session Start
        end : Index Where Session End
        mask : indicator for the sessions to be terminated
        """

        while not finished:
            min_len = (end - start).min() - 1  # Shortest Length Among Sessions
            for i in range(min_len):
                # Build inputs & targets
                inp = self.dataset.df['item_idx'].values[start + i]
                target = self.dataset.df['item_idx'].values[start + i + 1]
                yield inp, target, mask

            start, end, mask, last_session, finished = self.update_status(start, end, min_len, last_session, finished)

    def initialize(self):
        first_iters = np.arange(self.batch_size)    # 첫 배치에 사용할 세션 Index를 가져옵니다.
        last_session = self.batch_size - 1    # 마지막으로 다루고 있는 세션 Index를 저장해둡니다.
        start = self.dataset.click_offsets[self.dataset.session_idx[first_iters]]       # data 상에서 session이 시작된 위치를 가져옵니다.
        end = self.dataset.click_offsets[self.dataset.session_idx[first_iters] + 1]  # session이 끝난 위치 바로 다음 위치를 가져옵니다.
        mask = np.array([])   # session의 모든 아이템을 다 돌은 경우 mask에 추가해줄 것입니다.
        finished = False         # data를 전부 돌았는지 기록하기 위한 변수입니다.
        return start, end, mask, last_session, finished

    def update_status(self, start: np.ndarray, end: np.ndarray, min_len: int, last_session: int, finished: bool):  
        # 다음 배치 데이터를 생성하기 위해 상태를 update합니다.
        
        start += min_len   # __iter__에서 min_len 만큼 for문을 돌았으므로 start를 min_len 만큼 더해줍니다.
        mask = np.arange(self.batch_size)[(end - start) == 1]  
        # end는 다음 세션이 시작되는 위치인데 start와 한 칸 차이난다는 것은 session이 끝났다는 뜻입니다. mask에 기록해줍니다.

        for i, idx in enumerate(mask, start=1):  # mask에 추가된 세션 개수만큼 새로운 세션을 돌것입니다.
            new_session = last_session + i  
            if new_session > self.dataset.session_idx[-1]:  # 만약 새로운 세션이 마지막 세션 index보다 크다면 모든 학습데이터를 돈 것입니다.
                finished = True
                break
            # update the next starting/ending point
            start[idx] = self.dataset.click_offsets[self.dataset.session_idx[new_session]]     # 종료된 세션 대신 새로운 세션의 시작점을 기록합니다.
            end[idx] = self.dataset.click_offsets[self.dataset.session_idx[new_session] + 1]

        last_session += len(mask)  # 마지막 세션의 위치를 기록해둡니다.
        return start, end, mask, last_session, finished

In [47]:
tr_data_loader = SessionDataLoader(tr_dataset, batch_size=4)
tr_dataset.df.head(15)

Unnamed: 0,UserId,ItemId,Rating,Time,Time_new,SessionId,item_idx
0,1,3186,4,978300019,2000-12-31 22:00:19,1,0
1,1,1270,5,978300055,2000-12-31 22:00:55,1,1
2,1,1022,5,978300055,2000-12-31 22:00:55,1,2
3,1,1721,4,978300055,2000-12-31 22:00:55,1,3
4,1,2340,3,978300103,2000-12-31 22:01:43,1,4
5,1,1836,5,978300172,2000-12-31 22:02:52,1,5
6,1,3408,4,978300275,2000-12-31 22:04:35,1,6
7,1,2804,5,978300719,2000-12-31 22:11:59,1,7
8,1,1207,4,978300719,2000-12-31 22:11:59,1,8
9,1,1193,5,978300760,2000-12-31 22:12:40,1,9


In [48]:
iter_ex = iter(tr_data_loader)

In [49]:
inputs, labels, mask =  next(iter_ex)
print(f'Model Input Item Idx are : {inputs}')
print(f'Label Item Idx are : {"":5} {labels}')
print(f'Previous Masked Input Idx are {mask}')

Model Input Item Idx are : [ 0 40 53 65]
Label Item Idx are :       [ 1 41 54 61]
Previous Masked Input Idx are []


## Step 3. 모델 구성

In [50]:
def mrr_k(pred, truth: int, k: int):
    indexing = np.where(pred[:k] == truth)[0]
    if len(indexing) > 0:
        return 1 / (indexing[0] + 1)
    else:
        return 0


def recall_k(pred, truth: int, k: int) -> int:
    answer = truth in pred[:k]
    return int(answer)

In [51]:
import numpy as np
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)
from tensorflow.keras.layers import Input, Dense, Dropout, GRU
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tqdm import tqdm

1 Physical GPUs, 1 Logical GPUs


In [52]:
def create_model(args):
    inputs = Input(batch_shape=(args.batch_size, 1, args.num_items))
    gru, _ = GRU(args.hsz, stateful=True, return_state=True, name='GRU')(inputs)
    dropout = Dropout(args.drop_rate)(gru)
    predictions = Dense(args.num_items, activation='softmax')(dropout)
    model = Model(inputs=inputs, outputs=[predictions])
    model.compile(loss=categorical_crossentropy, optimizer=Adam(args.lr), metrics=['accuracy'])
    model.summary()
    return model

In [53]:
class Args:
    def __init__(self, tr, val, test, batch_size, hsz, drop_rate, lr, epochs, k):
        self.tr = tr
        self.val = val
        self.test = test
        self.num_items = tr['ItemId'].nunique()
        self.num_sessions = tr['SessionId'].nunique()
        self.batch_size = batch_size
        self.hsz = hsz
        self.drop_rate = drop_rate
        self.lr = lr
        self.epochs = epochs
        self.k = k

args = Args(train_set, validation_set, test_set, batch_size=56, hsz=50, drop_rate=0.1, lr=0.001, epochs=3, k=20)

In [54]:
model = create_model(args)

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(56, 1, 3596)]           0         
_________________________________________________________________
GRU (GRU)                    [(56, 50), (56, 50)]      547200    
_________________________________________________________________
dropout (Dropout)            (56, 50)                  0         
_________________________________________________________________
dense (Dense)                (56, 3596)                183396    
Total params: 730,596
Trainable params: 730,596
Non-trainable params: 0
_________________________________________________________________


## Step 4. 모델 학습

다양한 하이퍼파라미터를 변경해 보며 검증해 보도록 합니다.

In [55]:
# train 셋으로 학습하면서 valid 셋으로 검증합니다.
def train_model(model, args):
    train_dataset = SessionDataset(args.tr)
    train_loader = SessionDataLoader(train_dataset, batch_size=args.batch_size)

    for epoch in range(1, args.epochs + 1):
        total_step = len(args.tr) - args.tr['SessionId'].nunique()
        tr_loader = tqdm(train_loader, total=total_step // args.batch_size, desc='Train', mininterval=1)
        for feat, target, mask in tr_loader:
            reset_hidden_states(model, mask)  # 종료된 session은 hidden_state를 초기화합니다. 아래 메서드에서 확인해주세요.

            input_ohe = to_categorical(feat, num_classes=args.num_items)
            input_ohe = np.expand_dims(input_ohe, axis=1)
            target_ohe = to_categorical(target, num_classes=args.num_items)

            result = model.train_on_batch(input_ohe, target_ohe)
            tr_loader.set_postfix(train_loss=result[0], accuracy = result[1])

        val_recall, val_mrr = get_metrics(args.val, model, args, args.k)  # valid set에 대해 검증합니다.

        print(f"\t - Recall@{args.k} epoch {epoch}: {val_recall:3f}")
        print(f"\t - MRR@{args.k}    epoch {epoch}: {val_mrr:3f}\n")


def reset_hidden_states(model, mask):
    gru_layer = model.get_layer(name='GRU')  # model에서 gru layer를 가져옵니다.
    hidden_states = gru_layer.states[0].numpy()  # gru_layer의 parameter를 가져옵니다.
    for elt in mask:  # mask된 인덱스 즉, 종료된 세션의 인덱스를 돌면서
        hidden_states[elt, :] = 0  # parameter를 초기화 합니다.
    gru_layer.reset_states(states=hidden_states)


def get_metrics(data, model, args, k: int):  # valid셋과 test셋을 평가하는 코드입니다. 
                                             # train과 거의 같지만 mrr, recall을 구하는 라인이 있습니다.
    dataset = SessionDataset(data)
    loader = SessionDataLoader(dataset, batch_size=args.batch_size)
    recall_list, mrr_list = [], []

    total_step = len(data) - data['SessionId'].nunique()
    for inputs, label, mask in tqdm(loader, total=total_step // args.batch_size, desc='Evaluation', mininterval=1):
        reset_hidden_states(model, mask)
        input_ohe = to_categorical(inputs, num_classes=args.num_items)
        input_ohe = np.expand_dims(input_ohe, axis=1)

        pred = model.predict(input_ohe, batch_size=args.batch_size)
        pred_arg = tf.argsort(pred, direction='DESCENDING')  # softmax 값이 큰 순서대로 sorting 합니다.

        length = len(inputs)
        recall_list.extend([recall_k(pred_arg[i], label[i], k) for i in range(length)])
        mrr_list.extend([mrr_k(pred_arg[i], label[i], k) for i in range(length)])

    recall, mrr = np.mean(recall_list), np.mean(mrr_list)
    return recall, mrr

In [56]:
train_model(model, args)

Train:  99%|█████████▉| 13595/13674 [01:20<00:00, 169.92it/s, accuracy=0.0357, train_loss=5.53]
Evaluation:  95%|█████████▌| 738/776 [01:39<00:05,  7.41it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0179, train_loss=5.52]

	 - Recall@20 epoch 1: 0.152971
	 - MRR@20    epoch 1: 0.035323



Train:  99%|█████████▉| 13595/13674 [01:18<00:00, 172.79it/s, accuracy=0.0179, train_loss=5.35]
Evaluation:  95%|█████████▌| 738/776 [01:39<00:05,  7.45it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0, train_loss=5.19]     

	 - Recall@20 epoch 2: 0.179128
	 - MRR@20    epoch 2: 0.041026



Train:  99%|█████████▉| 13595/13674 [01:20<00:00, 169.23it/s, accuracy=0.0536, train_loss=5.21]
Evaluation:  95%|█████████▌| 738/776 [01:37<00:05,  7.55it/s]

	 - Recall@20 epoch 3: 0.185613
	 - MRR@20    epoch 3: 0.043292






## Step 5. 모델 테스트

미리 구성한 테스트셋을 바탕으로 Recall, MRR 을 확인해 봅니다.

In [57]:
def test_model(model, args, test):
    test_recall, test_mrr = get_metrics(test, model, args, 20)
    print(f"\t - Recall@{args.k}: {test_recall:3f}")
    print(f"\t - MRR@{args.k}: {test_mrr:3f}\n")

test_model(model, args, test_set)

Evaluation:  56%|█████▋    | 31/55 [00:04<00:03,  7.37it/s]

	 - Recall@20: 0.159562
	 - MRR@20: 0.042906






# 하이퍼파라미터 변경하기

epoch를 늘려 학습시켜보도록 하겠습니다!

### drop_rate = 0.5 변경하기

In [61]:
args_1 = Args(train_set, validation_set, test_set, batch_size=56, hsz=50, drop_rate=0.5, lr=0.001, epochs=7, k=20)

In [62]:
model_1 = create_model(args_1)

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(56, 1, 3596)]           0         
_________________________________________________________________
GRU (GRU)                    [(56, 50), (56, 50)]      547200    
_________________________________________________________________
dropout_2 (Dropout)          (56, 50)                  0         
_________________________________________________________________
dense_2 (Dense)              (56, 3596)                183396    
Total params: 730,596
Trainable params: 730,596
Non-trainable params: 0
_________________________________________________________________


In [63]:
train_model(model_1, args_1)

Train:  99%|█████████▉| 13595/13674 [01:21<00:00, 167.70it/s, accuracy=0, train_loss=5.84]     
Evaluation:  95%|█████████▌| 738/776 [01:39<00:05,  7.40it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0179, train_loss=5.74]

	 - Recall@20 epoch 1: 0.144212
	 - MRR@20    epoch 1: 0.032898



Train:  99%|█████████▉| 13595/13674 [01:18<00:00, 173.64it/s, accuracy=0.0357, train_loss=5.57]
Evaluation:  95%|█████████▌| 738/776 [01:38<00:05,  7.48it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0357, train_loss=5.51]

	 - Recall@20 epoch 2: 0.168118
	 - MRR@20    epoch 2: 0.039383



Train:  99%|█████████▉| 13595/13674 [01:18<00:00, 173.32it/s, accuracy=0.0179, train_loss=5.48]
Evaluation:  95%|█████████▌| 738/776 [01:38<00:05,  7.53it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0179, train_loss=5.42]

	 - Recall@20 epoch 3: 0.177216
	 - MRR@20    epoch 3: 0.041390



Train:  99%|█████████▉| 13595/13674 [01:18<00:00, 173.01it/s, accuracy=0.0357, train_loss=5.46]
Evaluation:  95%|█████████▌| 738/776 [01:38<00:05,  7.52it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0893, train_loss=5.17]

	 - Recall@20 epoch 4: 0.181693
	 - MRR@20    epoch 4: 0.042653



Train:  99%|█████████▉| 13595/13674 [01:17<00:00, 174.33it/s, accuracy=0.0179, train_loss=5.47]
Evaluation:  95%|█████████▌| 738/776 [01:37<00:05,  7.60it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0357, train_loss=5.34]

	 - Recall@20 epoch 5: 0.183338
	 - MRR@20    epoch 5: 0.042713



Train:  99%|█████████▉| 13595/13674 [01:17<00:00, 175.04it/s, accuracy=0.0179, train_loss=5.5] 
Evaluation:  95%|█████████▌| 738/776 [01:36<00:04,  7.63it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0179, train_loss=5.35]

	 - Recall@20 epoch 6: 0.182999
	 - MRR@20    epoch 6: 0.043308



Train:  99%|█████████▉| 13595/13674 [01:17<00:00, 175.35it/s, accuracy=0, train_loss=5.6]      
Evaluation:  95%|█████████▌| 738/776 [01:36<00:04,  7.65it/s]

	 - Recall@20 epoch 7: 0.185709
	 - MRR@20    epoch 7: 0.043533






In [64]:
test_model(model_1, args, test_set)

Evaluation:  56%|█████▋    | 31/55 [00:04<00:03,  7.64it/s]

	 - Recall@20: 0.164171
	 - MRR@20: 0.042154






### drop_rate = 0.3 변경하기

In [67]:
args_2 = Args(train_set, validation_set, test_set, batch_size=56, hsz=50, drop_rate=0.3, lr=0.001, epochs=7, k=20)

In [68]:
model_2 = create_model(args_2)

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(56, 1, 3596)]           0         
_________________________________________________________________
GRU (GRU)                    [(56, 50), (56, 50)]      547200    
_________________________________________________________________
dropout_3 (Dropout)          (56, 50)                  0         
_________________________________________________________________
dense_3 (Dense)              (56, 3596)                183396    
Total params: 730,596
Trainable params: 730,596
Non-trainable params: 0
_________________________________________________________________


In [69]:
train_model(model_2, args_2)

Train:  99%|█████████▉| 13595/13674 [01:19<00:00, 170.88it/s, accuracy=0.0179, train_loss=5.61]
Evaluation:  95%|█████████▌| 738/776 [01:41<00:05,  7.26it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0179, train_loss=5.73]

	 - Recall@20 epoch 1: 0.150019
	 - MRR@20    epoch 1: 0.034317



Train:  99%|█████████▉| 13595/13674 [01:16<00:00, 177.00it/s, accuracy=0.0179, train_loss=5.69]
Evaluation:  95%|█████████▌| 738/776 [01:39<00:05,  7.44it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0179, train_loss=5.24]

	 - Recall@20 epoch 2: 0.175595
	 - MRR@20    epoch 2: 0.040777



Train:  99%|█████████▉| 13595/13674 [01:18<00:00, 174.22it/s, accuracy=0.0714, train_loss=5.62]
Evaluation:  95%|█████████▌| 738/776 [01:36<00:04,  7.64it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0, train_loss=5.58]     

	 - Recall@20 epoch 3: 0.184185
	 - MRR@20    epoch 3: 0.043097



Train:  99%|█████████▉| 13595/13674 [01:18<00:00, 172.29it/s, accuracy=0.0893, train_loss=5.45]
Evaluation:  95%|█████████▌| 738/776 [01:37<00:05,  7.58it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0536, train_loss=5.04]

	 - Recall@20 epoch 4: 0.188928
	 - MRR@20    epoch 4: 0.044466



Train:  99%|█████████▉| 13595/13674 [01:19<00:00, 170.02it/s, accuracy=0.0357, train_loss=5.42]
Evaluation:  95%|█████████▌| 738/776 [01:33<00:04,  7.88it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0357, train_loss=5.01]

	 - Recall@20 epoch 5: 0.189654
	 - MRR@20    epoch 5: 0.045108



Train:  99%|█████████▉| 13595/13674 [01:17<00:00, 176.28it/s, accuracy=0.0536, train_loss=5.26]
Evaluation:  95%|█████████▌| 738/776 [01:33<00:04,  7.91it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0536, train_loss=5]   

	 - Recall@20 epoch 6: 0.190476
	 - MRR@20    epoch 6: 0.046116



Train:  99%|█████████▉| 13595/13674 [01:16<00:00, 178.28it/s, accuracy=0.0714, train_loss=5.38]
Evaluation:  95%|█████████▌| 738/776 [01:40<00:05,  7.38it/s]

	 - Recall@20 epoch 7: 0.192291
	 - MRR@20    epoch 7: 0.046860






In [70]:
test_model(model_2, args, test_set)

Evaluation:  56%|█████▋    | 31/55 [00:03<00:03,  7.89it/s]

	 - Recall@20: 0.170507
	 - MRR@20: 0.046753






### k = 40 변경하기

In [71]:
args_3 = Args(train_set, validation_set, test_set, batch_size=56, hsz=50, drop_rate=0.1, lr=0.001, epochs=7, k=40)

In [72]:
model_3 = create_model(args_3)

Model: "model_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         [(56, 1, 3596)]           0         
_________________________________________________________________
GRU (GRU)                    [(56, 50), (56, 50)]      547200    
_________________________________________________________________
dropout_4 (Dropout)          (56, 50)                  0         
_________________________________________________________________
dense_4 (Dense)              (56, 3596)                183396    
Total params: 730,596
Trainable params: 730,596
Non-trainable params: 0
_________________________________________________________________


In [73]:
train_model(model_3, args_3)

Train:  99%|█████████▉| 13595/13674 [01:17<00:00, 174.31it/s, accuracy=0.0357, train_loss=5.63]
Evaluation:  95%|█████████▌| 738/776 [02:44<00:08,  4.49it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0179, train_loss=5.31]

	 - Recall@40 epoch 1: 0.237684
	 - MRR@40    epoch 1: 0.038065



Train:  99%|█████████▉| 13595/13674 [01:17<00:00, 175.74it/s, accuracy=0.0357, train_loss=5.38]
Evaluation:  95%|█████████▌| 738/776 [02:40<00:08,  4.61it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0357, train_loss=5.04]

	 - Recall@40 epoch 2: 0.274439
	 - MRR@40    epoch 2: 0.044288



Train:  99%|█████████▉| 13595/13674 [01:18<00:00, 173.70it/s, accuracy=0.0714, train_loss=5.3] 
Evaluation:  95%|█████████▌| 738/776 [02:43<00:08,  4.52it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0357, train_loss=5.44]

	 - Recall@40 epoch 3: 0.284311
	 - MRR@40    epoch 3: 0.047414



Train:  99%|█████████▉| 13595/13674 [01:17<00:00, 174.97it/s, accuracy=0.0893, train_loss=5.26]
Evaluation:  95%|█████████▌| 738/776 [02:36<00:08,  4.73it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0714, train_loss=4.93]

	 - Recall@40 epoch 4: 0.290820
	 - MRR@40    epoch 4: 0.049141



Train:  99%|█████████▉| 13595/13674 [01:19<00:00, 170.42it/s, accuracy=0.0536, train_loss=5.27]
Evaluation:  95%|█████████▌| 738/776 [02:44<00:08,  4.48it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0357, train_loss=4.98]

	 - Recall@40 epoch 5: 0.291570
	 - MRR@40    epoch 5: 0.049891



Train:  99%|█████████▉| 13595/13674 [01:24<00:00, 161.80it/s, accuracy=0.0893, train_loss=5.24]
Evaluation:  95%|█████████▌| 738/776 [02:48<00:08,  4.39it/s]
Train:   0%|          | 0/13674 [00:00<?, ?it/s, accuracy=0.0714, train_loss=4.92]

	 - Recall@40 epoch 6: 0.290868
	 - MRR@40    epoch 6: 0.050747



Train:  99%|█████████▉| 13595/13674 [01:21<00:00, 167.14it/s, accuracy=0.0536, train_loss=5.18]
Evaluation:  95%|█████████▌| 738/776 [02:41<00:08,  4.56it/s]

	 - Recall@40 epoch 7: 0.289949
	 - MRR@40    epoch 7: 0.050652






In [74]:
test_model(model_3, args, test_set)

Evaluation:  56%|█████▋    | 31/55 [00:04<00:03,  7.55it/s]

	 - Recall@20: 0.160138
	 - MRR@20: 0.044069






# 결과 및 총평

Movielens_data를 통해 추천시스템을 만들어보는 시간을 가졌습니다.
시간문제상 epoch를 7회씩밖에 학습시키지 못했지만 로스가 충분히 잘떨어지는 걸로 보아 학습횟수를 늘린다면 더 좋은 모델성능이 나오지 않을까 싶습니다.  
초반에 user_id와 시간을 고려하여 session_id를 따로 생성해주었지만 생각해보니 그러지 않았어도 됐을것 같다는 생각이 듭니다.  
기회가 된다면 데이터의 다른 부분들을 더 경량화하여 학습시켜보도록 하겠습니다.