This Code is originally published from The Chinese University of Hong Kong STAT 3009 Class. This is for review only

# Overview of Recommender Systems

## Examples (RS) in Kaggle

- [Elo Merchant Category Recommendation](https://www.kaggle.com/c/elo-merchant-category-recommendation/data?select=Data+Dictionary.xlsx): `merchant_id` and `card_id`.

- [WSDM - KKBox's Music Recommendation Challenge](https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data): `user` and `music`.

- [Event Recommendation Engine Challenge](https://www.kaggle.com/c/event-recommendation-engine-challenge/overview/evaluation): `user` and `event`.


## Load Netflix dataset

- Dowload [Netflix Prize Data](https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data). (For illustration, we only take the first subset.)

- Dataset is pre-processed by [`pre-process.py`](https://github.com/statmlben/CUHK-STAT3009/tree/main/dataset)

- Load data into Python

- Re-orginize the data structure as a standard form

- For testing set, we remove the real ratings.



In [37]:
import numpy as np
import pandas as pd

## Upload Netflix dataset in CUHK-STAT3009 Github repo

train_url = "https://raw.githubusercontent.com/statmlben/CUHK-STAT3009/main/dataset/train.csv"
test_url = "https://raw.githubusercontent.com/statmlben/CUHK-STAT3009/main/dataset/test.csv"

dtrain = pd.read_csv(train_url)
dtest = pd.read_csv(test_url)

In [38]:
dtrain.sample(5)

Unnamed: 0,movie_id,user_id,rating,date
17234,1161,1942,5,2003-11-12
34560,1133,1374,4,2005-04-30
19469,510,1347,5,2005-12-19
25139,21,1302,3,2004-07-23
12748,3019,1013,4,2004-02-18


In [39]:
dtest.sample(5)

Unnamed: 0,movie_id,user_id,rating,date
34070,242,879,1,2003-02-14
33963,2636,698,5,2003-05-01
46038,887,1709,5,2005-09-09
6851,1456,1372,2,2003-08-19
33346,3044,1161,4,2004-05-28


### Pre-process the data as a `np.array`

In [40]:
dtest.values

array([[2956, 1574, 4, '2005-06-14'],
       [791, 1670, 4, '2005-02-26'],
       [1547, 837, 5, '2002-03-15'],
       ...,
       [653, 1828, 5, '2005-12-22'],
       [2195, 566, 3, '2005-05-08'],
       [3081, 1046, 4, '2005-01-08']], dtype=object)

In [41]:
np.array(dtrain['rating'].values, dtype='float')

array([4., 4., 4., ..., 4., 2., 3.])

In [42]:
## save (user_id, item_id) and rating separately
train_rating = dtrain['rating'].values

# 기존에 np array가 정수형으로 되어 있는데 데이터 타입을 유리수 형태로 변경 
train_rating = np.array(train_rating, dtype='float')

# train_pair에 train 데이터로 사용할 user_id와 movie_id를 기입 
train_pair = dtrain[['user_id', 'movie_id']].values

# test에도 동일하게 적용 
test_rating = dtest['rating'].values
test_rating = np.array(train_rating, dtype='float')
test_pair = dtest[['user_id', 'movie_id']].values

## we want to predict `test_rating` based on `train_pair`, `train_rating`, `test_pair`

In [43]:
len(train_pair)

51161

In [44]:
len(train_rating)

51161

In [45]:
## find the number of users/items

# user
# starts with 0, so add up 1
n_user = max(max(train_pair[:,0]), max(test_pair[:,0])) + 1

# item
n_item = max(max(train_pair[:,1]), max(test_pair[:,1])) + 1

## if the user_id is not the form of {0,1,2,3,4, ...}
## You should use set operators

In [46]:
len(train_pair[:,0]), len(test_pair[:,0])

(51161, 51161)

In [47]:
max(train_pair[:,0]), max(test_pair[:,0])

(1999, 1999)

In [48]:
max(train_pair[:,1]), max(test_pair[:,1])

(3567, 3567)

In [49]:
print('num of users: %d' %n_user)
print('num of items: %d' %n_item)

num of users: 2000
num of items: 3568


## Evaluation

- Define a function to compute `rmse` for the predicted rating

- Test your function

In [50]:
## define RMSE function
## professor answer
def rmse(pred_rating, test_rating) :
    return np.sqrt(np.mean((pred_rating - test_rating)**2))

In [51]:
demo = test_rating.copy()
demo = 0.0 * demo
print(demo)

res_tmp = np.sqrt(np.mean((demo - test_rating)**2))
print(res_tmp)

[0. 0. 0. ... 0. 0. 0.]
3.781031894239759


In [53]:
## define RMSE function
## generated by ChatGPT
def calculate_rmse(predictions, targets):
    """
    RMSE 계산하는 함수

    :param predictions: 예측값 배열
    :param targets: 실제값 배열
    :return: RMSE
    """
    # 예측값과 실제값 배열 길이가 같은지 확인
    if len(predictions) != len(targets):
        raise ValueError("예측값과 실제값 배열의 길이가 같아야 합니다.")

    # RMSE 계산
    mse = np.mean((predictions - targets) ** 2)
    rmse = np.sqrt(mse)

    return rmse


In [55]:
## Test `rmse` function
rmse(pred_rating=test_rating, test_rating=test_rating)

0.0

## Implement Baseline methods

- Inout: training set.

- Output: return predicted ratings for (user id, item id) user-item pairs in testing set.

- Goal: make prediction for testing set

### Global mean

$$
\bar{r} = \frac{1}{|\Omega|} \sum_{(u,i) \in \Omega} r_{ui}, \quad \hat{r}_{ui} = \bar{r}
$$

In [56]:
## create a potential prediction for `test_rating`

## test_rating의 숫자만큼의 1을 생성 
pred_rating = np.ones(len(test_pair))

## Compute global mean based on `train_rating`
pred_rating = pred_rating * np.mean(train_rating)
print('glb ave prediction is %s'%pred_rating)

glb ave prediction is [3.62115674 3.62115674 3.62115674 ... 3.62115674 3.62115674 3.62115674]


In [57]:
rmse(pred_rating, test_rating)
print('rmse for glb Ave: %.4f' %rmse(pred_rating, test_rating))

rmse for glb Ave: 1.0879


### user average

$$
		\bar{r}_{u} = \frac{1}{|\mathcal{I}_u|} \sum_{i \in \mathcal{I}_u} r_{ui}, \text{ for } u=1, \cdots, n; \quad \hat{r}_{ui} = \bar{r}_u
$$

- Loop for all users
  - Find all records for this user in both training and testing sets.
  - Compute the average ratings for this user in the training set.
  - Predict the ratings for this users in the testing set.

In [58]:
## (InClass Practice) user average
pred_rating = np.zeros(len(test_pair))
glb_mean = np.mean(train_rating)

for u in range(n_user):
  # S1: find the records in `test_pair` that `user_id` = u
  # S2: compute the user average in the training set
    # S2.1: find all indices in `train_pair` that `user_id` = u
    # S2.2: put the indices into `train_rating` then take average
  # S3: put user average into all records in S1

  # user column에서 user 식별 번호가 u에 해당하는 모든 경우의 인덱스를 변수에 지정(train과 test에 모두 적용)  
  ind_test = np.where(test_pair[:,0] == u)[0]
  ind_train = np.where(train_pair[:,0] == u)[0]

  # if ind_test is none

  # test 데이터는 평가를 위한 데이터셋이므로 glb_mean을 지정하지 않는다. 
  if len(ind_test)==0:
    continue
    
  # train 데이터가 없는 경우 global mean을 변수로 지정해 준다. 
  if len(ind_train)==0:
    user_mean_tmp = glb_mean
  else:
    user_mean_tmp = train_rating[ind_train].mean()
  
  # 해당 인덱스에 평균 지정 
  pred_rating[ind_test] = user_mean_tmp

In [59]:
rmse(pred_rating, test_rating)

1.1917209444380417

In [60]:
## 내가 이해한 내용 
## Predicted value는 같은 user_id를 기준으로 그룹화 해 평균을 낸 숫자를 predicted value라고 정의 

### item average

$$
		\bar{r}_{i} = \frac{1}{|\mathcal{U}_i|} \sum_{u \in \mathcal{U}_i} r_{ui}, \text{ for } i=1, \cdots, m; \quad \hat{r}_{ui} = \bar{r}_i,
$$

In [61]:
## (InClass Practice) item average
pred_rating = np.zeros(len(test_pair))
glb_mean = np.mean(train_rating)

for i in range(n_item):

  # item column에서 item 식별 번호가 i에 해당하는 모든 경우의 인덱스를 변수에 지정(train과 test에 모두 적용)  
  ind_test = np.where(test_pair[:,1] == i)[0]
  ind_train = np.where(train_pair[:,1] == i)[0]

  # if ind_test is none
  if len(ind_test)==0:
    continue
  # less than 3 can't represents the data
  if len(ind_train)<3:
    item_mean = glb_mean
  else:
    item_mean_tmp = train_rating[ind_train].mean()
  pred_rating[ind_test] = item_mean_tmp

In [62]:
## 내가 이해한 내용
## 이번에는 같은 item끼리 묶어서 rating 중 같은 item의 평균을 도출해 냄. 
## 참고로 item이 3개 이하인 경우 그 item의 rating을 대변하기 어렵다고 생각해서 global mean을 그대로 적용. 

## Package Python functions


- *Input*: 'train_rating', 'test_pair'

- *Return*: Predicted ratings.

In [63]:
def glb_mean(train_rating, test_pair):
    pred = train_rating.mean() * np.ones(len(test_pair))
    return pred

In [64]:
def user_mean(train_pair, train_rating, test_pair):
    n_user = max(train_pair[:,0].max(), test_pair[:,0].max())+1
    pred = np.zeros(len(test_pair))
    glb_mean_value = train_rating.mean()
    for u in range(n_user):
        # find the index for both train and test for user_id = u
        ind_test = np.where(test_pair[:,0] == u)[0]
        ind_train = np.where(train_pair[:,0] == u)[0]
        if len(ind_test) == 0:
            continue
        if len(ind_train) < 3:
            pred[ind_test] = glb_mean_value
        else:
            # predict as user average
            pred[ind_test] = train_rating[ind_train].mean()
    return pred

In [65]:
def item_mean(train_pair, train_rating, test_pair):
    n_item = max(train_pair[:,1].max(), test_pair[:,1].max())+1
    pred = np.zeros(len(test_pair))
    glb_mean_value = train_rating.mean()
    for i in range(n_item):
        # find the index for both train and test for item_id = i
        ind_test = np.where(test_pair[:,1] == i)[0]
        ind_train = np.where(train_pair[:,1] == i)[0]
        if len(ind_test) == 0:
            continue
        if len(ind_train) < 3:
            pred[ind_test] = glb_mean_value
        else:
            # predict as user average
            pred[ind_test] = train_rating[ind_train].mean()
    return pred

In [66]:
pred_rating = user_mean(train_pair, train_rating, test_pair)
rmse_user_ave = rmse(pred_rating, test_rating)
print('rmse for user_ave: %.4f' %rmse_user_ave)

rmse for user_ave: 1.1870


## Sequential models; user-item mean

- We can predict the rating by the `user_mean`, then fit the residual by `item_mean`

$$\hat{r}_{ui} = \mu_u + \mu_i$$

where
		$$\mu_u = \frac{1}{|\mathcal{I}_u|} \sum_{i \in \mathcal{I}_u} r_{ui}, \quad \mu_i = \frac{1}{|\mathcal{U}_i|} \sum_{u \in \mathcal{U}_i} (r_{ui} - \mu_u)$$

In [70]:
## compute user-mean + item_mean on res

# train_pair : 학습 데이터 / train_rating : validation data / test_pair : 예측하고자하는 데이터 
pred_rating = user_mean(train_pair, train_rating, test_pair)

# 유저 데이터 train_pair로 fit
pred_rating_train = user_mean(train_pair, train_rating, train_pair)

# rating에 대한 residual 구하기 
res_rating= train_rating - pred_rating_train

# rating residual을 가지고 item_mean 구해서 더하기 |
pred_rating = pred_rating + item_mean(train_pair, res_rating, test_pair)


In [73]:
pred_rating

array([4.25566734, 2.89384634, 3.32151874, ..., 4.26759153, 1.23110284,
       3.23510628])

In [None]:
rmse(pred_rating, test_rating)

1.2438011589430351

In [None]:
## 반대로 item-mean을 먼저 구해 잔차를 구한후 user_mean을 도출하는 방식 

## compute item-mean + user_mean on res
pred_rating = item_mean(train_pair, train_rating, test_pair)
rmse(pred_rating, test_rating)

pred_rating_train = item_mean(train_pair, train_rating, train_pair)
res_rating = train_rating - pred_rating_train

pred_rating = pred_rating + user_mean(train_pair, res_rating, test_pair)
rmse(pred_rating, test_rating)

1.2411032329726488

## To-do list

- **STAT**
  - [ ] Background of RS  
  - [ ] The data types in RS
  - [ ] Evaluation metrics
  - [ ] Statistical models for baseline methods

- **Code**

  - [ ] Load data to Python `pd.read_csv`
  - [ ] implementation of baseline methods
  - [ ] define Python functions