<a href="https://colab.research.google.com/github/CP2J/cp2j/blob/ACJ-9-MF-SGD-/Recsys_MF_SGD(normal%20init).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
from google.colab import files
from collections import Counter
from sklearn.model_selection import train_test_split
from scipy import sparse

In [2]:
from google.colab import drive
drive.mount('/content/drive')
rating = pd.read_csv('/content/drive/MyDrive/ml-100k/u.data', sep='\t', header=None, names=['user_id', 'item_id', 'rating', 'timestamp'])

Mounted at /content/drive


 # MF-SGD (Matrix Factorization - Stochastic Gradient Descent)

 SVD에서 쓰인 행렬분해를 이용, 확률적 경사하강 기법으로 오차를 줄이는 방향으로 학습한다.  
SVD의 행렬분해에서는 null값이 존재하면 안되기에 평균값, 최빈값 등을 사용했으나 여기서는 랜덤값 지정 후 오차를 줄이는 방향으로 학습.  
결국 데이터가 sparse 할 수록 임의값에 의존하던 이전 모델들에 비해 성능이 더 잘 나오게 된다.

In [3]:
# files.upload();
# rating = pd.read_csv('ratings.csv')

In [4]:
rating

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


train시 embedding layer 필요, df.pivot으로 해결되는지 추후 실험 필요


train_test_split 수행

train data를 연속적 값으로 변환하는 목적(빠진 값이 있을 것, svd 함수 내 설명 참고)


In [5]:
rating = rating.drop(columns = ['timestamp'])

In [6]:
from sklearn.model_selection import train_test_split
rating_train, rating_test = train_test_split(rating, test_size = 0.2)

rating_train = rating_train.reset_index()[['user_id', 'item_id', 'rating']]
rating_test = rating_test.reset_index()[['user_id', 'item_id', 'rating']]

In [7]:
def encode_column(column):
  # 컬럼 을 연속 id로 인코딩
  # 열 내 고유값 키
  keys = column.unique()
  #enumerate = 리스트 내 넘버와 값
  key_to_id = {key:idx for idx, key in enumerate(keys)}
  return key_to_id, np.array([key_to_id[x] for x in column]), len(keys)

In [8]:
def encode_df(rating):
  #rating 데이터를 연속적인 user와 item id 로 재 배열
  #encode_column의 결과 값으로 들어온 df의 열 데이터를 바꿔서 반환
  item_ids, rating['item_id'], num_item = encode_column(rating['item_id'])
  user_ids, rating['user_id'], num_user = encode_column(rating['user_id'])
  return rating, num_user, num_item, user_ids, item_ids

In [9]:
rating_df, num_user, num_item, user_ids, item_ids = encode_df(rating_train)
print("Number of Users : ", num_user)
print("Number of Items : ", num_item)
rating_df.head()

Number of Users :  943
Number of Items :  1647


Unnamed: 0,user_id,item_id,rating
0,0,0,4
1,1,1,5
2,2,2,4
3,3,3,5
4,4,4,1


User and Item embeddings

In [10]:
def create_embeddings(n, K):
  # 랜덤한 값의 넘파이 행력 생성 함수 (n, K)
  # n = 아이템/유저의 수
  # K = embedding 안의 고유값 개수
  return 5* np.random.random((n, K)) / K

In [11]:
def create_sparse_matrix(df, rows, columns, column_name = 'rating'):
  # scipy를 이용해 Sparse utility matrix 생성 함수
  return sparse.csc_matrix((df[column_name].values, (df['user_id'].values, df['item_id'].values)),shape = (rows, columns))

In [12]:
rating_df, num_user, num_item, user_ids, item_ids = encode_df(rating_train)
Y = create_sparse_matrix(rating_df, num_user, num_item)

In [13]:
Y.todense()

matrix([[4, 0, 3, ..., 0, 0, 0],
        [0, 5, 0, ..., 0, 0, 0],
        [0, 0, 4, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

예측값 함수

In [14]:
def predict(df, emb_user, emb_item):
  # 행렬곱(U * V^T) 없이 예측값을 돌려주는 함수
  # embedding 끼리의 같은 위치에 있는 값들의 곱(elementwise multiplication)의 합으로 u_i* v_j의 합을 구함
  # 이걸로 U * V^T를 위한 행렬 생성이 필요없게 된다
  df['prediction'] = np.sum(np.multiply(emb_item[df['item_id']], emb_user[df['user_id']]), axis = 1)
  return df

비용함수

In [15]:
lmbda = 0.0002

In [16]:
def cost(df, emb_user, emb_item) :
  Y = create_sparse_matrix(df, emb_user.shape[0], emb_item.shape[0])
  predicted = create_sparse_matrix(predict(df, emb_user, emb_item), emb_user.shape[0], emb_item.shape[0], 'prediction')
  return np.sqrt(np.sum((Y-predicted).power(2))/df.shape[0])

Gradient Descent(경사하강)

In [17]:
def gradient(df, emb_user, emb_item):
  # embedding에 적용할 경사 설정
  Y = create_sparse_matrix(df, emb_user.shape[0], emb_item.shape[0])
  predicted = create_sparse_matrix(predict(df, emb_user, emb_item), emb_user.shape[0], emb_item.shape[0], 'prediction')
  delta = (Y-predicted)
  grad_user = (-2/df.shape[0])*(delta*emb_item) + 2*lmbda*emb_user
  grad_item = (-2/df.shape[0])*(delta.T*emb_user) + 2*lmbda*emb_item
  return grad_user, grad_item

In [18]:
def gradient_descent(df, emb_user, emb_item, iterations = 2000, learning_rate=0.01, df_val = None):
  Y = create_sparse_matrix(df, emb_user.shape[0], emb_item.shape[0])
  beta = 0.9
  grad_user, grad_item = gradient(df, emb_user, emb_item)
  v_user = grad_user
  v_item = grad_item
  for i in range(iterations):
    grad_user, grad_item = gradient(df, emb_user, emb_item)
    v_user = beta*v_user + (1-beta)*grad_user
    v_item = beta*v_item + (1-beta)*grad_item
    emb_user = emb_user - learning_rate*v_user
    emb_item = emb_item - learning_rate*v_item
    if (i+1) % 50 == 0:
      print('\niteration', i+1, ":")
      print("train rmse : ", cost(df, emb_user, emb_item))
      if df_val is not None:
        print('validation rmse : ', cost(df_val, emb_user, emb_item))
  return emb_user, emb_item

In [19]:
emb_user = create_embeddings(num_user, 3)
emb_item = create_embeddings(num_item, 3)
emb_user, emb_item = gradient_descent(rating_df, emb_user, emb_item, iterations = 3000, learning_rate = 0.02)#, df_val = rating_test)


iteration 50 :
train rmse :  2.0981336457531103

iteration 100 :
train rmse :  2.0819945861241322

iteration 150 :
train rmse :  2.066065042043993

iteration 200 :
train rmse :  2.0503448910399444

iteration 250 :
train rmse :  2.034833927804751

iteration 300 :
train rmse :  2.019531890331309

iteration 350 :
train rmse :  2.0044384549967913

iteration 400 :
train rmse :  1.989553232149667

iteration 450 :
train rmse :  1.9748757623332982

iteration 500 :
train rmse :  1.9604055131168487

iteration 550 :
train rmse :  1.946141876503615

iteration 600 :
train rmse :  1.93208416688717

iteration 650 :
train rmse :  1.9182316195259226

iteration 700 :
train rmse :  1.9045833895069666

iteration 750 :
train rmse :  1.8911385511703602

iteration 800 :
train rmse :  1.877896097965259

iteration 850 :
train rmse :  1.8648549427096763

iteration 900 :
train rmse :  1.852013918225997

iteration 950 :
train rmse :  1.839371778324801

iteration 1000 :
train rmse :  1.8269271991100025

iteration

In [20]:
def encode_new_data(val_df, user_ids, item_ids):
  val_df_chosen = val_df['item_id'].isin(item_ids.keys())&val_df['user_id'].isin(user_ids.keys())
  val_df = val_df[val_df_chosen]
  val_df['user_id'] = np.array([user_ids[x] for x in val_df['user_id']])
  val_df['item_id'] = np.array([item_ids[x] for x in val_df['item_id']])
  return val_df

In [21]:
print('before encoding :', rating_test.shape)
rating_test = encode_new_data(rating_test, user_ids, item_ids)
print('after encoding :', rating_test.shape)

before encoding : (20000, 3)
after encoding : (19952, 3)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_df['user_id'] = np.array([user_ids[x] for x in val_df['user_id']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_df['item_id'] = np.array([item_ids[x] for x in val_df['item_id']])


In [22]:
train_rmse = cost(rating_df, emb_user, emb_item)
val_rmse = cost(rating_test, emb_user, emb_item)
print(train_rmse, val_rmse)

1.4691658252093198 1.8008156198021856


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['prediction'] = np.sum(np.multiply(emb_item[df['item_id']], emb_user[df['user_id']]), axis = 1)


In [23]:
rating_test

Unnamed: 0,user_id,item_id,rating,prediction
0,407,1,4,1.502822
1,280,288,5,2.815957
2,707,93,5,2.077619
3,218,265,3,2.906563
4,939,476,5,3.008769
...,...,...,...,...
19995,72,866,4,3.404354
19996,303,568,4,3.092393
19997,92,38,3,4.252924
19998,393,627,4,2.651378


In [24]:
emb_user = create_embeddings(num_user, 3)
emb_item = create_embeddings(num_item, 3)
emb_user, emb_item = gradient_descent(rating_df, emb_user, emb_item, iterations = 5000, learning_rate = 0.2, df_val = rating_test)


iteration 50 :
train rmse :  1.9579469675651966
validation rmse :  2.0256940675786863

iteration 100 :
train rmse :  1.8184862079060893
validation rmse :  1.9525507590943154

iteration 150 :
train rmse :  1.6998226915487011
validation rmse :  1.8893283039017243

iteration 200 :
train rmse :  1.6002310875164185
validation rmse :  1.8352171648641116

iteration 250 :
train rmse :  1.517389797844767
validation rmse :  1.789140515872427

iteration 300 :
train rmse :  1.4487499438368199
validation rmse :  1.7499423285666882

iteration 350 :
train rmse :  1.391843989982188
validation rmse :  1.716519238092268

iteration 400 :
train rmse :  1.3444740917276758
validation rmse :  1.687891829937443

iteration 450 :
train rmse :  1.304788034766048
validation rmse :  1.6632294927169482

iteration 500 :
train rmse :  1.2712793489918865
validation rmse :  1.6418471088832929

iteration 550 :
train rmse :  1.2427483573623495
validation rmse :  1.623188253494727

iteration 600 :
train rmse :  1.2182498

 # 0410 추가본 - 다른 레퍼런스 참고

 https://big-dream-world.tistory.com/69

1. 분해한 P, Q 행렬 임의값으로 생성
2. P 행렬, Q 전치행렬 곱해서 예측행렬 생성, 실제 R 행렬과 차이 계산(R 행렬 내 존재하는 실제값들과의 차이) 
3. 차이 줄이는 방향으로 P, Q 행렬 업데이트
4. 반복하며 근사화


  ** 상단 코드와의 차이  
1. train_test_split 안함 - train값으로 안본 영화 평점 예측 목적
2. 코드 간소화
3. 추후 수정

In [25]:
rating.head()

Unnamed: 0,user_id,item_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [26]:
rating_df = rating.pivot(index = 'user_id', columns = 'item_id', values = 'rating')
real_mat = rating_df.to_numpy()
num_user, num_item = real_mat.shape

In [27]:
real_mat

array([[ 5.,  3.,  4., ..., nan, nan, nan],
       [ 4., nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [ 5., nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan,  5., nan, ..., nan, nan, nan]])

In [28]:
print(num_user, num_item)

943 1682


In [47]:
# P, Q 만들기
K = 5
P = np.random.normal(size=(num_user, K))
Q = np.random.normal(size=(num_item, K))
print(P.shape, Q.shape)

(943, 5) (1682, 5)


In [48]:
P

array([[ 0.7245597 , -0.17150729,  1.37897079, -0.57885157,  0.71274853],
       [ 0.3429683 , -0.52461838,  0.84040773,  0.45433431,  0.85646375],
       [-0.17097968,  1.35593958,  0.22210508,  0.39740103,  1.66070481],
       ...,
       [ 0.18743347,  0.47139466, -1.22702874,  1.15932438,  1.34757507],
       [ 1.95998917, -0.12734599, -0.46967133, -1.28144442,  0.27286509],
       [-1.51475265,  1.76561731,  1.5217673 ,  0.47284183,  1.20171089]])

In [49]:
P.max()

3.562224926128023

In [50]:
Q

array([[-0.41329918,  0.69361674, -0.48171511, -0.39338978, -0.55187523],
       [-1.51055141, -1.36575973,  0.52120164, -1.7197342 , -0.08637016],
       [ 2.26347044, -0.44428243, -1.52256167, -0.43193459, -0.80922582],
       ...,
       [-0.71402496,  0.1578051 , -1.36845653, -0.53815587,  0.74624167],
       [-0.03222637,  0.13144685,  0.79017909,  0.00411883, -0.25427219],
       [ 1.41135234,  0.64541051,  0.36789059, -1.94162847,  0.7914448 ]])

In [51]:
Q.max()

3.516657815914739

In [52]:
non_zeros = [(i, j, real_mat[i, j]) for i in range(num_user) for j in range(num_item) if real_mat[i, j] > 0]
# 평점이 있는 user 위치, item 위치, rating 값 튜플로 묶어 리스트 내 저장
non_zeros

[(0, 0, 5.0),
 (0, 1, 3.0),
 (0, 2, 4.0),
 (0, 3, 3.0),
 (0, 4, 3.0),
 (0, 5, 5.0),
 (0, 6, 4.0),
 (0, 7, 1.0),
 (0, 8, 5.0),
 (0, 9, 3.0),
 (0, 10, 2.0),
 (0, 11, 5.0),
 (0, 12, 5.0),
 (0, 13, 5.0),
 (0, 14, 5.0),
 (0, 15, 5.0),
 (0, 16, 3.0),
 (0, 17, 4.0),
 (0, 18, 5.0),
 (0, 19, 4.0),
 (0, 20, 1.0),
 (0, 21, 4.0),
 (0, 22, 4.0),
 (0, 23, 3.0),
 (0, 24, 4.0),
 (0, 25, 3.0),
 (0, 26, 2.0),
 (0, 27, 4.0),
 (0, 28, 1.0),
 (0, 29, 3.0),
 (0, 30, 3.0),
 (0, 31, 5.0),
 (0, 32, 4.0),
 (0, 33, 2.0),
 (0, 34, 1.0),
 (0, 35, 2.0),
 (0, 36, 2.0),
 (0, 37, 3.0),
 (0, 38, 4.0),
 (0, 39, 3.0),
 (0, 40, 2.0),
 (0, 41, 5.0),
 (0, 42, 4.0),
 (0, 43, 5.0),
 (0, 44, 5.0),
 (0, 45, 4.0),
 (0, 46, 4.0),
 (0, 47, 5.0),
 (0, 48, 3.0),
 (0, 49, 5.0),
 (0, 50, 4.0),
 (0, 51, 4.0),
 (0, 52, 3.0),
 (0, 53, 3.0),
 (0, 54, 5.0),
 (0, 55, 4.0),
 (0, 56, 5.0),
 (0, 57, 4.0),
 (0, 58, 5.0),
 (0, 59, 5.0),
 (0, 60, 4.0),
 (0, 61, 3.0),
 (0, 62, 2.0),
 (0, 63, 5.0),
 (0, 64, 4.0),
 (0, 65, 4.0),
 (0, 66, 3.0),
 (0, 

In [53]:
from sklearn.metrics import mean_squared_error

def get_rmse(real_mat, P, Q, non_zeros):
  # real_df = 실제 유저 아이템 행렬
  # P, Q = 잠재요인, user와 item으로 분해된 잠재행렬, 이걸로 예측 행렬 생성
  # non_null = real_df 내 null값 아니었던 것(real_df 만으로 함수 내 해결할 수 있을것으로 보이나, 저장해서 반복 계산 시 효율성 재고)
  error = 0
  pred_mat = np.dot(P, Q.T)
  # 실제 R 행렬에서 널이 아닌 값의 위치 인덱스 추출하여 실제 R 행렬과 예측 행렬의 RMSE 추출
  user_non_zero_index = [non_zero[0] for non_zero in non_zeros]
  item_non_zero_index = [non_zero[1] for non_zero in non_zeros]
  real_mat_non_zero = real_mat[user_non_zero_index, item_non_zero_index]
  pred_mat_non_zero = pred_mat[user_non_zero_index, item_non_zero_index]

  mse = mean_squared_error(real_mat_non_zero, pred_mat_non_zero)
  rmse = np.sqrt(mse)

  return rmse

In [54]:
get_rmse(real_mat, P,Q, non_zeros)

4.322673748594236

**비용 함수 생략 잠재 행렬(P, Q) 내 벡터(Pu, Qi) 업데이트 식**  
  
* Rui = 실제 행렬의 (u, i) 값  
* R^ui = 예측 행렬의 (u, i) 값  
* Eui = (u, i)위치의 실제행렬 값 예측행렬 값 차이 

  * $e_{ui}=r_{ui}-p_uq_i^T$
* Gradient
  * $\frac {\partial L}{\partial p_u} = \frac {\partial(r_{ui}-p_uq_i^T)^2}{\partial p_u} + \frac {\partial\lambda||p_u||^2_2}{\partial p_u} = -2(r_{ui}-p_uq_i^T)q_i+2\lambda p_u = -2(e_{ui}q_i-\lambda p_u)$
    * 기존 LOSS에서 $p_u$에 대해 편미분을 진행해 필요없는 $q_i$항을 모두 지우고 계산하면 $-2(e_{ui}q_i-\lambda p_u)$가 남게 됨.
  * Gradient 반대로 $p_u, q_i$ 업데이트
    * $p_u ← p_u +\eta \cdot(e_{ui}q_i - \lambda p_u)$
    * $q_i ← q_i +\eta \cdot(e_{ui}p_u - \lambda q_i)$
  
* Pu(new) = Pu + 학습률 * (Eui * Qi - lambda(L2 정규화 계수) * Pu)  
* Qi(new) = Qi + 학습률 * (Eui * Pu - lambda(L2 정규화 계수) * Qi)


In [55]:
iteration = 200
learning_rate = 0.01
lmbda = 0.01

In [56]:
for iter in range(iteration):
  for i, j, Rij in non_zeros:
    # 실제 값과 예측 값의 차이인 오류 값 구함
    Eij = Rij - np.dot(P[i, :], Q[j, :].T)
    # 벡터 업데이트 :Regularization을 반영한 SGD 업데이트 공식 적용
    P[i, :] = P[i, :] + learning_rate *(Eij* Q[j, :] - lmbda* P[i, :])
    Q[j, :] = Q[j, :] + learning_rate *(Eij* P[i, :] - lmbda* Q[j, :])
  rmse = get_rmse(real_mat, P, Q, non_zeros)
  if (iter+1) % 10 == 0:
    print('iteration num : ', iter+1, " rmse : ", rmse)

iteration num :  10  rmse :  0.9304000224406472
iteration num :  20  rmse :  0.8973367593290096
iteration num :  30  rmse :  0.8742098451235227
iteration num :  40  rmse :  0.8588092266280087
iteration num :  50  rmse :  0.8482139777007578
iteration num :  60  rmse :  0.8404999970164897
iteration num :  70  rmse :  0.8345785877012425
iteration num :  80  rmse :  0.8299031066508002
iteration num :  90  rmse :  0.8261782702293161
iteration num :  100  rmse :  0.8232022435305887
iteration num :  110  rmse :  0.8208128155091534
iteration num :  120  rmse :  0.8188766302747367
iteration num :  130  rmse :  0.8172878486769516
iteration num :  140  rmse :  0.8159663912323314
iteration num :  150  rmse :  0.8148539357478874
iteration num :  160  rmse :  0.813908480896653
iteration num :  170  rmse :  0.8130991924266986
iteration num :  180  rmse :  0.8124024989379751
iteration num :  190  rmse :  0.8117995736650115
iteration num :  200  rmse :  0.8112749180016177


이후 내용은 추천시스템.  
test train 나눠서 검증이 아니라 최대한 오차를 낮춘 다음 유저가 안 본 영화 중 예측 평점 높은 순대로 추천하는 시스템.  
그렇다면 과적합 문제가 발생하지 않을까?  
