<a href="https://colab.research.google.com/github/HwangHanJae/recommender_system/blob/main/inflearn_recsys/Matrix_Factorization_CF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Matrix Factorization(MF)방식의 원리

$$R \approx P \times Q^{T} = \hat{R}$$

행렬 $R$을 행렬$P$와 행렬$Q^{T}$로 분해하는 방식
- $R$ : ($M \times N$)의 크기를 가진 Rating Matrix
- $P$ : ($M \times K$)의 크기를 가진 User Latent Matrix
- $Q$ : ($N \times K$)의 크기를 가진 Item latent Matrix


## SGD(Stochastic Gradient Decent)를 사용한 MF 알고리즘

1. 잠재 요인 $K$ 선택
2. $P$, $Q$ 행렬 초기화
3. 예측 평점 $\hat{r}$을 계산
4. 실제 $R$과 $\hat{R}$간 오차 계산 및 $P$, $Q$ 수정
5. 기준 오차 도달 확인

3~5번 과정을 반복합니다.



# 데이터 읽기

무비렌즈의 유저의 정보(u.user) 읽기

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
#베이스 경로 설정
base = '/content/drive/MyDrive/RecoSys/Data'

# u.user 파일 경로 설정
u_user_path = os.path.join(base, 'u.user')

#필요한 컬럼 정의
u_cols = ['user_id','age','sex','occupation','zip_code']

#데이터 읽어오기
users = pd.read_csv(u_user_path, sep='|', names = u_cols, encoding='latin-1')
#users 데이터 프레임에 인덱스(user_id) 지정
users = users.set_index('user_id')

#상위 5개
users.head()

Unnamed: 0_level_0,age,sex,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


무비렌즈의 영화의 정보(u.item) 읽기

In [None]:
#u.item의 파일 경로 설정
u_item_path = os.path.join(base, 'u.item')

#필요한 컬럼 정의
i_cols = ['movie_id','title','release date','video release date','IMDB URL','unknown','Action',
          'Adventure','Animation','Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama',
          'Fantasy','Film-Noir','Horror','Musical', 'Mystery','Romance','Sci-Fi','Thriller','War','Western']

# 데이터 읽어오기
movies = pd.read_csv(u_item_path, sep='|',names =i_cols, encoding='latin-1')
# movies 데이터 프레임에 인덱스(movie_id) 지정
movies = movies.set_index('movie_id')

#상위 5개
movies.head()

Unnamed: 0_level_0,title,release date,video release date,IMDB URL,unknown,Action,Adventure,Animation,Children's,Comedy,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


무비렌즈의 평점 정보(u.data)읽기

In [None]:
#u.data의 파일경로 지정
u_data_path = os.path.join(base, 'u.data')

#필요한 컬럼 정의
r_cols = ['user_id', 'movie_id','rating','timestamp']

#데이터 읽어오기
ratings = pd.read_csv(u_data_path, sep='\t',names = r_cols, encoding='latin-1')

#상위 5개
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [None]:

#timestamp 제거
ratings = ratings[['user_id','movie_id','rating']].astype(int)

In [None]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [None]:
class MF():
  def __init__(self, ratings, hyper_params):
    self.R = np.array(ratings)
    self.num_users, self.num_items = np.shape(self.R)
    self.K = hyper_params['K']
    self.alpha = hyper_params['alpha']
    self.beta = hyper_params['beta']
    self.iterations = hyper_params['iterations']
    self.verbose = hyper_params['verbose']

  def rmse(self):
    xs, ys = self.R.nonzero()
    self.predictions = []
    self.errors = []

    for x, y in zip(xs, ys):
      prediction = self.get_prediction(x,y)
      self.predictions.append(prediction)
      self.errors.append(self.R[x, y] - prediction)
    self.predicitons = np.array(self.predictions)
    self.errors = np.array(self.errors)
    return np.sqrt(np.mean(self.errors ** 2))

  def train(self):
    self.P = np.random.normal(scale = 1. /self.K,  size=(self.num_users, self.K))
    self.Q = np.random.normal(scale = 1. / self.K, size=(self.num_items, self.K))

    self.b_u = np.zeros(self.num_users)
    self.b_d = np.zeros(self.num_items)
    self.b =  np.mean(self.R[self.R.nonzero()])
    
    rows, columns = self.R.nonzero()
    self.samples = [(i, j, self.R[i, j]) for i, j in zip(rows, columns)]

    training_process = []
    for i in range(self.iterations):
      np.random.shuffle(self.samples)
      self.sgd()
      rmse = self.rmse()
      training_process.append((i+1, rmse))
      if self.verbose: 
        if (i+1) % 10 == 0:
          print(f"Iteration : {i+1} ; train RMSE : {rmse}")
    return training_process

  def get_prediction(self, i,j):
    prediction = self.b + self.b_u[i] + self.b_d[j] + self.P[i, :].dot(self.Q[j, :].T)
    return prediction 
  
  def sgd(self):
    for i, j , r in self.samples:
      prediction = self.get_prediction(i, j)
      e = (r - prediction)
      self.b_u[i] += self.alpha * (e - (self.beta * self.b_u[i]))
      self.b_d[j] += self.alpha * (e - (self.beta * self.b_d[j]))

      self.P[i,:] += self.alpha * ((e * self.Q[j, :]) - (self.beta * self.P[i, :]))
      self.Q[j,:] += self.alpha * ((e * self.P[i, :]) - (self.beta * self.Q[j, :]))


In [None]:
R_temp = ratings.pivot(index='user_id',
                       columns = 'movie_id',
                       values = 'rating').fillna(0)
hyper_params = {
    'K' : 30,
    "alpha" : 0.001,
    "beta" : 0.02,
    "iterations" : 100,
    "verbose" : True
}

mf = MF(R_temp, hyper_params)

trian_process = mf.train()

Iteration : 10 ; train RMSE : 0.9585295726568529
Iteration : 20 ; train RMSE : 0.9373941559328781
Iteration : 30 ; train RMSE : 0.9281288419995876
Iteration : 40 ; train RMSE : 0.9226206277144339
Iteration : 50 ; train RMSE : 0.9185648536455099
Iteration : 60 ; train RMSE : 0.9148158051685995
Iteration : 70 ; train RMSE : 0.9105088316428255
Iteration : 80 ; train RMSE : 0.9046984189171865
Iteration : 90 ; train RMSE : 0.8964882145026934
Iteration : 100 ; train RMSE : 0.8854790833093253


## Train / Test 분리

In [None]:
from sklearn.utils import shuffle
TRAIN_SIZE = 0.75

ratings = shuffle(ratings, random_state=2021)
cutoff = int(TRAIN_SIZE * len(ratings))
rating_train = ratings.iloc[:cutoff]
rating_test = ratings.iloc[cutoff:]

In [None]:
class NEW_CF():
  def __init__(self, ratings, hyper_params):
      self.R = np.array(ratings)
      self.num_users, self.num_items = np.shape(self.R)
      self.K = hyper_params['K']
      self.alpha = hyper_params['alpha']
      self.beta = hyper_params['beta']
      self.iterations = hyper_params['iterations']
      self.verbose = hyper_params['verbose']

      item_id_index = []
      index_item_id = []
      for i, one_id in enumerate(ratings):
        item_id_index.append([one_id, i])
        index_item_id.append([i, one_id])
      
      self.item_id_index = dict(item_id_index)
      self.index_item_id = dict(index_item_id)

      user_id_index = []
      index_user_id = []
      for i, one_id in enumerate(ratings.T):
        user_id_index.append([one_id, i])
        index_user_id.append([i, one_id])
      
      self.user_id_index = dict(user_id_index)
      self.index_user_id = dict(index_user_id)

  def rmse(self):
    xs, ys = self.R.nonzero()
    self.predictions = []
    self.errors = []

    for x, y in zip(xs, ys):
      prediction = self.get_prediction(x,y)
      self.predictions.append(prediction)
      self.errors.append(self.R[x, y] - prediction)
    self.predicitons = np.array(self.predictions)
    self.errors = np.array(self.errors)
    return np.sqrt(np.mean(self.errors ** 2))

  def sgd(self):
    for i, j , r in self.samples:
      prediction = self.get_prediction(i, j)
      e = (r - prediction)
      self.b_u[i] += self.alpha * (e - (self.beta * self.b_u[i]))
      self.b_d[j] += self.alpha * (e - (self.beta * self.b_d[j]))

      self.P[i,:] += self.alpha * ((e * self.Q[j, :]) - (self.beta * self.P[i, :]))
      self.Q[j,:] += self.alpha * ((e * self.P[i, :]) - (self.beta * self.Q[j, :]))

  def get_prediction(self, i,j):
    prediction = self.b + self.b_u[i] + self.b_d[j] + self.P[i, :].dot(self.Q[j, :].T)
    return prediction 

  def set_test(self, ratings_test):
    test_set = []
    for i in range(len(ratings_test)):
      x = self.user_id_index[ratings_test.iloc[i, 0]]
      y = self.item_id_index[ratings_test.iloc[i, 1]]
      z = ratings_test.iloc[i, 2]
      test_set.append([x,y,z])
      self.R[x, y] = 0

    self.test_set = test_set
    return test_set
  
  def test_rmse(self):
    error = 0
    for one_set in self.test_set:
      predicted = self.get_prediction(one_set[0], one_set[1])
      error += pow(one_set[2] - predicted, 2)
    return np.sqrt(error /len(self.test_set))

  def test(self):
    self.P = np.random.normal(scale=1./self.K, size=(self.num_users, self.K))
    self.Q = np.random.normal(scale = 1./self.K, size=(self.num_items, self.K))

    self.b_u = np.zeros(self.num_users)
    self.b_d = np.zeros(self.num_items)
    self.b = np.mean(self.R[self.R.nonzero()])

    rows, columns = self.R.nonzero()
    self.samples = [(i, j, self.R[i, j]) for i, j in zip(rows, columns)]
    training_process = []
    
    for i in range(self.iterations):
      np.random.shuffle(self.samples)
      self.sgd()
      rmse1 = self.rmse()
      rmse2 = self.test_rmse()
      training_process.append((i+1, rmse1, rmse2))
      if self.verbose:
        if (i+1) % 10 == 0:
          print(f"Iteration : {i+1} ; Train RMSE : {rmse1} ; Test RMSE  : {rmse2}")
      
    return training_process

  def get_one_prediction(self, user_id, item_id):
    return self.get_prediction(self.user_id_index[user_id],
                               self.item_id_index[item_id])
  def full_prediction(self):
    return self.b + self.b_u[:,np.newaxis] + self.b_d[np.newaxis,:] + self.P.dot(self.Q.T)

In [None]:
R_temp = ratings.pivot(index='user_id',
                       columns = 'movie_id',
                       values = 'rating').fillna(0)

hyper_params = {
    'K' : 30,
    "alpha" : 0.001,
    "beta" : 0.02,
    "iterations" : 100,
    "verbose" : True
}

mf = NEW_CF(R_temp, hyper_params)
test_set = mf.set_test(rating_test)
result = mf.test()

Iteration : 10 ; Train RMSE : 0.9683370224543192 ; Test RMSE  : 0.9783823974281639
Iteration : 20 ; Train RMSE : 0.9434892790848877 ; Test RMSE  : 0.9578234464818044
Iteration : 30 ; Train RMSE : 0.9322516117199622 ; Test RMSE  : 0.9495861747375062
Iteration : 40 ; Train RMSE : 0.9254943586901253 ; Test RMSE  : 0.9451661760583482
Iteration : 50 ; Train RMSE : 0.9206937600158613 ; Test RMSE  : 0.942475865359023
Iteration : 60 ; Train RMSE : 0.9167404486589302 ; Test RMSE  : 0.9406222425531152
Iteration : 70 ; Train RMSE : 0.9129484886418737 ; Test RMSE  : 0.9391938828602607
Iteration : 80 ; Train RMSE : 0.9087432651670111 ; Test RMSE  : 0.9378188776034411
Iteration : 90 ; Train RMSE : 0.9035476288679675 ; Test RMSE  : 0.9362400105748945
Iteration : 100 ; Train RMSE : 0.8967297034217662 ; Test RMSE  : 0.9342354892234626


In [None]:
mf.full_prediction()

array([[3.9507114 , 3.36524309, 3.27891312, ..., 3.43707408, 3.56005385,
        3.55484594],
       [4.00472493, 3.42747691, 3.29771542, ..., 3.49419414, 3.5920743 ,
        3.594565  ],
       [3.18892962, 2.58091174, 2.49201975, ..., 2.66703829, 2.80270115,
        2.78570396],
       ...,
       [4.14805998, 3.58118258, 3.44393917, ..., 3.6351286 , 3.75528467,
        3.75956866],
       [4.26855941, 3.71538816, 3.55476903, ..., 3.75238867, 3.85880536,
        3.84871709],
       [3.67949871, 3.14267939, 2.9928145 , ..., 3.16902619, 3.31193423,
        3.30197051]])

## MF의 최적 파라미터 찾기

In [None]:
R_temp = ratings.pivot(index='user_id',
                       columns = 'movie_id',
                       values = 'rating').fillna(0)
index = []
results = []
for K in range(50, 261, 10):
  hyper_params = {
      'K' : K,
      "alpha" : 0.001,
      "beta" : 0.02,
      "iterations" : 100,
      "verbose" : True
  }

  mf = NEW_CF(R_temp, hyper_params)
  test_set = mf.set_test(rating_test)
  result = mf.test()
  index.append(K)
  results.append(result)

K와 iterations을 반복하면서 최적 값을 찾음