## [0] 실습 소개

이번 실습에서 해볼 내용
- Latent Factor Model
  - MF (Matrix Factorization)
  - ALS (Alternating Least Squares)
- Supervised Learning
  - Naive Bayes
  - Gradient Boosting Decision Tree
    - XGBoost
    - LightGBM
    - CatBoost

RecSys 기초 대회 강의에서는 Book Crossing 데이터를 사용하여, 모든 실습 및 미션, 대회를 진행합니다. [Kaggle Book-Crossing](https://www.kaggle.com/datasets/ruchi798/bookcrossing-dataset) 을 출처로 하며, 데이터는 재구성되어 제공되었습니다. 해당 데이터는 CC0: Public Domain 라이센스임을 밝힙니다.

In [2]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix, linalg
from xgboost import XGBRegressor, XGBClassifier
from lightgbm import LGBMRegressor, LGBMClassifier

# !pip install catboost
from catboost import CatBoostRegressor, CatBoostClassifier, Pool
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

import re
import warnings

warnings.filterwarnings(action='ignore')

# [1] 데이터 불러오기
book data를 이용합니다. 3강 실습용으로 샘플링된 데이터를 이용하며, 아래 코드를 통해 불러올 수 있습니다.

In [None]:
!wget --load-cookies ~/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies ~/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1jdLFf4JyfWo1406LJ2f8no67YPf15X3f' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1jdLFf4JyfWo1406LJ2f8no67YPf15X3f" -O users.csv && rm -rf ~/cookies.txt
!wget --load-cookies ~/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies ~/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1-EZ2fFCA5RNoqlyM69NeN-L4Y6qTLnQN' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1-EZ2fFCA5RNoqlyM69NeN-L4Y6qTLnQN" -O books.csv && rm -rf ~/cookies.txt
!wget --load-cookies ~/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies ~/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1-I3YKtaJb5IPvOOFqkonQj5ikJQHfoUC' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1-I3YKtaJb5IPvOOFqkonQj5ikJQHfoUC" -O ratings.csv && rm -rf ~/cookies.txt

In [4]:
books = pd.read_csv('./books.csv')
users = pd.read_csv('./users.csv')
ratings = pd.read_csv('./ratings.csv')

해당 데이터셋에서는 isbn을 item_id로 볼 수 있으며, user_id가 users_id로 볼 수 있습니다.

In [5]:
seed=42

Latent Factor 모델에서는 ratings 데이터[user, item, rating]를 그대로 사용하고

Supervised Learning을 위해 books와 users 정보가 추가된 데이터를 만들어줍니다

- users 데이터의 country, state, city 정보를 원핫인코딩 합니다

In [6]:
users.head()

Unnamed: 0,user_id,age,location_country,location_state,location_city
0,197659,49,usa,pennsylvania,indiana
1,48911,57,usa,louisiana,neworleans
2,70666,18,usa,rhodeisland,warwick
3,75819,34,usa,newyork,westfalls
4,78973,29,portugal,lisboa,amadora


In [23]:
# get_dummies
users_df = pd.get_dummies(
    users,
    columns=['location_country', 'location_state', 'location_city']
)

users_df.shape # users.shape (247, 5) -> (247, 318)
# 왜 이렇게 되었느냐? 
# print(users['location_country'].nunique()) =  21
# print(users['location_state'].nunique()) = 83
# print(users['location_city'].nunique()) = 212
# 모두 다 원 핫 벡터 처리 + 기존에 있던 기준 col 3개 제외 = 21 + 83 + 212 - 3 = 318

# one-hot encoding된 거 하나만 확인해볼까.
ser = users_df.iloc[0, :]
ser[ser != 0]
# user_id                        197659
# age                                49
# location_country_usa                1
# location_state_pennsylvania         1
# location_city_indiana               1

users_df.sample(5)

Unnamed: 0,user_id,age,location_country_argentina,location_country_australia,location_country_austria,location_country_canada,location_country_england,location_country_faraway,location_country_france,location_country_germany,...,location_city_wangen,location_city_warwick,location_city_webster,location_city_westchester,location_city_westfalls,location_city_westpalmbeach,location_city_whiteplains,location_city_wichita,location_city_yulee,location_city_zaragoza
107,28177,26,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
224,35921,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40,220688,27,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
198,126814,32,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
121,83671,52,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


- books 데이터의 category, publisher, language를 원핫인코딩 합니다

In [39]:
print(books.columns)

cat_ser = books['category']

# 하나 밖에 없는 카테고리들.
cat_ser.value_counts()[cat_ser.value_counts() <= 1][:10]

Index(['isbn', 'book_title', 'book_author', 'publisher', 'language',
       'category'],
      dtype='object')


['CD-ROMs']                                   1
['Motion picture producers and directors']    1
['Parapsychology']                            1
['Drawing, French']                           1
['Runes']                                     1
['Motion picture plays']                      1
['Creative thinking']                         1
['Horse-racing']                              1
['Watercolor painting']                       1
["Children's literature"]                     1
Name: category, dtype: int64

In [40]:
# 간결성을 위해 category 첫번째만 사용

# expand는 셀을 분리하라는 의미.
books_cat = (pd.concat([
  books, 
  books['category'].str.replace(r'[^0-9a-zA-Z:,]+', '').str.split(',', expand=True)
  ], axis=1)
  .drop(['category',1,2,3], axis=1)
  .rename(columns={0:'category'}))

# books_cat.head(2)
books.head(2)

books_df = pd.get_dummies(
  books_cat, 
  columns=['category', 'publisher', 'language']
).drop(['book_title', 'book_author'], axis=1)

books_df.head()

Unnamed: 0,isbn,category_11030fictioninEnglish1900194560030texts,category_9,category_AGrowandLearnLibrary,category_AIDSDisease,category_ANIMAUXSAUVAGES,category_Abortion,category_Abusedchildren,category_Abusedwives,category_Abusedwomen,...,language_da,language_de,language_en,language_es,language_fr,language_it,language_la,language_nl,language_pt,language_ru
0,374157065,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,440234743,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,452264464,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,609804618,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,345402871,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


ratings에 book, user 데이터 가공한 것을 조인합니다

In [None]:
data = ratings.merge(books_df, on='isbn', how='inner').merge(users_df, on='user_id', how='inner')

# catboost에 사용하기 위해 one-hot encoding 형태가 아닌 카테고리 변수를 그대로 사용하는 dataframe
data_cat = ratings.merge(books_cat.drop(['book_title', 'book_author'], axis=1), on='isbn', how='inner').merge(users, on='user_id', how='inner')

train, test 데이터로 split 해줍니다

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.drop(['user_id', 'isbn', 'rating'], axis=1), data['rating'], 
                                                    test_size=0.2, shuffle=True, random_state=seed)

X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(data_cat.drop(['user_id', 'isbn', 'rating'], axis=1), data_cat['rating'], test_size=0.2, shuffle=True, random_state=seed)

In [None]:
R = np.array(ratings.pivot_table('rating', 'user_id', 'isbn').fillna(0))

# [2] Latent Factor Model

## SVD

### dense matrix 형태

In [None]:
def truncated_svd(mat: np.ndarray, k=10) -> np.ndarray:
  u, s, vh = np.linalg.svd(mat)
  truncated_u = u[:,:k]
  truncated_s = s[:k]
  truncated_vh = vh[:k, :]

  return np.dot(truncated_u, np.dot(np.diag(truncated_s), truncated_vh))#.round().astype(int)

In [None]:
truncated_svd(R)

array([[-1.64283952e-02,  1.47110115e-02,  3.47940458e-03, ...,
        -3.46471355e-18,  4.79888935e-02,  5.88137313e-02],
       [ 3.85992965e-02,  9.46183043e-04,  7.45738720e-04, ...,
        -1.08925618e-20, -3.18798281e-03, -7.22424152e-04],
       [-1.36884399e-16,  6.32495156e-17, -4.53884000e-17, ...,
        -3.31133696e-32, -2.89624872e-17,  3.14756755e-16],
       ...,
       [-1.10138344e-02,  2.92482476e-02,  5.94874982e-03, ...,
        -1.63192331e-18, -5.81179138e-03,  2.05548740e-02],
       [-2.55535480e-01,  1.30884665e-01, -3.62617942e-04, ...,
         3.21568268e-18, -1.86423563e-01, -2.04495466e-02],
       [ 1.26702819e-03,  1.84125756e-03,  3.08195398e-04, ...,
        -2.48286211e-20,  3.98501467e-03,  4.99958441e-03]])

### csr matrix 형태

In [None]:
def truncated_svd_sparse(mat: np.ndarray, k: int=100) -> np.ndarray:
  u, s, vh = linalg.svds(csr_matrix(mat), k)
  return u @ np.diag(s) @ vh

In [None]:
truncated_svd_sparse(R, k=10)

array([[-1.64283952e-02,  1.47110115e-02,  3.47940458e-03, ...,
        -1.35642396e-17,  4.79888935e-02,  5.88137313e-02],
       [ 3.85992965e-02,  9.46183043e-04,  7.45738720e-04, ...,
        -2.58132512e-18, -3.18798281e-03, -7.22424152e-04],
       [-5.84973667e-16,  1.51132828e-16,  3.28729279e-17, ...,
         5.33570631e-30, -1.66725400e-15, -6.00220534e-15],
       ...,
       [-1.10138344e-02,  2.92482476e-02,  5.94874982e-03, ...,
         1.69371755e-17, -5.81179138e-03,  2.05548740e-02],
       [-2.55535480e-01,  1.30884665e-01, -3.62617942e-04, ...,
         1.17285906e-16, -1.86423563e-01, -2.04495466e-02],
       [ 1.26702819e-03,  1.84125756e-03,  3.08195398e-04, ...,
        -2.31838376e-18,  3.98501467e-03,  4.99958441e-03]])

## MF

MF는 점수 테이블인 R을 P와 Q 두 Latent Factor로 분해하여 학습하고 두 잠재행렬을 행렬곱하여 유저-아이템 선호도를 예측하는 방법론입니다

---
<img src="https://drive.google.com/uc?export=view&id=1KB0pUxES-GK5PUmAIMlSjmv8cqfceZLQ" width="800">


### dense matrix 형태

In [None]:
class MatrixFactorization:
    def __init__(self, R: np.ndarray, k: int, lr: float, regularization: float, epochs: int, verbose: bool =False) -> None:
        """
        :param R: rating matrix
        :param k: latent parameter
        :param lr: learning rate
        :param regularization: regularization term for update
        :param epochs: training epochs
        :param verbose: print status
        """

        self._R = R
        self._n_users, self._n_items = R.shape
        self._k = k
        self._lr = lr
        self._regularization = regularization
        self._epochs = epochs
        self._verbose = verbose


    def fit(self) -> None:

        # latent features
        self._P = np.random.normal(size=(self._n_users, self._k))
        self._Q = np.random.normal(size=(self._n_items, self._k))

        # biases
        self._bu = np.zeros(self._n_users)
        self._bi = np.zeros(self._n_items)
        self._b = np.mean(self._R[np.where(self._R != 0)])

        # train while epochs
        self._training_process = []
        for epoch in range(self._epochs):

            # rating이 0이 아닌 index로 train
            for i in range(self._n_users):
                for j in range(self._n_items):
                    if self._R[i, j] > 0:
                        self.gradient_descent(i, j, self._R[i, j])
            cost = self.cost()
            self._training_process.append((epoch, cost))

            # print status
            if self._verbose == True and ((epoch + 1) % 10 == 0):
                print("Iteration: %d ; cost = %.4f" % (epoch + 1, cost))


    def cost(self) -> None:
        """
        compute root mean square error
        :return: rmse cost
        """

        # xi, yi: R[xi, yi]는 nonzero인 value를 의미한다.
        xi, yi = self._R.nonzero()
        cost = 0
        for x, y in zip(xi, yi):
            cost += pow(self._R[x, y] - self.predict(x, y), 2)
        return np.sqrt(cost / len(xi))


    def gradient(self, error: float, i: int, j: int) -> tuple:
        """
        gradient of latent feature for GD

        :param error: rating - prediction error
        :param i: user index
        :param j: item index
        :return: gradient of latent feature tuple
        """

        dp = (error * self._Q[j, :]) - (self._regularization * self._P[i, :])
        dq = (error * self._P[i, :]) - (self._regularization * self._Q[j, :])
        return dp, dq


    def gradient_descent(self, i: int, j: int, rating: int) -> None:
        """
        graident descent function

        :param i: user index of matrix
        :param j: item index of matrix
        :param rating: rating of (i,j)
        """

        # get error
        prediction = self.predict(i, j)
        error = rating - prediction

        # update biases
        self._bu[i] += self._lr * (error - self._regularization * self._bu[i])
        self._bi[j] += self._lr * (error - self._regularization * self._bi[j])

        # update latent feature
        dp, dq = self.gradient(error, i, j)
        self._P[i, :] += self._lr * dp
        self._Q[j, :] += self._lr * dq


    def predict(self, i: int, j: int) -> float:
        """
        get predicted rating: user_i, item_j
        :return: prediction of r_ij
        """
        return self._b + self._bu[i] + self._bi[j] + self._P[i, :].dot(self._Q[j, :].T)


    def complete_matrix(self) -> np.ndarray:
        """
        computer complete matrix PXQ + P.bias + Q.bias + global bias

        - PXQ 행렬에 _bu[:, np.newaxis]를 더하는 것은 각 열마다 bias를 더해주는 것
        - _bi[np.newaxis:, ]를 더하는 것은 각 행마다 bias를 더해주는 것
        - b를 더하는 것은 각 element마다 bias를 더해주는 것

        - newaxis: 차원을 추가해줌. 1차원인 Latent들로 2차원의 R에 행/열 단위 연산을 해주기위해 차원을 추가하는 것.

        :return: complete matrix R^
        """
        return self._b + self._bu[:, np.newaxis] + self._bi[np.newaxis:, ] + self._P.dot(self._Q.T)

In [None]:
mf = MatrixFactorization(R, k=3, lr=0.01, regularization=0.01, epochs=10, verbose=True)
mf.fit()
mf.complete_matrix()

Iteration: 10 ; cost = 2.1938


array([[ 2.24763975,  6.50443004,  3.17932721, ...,  6.36061056,
         3.50218606,  3.14936372],
       [10.50300803,  9.01721716, 10.33439981, ...,  9.43577893,
         9.83294057, 10.75727496],
       [ 6.4134851 ,  6.20132055,  7.30487491, ...,  5.16559364,
         6.34699165,  5.8968536 ],
       ...,
       [ 4.50436179,  6.61980514,  6.06887039, ...,  5.64808008,
         5.39246033,  4.00131457],
       [ 5.05028549,  4.94542632,  5.46858607, ...,  5.03556709,
         4.97567952,  5.10188016],
       [ 9.61787864,  6.05378364, 10.00548607, ...,  3.29802447,
         8.1130084 ,  8.55239389]])

### csr matrix 형태

In [None]:
class MatrixFactorization_sparse:
    def __init__(self, R: np.ndarray, k: int, lr: float, regularization: float, epochs: int, verbose: bool =False) -> None:
        """
        :param R: rating matrix
        :param k: latent parameter
        :param lr: learning rate
        :param regularization: regularization term for update
        :param epochs: training epochs
        :param verbose: print status
        """

        self._R = csr_matrix(R)
        self._ind, self._col = self._R.nonzero()
        self._n_users, self._n_items = R.shape
        self._k = k
        self._lr = lr
        self._regularization = regularization
        self._epochs = epochs
        self._verbose = verbose


    def fit(self) -> None:

        # latent features
        self._P = np.random.normal(size=(self._n_users, self._k))
        self._Q = np.random.normal(size=(self._n_items, self._k))

        # biases
        self._bu = np.zeros(self._n_users)
        self._bi = np.zeros(self._n_items)
        self._b = np.mean(self._R[self._R.nonzero()])

        # train while epochs
        self._training_process = []
        for epoch in range(self._epochs):

            # rating이 0이 아닌 index로 train
            for i in range(len(self._ind)):
              self.gradient_descent(self._ind[i], self._col[i], self._R[self._ind[i], self._col[i]])
            cost = self.cost()
            self._training_process.append((epoch, cost))

            # print status
            if self._verbose == True and ((epoch + 1) % 1 == 0):
                print("Iteration: %d ; cost = %.4f" % (epoch + 1, cost))


    def cost(self) -> None:
        """
        compute root mean square error
        :return: rmse cost
        """
        cost = 0
        for x, y in zip(self._ind, self._col):
            cost += pow(self._R[x, y] - self.predict(x, y), 2)
        return np.sqrt(cost / len(self._ind))


    def gradient(self, error: float, i: int, j: int) -> tuple:
        """
        gradient of latent feature for GD

        :param error: rating - prediction error
        :param i: user index
        :param j: item index
        :return: gradient of latent feature tuple
        """

        dp = (error * self._Q[j, :]) - (self._regularization * self._P[i, :])
        dq = (error * self._P[i, :]) - (self._regularization * self._Q[j, :])
        return dp, dq


    def gradient_descent(self, i: int, j: int, rating: int) -> None:
        """
        graident descent function

        :param i: user index of matrix
        :param j: item index of matrix
        :param rating: rating of (i,j)
        """

        # get error
        prediction = self.predict(i, j)
        error = rating - prediction

        # update biases
        self._bu[i] += self._lr * (error - self._regularization * self._bu[i])
        self._bi[j] += self._lr * (error - self._regularization * self._bi[j])

        # update latent feature
        dp, dq = self.gradient(error, i, j)
        self._P[i, :] += self._lr * dp
        self._Q[j, :] += self._lr * dq


    def predict(self, i: int, j: int) -> float:
        """
        get predicted rating: user_i, item_j
        :return: prediction of r_ij
        """
        return self._b + self._bu[i] + self._bi[j] + (csr_matrix(self._P[i, :]).dot(csr_matrix(self._Q[j, :].T).T)).toarray().reshape(-1)[0]


    def complete_matrix(self) -> np.ndarray:
        """
        computer complete matrix PXQ + P.bias + Q.bias + global bias

        - PXQ 행렬에 _bu[:, np.newaxis]를 더하는 것은 각 열마다 bias를 더해주는 것
        - _bi[np.newaxis:, ]를 더하는 것은 각 행마다 bias를 더해주는 것
        - b를 더하는 것은 각 element마다 bias를 더해주는 것

        - newaxis: 차원을 추가해줌. 1차원인 Latent들로 2차원의 R에 행/열 단위 연산을 해주기위해 차원을 추가하는 것.

        :return: complete matrix R^
        """
        return self._b + self._bu[:, np.newaxis] + self._bi[np.newaxis:, ] + (csr_matrix(self._P).dot(csr_matrix(self._Q.T))).toarray().reshape(-1)[0]


In [None]:
mf = MatrixFactorization_sparse(R, k=3, lr=0.01, regularization=0.01, epochs=10, verbose=True)
mf.fit()
mf.complete_matrix()

Iteration: 1 ; cost = 2.7472
Iteration: 2 ; cost = 2.6198
Iteration: 3 ; cost = 2.5516
Iteration: 4 ; cost = 2.5006
Iteration: 5 ; cost = 2.4552
Iteration: 6 ; cost = 2.4114
Iteration: 7 ; cost = 2.3674
Iteration: 8 ; cost = 2.3226
Iteration: 9 ; cost = 2.2766
Iteration: 10 ; cost = 2.2292


array([[ 8.58072755,  8.37458295,  8.69488867, ...,  8.47269015,
         8.39836927,  8.65734614],
       [10.37375781, 10.16761321, 10.48791893, ..., 10.26572041,
        10.19139953, 10.4503764 ],
       [ 7.80602495,  7.59988035,  7.92018607, ...,  7.69798755,
         7.62366667,  7.88264354],
       ...,
       [ 8.27634147,  8.07019687,  8.39050259, ...,  8.16830407,
         8.09398319,  8.35296007],
       [ 6.7986166 ,  6.592472  ,  6.91277772, ...,  6.69057919,
         6.61625832,  6.87523519],
       [ 7.50477271,  7.29862811,  7.61893383, ...,  7.39673531,
         7.32241443,  7.5813913 ]])

## ALS

ALS는 MF와 같이 점수 테이블인 R을 P와 Q 두 Latent Factor로 분해하여 학습하고 두 잠재행렬을 행렬곱하여 유저-아이템 선호도를 예측하는 방법론입니다 
- MF와 다른 점은 잠재행렬 계산 과정에 P와 Q에 대해 번갈아가며 고정하고 최소제곱법으로 최적화 한다는 점입니다

---
<img src="https://drive.google.com/uc?export=view&id=1KB0pUxES-GK5PUmAIMlSjmv8cqfceZLQ" width="800">


### dense matrix 형태

In [None]:
class AlternatingLeastSquares:
    def __init__(self, R: np.ndarray, k: int, regularization: float, epochs: int, verbose: bool =False) -> None:
        """
        :param R: rating matrix
        :param k: latent parameter
        :param regularization: regularization term for update
        :param epochs: training epochs
        :param verbose: print status
        """
        self._R = R
        self._n_users, self._n_items = R.shape
        self._k = k
        self._regularization = regularization
        self._epochs = epochs
        self._verbose = verbose


    def fit(self) -> None:
        # init latent features
        self._users = np.random.normal(size=(self._n_users, self._k))
        self._items = np.random.normal(size=(self._n_items, self._k))

        # train while epochs
        self._training_process = []
        self._user_error = 0; self._item_error = 0; 
        for epoch in range(self._epochs):
            for i, Ri in enumerate(self._R):
                self._users[i] = self.user_latent(Ri)

            for j, Rj in enumerate(self._R.T):
                self._items[j] = self.item_latent(Rj)

            cost = self.cost()
            self._training_process.append((epoch, cost))

            # print status
            if self._verbose == True and ((epoch + 1) % 1 == 0):
                print("Iteration: %d ; cost = %.4f" % (epoch + 1, cost))


    def cost(self) -> float:
        """
        compute root mean square error
        :return: rmse cost
        """
        xi, yi = self._R.nonzero()
        cost = 0
        for x, y in zip(xi, yi):
            cost += pow(self._R[x, y] - self.predict(x, y), 2)
        return np.sqrt(cost/len(xi))


    def user_latent(self, Ri: np.ndarray) -> np.ndarray:
        """
        :param Ri: Rating of user index i
        :return: convergence value of user latent of i index
        """

        du = np.linalg.solve(np.dot(self._items.T, self._items) + 
                             self._regularization * np.eye(self._k),
                             np.dot(self._items.T, Ri.T)).T
        return du

    def item_latent(self, Rj: np.ndarray) -> np.ndarray:
        """
        :param Rj: Rating of item index j
        :return: convergence value of item latent of j index
        """

        di = np.linalg.solve(np.dot(self._users.T, self._users) + 
                             self._regularization * np.eye(self._k),
                             np.dot(self._users.T, Rj))
        return di


    def predict(self, i: int, j: int) -> float:
        """
        get predicted rating: user_i, item_j
        :return: prediction of r_ij
        """
        return self._users[i, :].dot(self._items[j, :].T)


    def complete_matrix(self) -> np.ndarray:
        """
        :return: complete matrix R^
        """
        return self._users.dot(self._items.T)

In [None]:
als = AlternatingLeastSquares(R, k=3, regularization=0.01, epochs=10, verbose=True)
als.fit()
als.complete_matrix()

Iteration: 1 ; cost = 7.3325
Iteration: 2 ; cost = 6.9303
Iteration: 3 ; cost = 6.8407
Iteration: 4 ; cost = 6.8433
Iteration: 5 ; cost = 6.8514
Iteration: 6 ; cost = 6.8587
Iteration: 7 ; cost = 6.8664
Iteration: 8 ; cost = 6.8751
Iteration: 9 ; cost = 6.8842
Iteration: 10 ; cost = 6.8927


array([[ 8.21955752e-04,  1.37443682e-04,  1.58859396e-04, ...,
        -1.65185528e-23,  2.85284522e-04,  5.14093997e-04],
       [ 8.82633776e-05,  1.60319186e-05,  1.74148452e-05, ...,
        -1.73977300e-24,  3.18983282e-05,  5.68589361e-05],
       [ 1.02870360e-16,  1.41515969e-17,  1.94753545e-17, ...,
        -2.11530578e-36,  3.41190270e-17,  6.18192169e-17],
       ...,
       [ 1.20703994e-03,  1.79756020e-04,  2.33636468e-04, ...,
        -2.43572840e-23,  4.18096565e-04,  7.47324288e-04],
       [ 8.59362087e-04,  3.71473768e-04,  1.51692797e-04, ...,
        -1.70489000e-23,  2.72209378e-04,  5.81364778e-04],
       [ 7.63397495e-05,  1.00890831e-05,  1.52057108e-05, ...,
        -1.51556900e-24,  2.77137369e-05,  4.81427799e-05]])

### csr matrix 형태

In [None]:
class AlternatingLeastSquares_sparse:
    def __init__(self, R: np.ndarray, k: int, regularization: float, epochs: int, verbose: bool =False) -> None:
        """
        :param R: rating matrix
        :param k: latent parameter
        :param regularization: regularization term for update
        :param epochs: training epochs
        :param verbose: print status
        """
        self._R = csr_matrix(R)
        self._ind, self._col = self._R.nonzero()
        self._n_users, self._n_items = R.shape
        self._k = k
        self._regularization = regularization
        self._epochs = epochs
        self._verbose = verbose


    def fit(self) -> None:
        # init latent features
        self._users = np.random.normal(size=(self._n_users, self._k))
        self._items = np.random.normal(size=(self._n_items, self._k))

        # train while epochs
        self._training_process = []
        self._user_error = 0; self._item_error = 0; 
        for epoch in range(self._epochs):
            for i, Ri in enumerate(self._R):
                self._users[i] = self.user_latent(Ri)

            for j, Rj in enumerate(self._R.T):
                self._items[j] = self.item_latent(Rj)

            cost = self.cost()
            self._training_process.append((epoch, cost))

            # print status
            if self._verbose == True and ((epoch + 1) % 1 == 0):
                print("Iteration: %d ; cost = %.4f" % (epoch + 1, cost))


    def cost(self) -> float:
        """
        compute root mean square error
        :return: rmse cost
        """
        cost = 0
        for x, y in zip(self._ind, self._col):
            cost += pow(self._R[x, y] - self.predict(x, y), 2)
        return np.sqrt(cost / len(self._ind))


    def user_latent(self, Ri: csr_matrix) -> np.ndarray:
        """
        :param i: user index
        :param Ri: Rating of user index i
        :return: convergence value of user latent of i index
        """
        du = linalg.spsolve((self._items.T @ (self._items)) + 
                            self._regularization * np.eye(self._k),
                            self._items.T @ (Ri.T)
                            ).T
        return du

    def item_latent(self, Rj: csr_matrix) -> np.ndarray:
        """
        :param j: item index
        :param Rj: Rating of item index j
        :return: convergence value of itemr latent of j index
        """

        di = linalg.spsolve((self._users.T @ self._users) + 
                            self._regularization * np.eye(self._k),
                            self._users.T @ (Rj.T)
                            ).T
        return di


    def predict(self, i: int, j: int) -> float:
        """
        get predicted rating: user_i, item_j
        :return: prediction of r_ij
        """
        return self._users[i, :].dot(self._items[j, :].T)


    def complete_matrix(self) -> np.ndarray:
        """
        :return: complete matrix R^
        """
        return self._users.dot(self._items.T)

In [None]:
als = AlternatingLeastSquares_sparse(R, k=3, regularization=0.01, epochs=10, verbose=True)
als.fit()
als.complete_matrix()

Iteration: 1 ; cost = 7.3105
Iteration: 2 ; cost = 6.8870
Iteration: 3 ; cost = 6.8619
Iteration: 4 ; cost = 6.8692
Iteration: 5 ; cost = 6.8800
Iteration: 6 ; cost = 6.8902
Iteration: 7 ; cost = 6.8985
Iteration: 8 ; cost = 6.9049
Iteration: 9 ; cost = 6.9097
Iteration: 10 ; cost = 6.9132


array([[ 2.22899642e-04,  6.36814844e-05,  5.23916345e-05, ...,
        -2.20215406e-25,  1.05048277e-04,  1.94455156e-04],
       [ 3.94372322e-05,  1.06265787e-05,  9.30913231e-06, ...,
        -3.75543805e-26,  1.86593951e-05,  3.43690058e-05],
       [-5.83506934e-18, -1.17743122e-18, -1.21232438e-18, ...,
         6.33204185e-39, -2.18713331e-18, -4.48959920e-18],
       ...,
       [ 4.51411100e-04,  8.12539868e-05,  1.04439513e-04, ...,
        -3.81094768e-25,  2.03143984e-04,  3.77191969e-04],
       [ 4.22662041e-04,  3.46901886e-04,  8.74693999e-05, ...,
        -8.96506716e-25,  1.80132845e-04,  3.87647629e-04],
       [ 3.98442643e-05,  6.39490984e-06,  9.31765061e-06, ...,
        -3.14855360e-26,  1.81869557e-05,  3.34050155e-05]])

# [3] Supervised Learning

## Naive Bayes

나이브베이즈는 X 변수가 서로 독립이라는 가정 하에 계산되는 베이즈 공식의 응용입니다
- 자세한 개념은 강의나 [참고자료](https://towardsdatascience.com/introduction-to-na%C3%AFve-bayes-classifier-fa59e3e24aaf)를 확인 바랍니다

In [None]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB

- 사용하는 피쳐가 binary인 경우가 많아 binary 데이터일 때 좋은 것으로 알려진 BernoulliNB를 사용합니다

In [None]:
# nb = GaussianNB() 
nb = BernoulliNB() 
nb.fit(X_train, y_train)
print('most probable :', nb.predict(X_test))
print('proba :', nb.predict_proba(X_test))

most probable : [10  9 10 ...  8 10 10]
proba : [[8.07173556e-03 4.88255355e-35 1.89842040e-24 ... 1.33434066e-02
  8.57159267e-03 9.65943022e-01]
 [1.29038571e-01 1.08600205e-33 3.39917567e-22 ... 2.87643150e-01
  3.71120528e-01 1.14236089e-01]
 [2.00001831e-01 2.83629201e-31 2.35323067e-21 ... 3.34871427e-02
  4.46865717e-02 6.99158527e-01]
 ...
 [2.22942192e-02 6.30143148e-34 1.88638792e-24 ... 9.74328836e-01
  6.02390484e-04 3.84895540e-04]
 [2.35831072e-01 2.59926120e-36 9.85744696e-26 ... 7.47444023e-03
  4.21912651e-04 7.44009326e-01]
 [2.57528981e-01 5.82461987e-34 1.45100013e-23 ... 2.24708186e-01
  1.28493593e-01 2.87365745e-01]]


In [None]:
# 두 함수의 결과는 같습니다
print('accuracy : ', accuracy_score(y_test, nb.predict(X_test)))
print('nb score : ', nb.score(X_test, y_test))

accuracy :  0.41067700566656723
nb score :  0.41067700566656723


In [None]:
def rmse(real, predict):
  return np.sqrt(np.mean((real-predict) ** 2))

In [None]:
def mae(real, predict):
  return np.mean(np.abs(real-predict))

In [None]:
print('RMSE : ', rmse(y_test, nb.predict(X_test)))
print('MAE : ', mae(y_test, nb.predict(X_test)))

RMSE :  3.5162271351168686
MAE :  2.1526990754548168


In [None]:
pd.crosstab(y_test, nb.predict(X_test))

col_0,1,5,6,7,8,9,10
rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,83,41,0,41,103,49,129
2,1,1,0,0,3,3,0
3,3,2,1,1,5,5,11
4,3,13,0,1,5,6,6
5,29,381,0,14,55,18,82
6,13,20,0,31,54,10,34
7,32,27,0,62,125,34,68
8,35,29,0,38,200,84,145
9,40,18,0,27,117,162,124
10,65,28,0,16,72,59,489


## GBDT

GBDT 모델은 weak tree 모델을 다수 만들어 잔차를 줄여나가는 방법론입니다
- XGBoost와 LightGBM, CatBoost가 가장 많이 활용되고 있습니다

### XGBoost

In [None]:
xgb_cl = XGBClassifier()

- 이번 실습에서는 rating을 분류로 접근하고 있기 때문에 0부터 y값을 요구하는 XGBClassifier의 특성에 맞춰 LabelEncoder를 사용합니다  
- LabelEncoder로 인코딩한 y값으로 훈련 및 예측을 진행하고 정확도를 판단할 때는 inverse_transform을 이용해 다시 원래 값으로 비교하겠습니다

In [None]:
label_encoder = LabelEncoder()
label_encoder.fit(y_train)

y_train_encoded = label_encoder.transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

In [None]:
# Fit
xgb_cl.fit(X_train, y_train_encoded)

In [None]:
# Predict
pred = xgb_cl.predict(X_test)
y_pred = label_encoder.inverse_transform(pred)

In [None]:
# Score
accuracy_score(y_test, y_pred)

0.3313450641216821

In [None]:
print('RMSE : ', rmse(y_test, y_pred))
print('MAE : ', mae(y_test, y_pred))

RMSE :  3.7714690144150085
MAE :  2.502535043244855


### LightGBM

In [None]:
lgbm_cl = LGBMClassifier()

- category를 그대로 변수명으로 사용하여 일부 LGBMClassifier에 사용 불가능한 string값이 섞여있었습니다. 아래 단계를 통해 이를 제거합니다

In [None]:
X_train_for_lgbm = X_train.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
X_test_for_lgbm = X_test.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

In [None]:
# Fit
lgbm_cl.fit(X_train_for_lgbm, y_train)

In [None]:
# Predict
y_pred = lgbm_cl.predict(X_test_for_lgbm)

In [None]:
# Score
accuracy_score(y_test, y_pred)

0.443483447658813

In [None]:
print('RMSE : ', rmse(y_test, y_pred))
print('MAE : ', mae(y_test, y_pred))

RMSE :  3.5432229494725496
MAE :  2.1225767968983


### CatBoost

In [None]:
catboost_cl = CatBoostClassifier()

In [None]:
# Fit
catboost_cl.fit(X_train, y_train)

In [None]:
# Predict
y_pred = catboost_cl.predict(X_test)

In [None]:
# Score
accuracy_score(y_test, y_pred)

0.44437816880405606

In [None]:
print('RMSE : ', rmse(y_test, y_pred.squeeze(1)))
print('MAE : ', mae(y_test, y_pred.squeeze(1)))

RMSE :  3.528842430527749
MAE :  2.0990158067402325


In [None]:
# categorical 변수를 one-hot encoding 하지 않고 그대로 사용하는 방법

# category 타입 cat_features에 선언하기 위한 작업
cat_list = [x for x in X_train_cat.columns.tolist() if x not in ['age', 'year_of_publication']]
                                
catboost_cl = CatBoostClassifier(cat_features=cat_list)

# Fit
catboost_cl.fit(X_train_cat, y_train_cat)

# Predict
y_pred = catboost_cl.predict(X_test_cat)

# Score
accuracy_score(y_test_cat, y_pred)

In [None]:
print('RMSE : ', rmse(y_test, y_pred.squeeze(1)))
print('MAE : ', mae(y_test, y_pred.squeeze(1)))

RMSE :  3.4575091065089913
MAE :  2.0605427974947808


### parameter tuning

XGBoost  
https://xgboost.readthedocs.io/en/stable/parameter.html 문서에 나온 파라미터를 참조하여 수정할 수 있습니다


In [None]:
xgb_cl = XGBClassifier(n_estimators=100, learning_rate=0.15, max_depth=5, num_parallel_tree=3)

# Fit
xgb_cl.fit(X_train, y_train_encoded, early_stopping_rounds=5,
             eval_set=[(X_test, y_test_encoded)], verbose=True)

# Predict
pred = xgb_cl.predict(X_test)
y_pred = label_encoder.inverse_transform(pred)

# Score
accuracy_score(y_test_encoded, y_pred)

LightGBM  
https://lightgbm.readthedocs.io/en/latest/Parameters.html 문서에 나온 파라미터를 참조하여 수정할 수 있습니다 


In [None]:
lgbm_cl = LGBMClassifier(
                        nthread=4,
                        n_estimators=1000,
                        learning_rate=0.02,
                        num_leaves=34,
                        colsample_bytree=0.94,
                        subsample=0.87,
                        max_depth=8,
                        reg_alpha=0.04,
                        reg_lambda=0.07,
                        min_split_gain=0.02,
                        min_child_weight=32,
                        silent=-1,
                        verbose=-1
                        )

# Fit
lgbm_cl.fit(X_train_for_lgbm, y_train)

# Predict
y_pred = lgbm_cl.predict(X_test_for_lgbm)

# Score
accuracy_score(y_test, y_pred)



0.40620339994035193

CatBoost  
https://catboost.ai/en/docs/references/training-parameters/ 문서에 나온 파라미터를 참조하여 수정할 수 있습니다 


In [None]:
# category 타입 cat_features에 선언하기 위한 작업
cat_list = [x for x in X_train_cat.columns.tolist() if x not in ['age', 'year_of_publication']]

catboost_cl = CatBoostClassifier(
                                loss_function='MultiClass',
                                eval_metric='MultiClass',
                                verbose=200,
                                early_stopping_rounds=200,
                                cat_features=cat_list,
                                random_seed=101
                                )

# Fit
catboost_cl.fit(X_train_cat, y_train_cat)

# Predict
y_pred = catboost_cl.predict(X_test_cat)

# Score
accuracy_score(y_test_cat, y_pred)

Learning rate set to 0.090329
0:	learn: 2.1702451	total: 478ms	remaining: 7m 57s
200:	learn: 1.4117550	total: 1m 21s	remaining: 5m 22s
400:	learn: 1.3036762	total: 2m 42s	remaining: 4m 2s
600:	learn: 1.2036472	total: 4m 3s	remaining: 2m 41s
800:	learn: 1.1116926	total: 5m 25s	remaining: 1m 20s
999:	learn: 1.0213578	total: 6m 45s	remaining: 0us


0.43722039964211157

params를 미리 선언하여 학습도 가능합니다. (catboost뿐만 아니라 모든 모델에 해당)

In [None]:
params = {'loss_function':'MultiClass', # objective function
          'eval_metric':'MultiClass', # metric
          'verbose': 200, # output to stdout info about training process every 200 iterations
          'early_stopping_rounds': 200,
          'cat_features': cat_list,
          'random_seed': 101
        }
catboost_cl_params = CatBoostClassifier(**params)

# Fit
catboost_cl_params.fit(X_train_cat, y_train_cat)

# Predict
y_pred = catboost_cl_params.predict(X_test_cat)

# Score
accuracy_score(y_test_cat, y_pred)

Learning rate set to 0.090329
0:	learn: 2.1702451	total: 368ms	remaining: 6m 7s
200:	learn: 1.4117550	total: 1m 20s	remaining: 5m 19s
400:	learn: 1.3036762	total: 2m 41s	remaining: 4m 1s
600:	learn: 1.2036472	total: 4m 2s	remaining: 2m 40s
800:	learn: 1.1116926	total: 5m 23s	remaining: 1m 20s
999:	learn: 1.0213578	total: 6m 43s	remaining: 0us


0.43722039964211157

<font color='red'><b>**WARNING**</b></font> : **본 교육 콘텐츠의 지식재산권은 재단법인 네이버커넥트에 귀속됩니다. 본 콘텐츠를 어떠한 경로로든 외부로 유출 및 수정하는 행위를 엄격히 금합니다.** 다만, 비영리적 교육 및 연구활동에 한정되어 사용할 수 있으나 재단의 허락을 받아야 합니다. 이를 위반하는 경우, 관련 법률에 따라 책임을 질 수 있습니다.


