# Recommendation System

1. 추천 시스템
2. 유사도 함수
3. Content  Based Fitering
4. TF-IDF

### 추천시스템

: 사용자의 취향을 알아내 새로운 아이템을 추천하는 것

### 추천시스템 알고리즘

:Basic 알고리즘을 기반으로 변형

### 유사도 정의

:비슷한 정도를 나타내는 지표

### 유사도 함수 종류

1. 유클리디안 유사도 : 유클리디안 거리의 역수로 정의
2. 코사인 유사도
3. 피어슨 유사도
4. 자카드 유사도 : 유저가 상호작용한 아이템의 합집합과 교집합의 비율로 계산

→ 유저가 기존에 선호한 아이템과 유사한 아이템을 추천

### TF-IDF 정의

- Term Frequency - Inverse Document Frequency
- 텍스트 기반 컨텐츠의 특징 벡터를 추출하는 방법
- 출현 빈도를 이용해 특성 문서 (d) 내에서 키워드 (w)의 중요도를 측정
- 자주 등장하는 키워드는 낮은 중요도를 부여

### TF-IDF 수식

TF-IDF(w, d) = TF(w, d) * IDF(w)

### TF-IDF 정의

TF : 특정 문서 내에 특정 키워드가 등장하는 빈도

TF(w, d) = 문서 d에 속한 키워드 w의 수

DF: 전체 문서 내에 특정 키워드가 등장하는 빈도

→ DF가 큰 경우 보편적인 키워드다.

IDF: DF의 역수

$$
IDF(w)=log(전체문서의 수 /키워드 w가 포함된 문서의 수(DF))
$$

STEP.1 TF를 계산한다

STEP.2 IDF를 계산한다

STEP.3 TF와 IDF를 곱한다

Contents Based Filtering 적용

### TF-IDF가 필요한 이유

단어의 중요도를 반영하기 위함

## Collaborative Filtering

1. 협업 필터링
2. 메모리 기반
3. KNN

### Collaborative Fitering(협업 필터링)

여러 유저의 과거 아이템 **상호작용** 정보를 이용해 추천

### 상호작용

- 영화 평점
- 제품 사용 리뷰
- 동영상 시청 시간
- 클릭 횟수

### 메모리 기반(Memory Based) 협업 필터링

여러 유저의 과거 아이템 상호작용 정보를 **메모리에 저장**하고 추천이 필요할 때마다 전체 데이터를 이용해서 추천

### 메모리 기반(Memory Based) 종류

1. 유저 기반
    1. : 아이템 선호도가 비슷한 유저를 탐색하고 비슷한 유저가 좋아한 것 중 새로운 아이템 추천
2. 아이템 기반
    1. 유저들의 선호도가 비슷한 아이템 탐색하고, 유저가 기존에 선호한 아이템과 유사한 아이템 추천

### KNN 협업 필터링

- 가장 유사한 K개의 이웃을 통해 아이템을 추천하는 방법
- 유저별 아이템 선호도 예측
1. KNN Basic : 아이템 i에 대한 유저 u의 선호도 예측
    1. 유저 간의 유사도를 계산한다.
    2. 아이템 i를 평가한 유저들 중에서 유저 u와 비슷한 유저 k명을 찾는다.
    3. K명의 유사한 유저들이 아이템 i에 평가한 선호도를 유사도 기준으로 가중 평균한다.
    4. 예측 선호도가 높은 아이템을 유저에게 추천한다.
2. KNN with Means
    - 선호도의 평균에 선호도 편차를 유사도 기준으로 가중 평균을 더하는 방법
    - 유저나 아이템의 평균 선호도를 반영한다.
    1. 아이템간의 유사도를 계산한다.
    2. 아이템 i와 비슷한 아이템을 k개 찾는다.
    3. 아이템 i의 평균 선호도를 계산한다.
    4. 유저가 평가한 K개의 아이템의 선호도의 편차를 유사도 기준으로 가중 평균한다.
    5. 예측 선호도가 높은 아이템을 유저에게 추천한다.

1. 모델 기반 협업 필터링
    1. 여러 유저의 과거 아이템 상호작용 정보를 이용해 추천을 위한 모델을 학습하고, 학습된 모델을 이용해 추천
2. Matrix Factorization
    1. P와 Q를  랜던 값으로 초기화
    2. R_h 계산
    3. R과 R_h의 오차 계싼
    4. Gradient Descent를 이용해 P와 Q를 업데이트
    5. 2~4 반복

# 예제실습

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2021)

# Sample Data
data = [
    [1., None, 1., None],
    [None, 1., 1., None],
    [1., None, 1., 1.],
]

df = pd.DataFrame(
    data=data,
    index=["userA", "userB", "userC"],
    columns=["itemA", "itemB", "itemC", "itemD"],
)
df

Unnamed: 0,itemA,itemB,itemC,itemD
userA,1.0,,1.0,
userB,,1.0,1.0,
userC,1.0,,1.0,1.0


In [None]:
# 결측값 0으로
df = df.fillna(0)
df

Unnamed: 0,itemA,itemB,itemC,itemD
userA,1.0,0.0,1.0,0.0
userB,0.0,1.0,1.0,0.0
userC,1.0,0.0,1.0,1.0


## 유클리드안 유사도 계산
$$
\text{Euclidean Similarity} = \frac{1}{\sqrt{\sum_{i=1}^{n}(x_{i} - y_{i})^2} + 1e-5}
$$


In [None]:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(
		X=df.loc[["userA"]],
		Y=df.loc[["userB"]],
)

array([[1.41421356]])

In [None]:
euclidean_distances(df)

array([[0.        , 1.41421356, 1.        ],
       [1.41421356, 0.        , 1.73205081],
       [1.        , 1.73205081, 0.        ]])

In [None]:
distance = euclidean_distances(df)
similarity = 1 / (distance + 1e-5)
similarity

array([[1.00000000e+05, 7.07101781e-01, 9.99990000e-01],
       [7.07101781e-01, 1.00000000e+05, 5.77346936e-01],
       [9.99990000e-01, 5.77346936e-01, 1.00000000e+05]])

## 코사인 유사도 계산
$$
\text{Cosine Similarity} = \frac{\sum_{i=1}^{n} x_{i} y_{i}}{\sqrt{\sum_{i=1}^{n} x_{i}^2} \sqrt{\sum_{i=1}^{n} y_{i}^2}}
$$


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(
		X=df.loc[["userA"]],
		Y=df.loc[["userB"]],
)

array([[0.5]])

In [None]:
cosine_similarity(df)

array([[1.        , 0.5       , 0.81649658],
       [0.5       , 1.        , 0.40824829],
       [0.81649658, 0.40824829, 1.        ]])

## 피어슨 유사도 계산
$$
\text{Pearson Correlation Coefficient} = \frac{\sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_{i} - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_{i} - \bar{y})^2}}
$$


In [None]:
data = [
    [4., 5., 4., 3.],
    [3., 4., 3., 2.],
    [4., 4., 5., 3.],
]

df = pd.DataFrame(
    data=data,
    index=["userA", "userB", "userC"],
    columns=["itemA", "itemB", "itemC", "itemD"],
)
df

Unnamed: 0,itemA,itemB,itemC,itemD
userA,4.0,5.0,4.0,3.0
userB,3.0,4.0,3.0,2.0
userC,4.0,4.0,5.0,3.0


## 자카드 유사도
$$
\text{Jaccard Similarity} = \frac{|A \cap B|}{|A \cup B|}
$$


In [None]:
data = [
    [4., 0., 4., 3., 0.],
    [3., 4., 0., 2., 0.],
    [0., 0., 4., 5., 3.],
]

df = pd.DataFrame(
    data=data,
    index=["userA", "userB", "userC"],
    columns=["itemA", "itemB", "itemC", "itemD", "itemE"],
)
df

Unnamed: 0,itemA,itemB,itemC,itemD,itemE
userA,4.0,0.0,4.0,3.0,0.0
userB,3.0,4.0,0.0,2.0,0.0
userC,0.0,0.0,4.0,5.0,3.0


In [None]:
from sklearn.metrics import jaccard_score
# jaccard_score은 값의 크기 무시, 0,1으로 변경
df[df > 0] = 1
df

Unnamed: 0,itemA,itemB,itemC,itemD,itemE
userA,1.0,0.0,1.0,1.0,0.0
userB,1.0,1.0,0.0,1.0,0.0
userC,0.0,0.0,1.0,1.0,1.0


In [None]:
# jaccard_score은 비교하는 두 유저의 값을 각각 입력
jaccard_score(
		df.loc["userB"],
		df.loc["userC"]
)

0.2

# 영화 메타 데이터와 TF-IDF 실습

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2021)

df = pd.read_csv("/content/drive/MyDrive/Wanted/tmdb_5000_movies.csv")
df = df[["title", "overview"]]
df = df.iloc[:1000]
df.head()

Unnamed: 0,title,overview
0,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,Spectre,A cryptic message from Bond’s past sends him o...
3,The Dark Knight Rises,Following the death of District Attorney Harve...
4,John Carter,"John Carter is a war-weary, former military ca..."


In [3]:
# 결측값 개수
df["overview"].isna().sum()

0

## TF-IDF 계산

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
df["overview"].values[:2]

transformer = TfidfVectorizer(stop_words = 'english')
tfidf_matrix = transformer.fit_transform(df["overview"].values[:2])
tfidf_matrix.toarray()

array([[0.25819889, 0.25819889, 0.        , 0.        , 0.        ,
        0.25819889, 0.25819889, 0.        , 0.        , 0.25819889,
        0.        , 0.        , 0.        , 0.25819889, 0.        ,
        0.        , 0.        , 0.25819889, 0.25819889, 0.25819889,
        0.25819889, 0.25819889, 0.25819889, 0.25819889, 0.        ,
        0.        , 0.25819889, 0.        , 0.25819889],
       [0.        , 0.        , 0.26726124, 0.26726124, 0.26726124,
        0.        , 0.        , 0.26726124, 0.26726124, 0.        ,
        0.26726124, 0.26726124, 0.26726124, 0.        , 0.26726124,
        0.26726124, 0.26726124, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.26726124,
        0.26726124, 0.        , 0.26726124, 0.        ]])

In [7]:
transformer.get_feature_names_out()[:10]

array(['22nd', 'alien', 'barbossa', 'believed', 'captain', 'century',
       'civilization', 'come', 'dead', 'dispatched'], dtype=object)

In [8]:
pd.DataFrame(tfidf_matrix.toarray(), columns=transformer.get_feature_names_out()).T.head(10)

Unnamed: 0,0,1
22nd,0.258199,0.0
alien,0.258199,0.0
barbossa,0.0,0.267261
believed,0.0,0.267261
captain,0.0,0.267261
century,0.258199,0.0
civilization,0.258199,0.0
come,0.0,0.267261
dead,0.0,0.267261
dispatched,0.258199,0.0


In [9]:
# 학습
transformer = TfidfVectorizer(stop_words="english")
# 변환
tfidf_matrix = transformer.fit_transform(df['overview'])
tfidf_matrix.toarray()

array([[0.      , 0.      , 0.      , ..., 0.      , 0.      , 0.      ],
       [0.      , 0.      , 0.      , ..., 0.      , 0.      , 0.      ],
       [0.      , 0.      , 0.      , ..., 0.      , 0.      , 0.      ],
       ...,
       [0.      , 0.      , 0.      , ..., 0.      , 0.      , 0.      ],
       [0.      , 0.228629, 0.      , ..., 0.      , 0.      , 0.      ],
       [0.      , 0.      , 0.      , ..., 0.      , 0.      , 0.      ]])

In [10]:
# 키워드 확인
transformer.get_feature_names_out()[-5:]

array(['zoo', 'zooey', 'zookeeper', 'zorro', 'æon'], dtype=object)

In [11]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(tfidf_matrix)
similarity

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.01847736],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.01253884],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.01847736, 0.01253884, 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [12]:
# 인덱스 998과 유사한 영화를 추천
idx = 998
print(df.loc[idx, 'title'])
# Robin Hood: Prince of Thieves

similarity_one_idx = similarity[idx]

The Big Bounce


In [13]:
order_idx = similarity_one_idx.argsort()[::-1]
order_idx[:100]

array([998, 425,  37, 635, 141, 666, 336, 540, 263, 133, 482, 543, 673,
       474, 810, 430, 258, 432, 472, 750, 598,  71, 989, 466, 145, 825,
       388, 221, 956, 663, 721,  47, 797, 907, 339, 103, 612, 157, 358,
       303, 436, 559,  17,  27, 752, 143, 321,  50, 153, 682, 587, 741,
       644, 904, 977, 553, 837, 652, 620, 911, 847, 622, 929, 192, 130,
       398, 610, 949, 589, 583, 434, 916, 941, 770, 960,  97,  18,  39,
       113, 382,  83, 958, 470, 525, 945, 659, 134, 595,  86, 519, 313,
       433, 868, 220,  41, 511, 886,  46, 144, 198])

In [14]:
# 인덱스 998 자기자신이 제일 높고 그 후 유사한 영화의 인덱스를 얻는다.
top5 = order_idx[:6]
top5

array([998, 425,  37, 635, 141, 666])

In [15]:
df.loc[top5, 'title']

998                The Big Bounce
425           Mission: Impossible
37     Oz: The Great and Powerful
635                     Apollo 13
141               Mars Needs Moms
666               I, Frankenstein
Name: title, dtype: object

# KNN 협업 필터링 실습

In [12]:
import pandas as pd
ratings = pd.read_csv("/content/drive/MyDrive/Wanted/ratings_small.csv")
ratings = ratings[["userId", "movieId", "rating"]]
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


In [13]:
ratings.isnull().sum()

userId     0
movieId    0
rating     0
dtype: int64

In [14]:
ratings.shape

(100004, 3)

In [16]:
movie = pd.read_csv("/content/drive/MyDrive/Wanted/movie.csv", low_memory=False)
movie = movie[["movieId", "title"]]
movie.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [17]:
movie.isnull().sum()

movieId    0
title      0
dtype: int64

In [18]:
movie.shape

(27278, 2)

In [57]:
movies = movie.set_index("movieId")
movies.head()

Unnamed: 0_level_0,title
movieId,Unnamed: 1_level_1
1,Toy Story (1995)
2,Jumanji (1995)
3,Grumpier Old Men (1995)
4,Waiting to Exhale (1995)
5,Father of the Bride Part II (1995)


In [19]:
user_movies_matrix = ratings.pivot(
		index = "userId",
		columns = "movieId",
		values = "rating",
)
user_movies_matrix.iloc[-5:, -5:]

movieId,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
667,,,,,
668,,,,,
669,,,,,
670,,,,,
671,,,,,


In [20]:
# 결측값 대체
user_movie_matrix = user_movies_matrix.fillna(0)
user_movie_matrix.shape

(671, 9066)

In [35]:
# k = 5인 KNN Basic을 이용해 유저 100가 아직 평가하지 않은 영화 58에 대한 점수를 예측
k = 5
user_i = 124
movie_id = 648

# 유저 간의 유사도를 계산한다
from sklearn.metrics.pairwise import cosine_similarity
user_similarity = cosine_similarity(user_movie_matrix)
user_similarity.shape

(671, 671)

In [36]:
user_similarity[:10, :10]

array([[1.        , 0.        , 0.        , 0.07448245, 0.01681799,
        0.        , 0.08388416, 0.        , 0.01284289, 0.        ],
       [0.        , 1.        , 0.12429498, 0.11882103, 0.10364614,
        0.        , 0.21298521, 0.11319045, 0.11333307, 0.04321284],
       [0.        , 0.12429498, 1.        , 0.08163991, 0.15153112,
        0.06069128, 0.15471414, 0.24978072, 0.13447489, 0.1146725 ],
       [0.07448245, 0.11882103, 0.08163991, 1.        , 0.13064868,
        0.07964833, 0.31974534, 0.19101336, 0.03041726, 0.13718558],
       [0.01681799, 0.10364614, 0.15153112, 0.13064868, 1.        ,
        0.06379575, 0.0958878 , 0.16571211, 0.08661604, 0.03237017],
       [0.        , 0.        , 0.06069128, 0.07964833, 0.06379575,
        1.        , 0.        , 0.12850206, 0.02174493, 0.04526415],
       [0.08388416, 0.21298521, 0.15471414, 0.31974534, 0.0958878 ,
        0.        , 1.        , 0.14957182, 0.05972764, 0.18649318],
       [0.        , 0.11319045, 0.2497807

In [37]:
user_similarity = pd.DataFrame(
		data=user_similarity,
		index=user_movie_matrix.index,
		columns=user_movie_matrix.index,
)

user_similarity.head(5)

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.0,0.0,0.074482,0.016818,0.0,0.083884,0.0,0.012843,0.0,...,0.0,0.0,0.014474,0.043719,0.0,0.0,0.0,0.062917,0.0,0.017466
2,0.0,1.0,0.124295,0.118821,0.103646,0.0,0.212985,0.11319,0.113333,0.043213,...,0.477306,0.063202,0.077745,0.164162,0.466281,0.425462,0.084646,0.02414,0.170595,0.113175
3,0.0,0.124295,1.0,0.08164,0.151531,0.060691,0.154714,0.249781,0.134475,0.114672,...,0.161205,0.064198,0.176134,0.158357,0.177098,0.124562,0.124911,0.080984,0.136606,0.170193
4,0.074482,0.118821,0.08164,1.0,0.130649,0.079648,0.319745,0.191013,0.030417,0.137186,...,0.114319,0.047228,0.136579,0.25403,0.121905,0.088735,0.068483,0.104309,0.054512,0.211609
5,0.016818,0.103646,0.151531,0.130649,1.0,0.063796,0.095888,0.165712,0.086616,0.03237,...,0.191029,0.021142,0.146173,0.224245,0.139721,0.058252,0.042926,0.038358,0.062642,0.225086


In [38]:
# 아이템 i를 평가한 유저들 중에서 유저 u와 비슷한 유저 k명을 찾는다.
user_i_similarity = user_similarity.loc[user_i]
user_i_similarity

userId
1      0.000000
2      0.129669
3      0.224600
4      0.147568
5      0.159521
         ...   
667    0.065720
668    0.074023
669    0.049342
670    0.201474
671    0.330381
Name: 124, Length: 671, dtype: float64

In [39]:
# 내림차순 정렬
user_i_similarity = user_i_similarity.sort_values(ascending=False)
user_i_similarity

userId
124    1.000000
458    0.455216
379    0.433607
355    0.432242
282    0.423280
         ...   
640    0.000000
642    0.000000
341    0.000000
76     0.000000
1      0.000000
Name: 124, Length: 671, dtype: float64

In [41]:
# 유사도 상위 k명의 유사도와 id추출
# 본인은 제외

top_k_similarity = user_i_similarity[1: k + 1]
top_k_similar_user_ids = top_k_similarity.index
top_k_similar_user_ids

Int64Index([458, 379, 355, 282, 271], dtype='int64', name='userId')

In [42]:
top_k_similarity

userId
458    0.455216
379    0.433607
355    0.432242
282    0.423280
271    0.409402
Name: 124, dtype: float64

In [43]:
movie_id

648

In [44]:
user_movie_matrix

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
668,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
670,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
# K명의 유사한 유저들이 아이템i에 평가한 선호도를 유사도 기준으로 가중 평균
tok_k_similar_ratings = user_movie_matrix.loc[top_k_similar_user_ids, movie_id]
print(movie_id)
print(tok_k_similar_ratings)

648
userId
458    4.5
379    0.0
355    3.5
282    4.0
271    0.0
Name: 648, dtype: float64


In [46]:
top_k_weight = (tok_k_similar_ratings > 0) * top_k_similarity
top_k_weight

userId
458    0.455216
379    0.000000
355    0.432242
282    0.423280
271    0.000000
dtype: float64

In [47]:
# 유사도가 곱해진 평가 점수의 합을 유사도 합으로 나눔
weighted_ratings = top_k_weight.sum()
weight = top_k_weight.sum()
weight

1.31073817898241

In [48]:
if weight > 0:
	prediction_rating = weighted_ratings / weight
else:
	prediction_rating = 0
prediction_rating

1.0

In [49]:
user_movie_matrix

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
668,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
670,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [51]:
user_movie_matrix.columns

Int64Index([     1,      2,      3,      4,      5,      6,      7,      8,
                 9,     10,
            ...
            161084, 161155, 161594, 161830, 161918, 161944, 162376, 162542,
            162672, 163949],
           dtype='int64', name='movieId', length=9066)

In [53]:
# 예측 선호도가 높은 아이템을 유저에게 추천
# 모든 영화에 대해서 점수를 예측하고 예측 평가 점수가 높은 영화를 유저에게 추천

# 선호도 계산
prediction_dict = {}

for movie_id in user_movie_matrix.columns:
    # 이미 유저가 평가한 경우 제외
    if user_movie_matrix.loc[user_i, movie_id] > 0:
        continue
    top_k_similar_ratings = user_movie_matrix.loc[top_k_similar_user_ids, movie_id]

    top_k_weighted_ratings = top_k_similar_ratings * top_k_similarity
    top_k_weight = (top_k_similar_ratings > 0) * top_k_similarity

    weighted_ratings = top_k_weighted_ratings.sum()
    weight = top_k_weight.sum()

    if weight > 0:
        prediction_rating = weighted_ratings / weight
    else:
        prediction_rating = 0

    # 영화 아이디별로 예측 평가 점수 저장
    prediction_dict[movie_id] = prediction_rating

# 내리차순으로 정렬
prediction = pd.Series(prediction_dict).sort_values(ascending=False)

# 상위 아이템 추출
recommend = prediction[:10].index
recommend

Int64Index([1258, 924, 3861, 524, 3916, 260, 968, 1653, 2115, 3949], dtype='int64')

In [58]:
movies.loc[recommend]

Unnamed: 0,title
1258,"Shining, The (1980)"
924,2001: A Space Odyssey (1968)
3861,"Replacements, The (2000)"
524,Rudy (1993)
3916,Remember the Titans (2000)
260,Star Wars: Episode IV - A New Hope (1977)
968,Night of the Living Dead (1968)
1653,Gattaca (1997)
2115,Indiana Jones and the Temple of Doom (1984)
3949,Requiem for a Dream (2000)


# Matrix Factorization 실습

In [61]:
import pandas as pd
import numpy as np
np.random.seed(2021)

ratings = pd.read_csv("/content/drive/MyDrive/Wanted/ratings_small.csv")
ratings = ratings[['userId', 'movieId', 'rating']]
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


In [63]:
moveis = pd.read_csv("/content/drive/MyDrive/Wanted/movie.csv")
# links = pd.read_csv("links")

movies.head()

Unnamed: 0_level_0,title
movieId,Unnamed: 1_level_1
1,Toy Story (1995)
2,Jumanji (1995)
3,Grumpier Old Men (1995)
4,Waiting to Exhale (1995)
5,Father of the Bride Part II (1995)


In [66]:
user_movie_matrix = ratings.pivot(
    index="userId",
    columns="movieId",
    values="rating",
)
user_movie_matrix = user_movie_matrix.fillna(0)

# Matrix Facorization
# 정답 R
R = user_movie_matrix.values
n_user = R.shape[0]
n_item = R.shape[1]

# 잠재 요인 행렬
K = 10

# P와 Q 랜덤 값으로 초기화
P = np.random.normal(size = (n_user, K))
Q = np.random.normal(size = (n_item, K))
print(P)
print('')
print(Q)

[[-0.8628148   1.26014792  0.75181999 ...  0.20625798 -0.07394878
   0.25567585]
 [ 1.86247556 -1.02041415  0.92101015 ...  1.3333329  -0.92944303
   1.07725698]
 [-0.30616704  1.77757089  0.28191595 ...  0.19601641  1.41544502
  -0.4093678 ]
 ...
 [-0.14994807 -1.67644508 -0.76044559 ...  0.8620069   0.25274095
  -0.19300867]
 [ 0.08113144 -1.20398428 -0.172246   ...  1.78485929 -0.03735318
  -0.23170517]
 [-0.62966629 -1.17179469 -0.04452852 ... -0.51086987 -0.75285258
   0.1469807 ]]

[[ 0.52472498 -0.54215591  0.09096537 ...  0.40225378 -0.9975197
  -1.12128565]
 [ 0.17235418 -0.64850391 -1.13307562 ... -0.70828962  1.50422767
  -0.02895267]
 [-0.33528878 -1.31283463 -0.33947709 ... -0.67389321  0.35014768
   1.78720248]
 ...
 [ 0.35918857 -0.6626184  -0.37398741 ...  1.28356336 -0.76616038
   0.90302269]
 [ 0.37394715 -1.07087477  0.70600411 ...  0.09376487  1.82244032
  -0.67586397]
 [ 1.3336289  -0.41372545 -0.67809159 ... -1.47565814  0.16883098
  -1.01941128]]


# Gradient Descent를 이용한 잠재 요인 행렬 학습

In [69]:
user_id = 670
item_id = 0

# R_h 계산
pred = P[user_id, :].dot(Q[item_id, :].T)
print('예측 값:',pred)

# R과 R_h 오차 계산
real = R[user_id, item_id]
print('실제 값 :',real)

error = real - pred
print('오차 값',error)

예측 값: 1.6850677464940305
실제 값 : 5.0
오차 값 3.3149322535059698


In [70]:
# Gradient Descent를 이용한 P와 Q를 업데이트
learning_rate = 0.01
dp = 2 * error * Q[item_id, :]
dq = 2 * error * P[user_id, :]
P[user_id, :] += learning_rate * dp
Q[item_id, :] += learning_rate * dq
P[user_id]

array([-0.59487773, -1.20773889, -0.03849763,  0.39176852, -1.6039837 ,
        1.58768382,  0.45126681, -0.48420099, -0.81898678,  0.07264098])

In [71]:
pred = P[user_id, :].dot(Q[item_id, :].T)
error = real - pred
error

2.177313040230893

In [72]:
# 업데이트 과정을 반복
epochs = 10
real = R[user_id, item_id]

for epoch in range(epochs):
    pred = P[user_id, :].dot(Q[item_id, :].T)
    error = real - pred

    dp = 2 * error * Q[item_id, :]
    dq = 2 * error * P[user_id, :]

    P[user_id, :] += learning_rate * dp
    Q[item_id, :] += learning_rate * dq

    print(f"Epoch{epoch}: {round(error, 3)}")

Epoch0: 2.177
Epoch1: 1.407
Epoch2: 0.895
Epoch3: 0.562
Epoch4: 0.35
Epoch5: 0.217
Epoch6: 0.134
Epoch7: 0.082
Epoch8: 0.051
Epoch9: 0.031


In [73]:
# 전체 데이터를 이용해 P와 Q 업데이트
K = 10

P = np.random.normal(size=(n_user, K))
Q = np.random.normal(size=(n_item, K))

epochs = 5
learning_rate = 0.01

for epoch in range(1, epochs + 1):
    total_error = 0
    iteration = 0

    # 모든 유저에 대해 반복
    for user_id in range(n_user):
        # 모든 아이템에 대해 반복
        for item_id in range(n_item):

            real = R[user_id, item_id]

            # 평가하지 않은 경우 제외
            if real == 0:
                continue

            # P와 Q 업데이트
            pred = P[user_id, :].dot(Q[item_id, :].T)
            error = real - pred

            dp = 2 * error * Q[item_id, :]
            dq = 2 * error * P[user_id, :]

            P[user_id, :] += learning_rate * dp
            Q[item_id, :] += learning_rate * dq

            total_error += (error ** 2)
            iteration += 1

    print(f"Epoch {epoch}: {round(np.sqrt(total_error / iteration), 5)}")

Epoch 1: 3.00383
Epoch 2: 1.67262
Epoch 3: 1.31783
Epoch 4: 1.15994
Epoch 5: 1.06419


In [74]:
# 영화 추천
user_id = 124
# 유저 124 잠재요인에 모든 영화의 잠재요인을 곱해 평점 예측
prediction = P[[user_id], :].dot(Q.T)[0]
prediction = pd.Series(
    data=prediction,
    index=user_movie_matrix.columns,
).sort_values(ascending=False)
prediction

movieId
94672     7.348390
77359     7.324331
4479      7.186576
100304    6.977154
27875     6.706461
            ...   
136654   -5.439048
123947   -5.636994
31724    -5.709157
132462   -5.995795
133281   -6.223498
Length: 9066, dtype: float64

In [75]:
# 아직 평가하지 않은 영화만 추출
prediction = prediction[user_movie_matrix.loc[user_id] == 0]
# 상위 10개
recommend = prediction[:10].index

movies.loc[recommend]

Unnamed: 0_level_0,title
movieId,Unnamed: 1_level_1
94672,Across the Line: The Exodus of Charlie Wright ...
77359,Red Riding: 1983 (2009)
4479,Bird (1988)
100304,"Liability, The (2012)"
27875,Redemption: The Stan Tookie Williams Story (2004)
55566,Tyler Perry's Why Did I Get Married? (2007)
113220,"Dog, The (2013)"
88380,Dylan Dog: Dead of Night (2010)
73860,Nowhere Boy (2009)
95752,Terminal USA (1993)
