# 루브릭
1. CSR Matrix가 정상적으로 만들어졌다.
    - 사용자와 아이템 개수를 바탕으로 정확한 사이즈로 만들어졌다.
2. MF 모델이 정상적으로 훈련되어 그럴듯한 추천이 이루어졌다.
    - 사용자와 아이템 벡터 내적수치가 의미있게 형성되었다.
3. 비슷한 영화 찾기와 유저에게 추천하기의 과정이 정상적으로 진행되었다.
    - MF 모델이 예측한 유저 선호도 및 아이템간 유사도, 기여도를 측정하고 의미를 분석해보았다.

## 데이터 불러오기

In [1]:
import numpy as np
import scipy
import implicit
import pandas as pd

In [2]:
import os
rating_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'ratings', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python', encoding = "ISO-8859-1")
orginal_data_size = len(ratings)
ratings.head()

Unnamed: 0,user_id,movie_id,ratings,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [3]:
# 3점 이상만 남깁니다.
ratings = ratings[ratings['ratings']>=3]
filtered_data_size = len(ratings)

print(f'orginal_data_size: {orginal_data_size}, filtered_data_size: {filtered_data_size}')
print(f'Ratio of Remaining Data is {filtered_data_size / orginal_data_size:.2%}')

orginal_data_size: 1000209, filtered_data_size: 836478
Ratio of Remaining Data is 83.63%


In [4]:
# ratings 컬럼의 이름을 counts로 바꿉니다.
ratings.rename(columns={'ratings':'counts'}, inplace=True)

In [5]:
ratings['counts']
ratings

Unnamed: 0,user_id,movie_id,counts,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000203,6040,1090,3,956715518
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


In [6]:
ratings = ratings[['user_id' ,'movie_id', 'counts']]
ratings

Unnamed: 0,user_id,movie_id,counts
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
...,...,...,...
1000203,6040,1090,3
1000205,6040,1094,5
1000206,6040,562,5
1000207,6040,1096,4


In [7]:
# 영화 제목을 보기 위해 메타 데이터를 읽어옵니다.
movie_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/movies.dat'
cols = ['movie_id', 'title', 'genre'] 
movies = pd.read_csv(movie_file_path, sep='::', names=cols, engine='python', encoding='ISO-8859-1')
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


## 데이터 살펴보기

In [8]:
# 유저 수
ratings['user_id'].nunique()

6039

In [9]:
# 영화 수
ratings['movie_id'].nunique()

3628

In [10]:
# 인기 많은 영화
movie_data = pd.merge(ratings, movies)
movie_count = movie_data.groupby('title')['counts'].count()
movie_count.sort_values(ascending=False).head(30)

title
American Beauty (1999)                                   3211
Star Wars: Episode IV - A New Hope (1977)                2910
Star Wars: Episode V - The Empire Strikes Back (1980)    2885
Star Wars: Episode VI - Return of the Jedi (1983)        2716
Saving Private Ryan (1998)                               2561
Terminator 2: Judgment Day (1991)                        2509
Silence of the Lambs, The (1991)                         2498
Raiders of the Lost Ark (1981)                           2473
Back to the Future (1985)                                2460
Matrix, The (1999)                                       2434
Jurassic Park (1993)                                     2413
Sixth Sense, The (1999)                                  2385
Fargo (1996)                                             2371
Braveheart (1995)                                        2314
Men in Black (1997)                                      2297
Schindler's List (1993)                                  2257
Pr

In [13]:
# 사용자 초기 정보 설정하기
my_favorite = ['Star Wars: Episode IV - A New Hope (1977)' , 'Back to the Future (1985)' ,'Terminator 2: Judgment Day (1991)' ,'Men in Black (1997)' ,'Sixth Sense, The (1999)']
favorite_movie_id = movies[movies['title'].isin(my_favorite)]
my_movie = pd.DataFrame({'user_id': [6041]*5, 'movie_id': favorite_movie_id['movie_id'], 'counts':[5]*5}) # 마지막 user_id 가 6040이니 6041로 설정

if not ratings.isin({'user_id':[6041]})['user_id'].any():
    ratings = ratings.append(my_movie)
ratings.tail(10)

Unnamed: 0,user_id,movie_id,counts
1000203,6040,1090,3
1000205,6040,1094,5
1000206,6040,562,5
1000207,6040,1096,4
1000208,6040,1097,4
476,6041,480,5
847,6041,858,5
1250,6041,1270,5
1539,6041,1580,5
2502,6041,2571,5


In [14]:
favorite_movie_id

Unnamed: 0,movie_id,title,genre
257,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi
585,589,Terminator 2: Judgment Day (1991),Action|Sci-Fi|Thriller
1250,1270,Back to the Future (1985),Comedy|Sci-Fi
1539,1580,Men in Black (1997),Action|Adventure|Comedy|Sci-Fi
2693,2762,"Sixth Sense, The (1999)",Thriller


In [15]:
movie_data = pd.merge(ratings, movies, on='movie_id')
movie_data

Unnamed: 0,user_id,movie_id,counts,title,genre
0,1,1193,5,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...,...
836478,5851,3607,5,One Little Indian (1973),Comedy|Drama|Western
836479,5854,3026,4,Slaughterhouse (1987),Horror
836480,5854,690,3,"Promise, The (Versprechen, Das) (1994)",Romance
836481,5938,2909,4,"Five Wives, Three Secretaries and Me (1998)",Documentary


## CSR Matrix

In [16]:
ratings

Unnamed: 0,user_id,movie_id,counts
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
...,...,...,...
476,6041,480,5
847,6041,858,5
1250,6041,1270,5
1539,6041,1580,5


In [17]:
from scipy.sparse import csr_matrix

num_user = ratings['user_id'].nunique()
num_artist = ratings['movie_id'].nunique()

csr_data = csr_matrix((ratings['counts'], (ratings.user_id, ratings.movie_id)))
csr_data

<6042x3953 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Row format>

## MF model 학습하기

In [18]:
from implicit.als import AlternatingLeastSquares
import os
import numpy as np

# implicit 라이브러리에서 권장하고 있는 부분입니다. 학습 내용과는 무관합니다.
os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

In [19]:
# Implicit AlternatingLeastSquares 모델의 선언
als_model = AlternatingLeastSquares(factors=100, regularization=0.01, use_gpu=False, iterations=15, dtype=np.float32)

In [20]:
# als 모델은 input으로 (item X user 꼴의 matrix를 받기 때문에 Transpose해줍니다.)
csr_data_transpose = csr_data.T
csr_data_transpose

<3953x6042 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Column format>

In [21]:
# 모델 훈련
als_model.fit(csr_data_transpose)

  0%|          | 0/15 [00:00<?, ?it/s]

In [22]:
my_vector, Star_vector = als_model.user_factors[6041], als_model.item_factors[480]

In [23]:
my_vector

array([-0.5668729 , -0.26596668, -0.27660382,  1.1421404 , -0.6238201 ,
        0.7446186 , -0.552115  ,  0.28068736,  0.14669171,  0.9547543 ,
       -0.1571627 ,  0.46182168, -1.1386071 ,  0.12754671, -0.3464443 ,
        0.2370755 , -0.8451545 , -0.5511128 ,  0.9002868 , -0.85408974,
        0.62913376,  0.11740011, -0.5185248 ,  0.22772552,  0.39829817,
        0.24891564, -0.09849531, -0.5824059 , -0.39052543, -1.391638  ,
        0.04577921, -0.58849585,  0.46753663,  1.2107991 ,  0.35357526,
        0.0021524 , -0.88416356, -0.28420645,  0.18893832, -0.10590921,
        0.14612079,  0.3846305 , -0.6910506 , -0.5684306 ,  0.41699693,
        0.20090479,  0.48203957,  0.08943246, -0.04074716, -0.43543997,
       -0.11716601,  0.2748332 ,  0.361624  , -0.17512707,  0.13059445,
        0.7179673 , -0.05023082, -0.51431924,  0.12921457, -0.948869  ,
        0.55093265, -0.47920746, -0.4145841 ,  0.7978785 , -0.16338061,
       -0.37135825, -0.2879083 , -0.25072774,  0.92677116, -0.26

In [24]:
Star_vector

array([-4.9486957e-03,  9.5384717e-03, -2.9945018e-02,  3.8520224e-02,
        2.7991582e-03,  1.0952486e-02, -7.5015835e-03,  1.7063096e-02,
        3.3514269e-02,  3.0385425e-02,  8.9191459e-03,  6.6439738e-03,
       -3.4399580e-02,  3.4031246e-02, -5.1321806e-03,  2.2953624e-02,
       -2.0303145e-02,  6.5742144e-03,  2.4142561e-02, -5.9828483e-03,
        1.7286777e-02,  1.7038967e-02,  6.9917943e-03,  1.2312734e-02,
        1.5934784e-02,  3.0653495e-03, -3.8548824e-03,  1.5074592e-02,
       -2.2018127e-02, -4.1924935e-02, -1.1022119e-02, -3.1987075e-02,
        1.5299962e-03,  3.7868939e-02,  1.9815190e-02, -1.3590363e-02,
       -3.0097041e-02,  1.3906572e-02,  6.9184485e-03, -1.1711261e-02,
        1.5646558e-02,  1.2289886e-02, -3.5466549e-03, -1.1923822e-02,
        3.0847508e-02,  5.0306246e-02,  9.8784510e-03, -1.0977868e-03,
        2.7092285e-02,  2.8985411e-02,  1.1447594e-02,  2.9324722e-03,
        1.1461173e-02, -1.4165704e-02,  2.0972528e-02,  1.7944995e-02,
      

In [25]:
np.dot(my_vector, Star_vector)

0.7257263

In [26]:
toystory_vector = als_model.item_factors[1]
np.dot(my_vector, toystory_vector)

0.092919044

## 유사 영화 찾기

In [27]:
favorite_movie = 'Star Wars: Episode IV - A New Hope (1977)'
movie_id = movies[movies['title']=='Star Wars: Episode IV - A New Hope (1977)']['movie_id']
similar_movie = als_model.similar_items(movie_id.values[0], N=15)
similar_movie

[(260, 0.9999999),
 (1196, 0.8737084),
 (1210, 0.73525),
 (1198, 0.7068786),
 (1214, 0.45851356),
 (2628, 0.45318183),
 (1270, 0.44791123),
 (1097, 0.44083995),
 (1240, 0.4351123),
 (1197, 0.41813955),
 (2571, 0.4049188),
 (1291, 0.39002144),
 (2887, 0.37051708),
 (1843, 0.36754793),
 (1387, 0.3648739)]

In [28]:
movies[movies['movie_id'].isin([s[0] for s in similar_movie])]

Unnamed: 0,movie_id,title,genre
257,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi
1081,1097,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi
1178,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Drama|Sci-Fi|War
1179,1197,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance
1180,1198,Raiders of the Lost Ark (1981),Action|Adventure
1192,1210,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War
1196,1214,Alien (1979),Action|Horror|Sci-Fi|Thriller
1220,1240,"Terminator, The (1984)",Action|Sci-Fi|Thriller
1250,1270,Back to the Future (1985),Comedy|Sci-Fi
1271,1291,Indiana Jones and the Last Crusade (1989),Action|Adventure


## Favorite Movie

In [29]:
user = 6041
movie_recommended = als_model.recommend(user, csr_data, N=20, filter_already_liked_items=True)
movie_recommended

[(589, 0.63394773),
 (1221, 0.51722723),
 (2916, 0.4129526),
 (110, 0.39596915),
 (260, 0.365498),
 (1210, 0.3244884),
 (1196, 0.32323045),
 (1573, 0.3201874),
 (1240, 0.29441845),
 (2023, 0.28975818),
 (3793, 0.2719371),
 (2028, 0.26997566),
 (1527, 0.2653489),
 (780, 0.25805435),
 (3175, 0.25128326),
 (1544, 0.2278373),
 (2628, 0.22755808),
 (1965, 0.22680652),
 (1198, 0.22500317),
 (2529, 0.22149979)]

In [30]:
movies[movies['movie_id'].isin([m[0] for m in movie_recommended])]

Unnamed: 0,movie_id,title,genre
108,110,Braveheart (1995),Action|Drama|War
257,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi
585,589,Terminator 2: Judgment Day (1991),Action|Sci-Fi|Thriller
770,780,Independence Day (ID4) (1996),Action|Sci-Fi|War
1178,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Drama|Sci-Fi|War
1180,1198,Raiders of the Lost Ark (1981),Action|Adventure
1192,1210,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War
1203,1221,"Godfather: Part II, The (1974)",Action|Crime|Drama
1220,1240,"Terminator, The (1984)",Action|Sci-Fi|Thriller
1491,1527,"Fifth Element, The (1997)",Action|Sci-Fi


MF model을 활용해 추천시스템을 만들어보았습니다.
우선 초기설정으로 5가지를 선정해주었습니다.
Star Wars에 대한 선호도는 0.7257263이었으며, toystory에 대한 선호도는 0.7257263 이었습니다.

그리고 이러한 것들을 바탕으로 Star Wars 유사한 영화를 추천받았습니다.
Star Wars의 장르는 Action|Adventure|Fantasy|Sci-Fi 이 있었으며, 추천받은 영화들은 이와 유사한 장르를 띄고 있었습니다.
