## Required Modules

<br>

- matplotlib >= 3.1.1
- mplcursors >= 0.5.1
    - use command to install(in notebook) : **!pip install mplcursors**
- numpy >= 1.21.5
- pandas >= 0.25.1
- sklearn >= 0.21.3

## Import Modules

In [1]:
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mplcursors # Use this is for creating a cursor-interactive plot with "%matplotlib notebook"
from sklearn.decomposition import NMF # Use this for training Non-negative Matrix Factorization
from sklearn.utils.extmath import randomized_svd # Use this for training Singular Value Decomposition
from sklearn.manifold import TSNE # Use this for training t-sne manifolding

plt.style.use('ggplot') # You can also use different style

# just for plot checking, use this option
# %matplotlib inline

# for interactive plot
# If you use this option, plot will appear at first-drawn position
%matplotlib notebook

warnings.filterwarnings('ignore')

## Data Loading

<br>

> ### File description : MovieLens-100K
- movies.csv : list of movies (9742 movies)
- ratings.csv : list of ratings given by users (610 users, 100,836 ratings)

In [2]:
dir = 'C:/Users/JinnyeongHeo/Desktop/2022-1/추천시스템/RS_Assignment1 (1)/RS_Assignment1/MovieLens100K/'
df_ratings = pd.read_csv(dir + 'ratings.csv', usecols=['userId', 'movieId', 'rating'])
df_movies = pd.read_csv(dir + 'movies.csv', usecols=['movieId', 'title', 'genres']) # for title-matching

In [3]:
df_movies = pd.merge(df_movies, df_ratings, on='movieId')
df_movies = df_movies.drop(['userId', 'rating'],axis=1)
df_movies = df_movies.set_index(keys=['movieId'])
df_movies = df_movies.loc[~df_movies.index.duplicated(keep='first')]
df_movies = df_movies.sort_index(ascending=True)
df_movies = np.array(df_movies)
df_movies = pd.DataFrame(df_movies)
print(df_movies)

                                              0  \
0                              Toy Story (1995)   
1                                Jumanji (1995)   
2                       Grumpier Old Men (1995)   
3                      Waiting to Exhale (1995)   
4            Father of the Bride Part II (1995)   
...                                         ...   
9719  Black Butler: Book of the Atlantic (2017)   
9720               No Game No Life: Zero (2017)   
9721                               Flint (2017)   
9722        Bungo Stray Dogs: Dead Apple (2018)   
9723        Andrew Dice Clay: Dice Rules (1991)   

                                                1  
0     Adventure|Animation|Children|Comedy|Fantasy  
1                      Adventure|Children|Fantasy  
2                                  Comedy|Romance  
3                            Comedy|Drama|Romance  
4                                          Comedy  
...                                           ...  
9719              Actio

## Simple EDA (Exploratory Data Analysis)

<br>

- Before starting, let's perform simple data analysis on the given dataset

In [4]:
# 고유 사용자, 고유 영화 갯수 확인
n_users = len(df_ratings['userId'].unique())
n_movies = len(df_ratings['movieId'].unique())

n_users, n_movies # 610 명의 사용자가 9724개의 영화에 평점을 매김을 확인

(610, 9724)

In [5]:
len(df_movies) # len(df_movies) - n_movies 만큼의 평점이 매겨지지 않은 영화가 존재

9724

## Generate Utility Matrix $A$

<br>

> ### Problem 1 (2 points)
1. Generate an utility matrix $A$ by using **df_ratings** (store rating values with unique 'movieId' and 'userId')
2. Within a matrix $A$, replace NaN values (unknown ratings) with 0 (zero value)
3. Convert the utility matrix $A$ to **numpy array**

In [6]:
"""
# Utility Matrix의 형태는 (n_movies, n_users)
# 즉 Utility Matrix 의 각 행은 movie, 각 열은 user를 나타냄
"""
# your code here

A = df_ratings.pivot_table(
    values = 'rating',
    index = 'movieId',
    columns = 'userId'
).fillna(0).to_numpy()

# 올바른 형태로 utility matrix가 생성되었는지 확인
print(A.shape)
print(A)
print(type(A))

(9724, 610)
[[4.  0.  0.  ... 2.5 3.  5. ]
 [0.  0.  0.  ... 2.  0.  0. ]
 [4.  0.  0.  ... 2.  0.  0. ]
 ...
 [0.  0.  0.  ... 0.  0.  0. ]
 [0.  0.  0.  ... 0.  0.  0. ]
 [0.  0.  0.  ... 0.  0.  0. ]]
<class 'numpy.ndarray'>


## Training SVD Model

<br>

> ### Problem 2 (2 points)
1. Decompose utility matrix $A$ into three matrices $U$, $\sum$, and $V^T$ by training SVD model (you can use  randomized_svd() function provided from scikit-learn)
    - Refer to: https://scikit-learn.org/stable/modules/generated/sklearn.utils.extmath.randomized_svd.html

2. After training SVD is completed, perform dot-product of $U$, $\sum$, and $V^T$ to obtain the matrix $A_{approx_{svd}}$ that approximates matrix $A$
    - Note that, **$\sum$ should be a diagonal matrix**, not a vector
    - Before computing $A_{approx_{svd}}$, you will need to transform $\sum$ generated from randomized_svd() to a diagonal matrix

In [7]:
"""
# sklearn에서 제공하는 randomized_svd()를 통해 SVD 모델을 학습하여 U, Sigma, VT를 구함
# k는 분해될 행렬들의 feature/factor 의 크기를 정하는 hyperparameter
# 여러 인자를 조정해서 randomized_svd()를 실행 가능 (자세한 내용은 위의 참고 사이트를 참고)
# 분해된 행렬 U, Sigma, VT의 형태는 (n_movies, k), (k,) (k, n_users)
# 분해된 행렬들을 이용하여 dot-product 연산을 수행하면 원래의 utility matrix와 같은 (n_movies, n_users) 형태의 근사 행렬을 얻을 수 있음
"""

# your code here

k = 100
U, Sigma, VT = randomized_svd(A, n_components = k)

# 분해된 행렬이 올바른 형태로 생성되었는지 확인
print(U.shape, Sigma.shape, VT.shape)

A_approx_svd = np.dot(np.dot(U, np.diag(Sigma)),VT)

# 근사 행렬이 올바른 형태로 생성되었는지 확인
print(A_approx_svd.shape)

(9724, 100) (100,) (100, 610)
(9724, 610)


## Training NMF Model

<br>

> ### Problem 3 (2 points)
1. Decompose utility matrix $A$ into two matrices $W$ and $H$ by training NMF model (using NMF()).
    - Refer to: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF

2. After training NMF, perform dot-product of $W$ and $H$ to obtain the matrix $A_{approx_{nmf}}$ that approximates matrix $A$

In [8]:
"""
# sklearn에서 제공하는 NMF()를 통해 NMF 모델을 학습하여 W, H를 구함
# SVD와 마찬가지로, k는 분해될 행렬들의 feature/factor 의 크기를 정하는 hyperparameter
# 여러 인자를 조정해서 NMF() 모델을 생성 (자세한 내용은 위의 참고 사이트를 참고)
# 분해된 행렬 W와 H는 .fit(data)를 실행한 후에 구할수 있으며, W와 H의 형태는 (n_movies, k), (k, n_users)
# 분해된 행렬들을 이용하여 dot-product 연산을 수행하면 원래의 utility matrix와 같은 (n_movies, n_users) 형태의 근사 행렬을 구할 수 있음
"""

# your code here

k = 100
model_nmf = NMF(n_components = k, init = 'random', random_state = 42, max_iter = 100, l1_ratio = 0.2)
model_nmf.fit(A)
W = model_nmf.transform(A)
print(type(W))
H = model_nmf.components_

# 분해된 행렬이 올바른 형태로 생성되었는지 확인
print(W.shape, H.shape)

A_approx_nmf = np.dot(W, H)

# 근사 행렬이 올바른 형태로 생성되었는지 확인
print(A_approx_nmf.shape)

<class 'numpy.ndarray'>
(9724, 100) (100, 610)
(9724, 610)


## Compute loss by implementing a custom function

> ### Problem 4 (2 points)
1. Implement **compute_error(actual, prediction)** function that takes matrices 'actual' and 'prediction' as input parameters.
    1. Ignore zero values in the actual maxtrix
    2. Compute SSE and RMSE
    3. Return SSE and RMSE

<br>

2. Implement **compute_error_all(actual, prediction)** function that takes matrices 'actual' and 'prediction' as input parameters.
    1. Do not ignore zero values in the actual matrix (compute all values)
    2. Compute SSE and RMSE
    3. Return SSE and RMSE

In [9]:
"""
# 실제 utility matrix A와 SVD 를 통해 생성된 행렬 A_approx_svd 간의 오차를 계산
# 실제 utility matrix A와 NMF 를 통해 생성된 행렬 A_approx_nmf 간의 오차를 계산
# 오차 값을 계산하기 위해 함수 compute_error(actual, prediction)와 compute_error_all(actual, prediction)를 구현
# 강의시간에 배운 수식을 통해 함수 구현
"""

# your code here

def compute_error(actual, prediction):
    # 매개변수로 입력받은 actual 행렬 안의 0값을 갖는 원소들은 오차 계산에서 제외합니다.
    prediction = prediction[actual.nonzero()].flatten()
    actual = actual[actual.nonzero()].flatten()
    
    sse = np.sum((actual - prediction) ** 2)
    rmse = np.sqrt((sse / len(actual)))
    
    return sse, rmse

def compute_error_all(actual, prediction):
    # actual 행렬 안의 0값을 갖는 원소들도 포함해서 오차를 계산합니다.
    sse = np.sum((actual - prediction) ** 2)
    rmse = np.sqrt((sse / len(actual)))
    
    return sse, rmse

In [10]:
print(f"SVD Error(ignoring zero values): SSE = {compute_error(A, A_approx_svd)[0]}, RMSE = {compute_error(A, A_approx_svd)[1]}")
print(f"NMF Error(ignoring zero values): SSE = {compute_error(A, A_approx_nmf)[0]}, RMSE = {compute_error(A, A_approx_nmf)[1]}")

print('\n')

print(f"SVD Error(including all zero values): SSE = {compute_error_all(A, A_approx_svd)[0]}, RMSE = {compute_error_all(A, A_approx_svd)[1]}")
print(f"NMF Error(including all zero values): SSE = {compute_error_all(A, A_approx_nmf)[0]}, RMSE = {compute_error_all(A, A_approx_nmf)[1]}")

SVD Error(ignoring zero values): SSE = 252565.8530103409, RMSE = 1.5826304302035459
NMF Error(ignoring zero values): SSE = 335944.532422351, RMSE = 1.8252652423769713


SVD Error(including all zero values): SSE = 380170.2236596947, RMSE = 6.252685343802855
NMF Error(including all zero values): SSE = 428826.81563079334, RMSE = 6.64077081867036


## Predict missing (unknown) values in utility matrix $A$ for a specific user


> ### Problem 5 (2 points)

- 실제 평점과 예측 평점을 확인할 수 있는 dataframe 생성을 위해 makePredictions(actual, pred, user) 함수를 정의
- makePredictions() 함수는 user(사용자 번호, user index)를 통해 actual, pred에서 rated movies(seen movies), non-rated movies(unseen movies)를 추출
- 그 다음 앞서 정의했던 df_movies와 추출한 2개의 dataframe을 concat 해줄 것
- column mismatching이 일어날 수 있는데, 우선 dataframe을 모든 column과 concat한 후 불필요한 'movieId' column을 drop 해줄 것
- 이어서 실제로 본 영화 목록을 rated_movies로 정의하고, rating을 기준으로 내림차순 정렬 수행
- 마찬가지로 평점이 부여되지 않은 영화들(평점을 예측하고자하는 영화들)의 목록을 unrated_movies로 정의하고, dataframe의 index를 기준으로 오름차순 정렬 수행

- dataframe의 앞, 뒤를 보는 함수는 .head(), .tail()를 사용할 수 있지만, 중간을 볼 수 있는 함수는 존재하지 않음
- 따라서, 함수 findMiddle(dataframe)을 정의하고, 이 함수는 indexing을 통해 dataframe의 중간 위치를 보여줌
- findMiddle()의 return은 dataframe의 중간 10개 부분

In [11]:
# your code here

def makePredictions(actual, pred, user):
    # rated_movies, unrated_movies 생성을 하기 위한 사전작업 dataframe을 정의할 수 있습니다. (약 4~5줄)
    user -= 1
    rated_items_df = pd.DataFrame(actual).iloc[:,user]
    user_pred_df = pd.DataFrame(pred).iloc[:, user]
    total_df = pd.concat([rated_items_df, user_pred_df, df_movies], axis=1)
    total_df.columns = ['rating', 'prediction', 'title', 'genre']
#     total_df = total_df.drop('movieId', axis = 1)
    
    # 실제로 본 영화 목록 (평점이 부여된 영화 목록)
    rated_movies = total_df.where(total_df['rating'] > 0)
    rated_movies = rated_movies.dropna()
    rated_movies = rated_movies.sort_values(by = 'rating', ascending = False)
    
    
    # 예측할 영화 목록 (평점이 부여되지 않은 영화 목록)
    unrated_movies = total_df[total_df['rating'] == 0]
    unrated_movies = unrated_movies.sort_values(by='prediction',ascending = False)
    
    
    return rated_movies, unrated_movies


def findMiddle(dataframe):
    # dataframe의 중간 부분을 반환하는 함수입니다.
    # return은 dataframe의 중간 10개 부분들 입니다. (dataframe 형식)
    if len(dataframe) % 2 == 0:
        return dataframe.iloc[int(len(dataframe) / 2) - 5 : int(len(dataframe) / 2) + 5]
    else:
        return dataframe.iloc[int(len(dataframe) / 2) - 0.5 : int(len(dataframe) / 2 + 0.5) + 4]
    
    

In [12]:
"""
# 임의의 사용자를 1명 선정하고, 앞서 작성한 함수에 svd, nmf의 근사 행렬을 인자로 전달
"""

# your code here

# 임의의 사용자 (정수)를 선택
userNumber = 10

prediction_with_rated_svd, prediction_with_unrated_svd = makePredictions(A, A_approx_svd, userNumber) # 실제 Utility Matrix와 svd를 통해 근사한 행렬 간의 비교
prediction_with_rated_nmf, prediction_with_unrated_nmf = makePredictions(A, A_approx_nmf, userNumber) # 실제 Utility Matrix와 nmf를 통해 근사한 행렬 간의 비교

In [13]:
prediction_with_rated_svd.head(userNumber)

Unnamed: 0,rating,prediction,title,genre
8988,5.0,1.297754,The Intern (2015),Comedy
5321,5.0,1.115703,First Daughter (2004),Comedy|Romance
7937,5.0,1.318179,Skyfall (2012),Action|Adventure|Thriller|IMAX
7750,5.0,1.742503,"Dark Knight Rises, The (2012)",Action|Adventure|Crime|IMAX
4939,5.0,1.145617,Troy (2004),Action|Adventure|Drama|War
7449,5.0,2.78399,"King's Speech, The (2010)",Drama
5217,5.0,2.494372,"Notebook, The (2004)",Drama|Romance
7354,5.0,1.192193,Despicable Me (2010),Animation|Children|Comedy|Crime
7139,5.0,0.601279,"Education, An (2009)",Drama|Romance
5901,5.0,1.213946,Batman Begins (2005),Action|Crime|IMAX


In [14]:
prediction_with_rated_nmf.head(10)

Unnamed: 0,rating,prediction,title,genre
8988,5.0,2.118948,The Intern (2015),Comedy
5321,5.0,1.086151,First Daughter (2004),Comedy|Romance
7937,5.0,1.133251,Skyfall (2012),Action|Adventure|Thriller|IMAX
7750,5.0,1.525531,"Dark Knight Rises, The (2012)",Action|Adventure|Crime|IMAX
4939,5.0,0.649875,Troy (2004),Action|Adventure|Drama|War
7449,5.0,3.135962,"King's Speech, The (2010)",Drama
5217,5.0,2.297666,"Notebook, The (2004)",Drama|Romance
7354,5.0,0.695306,Despicable Me (2010),Animation|Children|Comedy|Crime
7139,5.0,0.686772,"Education, An (2009)",Drama|Romance
5901,5.0,1.096237,Batman Begins (2005),Action|Crime|IMAX


In [15]:
prediction_with_unrated_svd.head(10)

Unnamed: 0,rating,prediction,title,genre
7338,0.0,1.914445,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX
3609,0.0,1.765128,Ocean's Eleven (2001),Crime|Thriller
6868,0.0,1.616503,Slumdog Millionaire (2008),Crime|Drama|Romance
3785,0.0,1.595055,My Big Fat Greek Wedding (2002),Comedy|Romance
3849,0.0,1.564519,"Bourne Identity, The (2002)",Action|Mystery|Thriller
7058,0.0,1.55333,(500) Days of Summer (2009),Comedy|Drama|Romance
6225,0.0,1.53567,Little Miss Sunshine (2006),Adventure|Comedy|Drama
7026,0.0,1.520929,"Hangover, The (2009)",Comedy|Crime
5955,0.0,1.491053,"40-Year-Old Virgin, The (2005)",Comedy|Romance
4421,0.0,1.483672,Pirates of the Caribbean: The Curse of the Bla...,Action|Adventure|Comedy|Fantasy


In [16]:
prediction_with_unrated_nmf.head(10)

Unnamed: 0,rating,prediction,title,genre
7058,0.0,2.700044,(500) Days of Summer (2009),Comedy|Drama|Romance
6203,0.0,2.415563,"Devil Wears Prada, The (2006)",Comedy|Drama
7181,0.0,2.269998,"Blind Side, The (2009)",Drama
7970,0.0,2.259506,"Perks of Being a Wallflower, The (2012)",Drama|Romance
7026,0.0,2.152597,"Hangover, The (2009)",Comedy|Crime
7587,0.0,2.150168,Bridesmaids (2011),Comedy
8005,0.0,2.02845,Silver Linings Playbook (2012),Comedy|Drama
6926,0.0,2.016952,Marley & Me (2008),Comedy|Drama
4421,0.0,1.9537,Pirates of the Caribbean: The Curse of the Bla...,Action|Adventure|Comedy|Fantasy
7355,0.0,1.9463,Inception (2010),Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMAX


In [17]:
prediction_with_unrated_svd.tail(10)

Unnamed: 0,rating,prediction,title,genre
46,0.0,-0.825656,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
594,0.0,-0.835599,Twister (1996),Action|Adventure|Romance|Thriller
619,0.0,-0.844744,"Cable Guy, The (1996)",Comedy|Thriller
1082,0.0,-0.953032,Scream (1996),Comedy|Horror|Mystery|Thriller
1733,0.0,-0.958937,American History X (1998),Crime|Drama
2096,0.0,-0.972496,Airplane! (1980),Comedy
197,0.0,-0.972957,Dumb & Dumber (Dumb and Dumber) (1994),Adventure|Comedy
33,0.0,-0.9868,Dead Man Walking (1995),Crime|Drama
31,0.0,-1.04963,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
461,0.0,-1.069198,Schindler's List (1993),Drama|War


In [18]:
prediction_with_unrated_nmf.tail(10)

Unnamed: 0,rating,prediction,title,genre
5039,0.0,0.0,"Duck, You Sucker (1971)",Action|Western
5038,0.0,0.0,"Last Man on Earth, The (Ultimo uomo della Terr...",Drama|Horror|Sci-Fi
5037,0.0,0.0,Pat Garrett and Billy the Kid (1973),Western
1071,0.0,0.0,Citizen Ruth (1996),Comedy|Drama
5035,0.0,0.0,Highway 61 (1991),Comedy
5033,0.0,0.0,The Plague of the Zombies (1966),Horror
5031,0.0,0.0,Tremors 3: Back to Perfection (2001),Comedy|Horror|Sci-Fi
5029,0.0,0.0,"Legend, The (Legend of Fong Sai-Yuk, The) (Fon...",Action|Comedy
5028,0.0,0.0,Lammbock (2001),Comedy
4831,0.0,0.0,Pumping Iron (1977),Documentary


In [19]:
findMiddle(prediction_with_unrated_svd)

Unnamed: 0,rating,prediction,title,genre
5694,0.0,-0.001073,Saints and Soldiers (2003),Action|Adventure|Drama|War
6067,0.0,-0.001073,Mozart and the Whale (2005),Comedy|Drama|Romance
3879,0.0,-0.001073,Like Mike (2002),Children|Comedy|Fantasy
6428,0.0,-0.001073,Pride (2007),Drama
5983,0.0,-0.001073,"Man, The (2005)",Action|Comedy|Crime
7490,0.0,-0.001086,Amer (2009),Drama|Horror
2136,0.0,-0.001178,"Minus Man, The (1999)",Drama|Mystery
891,0.0,-0.001222,"Thin Blue Line, The (1988)",Documentary
194,0.0,-0.001232,Destiny Turns on the Radio (1995),Comedy
854,0.0,-0.001268,Drop Dead Fred (1991),Comedy|Fantasy


In [20]:
findMiddle(prediction_with_unrated_nmf)

Unnamed: 0,rating,prediction,title,genre
9557,0.0,0.0,Obsession (1965),Comedy
9565,0.0,0.0,Gena the Crocodile (1969),Animation|Children
9558,0.0,0.0,"Oh, Hello: On Broadway (2017)",Comedy
9559,0.0,0.0,Good Time (2017),Crime|Drama
770,0.0,0.0,Old Yeller (1957),Children|Drama
771,0.0,0.0,"Parent Trap, The (1961)",Children|Comedy|Romance
6610,0.0,0.0,Hitman (2007),Action|Crime|Thriller
9561,0.0,0.0,The House (2017),Comedy
6652,0.0,0.0,First Sunday (2008),Comedy|Crime
9529,0.0,0.0,Final Flight of the Osiris (2003),Action|Animation|Sci-Fi
