### 目的: 電影推薦系統
### 應用場景:
1. 新用戶沒有過去評分資訊，所以需先讓用戶挑選一些覺得不錯的電影，然後找出與這個電影評分狀況類似的電影
2. 舊用戶可以找出過去正回饋的樣本，然後找出與這個電影評分狀況類似的電影
EX.當新客戶選擇喜歡的電影後，根據其他電影的客戶評分資料，找出是不是有喜歡A電影的人，也同像喜歡某電影，而不喜歡A電影的人，也同樣不喜歡某電影
   其實可以把其他客戶的評分想成電影的特徵值
### 演算法
1. 基於歷史評分資料，建置以使用者(總數m)列座標，物品(總數n)為行座標的的`m*n`共現矩陣(Co-occurrence Matrix)
2. 計算共現矩陣兩兩行向量間的相似性，建立`n*n`維的物品相似度矩陣
3. 獲得使用者歷史行為資料中的正回饋物品列表;新使用者則需先填寫偏好的物品作為正回饋樣本
4. 利用相似度矩陣，針對目標使用者歷史行為中的正回饋物品，找出相似的Top K個物品，組成相似物品的集合
5. 對相似物品集合中的物品，利用相似度進行排序，產生最後的推薦利表
**note: 如果一個物品與使用者行為歷史中的多個正回饋物品相似，那麼該物品的相似度應該是多個相似度相加，代表該物品跟使用者正回饋物品集合很相似**
### 模型(ItemCF的架構):
相似性的計算用 `Adjusted Cosine Similarity`

### 缺點: 
1. 熱門的物品具有很強的領頭效應，容易跟大量物品產生相似性;而尾部的物品由於特徵向量稀疏，很少與其他物品產生相似性，導致很少被推薦
2. 協同過濾僅利用使用者和物品的互動資訊，無法有效地引用使用者年齡、性別、商品描述、商品分類、當下時間等使用者特徵、物品特徵、上下文特徵

In [1]:
import pandas as pd
from utils import grid_df_display, check_nan_by_row
from sklearn.metrics.pairwise import cosine_similarity
from tqdm.notebook import tqdm

import scipy.spatial.distance as ssd
from scipy.sparse import issparse
import numpy as np

In [2]:
titles = pd.read_csv('./Data/movies.csv')
ratings = pd.read_csv('./Data/ratings.csv')

In [3]:
grid_df_display(list_df = [titles, ratings], 
                list_df_name = ['Movie Attributes', 'Rating'],
                list_number_of_data = [10, 10], row = 1, col = 2, fill = 'col')

Unnamed: 0_level_0,movieId,title,genres,Unnamed: 4_level_0
Unnamed: 0_level_1,userId,movieId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,
1,2,Jumanji (1995),Adventure|Children|Fantasy,
2,3,Grumpier Old Men (1995),Comedy|Romance,
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,
4,5,Father of the Bride Part II (1995),Comedy,
5,6,Heat (1995),Action|Crime|Thriller,
6,7,Sabrina (1995),Comedy|Romance,
7,8,Tom and Huck (1995),Adventure|Children,
8,9,Sudden Death (1995),Action,
9,10,GoldenEye (1995),Action|Adventure|Thriller,

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510
5,1,1088,4.0,1147868495
6,1,1175,3.5,1147868826
7,1,1217,3.5,1147878326
8,1,1237,5.0,1147868839
9,1,1250,4.0,1147868414


In [4]:
titles.info()
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB


In [5]:
check_nan_by_row(titles, titles.columns)

There is no NaN in dataframe.


In [6]:
check_nan_by_row(ratings, ratings.columns)

There is no NaN in dataframe.


In [7]:
titles['year'] = titles['title'].str.extract('(\d{4})(?=\))').astype(float)
titles = titles.drop(titles.query("year!=year").index).reset_index(drop=True)
titles = titles.query("year>=2014")
titles = titles.sample(frac=1, replace=False, random_state=2021).reset_index(drop=True)

In [8]:
titles

Unnamed: 0,movieId,title,genres,year
0,166369,Sour Grapes (2016),Documentary,2016.0
1,193253,Active Measures (2018),Documentary,2018.0
2,196805,Close Range Love (2014),Drama|Romance,2014.0
3,192757,As Seen by the Rest (2014),Drama|Romance|Thriller,2014.0
4,176343,A Movie Life (2017),Drama,2017.0
...,...,...,...,...
12804,151319,Notias (2016),Comedy|Drama,2016.0
12805,133945,Deep Web (2015),Documentary,2015.0
12806,175915,Między nami dobrze jest (2014),(no genres listed),2014.0
12807,202273,The Challenger Disaster (2019),Drama,2019.0


## Similarity
### Pearson Correlation Coefficient
${\displaystyle r={\frac {\sum \limits _{i=1}^{n}(X_{i}-{\overline {X}})(Y_{i}-{\overline {Y}})}{{\sqrt {\sum \limits _{i=1}^{n}(X_{i}-{\overline {X}})^{2}}}{\sqrt {\sum \limits _{i=1}^{n}(Y_{i}-{\overline {Y}})^{2}}}}}}$

### Cosine Similarity
${\text{similarity}}=\cos(\theta )={A\cdot B \over \|A\|\|B\|}={\frac  {\sum \limits _{{i=1}}^{{n}}{A_{i}\times B_{i}}}{{\sqrt  {\sum \limits _{{i=1}}^{{n}}{(A_{i})^{2}}}}\times {\sqrt  {\sum \limits _{{i=1}}^{{n}}{(B_{i})^{2}}}}}}$

### Adjusted Cosine Similarity
${\displaystyle sim(i,j) = {\frac {\sum \limits _{u \in U}(R_{u,i}-{\overline {R}_u})(R_{u,j}-{\overline {R}_u})}{{\sqrt {\sum \limits _{u \in U}(R_{u,i}-{\overline {R}_u})^{2}}}{\sqrt {\sum \limits _{i=1}^{n}(R_{u,i}-{\overline {R}_u})^{2}}}}}}$

#### 因為評分高低是一個重要因子，所以不考慮一般的 `Cosine Similarity`，而是使用 `Adjusted Cosine Similarity`

In [25]:
def adjusted_cosine(X, Y, N):
    """
    Considering the rows of X (and Y=X) as vectors, compute the
    distance matrix between each pair of vectors after normalize or adjust
    the vector using the N vector. N vector contains the mean of the values
    of each feature vector from X and Y.
    This correlation implementation is equivalent to the cosine similarity
    since the data it receives is assumed to be centered -- mean is 0. The
    correlation may be interpreted as the cosine of the angle between the two
    vectors defined by the users' preference values.
    Parameters
    ----------
    X : {array-like, sparse matrix}, shape = [n_samples_1, n_features]
    Y : {array-like, sparse matrix}, shape = [n_samples_2, n_features]
    N: {array-like, sparse matrix}, shape = [n_samples_3, n_features]
    Returns
    -------
    distances : {array, sparse matrix}, shape = [n_samples_1, n_samples_2]
    Examples
    --------
    >>> from crab.metrics.pairwise import adjusted_cosine
    >>> X = [[1.0, 5.0, 4.0]]
    >>> Y = [[2.0, 5.0, 5.0]]
    >>> N = [[3.0, 3.5, 4.0]]
    >>> # distance between rows of X
    >>> adjusted_cosine(X, X, N)
    array([[ 1.]])
    >>> adjusted_cosine(X, Y, N)
    array([[ 0.82462113]])
    """   

    if X is Y:
        X = Y = np.asanyarray(X)
    else:
        X = np.asanyarray(X)
        Y = np.asanyarray(Y)

    if X.shape[1] != Y.shape[1] != N.shape[1]:
        raise ValueError("Incompatible dimension for X, Y and N matrices")

    X = X - N
    Y = Y - N
    # find index which has nan value in X or Y, and remove it
    no_need_index_li = list(set(np.argwhere(np.isnan(X[0])).flatten().tolist() + np.argwhere(np.isnan(Y[0])).flatten().tolist()))
    X = np.array([np.delete(X, no_need_index_li)])
    Y = np.array([np.delete(Y, no_need_index_li)])
    
    
    XY = 1 - ssd.cdist(X, Y, 'cosine')

    return XY

In [26]:
# Check function result
X = [[1.0, 5.0, 4.0]]
Y = [[2.0, 5.0, 5.0]]
N = [[3.0, 3.5, 4.0]]
adjusted_cosine(X, Y, N)

array([[0.82462113]])

In [27]:
# calculate adjusted ratings
rating_mean= ratings.groupby(['userId'], as_index = False, sort = False).mean().rename(columns = {'rating': 'rating_mean'})
rating_mean = rating_mean[['userId','rating_mean']]
adjusted_ratings = pd.merge(ratings, rating_mean, on = 'userId', how = 'left', sort = False)
data = titles.merge(adjusted_ratings, on='movieId', how='left').reset_index(drop=True)

In [28]:
movies = pd.crosstab(data['userId'], data['title'], values=data['rating'], aggfunc='sum')
movies = movies.dropna(axis='columns', thresh=2)

In [29]:
usr_rating_mean = adjusted_ratings[['userId', 'rating_mean']]
del adjusted_ratings
usr_rating_mean = usr_rating_mean.drop_duplicates(subset = ['userId', 'rating_mean']).sort_values('userId')
usr_rating_mean = usr_rating_mean[usr_rating_mean['userId'].isin(movies.index)].reset_index(drop=True)
usr_rating_mean = usr_rating_mean.set_index('userId')

In [30]:
display(movies.head(), usr_rating_mean.head())

title,#Captured (2017),#Female Pleasure (2018),#FollowMe (2019),#Horror (2015),#SCREAMERS (2016),#Stuck (2014),#realityhigh (2017),'71 (2014),'Tis the Season for Love (2015),(Dis)Honesty: The Truth About Lies (2015),...,Τα Δάκρυα του Βουνού (2019),Бабушка лёгкого поведения (2017),Газгольдер: Фильм (2014),Каникулы президента (2018),Крепость: щитом и мечом (2015),Мафия (2016),Напарник (2017),Он вам не Димон (2017),Я худею (2018),أهواك (2015)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,,...,,,,,,,,,,
14.0,,,,,,,,,,,...,,,,,,,,,,
17.0,,,,,,,,,,,...,,,,,,,,,,
19.0,,,,,,,,,,,...,,,,,,,,,,


Unnamed: 0_level_0,rating_mean
userId,Unnamed: 1_level_1
3,3.697409
4,3.378099
14,4.596774
17,3.655172
19,3.423581


In [31]:
if sum( (movies.index == usr_rating_mean.index) != 1 ) !=0:
    raise ValueError

### 使用者正回饋物品集合對其他物品的相似度

In [32]:
# 我的正回饋電影
user_input = ['12 Gifts of Christmas (2015)', 'Interstellar (2014)', 'Arrival (2016)']
# 回饋分數
score = [7, 9, 8]

In [34]:
rating_mean_series = usr_rating_mean['rating_mean']
all_li = []
for target_col_name in tqdm(user_input, position=1):
    li = []
    for col_name in tqdm(movies.columns, position=0):
        X = [movies[target_col_name].tolist()]
        Y = [movies[col_name].tolist()]
        N = [rating_mean_series.tolist()]
        if np.isnan(np.nansum(X[0])) or len(X[0])<2:
            li.append(0)
        else:
            li.append( adjusted_cosine(X, Y, N)[0][0] )
    all_li.append(li)

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/10338 [00:00<?, ?it/s]

  0%|          | 0/10338 [00:00<?, ?it/s]

  0%|          | 0/10338 [00:00<?, ?it/s]

In [59]:
similarity_for_person = np.nan_to_num(np.array(all_li[0])*score[0]) + np.nan_to_num(np.array(all_li[1])*score[1]) + np.nan_to_num(np.array(all_li[2])*score[2])
final_data = {'title': movies.columns.tolist(), 'adjusted_cosine_similarity': similarity_for_person}

In [60]:
final_df = pd.DataFrame(final_data)

In [61]:
final_df = final_df.sort_values('adjusted_cosine_similarity', ascending=False)

In [62]:
data_atr = data.groupby(['movieId', 'title', 'genres', 'year']).agg({'userId': 'nunique', 'rating': 'mean'}).\
rename(columns={'userId': 'rating_count', 'rating': 'average_renting'}).reset_index()

In [63]:
data_atr

Unnamed: 0,movieId,title,genres,year,rating_count,average_renting
0,107516,Punk's Dead: SLC Punk! 2 (2014),Comedy,2014.0,14,2.392857
1,107769,Paranormal Activity: The Marked Ones (2014),Horror|Thriller,2014.0,168,2.761905
2,107916,Yves Saint Laurent (2014),Drama|Romance,2014.0,70,3.178571
3,108156,Ride Along (2014),Action|Comedy,2014.0,531,3.181733
4,108188,Jack Ryan: Shadow Recruit (2014),Action|Drama|Thriller|IMAX,2014.0,1003,3.227318
...,...,...,...,...,...,...
12804,209143,The Painting (2019),Animation|Documentary,2019.0,1,1.000000
12805,209145,Liberté (2019),Drama,2019.0,1,1.000000
12806,209151,Mao Zedong 1949 (2019),(no genres listed),2019.0,1,3.500000
12807,209157,We (2018),Drama,2018.0,1,1.500000


In [64]:
final_df = data_atr.merge(final_df, on='title', how='left').sort_values('adjusted_cosine_similarity', ascending=False).reset_index(drop=True)

### Recommendation list

In [67]:
user_input

['12 Gifts of Christmas (2015)', 'Interstellar (2014)', 'Arrival (2016)']

In [65]:
final_df

Unnamed: 0,movieId,title,genres,year,rating_count,average_renting,adjusted_cosine_similarity
0,166195,The Submarine Kid (2016),Drama,2016.0,4,4.375000,24.000000
1,122573,Expecting Amish (2014),(no genres listed),2014.0,5,2.900000,18.330584
2,165489,Ethel & Ernest (2016),(no genres listed),2016.0,23,3.630435,17.177301
3,179997,Myosis (2014),Animation|Drama|Sci-Fi,2014.0,2,3.000000,17.000000
4,203655,Queen of Spades: Through the Looking Glass (2019),Horror|Thriller,2019.0,2,3.750000,17.000000
...,...,...,...,...,...,...,...
12804,209143,The Painting (2019),Animation|Documentary,2019.0,1,1.000000,
12805,209145,Liberté (2019),Drama,2019.0,1,1.000000,
12806,209151,Mao Zedong 1949 (2019),(no genres listed),2019.0,1,3.500000,
12807,209157,We (2018),Drama,2018.0,1,1.500000,
