### 目的: 電影推薦
### 應用場景:
新客戶沒有過去評分資訊，所以需先讓客戶挑選一些覺得不錯的電影，然後找出與這個電影評分狀況類似的電影
EX.當新客戶選擇喜歡的電影後，根據其他電影的客戶評分資料，找出是不是有喜歡A電影的人，也同像喜歡某電影，而不喜歡A電影的人，也同樣不喜歡某電影
   其實可以把其他客戶的評分想成電影的特徵值
### 演算法
1. 基於歷史評分資料，建置以使用者(總數m)列座標，物品(總數n)為行座標的的`m*n`共現矩陣(Co-occurrence Matrix)
2. 計算共現矩陣兩兩行向量間的相似性，建立`n*n`維的物品相似度矩陣
3. 獲得使用者歷史行為資料中的正回饋物品列表;新使用者則需先填寫偏好的物品作為正回饋樣本
4. 利用相似度矩陣，針對目標使用者歷史行為中的正回饋物品，找出相似的Top K個物品，組成相似物品的集合
5. 對相似物品集合中的物品，利用相似度進行排序，產生最後的推薦利表
**note: 如果一個物品與使用者行為歷史中的多個正回饋物品相似，那麼該物品的相似度應該是多個相似度相加，代表該物品跟使用者正回饋物品集合很相似**
### 模型(ItemCF的架構):
<略>

### 缺點: 
1. 熱門的物品具有很強的領頭效應，容易跟大量物品產生相似性;而尾部的物品由於特徵向量稀疏，很少與其他物品產生相似性，導致很少被推薦
2. 協同過濾僅利用使用者和物品的互動資訊，無法有效地引用使用者年齡、性別、商品描述、商品分類、當下時間等使用者特徵、物品特徵、上下文特徵

In [1]:
import pandas as pd
from utils import grid_df_display, check_nan_by_row
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
titles = pd.read_csv('./movies.csv')
data = pd.read_csv('./ratings.csv')

In [3]:
grid_df_display(list_df = [titles, data], 
                list_df_name = ['Movie Attributes', 'Rating'],
                list_number_of_data = [10, 10], row = 1, col = 2, fill = 'col')

Unnamed: 0_level_0,movieId,title,genres,Unnamed: 4_level_0
Unnamed: 0_level_1,userId,movieId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,
1,2,Jumanji (1995),Adventure|Children|Fantasy,
2,3,Grumpier Old Men (1995),Comedy|Romance,
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,
4,5,Father of the Bride Part II (1995),Comedy,
5,6,Heat (1995),Action|Crime|Thriller,
6,7,Sabrina (1995),Comedy|Romance,
7,8,Tom and Huck (1995),Adventure|Children,
8,9,Sudden Death (1995),Action,
9,10,GoldenEye (1995),Action|Adventure|Thriller,

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510
5,1,1088,4.0,1147868495
6,1,1175,3.5,1147868826
7,1,1217,3.5,1147878326
8,1,1237,5.0,1147868839
9,1,1250,4.0,1147868414


In [4]:
titles.info()
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB


In [5]:
check_nan_by_row(titles, titles.columns)

There is no NaN in dataframe.


In [6]:
check_nan_by_row(data, data.columns)

There is no NaN in dataframe.


In [7]:
titles['year'] = titles['title'].str.extract('(\d{4})(?=\))').astype(float)
titles = titles.drop(titles.query("year!=year").index).reset_index(drop=True)
titles = titles.query("year>=2014")
titles = titles.sample(frac=1, replace=False, random_state=2021).reset_index(drop=True)

In [8]:
titles

Unnamed: 0,movieId,title,genres,year
0,166369,Sour Grapes (2016),Documentary,2016.0
1,193253,Active Measures (2018),Documentary,2018.0
2,196805,Close Range Love (2014),Drama|Romance,2014.0
3,192757,As Seen by the Rest (2014),Drama|Romance|Thriller,2014.0
4,176343,A Movie Life (2017),Drama,2017.0
...,...,...,...,...
12804,151319,Notias (2016),Comedy|Drama,2016.0
12805,133945,Deep Web (2015),Documentary,2015.0
12806,175915,Między nami dobrze jest (2014),(no genres listed),2014.0
12807,202273,The Challenger Disaster (2019),Drama,2019.0


In [9]:
data = titles.merge(data, on='movieId', how='left').reset_index(drop=True)

In [10]:
data

Unnamed: 0,movieId,title,genres,year,userId,rating,timestamp
0,166369,Sour Grapes (2016),Documentary,2016.0,1401.0,2.5,1.544378e+09
1,166369,Sour Grapes (2016),Documentary,2016.0,3959.0,4.0,1.507721e+09
2,166369,Sour Grapes (2016),Documentary,2016.0,13350.0,4.5,1.496468e+09
3,166369,Sour Grapes (2016),Documentary,2016.0,14343.0,5.0,1.542663e+09
4,166369,Sour Grapes (2016),Documentary,2016.0,19997.0,5.0,1.550683e+09
...,...,...,...,...,...,...,...
1182965,133945,Deep Web (2015),Documentary,2015.0,162235.0,4.0,1.507151e+09
1182966,175915,Między nami dobrze jest (2014),(no genres listed),2014.0,88556.0,4.0,1.500919e+09
1182967,202273,The Challenger Disaster (2019),Drama,2019.0,42078.0,1.0,1.558301e+09
1182968,192257,Destiny: The Tale of Kamakura (2017),Drama|Mystery,2017.0,53837.0,1.5,1.533731e+09


In [11]:
reviews = data.groupby('title')['rating'].agg(['count','mean']).reset_index().round(1)

In [12]:
reviews

Unnamed: 0,title,count,mean
0,#Captured (2017),2,3.8
1,#Female Pleasure (2018),3,3.7
2,#FollowMe (2019),6,2.1
3,#Horror (2015),23,2.1
4,#Lucky Number (2015),1,3.0
...,...,...,...
12752,Напарник (2017),5,1.7
12753,Он вам не Димон (2017),40,3.7
12754,Я худею (2018),33,3.3
12755,أهواك (2015),2,2.0


In [13]:
movies = pd.crosstab(data['userId'], data['title'], values=data['rating'], aggfunc='sum')

In [14]:
movies

title,#Captured (2017),#Female Pleasure (2018),#FollowMe (2019),#Horror (2015),#Lucky Number (2015),#SCREAMERS (2016),#SquadGoals (2018),#Stuck (2014),#realityhigh (2017),$50K and a Call Girl: A Love Story (2014),...,Залётчики (2014),Из Уфы с любовью! (2017),Каникулы президента (2018),Крепость: щитом и мечом (2015),Мафия (2016),Напарник (2017),Он вам не Димон (2017),Я худею (2018),أهواك (2015),काशी - In Search of Ganga (2018)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,,...,,,,,,,,,,
14.0,,,,,,,,,,,...,,,,,,,,,,
17.0,,,,,,,,,,,...,,,,,,,,,,
19.0,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162526.0,,,,,,,,,,,...,,,,,,,,,,
162532.0,,,,,,,,,,,...,,,,,,,,,,
162534.0,,,,,,,,,,,...,,,,,,,,,,
162536.0,,,,,,,,,,,...,,,,,,,,,,


In [15]:
movies = movies.dropna(axis='columns', thresh=2)

In [16]:
movies

title,#Captured (2017),#Female Pleasure (2018),#FollowMe (2019),#Horror (2015),#SCREAMERS (2016),#Stuck (2014),#realityhigh (2017),'71 (2014),'Tis the Season for Love (2015),(Dis)Honesty: The Truth About Lies (2015),...,Τα Δάκρυα του Βουνού (2019),Бабушка лёгкого поведения (2017),Газгольдер: Фильм (2014),Каникулы президента (2018),Крепость: щитом и мечом (2015),Мафия (2016),Напарник (2017),Он вам не Димон (2017),Я худею (2018),أهواك (2015)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,,...,,,,,,,,,,
14.0,,,,,,,,,,,...,,,,,,,,,,
17.0,,,,,,,,,,,...,,,,,,,,,,
19.0,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162526.0,,,,,,,,,,,...,,,,,,,,,,
162532.0,,,,,,,,,,,...,,,,,,,,,,
162534.0,,,,,,,,,,,...,,,,,,,,,,
162536.0,,,,,,,,,,,...,,,,,,,,,,


### If I love the movies below, I want to find which movies are similar by using rating scores from other people.

In [17]:
# 我的正回饋電影
userInput = ['12 Gifts of Christmas (2015)', 'Interstellar (2014)', 'Arrival (2016)']
# 回饋分數
score = [7, 9, 8]

### 因為評分高低是一個重要因子，所以不考慮 cosine similarity

In [18]:
# method: pearson, kendall, spearman
similarity = movies.corrwith(movies[userInput[0]], method = 'pearson')*score[0] +\
             movies.corrwith(movies[userInput[1]], method = 'pearson')*score[1] +\
             movies.corrwith(movies[userInput[2]], method = 'pearson')*score[2]

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


### 使用者正回饋物品集合對其他物品的相似度

In [19]:
similarity[similarity==similarity]

title
'Tis the Season for Love (2015)      20.562253
12 Gifts of Christmas (2015)         24.000000
22 Jump Street (2014)                -3.571883
3 Days to Kill (2014)                10.342726
A Christmas Detour (2015)            11.204179
                                       ...    
Winter's Tale (2014)                 13.532086
X-Men: Apocalypse (2016)             10.388494
X-Men: Days of Future Past (2014)     9.270154
Zapped (2014)                         0.526740
Zootopia (2016)                      10.744530
Length: 174, dtype: float64

In [20]:
correlatedMovies = pd.DataFrame(similarity, columns = ['correlation'])
correlatedMovies = reviews.merge(correlatedMovies, how='left', on = 'title')
correlatedMovies = correlatedMovies.merge(titles, how='left', on = 'title')

In [21]:
correlatedMovies.query("correlation==correlation")

Unnamed: 0,title,count,mean,correlation,movieId,genres,year
12,'Tis the Season for Love (2015),11,2.8,20.562253,149380,Romance,2015.0
57,12 Gifts of Christmas (2015),9,3.0,24.000000,149392,Children,2015.0
118,22 Jump Street (2014),3796,3.5,-3.571883,112138,Action|Comedy|Crime,2014.0
135,3 Days to Kill (2014),379,3.1,10.342726,109569,Action|Crime|Drama,2014.0
260,A Christmas Detour (2015),19,2.9,11.204179,149368,Romance,2015.0
...,...,...,...,...,...,...,...
12539,Winter's Tale (2014),221,3.1,13.532086,109191,Drama|Fantasy|Mystery,2014.0
12631,X-Men: Apocalypse (2016),2938,3.3,10.388494,122924,Action|Adventure|Fantasy|Sci-Fi,2016.0
12632,X-Men: Days of Future Past (2014),9056,3.7,9.270154,111362,Action|Adventure|Sci-Fi,2014.0
12718,Zapped (2014),28,2.9,0.526740,158412,Children|Comedy|Fantasy|Romance,2014.0


In [22]:
output = correlatedMovies.query("mean > 3.5 & count >= 300 & title not in @userInput").sort_values(['correlation', 'mean'], ascending = False)

In [23]:
output

Unnamed: 0,title,count,mean,correlation,movieId,genres,year
11125,The Revenant (2015),7083,3.8,11.999776,139385,Adventure|Drama,2015.0
10176,The Edge of Seventeen (2016),975,3.7,11.719585,161922,Comedy,2016.0
3990,Fury (2014),4211,3.7,11.687702,115210,Action|Drama|War,2014.0
7998,Predestination (2014),3837,3.8,11.583636,114935,Action|Mystery|Sci-Fi|Thriller,2014.0
4352,Hacksaw Ridge (2016),2602,3.9,11.552436,163645,Drama|War,2016.0
...,...,...,...,...,...,...,...
12253,War Dogs (2016),1379,3.6,,161131,Comedy,2016.0
12259,War for the Planet of the Apes (2017),1373,3.6,,173145,Action|Adventure|Drama|Sci-Fi,2017.0
12493,Wild (2014),1016,3.6,,118900,Drama,2014.0
12589,Wonder Woman (2017),4099,3.6,,143355,Action|Adventure|Fantasy,2017.0


In [24]:
del output['movieId']
del output['year']
output.rename(columns={"title": ("Movies Suggestions based on " + userInput[0] + ', ' + userInput[1] + ', ' + userInput[2]),
                       "count": "Number of Ratings",
                       "genres": "Genres",
                       "mean": "Ratings",
                       "correlation": "Correlation"}).head(25)

Unnamed: 0,"Movies Suggestions based on 12 Gifts of Christmas (2015), Interstellar (2014), Arrival (2016)",Number of Ratings,Ratings,Correlation,Genres
11125,The Revenant (2015),7083,3.8,11.999776,Adventure|Drama
10176,The Edge of Seventeen (2016),975,3.7,11.719585,Comedy
3990,Fury (2014),4211,3.7,11.687702,Action|Drama|War
7998,Predestination (2014),3837,3.8,11.583636,Action|Mystery|Sci-Fi|Thriller
4352,Hacksaw Ridge (2016),2602,3.9,11.552436,Drama|War
5774,La La Land (2016),3771,3.7,11.48836,Comedy|Drama|Romance
7230,Nightcrawler (2014),7343,3.9,11.374859,Crime|Drama|Thriller
10415,The Hateful Eight (2015),6148,3.9,11.239996,Western
6386,Mad Max: Fury Road (2015),13479,3.9,11.100505,Action|Adventure|Sci-Fi|Thriller
5195,It Follows (2014),2277,3.6,10.970168,Horror
