## Content-Based Recommendation System - Kaggle IMDB Movie Dataset
Trong Notebook này, ta sẽ tiếp tục sử dụng Kaggle IMDB Movie Dataset để tạo Recommendation system cho nhu cầu xem phim của người dùng dựa trên sự giống nhau về content giữa các phim.

Link data: https://www.kaggle.com/ashirwadsangwan/imdb-dataset

Ta sẽ chỉ cần sử dụng đến 2 bảng trong Dataset này:
- title.basics.tsv.gz: Show các thông tin liên quan đến phim
- title.ratings.tsv.gz: Show các thông tin liên quan đến rating của phim

Lưu ý, ta chỉ làm việc với dữ liệu liên quan đến tvSeries (phim truyền hình) và recommend các phim truyền hình cho người dùng dựa trên thị hiếu của người này về thể loại phim này.

Đầu tiên, ta cũng import các thư viện cần thiết:

In [1]:
import gzip
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
%matplotlib inline

## 1. Bảng title.basics.tsv.gz - Bao gồm các thông tin sau:
- tconst (string) - ID nhận biết riêng của từng phim/chương trình
- titleType (string) – loại phát sóng của chương trình (e.g. movie, short, tvseries, tvepisode, video, etc).
- primaryTitle (string) – Tên chính thức của chương trình, đây là tên mà được sử dụng trong việc quảng bá khi chương trình được phát hành
- originalTitle (string) - Tên gốc của chương trình (sử dụng ngôn ngữ gốc).
- isAdult (boolean) - 0: phim không có yếu tố người lớn, 1: phim có yếu tố người lớn
- startYear (YYYY) – Năm phát hành. Đối với TVseries thì trường này là năm bắt đầu công chiếu của phim
- endYear (YYYY) – đối với TVseries thì đây là năm kết thúc, đối với các thể loại khác: Null values
- runtimeMinutes – Thời lượng phát sóng (tính bằng phút)
- genres (string array) – Thể loại chương trình (gồm nhiều nhất 3 loại/1 phim)

In [2]:
TitleBasics = pd.read_csv(r"C:/Users/User/Desktop/Kaggle_movie_dataset/title.basics.tsv.gz",sep="\t", low_memory=False, 
                          na_values=["\\N","nan"], compression='gzip',error_bad_lines = False)
TitleBasics

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894.0,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892.0,,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892.0,,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892.0,,,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893.0,,1,"Comedy,Short"
...,...,...,...,...,...,...,...,...,...
6326540,tt9916848,tvEpisode,Episode #3.17,Episode #3.17,0,2010.0,,,"Action,Drama,Family"
6326541,tt9916850,tvEpisode,Episode #3.19,Episode #3.19,0,2010.0,,,"Action,Drama,Family"
6326542,tt9916852,tvEpisode,Episode #3.20,Episode #3.20,0,2010.0,,,"Action,Drama,Family"
6326543,tt9916856,short,The Wind,The Wind,0,2015.0,,27,Short


Ta lọc ra loại chương trình là tvSeries để tạo recommendation table cho loại này, drop một số cột không cần thiết và drop các giá trị Null của cột genres:

In [3]:
tvShow=TitleBasics[TitleBasics['titleType']=='tvSeries'].drop(columns=['titleType','originalTitle',
                                                                     'isAdult','runtimeMinutes'])
tvShow=tvShow[tvShow['genres'].isnull()==False]
tvShow

Unnamed: 0,tconst,primaryTitle,startYear,endYear,genres
38453,tt0039120,Americana,1947.0,1949.0,"Family,Game-Show"
38454,tt0039121,Birthday Party,1947.0,1949.0,Family
38455,tt0039122,The Borden Show,1947.0,,"Comedy,Music"
38456,tt0039123,Kraft Theatre,1947.0,1958.0,Drama
38458,tt0039125,Public Prosecutor,1947.0,1951.0,"Crime,Drama,Mystery"
...,...,...,...,...,...
6326231,tt9916206,Nojor,2019.0,,Fantasy
6326236,tt9916216,Kalyanam Mudhal Kadhal Varai,2014.0,2017.0,Romance
6326237,tt9916218,Lost in Food,2016.0,2017.0,Talk-Show
6326317,tt9916380,Meie aasta Aafrikas,2019.0,,"Adventure,Comedy,Family"


Để tạo được recommendation dựa vào sự giống nhau về content giữa các phim, ta cần dựa vào sự giống nhau về thể loại phim (cột genres) của các phim đó.  

Trước hết ta cần tách các thể loại của mỗi phim rồi tạo các biến giả là các cột tương ứng với từng thể loại phim. Phim nào có thể loại nào thì giá trị tương ứng ở cột đó là 1, nếu không thì trả về giá trị 0:

In [4]:
# Tách mỗi dòng của cột genres thành 1 list với các phần tử trong list là mỗi thể loại khác nhau:

tvShow['genres'] = tvShow.genres.str.split(',')
tvShow

Unnamed: 0,tconst,primaryTitle,startYear,endYear,genres
38453,tt0039120,Americana,1947.0,1949.0,"[Family, Game-Show]"
38454,tt0039121,Birthday Party,1947.0,1949.0,[Family]
38455,tt0039122,The Borden Show,1947.0,,"[Comedy, Music]"
38456,tt0039123,Kraft Theatre,1947.0,1958.0,[Drama]
38458,tt0039125,Public Prosecutor,1947.0,1951.0,"[Crime, Drama, Mystery]"
...,...,...,...,...,...
6326231,tt9916206,Nojor,2019.0,,[Fantasy]
6326236,tt9916216,Kalyanam Mudhal Kadhal Varai,2014.0,2017.0,[Romance]
6326237,tt9916218,Lost in Food,2016.0,2017.0,[Talk-Show]
6326317,tt9916380,Meie aasta Aafrikas,2019.0,,"[Adventure, Comedy, Family]"


In [5]:
#Encode cột genres thành các cột genre, mỗi cột là 1 thể loại phim khác nhau, giá trị trả về là 1 với phim thuộc thể loại đó,
#trả về 0 nếu phim không thuộc thể loại đó:

tvShow1=tvShow.copy()
for index,row in tvShow1.iterrows():
    for genre in row['genres']:
        tvShow1.at[index,genre]=1
tvShow1=tvShow1.fillna(0)
tvShow1

Unnamed: 0,tconst,primaryTitle,startYear,endYear,genres,Family,Game-Show,Comedy,Music,Drama,...,Sci-Fi,Romance,Thriller,Documentary,War,Animation,News,Biography,Short,Adult
38453,tt0039120,Americana,1947.0,1949.0,"[Family, Game-Show]",1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38454,tt0039121,Birthday Party,1947.0,1949.0,[Family],1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38455,tt0039122,The Borden Show,1947.0,0.0,"[Comedy, Music]",0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38456,tt0039123,Kraft Theatre,1947.0,1958.0,[Drama],0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38458,tt0039125,Public Prosecutor,1947.0,1951.0,"[Crime, Drama, Mystery]",0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6326231,tt9916206,Nojor,2019.0,0.0,[Fantasy],0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6326236,tt9916216,Kalyanam Mudhal Kadhal Varai,2014.0,2017.0,[Romance],0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6326237,tt9916218,Lost in Food,2016.0,2017.0,[Talk-Show],0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6326317,tt9916380,Meie aasta Aafrikas,2019.0,0.0,"[Adventure, Comedy, Family]",1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2. Bảng title.ratings.tsv.gz – Bao gồm các thông tin liên quan đến ratings của phim:
- tconst (string) - ID riêng của từng chương trình
- averageRating – Trung bình trọng số ratings của tất cả người dùng.
- numVotes - Số lượt đánh giá của mỗi chương trình.

In [6]:
TitleRatings = pd.read_csv(r"C:/Users/User/Desktop/Kaggle_movie_dataset/title.ratings.tsv.gz",sep="\t", low_memory=False, 
                          na_values=["\\N","nan"], compression='gzip',error_bad_lines = False)
TitleRatings

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.6,1550
1,tt0000002,6.1,186
2,tt0000003,6.5,1207
3,tt0000004,6.2,113
4,tt0000005,6.1,1934
...,...,...,...
993816,tt9916576,5.9,7
993817,tt9916578,9.1,11
993818,tt9916720,5.1,41
993819,tt9916766,6.7,11


## 3. User profile
Ta sẽ add thêm profile của người dùng với rating của 5 tvSeries mà người đó đã xem, từ đó tạo recommendation system dựa theo 5 phim này.  
(Mình sẽ lấy theo sở thích về phim của riêng mình nhé):

In [7]:
userInput = [{'title':'Game of Thrones', 'rating':10},
             {'title':'The Walking Dead', 'rating':8},
             {'title':'Breaking Bad', 'rating':9.5},
             {'title':'Sherlock', 'rating':9.5},
             {'title':'Kingdom', 'rating':9}] 

In [8]:
inputMovie=pd.DataFrame(userInput)
inputMovie

Unnamed: 0,title,rating
0,Game of Thrones,10.0
1,The Walking Dead,8.0
2,Breaking Bad,9.5
3,Sherlock,9.5
4,Kingdom,9.0


Bảng trên thể hiện 5 phim thuộc thể loại tvSeries mà mình yêu thích nhất và ratings mà tự mình đánh giá cho mỗi phim.

Ta merge bảng inputMovie với bảng tvShow1 ở trên để show ra thêm đầy đủ thông tin của 5 phim mà user input:

In [9]:
inputMovie=pd.merge(inputMovie,tvShow1,left_on='title',right_on='primaryTitle')
inputMovie

Unnamed: 0,title,rating,tconst,primaryTitle,startYear,endYear,genres,Family,Game-Show,Comedy,...,Sci-Fi,Romance,Thriller,Documentary,War,Animation,News,Biography,Short,Adult
0,Game of Thrones,10.0,tt0944947,Game of Thrones,2011.0,2019.0,"[Action, Adventure, Drama]",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,The Walking Dead,8.0,tt1520211,The Walking Dead,2010.0,0.0,"[Drama, Horror, Thriller]",0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Breaking Bad,9.5,tt0903747,Breaking Bad,2008.0,2013.0,"[Crime, Drama, Thriller]",0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Sherlock,9.5,tt10691922,Sherlock,2019.0,0.0,[Crime],0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Sherlock,9.5,tt1475582,Sherlock,2010.0,0.0,"[Crime, Drama, Mystery]",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Kingdom,9.0,tt0841961,Kingdom,2007.0,2009.0,"[Comedy, Drama]",0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Kingdom,9.0,tt2404499,Kingdom,2012.0,0.0,"[Action, Animation, Drama]",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
7,Kingdom,9.0,tt3673794,Kingdom,2014.0,2017.0,[Drama],0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Kingdom,9.0,tt6611916,Kingdom,2019.0,0.0,"[Action, Drama, Thriller]",0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Ở bảng trên ta thấy có một vài phim trùng tên với nhau. Dựa vào năm sản xuất và thể loại phim (genres), ta có thể loại bỏ các phim không phải là phim mà user muốn đề cập đến. Các phim cần drop đó là dòng 3,5,6,7.

In [10]:
inputMovie1=inputMovie.drop(index=[3,5,6,7])
inputMovie1.reset_index(drop=True,inplace=True)
inputMovie1

Unnamed: 0,title,rating,tconst,primaryTitle,startYear,endYear,genres,Family,Game-Show,Comedy,...,Sci-Fi,Romance,Thriller,Documentary,War,Animation,News,Biography,Short,Adult
0,Game of Thrones,10.0,tt0944947,Game of Thrones,2011.0,2019.0,"[Action, Adventure, Drama]",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,The Walking Dead,8.0,tt1520211,The Walking Dead,2010.0,0.0,"[Drama, Horror, Thriller]",0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Breaking Bad,9.5,tt0903747,Breaking Bad,2008.0,2013.0,"[Crime, Drama, Thriller]",0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Sherlock,9.5,tt1475582,Sherlock,2010.0,0.0,"[Crime, Drama, Mystery]",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Kingdom,9.0,tt6611916,Kingdom,2019.0,0.0,"[Action, Drama, Thriller]",0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Ta ra được bảng inputMovie1 là bảng thống kê tất cả thông tin liên quan đến 5 phim mà user input.  
Ta drop tất cả các cột đằng trước, chỉ để lại các cột liên quan đến genre (Bắt đầu từ cột Family), việc này là để phục vụ cho việc tính toán điểm của user cho mỗi thể loại phim:

In [11]:
userGenre=inputMovie1.drop(columns=['title','rating','tconst','primaryTitle','startYear','endYear','genres'])
userGenre

Unnamed: 0,Family,Game-Show,Comedy,Music,Drama,Crime,Mystery,Fantasy,Talk-Show,Reality-TV,...,Sci-Fi,Romance,Thriller,Documentary,War,Animation,News,Biography,Short,Adult
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
inputMovie1['rating']

0    10.0
1     8.0
2     9.5
3     9.5
4     9.0
Name: rating, dtype: float64

Ta tính tổng điểm của user đánh cho mỗi thể loại phim bằng cách nhân từng dòng của bảng userGenre với từng dòng của bảng inputMovie1['rating'] ở trên rồi cộng tổng.  
Ta có thể thực hiện bằng cách transpose bảng userGenre rồi sử dụng hàm dot() để nhân & tính tổng:

In [13]:
userProfile = userGenre.transpose().dot(inputMovie1['rating'])
userProfile

Family          0.0
Game-Show       0.0
Comedy          0.0
Music           0.0
Drama          46.0
Crime          19.0
Mystery         9.5
Fantasy         0.0
Talk-Show       0.0
Reality-TV      0.0
Musical         0.0
History         0.0
Sport           0.0
Action         19.0
Adventure      10.0
Western         0.0
Horror          8.0
Sci-Fi          0.0
Romance         0.0
Thriller       26.5
Documentary     0.0
War             0.0
Animation       0.0
News            0.0
Biography       0.0
Short           0.0
Adult           0.0
dtype: float64

Tiếp theo, tính điểm của user cho từng phim trong list tổng hợp các phim, những phim có điểm cao nhất thể hiện mức độ relevant lớn nhất về thể loại với list phim yêu thích của người đó, những phim này thường sẽ được recommend cho user.

In [14]:
genreTable=tvShow1.set_index(tvShow1['tconst'])
genreTable=genreTable.drop(columns=['tconst','primaryTitle','startYear','endYear','genres'])
genreTable.head()

Unnamed: 0_level_0,Family,Game-Show,Comedy,Music,Drama,Crime,Mystery,Fantasy,Talk-Show,Reality-TV,...,Sci-Fi,Romance,Thriller,Documentary,War,Animation,News,Biography,Short,Adult
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
tt0039120,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
tt0039121,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
tt0039122,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
tt0039123,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
tt0039125,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
recommendationTable = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendationTable=recommendationTable.sort_values(ascending=False)
recommendationTable.head()

tconst
tt2431386    0.663043
tt1828246    0.663043
tt1590961    0.663043
tt7131488    0.663043
tt5826200    0.663043
dtype: float64

Tuy nhiên, nếu chỉ recommend phim cho user dựa trên điểm số về sự giống nhau ở thể loại phim giữa các phim thì chưa đủ.  
Vì có thể những phim có điểm cao nhất lại không phải là phim được đánh giá cao (điểm averageRating IMDB thấp) hoặc không phổ biến (numVotes rất thấp).

Do đó để tối ưu hóa quá trình recommendation cho user, ta sẽ list ra 1000 phim có điểm recommend cao nhất, và trong số 1000 phim này, ta sẽ lấy ra top 20 phim có số lượng vote > 30000 và có số điểm IMDB averageRating cao nhất.

Đây sẽ là 20 phim được hệ thống recommend cho người dùng dựa trên 5 bộ phim mà người này yêu thích.

In [16]:
topmovies=TitleBasics.loc[TitleBasics['tconst'].isin(recommendationTable.head(1000).keys())]
topmovies=pd.merge(topmovies,TitleRatings,on='tconst')
topmovies[topmovies['numVotes']>30000].sort_values(by='averageRating',ascending=False).head(20)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
245,tt0903747,tvSeries,Breaking Bad,Breaking Bad,0,2008.0,2013.0,49,"Crime,Drama,Thriller",9.5,1280780
248,tt0944947,tvSeries,Game of Thrones,Game of Thrones,0,2011.0,2019.0,57,"Action,Adventure,Drama",9.4,1606675
169,tt0306414,tvSeries,The Wire,The Wire,0,2002.0,2008.0,59,"Crime,Drama,Thriller",9.3,255545
431,tt2802850,tvSeries,Fargo,Fargo,0,2014.0,,53,"Crime,Drama,Thriller",8.9,286115
588,tt6077448,tvSeries,Sacred Games,Sacred Games,0,2018.0,,50,"Action,Crime,Drama",8.8,62567
112,tt0118421,tvSeries,Oz,Oz,0,1997.0,2003.0,55,"Crime,Drama,Thriller",8.7,83340
163,tt0286486,tvSeries,The Shield,The Shield,0,2002.0,2008.0,47,"Crime,Drama,Thriller",8.7,67325
197,tt0407362,tvSeries,Battlestar Galactica,Battlestar Galactica,0,2004.0,2009.0,44,"Action,Adventure,Drama",8.7,144413
550,tt5290382,tvSeries,Mindhunter,Mindhunter,0,2017.0,,60,"Crime,Drama,Thriller",8.6,164826
316,tt1489428,tvSeries,Justified,Justified,0,2010.0,2015.0,44,"Action,Crime,Drama",8.6,79149


Hệ thống này được xem là tương đối hiệu quả dựa trên thực tế các phim khác cũng được người dùng yêu thích, với chất lượng tốt và độ phổ biến rộng rãi như "The Wire", "Fargo", "Vikings", "Daredevil",...