# Recommendation System

References
- https://developers.google.com/machine-learning/recommendation
- http://surpriselib.com/

ระบบ recommendation system ที่ใช้กันทั่วไปมี 2 ประเภทหลักๆ คือ

## 1. Home page recommendations

ลองนึกถึงหน้า home page ของ Google Play Store หรือ Apple App Store

<img src="https://developers.google.com/machine-learning/recommendation/images/PlayStore.svg" width="300">

## 2. Related item recommendations

เมื่อกดเข้าไปในหน้ารายละเอียดของ Application เราก็จะเจอ Related App หรือ App ที่มีลักษณะคล้ายกัน

ระบบ Recommendation System ทั้ง 2 ประเภทนั้นใช้หลักการที่แตกต่างกัน

- Content-based filtering
- Collaborative filtering



# Content-based Filtering
Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback.

<img src="https://developers.google.com/machine-learning/recommendation/images/Matrix1.svg" width="500">

**Question**

จากภาพด้านบน เราจะแนะนำ App จาก Publisher ใดให้กับ user ในภาพ

a. TimeWastr

b. Science R Us

c. Healthcare

# Collaborative Filtering

Uses similarities between queries and items simultaneously to provide recommendations.

In [1]:
import pandas as pd
import numpy as np

In [2]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 17.3 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1619433 sha256=44034b32fd1cadfcd4e01b99bdaddb4fd1a7af206a7081752ff69e515c500ff7
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


# The 100k MovieLens Data

In [3]:
from surprise import Dataset
data = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


In [4]:
!ls /root/.surprise_data/ml-100k/ml-100k/

allbut.pl  u1.base  u2.test  u4.base  u5.test  ub.base	u.genre  u.occupation
mku.sh	   u1.test  u3.base  u4.test  ua.base  ub.test	u.info	 u.user
README	   u2.base  u3.test  u5.base  ua.test  u.data	u.item


In [5]:
!cat /root/.surprise_data/ml-100k/ml-100k/README

SUMMARY & USAGE LICENSE

MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
 
This data set consists of:
	* 100,000 ratings (1-5) from 943 users on 1682 movies. 
	* Each user has rated at least 20 movies. 
        * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th, 
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.

Neither the University of Minnesota nor any of the researchers
involved can guarantee the correctness of the data, its suitability
for any particular purpose, or the validity of results based on the
use of the data set.  The data set may be used for any research
purposes under th

## DETAILED DESCRIPTIONS OF DATA FILES

MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
 
This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies. 
* Each user has rated at least 20 movies. 
* Simple demographic info for the users (age, gender, occupation, zip)

Here are brief descriptions of the data.

```
ml-data.tar.gz   -- Compressed tar file.  To rebuild the u data files do this:
                gunzip ml-data.tar.gz
                tar xvf ml-data.tar
                mku.sh

u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
	          user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   

u.info     -- The number of users, items, and ratings in the u data set.

u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.

u.genre    -- A list of the genres.

u.user     -- Demographic information about the users; this is a tab
              separated list of
              user id | age | gender | occupation | zip code
              The user ids are the ones used in the u.data data set.

u.occupation -- A list of the occupations.
```

## Movies Data

In [None]:
# Raw Text File
!cat /root/.surprise_data/ml-100k/ml-100k/u.item

In [7]:
# Check File Encoding
!pip install chardet
!chardetect /root/.surprise_data/ml-100k/ml-100k/u.item

/root/.surprise_data/ml-100k/ml-100k/u.item: ISO-8859-1 with confidence 0.73


In [8]:
movie_columns = ['movie_id', 'movie_title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation', "Children's", 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies = pd.read_csv('/root/.surprise_data/ml-100k/ml-100k/u.item', sep='|', encoding='ISO-8859-1', names=movie_columns)
movies.head()

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


In [9]:
genre_columns = ['unknown', 'Action', 'Adventure', 'Animation', "Children's", 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
def get_genre_text(row):
    genre = []
    for col in genre_columns:
        if row[col] == 1:
            genre.append(col)
    return ', '.join(genre)

# Get genre of Toy Story
get_genre_text(movies.iloc[0])

"Animation, Children's, Comedy"

In [41]:
movies['Genre'] = movies.apply(get_genre_text, axis=1)
movies.head()

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,Genre
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,"Animation, Children's, Comedy"
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,"Action, Adventure, Thriller"
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,Thriller
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,"Action, Comedy, Drama"
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,"Crime, Drama, Thriller"


## User Data

In [None]:
!cat /root/.surprise_data/ml-100k/ml-100k/u.user

In [11]:
user_columns = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_csv('/root/.surprise_data/ml-100k/ml-100k/u.user', sep='|', encoding='ISO-8859-1', names=user_columns)
users.head(10)

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


## Rating Data

In [None]:
!cat /root/.surprise_data/ml-100k/ml-100k/u.data

In [13]:
rating_columns = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('/root/.surprise_data/ml-100k/ml-100k/u.data', sep='\t', encoding='ISO-8859-1', names=rating_columns)
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [14]:
len(ratings)

100000

# Explore The Data

In [15]:
import matplotlib.pyplot as plt
import seaborn as sns

In [16]:
users.describe(include=[np.object])

Unnamed: 0,gender,occupation,zip_code
count,943,943,943
unique,2,21,795
top,M,student,55414
freq,670,196,9


In [None]:
fig = plt.figure(figsize=(18,6))
ax = fig.subplots(nrows=1, ncols=2)

# Left
ax[0].set_title('User\'s Occupations')
users.groupby('occupation').count()['user_id'].plot(kind='bar', ax=ax[0])

# Right
ax[1].set_title('User\'s Age')
ax[1].set_xlabel('Age')
users['age'].plot(kind='hist', bins=30, ax=ax[1])

In [None]:
fig = plt.figure(figsize=(18,6))
ax = fig.subplots(nrows=1, ncols=2)

user_ratings = ratings.groupby('user_id').agg({'rating': ['count', 'mean']})

# Left
ax[0].set_xlabel('Number of Rating / User')
sns.histplot(user_ratings['rating']['count'], ax=ax[0], binwidth=50, binrange=(0, 800))

# Right
ax[1].set_xlabel('Mean Rating')
sns.histplot(user_ratings['rating']['mean'], ax=ax[1], binwidth=0.5, binrange=(1, 5))

# Helper Functions

In [19]:
def get_movie_id_from_title(title):
    m = movies[movies['movie_title'].str.contains(title, regex=False, case=False)]

    if len(m) > 1:
        print(f'Found {len(m)} movies, Please use fullname')
        print(m[['movie_id', 'movie_title']])
        return None
    
    if len(m) == 0:
        print('Not found')
        return None

    return m['movie_id'].values[0]

get_movie_id_from_title('Star Wars (1977)')

50

In [20]:
get_movie_id_from_title('star')

Found 20 movies, Please use fullname
      movie_id                                     movie_title
49          50                                Star Wars (1977)
61          62                                 Stargate (1994)
123        124                                Lone Star (1996)
145        146                         Unhook the Stars (1996)
221        222                 Star Trek: First Contact (1996)
226        227   Star Trek VI: The Undiscovered Country (1991)
227        228             Star Trek: The Wrath of Khan (1982)
228        229      Star Trek III: The Search for Spock (1984)
229        230            Star Trek IV: The Voyage Home (1986)
270        271                        Starship Troopers (1997)
379        380                   Star Trek: Generations (1994)
448        449            Star Trek: The Motion Picture (1979)
449        450          Star Trek V: The Final Frontier (1989)
453        454                  Bastard Out of Carolina (1996)
1060      1061    

# Add your movie's rating

In [21]:
# ค้นหาหนังโดยพิมพ์ชื่อหนังในตัวแปร search
search = 'terminator'
movies[movies['movie_title'].str.contains(search, case=False)].head(10)

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
95,96,Terminator 2: Judgment Day (1991),01-Jan-1991,,http://us.imdb.com/M/title-exact?Terminator%20...,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
194,195,"Terminator, The (1984)",01-Jan-1984,,"http://us.imdb.com/M/title-exact?Terminator,%2...",0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0


In [22]:
# Add your rating here
my_ratings_dict = [
    { 'movie_id': 50, 'rating': 5 },
    { 'movie_id': 64, 'rating': 4 },
    { 'movie_id': 271, 'rating': 4 },
    { 'movie_id': 95, 'rating': 5 },
    { 'movie_id': 71, 'rating': 5 },
    { 'movie_id': 222, 'rating': 2 },
    { 'movie_id': 228, 'rating': 2 },
    { 'movie_id': 229, 'rating': 3 },
    { 'movie_id': 449, 'rating': 2 },
    { 'movie_id': 450, 'rating': 1 },
]

my_ratings = pd.DataFrame(my_ratings_dict)
my_ratings['user_id'] = 0
my_ratings

Unnamed: 0,movie_id,rating,user_id
0,50,5,0
1,64,4,0
2,271,4,0
3,95,5,0
4,71,5,0
5,222,2,0
6,228,2,0
7,229,3,0
8,449,2,0
9,450,1,0


In [23]:
ratings = ratings.append(my_ratings, ignore_index=True)
ratings.tail(15)

Unnamed: 0,user_id,movie_id,rating,timestamp
99995,880,476,3,880175444.0
99996,716,204,5,879795543.0
99997,276,1090,1,874795795.0
99998,13,225,2,882399156.0
99999,12,203,3,879959583.0
100000,0,50,5,
100001,0,64,4,
100002,0,271,4,
100003,0,95,5,
100004,0,71,5,


# Matrix Fatorization SVD Model

https://developers.google.com/machine-learning/recommendation/collaborative/basics

### 1D Embedding

สมมติว่าเรากำหนดตัวเลขให้กับภาพยนตร์ด้วยความเหมาะสมสำหรับเด็ก-ผู้ใหญ่ โดยให้มีค่าอยู่ระหว่าง -1 ถึง 1  
โดย -1 หมายถึงเหมาะสำหรับเด็ก  
และ 1 หมายถึงเหมาะสำหรับผู้ใหญ่

![](https://developers.google.com/machine-learning/recommendation/images/1D.svg)

รูปด้านล่างเป็นประวัติการรับชมภาพยนตร์ของแต่ละคน

![](https://developers.google.com/machine-learning/recommendation/images/1Dmatrix.svg)

จะเห็นได้ว่าการ **การ embedding ข้อมูล**ในรูปแบบนี้ (แทนค่าด้วยเลข 1 ตัว)  สามารถอธิบาย พฤติกรรมการรับชมภาพยนตร์ของผู้ใช้คนที่ 3 และ 4 ได้ แต่ใช้ไม่ได้กับผู้ใช้คนที่ 1 และ 2

## 2D Embedding

เนื่องจาก 1D Embedding ไม่สามารถอธิบายพฤติกรรมการรับชมภาพยนตร์ของผู้ใช้ทั้งหมดได้  
เราจึงเพิ่มมิติเป็น 2D Embedding คือ
- แกน x เป็น ความเหมาะสมสำหรับเด็ก - ผู้ใหญ่
- แกน y เป็น Blockbuster หรือ Arthouse

![](https://developers.google.com/machine-learning/recommendation/images/2D.svg)

![](https://developers.google.com/machine-learning/recommendation/images/2Dmatrix.svg)

ในตัวอย่างนี้เราคิด feature ขึ้นมาเอง (Children - Adult, Arthouse-Blockbuster)
แต่ในการใช้งานจริง embeddings นั้นสามารถเรียนรู้ได้จากข้อมูลการรับชมภาพยนตร์ของผู้ใช้

In [24]:
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

In [25]:
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)

In [26]:
data

<surprise.dataset.DatasetAutoFolds at 0x7f5f1cc31810>

In [27]:
# Use the famous SVD algorithm.
algo = SVD(n_factors=30, biased=False)

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9427  0.9458  0.9371  0.9352  0.9484  0.9419  0.0050  
MAE (testset)     0.7410  0.7404  0.7360  0.7356  0.7472  0.7400  0.0042  
Fit time          2.44    2.46    2.45    2.46    2.45    2.45    0.01    
Test time         0.24    0.15    0.14    0.23    0.15    0.18    0.04    


{'fit_time': (2.4371514320373535,
  2.4550704956054688,
  2.4513494968414307,
  2.4570772647857666,
  2.451483726501465),
 'test_mae': array([0.74101818, 0.74043435, 0.73598065, 0.73560675, 0.74717587]),
 'test_rmse': array([0.94274973, 0.94579346, 0.93714138, 0.93523317, 0.94841696]),
 'test_time': (0.241851806640625,
  0.15354514122009277,
  0.13637042045593262,
  0.22777915000915527,
  0.146958589553833)}

In [28]:
trainset = data.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f5f30637590>

In [29]:
# movie factor vector
algo.qi.shape

(1682, 30)

In [30]:
# user factor vector
algo.pu.shape

(944, 30)

# User Factors & Item Factors

![](https://developers.google.com/machine-learning/recommendation/images/Matrixfactor.svg)

In [35]:
def get_user_embedding(user_id):
    inner_user_id = algo.trainset.to_inner_uid(user_id)
    return algo.pu[inner_user_id]

def get_item_embedding(item_id):
    inner_item_id = algo.trainset.to_inner_iid(item_id)
    return algo.qi[inner_item_id]

In [36]:
get_item_embedding(item_id=1)

array([ 0.19696627,  0.0109829 ,  0.37872045, -0.21626305, -0.34625946,
       -0.10528369, -0.23143802, -0.22980518,  0.59683505,  0.40796117,
        0.07226169,  0.40533163,  0.61906784,  0.57550329, -0.03775312,
        0.44217309, -0.4129559 ,  0.19694524,  0.08678585, -0.42059529,
       -0.02460167, -0.55868001, -0.40998258, -0.81614693,  0.75383652,
       -0.43501086, -0.43230567, -0.5780743 , -0.04986305,  0.4909705 ])

In [37]:
get_user_embedding(user_id=0)

array([-0.02628447,  0.02881049,  0.42462501, -0.24359625,  0.16601058,
       -0.10075082, -0.12036345, -0.40516011,  0.42516332,  0.44438193,
        0.01968893,  0.53445152,  0.35882798,  0.69557579, -0.17576337,
        0.29891974, -0.24240175, -0.14174654, -0.14754893, -0.25854379,
       -0.28099556, -0.38286802, -0.40611118, -0.68507865,  0.2571413 ,
        0.0601776 , -0.34382254, -0.31562702, -0.01955096,  0.44041402])

# Get similar Movies
https://developers.google.com/machine-learning/recommendation/overview/candidate-generation
![](https://developers.google.com/machine-learning/recommendation/images/2D.svg)

## Similarity Measures

การวัดความคล้ายกันของ Vector ในระบบ Recommendation System โดยส่วนมากจะใช้กันอยู่ 3 วิธีดังนี้
1. Cosine
2. Dot product
3. Euclidean distance

### Cosine Similarity
เป็นการวัดมุมระหว่าง 2 Vector - s(q, x) = cos (q, x)

### Dot Product Similarity
ความคล้ายกันจะวัดจากการเอา Vector ทั้ง 2 มาทำการ Dot Product กัน การหาค่า Dot Product สามารถทำได้ 2 วิธีดังนี้

<img src="https://betterexplained.com/wp-content/webp-express/webp-images/uploads/dotproduct/dot_product_components.png.webp" width="400">

<img src="https://betterexplained.com/wp-content/webp-express/webp-images/uploads/dotproduct/dot_product_rotation.png.webp" width="400">
<br/>
<img src="http://media5.datahacker.rs/2020/04/Picture27-1024x386.jpg" width="400">

### Euclidean Distance

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/55/Euclidean_distance_2d.svg/600px-Euclidean_distance_2d.svg.png" width="400">

### Lets' see some examples

<img src="https://developers.google.com/machine-learning/recommendation/images/Similarity.svg" width="400">

**Cosine**

Query: Item C > Item B > Item A

**Dot Product**

Query: Item A > Item B > Item C

**Euclidean distance**

Query: Item B > Item C > Item A

In [38]:
def compute_similarity(query_embedding, item_embeddings, method='dot'):
    a = query_embedding
    b = item_embeddings
    if method == 'cosine':
        a = a / np.linalg.norm(a)
        b = b / np.linalg.norm(b, axis=1, keepdims=True)
    scores = a.dot(b.T)
    return scores

In [39]:
def get_similar_movies(movie_title, method='dot'):
    movie_id = get_movie_id_from_title(movie_title)
    query_embedding = get_item_embedding(movie_id)
    item_embeddings = np.array([get_item_embedding(mid) for mid in movies['movie_id']])
    scores = compute_similarity(query_embedding, item_embeddings, method)

    return pd.DataFrame({
        'movie_id': movies['movie_id'],
        'movie_title': movies['movie_title'],
        'similarity_score': scores,
        'genre': movies['Genre']
    }).sort_values('similarity_score', ascending=False)

In [42]:
get_similar_movies('Star Wars (1977)', 'dot').head(10)

Unnamed: 0,movie_id,movie_title,similarity_score,genre
49,50,Star Wars (1977),6.578573,"Action, Adventure, Romance, Sci-Fi, War"
171,172,"Empire Strikes Back, The (1980)",6.402333,"Action, Adventure, Drama, Romance, Sci-Fi, War"
173,174,Raiders of the Lost Ark (1981),5.976358,"Action, Adventure"
180,181,Return of the Jedi (1983),5.868114,"Action, Adventure, Romance, Sci-Fi, War"
172,173,"Princess Bride, The (1987)",5.502729,"Action, Adventure, Comedy, Romance"
317,318,Schindler's List (1993),5.491239,"Drama, War"
168,169,"Wrong Trousers, The (1993)",5.311921,"Animation, Comedy"
63,64,"Shawshank Redemption, The (1994)",5.230188,Drama
407,408,"Close Shave, A (1995)",5.214071,"Animation, Comedy, Thriller"
482,483,Casablanca (1942),5.206997,"Drama, Romance, War"


In [43]:
get_similar_movies('Star Wars (1977)', 'cosine').head(10)

Unnamed: 0,movie_id,movie_title,similarity_score,genre
49,50,Star Wars (1977),1.0,"Action, Adventure, Romance, Sci-Fi, War"
171,172,"Empire Strikes Back, The (1980)",0.970198,"Action, Adventure, Drama, Romance, Sci-Fi, War"
180,181,Return of the Jedi (1983),0.964626,"Action, Adventure, Romance, Sci-Fi, War"
173,174,Raiders of the Lost Ark (1981),0.906883,"Action, Adventure"
1266,1267,Clockers (1995),0.89708,Drama
1115,1116,"Mark of Zorro, The (1940)",0.893989,Adventure
735,736,Shadowlands (1993),0.889367,"Drama, Romance"
574,575,City Slickers II: The Legend of Curly's Gold (...,0.882022,"Comedy, Western"
1215,1216,Kissed (1996),0.879085,Romance
18,19,Antonia's Line (1995),0.875765,Drama


**Exercise**

ลองคำนวณ Similarity ของ ภาพยนตร์ 4 เรื่องดังต่อไปนี้

- Lion King, The (1994)
- Aladdin (1992)
- Sleepless in Seattle (1993)
- Alien (1979)

ขั้นตอน
1. หา id ของแต่ละเรื่อง
2. หา embedding ของแค่ละเรื่อง
3. หา cosine similarity ระหว่าง Lion King, The (1994) กับเรื่องอื่นๆ
4. หา Dot product similarity ระหว่าง Lion King, The (1994) กับเรื่องอื่นๆ

*Hint: ใช้ function `get_item_embedding()` และ `compute_similarity()` มาช่วย*

In [None]:
# Get movie ids

# Get item embeddings

In [48]:
# Compute similarity - dot product

array([4.23919835, 4.23164351, 3.68265   ])

In [47]:
# Compute similarity - cosine

array([0.8962017 , 0.87035964, 0.71998854])

# Get Recommended Movies
![](https://developers.google.com/machine-learning/recommendation/images/2Dmatrix.svg)

In [49]:
my_ratings

Unnamed: 0,movie_id,rating,user_id
0,50,5,0
1,64,4,0
2,271,4,0
3,95,5,0
4,71,5,0
5,222,2,0
6,228,2,0
7,229,3,0
8,449,2,0
9,450,1,0


In [50]:
user_id = 0
movie_id = get_movie_id_from_title('Terminator 2: Judgment Day (1991)')
algo.predict(user_id, movie_id)

Prediction(uid=0, iid=96, r_ui=None, est=3.4586346130328214, details={'was_impossible': False})

In [51]:
def get_recommended_movies(user_id):
    pred_ratings = []
    for movie in movies.iloc:
        pred_ratings.append({
            'movie_id': movie['movie_id'],
            'movie_title': movie['movie_title'],
            'pred_rating': algo.predict(user_id, movie['movie_id']).est,
            'Genre': movie['Genre']
        })
    
    return pd.DataFrame(pred_ratings).sort_values('pred_rating', ascending=False)


recommendations = get_recommended_movies(0)
recommendations.head(20)

Unnamed: 0,movie_id,movie_title,pred_rating,Genre
49,50,Star Wars (1977),4.121604,"Action, Adventure, Romance, Sci-Fi, War"
317,318,Schindler's List (1993),4.081233,"Drama, War"
63,64,"Shawshank Redemption, The (1994)",4.066635,Drama
173,174,Raiders of the Lost Ark (1981),4.025617,"Action, Adventure"
171,172,"Empire Strikes Back, The (1980)",3.99391,"Action, Adventure, Drama, Romance, Sci-Fi, War"
407,408,"Close Shave, A (1995)",3.968281,"Animation, Comedy, Thriller"
482,483,Casablanca (1942),3.909327,"Drama, Romance, War"
312,313,Titanic (1997),3.899157,"Action, Drama, Romance"
168,169,"Wrong Trousers, The (1993)",3.879623,"Animation, Comedy"
426,427,To Kill a Mockingbird (1962),3.870068,Drama


**Exercise**

ลองหาผู้ใช้ที่มีความชอบดูภาพยนต์คล้ายๆ กับ user_id = 10 มาลองดูลักษณะเชิงประชากรของผู้ใช้ว่ามีความคล้าย หรือ ต่างกับ user_id = 10 อย่างไร

*Hint: ทำคล้ายๆ กับการหา similar movies*

In [None]:
# Find user 10's rating count
len(ratings[ratings['user_id']==10])

184

In [None]:
def get_similar_users(user_id, method='dot'):
  # Code here

In [None]:
get_similar_users(10, method='dot').head(10)

In [None]:
get_similar_users(10, method='cosine').head(10)

In [None]:
# ลองหา user ที่คล้ายกับเรา
get_similar_users(0, method='cosine').head(10)

# Visualizing The Movie Factor

In [52]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, n_iter=500, verbose=3, random_state=1)
movie_embeddings = tsne.fit_transform(algo.qi)
projection = pd.DataFrame(columns=['x', 'y'], data=movie_embeddings)
projection['title'] = movies['movie_title']

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1682 samples in 0.005s...
[t-SNE] Computed neighbors for 1682 samples in 0.213s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1682
[t-SNE] Computed conditional probabilities for sample 1682 / 1682
[t-SNE] Mean sigma: 0.249687
[t-SNE] Computed conditional probabilities in 0.110s
[t-SNE] Iteration 50: error = 75.7462540, gradient norm = 0.2046199 (50 iterations in 0.903s)
[t-SNE] Iteration 100: error = 75.1849670, gradient norm = 0.1882657 (50 iterations in 0.824s)
[t-SNE] Iteration 150: error = 75.3772354, gradient norm = 0.1808060 (50 iterations in 0.826s)
[t-SNE] Iteration 200: error = 74.7231750, gradient norm = 0.2045180 (50 iterations in 0.813s)
[t-SNE] Iteration 250: error = 75.6822357, gradient norm = 0.1838126 (50 iterations in 0.780s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 75.682236
[t-SNE] Iteration 300: error = 2.3055892, gradient norm = 0.0014667 (50 iterations in 0.637s)

In [53]:
projection.head()

Unnamed: 0,x,y,title
0,-12.161079,27.092005,Toy Story (1995)
1,-9.973627,20.56694,GoldenEye (1995)
2,5.927454,-16.654619,Four Rooms (1995)
3,-14.481998,3.750635,Get Shorty (1995)
4,-22.48847,6.335414,Copycat (1995)


In [54]:
import plotly.express as px
fig = px.scatter(projection, x='x', y='y', hover_name='title')
fig.show()