# Data, Vector, Similarity — Action!

*Now, here as promised - this is the second book in the recommendation series. Here I will try to use the knowledge and insigts gathered from the first EDA book and use it to recommend the similar movies to you.*

*Here, this will be a system in which you will have to enter a name of movie and it will bring top 3 most relavant movies that matches the charecteristics of the movie your entered.*

Let's go!

In [1]:
# Normal imports
import pandas as pd
import numpy as np

# Might need
import matplotlib.pyplot as plt

In [18]:
PATH = "./DataSource/"

ratings = pd.read_csv(PATH + "ratings.csv", usecols=range(3))
movies = pd.read_csv(PATH + "movies.csv")
gnome = pd.read_csv(PATH + "genome-scores.csv")
gnome_lookup = pd.read_csv(PATH + "genome-tags.csv")

#### Making the data from different tables ready to merge

In [5]:
# Getting the mean of all movies ratings
rating_mean = ratings.groupby("movieId")["rating"].mean()

In [7]:
# Making the gnome data compatible to get merged
gnome_pivoted = gnome.pivot("movieId", "tagId", "relevance")

In [17]:
# Making the "movieId" as the index for easy merge
movies.set_index("movieId", inplace=True)

In [21]:
# 1010 encoding for the genres
genre_encoded = movies.genres.str.get_dummies("|")

Now, the resultant data must be total of 62423 rows. So keep that in mind

In [24]:
merged = pd.merge(genre_encoded, gnome_pivoted, left_index=True, right_index=True)

In [30]:
merged.shape[0]

13816

↑ This shows us that - 'gnome' data doesn't have all movies available. Hence, we can't use it directly right now. 

**SOLUTION**: After making the first version of this recommendation system, I will impute all data for them by using the KNN. But for now, I am avoiding the usage of gnome data - making the system simple for now.

In [33]:
rating_mean.index.nunique()

59047

Again, see that we don't have all data available for all movies in the ratings table. Which is `3376` less. So again, I think we should just use the `genre` as the matcher and then we will come back to impute those values.

In [8]:
from sklearn.neighbors import DistanceMetric

In [9]:
class get_k_nearest_movies:
    def __init__(self, k=5, distance_metric="hamming"):
        self.distance_metric = DistanceMetric.get_metric(distance_metric)
        self.k = k
        
    def fit(self, X):
        self.X = X
        
    def predict(self, vector):
        if vector.ndim == 1:
            vector = vector[np.newaxis, :]
            
        distances = self.distance_metric.pairwise(vector, self.X)[0]
        top_k_matches = distances.argsort()[:self.k]
        return top_k_matches

In [126]:
recommendor = get_k_nearest_movies(distance_metric="hamming")

In [127]:
recommendor.fit(genre_encoded)

In [129]:
movies.iloc[recommendor.predict(genre_encoded.loc[12].values)]

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
152447,Blood Surf (2000),Comedy|Horror
134511,Flesh Eating Mothers (1988),Comedy|Horror
195999,R.L. Stine's Monsterville: The Cabinet of Soul...,Comedy|Horror
171799,Range 15 (2016),Comedy|Horror
141918,Dude Bro Party Massacre III (2015),Comedy|Horror


Woah!<br>
I know, it is simple - but it works. Now, let's fill the gaps and then try the distances. 

So:
1. We will fill the ratings
2. We will fill the gnome values

For that we are gonna use the KNN Regressor

In [133]:
impute1 = pd.merge(genre_encoded, rating_mean, 
                 left_index=True, right_index=True,
                 how="left")

In [144]:
X = impute1.drop("rating", axis=1)
y = impute1["rating"]

filter_ = y.isna()

x_train = X[~filter_]
y_train = y[~filter_]

x_filler = X[filter_]

In [145]:
from sklearn.neighbors import KNeighborsRegressor

In [147]:
model = KNeighborsRegressor(n_neighbors=5, metric="hamming")

In [148]:
model.fit(x_train, y_train)

KNeighborsRegressor(metric='hamming')

In [149]:
y_predicted = model.predict(x_filler)

In [150]:
y_predicted

array([3.41308579, 3.53792173, 3.06969697, ..., 2.67036906, 3.87      ,
       3.64893064])

Now, we could have done the same thing with our model as well! Like ↓

In [151]:
model  = get_k_nearest_movies(distance_metric="hamming")

In [153]:
model.fit(x_train)

In [157]:
preds = []
for to_fill in x_filler.values:
    indices = model.predict(to_fill)
    predicted = y_train.iloc[indices].mean()
    preds.append(predicted)

In [163]:
preds[:5]

[3.497752817626889,
 3.502865447259012,
 2.466666666666667,
 3.2833333333333328,
 3.2679688779688782]

Close? Nah!

But, fine we now have the ratings imputed!

In [182]:
impute1

Unnamed: 0_level_0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,...,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,1,1,1,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.893708
2,0,0,1,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.251527
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,3.142028
4,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,1,0,0,0,0,2.853547
5,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.058434
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209157,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1.500000
209159,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,3.000000
209163,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,4.500000
209169,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.000000


In [185]:
imput_index = impute1.index[impute1.rating.isna()]

In [190]:
impute1.loc[imput_index, "rating"] = y_predicted

In [246]:
impute1.isna().sum()

(no genres listed)    0
Action                0
Adventure             0
Animation             0
Children              0
Comedy                0
Crime                 0
Documentary           0
Drama                 0
Fantasy               0
Film-Noir             0
Horror                0
IMAX                  0
Musical               0
Mystery               0
Romance               0
Sci-Fi                0
Thriller              0
War                   0
Western               0
rating                0
dtype: int64

DONE!

Now, let's do the same with our gnome guy!

In [192]:
impute2 = pd.merge(genre_encoded, gnome_pivoted, 
                 left_index=True, right_index=True,
                 how="left")

In [215]:
X = impute2.iloc[:, :20]
y = impute2.iloc[:, 20:]
filter_ = impute2.isna().any(1)

x_train = X[~filter_]
y_train = y[~filter_]

x_fill = X[filter_]

In [224]:
model = KNeighborsRegressor()

In [225]:
model.fit(x_train, y_train)

KNeighborsRegressor()

In [226]:
y2_predicted = model.predict(x_fill)

In [227]:
y2_predicted_predicted

array([[0.05175, 0.05725, 0.0293 , ..., 0.0117 , 0.0993 , 0.0188 ],
       [0.02365, 0.024  , 0.07605, ..., 0.01305, 0.0987 , 0.02485],
       [0.03895, 0.0416 , 0.0359 , ..., 0.01455, 0.08345, 0.01895],
       ...,
       [0.05455, 0.06185, 0.063  , ..., 0.01815, 0.09385, 0.02115],
       [0.03565, 0.0388 , 0.0897 , ..., 0.02015, 0.0878 , 0.02325],
       [0.04215, 0.038  , 0.0748 , ..., 0.01045, 0.18785, 0.0318 ]])

In [231]:
indices = impute2[filter_].index

In [250]:
columns = impute2.iloc[:, 20:].columns

In [252]:
impute2.loc[indices, columns] = y2_predicted

In [255]:
impute2.isna().sum()

(no genres listed)    0
Action                0
Adventure             0
Animation             0
Children              0
                     ..
1124                  0
1125                  0
1126                  0
1127                  0
1128                  0
Length: 1148, dtype: int64

DONE!!!


So, now we can make use of them both!

In [257]:
impute1["rating"]

movieId
1         3.893708
2         3.251527
3         3.142028
4         2.853547
5         3.058434
            ...   
209157    1.500000
209159    3.000000
209163    4.500000
209169    3.000000
209171    3.000000
Name: rating, Length: 62423, dtype: float64

In [259]:
impute2.loc[:, columns]

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,1119,1120,1121,1122,1123,1124,1125,1126,1127,1128
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.02875,0.02375,0.06250,0.07575,0.14075,0.14675,0.06350,0.20375,0.20200,0.03075,...,0.04050,0.01425,0.03050,0.03500,0.14125,0.05775,0.03900,0.02975,0.08475,0.02200
2,0.04125,0.04050,0.06275,0.08275,0.09100,0.06125,0.06925,0.09600,0.07650,0.05250,...,0.05250,0.01575,0.01250,0.02000,0.12225,0.03275,0.02100,0.01100,0.10525,0.01975
3,0.04675,0.05550,0.02925,0.08700,0.04750,0.04775,0.04600,0.14275,0.02850,0.03875,...,0.06275,0.01950,0.02225,0.02300,0.12200,0.03475,0.01700,0.01800,0.09100,0.01775
4,0.03425,0.03800,0.04050,0.03100,0.06500,0.03575,0.02900,0.08650,0.03200,0.03150,...,0.05325,0.02800,0.01675,0.03875,0.18200,0.07050,0.01625,0.01425,0.08850,0.01500
5,0.04300,0.05325,0.03800,0.04100,0.05400,0.06725,0.02775,0.07650,0.02150,0.02975,...,0.05350,0.02050,0.01425,0.02550,0.19225,0.02675,0.01625,0.01300,0.08700,0.01600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209157,0.03895,0.04160,0.03590,0.07690,0.12345,0.16170,0.10855,0.26160,0.28265,0.02495,...,0.29715,0.02025,0.01560,0.06480,0.31965,0.19270,0.02150,0.01455,0.08345,0.01895
209159,0.03080,0.03760,0.09445,0.08420,0.09740,0.07520,0.15565,0.39740,0.09100,0.07650,...,0.18285,0.02070,0.01725,0.06860,0.17900,0.10085,0.03610,0.01605,0.09840,0.03270
209163,0.05455,0.06185,0.06300,0.09040,0.12635,0.13055,0.11170,0.17865,0.06660,0.05465,...,0.08480,0.02900,0.03015,0.06020,0.24930,0.08125,0.03205,0.01815,0.09385,0.02115
209169,0.03565,0.03880,0.08970,0.12780,0.16115,0.18345,0.13395,0.25740,0.13385,0.11600,...,0.08660,0.03395,0.02510,0.07400,0.26200,0.15355,0.04330,0.02015,0.08780,0.02325


In [260]:
major_data = pd.merge(impute2, impute1["rating"], left_index=True, right_index=True)

In [261]:
major_data

Unnamed: 0_level_0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,...,1120,1121,1122,1123,1124,1125,1126,1127,1128,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,1,1,1,1,0,0,0,1,...,0.01425,0.03050,0.03500,0.14125,0.05775,0.03900,0.02975,0.08475,0.02200,3.893708
2,0,0,1,0,1,0,0,0,0,1,...,0.01575,0.01250,0.02000,0.12225,0.03275,0.02100,0.01100,0.10525,0.01975,3.251527
3,0,0,0,0,0,1,0,0,0,0,...,0.01950,0.02225,0.02300,0.12200,0.03475,0.01700,0.01800,0.09100,0.01775,3.142028
4,0,0,0,0,0,1,0,0,1,0,...,0.02800,0.01675,0.03875,0.18200,0.07050,0.01625,0.01425,0.08850,0.01500,2.853547
5,0,0,0,0,0,1,0,0,0,0,...,0.02050,0.01425,0.02550,0.19225,0.02675,0.01625,0.01300,0.08700,0.01600,3.058434
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209157,0,0,0,0,0,0,0,0,1,0,...,0.02025,0.01560,0.06480,0.31965,0.19270,0.02150,0.01455,0.08345,0.01895,1.500000
209159,0,0,0,0,0,0,0,1,0,0,...,0.02070,0.01725,0.06860,0.17900,0.10085,0.03610,0.01605,0.09840,0.03270,3.000000
209163,0,0,0,0,0,1,0,0,1,0,...,0.02900,0.03015,0.06020,0.24930,0.08125,0.03205,0.01815,0.09385,0.02115,4.500000
209169,1,0,0,0,0,0,0,0,0,0,...,0.03395,0.02510,0.07400,0.26200,0.15355,0.04330,0.02015,0.08780,0.02325,3.000000


Perfect! <br>
Saving, so later it become easy to access.

In [262]:
major_data.to_csv("ready_data.csv")

In [263]:
file = pd.HDFStore("ready_data_hdf5")
# NAME is "table"

In [264]:
file["table"] = major_data

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->axis0] [items->None]

  file["table"] = major_data
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_items] [items->None]

  file["table"] = major_data


### Loading the saved file

In [2]:
file = pd.HDFStore("ready_data_hdf5")
df = file["table"]

In [10]:
recommendor = get_k_nearest_movies(distance_metric="hamming")

In [11]:
recommendor.fit(df)

In [52]:
# A movie ID for Inception
sample = 79132

In [53]:
recommendations = recommendor.predict(df.loc[sample].values)

In [55]:
movies.loc[recommendations]

Unnamed: 0,movieId,title,genres
14937,79132,Inception (2010),Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMAX
16184,85414,Source Code (2011),Action|Drama|Mystery|Sci-Fi|Thriller
32928,142032,The Transporter Refuelled (2015),Action|Crime|Thriller
12536,60684,Watchmen (2009),Action|Drama|Mystery|Sci-Fi|Thriller|IMAX
2645,2737,Assassination (1987),Action|Drama|Thriller


# See!
Now we have a working model which returns the top 5 matched movies. Now just it needs to be deployed. Haah!