## Demo - Movie Recommender

### Basic recommender

The full movielens dataset contains over 26 million ratings and 750000 tag applications from 270000 users on 45000 movies in the dataset. A bit too much for our explorative examples. So we use a reduced dataset (you can find this on Leho).

In [4]:
import pandas as pd
metadata = pd.read_csv('Movies/movies_metadata.csv', low_memory=False)
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


We will start with a weighted rating as a metric. 
\begin{equation}WR = \left({{v} \over {v} + {m}} \cdot R\right) + \left({{ m} \over { v} + { m}} \cdot C\right) \end{equation}

where:

*   $v$ (```vote_count```) is the number of movie votes
*   $m$ is the minimum votes required to be listed
*   $R$ (```vote_average```) is the average rating of the movie
*   $C$ is the mean vote across the whole report

A good value for ```m``` is a hyperparameter we can choose. We will use the cutoff ```m``` as the 90th percentile. Or in other words, for a movie to be featured in the charts, it must have more votes than at least 90% of the movies. 
The value for C can be found by using the ```mean()``` function. 



In [5]:
# calculate the mean of the vote_average column
C = metadata['vote_average'].mean()
print("The average rating of a movie is: {0:5f}".format(C))

The average rating of a movie is: 5.618207


The average rating of a movie is around 5.62 on a scale of 10. 

We next calculate the number of votes ```m```


In [6]:
# calculate the tresshold m
m = metadata['vote_count'].quantile(0.90)
print("We will filter out movies having >=", round(m), "vote counts. ")

We will filter out movies having >= 160 vote counts. 


We now create a new DataFrame where all records have a `vote_count >= m`

In [7]:
q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
q_movies.shape

(4555, 24)

In [8]:
metadata.shape

(45466, 24)

In the next step we create a `weighted_rating()` function and a new feature `score` with the value of the function

In [9]:
def weighted_rating(x, m=m, C=C):
  v = x['vote_count']
  R = x['vote_average']
  return (v/(v+m)*R) + (m/(m+v)*C)

In [10]:
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

And finally, let's sort the DataFrame in descending order on the `score` feature.

In [11]:
q_movies = q_movies.sort_values('score', ascending=False)
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(10)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


### Content-Based filtering

In our Content-based filtering, we will use the cosine distance on the plot `overview` feature. 

In [12]:
metadata_small = metadata.copy().loc[metadata['vote_count'] >= 5] 
#we need a smaller dataset due to computational limits

In [13]:
metadata_small['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

This is basically a NLP problem, so we will do some basic text manipulation. The theoretical background will be tought in the DL course. 

Word vectors are basically a vectorized representation of words in a document. Words have a semantic meaning. We will start by computing the TF-IDF (Term Frequency-Inverse Document Frequency) vectors for each document. 
This is a matrix where each column represents a word in the voerview vocabulary and each column represents a movie. 

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english') #we remove the commons stopwords such as 'the', 'and', ...

In [15]:
# replace the NaN with an empty string
metadata_small['overview'] = metadata_small['overview'].fillna('')

In [16]:
# construct the tfidf matrix
tfidf_matrix = tfidf.fit_transform(metadata_small['overview'])
tfidf_matrix.shape

(30898, 59767)

There are 59767 different vocabularies or words (can also be numbers) in the dataset having > 30800 movies.

In [17]:
tfidf_tokens = tfidf.get_feature_names_out()[3000:3020]

In [18]:
tfidf_tokens
tfidf_matrix.dtype

dtype('float64')

With the tfidf matrix we can now compute a similarity score. We prefer in this example to use the cosine similarity. Cosine is independent of magnitutde and is relatively easy and fast to calculate. 
In Python we won't be using `cosine_similarities()` but prefer the use of sklearn's `linear_kernel()` because of it's faster implementation.

In [19]:
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [20]:
cosine_sim.shape

(30898, 30898)

We now have a matrix of shape 30898 x 30898, each movie overview similarity score with every other movie overview. Each movie will be a 1x30898 column vector where each column will be a similarity score with each movie. 

In [21]:
cosine_sim[1]

array([0.01511297, 1.        , 0.04702084, ..., 0.        , 0.        ,
       0.        ])

We are now defining a function, taking in a movie title as input and outputting a list of the $n$ most similar movies. In order to do that, we need to reverse map the movie title to the index.

In [22]:
indices = pd.Series(metadata_small.index, index=metadata_small['title']).drop_duplicates()

In [23]:
indices[:20]

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
Heat                               5
Sabrina                            6
Tom and Huck                       7
Sudden Death                       8
GoldenEye                          9
The American President            10
Dracula: Dead and Loving It       11
Balto                             12
Nixon                             13
Cutthroat Island                  14
Casino                            15
Sense and Sensibility             16
Four Rooms                        17
Ace Ventura: When Nature Calls    18
Money Train                       19
dtype: int64

In [24]:
# Function taking in a movie title and outputting similar movies
def get_recommendations(title, nOR, cosine_sim=cosine_sim):
    # get the index of the movie
    ix = indices[title]
    # get the pairwise sim score
    sim_scores = list(enumerate(cosine_sim[ix]))
    # sort the movies based on the sim score
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # get the scores of the 20 most similar movies
    sim_scores = sim_scores[1:(nOR+1)]
    # get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # return the top n most similar movies
    return metadata_small['title'].iloc[movie_indices]
    

In [25]:
get_recommendations('The Dark Knight Rises', 10)

12177                 August Rush
42602    Everybody Loves Somebody
7530       Through a Glass Darkly
7768               Robot Carnival
3857              What's Cooking?
44048                 Simon's Cat
38567             The Family Fang
12226                 The Savages
33472                  Kokowaah 2
24656             Two-Faced Woman
Name: title, dtype: object

In [26]:
get_recommendations('The Godfather', 2)

23507                         Cinderella
28965    Cinderella III: A Twist in Time
Name: title, dtype: object

Our system tries to perform well and does what it should do, but the quality of recommendations is doubtfull ...

We will integrate some better metadata to get finer details, we will base our recommender system on the 3 top actors and genre. These are not yet loaded in our dataset, so we'll doe this first. 

In [27]:
credits = pd.read_csv('Movies/credits.csv')
keywords = pd.read_csv('Movies/keywords.csv')

# Remove rows with bad IDs.
metadata = metadata.drop([19730, 29503, 35587])

# convert IDS to int so we can merge
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

In [28]:
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."


We need to perform some manipulations on the "stringified" lists and convert them to a more suitable format

In [29]:
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

In [30]:
# some additional functions
import numpy as np

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

# get top 3 elements or entire list
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        # check if more than 3 elements exist, if yes --> return top 3
        if len(names) > 3:
            names = names[:3]
        return names
    return []

In [31]:
# define director, cast and genre keyword features
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)

In [32]:
# again, we need to take a smaller dataset to do the computations later down the road
metadata_small = metadata.copy().loc[metadata['vote_count'] >= 10]

In [33]:
metadata_small[['title', 'cast', 'director', 'genres']].head(4)

Unnamed: 0,title,cast,director,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[Romance, Comedy]"
3,Waiting to Exhale,"[Whitney Houston, Angela Bassett, Loretta Devine]",Forest Whitaker,"[Comedy, Drama, Romance]"


We need to remove the spaces between words, so our vectorizer doesn't count "Angela Bassett" and "Angela Bennet" as the same --> angelabasset and angelabennet

In [34]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [35]:
# Apply clean_data function to your features.
features = ['cast', 'director', 'genres']

for feature in features:
    metadata_small[feature] = metadata_small[feature].apply(clean_data)

We now create a "metadata soup", basically a string containing all metadata we want to feed our vectorizer. 
The `create_soup` function joins all required columns by a space

In [36]:
# final preprocessing step :-)
def create_soup(x):
    return ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [37]:
metadata_small['soup'] = metadata_small.apply(create_soup, axis=1)

In [38]:
metadata_small[['soup']].head(4)

Unnamed: 0,soup
0,tomhanks timallen donrickles johnlasseter anim...
1,robinwilliams jonathanhyde kirstendunst joejoh...
2,waltermatthau jacklemmon ann-margret howarddeu...
3,whitneyhouston angelabassett lorettadevine for...


The next steps are the same as in our previous recommender (plot based), with one big difference, we will be using the `CountVectorizer()` instead of the `TF-IDF`. The technical difference will be covered in the DL - NLP class. 

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata_small['soup'])

In [40]:
count_matrix.shape

(23374, 34662)

In [41]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [42]:
get_recommendations('Batman Begins', 4, cosine_sim2)

41176       The Accountant
1606         The Rainmaker
12371    Cassandra's Dream
3127       The Odessa File
Name: title, dtype: object

## Collaborative filtering - User based

We will use a different version of the movielens dataset with less rows and less features (because the computation would ... run out of memory (again, the life of a AI professional ;-) )

In [43]:
import os
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

stats = pd.read_csv('ml-100k/u.info',header=None)

In [44]:
print(list(stats[0]))

['943 users', '1682 items', '100000 ratings']


In [45]:
features = ['user id', 'movie id', 'rating', 'timestamp']
movielens = pd.read_csv('ml-100k/u.data', sep='\t', header=None, names=features)
movielens.head()

Unnamed: 0,user id,movie id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [46]:
len(movielens), max(movielens['movie id']), min(movielens['movie id'])

(100000, 1682, 1)

In [47]:
d = 'movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western'
features2 = d.split(' | ')
features2

['movie id',
 'movie title',
 'release date',
 'video release date',
 'IMDb URL',
 'unknown',
 'Action',
 'Adventure',
 'Animation',
 'Children',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']

In [48]:
items_movielens = pd.read_csv('ml-100k/u.item', sep='|', header=None, names=features2, encoding='latin-1')
items_movielens

Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,Mat' i syn (1997),06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1678,1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
1679,1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1680,1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [49]:
movies = items_movielens[['movie id', 'movie title']]
movies.head()

Unnamed: 0,movie id,movie title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [50]:
len(items_movielens.groupby(by=features2[1:])),len(items_movielens)

(1664, 1682)

In [51]:
len(items_movielens)

1682

There is a difference! 18 movies extra

In [52]:
# merging the datasets
merged = pd.merge(movielens, movies, how='inner', on='movie id')
merged.head()

Unnamed: 0,user id,movie id,rating,timestamp,movie title
0,196,242,3,881250949,Kolya (1996)
1,186,302,3,891717742,L.A. Confidential (1997)
2,22,377,1,878887116,Heavyweights (1994)
3,244,51,2,880606923,Legends of the Fall (1994)
4,166,346,1,886397596,Jackie Brown (1997)


Example of a multiple rating scenario by an user to a specific movie

In [53]:
merged[(merged['movie title'] == 'Chasing Amy (1997)') & (merged['user id'] == 894)]

Unnamed: 0,user id,movie id,rating,timestamp,movie title
62716,894,246,4,882404137,Chasing Amy (1997)
90596,894,268,3,879896041,Chasing Amy (1997)


In [54]:
refined = merged.groupby(by=['user id', 'movie title'], as_index=False).agg({"rating": "mean"})
refined.head()

Unnamed: 0,user id,movie title,rating
0,1,101 Dalmatians (1996),2.0
1,1,12 Angry Men (1957),5.0
2,1,"20,000 Leagues Under the Sea (1954)",3.0
3,1,2001: A Space Odyssey (1968),4.0
4,1,"Abyss, The (1989)",3.0


### Training a KNN model for item-based collaborative recommender system

User input is `user id` 

In [55]:
# first, create a pivot and movie-user matrix
user_to_movie_df = refined.pivot(
    index='user id', 
    columns='movie title',
    values='rating').fillna(0)

user_to_movie_df.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,2.0,5.0,0.0,0.0,3.0,4.0,0.0,0.0,...,0.0,0.0,0.0,5.0,3.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,2.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0


In [56]:
# transform matrix to scipy sparse matrix
user_to_movie_sparse_df = csr_matrix(user_to_movie_df.values)
user_to_movie_sparse_df

<943x1664 sparse matrix of type '<class 'numpy.float64'>'
	with 99693 stored elements in Compressed Sparse Row format>

In [57]:
knn_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_model.fit(user_to_movie_sparse_df)

In [58]:
## function to find top n similar users of the given input user 
def get_similar_users(user, n = 5):
  ## input to this function is the user and number of top similar users you want.

  knn_input = np.asarray([user_to_movie_df.values[user-1]])  #.reshape(1,-1)
  # knn_input = user_to_movie_df.iloc[0,:].values.reshape(1,-1)
  distances, indices = knn_model.kneighbors(knn_input, n_neighbors=n+1)
  
  print("Top",n,"users who are very much similar to the User-",user, "are: ")
  print(" ")
  for i in range(1,len(distances[0])):
    print(i,". User:", indices[0][i]+1, "separated by distance of",distances[0][i])
  return indices.flatten()[1:] + 1, distances.flatten()[1:]

Specify user id and number of similar users we want to consider here

In [59]:
from pprint import pprint
user_id = 778
print(" Few of movies seen by the User:")
pprint(list(refined[refined['user id'] == user_id]['movie title'])[:10])
similar_user_list, distance_list = get_similar_users(user_id,5)

 Few of movies seen by the User:
['Amityville Horror, The (1979)',
 'Angels in the Outfield (1994)',
 'Apocalypse Now (1979)',
 'Apollo 13 (1995)',
 'Austin Powers: International Man of Mystery (1997)',
 'Babe (1995)',
 'Back to the Future (1985)',
 'Blues Brothers, The (1980)',
 'Chasing Amy (1997)',
 'Clerks (1994)']
Top 5 users who are very much similar to the User- 778 are: 
 
1 . User: 124 separated by distance of 0.4586649429539592
2 . User: 933 separated by distance of 0.5581959868865324
3 . User: 56 separated by distance of 0.5858413112292744
4 . User: 738 separated by distance of 0.5916272517988691
5 . User: 653 separated by distance of 0.5991479757406326


Now we need to pick the top movies to recommend. This could be done by taking the average of the existing ratings given by the similar users and picking the top 10 or 15 movies to recommend. 
But it is perhaps more effective if we define weights to ratings by each similar user based on the distance from the input user. 

In [60]:
similar_user_list, distance_list

(array([124, 933,  56, 738, 653]),
 array([0.45866494, 0.55819599, 0.58584131, 0.59162725, 0.59914798]))

In [61]:
weightage_list = distance_list/np.sum(distance_list)
weightage_list

array([0.16419139, 0.19982119, 0.20971757, 0.2117888 , 0.21448105])

In [62]:
mov_ratings_sim_users = user_to_movie_df.values[similar_user_list]
mov_ratings_sim_users

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 2., ..., 0., 0., 0.],
       [0., 0., 3., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [63]:
movies_list = user_to_movie_df.columns
movies_list

Index([''Til There Was You (1997)', '1-900 (1994)', '101 Dalmatians (1996)',
       '12 Angry Men (1957)', '187 (1997)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '3 Ninjas: High Noon At Mega Mountain (1998)', '39 Steps, The (1935)',
       ...
       'Yankee Zulu (1994)', 'Year of the Horse (1997)', 'You So Crazy (1994)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Poisoner's Handbook, The (1995)',
       'Zeus and Roxanne (1997)', 'unknown',
       'Á köldum klaka (Cold Fever) (1994)'],
      dtype='object', name='movie title', length=1664)

In [64]:
print("Weights shape:", len(weightage_list))
print("mov_ratings_sim_users shape:", mov_ratings_sim_users.shape)
print("Number of movies:", len(movies_list))

Weights shape: 5
mov_ratings_sim_users shape: (5, 1664)
Number of movies: 1664


In [65]:
# Broadcast the weight matrix to similar user rating matrix
weightage_list = weightage_list[:, np.newaxis] + np.zeros(len(movies_list))
weightage_list.shape

(5, 1664)

In [66]:
new_rating_matrix = weightage_list * mov_ratings_sim_users
mean_rating_list = new_rating_matrix.sum(axis=0)
mean_rating_list

array([0.        , 0.        , 1.02879509, ..., 0.        , 0.        ,
       0.        ])

In [67]:
from pprint import pprint
def recommend_movies(n):
    n = min(len(mean_rating_list), n)
    pprint(list(movies_list[np.argsort(mean_rating_list)[::-1][:n]]))

In [68]:
print("Movies recommended based on similar users are: ")
recommend_movies(7)

Movies recommended based on similar users are: 
['Star Wars (1977)',
 'Terminator, The (1984)',
 'Fugitive, The (1993)',
 "Schindler's List (1993)",
 'Forrest Gump (1994)',
 'Princess Bride, The (1987)',
 'Empire Strikes Back, The (1980)']


The above list has obvious drawbacks, some movies have already been seen, so we could make this even more accurate. 

Try it yourself! :-)