# MOVIE RECOMMENDATION PROJECT

## 1.0 Introduction

In this project, we'll build an interactive movie recommendation system that allows you to type in a movie name and immediately get ten recommendations for other movies you might want to watch.

## 2.0 Reading in Our Movie Data in Pandas

In [3]:
pip install numpy

Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd


We will be downloading data using the link below:

https://files.grouplens.org/datasets/movielens/ml-25m.zip

This is a zip file with a combination of files. We would be using only two files from the folder after unzipping 

- 'movies.csv' and 'ratings.csv'

lets start with the 'movies.csv'

In [5]:
#convert file from csv to dataframe
movies = pd.read_csv('Downloads/ml-25m/ml-25m/movies.csv')
movies    

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


## 3.0 Cleaning Movie Titles Using Regex

The first thing we'll do is build our search engine. To do that, we should clean the movie titles. We want to clean them because some extra characters like parenthesis will make the search difficult. To do this, we can use regular expressions. We will keep letters a-z, A-Z, numbers 0-9 and spaces.

In [35]:
import re

def clean_title(title):
    title = re.sub("[^a-zA-Z0-9 ]", "", title)
    return title

In [36]:
movies["clean_title"] = movies["title"].apply(clean_title)

In [37]:
movies

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995
...,...,...,...,...
62418,209157,We (2018),Drama,We 2018
62419,209159,Window of the Soul (2001),Documentary,Window of the Soul 2001
62420,209163,Bad Poems (2018),Comedy|Drama,Bad Poems 2018
62421,209169,A Girl Thing (2001),(no genres listed),A Girl Thing 2001


## 4.0 Creating a TFIGF Matrix

Next, we'll use TfidfVectorizer to create a TFIDF Matrix.

While creating the matrix, we'll pass in something called ngram_range when we initialize our class. Instead of only looking at individual words in the title, this parameter will also look at engrams. Engrams are groups of two consecutive words, so instead of just looking at toy story in 1995, it's also going to look at toy story together and story 1995 together, so this makes our search a more accurate.

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Initialize the TfidfVectorizer, and set the ngram_range parameter to (1,2)
vectorizer = TfidfVectorizer(ngram_range=(1,2))

#Create the TFID Matrix.
tfidf = vectorizer.fit_transform(movies["clean_title"])

## 5.0 Creating a Search Function

Next, we'll compute the similarity between our search term and all the titles in our data. To do this, we're going to use something called cosine_similarity, which is available in scikit-learn — we don't need to implement it ourselves.

We'll then write a function called search, which takes in a search term; in this case, the term is a title we want to search. The function will then do the following:

    a. Clean the title
    b. Convert the title into a set of numbers
    c. Use cosine_similarity to find the similarity between our search term and all the titles in our data
    d. Return the five most similar titles to our search term

In [39]:


from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(title):
    #Clean the title
    title = clean_title(title)
    
    #Convert the title into a set of numbers
    query_vec = vectorizer.transform([title])
    
    #Use cosine_similarity to find the similarity between our search term and all the titles in our data
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    
    #Return the five most similar titles to our search term
    indices = np.argpartition(similarity, -5)[-5:]
    
    #Output the most similar movie
    results = movies.iloc[indices].iloc[::-1]
    
    return results

## 6.0 Building an Interactive Search Box in Jupyter

Now that we have created our search function, we're going to build an interactive Jupyter Notebook widget wherein we can type in the name of a movie and see the search results.

We need to import something called ipywidgets — widgets are small, interactive elements we can embed in notebooks. They let us enter input and then use that input. We also need to import display from ipython.display. display is a function you can use to display the output.

In [12]:
# pip install ipywidgets
#jupyter labextension install @jupyter-widgets/jupyterlab-manager

In [41]:
import ipywidgets as widgets
from IPython.display import display

#Create and input widget
movie_input = widgets.Text(
    value='Toy Story',
    description='Movie Title:',
    disabled=False
)

#Create and output widget
movie_list = widgets.Output()

def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            display(search(title))

movie_input.observe(on_type, names='value')


display(movie_input, movie_list)

Text(value='Toy Story', description='Movie Title:')

Output()

## 7.0 Reading in Movie Ratings Data

We've finished the first half of the project. The second half is the more exciting because we'll build the actual recommendation system. We need to find movies similar to a movie we liked. If we liked a specific movie, we can search for it and get recommendations. The ratings.csv file will help us do this.

In the ratings.csv file, we have movie_id and rating. Each user has rated a movie, and we can see how they rated it. We'll create a function to find all the users who also liked the movie that we typed in. For example, if we type the hulk, we want to find all users who also liked the movie hulk. Then we want to see the other movies they liked because those will probably be good recommendations for us.

In [42]:
ratings = pd.read_csv("Downloads/ml-25m/ml-25m/ratings.csv")


In [43]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510
...,...,...,...,...
25000090,162541,50872,4.5,1240953372
25000091,162541,55768,2.5,1240951998
25000092,162541,56176,2.0,1240950697
25000093,162541,58559,4.0,1240953434


In [44]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

## 8.0 Finding Users Who Liked the Same Movie

Now we'll find the users who liked the same movie we liked. Then, we need to find the other movies that they liked. Once we have done that, we'll establish a threshold for recommendations. For example, we could say that at least 10% of users like us need to like the movie for inclusion in our recommendations.

In [56]:
movie_id = 89745

#def find_similar_movies(movie_id):
movie = movies[movies["movieId"] == movie_id]

In [57]:
#Find the users who also liked the same movie we liked and ratings above 4
similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()

In [58]:
#Find the other movies that they liked
similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]

In [59]:
#convert similar_user_recs to percentage
similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

#Find only the moives that more than 10% of similar users liked
similar_user_recs = similar_user_recs[similar_user_recs > .10]

## 9.0 Determining How Much Users Like Movies

Now, we're going to find how many of the users in our dataset like these movies. We need to find movies that are specific to our niche. For example, if someone likes the Avengers, you want to find other movies they like that are similar to the Avengers. You don't just want all of the movies they like because they probably like many movies that don't have anything to do with the Avengers.



In [65]:
#Find all users who rated a movie highly that is in our set of recommended movies
all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]


In [66]:
#Find what percentage of all users recommend each of these movies.
all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())

all_user_recs

318       0.346395
296       0.288146
2571      0.247010
356       0.238136
593       0.228665
            ...   
86332     0.010142
91630     0.009324
122900    0.008573
122926    0.008070
106072    0.005289
Name: movieId, Length: 193, dtype: float64

## 10.0 Creating a Recommendation Score

Now that we found the percentages, we compare them.

In [67]:
#Use pd.concat to concatenate similar user recommendations and all user recommendations.
rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
rec_percentages.columns = ["similar", "all"]


In [68]:
rec_percentages

Unnamed: 0,similar,all
1,0.236083,0.126250
32,0.103877,0.101516
47,0.203115,0.146232
50,0.211067,0.202959
110,0.182240,0.162835
...,...,...
134853,0.198641,0.036444
152081,0.133532,0.020652
164179,0.128728,0.029124
166528,0.124751,0.014411


In [72]:
#Create a score by dividing similar user recommendations by all user recommendations.
rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]


In [73]:
#Sort the values to show highest values first.
rec_percentages = rec_percentages.sort_values("score", ascending=False)


In [74]:
rec_percentages

Unnamed: 0,similar,all,score
89745,1.000000,0.040459,24.716368
106072,0.103711,0.005289,19.610199
122892,0.241054,0.012367,19.491770
102125,0.216534,0.012119,17.867419
88140,0.215043,0.012052,17.843074
...,...,...,...
296,0.288933,0.288146,1.002730
593,0.222830,0.228665,0.974483
527,0.199967,0.217833,0.917984
1193,0.100895,0.120244,0.839081


In [75]:
#Take the top 10 recommendations and merge them with movies data.
rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")

Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
17067,1.0,0.040459,24.716368,89745,"Avengers, The (2012)",Action|Adventure|Sci-Fi|IMAX,Avengers The 2012
20513,0.103711,0.005289,19.610199,106072,Thor: The Dark World (2013),Action|Adventure|Fantasy|IMAX,Thor The Dark World 2013
25058,0.241054,0.012367,19.49177,122892,Avengers: Age of Ultron (2015),Action|Adventure|Sci-Fi,Avengers Age of Ultron 2015
19678,0.216534,0.012119,17.867419,102125,Iron Man 3 (2013),Action|Sci-Fi|Thriller|IMAX,Iron Man 3 2013
16725,0.215043,0.012052,17.843074,88140,Captain America: The First Avenger (2011),Action|Adventure|Sci-Fi|Thriller|War,Captain America The First Avenger 2011
16312,0.175447,0.010142,17.299824,86332,Thor (2011),Action|Adventure|Drama|Fantasy|IMAX,Thor 2011
21348,0.287608,0.016737,17.183667,110102,Captain America: The Winter Soldier (2014),Action|Adventure|Sci-Fi|IMAX,Captain America The Winter Soldier 2014
25071,0.214049,0.012856,16.649399,122920,Captain America: Civil War (2016),Action|Sci-Fi|Thriller,Captain America Civil War 2016
25061,0.136017,0.008573,15.865628,122900,Ant-Man (2015),Action|Adventure|Sci-Fi,AntMan 2015
14628,0.242876,0.015517,15.651921,77561,Iron Man 2 (2010),Action|Adventure|Sci-Fi|Thriller|IMAX,Iron Man 2 2010


## 11.0 Building a Recommendation Function

Now we need to put all of these into a function. It should return the following columns of our top 10 movie recommendations:

    score
    title
    genres

In [76]:
def find_similar_movies(movie_id):
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

    similar_user_recs = similar_user_recs[similar_user_recs > .10]
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
    all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]
    
    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    rec_percentages = rec_percentages.sort_values("score", ascending=False)
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title", "genres"]]

## 12.0 Create an Interactive Recommendation Widget

Now we can build the widget that will do this automatically so we can type in a movie title and get recommendations.

In [77]:
import ipywidgets as widgets
from IPython.display import display

movie_name_input = widgets.Text(
    value='Toy Story',
    description='Movie Title:',
    disabled=False
)
recommendation_list = widgets.Output()

def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(find_similar_movies(movie_id))

movie_name_input.observe(on_type, names='value')

display(movie_name_input, recommendation_list)

Text(value='Toy Story', description='Movie Title:')

Output()

Congratulations! You finished the project! There's still a lot more you can do to extend the project if you want. Here are some ideas:

    Try using the genres to add a second input box to filter the recommendation by a specific genre.
    Try to use the genres to improve the actual recommendation engine.
    If you can find a dataset with metadata like tags for these movies, use it to improve the recommendation.