# CONTENT-BASED RECOMMENDATION SYSTEM 
# (MOVIE RECOMMENDATION)

### Business Requirement
This is about a user we'll call Sam for the sake of this analysis. Sam is a subscriber of the company's online streaming service, and I have been tasked to give recommendations for movies he might like based on the his streaming history and rating.

### About The Dataset
This dataset describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 22884377 ratings and 586994 tag applications across 34208 movies. These data were created by 247753 users between January 09, 1995 and January 29, 2016. This dataset was generated on January 29, 2016. 

Users were selected at random for inclusion. All selected users had rated at least 1 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in four files, `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv` but the focus will be strictly on two files; 'movies.csv' and 'ratings.csv'. It was downloaded from IBM cloud object storage.

In [8]:
#importing the dataset
import urllib.request
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%205/data/moviedataset.zip"
filename = "Moviedata.zip"
urllib.request.urlretrieve(url, filename)

('Moviedata.zip', <http.client.HTTPMessage at 0x28dddee8a90>)

### Pre-processing

In [28]:
# import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt
%matplotlib inline

In [73]:
#read the movie file into it's DataFrame
movie_df = pd.read_csv(r'C:\Users\Hello\Moviedata\ml-latest\movies.csv')

movie_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [74]:
#Using regular expressions to find a year stored between parentheses
#specify the parantheses so we don't conflict with movies that have years in their titles
movie_df['year'] = movie_df['title'].str.extract('(\(\d\d\d\d\))', expand = False)
#Removing the parentheses
movie_df['year'] = movie_df['year'].str.extract('(\d\d\d\d)', expand = False)
#Removing the years from the 'title' column
movie_df['title'] = movie_df['title'].str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movie_df['title'] = movie_df['title'].apply(lambda x:x.strip())

movie_df.head()

  movie_df['title'] = movie_df['title'].str.replace('(\(\d\d\d\d\))', '')


Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [75]:
#Every genre is separated by a | so we simply have to call the split function on |
movie_df['genres'] = movie_df.genres.str.split('|')
movie_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


Since keeping genres in a list format isn't optimal for the content-based recommendation system technique, we will use the One Hot Encoding technique to convert the list of genres to a vector where each column corresponds to one possible value of the feature. This encoding is needed for feeding categorical data. In this case, we store every different genre in columns that contain either 1 or 0. 1 shows that a movie has that genre and 0 shows that it doesn't. Let's also store this dataframe in another variable since genres won't be important for our first recommendation system.

In [76]:
#Copying the movie dataframe into a new one since we won't need to use the genre information in our first case.
moviecopy_df = movie_df.copy()
#For every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column
for index, row in movie_df.iterrows():
    for genre in row['genres']:
        moviecopy_df.at[index, genre] = 1
# fill NaN values with 0        
moviecopy_df = moviecopy_df.fillna(0)
moviecopy_df.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [79]:
# read the rating file into it's DataFrame
rating_df = pd.read_csv(r'C:\Users\Hello\Moviedata\ml-latest\ratings.csv')
rating_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


In [80]:
#drop the timestamp column as it's unnecessary
rating_df = rating_df.drop('timestamp', 1)
rating_df.head()

  rating_df = rating_df.drop('timestamp', 1)


Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
3,2,2571,3.5
4,2,109487,4.0


### Content-Based Recommendation System
This technique attempts to figure out what a user's favourite aspects of an item is, and then recommends items that present those aspects. In this case, we're going to try to figure out the input's favorite genres from the movies and ratings given.

Let's begin by creating an input user to recommend movies to:

In [81]:
inputId = [{'title': 'Breakfast Club, The', 'rating': 2},
          {'title': 'Toy Story', 'rating': 5},
          {'title': 'Jumanji', 'rating': 4.5},
          {'title': 'Pulp Fiction', 'rating': 3},
          {'title': 'Akira', 'rating': 3}]
inputMovies = pd.DataFrame(inputId)
inputMovies

Unnamed: 0,title,rating
0,"Breakfast Club, The",2.0
1,Toy Story,5.0
2,Jumanji,4.5
3,Pulp Fiction,3.0
4,Akira,3.0


With the input complete, let's extract the input movie's ID's from the movies dataframe and add them into it.

We can achieve this by first filtering out the rows that contain the input movie's title and then merging this subset with the input dataframe. We also drop unnecessary columns for the input to save memory space.

In [82]:
input = movie_df[movie_df['title'].isin(inputMovies['title'].tolist())]

inputMovies = pd.merge(input, inputMovies)
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)
inputMovies

  inputMovies = inputMovies.drop('genres', 1).drop('year', 1)
  inputMovies = inputMovies.drop('genres', 1).drop('year', 1)


Unnamed: 0,movieId,title,rating
0,1,Toy Story,5.0
1,2,Jumanji,4.5
2,296,Pulp Fiction,3.0
3,1274,Akira,3.0
4,1968,"Breakfast Club, The",2.0


We're going to start by learning the input's preferences, so let's get the subset of movies that the input has watched from the Dataframe containing genres defined with binary values.

In [83]:
#Filtering out the movies from the input
userMovies = moviecopy_df[moviecopy_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
293,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1246,1274,Akira,"[Action, Adventure, Animation, Sci-Fi]",1988,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1885,1968,"Breakfast Club, The","[Comedy, Drama]",1985,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We'll only need the actual genre table, so let's clean this up a bit by resetting the index and dropping the movieId, title, genres and year columns.

In [84]:
userMovies = userMovies.reset_index(drop = True)

userGenre = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
userGenre

  userGenre = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
  userGenre = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
  userGenre = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
  userGenre = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)


Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [85]:

inputMovies['rating']

0    5.0
1    4.5
2    3.0
3    3.0
4    2.0
Name: rating, dtype: float64

We're going to turn each genre into weights. We can do this by using the input's reviews and multiplying them into the input's genre table and then summing up the resulting table by column.

In [86]:
genreWeight = userGenre.transpose().dot(inputMovies['rating'])
genreWeight

Adventure             12.5
Animation              8.0
Children               9.5
Comedy                10.0
Fantasy                9.5
Romance                0.0
Drama                  5.0
Action                 3.0
Crime                  3.0
Thriller               3.0
Horror                 0.0
Mystery                0.0
Sci-Fi                 3.0
IMAX                   0.0
Documentary            0.0
War                    0.0
Musical                0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

Using this, we can recommend movies that satisfy the user's preferences.

Let's start by extracting the genre table from the original dataframe, setting the index as the movieId and dropping unnecessary columns:

In [87]:
genreTable = moviecopy_df.set_index(moviecopy_df['movieId'])

genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
genreTable

  genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
  genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
  genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
  genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)


Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
151697,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
151701,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
151703,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
151709,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


With the input and the complete list of movies and their genres in hand, we're going to take the weighted average of every movie based on the input profile and recommend the top twenty movies that most satisfy it.

In [89]:
recommendationTable =((genreTable*genreWeight).sum(axis=1))/(genreWeight.sum())
recommendationTable.head()

movieId
1    0.744361
2    0.473684
3    0.150376
4    0.225564
5    0.150376
dtype: float64

In [90]:
# sort the recommended table in descending order
recommendationTable = recommendationTable.sort_values(ascending = False)
recommendationTable.head()

movieId
26093     0.819549
51632     0.789474
108932    0.789474
673       0.789474
32031     0.789474
dtype: float64

In [91]:
#The final recommendation table
movie_df.loc[movie_df['movieId'].isin(recommendationTable.head(20).keys())]

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
664,673,Space Jam,"[Adventure, Animation, Children, Comedy, Fanta...",1996
2902,2987,Who Framed Roger Rabbit?,"[Adventure, Animation, Children, Comedy, Crime...",1988
3664,3754,"Adventures of Rocky and Bullwinkle, The","[Adventure, Animation, Children, Comedy, Fantasy]",2000
3923,4016,"Emperor's New Groove, The","[Adventure, Animation, Children, Comedy, Fantasy]",2000
4212,4306,Shrek,"[Adventure, Animation, Children, Comedy, Fanta...",2001
8605,26093,"Wonderful World of the Brothers Grimm, The","[Adventure, Animation, Children, Comedy, Drama...",1962
8783,26340,"Twelve Tasks of Asterix, The (Les douze travau...","[Action, Adventure, Animation, Children, Comed...",1976
9825,32031,Robots,"[Adventure, Animation, Children, Comedy, Fanta...",2005
11716,51632,Atlantis: Milo's Return,"[Action, Adventure, Animation, Children, Comed...",2003
