# Content-Based Movie Recommendations (Item Profiles & User Profiles)

![alt text](amazon_prime.png "Movie Recommendations (source https://www.amazon.com)")

In this notebook, we will have a look at the [MovieLens](https://grouplens.org/datasets/movielens/) dataset, which is a popular dataset for building and benchmarking recommender systems. The dataset version we work with is the 1M dataset, which contains 1,000,209 ratings for about 3,900 movies made by 6,040 users in the year 2000. **Your task is to recommend "similar" movies for a given query movie, which is a simple yet often user recommender system.** Note that this notebook is based on the following github repository: [https://github.com/khanhnamle1994/movielens](https://github.com/khanhnamle1994/movielens). Below, we will also make use of the well-known [Numpy](https://numpy.org/) and [Pandas](https://pandas.pydata.org/) packages. The former provides functions for, e.g., matrix/array operations. The latter is a quite complex package for reading tabular-like data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# parse the movies.csv file
movies = pd.read_csv('movies.csv', 
                     sep='\t', 
                     encoding='latin-1', 
                     usecols=['movie_id', 'title', 'genres'])

# print the first 10 rows
print("The movies dataframe:")
print(movies.head(10))

The movies dataframe:
   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy
5         6                         Heat (1995)         Action|Crime|Thriller
6         7                      Sabrina (1995)                Comedy|Romance
7         8                 Tom and Huck (1995)          Adventure|Children's
8         9                 Sudden Death (1995)                        Action
9        10                    GoldenEye (1995)     Action|Adventure|Thriller


The movie dataframe has a colum *genres*, which we will use to define our features. However, **we need a vector-like representation for each movie** and not a representation as string (e.g. "Animation|Children's|Comedy"). Below, we define a list of genres. Each movie will then be represented as a (boolean) vector, where an element is 1 if the movie is in the corresponding genre. The function *extract_genre* takes care of extracting this information per row.

In [3]:
GENRES = [
    "Action",
    "Adventure",
    "Animation",
    "Children",
    "Comedy",
    "Crime",
    "Documentary",
    "Drama",
    "Fantasy",
    "Film-Noir",
    "Horror",
    "Musical",
    "Mystery",
    "Romance",
    "Sci-Fi",
    "Thriller",
    "War",
    "Western"
]

print("Number of genres: {}".format(len(GENRES)))

def extract_genre(genres, gen):
    """
    Parameters
    ----------
    genres: str
        The string containing the genres per movie.
        (e.g. Animation|Children's|Comedy)
    gen: str
        The genre that we test (if it is contained 
        in the string above)
    """
    genres_list = genres.split("|")
    return 1 if gen in genres else 0


Number of genres: 18


In [4]:
import copy
    
# let's generate a copy for the current movies dataframe
converted_movies = copy.deepcopy(movies)    

# for each genre, we generate a new column (indicator if genre is present in the list of genres)
for gen in GENRES:
    converted_movies[gen] = converted_movies['genres'].apply(extract_genre, args=(gen,))

print(converted_movies[['title'] + GENRES].head(10))

                                title  Action  Adventure  Animation  Children  \
0                    Toy Story (1995)       0          0          1         1   
1                      Jumanji (1995)       0          1          0         1   
2             Grumpier Old Men (1995)       0          0          0         0   
3            Waiting to Exhale (1995)       0          0          0         0   
4  Father of the Bride Part II (1995)       0          0          0         0   
5                         Heat (1995)       1          0          0         0   
6                      Sabrina (1995)       0          0          0         0   
7                 Tom and Huck (1995)       0          1          0         1   
8                 Sudden Death (1995)       1          0          0         0   
9                    GoldenEye (1995)       1          1          0         0   

   Comedy  Crime  Documentary  Drama  Fantasy  Film-Noir  Horror  Musical  \
0       1      0            0  

### (1) Item Profiles 
Next, we define a function that computes the cosine similarity for two input vectors.

In [5]:
def get_cosine_similarity(x, y):
    
    numerator = np.dot(x,y)
    denominator = np.linalg.norm(x) * np.linalg.norm(y)

    # sanity check: x and y must be non-zero vectors
    if denominator > 0:
        sim = numerator / denominator
    else:
        raise Exception("The cosine similarity is not defined for vectors containing only zeros!")

    return sim

In [6]:
# let us define two arrays/lists; the indices list allows us to return,
# for each title, the corresponding row number
titles = movies['title']
indices = pd.Series(movies.index, index=movies['title'])
print(titles.head(10))
print(indices.head(10))

0                      Toy Story (1995)
1                        Jumanji (1995)
2               Grumpier Old Men (1995)
3              Waiting to Exhale (1995)
4    Father of the Bride Part II (1995)
5                           Heat (1995)
6                        Sabrina (1995)
7                   Tom and Huck (1995)
8                   Sudden Death (1995)
9                      GoldenEye (1995)
Name: title, dtype: object
title
Toy Story (1995)                      0
Jumanji (1995)                        1
Grumpier Old Men (1995)               2
Waiting to Exhale (1995)              3
Father of the Bride Part II (1995)    4
Heat (1995)                           5
Sabrina (1995)                        6
Tom and Huck (1995)                   7
Sudden Death (1995)                   8
GoldenEye (1995)                      9
dtype: int64


In [7]:
def get_recommendations(movie_title, n=10):
    
    # get the index/row number of the movie with title "movie_title"
    idx = indices[movie_title]
    
    # this is our "query movie" for which we have a representation
    # based on the genres defined above
    query_item = converted_movies.iloc[idx][GENRES]
    query_item = query_item.to_numpy()

    # compute cosine similarities between query movie and all
    # movies in the catalog (except for the query movie)
    similarities = []
    for i in range(len(converted_movies['genres'])):
        
        # skip the query item
        if i != idx:
            
            # get the i-th item
            other_item = converted_movies.iloc[i][GENRES]
            other_item = other_item.to_numpy()

            # compute cosine similarity between both items
            sim = get_cosine_similarity(query_item, other_item)
            
            # store result in list
            similarities.append((i, sim))
    
    # sort pairs w.r.t. second entry (cosine similarities) in 
    # descending order (reverse=True)
    sorted_similarities = sorted(similarities, key=lambda x: x[1], reverse=True)
    
    # take the top n elements
    sorted_similarities = sorted_similarities[:n]
    
    # get the corresponding movie indices
    movie_indices = [pair[0] for pair in sorted_similarities]
    
    # return the list of titles
    return titles.iloc[movie_indices]

In [8]:
get_recommendations('Toy Story (1995)', n=10)

1050            Aladdin and the King of Thieves (1996)
2072                          American Tail, An (1986)
2073        American Tail: Fievel Goes West, An (1991)
2285                         Rugrats Movie, The (1998)
2286                              Bug's Life, A (1998)
3045                                Toy Story 2 (1999)
3542                             Saludos Amigos (1943)
3682                                Chicken Run (2000)
3685    Adventures of Rocky and Bullwinkle, The (2000)
236                              Goofy Movie, A (1995)
Name: title, dtype: object

### (2) User Profiles
Next, we will have a look at user data and associated ratings. Below, we load the data files and merge all the dataframes (database-like join). 

In [9]:
users = pd.read_csv('users.csv', 
                    sep='\t', 
                    encoding='latin-1', 
                    usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])

ratings = pd.read_csv('ratings.csv', 
                      sep='\t', 
                      encoding='latin-1', 
                      usecols=['user_id', 'movie_id', 'rating'])

In [10]:
# merge all three dataframes into a single dataframe
# (via a database-like join)
dataset = pd.merge(converted_movies, ratings, on="movie_id")
dataset = pd.merge(dataset, users, on="user_id")
print(dataset.head(10))

   movie_id                                      title  \
0         1                           Toy Story (1995)   
1        48                          Pocahontas (1995)   
2       150                           Apollo 13 (1995)   
3       260  Star Wars: Episode IV - A New Hope (1977)   
4       527                    Schindler's List (1993)   
5       531                  Secret Garden, The (1993)   
6       588                             Aladdin (1992)   
7       594     Snow White and the Seven Dwarfs (1937)   
8       595                Beauty and the Beast (1991)   
9       608                               Fargo (1996)   

                                 genres  Action  Adventure  Animation  \
0           Animation|Children's|Comedy       0          0          1   
1  Animation|Children's|Musical|Romance       0          0          1   
2                                 Drama       0          0          0   
3       Action|Adventure|Fantasy|Sci-Fi       1          1          0

In [11]:
# get data for a specific user
user_id = 10
user_data = dataset[dataset['user_id'] == user_id]
print("Data for the user with id {}:".format(user_id))
print(user_data.head(10))

Data for the user with id 10:
     movie_id                         title  \
369         1              Toy Story (1995)   
370         2                Jumanji (1995)   
371         7                Sabrina (1995)   
372        24                 Powder (1995)   
373        32         Twelve Monkeys (1995)   
374        48             Pocahontas (1995)   
375        62     Mr. Holland's Opus (1995)   
376       104          Happy Gilmore (1996)   
377       110             Braveheart (1995)   
378       116  Anne Frank Remembered (1995)   

                                   genres  Action  Adventure  Animation  \
369           Animation|Children's|Comedy       0          0          1   
370          Adventure|Children's|Fantasy       0          1          0   
371                        Comedy|Romance       0          0          0   
372                          Drama|Sci-Fi       0          0          0   
373                          Drama|Sci-Fi       0          0          0   
37

In [12]:
# for the given user, extract all ratings and item profiles

ratings = []
items = []

for index, row in user_data.iterrows():
    
    rating = row['rating']
    item = row[GENRES].to_numpy()
    
    ratings.append(rating)
    items.append(item)
    
# convert lists to numpy arrays
ratings = np.array(ratings)
items = np.array(items)
print(items.shape)

(401, 18)


Next, we generate a user profile by computing weighted averages

In [13]:
user_profile = None

# option 1
new_ratings = ratings # normal ratings

# option 2
#new_ratings = ratings - ratings.mean() # mean ratings

for i in range(len(items)):
    
    if user_profile is None:
        user_profile = new_ratings[i] * items[i]
    else:
        user_profile += new_ratings[i] * items[i]

print("User profile:\n{}".format(user_profile))

User profile:
[317 288 142 263 757 54 8 485 164 14 76 164 29 293 297 114 92 31]


In [14]:
def get_user_recommendations(user_prof, n=10):
    
    similarities = []

    for i in range(len(converted_movies['genres'])):

        # get the i-th item
        item = converted_movies.iloc[i][GENRES]
        item = item.to_numpy()

        # compute cosine similarity between item and user profile
        sim = get_cosine_similarity(item, user_prof)

        # store result in list
        similarities.append((i, sim))

    # sort pairs w.r.t. second entry (similarities)
    sorted_similarities = sorted(similarities, key=lambda x: x[1], reverse=True)

    # take the top n elements
    sorted_similarities = sorted_similarities[:n]

    # get the corresponding movie indices
    movie_indices = [pair[0] for pair in sorted_similarities]
    
    # return the list of titles
    return titles.iloc[movie_indices]    

In [15]:
get_user_recommendations(user_profile)

20                        Get Shorty (1995)
386     Faster Pussycat! Kill! Kill! (1965)
1847                      Buffalo 66 (1998)
3087                Bicentennial Man (1999)
10           American President, The (1995)
193          Something to Talk About (1995)
221                 Don Juan DeMarco (1995)
347                 Corrina, Corrina (1994)
355              I Like It Like That (1994)
492             What Happened Was... (1994)
Name: title, dtype: object