# Content-based Recommender Systems

First we need to import the data. We will use the movieLens dataset.

We will use two files:
* ratings: it includes the ratings that each user has given to each of the movies they have watched.
* movies: it includes the movie ID, the title (with the release year) and the genres of each movie.

In [1]:
import pandas as pd
ratings = pd.read_csv('ml-latest-small/ratings.csv')

# Remove timestamp
ratings_df = ratings[['movieId', 'userId', 'rating']]
ratings_df.head()

Unnamed: 0,movieId,userId,rating
0,1,1,4.0
1,3,1,4.0
2,6,1,4.0
3,47,1,5.0
4,50,1,5.0


In [2]:
movies = pd.read_csv('ml-latest-small/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Genres similarity

First of all let's get the genres in different columns. 

In [3]:
genre_df = movies['genres'].str.get_dummies(sep='|')
genre_df = genre_df.drop(columns=[genre_df.columns[0]])
genre_df.index = movies.movieId.values
genre_df

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
1,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
193583,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
193585,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
193587,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Once we have it ready we can compute the similarity between them. We will use the cosine similarity.

In [4]:
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(genre_df)

We convert it to dataframe and we set as index and header the movies IDs

In [5]:
# Convert to DataFrame and add movies IDs to index and header
similarity_df = pd.DataFrame(similarity_matrix, columns=movies.movieId.values, index=movies.movieId.values)
similarity_df

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
1,1.000000,0.774597,0.316228,0.258199,0.447214,0.000000,0.316228,0.632456,0.000000,0.258199,...,0.447214,0.316228,0.316228,0.447214,0.0,0.670820,0.774597,0.00000,0.316228,0.447214
2,0.774597,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.816497,0.000000,0.333333,...,0.000000,0.000000,0.000000,0.000000,0.0,0.288675,0.333333,0.00000,0.000000,0.000000
3,0.316228,0.000000,1.000000,0.816497,0.707107,0.000000,1.000000,0.000000,0.000000,0.000000,...,0.353553,0.000000,0.500000,0.000000,0.0,0.353553,0.408248,0.00000,0.000000,0.707107
4,0.258199,0.000000,0.816497,1.000000,0.577350,0.000000,0.816497,0.000000,0.000000,0.000000,...,0.288675,0.408248,0.816497,0.000000,0.0,0.288675,0.333333,0.57735,0.000000,0.577350
5,0.447214,0.000000,0.707107,0.577350,1.000000,0.000000,0.707107,0.000000,0.000000,0.000000,...,0.500000,0.000000,0.707107,0.000000,0.0,0.500000,0.577350,0.00000,0.000000,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.670820,0.288675,0.353553,0.288675,0.500000,0.288675,0.353553,0.000000,0.500000,0.288675,...,0.750000,0.353553,0.353553,0.500000,0.0,1.000000,0.866025,0.00000,0.707107,0.500000
193583,0.774597,0.333333,0.408248,0.333333,0.577350,0.000000,0.408248,0.000000,0.000000,0.000000,...,0.577350,0.408248,0.408248,0.577350,0.0,0.866025,1.000000,0.00000,0.408248,0.577350
193585,0.000000,0.000000,0.000000,0.577350,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.707107,0.707107,0.000000,0.0,0.000000,0.000000,1.00000,0.000000,0.000000
193587,0.316228,0.000000,0.000000,0.000000,0.000000,0.408248,0.000000,0.000000,0.707107,0.408248,...,0.707107,0.500000,0.000000,0.707107,0.0,0.707107,0.408248,0.00000,1.000000,0.000000


## Release year similarity

First, we need to get release year from title.

In [6]:
movies_year = movies.copy()
movies_year[['Title', 'Year']] = movies_year['title'].str.extract(r'(?P<Title>.*?)\s*\((?P<Year>\d{4})\)')#.isna().sum()
movies_year = movies_year.dropna()
movies_year['Year'] = movies_year['Year'].astype(int)

In [7]:
movies_year = movies_year.drop(columns=['genres'])
movies_year

Unnamed: 0,movieId,title,Title,Year
0,1,Toy Story (1995),Toy Story,1995
1,2,Jumanji (1995),Jumanji,1995
2,3,Grumpier Old Men (1995),Grumpier Old Men,1995
3,4,Waiting to Exhale (1995),Waiting to Exhale,1995
4,5,Father of the Bride Part II (1995),Father of the Bride Part II,1995
...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Black Butler: Book of the Atlantic,2017
9738,193583,No Game No Life: Zero (2017),No Game No Life: Zero,2017
9739,193585,Flint (2017),Flint,2017
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Bungo Stray Dogs: Dead Apple,2018


Now we define a function that computes how similar movies are according to their release year. We used the exponential function of the difference divided by 10.

In [8]:
import math

def compute_year_similarity(df, movie1, movie2):

    diff = abs(movies_year.loc[movie1, 'Year'] - movies_year.loc[movie2, 'Year'])
    sim = math.exp(-diff / 10.0)
    return sim

Time to compute the similarity. We create a matrix full of zeros that we are going to fill in with the similarity value. This will take some time. To speed up things here we are taking advantage of this matrix being symmetric, so we compute only half of it.

In [9]:
import numpy as np
similarity_matrix_year = np.zeros((len(movies_year.index), len(movies_year.index)))

In [10]:
from IPython.display import clear_output
val = 0
print('0%')

similarity_matrix_year = np.zeros((len(movies_year.index), len(movies_year.index)))

for i, movie1 in enumerate(movies_year.index):
    for j, movie2 in enumerate(movies_year.index):

        
        if i < j and similarity_matrix_year[i][j] == 0.0:
            similarity_matrix_year[i][j] = compute_year_similarity(movies_year, movie1, movie2)
            similarity_matrix_year[j][i] = similarity_matrix_year[i][j]
        elif i == j:
            similarity_matrix_year[i][j] = 1 

        # Track progress
        new_val = int(i/movies_year.shape[0] * 100)
        if  val != new_val:
            val = new_val
            clear_output(wait=True)
            print(f'{val}%')

clear_output(wait=True)
print('100%')

99%


Finally we convert it into a dataframe and we assign the movie IDs to the index and header.

In [11]:
similarity_year_df = pd.DataFrame(similarity_matrix_year, columns=movies_year.movieId.values, index=movies_year.movieId.values)
similarity_year_df

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
1,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,...,0.223130,0.165299,0.149569,0.135335,0.135335,0.110803,0.110803,0.110803,0.100259,0.670320
2,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,...,0.223130,0.165299,0.149569,0.135335,0.135335,0.110803,0.110803,0.110803,0.100259,0.670320
3,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,...,0.223130,0.165299,0.149569,0.135335,0.135335,0.110803,0.110803,0.110803,0.100259,0.670320
4,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,...,0.223130,0.165299,0.149569,0.135335,0.135335,0.110803,0.110803,0.110803,0.100259,0.670320
5,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,...,0.223130,0.165299,0.149569,0.135335,0.135335,0.110803,0.110803,0.110803,0.100259,0.670320
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,...,0.496585,0.670320,0.740818,0.818731,0.818731,1.000000,1.000000,1.000000,0.904837,0.074274
193583,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,...,0.496585,0.670320,0.740818,0.818731,0.818731,1.000000,1.000000,1.000000,0.904837,0.074274
193585,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,0.110803,...,0.496585,0.670320,0.740818,0.818731,0.818731,1.000000,1.000000,1.000000,0.904837,0.074274
193587,0.100259,0.100259,0.100259,0.100259,0.100259,0.100259,0.100259,0.100259,0.100259,0.100259,...,0.449329,0.606531,0.670320,0.740818,0.740818,0.904837,0.904837,0.904837,1.000000,0.067206


## Global similarity matrix

There are multiple approaches you can use here. For simplicity, we will just multiply both similarity matrices, so if either the genres or the year are not appealing to the user, the global similarity will be low.

In [30]:
global_similarity_df = similarity_df * similarity_year_df
global_similarity_df.fillna(0.0, inplace=True)

## Get top n recommendations

In [31]:
movieId_to_title = movies.set_index('movieId').loc[:,['title']].to_dict()['title']

def get_top_n_recommendations(user_id, similarity_matrix=global_similarity_df, n=10, movieId_to_title=movieId_to_title):

    # Filter the ratings for the target user
    user_ratings = ratings_df[ratings_df['userId'] == user_id]

    # Merge the user's ratings with the similarity matrix
    merged = user_ratings.merge(similarity_matrix, left_on='movieId', right_index=True)

    # Calculate the weighted average of the similarity scores for each movie
    weighted_averages = merged.iloc[:, 3:].multiply(merged['rating'], axis=0).mean(axis=0)

    # Sort the weighted averages
    sorted_weighted_averages = weighted_averages.sort_values(ascending=False)
    sorted_weighted_averages = sorted_weighted_averages.to_frame(name='rating')
    sorted_weighted_averages['title'] = sorted_weighted_averages.index.map(movieId_to_title)
    sorted_weighted_averages = sorted_weighted_averages.reset_index(names=['movieId'])

    # Get the top-N recommendations
    top_n_rec = sorted_weighted_averages.head(n)
    top_n_rec.index = top_n_rec.index + 1
    print(f'Top recommended movies for user {user_id}: {top_n_rec.title.values}')

    return top_n_rec

### Test recommendations

In [36]:
user_id = 55
ratings_df[ratings_df.userId == user_id].merge(movies, left_on='movieId', right_on='movieId')

Unnamed: 0,movieId,userId,rating,title,genres
0,186,55,0.5,Nine Months (1995),Comedy|Romance
1,673,55,0.5,Space Jam (1996),Adventure|Animation|Children|Comedy|Fantasy|Sc...
2,1275,55,4.0,Highlander (1986),Action|Adventure|Fantasy
3,1293,55,4.0,Gandhi (1982),Drama
4,1357,55,0.5,Shine (1996),Drama|Romance
5,1947,55,0.5,West Side Story (1961),Drama|Musical|Romance
6,2005,55,2.0,"Goonies, The (1985)",Action|Adventure|Children|Comedy|Fantasy
7,2100,55,0.5,Splash (1984),Comedy|Fantasy|Romance
8,2278,55,3.0,Ronin (1998),Action|Crime|Thriller
9,2393,55,0.5,Star Trek: Insurrection (1998),Action|Drama|Romance|Sci-Fi


In [37]:
get_top_n_recommendations(user_id)

Top recommended movies for user 55: ['Wasabi (2001)' 'Boondock Saints, The (2000)' 'Double Jeopardy (1999)'
 'Fight Club (1999)' 'Corruptor, The (1999)' 'Spy Game (2001)'
 'Beautiful Creatures (2000)' 'Pusher II: With Blood on My Hands (2004)'
 'Collateral (2004)' 'Man Apart, A (2003)']


Unnamed: 0,movieId,rating,title
1,5628,1.197009,Wasabi (2001)
2,3275,1.194527,"Boondock Saints, The (2000)"
3,2881,1.172352,Double Jeopardy (1999)
4,2959,1.172352,Fight Club (1999)
5,2540,1.172352,"Corruptor, The (1999)"
6,4901,1.165458,Spy Game (2001)
7,4242,1.150894,Beautiful Creatures (2000)
8,34811,1.147654,Pusher II: With Blood on My Hands (2004)
9,8798,1.147654,Collateral (2004)
10,6280,1.142138,"Man Apart, A (2003)"
