# Movie Recommendation Systems

There are essentially 3 main types of recommendation systems:
- **Collaborative Filtering**: Makes predictions based on user behavior patterns and similarities, assuming users with similar past preferences will have similar future preferences. It works by finding either similar users or similar items to generate recommendations.
- **Content-Based Filtering**: Recommends items by analyzing item features and matching them to a user's historical preferences. For example, if you liked action movies, it suggests other action films based on genre, director, or similar attributes.
- **Hybrid Filtering**: Combines collaborative and content-based approaches to leverage the strengths of both techniques while mitigating their individual weaknesses. These systems use various strategies like weighted combinations or switching between methods.

Beyond the main types, some other approaches include:
- **Demographic-Based Filtering**: Groups users by demographic attributes (age, location, gender) and makes recommendations based on what similar demographic groups prefer. This creates recommendations without needing extensive user rating history.
- **Popularity-Based Filtering**: Recommends items based on overall popularity metrics like average ratings or view counts. While simple to implement, these provide identical recommendations to all users regardless of individual preferences.

# Dataset Overview

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ast import literal_eval
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import spacy
from scipy.sparse import csr_matrix
from sklearn.decomposition import TruncatedSVD
import warnings

warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('/content/movies_metadata.csv') # google colab version
#df = pd.read_csv('data/movies_metadata.csv') # local version

The dataset was sourced from [Kaggle](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset) and originally collected from TMDB and GroupLens.

In [3]:
df.shape

(45466, 24)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [5]:
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [6]:
df.describe() # numerical features

Unnamed: 0,revenue,runtime,vote_average,vote_count
count,45460.0,45203.0,45460.0,45460.0
mean,11209350.0,94.128199,5.618207,109.897338
std,64332250.0,38.40781,1.924216,491.310374
min,0.0,0.0,0.0,0.0
25%,0.0,85.0,5.0,3.0
50%,0.0,95.0,6.0,10.0
75%,0.0,107.0,6.8,34.0
max,2787965000.0,1256.0,10.0,14075.0


In [7]:
df.describe(include='object') # categorical features

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,spoken_languages,status,tagline,title,video
count,45466,4494,45466,45466,7782,45466,45449,45455,45466,44512,45461.0,45080,45463,45463,45379,45460,45379,20412,45460,45460
unique,5,1698,1226,4069,7673,45436,45417,92,43373,44307,44176.0,45024,22708,2393,17336,1931,6,20283,42277,2
top,False,"{'id': 415931, 'name': 'The Bowery Boys', 'pos...",0,"[{'id': 18, 'name': 'Drama'}]",http://www.georgecarlin.com,141971,tt1180333,en,Hamlet,No overview found.,0.0,/5D7UBSEgdyONE6Lql6xS7s6OLcW.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States o...",2008-01-01,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Based on a true story.,Cinderella,False
freq,45454,29,36573,5000,12,3,3,32269,8,133,34.0,5,11875,17851,136,22395,45014,7,11,45367


In [8]:
df.nunique()

Unnamed: 0,0
adult,5
belongs_to_collection,1698
budget,1226
genres,4069
homepage,7673
id,45436
imdb_id,45417
original_language,92
original_title,43373
overview,44307


In [9]:
df.isna().sum()

Unnamed: 0,0
adult,0
belongs_to_collection,40972
budget,0
genres,0
homepage,37684
id,0
imdb_id,17
original_language,11
original_title,0
overview,954


# Popularity-Based Recommender - a simple approach

**Popularity-Based Filtering** aims to give generalized recommendations based on movie popularity (as the name suggests), but can also be extended to genre. However, its limitation is that the offered recommendations are not personalized based on the individual user preferences.

First, we'll create a new feature - **Weighted Rating**, based on the IMDb formula, which is defined as follows:

$$
WR = \left( \frac{v}{v + m} \right) \cdot R + \left( \frac{m}{v + m} \right) \cdot C
$$

where,
- $WR$ is the weighted rating
- $R$ is the average rating for the movie
- $C$ is the mean vote across all movies
- $v$ is the number of votes for the movie
- $m$ is the minimum number of votes

And some cleaning...

In [10]:
df['genres'] = df['genres'].apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [11]:
null_rows_votes = df[df[['vote_average', 'vote_count']].isnull().any(axis=1)]
null_rows_votes[['original_title', 'vote_average', 'vote_count']]

Unnamed: 0,original_title,vote_average,vote_count
19729,Midnight Man,,
19730,"[{'iso_639_1': 'en', 'name': 'English'}]",,
29502,マルドゥック・スクランブル 排気,,
29503,"[{'iso_639_1': 'ja', 'name': '日本語'}]",,
35586,Avalanche Sharks,,
35587,"[{'iso_639_1': 'en', 'name': 'English'}]",,


In [12]:
df = df.dropna(subset=['vote_average', 'vote_count'])

In [13]:
df['popularity'].isnull().sum()

np.int64(0)

In [14]:
df['popularity'] = pd.to_numeric(df['popularity'], errors='coerce')

In [15]:
null_rows_date = df[df['release_date'].isnull()].sort_values('vote_count', ascending=False)
null_rows_date[['original_title', 'popularity', 'release_date', 'vote_count']].head()

Unnamed: 0,original_title,popularity,release_date,vote_count
37461,And Then There Were None,5.238281,,91.0
43523,Cosmos,0.282584,,41.0
19322,Endeavour,1.233673,,19.0
44798,Salad Fingers,0.141367,,4.0
33357,Independence Day 3,0.642294,,4.0


In [16]:
df = df.dropna(subset=['release_date'])

In [17]:
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
df['year'] = df['release_date'].dt.year

Now, $m$ - the minimum number of votes - needs to be set. We define $m$ as the 95th percentile of the **vote_count**, meaning 95% of movies have votes less than or equal to $m$.

In [18]:
m = df['vote_count'].quantile(0.95)
print(f'Minimum number of votes: {m}')

Minimum number of votes: 434.0


In [19]:
C = df['vote_average'].mean()

def weighted_rating(x, C = C, m = m):
    v = x['vote_count']
    R = x['vote_average']
    return (v / (v + m)) * R + (m / (m + v)) * C

In [20]:
popular_movies = df[df['vote_count'] >= m]
popular_movies['weighted_rating'] = popular_movies.apply(weighted_rating, axis=1)

popular_movies = popular_movies.sort_values('weighted_rating', ascending=False)
popular_movies[['original_title', 'year', 'genres', 'vote_count', 'vote_average', 'popularity', 'weighted_rating']].head(10)

Unnamed: 0,original_title,year,genres,vote_count,vote_average,popularity,weighted_rating
314,The Shawshank Redemption,1994,"[Drama, Crime]",8358.0,8.5,51.645403,8.358035
834,The Godfather,1972,"[Drama, Crime]",6024.0,8.5,41.109264,8.306728
12481,The Dark Knight,2008,"[Drama, Action, Crime, Thriller]",12269.0,8.3,123.167259,8.208576
2843,Fight Club,1999,[Drama],9678.0,8.3,63.869599,8.185151
292,Pulp Fiction,1994,"[Thriller, Crime]",8670.0,8.3,140.950236,8.172435
351,Forrest Gump,1994,"[Comedy, Drama, Romance]",8147.0,8.2,48.307194,8.069718
522,Schindler's List,1993,"[Drama, History, War]",4436.0,8.3,41.725123,8.061529
23673,Whiplash,2014,[Drama],4376.0,8.3,64.29999,8.058554
5481,千と千尋の神隠し,2001,"[Fantasy, Adventure, Animation, Family]",3968.0,8.3,41.048867,8.036176
1154,The Empire Strikes Back,1980,"[Adventure, Action, Science Fiction]",5998.0,8.2,19.470959,8.026189


Now, let's also include genre selection.

In [21]:
def recommend_popular_movies_by_genre(genre, n = 10):
    genre_popular_movies = popular_movies[popular_movies['genres'].apply(lambda x: genre in x)]
    recommended_movies = genre_popular_movies.sort_values('weighted_rating', ascending=False).head(n)
    return recommended_movies[['original_title', 'year', 'vote_count', 'vote_average', 'popularity', 'weighted_rating']]

In [22]:
# example usage
print("Top 10 Popular Action Movies:")
display(recommend_popular_movies_by_genre('Action'))

print("\nTop 10 Popular Comedy Movies:")
display(recommend_popular_movies_by_genre('Comedy'))

Top 10 Popular Action Movies:


Unnamed: 0,original_title,year,vote_count,vote_average,popularity,weighted_rating
12481,The Dark Knight,2008,12269.0,8.3,123.167259,8.208576
1154,The Empire Strikes Back,1980,5998.0,8.2,19.470959,8.026189
15480,Inception,2010,14075.0,8.1,29.108149,8.025939
7000,The Lord of the Rings: The Return of the King,2003,8226.0,8.1,29.324358,7.975918
256,Star Wars,1977,6778.0,8.1,42.149697,7.951005
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892.0,8.0,32.070725,7.889432
5814,The Lord of the Rings: The Two Towers,2002,7641.0,8.0,29.423537,7.872303
23753,Guardians of the Galaxy,2014,10014.0,7.9,53.291601,7.80546
2458,The Matrix,1999,9079.0,7.9,33.366332,7.796168
13605,Inglourious Basterds,2009,6598.0,7.9,16.89564,7.759534



Top 10 Popular Comedy Movies:


Unnamed: 0,original_title,year,vote_count,vote_average,popularity,weighted_rating
351,Forrest Gump,1994,8147.0,8.2,48.307194,8.069718
2211,La vita è bella,1997,3643.0,8.3,39.39497,8.015145
18465,Intouchables,2011,5410.0,8.2,16.086919,8.008701
1225,Back to the Future,1985,6239.0,8.0,25.778509,7.845474
22841,The Grand Budapest Hotel,2014,4644.0,8.0,14.442048,7.796937
22131,The Wolf of Wall Street,2013,6768.0,7.9,16.382422,7.76285
30315,Inside Out,2015,6737.0,7.9,23.985587,7.762257
10309,Dilwale Dulhania Le Jayenge,1995,661.0,9.1,34.457024,7.722325
40882,La La Land,2016,4745.0,7.9,19.681686,7.709277
13724,Up,2009,7048.0,7.8,19.330884,7.673783


If you ask me, Intouchables is probably my personal favorite 😁

And that is pretty much it for this trivial **Popularity-Based Recommender**

# Content-Based Recommender

As mentioned before, the major flaw of the previous recommender is that it provides the same recommendations to all users. So, if someone liked a recently watched movie and would like to watch, for example, a movie with the same theme, with the same actor, or the same director, we need to create a new, more advanced recommender. And this is where **Content-Based Filtering** comes into action.

First, some cleaning... again.

In [23]:
null_rows_overview = df[df['overview'].isnull()].sort_values('vote_count', ascending=False)
null_rows_overview[['original_title', 'overview', 'tagline', 'popularity', 'vote_count']].head(10)

Unnamed: 0,original_title,overview,tagline,popularity,vote_count
31600,Cado dalle nubi,,,6.40477,299.0
28549,Chiedimi se sono felice,,,4.288138,273.0
31547,La Banda Dei Babbi Natale,,,3.670624,185.0
31583,Tu la conosci Claudia?,,,5.038131,183.0
34178,"Il ricco, il povero e il maggiordomo",,,4.213349,178.0
31559,"Bianca come il latte, rossa come il sangue",,,5.665259,153.0
44886,Notte prima degli esami - Oggi,,,4.34552,137.0
25364,Il ciclone,,,4.815947,129.0
35490,Si accettano miracoli,,,4.833239,128.0
31569,Un boss in salotto,,,4.40496,111.0


In [24]:
null_rows_overview.shape[0]

941

Films without an **overview** are of no use to us, and as you can see there are not many of them, and they are mostly less popular non-English titles. Hence, we will simply remove the rows that lack this feature.

In [25]:
df.dropna(subset=['overview'], inplace=True)

In [26]:
for i in range(5):
    print(f"Overview: {df.loc[i, 'overview']}")
    print(f"Tagline: {df.loc[i, 'tagline']}\n")

Overview: Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.
Tagline: nan

Overview: When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.
Tagline: Roll the dice and unleash the excitement!

Overview: A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the local

In [27]:
df['tagline'].isna().sum()

np.int64(24045)

The **tagline** column contains short promotional phrases or catchlines used in marketing, which are often less informative than the **overview**. However, since they may still add some descriptive value or style, we can use them to enrich the textual data we'll be working with. Nonetheless, if a movie does not have a **tagline**, it's not an issue, and can just be filled with an empty string.

In [28]:
df['tagline'].fillna('', inplace=True)

In [29]:
df['description'] = df['overview'] + ' ' + df['tagline']

Unfortunately, I have to reduce the dataset due to limited computing power. Therefore, I will set a minimum number of votes, defined as the 80th percentile of the **vote_count**. However, if you have sufficient computing power, feel free to use the entire dataset.

In [30]:
min_votes = df['vote_count'].quantile(0.8)
print(f'Minimum number of votes: {min_votes}; Current shape: {df.shape}')

Minimum number of votes: 52.0; Current shape: (44435, 26)


In [31]:
df = df[df['vote_count'] >= min_votes]
print(f'New shape: {df.shape}')

New shape: (8894, 26)


In [32]:
df = df.reset_index(drop=True)

**TF-IDF (Term Frequency–Inverse Document Frequency) Vectorizer** turns text into numerical feature vectors by measuring how important each n-gram (single words or sequences of words) is in a document relative to the entire collection of texts.

In [33]:
tfidf_vectorizer = TfidfVectorizer(
    stop_words='english',
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95
)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['description'])

In [34]:
tfidf_matrix.shape

(8894, 32922)

$
\text{cosine similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \, \|\mathbf{B}\|}
= \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2} \cdot \sqrt{\sum_{i=1}^n B_i^2}}
$

**Cosine similarity** computes the dot product of two vectors divided by the product of their magnitudes. It measures the cosine of the angle between the vectors, capturing how similar their directions are regardless of their magnitudes. The result ranges from -1 to 1, where 1 indicates identical orientation and 0 indicates orthogonality (no similarity).

$
\text{linear kernel} = \mathbf{A} \cdot \mathbf{B} = \sum_{i=1}^n A_i B_i
$

**Linear kernel** computes the dot product between two vectors, which measures their alignment in the vector space.

Since **TF-IDF vectorization** normalizes the vectors (magnitude = 1), dividing by the magnitudes in cosine similarity becomes redundant, making linear kernel computationally equivalent but faster. Both capture content similarity by measuring how much movies share similar descriptive terms.

In [35]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

This computes the **cosine similarity** between all pairs of movies based on their **TF-IDF** feature vectors.

The result is a square matrix where each entry $[i][j]$ represents the similarity between movie $i$ and movie $j$.

In [36]:
# create a Series that maps each movie title to its DataFrame index
indices = pd.Series(df.index, index=df['original_title'])

In [37]:
def get_movie_recommendations(title, cosine_sim=cosine_sim, df=df, indices=indices, n=10):
    if title not in indices:
        raise ValueError(f"Title '{title}' not found in the dataset.")

    # get the index of the movie that matches the title
    idx = indices[title]

    # get similarity scores for all movies with this movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # sort movies based on similarity scores (descending)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # get scores of the n most similar movies (excluding itself)
    sim_scores = sim_scores[1:n+1]

    # get movie indices
    movie_indices = [i[0] for i in sim_scores]

    # return the top n most similar movies
    return df['original_title'].iloc[movie_indices]

In [38]:
# example usage
print(get_movie_recommendations('The Dark Knight'))

5898                      The Dark Knight Rises
78                               Batman Forever
704                              Batman Returns
5844                           Batman: Year One
317                                      Batman
6336    Batman: The Dark Knight Returns, Part 2
5415                 Batman: Under the Red Hood
8659                      The Lego Batman Movie
7291                          Batman vs Dracula
3653         Batman Beyond: Return of the Joker
Name: original_title, dtype: object


In [39]:
print(get_movie_recommendations('Interstellar'))

4037                               ほしのこえ
805                              Gattaca
1888                       Space Cowboys
6881                          キャプテンハーロック
166                             Stargate
8118                            400 Days
2737                 Final Destination 2
3258    Frank Herbert's Children of Dune
5550                     All Good Things
815                    Starship Troopers
Name: original_title, dtype: object


Well, we've got ourselves a content-based recommender. However, again, it has some limitations. The recommendations are based solely on the content of the film, but what about all other factors that make people like it? For instance, some might love a specific actor, some might follow a director's work, or maybe they just enjoy a particular genre. Let's try different approach. In order to do so, we will need new datasets - **credits** and **keywords**.

In [40]:
credits = pd.read_csv('/content/credits.csv')
#credits = pd.read('data/credits.csv')

keywords = pd.read_csv('/content/keywords.csv')
#keywords = pd.read('data/keywords.csv')

In [41]:
print(f'Credits shape: {credits.shape}')
print(f'Keywords shape: {keywords.shape}')

Credits shape: (45476, 3)
Keywords shape: (46419, 2)


In [42]:
(lambda: (credits.info(), print(), keywords.info()))()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45476 non-null  object
 1   crew    45476 non-null  object
 2   id      45476 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.0+ MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        46419 non-null  int64 
 1   keywords  46419 non-null  object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


(None, None, None)

These two need some preprocessing as well...

In [43]:
# convert to numeric
df['id'] = pd.to_numeric(df['id'], errors='coerce')

# drop rows where the conversion failed
df.dropna(subset=['id'], inplace=True)

# convert to integer type
df['id'] = df['id'].astype(int)

In [44]:
print(f"Number of duplicated IDs in credits: {credits['id'].duplicated().sum()}")
print(f"Number of duplicated IDs in keywords: {keywords['id'].duplicated().sum()}")

Number of duplicated IDs in credits: 44
Number of duplicated IDs in keywords: 987


In [45]:
credits = credits.drop_duplicates(subset='id')
keywords = keywords.drop_duplicates(subset='id')

In [46]:
# merge df with credits and keywords
# use LEFT JOIN to preserve all movies in df
df = df.merge(credits, on='id', how='left').merge(keywords, on='id', how='left')

In [47]:
df['cast'] = df['cast'].apply(literal_eval)
df['crew'] = df['crew'].apply(literal_eval)
df['keywords'] = df['keywords'].apply(literal_eval)

In [48]:
cast_list = df[df['original_title'] == 'Inception'].iloc[0]['cast']
cast_list[0]

{'cast_id': 1,
 'character': 'Dom Cobb',
 'credit_id': '52fe4534c3a368484e04de03',
 'gender': 2,
 'id': 6193,
 'name': 'Leonardo DiCaprio',
 'order': 0,
 'profile_path': '/jToSMocaCaS5YnuOJVqQ7S7pr4Q.jpg'}

In [49]:
crew_list = df[df['original_title'] == 'Inception'].iloc[0]['crew']
next((member for member in crew_list if member['job'] == 'Director'), None)

{'credit_id': '52fe4534c3a368484e04de4b',
 'department': 'Directing',
 'gender': 2,
 'id': 525,
 'job': 'Director',
 'name': 'Christopher Nolan',
 'profile_path': '/7OGmfDF4VHLLgbjxuEwTj3ga0uQ.jpg'}

In [50]:
def get_director(x):
    if not isinstance(x['crew'], list) or not x['crew']:
        return None
    director = next((member for member in x['crew'] if member.get('job') == 'Director'), None)
    return director.get('name') if director else None

def get_actors(x, n=3):
    if not isinstance(x['cast'], list) or not x['cast']:
        return []
    return [actor.get('name') for actor in x['cast'][:n]
            if isinstance(actor, dict) and actor.get('name')]

From the **crew** feature, we'll only use the director, as it seems to be the most important information for a potential viewer. From the **cast**, we'll take the first 3 actors.

In [51]:
df['director'] = df.apply(get_director, axis=1)
df['actors'] = df.apply(get_actors, axis=1)

In [52]:
df['director'].isna().sum()

np.int64(9)

In [53]:
df[df['director'].isna()][['original_title', 'director', 'vote_count']].sort_values(by='vote_count', ascending=False).head()

Unnamed: 0,original_title,director,vote_count
6122,Doragon Bōru Zetto: Fukkatsu no Fyūjon!! Gokū ...,,118.0
6131,ドラゴンボールZ 龍拳爆発!!悟空がやらねば誰がやる,,102.0
6099,Doragon bôru Z 5: Tobikkiri no saikyô tai saikyô,,97.0
7674,Barbie in A Mermaid Tale,,89.0
8814,Mythbusters Holiday Special,,88.0


In [54]:
df = df.dropna(subset=['director'])

In [55]:
df['actors'].apply(lambda x: isinstance(x, list) and len(x) == 0).sum()

np.int64(25)

In [56]:
df[df['actors'].apply(lambda x: isinstance(x, list) and len(x) == 0)][['original_title', 'actors', 'vote_count']].sort_values(by='vote_count', ascending=False).head()

Unnamed: 0,original_title,actors,vote_count
8424,Piper,[],487.0
5876,Lifted,[],232.0
4666,Zeitgeist,[],173.0
1811,Baraka,[],156.0
4030,Luxo Jr.,[],148.0


In [57]:
df = df[~df['actors'].apply(lambda x: isinstance(x, list) and len(x) == 0)]

Final processing for the **actors** and **director** features involves stripping spaces and converting to lowercase to prevent the recommender from confusing similar but distinct names.

In [58]:
print(df[df['original_title'] == 'Inception']['actors'].values[0])

['Leonardo DiCaprio', 'Joseph Gordon-Levitt', 'Ellen Page']


In [59]:
df['actors'] = df['actors'].apply(lambda x: [name.replace(" ", "").lower() for name in x] if isinstance(x, list) else [])
df['director'] = df['director'].apply(lambda x: x.replace(" ", "").lower() if isinstance(x, str) else x)

In [60]:
print(df[df['original_title'] == 'Inception']['actors'].values[0])

['leonardodicaprio', 'josephgordon-levitt', 'ellenpage']


In [61]:
df['keywords'].iloc[0]

[{'id': 931, 'name': 'jealousy'},
 {'id': 4290, 'name': 'toy'},
 {'id': 5202, 'name': 'boy'},
 {'id': 6054, 'name': 'friendship'},
 {'id': 9713, 'name': 'friends'},
 {'id': 9823, 'name': 'rivalry'},
 {'id': 165503, 'name': 'boy next door'},
 {'id': 170722, 'name': 'new toy'},
 {'id': 187065, 'name': 'toy comes to life'}]

In [62]:
df['keywords'] = df['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [63]:
df['keywords'].iloc[0]

['jealousy',
 'toy',
 'boy',
 'friendship',
 'friends',
 'rivalry',
 'boy next door',
 'new toy',
 'toy comes to life']

In [64]:
df['keywords'].apply(lambda x: isinstance(x, list) and len(x) == 0).sum()

np.int64(608)

In [65]:
df[df['keywords'].apply(lambda x: isinstance(x, list) and len(x) == 0)][['original_title', 'keywords', 'vote_count']].sort_values(by='vote_count', ascending=False).head()

Unnamed: 0,original_title,keywords,vote_count
6367,Identity Thief,[],1667.0
4777,The Mummy: Tomb of the Dragon Emperor,[],1418.0
6963,Begin Again,[],1285.0
6583,Grown Ups 2,[],1180.0
4874,Transporter 3,[],1110.0


In this case, we won't do anything with rows containing empty lists. We'll just leave them as they are.

In [66]:
keywords = df['keywords'].explode()

keywords_counts = keywords.value_counts()

In [67]:
valid_keywords = set(keywords_counts[keywords_counts > 1].index)

In [68]:
df['keywords'] = df['keywords'].apply(lambda x: [word for word in x if word in valid_keywords])

Keywords that occur only once are of no use to us. Hence, we get rid of them.

Now, it's time to convert the keywords using **spaCy's lemmatizer**. It reduces words to their base dictionary forms - for example, turning "running" to "run", "flies" to "fly", and so on.

In [69]:
nlp = spacy.load("en_core_web_sm")

In [70]:
for token in nlp("running flies happiness Cities"):
    print(f"{token.text} -> {token.lemma_}")

running -> run
flies -> fly
happiness -> happiness
Cities -> city


Now, there's a tricky part with our keyword preprocessing. We cannot simply lemmatize every word in our keyword lists without considering context. Let's examine the most frequent keywords in our dataset:

In [71]:
keywords_counts.head()

Unnamed: 0_level_0,count
keywords,Unnamed: 1_level_1
woman director,541
murder,429
duringcreditsstinger,374
based on novel,353
independent film,342


If we blindly lemmatize every word, we'll lose important semantic meaning. For example:

In [72]:
def process_keywords_test(x):
    all_lemmas = []
    for keyword in x:
        doc = nlp(keyword.lower())
        lemmas = [token.lemma_ for token in doc
                  if not token.is_stop and not token.is_punct and token.is_alpha]
        all_lemmas.extend(lemmas)
    return all_lemmas

In [73]:
sample_keywords = ["Women directors", "based on Novels", "duringcreditsstinger", "mUrders", "independent Films"]
print(process_keywords_test(sample_keywords))

['woman', 'director', 'base', 'novel', 'duringcreditsstinger', 'murder', 'independent', 'film']


You can tell this isn't ideal. What kind of information do words like 'director', 'base' or 'film' really give on their own? That's why we need to approach it carefully.

In [74]:
def process_keywords(x):
    all_processed = []
    for keyword in x:
        doc = nlp(keyword.lower())
        processed_tokens = []
        for token in doc:
            if token.is_alpha:
                # lemmatize nouns, keep adjectives as-is
                if token.pos_ in ['NOUN', 'PROPN']:
                    processed_tokens.append(token.lemma_)
                elif token.pos_ == 'ADJ':
                    processed_tokens.append(token.text)
                # skip other parts (prepositions, etc.)

        if processed_tokens:
            joined = ' '.join(processed_tokens) # tokens joined by space
            cleaned = joined.replace(" ","") # remove spaces
            all_processed.append(cleaned)

    return all_processed

In [75]:
print(process_keywords(sample_keywords))

['womandirector', 'novel', 'duringcreditsstinger', 'murder', 'independentfilm']


What we've done here is use spaCy's part-of-speech (POS) tagging to selectively lemmatize only nouns and proper nouns (to handle plurals and variants) while preserving adjectives that carry semantic meaning. Other parts of speech were ignored. Additionally, we removed the space between the words, as it was redundant.

In [76]:
df['keywords'] = df['keywords'].apply(process_keywords)

Now it's time to combine the features into a single text field.

To give more weight to the **director** and the **main actor**, we'll simply repeat their names multiple times (**director**: weight=3, **1st actor**: weight=2).

In [77]:
'''def create_soup(x):
    director = x['director']
    actors = ' '.join(x['actors']) if x['actors'] else ''
    genres = ' '.join(x['genres']) if x['genres'] else ''
    keywords = ' '.join(x['keywords']) if x['keywords'] else ''
    return ' '.join([director, actors, genres, keywords]).strip()
'''
def create_soup(x):
    director = ' '.join([x['director']] * 3)  # director weight 3

    actors = ''
    if x['actors']:
        actors = ' '.join([x['actors'][0]] * 2)  # first actor weight 2
        if len(x['actors']) > 1:
            actors += ' ' + ' '.join(x['actors'][1:])  # rest actors weight 1

    genres = ' '.join(x['genres']) if x['genres'] else ''
    keywords = ' '.join(x['keywords']) if x['keywords'] else ''

    return ' '.join([director, actors, genres, keywords]).strip()

In [78]:
df['soup'] = df.apply(create_soup, axis=1)

In [79]:
df = df.reset_index(drop=True)

**Count Vectorizer** converts text into numerical feature vectors by counting how often each n-gram appears in a document. Unlike **TF-IDF**, it doesn't adjust for term importance across documents - every term is treated equally based on frequency alone.

In [80]:
count_vectorizer = CountVectorizer(
    stop_words='english',
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95
)
count_matrix = count_vectorizer.fit_transform(df['soup'])

Unlike **TF-IDF** which produces normalized vectors, **CountVectorizer** generates raw count vectors where longer documents naturally have higher counts. That's why this time we use cosine similarity, which normalizes for vector magnitude, ensuring that similarity is based on content overlap rather than document length, making it essential for fair comparison between movie descriptions of different lengths.

In [81]:
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [82]:
# create a Series that maps each movie title to its DataFrame index
indices2 = pd.Series(df.index, index=df['original_title'])

In [83]:
print(get_movie_recommendations(title = 'The Dark Knight', cosine_sim=cosine_sim2, df=df, indices=indices2, n=10))

5889         The Dark Knight Rises
3866                 Batman Begins
4245                  The Prestige
8818                       Dunkirk
1299                     Following
7332                     Doodlebug
2515                      Insomnia
5408    Batman: Under the Red Hood
5399                     Inception
2063                       Memento
Name: original_title, dtype: object


In [84]:
print(get_movie_recommendations(title = 'Interstellar', cosine_sim=cosine_sim2, df=df, indices=indices2, n=10))

5399                Inception
7332                Doodlebug
8818                  Dunkirk
1299                Following
4245             The Prestige
2063                  Memento
2515                 Insomnia
5889    The Dark Knight Rises
4668          The Dark Knight
2522           Silent Running
Name: original_title, dtype: object


This way, we created a decent movie recommender based on movie **keywords, director, actors and genres**.

However, we could still enhance it slightly by adding a **weighted rating** alongside the **similarity score**. This hybrid approach helps us avoid recommending films that are not well-rated by audiences.

In [85]:
def get_movie_recommendations_with_rating(title, cosine_sim=cosine_sim2, df=df,
                                          indices=indices2, n=10, similarity_weight=0.7, rating_weight=0.3):
    if title not in indices:
        raise ValueError(f"Title '{title}' not found in the dataset.")

    # get the index of the movie that matches the title
    idx = indices[title]

    # calculate parameters
    m = df['vote_count'].quantile(0.85) # 847 votes
    C = df['vote_average'].mean()

    # get similarity scores for all movies with this movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # calculate hybrid scores on-the-fly
    hybrid_scores = []
    for i, similarity_score in sim_scores:
        # exclude the movie itself
        if i == idx:
            continue

        # only consider movies with sufficient votes
        if df.iloc[i]['vote_count'] < m:
            continue

        # calculate weighted rating on-the-fly without storing it
        v = df.iloc[i]['vote_count']
        R = df.iloc[i]['vote_average']
        wr = (v / (v + m)) * R + (m / (m + v)) * C

        # normalize to 0-1 scale
        normalized_rating = wr / 10.0

        # combine similarity and rating
        hybrid_score = (similarity_weight * similarity_score +
                       rating_weight * normalized_rating)

        hybrid_scores.append((i, hybrid_score))

    # sort movies based on hybrid scores (descending)
    hybrid_scores = sorted(hybrid_scores, key=lambda x: x[1], reverse=True)

    # get movie indices
    movie_indices = [i[0] for i in hybrid_scores[:n]]

    # return the top n recommended movies
    return df['original_title'].iloc[movie_indices]

In [86]:
print(get_movie_recommendations_with_rating(title = 'The Dark Knight', cosine_sim=cosine_sim2, df=df, indices=indices2, n=10, similarity_weight=0.5, rating_weight=0.5))

5889    The Dark Knight Rises
3866            Batman Begins
4245             The Prestige
5399                Inception
8818                  Dunkirk
2063                  Memento
6828             Interstellar
2515                 Insomnia
1753          American Psycho
403             The Godfather
Name: original_title, dtype: object


Different **similarity and rating weights** can be explored to **fine-tune** the recommendations results.

# Collaborative Filtering

**Collaborative filtering** is a recommendation technique that predicts user preferences by analyzing patterns in user behavior and ratings. It operates on the principle that users with similar preferences in the past will have similar preferences in the future. This approach doesn't require knowledge about item features, instead relying entirely on user-item interaction data to make recommendations.

In [87]:
ratings = pd.read_csv('/content/ratings_small.csv')
#ratings = pd.read_csv('data/ratings_small.csv')

We're going to need another dataset containing user ratings.

In [88]:
ratings = ratings[ratings['movieId'].isin(df['id'])]

In [89]:
print("Ratings dataset shape:", ratings.shape)
print(f"Number of unique users: {ratings['userId'].nunique()}")
print(f"Number of unique movies: {ratings['movieId'].nunique()}")
print(f"Rating range: {ratings['rating'].min()} to {ratings['rating'].max()}")
print(f"Average rating: {ratings['rating'].mean():.2f}")
print("\nInfo:")
ratings.info()
print("\nFirst few rows:")
print(ratings.head())
print("\nStatistics:")
print(ratings.describe())

Ratings dataset shape: (30388, 4)
Number of unique users: 671
Number of unique movies: 1429
Rating range: 0.5 to 5.0
Average rating: 3.59

Info:
<class 'pandas.core.frame.DataFrame'>
Index: 30388 entries, 10 to 99996
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   userId     30388 non-null  int64  
 1   movieId    30388 non-null  int64  
 2   rating     30388 non-null  float64
 3   timestamp  30388 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 1.2 MB

First few rows:
    userId  movieId  rating   timestamp
10       1     1371     2.5  1260759135
13       1     2105     4.0  1260759139
16       1     2294     2.0  1260759108
21       2       17     5.0   835355681
26       2       62     3.0   835355749

Statistics:
             userId        movieId        rating     timestamp
count  30388.000000   30388.000000  30388.000000  3.038800e+04
mean     345.785968    1983.489766      3.589163  1.063304e+09
st

We need to create a **user-item matrix** where each row corresponds to a user, each column to a movie (by id), and the values represent the ratings (missing values filled with 0 - meaning no rating). This structure allows us to measure similarities between users or items and make predictions based on patterns in the rating behavior.

In [90]:
user_item_matrix = ratings.pivot_table(
    index='userId',
    columns='movieId',
    values='rating',
    fill_value=0
)
total_cells = user_item_matrix.shape[0] * user_item_matrix.shape[1]
zero_cells = (user_item_matrix == 0).sum().sum()
sparsity = (zero_cells / total_cells) * 100

print(f"User-item matrix shape: {user_item_matrix.shape}")
print(f"Sparsity: {sparsity:.2f}%")

User-item matrix shape: (671, 1429)
Sparsity: 96.83%


The **sparsity** of the user-item matrix is basically the percentage of cells with missing ratings (that are represented as zeros). This high sparsity, indicates that most users have rated only a small fraction of all available films - which is completely expected, given the large number of films in the dataset and the fact that most people watch only a limited selection.

**User-based collaborative filtering** identifies users with similar rating patterns and recommends items that similar users have enjoyed. The algorithm calculates similarity between users using metrics like cosine similarity or Pearson correlation, then predicts ratings based on what similar users rated for unrated items. This approach works on the assumption that "users who agreed in the past will agree in the future".

In [91]:
# convert to sparse matrix for efficiency
sparse_matrix = csr_matrix(user_item_matrix.values)

# compute cosine similarity
user_similarity = cosine_similarity(sparse_matrix)
user_similarity = pd.DataFrame(
    user_similarity,
    index=user_item_matrix.index,
    columns=user_item_matrix.index
)

In [92]:
def user_based_predict(user_id, movie_id, n=50):
    if user_id not in user_item_matrix.index:
        return user_item_matrix.mean().mean()  # global average

    if movie_id not in user_item_matrix.columns:
        return user_item_matrix.loc[user_id].mean()  # user average

    # get users who rated this movie
    movie_ratings = user_item_matrix[movie_id]
    rated_users = movie_ratings[movie_ratings > 0].index

    if len(rated_users) == 0:
        return user_item_matrix.loc[user_id].mean()

    # get similarity scores for the target user with users who rated the movie
    user_similarities = user_similarity.loc[user_id, rated_users]

    # get top-n most similar users
    top_n_users = user_similarities.nlargest(n).index

    # calculate weighted average
    numerator = 0
    denominator = 0

    for similar_user in top_n_users:
        if similar_user != user_id:
            similarity = user_similarities[similar_user]
            rating = movie_ratings[similar_user]

            numerator += similarity * rating
            denominator += abs(similarity)

    if denominator == 0:
        return user_item_matrix.loc[user_id].mean()

    return numerator / denominator

In [93]:
# example
user_id = 1
movie_id = 105
predicted_rating = user_based_predict(user_id=user_id, movie_id=movie_id)
title = df.loc[df['id'] == movie_id, 'original_title'].values
title_str = title[0] if len(title) > 0 else '[not found]'
print(f"Predicted rating for User {user_id}, movie title: {title_str}: {predicted_rating:.2f}")

Predicted rating for User 1, movie title: Back to the Future: 3.23


**Item-based collaborative filtering** focuses on relationships between items rather than users, recommending items similar to those a user has previously rated highly. It calculates similarity between items based on how users have rated them, then predicts ratings for new items based on the user's ratings of similar items. This method tends to be more stable over time since item relationships change less frequently than user preferences.

In [94]:
# transpose matrix to get items as rows
item_matrix = user_item_matrix.T
# convert to sparse matrix
sparse_matrix = csr_matrix(item_matrix.values)

# compute cosine similarity
item_similarity = cosine_similarity(sparse_matrix)
item_similarity = pd.DataFrame(
    item_similarity,
    index=item_matrix.index,
    columns=item_matrix.index
)

In [95]:
def item_based_predict(user_id, movie_id, n=50):
    if user_id not in user_item_matrix.index:
        return user_item_matrix.mean().mean()  # global average

    if movie_id not in user_item_matrix.columns:
        return user_item_matrix.loc[user_id].mean()  # user average

    # get movies rated by this user
    user_ratings = user_item_matrix.loc[user_id]
    rated_movies = user_ratings[user_ratings > 0].index

    if len(rated_movies) == 0:
        return user_item_matrix[movie_id].mean()

    # get similarity scores for the target movie with movies rated by user
    movie_similarities = item_similarity.loc[movie_id, rated_movies]

    # get top-n most similar movies
    top_n_movies = movie_similarities.nlargest(n).index

    # calculate weighted average
    numerator = 0
    denominator = 0

    for similar_movie in top_n_movies:
        if similar_movie != movie_id:
            similarity = movie_similarities[similar_movie]
            rating = user_ratings[similar_movie]

            numerator += similarity * rating
            denominator += abs(similarity)

    if denominator == 0:
        return user_item_matrix.loc[user_id].mean()

    return numerator / denominator

In [96]:
# example
user_id = 1
movie_id = 105
predicted_rating = item_based_predict(user_id=user_id, movie_id=movie_id)
title = df.loc[df['id'] == movie_id, 'original_title'].values
title_str = title[0] if len(title) > 0 else '[not found]'
print(f"Predicted rating for User {user_id}, movie title: {title_str}: {predicted_rating:.2f}")

Predicted rating for User 1, movie title: Back to the Future: 2.77


**Single Value Decomposition (SVD)** is a matrix factorization technique that decomposes the user-item rating matrix into lower-dimensional matrices representing **latent factors**. It discovers hidden patterns in user preferences and item characteristics, such as genre preferences or quality factors, that aren't explicitly stated.

In [97]:
# number of latent factors to extract
# higher values retain more information but increase complexity
n_components = 200

# replace 0s with nan for proper handling
# (avoid treating missing ratings as actual zeros)
matrix_for_svd = user_item_matrix.replace(0, np.nan)

# fill nan with user means
user_means = matrix_for_svd.mean(axis=1)
matrix_filled = matrix_for_svd.sub(user_means, axis=0).fillna(0)

# apply svd
svd_model = TruncatedSVD(n_components=n_components, random_state=42)
user_factors = svd_model.fit_transform(matrix_filled)
item_factors = svd_model.components_.T

In [98]:
def svd_predict(user_id, movie_id):
    if user_id not in user_item_matrix.index:
        return user_item_matrix.mean().mean()  # global average

    if movie_id not in user_item_matrix.columns:
        return user_item_matrix.loc[user_id].mean()  # user average

    user_idx = user_item_matrix.index.get_loc(user_id)
    movie_idx = user_item_matrix.columns.get_loc(movie_id)

    # predict rating
    predicted_rating = np.dot(user_factors[user_idx], item_factors[movie_idx])
    predicted_rating += user_means[user_id]

    # clip to valid rating range (assuming 0.5-5.0)
    return np.clip(predicted_rating, 0.5, 5.0)

In [99]:
# example
user_id = 1
movie_id = 105
predicted_rating = svd_predict(user_id=user_id, movie_id=movie_id)
title = df.loc[df['id'] == movie_id, 'original_title'].values
title_str = title[0] if len(title) > 0 else '[not found]'
print(f"Predicted rating for User {user_id}, movie title: {title_str}: {predicted_rating:.2f}")

Predicted rating for User 1, movie title: Back to the Future: 2.84


Now let's make a function that generates personalized movie recommendations for a given user by predicting ratings for unrated movies using one of our three collaborative filtering methods: **user-based**, **item-based** or **SVD**.

In [100]:
def get_user_recommendations(user_id, df=df, n_recommendations=10, method='svd'):
    if user_id not in user_item_matrix.index:
        print(f"User {user_id} not found in the dataset")
        return None

    # get movies not yet rated by the user
    user_ratings = user_item_matrix.loc[user_id]
    unrated_movies = user_ratings[user_ratings == 0].index

    # predict ratings for unrated movies
    predictions = []
    for movie_id in unrated_movies:
        if method == 'user_based':
            pred_rating = user_based_predict(user_id, movie_id)
        elif method == 'item_based':
            pred_rating = item_based_predict(user_id, movie_id)
        elif method == 'svd':
            pred_rating = svd_predict(user_id, movie_id)

        predictions.append((movie_id, pred_rating))

    # sort by predicted rating and get top n
    predictions.sort(key=lambda x: x[1], reverse=True)
    top_predictions = predictions[:n_recommendations]

    # get movie details
    recommendations = []
    for movie_id, pred_rating in top_predictions:
        movie_info = df[df['id'] == movie_id]
        if not movie_info.empty:
            recommendations.append({
                'movie_id': movie_id,
                'title': movie_info['title'].iloc[0],
                'predicted_rating': round(pred_rating, 2)
            })

    return pd.DataFrame(recommendations)

To test the recommendations, let's first look at some top-rated movies by an example user.

In [101]:
merged = ratings.merge(df[['id', 'title']], left_on='movieId', right_on='id', how='left')

user_5_ratings = merged[merged['userId'] == 5].sort_values(by='rating', ascending=False)

user_5_ratings[['userId', 'movieId', 'title', 'rating']].head(10)

Unnamed: 0,userId,movieId,title,rating
158,5,597,Titanic,5.0
149,5,277,Underworld,4.5
154,5,500,Reservoir Dogs,4.5
168,5,1923,Twin Peaks: Fire Walk with Me,4.5
179,5,4995,Boogie Nights,4.5
164,5,1485,Get Carter,4.5
145,5,104,Run Lola Run,4.0
147,5,150,48 Hrs.,4.0
146,5,141,Donnie Darko,4.0
153,5,440,Aliens vs Predator: Requiem,4.0


Now, let's check the recommendations for this user.

**Item-Based Recommendations**

In [102]:
get_user_recommendations(user_id=5, method='item_based')

Unnamed: 0,movie_id,title,predicted_rating
0,1878,Fear and Loathing in Las Vegas,4.5
1,201,Star Trek: Nemesis,4.25
2,243,High Fidelity,4.25
3,343,Harold and Maude,4.25
4,96,Beverly Hills Cop II,4.23
5,4347,Atonement,4.22
6,6077,13 Tzameti,4.15
7,6644,El Dorado,4.15
8,568,Apollo 13,4.14
9,756,Fantasia,4.12


**User-Based Recommendations**

In [103]:
get_user_recommendations(user_id=5, method='user_based')

Unnamed: 0,movie_id,title,predicted_rating
0,178,Blown Away,5.0
1,1859,Ninotchka,5.0
2,3112,The Night of the Hunter,5.0
3,183,The Wizard,5.0
4,301,Rio Bravo,5.0
5,309,The Celebration,5.0
6,702,A Streetcar Named Desire,5.0
7,759,Gentlemen Prefer Blondes,5.0
8,764,The Evil Dead,5.0
9,845,Strangers on a Train,5.0


**SVD Recommendations**

In [104]:
get_user_recommendations(user_id=5, method='svd')

Unnamed: 0,movie_id,title,predicted_rating
0,1641,Forces of Nature,4.0
1,671,Harry Potter and the Philosopher's Stone,3.99
2,1909,Don Juan DeMarco,3.97
3,4104,Benny & Joon,3.96
4,19,Metropolis,3.96
5,455,Bend It Like Beckham,3.96
6,8873,Wayne's World 2,3.95
7,3101,I Love You to Death,3.95
8,1253,Breaking and Entering,3.94
9,674,Harry Potter and the Goblet of Fire,3.93


# Hybrid Recommender

We have successfully built both **content-based and collaborative filtering recommender systems**.

But what if we combined them?

This is actually a common practice, known as **hybrid recommendation**, which aims to leverage the strengths of both approaches to improve overall performance.

In [105]:
def hybrid_recommendation(user_id, title, n_recommendations=10, n_similar=50,
                          cosine_sim=cosine_sim2, indices=indices2):
    # check if title exists in the dataset
    if title not in indices:
        raise ValueError(f"Title '{title}' not found in the dataset.")

    # check if user exists in the system
    if user_id not in user_item_matrix.index:
        print(f"User {user_id} not found in the dataset")
        return None

    # get the index of the movie that matches the title
    idx = indices[title]

    # get similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))

    # sort by similarity and get top n_similar movies (excluding the input movie itself)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:n_similar+1]  # skip the first one (itself)

    # get movie indices
    movie_indices = [i[0] for i in sim_scores]

    # get movie details for similar movies
    similar_movies = df.iloc[movie_indices][['id', 'title']].copy()

    # predict ratings using SVD for each similar movie
    predicted_ratings = []
    for movie_id in similar_movies['id']:
        pred_rating = svd_predict(user_id, movie_id)
        predicted_ratings.append(pred_rating)

    similar_movies['predicted_rating'] = predicted_ratings

    # sort by predicted rating and return top n_recommendations
    recommendations = similar_movies.sort_values('predicted_rating', ascending=False)

    return recommendations.head(n_recommendations)[['title', 'predicted_rating']]

In [106]:
user_id = 1
movie_title = 'The Dark Knight'
hybrid_recommendation(
    user_id=user_id,
    title=movie_title
)

Unnamed: 0,title,predicted_rating
1753,American Psycho,2.865165
4245,The Prestige,2.848843
704,Batman Returns,2.845028
2515,Insomnia,2.842477
78,Batman Forever,2.840983
5300,Harry Brown,2.834365
765,Batman & Robin,2.832951
4032,Yamakasi,2.83233
2063,Memento,2.832279
3866,Batman Begins,2.806474


In [107]:
user_id = 100
movie_title = 'The Dark Knight'
hybrid_recommendation(
    user_id=user_id,
    title=movie_title
)

Unnamed: 0,title,predicted_rating
3866,Batman Begins,3.420983
765,Batman & Robin,3.387593
4245,The Prestige,3.385371
704,Batman Returns,3.361013
2063,Memento,3.358161
4032,Yamakasi,3.358068
78,Batman Forever,3.357132
5300,Harry Brown,3.354106
2515,Insomnia,3.352274
1753,American Psycho,3.316587


Although the input film is the same, we can see that the recommendations differ due to the influence of collaborative filtering. As a result, the suggestions are not only closely related to the selected film, but also well personalized to the user.

# Conclusion

We successfully implemented multiple recommendation approaches: **popularity-based** for cold-start users, **content-based** for item similarity, **collaborative filtering** (user-based, item-based, SVD) for personalized recommendations, and **hybrid systems** that combine these methods. Each approach has distinct strengths - popularity handles new users, content-based provides explainable recommendations, collaborative captures user preferences, and hybrid systems balance diversity with personalization by leveraging multiple strategies. The choice between these methods depends on data availability, user base characteristics, and specific business requirements.

I'd also like to mention an excellent notebook that was very helpful in my learning: [Kaggle](https://www.kaggle.com/code/rounakbanik/movie-recommender-systems)