# MRS

Here I'll work on content-based movie recommender based on the previous notebook (skill_showcase.ipynb)

data source: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

This notebook will consist of two parts:

1. Preparing data for the recommender

2. Content-based recommender itself

# 1. Preparing Data

In [1]:
import pandas as pd
from ast import literal_eval

In [2]:
df = pd.read_csv('data/movies_metadata.csv')
# Transpose for easier exploration of this dataset with many cols
df.head(3).transpose()

  df = pd.read_csv('data/movies_metadata.csv')


Unnamed: 0,0,1,2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
homepage,http://toystory.disney.com/toy-story,,
id,862,8844,15602
imdb_id,tt0114709,tt0113497,tt0113228
original_language,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

## Filtering out bad/barely known movies

In [4]:
# Calculate the count of films that have vote_average < 4 or vote_count < 40
len(df[(df['vote_average'] < 4) | (df['vote_count'] < 40)])

35109

As you can see, this dataset has a lot of mediocre movies. Therefore, to speed up calculations, let's remove them

In [5]:
df = df[(df['vote_average'] >= 4) & (df['vote_count'] >= 40)]
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10351 entries, 0 to 45441
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  10351 non-null  object 
 1   belongs_to_collection  2135 non-null   object 
 2   budget                 10351 non-null  object 
 3   genres                 10351 non-null  object 
 4   homepage               3001 non-null   object 
 5   id                     10351 non-null  object 
 6   imdb_id                10350 non-null  object 
 7   original_language      10351 non-null  object 
 8   original_title         10351 non-null  object 
 9   overview               10299 non-null  object 
 10  popularity             10351 non-null  object 
 11  poster_path            10351 non-null  object 
 12  production_companies   10351 non-null  object 
 13  production_countries   10351 non-null  object 
 14  release_date           10349 non-null  object 
 15  revenue

## Dropping unneeded columns

In [6]:
df["adult"].value_counts()

adult
False    10351
Name: count, dtype: int64

In [7]:
df["video"].value_counts()

video
False    10348
True         3
Name: count, dtype: int64

In [8]:
df["status"].value_counts()

status
Released           10335
Post Production        6
Rumored                4
In Production          4
Planned                1
Name: count, dtype: int64

The columns 'adult', 'status' and 'video' have predominantly one value, so let's remove them. Also, let's remove 'poster_path', 'hopepage' (too many null values), 'imdb_id', 'spoken_languages', 'overview' and 'tagline'

Apart from this, let's drop not much useful for recommender columns

In [9]:
df = df.drop(
    [
        "adult",
        "status",
        "video",
        "poster_path",
        "original_title",
        "homepage",
        "imdb_id",
        "spoken_languages",
        "overview",
        "tagline",
        "original_language",
        "production_companies",
        "production_countries"
    ],
    axis=1,
)
df.head(3).transpose()

Unnamed: 0,0,1,2
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
id,862,8844,15602
popularity,21.946943,17.015539,11.7129
release_date,1995-10-30,1995-12-15,1995-12-22
revenue,373554033.0,262797249.0,0.0
runtime,81.0,104.0,101.0
title,Toy Story,Jumanji,Grumpier Old Men
vote_average,7.7,6.9,6.5


Now let's have a look at dtypes

## Converting dtypes to more appropriate ones

In [10]:
df.dtypes

belongs_to_collection     object
budget                    object
genres                    object
id                        object
popularity                object
release_date              object
revenue                  float64
runtime                  float64
title                     object
vote_average             float64
vote_count               float64
dtype: object

First of all, let's handle 'release_date' column

In [11]:
# Convert 'release_date' column to datetime type
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
# Count the number of rows with bad date values
bad_date_count = df['release_date'].isnull().sum()
print(f"Number of rows with bad date values: {bad_date_count}")

Number of rows with bad date values: 2


Since 90 rows compared to 45,000 in total is nothing, we can freely remove them

In [12]:
# Remove rows with null or NaT values
df = df.dropna(subset=['release_date'])
bad_date_count = df['release_date'].isnull().sum()
print(f"Number of rows with bad date values: {bad_date_count}")

Number of rows with bad date values: 0


The column 'budget' contains non-numerical values like '/ff9qCepilowshEtG2GYWwzt2bs4.jpg'. Let's remove them

In [13]:
# Clean 'budget' column to remove non-numeric characters
df["budget"] = df["budget"].str.replace(r"\D", "", regex=True)

I don't like that columns with whole numbers like 'runtime' or 'vote_count' have dtype set to float. Let's change that

In [14]:
# Specify columns and their new data types
dict_columns_to_convert = {
    "budget": "int64",
    "revenue": "int64",
    "runtime": "int",
    "vote_count": "int",
    "popularity": "float",
    "id": "int"
}
# Clean 'budget' column to remove non-numeric characters
df["budget"] = df["budget"].str.replace(r"\D", "", regex=True)
# Fill NaN values with 0
cols_to_fill = list(dict_columns_to_convert.keys())
df[cols_to_fill] = df[cols_to_fill].fillna(0)
# Convert columns to integer type
df = df.astype(dict_columns_to_convert)
# Check the data types of the DataFrame
print(df.dtypes)

belongs_to_collection            object
budget                            int64
genres                           object
id                                int32
popularity                      float64
release_date             datetime64[ns]
revenue                           int64
runtime                           int32
title                            object
vote_average                    float64
vote_count                        int32
dtype: object


## Handling of 'budget', 'revenue', and 'popularity' columns

Columns 'budget' and 'revenue' have too big values in them while 'popularity' column has too many digits after decimal point. Let's change this

In [15]:
# Divide 'budget' and 'revenue' columns by million and round to 2 decimal places
df['budget'] = (df['budget'] / 1000000).round(2)
df['revenue'] = (df['revenue'] / 1000000).round(2)

# Round 'popularity' column to 2 decimal places
df['popularity'] = df['popularity'].round(2)
df.head(3)

Unnamed: 0,belongs_to_collection,budget,genres,id,popularity,release_date,revenue,runtime,title,vote_average,vote_count
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,21.95,1995-10-30,373.55,81,Toy Story,7.7,5415
1,,65.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,17.02,1995-12-15,262.8,104,Jumanji,6.9,2413
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,11.71,1995-12-22,0.0,101,Grumpier Old Men,6.5,92


## Working with 'belongs_to_collection' column

In [16]:
def extract_franchise_name(x):
    try:
        # Use literal_eval to safely evaluate the string as a Python dictionary
        # Extract the 'name' value from the dictionary
        return literal_eval(x)["name"]
    except (ValueError, TypeError):
        return None

# Apply the extract_franchise_name function to each value in the 'belongs_to_collection' column
df["franchise"] = df["belongs_to_collection"].apply(extract_franchise_name).str.strip()
# Remove the word 'Collection' (case-insensitive) from the end of each franchise name
df["franchise"] = df["franchise"].str.replace(r"[Cc]ollection$", "", regex=True)
# Remove trailing spaces before and after the string
df["franchise"] = df["franchise"].str.strip()
df = df.drop(["belongs_to_collection"], axis=1)

In [17]:
df["franchise"].value_counts().head()

franchise
James Bond               26
Dragon Ball Z (Movie)    15
Pokémon                  12
Friday the 13th          12
Fantozzi                 10
Name: count, dtype: int64

## Working with 'genres' column

In [18]:
# Convert the stringified JSON into a list of dictionaries
df["genres"] = df["genres"].apply(
    lambda x: literal_eval(x.replace("'", '"')) if isinstance(x, str) else []
)
# Extract the names of genres into a list and sort them alphabetically
df["genres"] = df["genres"].apply(
    lambda x: sorted([genre["name"] for genre in x]) if isinstance(x, list) else []
)
# Display the DataFrame with the extracted genre names
df[["title", "genres"]].head(3)

Unnamed: 0,title,genres
0,Toy Story,"[Animation, Comedy, Family]"
1,Jumanji,"[Adventure, Family, Fantasy]"
2,Grumpier Old Men,"[Comedy, Romance]"


In [19]:
# Flatten the list of genre names
flat_genre_names = [genre for sublist in df["genres"] for genre in sublist]
# Get the unique genre names
unique_genre_names = set(flat_genre_names)
# Print the unique genre names
print(f"There are {len(unique_genre_names)} unique genres.")
print(unique_genre_names)

There are 20 unique genres.
{'Crime', 'Adventure', 'Mystery', 'Science Fiction', 'Foreign', 'Action', 'Animation', 'Documentary', 'History', 'TV Movie', 'Music', 'Drama', 'Romance', 'Western', 'Fantasy', 'Horror', 'War', 'Thriller', 'Comedy', 'Family'}


We can see that 'genres' colomn has faulty data like 'Carousel Productions' or 'Vision View Entertainment', which sound like production companies, not genres. Thus, let's remove such values from the column

In [20]:
# Define the list of valid genre names
valid_genres = {
    'Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
    'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Mystery',
    'Romance', 'Science Fiction', 'Thriller', 'War', 'Western'
}
# Filter the genre_names column to include only the valid genres
df["genres"] = df["genres"].apply(lambda x: [genre for genre in x if genre in valid_genres])

Now let's check again

In [21]:
flat_genre_names = [genre for sublist in df["genres"] for genre in sublist]
unique_genre_names = set(flat_genre_names)
print(f"There are {len(unique_genre_names)} unique genres.")
print(unique_genre_names)

There are 17 unique genres.
{'War', 'Science Fiction', 'Crime', 'Animation', 'Mystery', 'Thriller', 'Adventure', 'Drama', 'Horror', 'Comedy', 'Documentary', 'Romance', 'Western', 'History', 'Fantasy', 'Family', 'Action'}


In [22]:
df["genres"].value_counts().head(7)

genres
[Comedy]                    808
[Drama]                     785
[Comedy, Drama]             423
[Drama, Romance]            398
[Comedy, Drama, Romance]    355
[Comedy, Romance]           298
[Horror, Thriller]          262
Name: count, dtype: int64

One movie can belong to many genres and one genre can be applied to many movies. It's a many-to-many relationship. Ideally, this kind of relationship is supposed to be broken into two 1:M relationships and connected with an intermidiate or junction table. However, because

- it's a project to show my knowledge mainly of writing SQL queries
- I'm applying to a junior data analyst position, and, at that role, you're not supposed to design databases
- preparation part is already too long
- maximum string length for genres is known (80 symbols for the movie with the title 'Yu-Gi-Oh')

I'll keep things simple and connect genre names by comma.

In [23]:
# Convert the list of genres into a string with comma as a delimiter
df["genres"] = df["genres"].apply(lambda x: ", ".join(x) if x else None)

In [24]:
df["genres"].value_counts().head(7)

genres
Comedy                    808
Drama                     785
Comedy, Drama             423
Drama, Romance            398
Comedy, Drama, Romance    355
Comedy, Romance           298
Horror, Thriller          262
Name: count, dtype: int64

In [25]:
df.head().transpose()

Unnamed: 0,0,1,2,4,5
budget,30.0,65.0,0.0,0.0,60.0
genres,"Animation, Comedy, Family","Adventure, Family, Fantasy","Comedy, Romance",Comedy,"Action, Crime, Drama, Thriller"
id,862,8844,15602,11862,949
popularity,21.95,17.02,11.71,8.39,17.92
release_date,1995-10-30 00:00:00,1995-12-15 00:00:00,1995-12-22 00:00:00,1995-02-10 00:00:00,1995-12-15 00:00:00
revenue,373.55,262.8,0.0,76.58,187.44
runtime,81,104,101,106,170
title,Toy Story,Jumanji,Grumpier Old Men,Father of the Bride Part II,Heat
vote_average,7.7,6.9,6.5,5.7,7.7
vote_count,5415,2413,92,173,1886


Time to rearrange columns a little bit because I'm not happy with the order of columns

## Changing column order

In [26]:
new_cols_order = [
    "id",
    "title",
    "franchise",
    "release_date",
    "runtime",
    "genres",
    "budget",
    "revenue",
    "popularity",
    "vote_average",
    "vote_count"
]
df = df[new_cols_order]
df.head(3).transpose()

Unnamed: 0,0,1,2
id,862,8844,15602
title,Toy Story,Jumanji,Grumpier Old Men
franchise,Toy Story,,Grumpy Old Men
release_date,1995-10-30 00:00:00,1995-12-15 00:00:00,1995-12-22 00:00:00
runtime,81,104,101
genres,"Animation, Comedy, Family","Adventure, Family, Fantasy","Comedy, Romance"
budget,30.0,65.0,0.0
revenue,373.55,262.8,0.0
popularity,21.95,17.02,11.71
vote_average,7.7,6.9,6.5


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10349 entries, 0 to 45441
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            10349 non-null  int32         
 1   title         10349 non-null  object        
 2   franchise     2135 non-null   object        
 3   release_date  10349 non-null  datetime64[ns]
 4   runtime       10349 non-null  int32         
 5   genres        10332 non-null  object        
 6   budget        10349 non-null  float64       
 7   revenue       10349 non-null  float64       
 8   popularity    10349 non-null  float64       
 9   vote_average  10349 non-null  float64       
 10  vote_count    10349 non-null  int32         
dtypes: datetime64[ns](1), float64(4), int32(3), object(3)
memory usage: 848.9+ KB


## Adding data from other two datasets

In [28]:
credits = pd.read_csv('data/credits.csv')
keywords = pd.read_csv('data/keywords.csv')

In [29]:
df = df.merge(credits, on='id')
df = df.merge(keywords, on='id')
df.head(3)

Unnamed: 0,id,title,franchise,release_date,runtime,genres,budget,revenue,popularity,vote_average,vote_count,cast,crew,keywords
0,862,Toy Story,Toy Story,1995-10-30,81,"Animation, Comedy, Family",30.0,373.55,21.95,7.7,5415,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,,1995-12-15,104,"Adventure, Family, Fantasy",65.0,262.8,17.02,6.9,2413,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,Grumpier Old Men,Grumpy Old Men,1995-12-22,101,"Comedy, Romance",0.0,0.0,11.71,6.5,92,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10513 entries, 0 to 10512
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            10513 non-null  int32         
 1   title         10513 non-null  object        
 2   franchise     2157 non-null   object        
 3   release_date  10513 non-null  datetime64[ns]
 4   runtime       10513 non-null  int32         
 5   genres        10495 non-null  object        
 6   budget        10513 non-null  float64       
 7   revenue       10513 non-null  float64       
 8   popularity    10513 non-null  float64       
 9   vote_average  10513 non-null  float64       
 10  vote_count    10513 non-null  int32         
 11  cast          10513 non-null  object        
 12  crew          10513 non-null  object        
 13  keywords      10513 non-null  object        
dtypes: datetime64[ns](1), float64(4), int32(3), object(6)
memory usage: 1.0+ MB


## Extracting director

In [31]:
def extract_director(crew_list):
    for crew_member in crew_list:
        if crew_member["job"] == "Director":
            return crew_member["name"]
    return None

# For function get_director to work, convert the string representations to actual dictionaries
df["crew"] = df["crew"].apply(literal_eval)
# Extract the director's name for each movie
df["director"] = df["crew"].apply(extract_director)

In [32]:
df["director"].value_counts().head(10)

director
Woody Allen         47
Alfred Hitchcock    36
Clint Eastwood      33
Steven Spielberg    31
Martin Scorsese     25
Ron Howard          24
Ridley Scott        23
Joel Schumacher     20
Brian De Palma      20
Tim Burton          20
Name: count, dtype: int64

In [33]:
df[df["director"] == "Martin Scorsese"][["title", "release_date", "runtime", "genres", "director"]].head(10)

Unnamed: 0,title,release_date,runtime,genres,director
14,Casino,1995-11-22,178,"Crime, Drama",Martin Scorsese
70,Taxi Driver,1976-02-07,114,"Crime, Drama",Martin Scorsese
233,The Age of Innocence,1993-09-17,139,"Drama, Romance",Martin Scorsese
631,GoodFellas,1990-09-12,145,"Crime, Drama",Martin Scorsese
646,Raging Bull,1980-11-14,129,Drama,Martin Scorsese
734,Cape Fear,1991-11-15,128,"Crime, Thriller",Martin Scorsese
906,Kundun,1997-12-25,134,Drama,Martin Scorsese
1060,The Last Temptation of Christ,1988-08-12,164,Drama,Martin Scorsese
1354,The Color of Money,1986-10-07,119,Drama,Martin Scorsese
1629,Bringing Out the Dead,1999-10-22,121,Drama,Martin Scorsese


## Extracting top actors

In [34]:
def extract_actors(cast_list):
    top_actors = []
    for actor in cast_list[:3]:  # Select the top 3 actors
        top_actors.append(actor["name"])
    return ", ".join(top_actors)

# Convert the string representations to actual dictionaries
df["cast"] = df["cast"].apply(literal_eval)
# Extract the top 5 actor names for each movie
df["top_actors"] = df["cast"].apply(extract_actors)

In [35]:
df[df["title"] == "The Empire Strikes Back"][["title", "release_date", "runtime", "genres", "director", "top_actors"]]

Unnamed: 0,title,release_date,runtime,genres,director,top_actors
615,The Empire Strikes Back,1980-05-17,124,"Action, Adventure, Science Fiction",Irvin Kershner,"Mark Hamill, Harrison Ford, Carrie Fisher"


## Extracting keywords

In [36]:
from collections import Counter
import pandas as pd

# Convert the string representations to actual dictionaries
df["keywords"] = df["keywords"].apply(literal_eval)
# Flatten the list of dictionaries in the 'keywords' column
keywords = [keyword["name"] for sublist in df["keywords"] for keyword in sublist]
# Count the frequencies of each keyword
keyword_counts = Counter(keywords)
# Sort the keywords based on their frequencies in descending order
sorted_keywords = sorted(keyword_counts.items(), key=lambda x: x[1], reverse=True)
# Remove keywords that rarely occur
sorted_keywords_filtered = [(keyword, count) for keyword, count in sorted_keywords if count > 9]
# Create a set of keywords that appear in sorted_keywords_filtered
filtered_keywords_set = set([keyword for keyword, _ in sorted_keywords_filtered])
# Print the sorted and filtered keywords
for keyword, count in sorted_keywords_filtered[:10]:
    print(f"{keyword}: {count}")

woman director: 655
murder: 480
independent film: 434
based on novel: 387
duringcreditsstinger: 385
violence: 340
revenge: 259
sex: 252
police: 232
suspense: 230


In [37]:
# Print the number of unique keywords
num_unique_keywords_before = len(keyword_counts)
print(f"Number of unique keywords before filtering: {num_unique_keywords_before}")
num_unique_keywords_after = len(sorted_keywords_filtered)
print(f"Number of unique keywords after filtering: {num_unique_keywords_after}")

Number of unique keywords before filtering: 13953
Number of unique keywords after filtering: 1553


In [39]:
def filter_keywords(keywords_list):
    filtered_keywords = []
    for keyword in keywords_list:
        if keyword["name"] in filtered_keywords_set and len(filtered_keywords) < 5:
            filtered_keywords.append(keyword["name"])
            if len(filtered_keywords) == 5:
                break
    return ", ".join(filtered_keywords)

# Add a new column 'filtered_keywords' to the DataFrame
df['filtered_keywords'] = df['keywords'].apply(filter_keywords)

In [40]:
df[["keywords", "filtered_keywords"]]

Unnamed: 0,keywords,filtered_keywords
0,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","jealousy, toy, boy, friendship, friends"
1,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","disappearance, based on children's book"
2,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","fishing, best friend, duringcreditsstinger"
3,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","baby, midlife crisis, confidence, aging, daughter"
4,"[{'id': 642, 'name': 'robbery'}, {'id': 703, '...","robbery, detective, bank, obsession, chase"
...,...,...
10508,[],
10509,[],
10510,"[{'id': 2652, 'name': 'nazis'}, {'id': 3098, '...","nazis, castle, time travel"
10511,"[{'id': 9673, 'name': 'love'}, {'id': 13130, '...","love, teenager, lgbt, short"


## Final steps of data preparation with Pandas

In [41]:
# Drop columns 'id', 'cast', 'crew', and 'keywords'
df = df.drop(columns=["id", "cast", "crew", "keywords"])
# Rename 'filtered_keywosrds' column to 'keywords'
df = df.rename(columns={"filtered_keywords": "keywords"})
# Add the new 'id' column as the first column
df.insert(0, "id", range(1, 1 + len(df)))
df.head().transpose()

Unnamed: 0,0,1,2,3,4
id,1,2,3,4,5
title,Toy Story,Jumanji,Grumpier Old Men,Father of the Bride Part II,Heat
franchise,Toy Story,,Grumpy Old Men,Father of the Bride,
release_date,1995-10-30 00:00:00,1995-12-15 00:00:00,1995-12-22 00:00:00,1995-02-10 00:00:00,1995-12-15 00:00:00
runtime,81,104,101,106,170
genres,"Animation, Comedy, Family","Adventure, Family, Fantasy","Comedy, Romance",Comedy,"Action, Crime, Drama, Thriller"
budget,30.0,65.0,0.0,0.0,60.0
revenue,373.55,262.8,0.0,76.58,187.44
popularity,21.95,17.02,11.71,8.39,17.92
vote_average,7.7,6.9,6.5,5.7,7.7


Columns explanation:
- id - row id
- title - official title of the movie
- release_date - theatrical release date of the movie
- runtime - movie duration/runtime in minutes
- genres - genres associated with the movie, separated by a comma
- budget - movie budget in millions of dollars
- revenue - total movie revenue in millions of dollars
- popularity - popularity score assigned by TMDB
- vote_average - average movie rating
- vote_count - number of votes by users, counted by TMDB
- director - name of the movie director
- top_actors - names of top 5 actors in the movie
- keywords - keywords associated with the movie

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10513 entries, 0 to 10512
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            10513 non-null  int64         
 1   title         10513 non-null  object        
 2   franchise     2157 non-null   object        
 3   release_date  10513 non-null  datetime64[ns]
 4   runtime       10513 non-null  int32         
 5   genres        10495 non-null  object        
 6   budget        10513 non-null  float64       
 7   revenue       10513 non-null  float64       
 8   popularity    10513 non-null  float64       
 9   vote_average  10513 non-null  float64       
 10  vote_count    10513 non-null  int32         
 11  director      10496 non-null  object        
 12  top_actors    10513 non-null  object        
 13  keywords      10513 non-null  object        
dtypes: datetime64[ns](1), float64(4), int32(2), int64(1), object(6)
memory usage: 1.0+ MB


Let's save the cleaned up dataset, which we'll use in the next chapters

In [43]:
# df.to_csv("data/recommender_data_v2.csv", index=False)

# 2. Content-based recommender

In [7]:
import pandas as pd

pd.set_option('display.max_colwidth', None)

In [8]:
df = pd.read_csv("data/recommender_data_v2.csv", index_col=0)
df.head(3)

Unnamed: 0_level_0,title,franchise,release_date,runtime,genres,budget,revenue,popularity,vote_average,vote_count,director,top_actors,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,Toy Story,Toy Story,1995-10-30,81,"Animation, Comedy, Family",30.0,373.55,21.95,7.7,5415,John Lasseter,"Tom Hanks, Tim Allen, Don Rickles","jealousy, toy, boy, friendship, friends"
2,Jumanji,,1995-12-15,104,"Adventure, Family, Fantasy",65.0,262.8,17.02,6.9,2413,Joe Johnston,"Robin Williams, Jonathan Hyde, Kirsten Dunst","disappearance, based on children's book"
3,Grumpier Old Men,Grumpy Old Men,1995-12-22,101,"Comedy, Romance",0.0,0.0,11.71,6.5,92,Howard Deutch,"Walter Matthau, Jack Lemmon, Ann-Margret","fishing, best friend, duringcreditsstinger"


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10513 entries, 1 to 10513
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         10513 non-null  object 
 1   franchise     2157 non-null   object 
 2   release_date  10513 non-null  object 
 3   runtime       10513 non-null  int64  
 4   genres        10495 non-null  object 
 5   budget        10513 non-null  float64
 6   revenue       10513 non-null  float64
 7   popularity    10513 non-null  float64
 8   vote_average  10513 non-null  float64
 9   vote_count    10513 non-null  int64  
 10  director      10496 non-null  object 
 11  top_actors    10475 non-null  object 
 12  keywords      9174 non-null   object 
dtypes: float64(4), int64(2), object(7)
memory usage: 1.1+ MB


## Handling movie title

The idea behind this recommender is that you enter the title of the movie that you liked, and then you get a list of similar movies. However, first of all, you need to check whether the movie with the typed title exist in the database in the first place. Also, spelling problems may arise.

To handle all these issues, the 'fuzzywuzzy' library comes to help

In [4]:
all_titles = df['title'].tolist()
all_titles[:10]

['Toy Story',
 'Jumanji',
 'Grumpier Old Men',
 'Father of the Bride Part II',
 'Heat',
 'Sabrina',
 'Tom and Huck',
 'Sudden Death',
 'GoldenEye',
 'The American President']

In [5]:
from fuzzywuzzy import process

def find_top_movies(title):
    all_titles = df["title"].tolist()
    matches = process.extract(title, all_titles, limit=7)
    return matches
    matched_titles = [match[0] for match in matches]
    return df[df['title'].isin(matched_titles)]

# Example usage
title = "ty stry"
found_movies = find_top_movies(title)
found_movies

[('Toy Story', 88),
 ('Toy Story 2', 78),
 ('Tokyo Story', 78),
 ('Toy Story 3', 78),
 ('The Straight Story', 77),
 ('The Greatest Story Ever Told', 77),
 ('Troy', 77)]

In [9]:
from fuzzywuzzy import process

def find_movie(title):
    all_titles = df["title"].tolist()
    closest_match = process.extractOne(title, all_titles)
    matched_title = closest_match[0]
    return df[df["title"] == matched_title]

find_movie("stare was")

Unnamed: 0_level_0,title,franchise,release_date,runtime,genres,budget,revenue,popularity,vote_average,vote_count,director,top_actors,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
153,Star Wars,Star Wars,1977-05-25,121,"Action, Adventure, Science Fiction",11.0,775.4,42.15,8.1,6778,George Lucas,"Mark Hamill, Harrison Ford, Carrie Fisher","android, rescue mission, rebellion, planet, space opera"


## Preprocessing

In [14]:
# repeat director 2, to get this column more weight
df["combined"] = (df["franchise"].fillna("") + "; " +
                  df["director"].fillna("") + "; " +
                  df["director"].fillna("") + "; " +
                  df["franchise"].fillna("") + "; " +
                  df["top_actors"].fillna("") + "; " +
                  df["genres"].fillna("") + "; " +
                  df["keywords"].fillna(""))

In [15]:
df[["title", "release_date", "combined"]].head(10)

Unnamed: 0_level_0,title,release_date,combined
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story,1995-10-30,"Toy Story; John Lasseter; John Lasseter; Toy Story; Tom Hanks, Tim Allen, Don Rickles; Animation, Comedy, Family; jealousy, toy, boy, friendship, friends"
2,Jumanji,1995-12-15,"; Joe Johnston; Joe Johnston; ; Robin Williams, Jonathan Hyde, Kirsten Dunst; Adventure, Family, Fantasy; disappearance, based on children's book"
3,Grumpier Old Men,1995-12-22,"Grumpy Old Men; Howard Deutch; Howard Deutch; Grumpy Old Men; Walter Matthau, Jack Lemmon, Ann-Margret; Comedy, Romance; fishing, best friend, duringcreditsstinger"
4,Father of the Bride Part II,1995-02-10,"Father of the Bride; Charles Shyer; Charles Shyer; Father of the Bride; Steve Martin, Diane Keaton, Martin Short; Comedy; baby, midlife crisis, confidence, aging, daughter"
5,Heat,1995-12-15,"; Michael Mann; Michael Mann; ; Al Pacino, Robert De Niro, Val Kilmer; Action, Crime, Drama, Thriller; robbery, detective, bank, obsession, chase"
6,Sabrina,1995-12-15,"; Sydney Pollack; Sydney Pollack; ; Harrison Ford, Julia Ormond, Greg Kinnear; Comedy, Romance; paris, brother brother relationship, millionaire"
7,Tom and Huck,1995-12-22,"; Peter Hewitt; Peter Hewitt; ; Jonathan Taylor Thomas, Brad Renfro, Rachael Leigh Cook; Action, Adventure, Drama, Family;"
8,Sudden Death,1995-12-22,"; Peter Hyams; Peter Hyams; ; Jean-Claude Van Damme, Powers Boothe, Dorian Harewood; Action, Adventure, Thriller; terrorist, hostage, explosive"
9,GoldenEye,1995-11-16,"James Bond; Martin Campbell; Martin Campbell; James Bond; Pierce Brosnan, Sean Bean, Izabella Scorupco; Action, Adventure, Thriller; cuba, falsely accused, secret identity, computer virus, secret intelligence service"
10,The American President,1995-11-17,"; Rob Reiner; Rob Reiner; ; Michael Douglas, Annette Bening, Michael J. Fox; Comedy, Drama, Romance; white house, usa president, new love, widower"


In [16]:
import re
import nltk

def preprocess(text):
    """
    Summary:
        Preprocesses the input text by removing non-alphanumeric characters,
        converting to lowercase, tokenizing, and filtering out stopwords.
    Parameters:
        text (str): Input text to be preprocessed.
    Returns:
        str: Preprocessed text
    """
    # Handle NaN
    if not isinstance(text, str):
        return ""
    # Remove non-alphanumeric characters, convert to lowercase, and 
    # strip leading/trailing whitespaces
    text = re.sub(r"[^0-9a-zA-Z\s]", "", text, re.I | re.A).lower().strip()
    # Tokenize each sentence using WordPunctTokenizer from NLTK
    wpt = nltk.WordPunctTokenizer() # Get the list of stopwords in English from NLTK
    stop_words = nltk.corpus.stopwords.words("english")
    output = []
    # Tokenize and filter out stopwords to create a new list of tokens
    tokens = wpt.tokenize(text)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # Join the filtered tokens into a sentence. Then append it to the output list
    output.append(" ".join(filtered_tokens))
    # Join all the processed sentences into a single text string
    return " ".join(output)

df["preproc"] = df["combined"].apply(preprocess)

In [20]:
df["preproc"].value_counts().head(15)

preproc
pokmon kunihiko yuyama kunihiko yuyama pokmon veronica taylor rachael lillis maddie blaustein adventure animation family fantasy science fiction sequel pokmon                                       8
jonathan frakes jonathan frakes jesse bradford paula garcs robin thomas adventure family science fiction thriller time airplane youth                                                                8
jeanpierre melville jeanpierre melville alain delon franois prier nathalie delon crime drama thriller paris bar jazz hearing garage                                                                  8
ruben stlund ruben stlund lisa loven kongsli johannes bah kuhnke clara wettergren comedy drama female nudity dark comedy family vacation avalanche running away                                      8
ari folman ari folman robin wright harvey keitel jon hamm animation drama science fiction animation                                                                                                 

In [21]:
df["preproc"].head(10)

id
1                                                                        toy story john lasseter john lasseter toy story tom hanks tim allen rickles animation comedy family jealousy toy boy friendship friends
2                                                                               joe johnston joe johnston robin williams jonathan hyde kirsten dunst adventure family fantasy disappearance based childrens book
3                                                        grumpy old men howard deutch howard deutch grumpy old men walter matthau jack lemmon annmargret comedy romance fishing best friend duringcreditsstinger
4                                                              father bride charles shyer charles shyer father bride steve martin diane keaton martin short comedy baby midlife crisis confidence aging daughter
5                                                                               michael mann michael mann al pacino robert de niro val kilmer action crime drama 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def get_matrix(df):
    # repeat director 2, to get this column more weight
    df["combined"] = (df["franchise"].fillna("") + "; " +
                    df["director"].fillna("") + "; " +
                    df["director"].fillna("") + "; " +
                    df["franchise"].fillna("") + "; " +
                    df["top_actors"].fillna("") + "; " +
                    df["genres"].fillna("") + "; " +
                    df["keywords"].fillna(""))
    df["preproc"] = df["combined"].apply(preprocess)
    cv = CountVectorizer()
    cv_matrix = cv.fit_transform(df["preproc"])
    return cv_matrix


## Recommender

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
cv_matrix = cv.fit_transform(df["preproc"])

In [24]:
cv_matrix

<10513x16934 sparse matrix of type '<class 'numpy.int64'>'
	with 156466 stored elements in Compressed Sparse Row format>

In [24]:
cv_matrix

<15800x29604 sparse matrix of type '<class 'numpy.int64'>'
	with 304761 stored elements in Compressed Sparse Row format>

In [25]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(cv_matrix, cv_matrix)

In [26]:
cosine_sim

array([[1.        , 0.03798686, 0.03126527, ..., 0.0362977 , 0.13055824,
        0.04351941],
       [0.03798686, 1.        , 0.        , ..., 0.        , 0.05455447,
        0.        ],
       [0.03126527, 0.        , 1.        , ..., 0.03745029, 0.08980265,
        0.04490133],
       ...,
       [0.0362977 , 0.        , 0.03745029, ..., 1.        , 0.0521286 ,
        0.0521286 ],
       [0.13055824, 0.05455447, 0.08980265, ..., 0.0521286 , 1.        ,
        0.0625    ],
       [0.04351941, 0.        , 0.04490133, ..., 0.0521286 , 0.0625    ,
        1.        ]])

In [32]:
find_movie("stare was ")

Unnamed: 0_level_0,title,franchise,release_date,runtime,genres,budget,revenue,popularity,vote_average,vote_count,director,top_actors,keywords,combined,preproc
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
153,Star Wars,Star Wars,1977-05-25,121,"Action, Adventure, Science Fiction",11.0,775.4,42.15,8.1,6778,George Lucas,"Mark Hamill, Harrison Ford, Carrie Fisher","android, rescue mission, rebellion, planet, space opera","Star Wars; George Lucas; George Lucas; Star Wars; Mark Hamill, Harrison Ford, Carrie Fisher; Action, Adventure, Science Fiction; android, rescue mission, rebellion, planet, space opera",star wars george lucas george lucas star wars mark hamill harrison ford carrie fisher action adventure science fiction android rescue mission rebellion planet space opera


In [33]:
idx = 153-1 # I prepared my data for SQL database with index starting from 1 while in pandas dataframes index starts from 0. Thus, subtract 1
n_recommendations = 7
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:(n_recommendations+1)]
similar_movies = [i[0] for i in sim_scores]

In [34]:
df.iloc[idx]

title                                                                                                                                                                                          Star Wars
franchise                                                                                                                                                                                      Star Wars
release_date                                                                                                                                                                                  1977-05-25
runtime                                                                                                                                                                                              121
genres                                                                                                                                                                Action, Adventure, Science Fic

In [35]:
sim_scores

[(1439, 0.699205898780101),
 (2804, 0.6770032003863301),
 (4416, 0.6770032003863301),
 (615, 0.6060606060606062),
 (628, 0.5296408977028497),
 (8508, 0.4545454545454546),
 (10153, 0.4000473456828314)]

In [73]:
sim_scores

[(3889, 0.6776439986371079),
 (4590, 0.6382847385042256),
 (10221, 0.5984437489312764),
 (10920, 0.5556623828915214),
 (11939, 0.5172935265326568),
 (814, 0.34694433324435536),
 (8447, 0.2974059387397313)]

In [68]:
sim_scores

[(7309, 0.3806934938134405),
 (597, 0.3651483716701107),
 (5351, 0.3651483716701107),
 (606, 0.3496029493900505),
 (2472, 0.34503277967117707),
 (1200, 0.29211869733608864),
 (15346, 0.271746488194703)]

In [36]:
print(f"Because you watched {df['title'].iloc[idx]}, you might like:")
df[["title", "release_date", "genres", "popularity", "vote_average", "vote_count", "director", "top_actors", "keywords"]].iloc[similar_movies]

Because you watched Star Wars:


Unnamed: 0_level_0,title,release_date,genres,popularity,vote_average,vote_count,director,top_actors,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1440,Star Wars: Episode I - The Phantom Menace,1999-05-19,"Action, Adventure, Science Fiction",15.65,6.4,4526,George Lucas,"Liam Neeson, Ewan McGregor, Natalie Portman","prophecy, queen, space opera"
2805,Star Wars: Episode II - Attack of the Clones,2002-05-15,"Action, Adventure, Science Fiction",14.07,6.4,4074,George Lucas,"Ewan McGregor, Natalie Portman, Hayden Christensen","investigation, army, wedding, violence, space opera"
4417,Star Wars: Episode III - Revenge of the Sith,2005-05-17,"Action, Adventure, Science Fiction",13.17,7.1,4200,George Lucas,"Ewan McGregor, Natalie Portman, Hayden Christensen","showdown, vision, dream sequence, space opera"
616,The Empire Strikes Back,1980-05-17,"Action, Adventure, Science Fiction",19.47,8.2,5998,Irvin Kershner,"Mark Hamill, Harrison Ford, Carrie Fisher","rebel, android, spaceship, good vs evil, rebellion"
629,Return of the Jedi,1983-05-23,"Action, Adventure, Science Fiction",14.59,7.9,4763,Richard Marquand,"Mark Hamill, Harrison Ford, Carrie Fisher","rebel, brother sister relationship, emperor, matter of life and death, spaceship"
8509,Star Wars: The Force Awakens,2015-12-15,"Action, Adventure, Fantasy, Science Fiction",31.63,7.5,7993,J.J. Abrams,"Daisy Ridley, John Boyega, Adam Driver","android, spaceship, imax, space opera, 3d"
10154,Rogue One: A Star Wars Story,2016-12-14,"Action, Adventure, Science Fiction",36.57,7.4,5111,Gareth Edwards,"Felicity Jones, Diego Luna, Ben Mendelsohn","rebel, space travel, war, prequel, spaceship"


In [74]:
print(f"Because you watched {df['title'].iloc[idx]}:")
df[["title", "release_date", "genres", "popularity", "vote_average", "vote_count", "director", "top_actors", "keywords"]].iloc[similar_movies]

Because you watched The Lord of the Rings: The Fellowship of the Ring:


Unnamed: 0_level_0,title,release_date,genres,popularity,vote_average,vote_count,director,top_actors,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3890,The Lord of the Rings: The Two Towers,2002-12-18,"Action, Adventure, Fantasy",29.42,8.0,7641,Peter Jackson,"Elijah Wood, Ian McKellen, Viggo Mortensen, Liv Tyler, Orlando Bloom","elves, based on novel, explosive, cave, army, mission, attack, wizard, sword and sorcery"
4591,The Lord of the Rings: The Return of the King,2003-12-01,"Action, Adventure, Fantasy",29.32,8.1,8226,Peter Jackson,"Elijah Wood, Ian McKellen, Viggo Mortensen, Liv Tyler, Orlando Bloom","elves, based on novel, suspicion, bravery, war, honor, brutality, violence, ghost, sword and sorcery"
10222,The Hobbit: An Unexpected Journey,2012-11-26,"Action, Adventure, Fantasy",23.25,7.0,8427,Peter Jackson,"Ian McKellen, Martin Freeman, Richard Armitage, Andy Serkis, Cate Blanchett","riddle, elves, dwarves, mountain, wizard, journey, tunnel"
10921,The Hobbit: The Desolation of Smaug,2013-12-11,"Adventure, Fantasy",20.64,7.6,4633,Peter Jackson,"Martin Freeman, Ian McKellen, Richard Armitage, Ken Stott, Graham McTavish","elves, dwarves, dragon, wizard, sword and sorcery"
11940,The Hobbit: The Battle of the Five Armies,2014-12-10,"Action, Adventure, Fantasy",31.72,7.1,4884,Peter Jackson,"Martin Freeman, Ian McKellen, Richard Armitage, Ken Stott, Graham McTavish","corruption, elves, dwarves, dragon, battle, unlikely friendship, sword and sorcery"
815,Bad Taste,1987-12-01,"Action, Comedy, Horror, Science Fiction",7.41,6.4,196,Peter Jackson,"Terry Potter, Pete O'Herne, Craig Smith, Mike Minett, Peter Jackson","new zealand, gore, cult favorite, chainsaw, axe murder"
8448,The Lovely Bones,2009-12-26,"Drama, Fantasy",12.74,6.6,1101,Peter Jackson,"Rachel Weisz, Mark Wahlberg, Susan Sarandon, Saoirse Ronan, Stanley Tucci","rape, 1970s, evidence, tree, afterlife, loss of daughter, serial killer, corpse, pedophile, teenage love, grieving, based on young adult novel"
