# MRS

Here I'll work on content-based movie recommender based on the previous notebook (skill_showcase.ipynb)

data source: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

This notebook will consist of two parts:

1. Preparing data for the recommender

2. Content-based recommender itself

# 1. Preparing Data

In [1]:
import pandas as pd
from ast import literal_eval

In [2]:
df = pd.read_csv('data/movies_metadata.csv')
# Transpose for easier exploration of this dataset with many cols
df.head(3).transpose()

  df = pd.read_csv('data/movies_metadata.csv')


Unnamed: 0,0,1,2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
homepage,http://toystory.disney.com/toy-story,,
id,862,8844,15602
imdb_id,tt0114709,tt0113497,tt0113228
original_language,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

## Filtering out bad/barely known movies

In [4]:
# Calculate the count of films that have vote_average < 4 or vote_count < 20
len(df[(df['vote_average'] < 4) | (df['vote_count'] < 20)])

29943

As you can see, this dataset has a lot of mediocre movies. Therefore, to speed up calculations, let's remove them

In [5]:
df = df[(df['vote_average'] >= 4) & (df['vote_count'] >= 20)]
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15517 entries, 0 to 45460
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  15517 non-null  object 
 1   belongs_to_collection  2757 non-null   object 
 2   budget                 15517 non-null  object 
 3   genres                 15517 non-null  object 
 4   homepage               3876 non-null   object 
 5   id                     15517 non-null  object 
 6   imdb_id                15515 non-null  object 
 7   original_language      15516 non-null  object 
 8   original_title         15517 non-null  object 
 9   overview               15415 non-null  object 
 10  popularity             15517 non-null  object 
 11  poster_path            15517 non-null  object 
 12  production_companies   15517 non-null  object 
 13  production_countries   15517 non-null  object 
 14  release_date           15515 non-null  object 
 15  revenue

## Dropping unneeded columns

In [6]:
df["adult"].value_counts()

adult
False    15517
Name: count, dtype: int64

In [7]:
df["video"].value_counts()

video
False    15506
True        11
Name: count, dtype: int64

In [8]:
df["status"].value_counts()

status
Released           15475
Post Production       17
Rumored               13
In Production          8
Planned                3
Name: count, dtype: int64

The columns 'adult', 'status' and 'video' have predominantly one value, so let's remove them. Also, let's remove 'poster_path', 'hopepage' (too many null values), 'imdb_id', 'spoken_languages', 'overview' and 'tagline'

Apart from this, let's drop not much useful for recommender columns

In [9]:
df = df.drop(
    [
        "adult",
        "status",
        "video",
        "poster_path",
        "original_title",
        "homepage",
        "imdb_id",
        "spoken_languages",
        "overview",
        "tagline",
        "belongs_to_collection",
        "original_language",
        "production_companies",
        "production_countries"
    ],
    axis=1,
)
df.head(3).transpose()

Unnamed: 0,0,1,2
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
id,862,8844,15602
popularity,21.946943,17.015539,11.7129
release_date,1995-10-30,1995-12-15,1995-12-22
revenue,373554033.0,262797249.0,0.0
runtime,81.0,104.0,101.0
title,Toy Story,Jumanji,Grumpier Old Men
vote_average,7.7,6.9,6.5
vote_count,5415.0,2413.0,92.0


Now let's have a look at dtypes

## Converting dtypes to more appropriate ones

In [10]:
df.dtypes

budget           object
genres           object
id               object
popularity       object
release_date     object
revenue         float64
runtime         float64
title            object
vote_average    float64
vote_count      float64
dtype: object

First of all, let's handle 'release_date' column

In [11]:
# Convert 'release_date' column to datetime type
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
# Count the number of rows with bad date values
bad_date_count = df['release_date'].isnull().sum()
print(f"Number of rows with bad date values: {bad_date_count}")

Number of rows with bad date values: 2


Since 90 rows compared to 45,000 in total is nothing, we can freely remove them

In [12]:
# Remove rows with null or NaT values
df = df.dropna(subset=['release_date'])
bad_date_count = df['release_date'].isnull().sum()
print(f"Number of rows with bad date values: {bad_date_count}")

Number of rows with bad date values: 0


The column 'budget' contains non-numerical values like '/ff9qCepilowshEtG2GYWwzt2bs4.jpg'. Let's remove them

In [13]:
# Clean 'budget' column to remove non-numeric characters
df["budget"] = df["budget"].str.replace(r"\D", "", regex=True)

I don't like that columns with whole numbers like 'runtime' or 'vote_count' have dtype set to float. Let's change that

In [14]:
# Specify columns and their new data types
dict_columns_to_convert = {
    "budget": "int64",
    "revenue": "int64",
    "runtime": "int",
    "vote_count": "int",
    "popularity": "float",
    "id": "int"
}
# Clean 'budget' column to remove non-numeric characters
df["budget"] = df["budget"].str.replace(r"\D", "", regex=True)
# Fill NaN values with 0
cols_to_fill = list(dict_columns_to_convert.keys())
df[cols_to_fill] = df[cols_to_fill].fillna(0)
# Convert columns to integer type
df = df.astype(dict_columns_to_convert)
# Check the data types of the DataFrame
print(df.dtypes)

budget                   int64
genres                  object
id                       int32
popularity             float64
release_date    datetime64[ns]
revenue                  int64
runtime                  int32
title                   object
vote_average           float64
vote_count               int32
dtype: object


## Handling of 'budget', 'revenue', and 'popularity' columns

Columns 'budget' and 'revenue' have too big values in them while 'popularity' column has too many digits after decimal point. Let's change this

In [15]:
# Divide 'budget' and 'revenue' columns by million and round to 2 decimal places
df['budget'] = (df['budget'] / 1000000).round(2)
df['revenue'] = (df['revenue'] / 1000000).round(2)

# Round 'popularity' column to 2 decimal places
df['popularity'] = df['popularity'].round(2)
df.head(3)

Unnamed: 0,budget,genres,id,popularity,release_date,revenue,runtime,title,vote_average,vote_count
0,30.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,21.95,1995-10-30,373.55,81,Toy Story,7.7,5415
1,65.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,17.02,1995-12-15,262.8,104,Jumanji,6.9,2413
2,0.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,11.71,1995-12-22,0.0,101,Grumpier Old Men,6.5,92


## Working with 'genres' column

In [16]:
# Convert the stringified JSON into a list of dictionaries
df["genres"] = df["genres"].apply(
    lambda x: literal_eval(x.replace("'", '"')) if isinstance(x, str) else []
)
# Extract the names of genres into a list and sort them alphabetically
df["genres"] = df["genres"].apply(
    lambda x: sorted([genre["name"] for genre in x]) if isinstance(x, list) else []
)
# Display the DataFrame with the extracted genre names
df[["title", "genres"]].head(3)

Unnamed: 0,title,genres
0,Toy Story,"[Animation, Comedy, Family]"
1,Jumanji,"[Adventure, Family, Fantasy]"
2,Grumpier Old Men,"[Comedy, Romance]"


In [17]:
# Flatten the list of genre names
flat_genre_names = [genre for sublist in df["genres"] for genre in sublist]
# Get the unique genre names
unique_genre_names = set(flat_genre_names)
# Print the unique genre names
print(f"There are {len(unique_genre_names)} unique genres.")
print(unique_genre_names)

There are 20 unique genres.
{'Family', 'Music', 'Science Fiction', 'TV Movie', 'Adventure', 'Comedy', 'Foreign', 'Fantasy', 'Crime', 'Horror', 'Mystery', 'Animation', 'Thriller', 'Documentary', 'Action', 'Romance', 'Drama', 'History', 'Western', 'War'}


We can see that 'genres' colomn has faulty data like 'Carousel Productions' or 'Vision View Entertainment', which sound like production companies, not genres. Thus, let's remove such values from the column

In [18]:
# Define the list of valid genre names
valid_genres = {
    'Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
    'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Mystery',
    'Romance', 'Science Fiction', 'Thriller', 'War', 'Western'
}
# Filter the genre_names column to include only the valid genres
df["genres"] = df["genres"].apply(lambda x: [genre for genre in x if genre in valid_genres])

Now let's check again

In [19]:
flat_genre_names = [genre for sublist in df["genres"] for genre in sublist]
unique_genre_names = set(flat_genre_names)
print(f"There are {len(unique_genre_names)} unique genres.")
print(unique_genre_names)

There are 17 unique genres.
{'Mystery', 'Family', 'Animation', 'Adventure', 'Thriller', 'Fantasy', 'Action', 'Crime', 'Romance', 'Drama', 'Horror', 'Documentary', 'History', 'Comedy', 'Science Fiction', 'Western', 'War'}


In [20]:
df["genres"].value_counts().head(7)

genres
[Drama]                     1360
[Comedy]                    1224
[Comedy, Drama]              697
[Drama, Romance]             658
[Comedy, Drama, Romance]     539
[Comedy, Romance]            442
[Documentary]                398
Name: count, dtype: int64

One movie can belong to many genres and one genre can be applied to many movies. It's a many-to-many relationship. Ideally, this kind of relationship is supposed to be broken into two 1:M relationships and connected with an intermidiate or junction table. However, because

- it's a project to show my knowledge mainly of writing SQL queries
- I'm applying to a junior data analyst position, and, at that role, you're not supposed to design databases
- preparation part is already too long
- maximum string length for genres is known (80 symbols for the movie with the title 'Yu-Gi-Oh')

I'll keep things simple and connect genre names by comma.

In [21]:
# Convert the list of genres into a string with comma as a delimiter
df["genres"] = df["genres"].apply(lambda x: ", ".join(x) if x else None)

In [22]:
df["genres"].value_counts().head(7)

genres
Drama                     1360
Comedy                    1224
Comedy, Drama              697
Drama, Romance             658
Comedy, Drama, Romance     539
Comedy, Romance            442
Documentary                398
Name: count, dtype: int64

In [23]:
df.head().transpose()

Unnamed: 0,0,1,2,3,4
budget,30.0,65.0,0.0,16.0,0.0
genres,"Animation, Comedy, Family","Adventure, Family, Fantasy","Comedy, Romance","Comedy, Drama, Romance",Comedy
id,862,8844,15602,31357,11862
popularity,21.95,17.02,11.71,3.86,8.39
release_date,1995-10-30 00:00:00,1995-12-15 00:00:00,1995-12-22 00:00:00,1995-12-22 00:00:00,1995-02-10 00:00:00
revenue,373.55,262.8,0.0,81.45,76.58
runtime,81,104,101,127,106
title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II
vote_average,7.7,6.9,6.5,6.1,5.7
vote_count,5415,2413,92,34,173


Time to rearrange columns a little bit because I'm not happy with the order of columns

## Changing column order

In [24]:
new_cols_order = [
    "id",
    "title",
    "release_date",
    "runtime",
    "genres",
    "budget",
    "revenue",
    "popularity",
    "vote_average",
    "vote_count"
]
df = df[new_cols_order]
df.head(3).transpose()

Unnamed: 0,0,1,2
id,862,8844,15602
title,Toy Story,Jumanji,Grumpier Old Men
release_date,1995-10-30 00:00:00,1995-12-15 00:00:00,1995-12-22 00:00:00
runtime,81,104,101
genres,"Animation, Comedy, Family","Adventure, Family, Fantasy","Comedy, Romance"
budget,30.0,65.0,0.0
revenue,373.55,262.8,0.0
popularity,21.95,17.02,11.71
vote_average,7.7,6.9,6.5
vote_count,5415,2413,92


## Adding data from other two datasets

In [27]:
credits = pd.read_csv('data/credits.csv')
keywords = pd.read_csv('data/keywords.csv')

In [32]:
df = df.merge(credits, on='id')
df = df.merge(keywords, on='id')
df.head(3)

Unnamed: 0,id,title,release_date,runtime,genres,budget,revenue,popularity,vote_average,vote_count,cast,crew,keywords
0,862,Toy Story,1995-10-30,81,"Animation, Comedy, Family",30.0,373.55,21.95,7.7,5415,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,1995-12-15,104,"Adventure, Family, Fantasy",65.0,262.8,17.02,6.9,2413,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,Grumpier Old Men,1995-12-22,101,"Comedy, Romance",0.0,0.0,11.71,6.5,92,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,Waiting to Exhale,1995-12-22,127,"Comedy, Drama, Romance",16.0,81.45,3.86,6.1,34,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,Father of the Bride Part II,1995-02-10,106,Comedy,0.0,76.58,8.39,5.7,173,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15800 entries, 0 to 15799
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            15800 non-null  int32         
 1   title         15800 non-null  object        
 2   release_date  15800 non-null  datetime64[ns]
 3   runtime       15800 non-null  int32         
 4   genres        15747 non-null  object        
 5   budget        15800 non-null  float64       
 6   revenue       15800 non-null  float64       
 7   popularity    15800 non-null  float64       
 8   vote_average  15800 non-null  float64       
 9   vote_count    15800 non-null  int32         
 10  cast          15800 non-null  object        
 11  crew          15800 non-null  object        
 12  keywords      15800 non-null  object        
dtypes: datetime64[ns](1), float64(4), int32(3), object(5)
memory usage: 1.4+ MB


## Extracting director

In [34]:
def extract_director(crew_list):
    for crew_member in crew_list:
        if crew_member["job"] == "Director":
            return crew_member["name"]
    return None

# For function get_director to work, convert the string representations to actual dictionaries
df["crew"] = df["crew"].apply(literal_eval)
# Extract the director's name for each movie
df["director"] = df["crew"].apply(extract_director)

In [36]:
df["director"].value_counts().head(10)

director
Woody Allen          48
Alfred Hitchcock     39
Clint Eastwood       35
Martin Scorsese      32
Steven Soderbergh    31
Steven Spielberg     31
Ron Howard           25
John Huston          24
Werner Herzog        24
John Ford            24
Name: count, dtype: int64

In [37]:
df[df["director"] == "Martin Scorsese"][["title", "release_date", "runtime", "genres", "director"]].head(10)

Unnamed: 0,title,release_date,runtime,genres,director
15,Casino,1995-11-22,178,"Crime, Drama",Martin Scorsese
81,Taxi Driver,1976-02-07,114,"Crime, Drama",Martin Scorsese
288,The Age of Innocence,1993-09-17,139,"Drama, Romance",Martin Scorsese
774,GoodFellas,1990-09-12,145,"Crime, Drama",Martin Scorsese
789,Raging Bull,1980-11-14,129,Drama,Martin Scorsese
884,Cape Fear,1991-11-15,128,"Crime, Thriller",Martin Scorsese
1102,Kundun,1997-12-25,134,Drama,Martin Scorsese
1282,The Last Temptation of Christ,1988-08-12,164,Drama,Martin Scorsese
1627,The Color of Money,1986-10-07,119,Drama,Martin Scorsese
1968,Bringing Out the Dead,1999-10-22,121,Drama,Martin Scorsese


## Extracting top actors

In [38]:
def extract_actors(cast_list):
    top_actors = []
    for actor in cast_list[:5]:  # Select the top 5 actors
        top_actors.append(actor["name"])
    return ", ".join(top_actors)

# Convert the string representations to actual dictionaries
df["cast"] = df["cast"].apply(literal_eval)
# Extract the top 5 actor names for each movie
df["top_actors"] = df["cast"].apply(extract_actors)

In [40]:
df[df["title"] == "The Empire Strikes Back"][["title", "release_date", "runtime", "genres", "director", "top_actors"]]

Unnamed: 0,title,release_date,runtime,genres,director,top_actors
758,The Empire Strikes Back,1980-05-17,124,"Action, Adventure, Science Fiction",Irvin Kershner,"Mark Hamill, Harrison Ford, Carrie Fisher, Bil..."


## Extracting keywords

In [42]:
from collections import Counter
import pandas as pd

# Convert the string representations to actual dictionaries
df["keywords"] = df["keywords"].apply(literal_eval)
# Flatten the list of dictionaries in the 'keywords' column
keywords = [keyword["name"] for sublist in df["keywords"] for keyword in sublist]
# Count the frequencies of each keyword
keyword_counts = Counter(keywords)
# Sort the keywords based on their frequencies in descending order
sorted_keywords = sorted(keyword_counts.items(), key=lambda x: x[1], reverse=True)
# Remove keywords that rarely occur
sorted_keywords_filtered = [(keyword, count) for keyword, count in sorted_keywords if count > 9]
# Create a set of keywords that appear in sorted_keywords_filtered
filtered_keywords_set = set([keyword for keyword, _ in sorted_keywords_filtered])
# Print the sorted and filtered keywords
for keyword, count in sorted_keywords_filtered[:10]:
    print(f"{keyword}: {count}")

woman director: 978
independent film: 808
murder: 682
based on novel: 480
violence: 446
duringcreditsstinger: 409
revenge: 371
sex: 364
suspense: 345
love: 327


In [43]:
# Print the number of unique keywords
num_unique_keywords_before = len(keyword_counts)
print(f"Number of unique keywords before filtering: {num_unique_keywords_before}")
num_unique_keywords_after = len(sorted_keywords_filtered)
print(f"Number of unique keywords after filtering: {num_unique_keywords_after}")

Number of unique keywords before filtering: 15676
Number of unique keywords after filtering: 1963


In [44]:
def filter_keywords(keywords_list):
    filtered_keywords = []
    for keyword in keywords_list:
        if keyword["name"] in filtered_keywords_set:
            filtered_keywords.append(keyword["name"])
    return ", ".join(filtered_keywords)

# Add a new column 'filtered_keywords' to the DataFrame
df['filtered_keywords'] = df['keywords'].apply(filter_keywords)

In [45]:
df[["keywords", "filtered_keywords"]]

Unnamed: 0,keywords,filtered_keywords
0,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","jealousy, toy, boy, friendship, friends, rival..."
1,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","disappearance, based on children's book"
2,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","fishing, best friend, duringcreditsstinger"
3,"[{'id': 818, 'name': 'based on novel'}, {'id':...","based on novel, interracial relationship, sing..."
4,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","baby, midlife crisis, confidence, aging, daugh..."
...,...,...
15795,"[{'id': 9673, 'name': 'love'}, {'id': 13130, '...","love, teenager, lgbt, short"
15796,"[{'id': 171803, 'name': 'military school'}]",
15797,"[{'id': 10124, 'name': 'laboratory'}, {'id': 1...","laboratory, mad scientist, silent film, short"
15798,[],


## Final steps of data preparation with Pandas

In [46]:
# Drop columns 'id', 'cast', 'crew', and 'keywords'
df = df.drop(columns=["id", "cast", "crew", "keywords"])
# Rename 'filtered_keywosrds' column to 'keywords'
df = df.rename(columns={"filtered_keywords": "keywords"})
# Add the new 'id' column as the first column
df.insert(0, "id", range(1, 1 + len(df)))
df.head().transpose()

Unnamed: 0,0,1,2,3,4
id,1,2,3,4,5
title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II
release_date,1995-10-30 00:00:00,1995-12-15 00:00:00,1995-12-22 00:00:00,1995-12-22 00:00:00,1995-02-10 00:00:00
runtime,81,104,101,127,106
genres,"Animation, Comedy, Family","Adventure, Family, Fantasy","Comedy, Romance","Comedy, Drama, Romance",Comedy
budget,30.0,65.0,0.0,16.0,0.0
revenue,373.55,262.8,0.0,81.45,76.58
popularity,21.95,17.02,11.71,3.86,8.39
vote_average,7.7,6.9,6.5,6.1,5.7
vote_count,5415,2413,92,34,173


Columns explanation:
- id - row id
- title - official title of the movie
- release_date - theatrical release date of the movie
- runtime - movie duration/runtime in minutes
- genres - genres associated with the movie, separated by a comma
- budget - movie budget in millions of dollars
- revenue - total movie revenue in millions of dollars
- popularity - popularity score assigned by TMDB
- vote_average - average movie rating
- vote_count - number of votes by users, counted by TMDB
- director - name of the movie director
- top_actors - names of top 5 actors in the movie
- keywords - keywords associated with the movie

In [47]:
df.dtypes

id                       int64
title                   object
release_date    datetime64[ns]
runtime                  int32
genres                  object
budget                 float64
revenue                float64
popularity             float64
vote_average           float64
vote_count               int32
director                object
top_actors              object
keywords                object
dtype: object

Let's save the cleaned up dataset, which we'll use in the next chapters

In [1]:
# df.to_csv("data/recommender_data.csv", index=False)

# 2. Content-based recommender

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("data/recommender_data.csv", index_col=0)
df.head(3)

Unnamed: 0_level_0,title,release_date,runtime,genres,budget,revenue,popularity,vote_average,vote_count,director,top_actors,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,Toy Story,1995-10-30,81,"Animation, Comedy, Family",30.0,373.55,21.95,7.7,5415,John Lasseter,"Tom Hanks, Tim Allen, Don Rickles, Jim Varney,...","jealousy, toy, boy, friendship, friends, rival..."
2,Jumanji,1995-12-15,104,"Adventure, Family, Fantasy",65.0,262.8,17.02,6.9,2413,Joe Johnston,"Robin Williams, Jonathan Hyde, Kirsten Dunst, ...","disappearance, based on children's book"
3,Grumpier Old Men,1995-12-22,101,"Comedy, Romance",0.0,0.0,11.71,6.5,92,Howard Deutch,"Walter Matthau, Jack Lemmon, Ann-Margret, Soph...","fishing, best friend, duringcreditsstinger"


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15800 entries, 1 to 15800
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         15800 non-null  object 
 1   release_date  15800 non-null  object 
 2   runtime       15800 non-null  int64  
 3   genres        15747 non-null  object 
 4   budget        15800 non-null  float64
 5   revenue       15800 non-null  float64
 6   popularity    15800 non-null  float64
 7   vote_average  15800 non-null  float64
 8   vote_count    15800 non-null  int64  
 9   director      15768 non-null  object 
 10  top_actors    15707 non-null  object 
 11  keywords      13129 non-null  object 
dtypes: float64(4), int64(2), object(6)
memory usage: 1.6+ MB


## Handling movie title

The idea behind this recommender is that you enter the title of the movie that you liked, and then you get a list of similar movies. However, first of all, you need to check whether the movie with the typed title exist in the database in the first place. Also, spelling problems may arise.

To handle all these issues, the 'fuzzywuzzy' library comes to help

In [4]:
all_titles = df['title'].tolist()
all_titles[:10]

['Toy Story',
 'Jumanji',
 'Grumpier Old Men',
 'Waiting to Exhale',
 'Father of the Bride Part II',
 'Heat',
 'Sabrina',
 'Tom and Huck',
 'Sudden Death',
 'GoldenEye']

In [5]:
from fuzzywuzzy import process

def find_top_movies(title):
    all_titles = df["title"].tolist()
    matches = process.extract(title, all_titles, limit=7)
    return matches
    matched_titles = [match[0] for match in matches]
    return df[df['title'].isin(matched_titles)]

# Example usage
title = "ty stry"
found_movies = find_top_movies(title)
found_movies

[('Toy Story', 88),
 ('Toy Story 2', 78),
 ('Tokyo Story', 78),
 ('Toy Story 3', 78),
 ('The Buddy Holly Story', 77),
 ('The Straight Story', 77),
 ('The Greatest Story Ever Told', 77)]

In [6]:
from fuzzywuzzy import process

def find_movie(title):
    all_titles = df["title"].tolist()
    closest_match = process.extractOne(title, all_titles)
    matched_title = closest_match[0]
    return df[df["title"] == matched_title]

find_movie("stare was")

Unnamed: 0_level_0,title,release_date,runtime,genres,budget,revenue,popularity,vote_average,vote_count,director,top_actors,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
188,Star Wars,1977-05-25,121,"Action, Adventure, Science Fiction",11.0,775.4,42.15,8.1,6778,George Lucas,"Mark Hamill, Harrison Ford, Carrie Fisher, Pet...","android, rescue mission, rebellion, planet, sp..."


## Preprocessing

In [7]:
df["combined"] = (df["director"].fillna("") + "; " +
                  df["top_actors"].fillna("") + "; " +
                  df["genres"].fillna("") + "; " +
                  df["keywords"].fillna(""))

In [8]:
pd.set_option('display.max_colwidth', None)

In [9]:
df[["title", "release_date", "combined"]].head(10)

Unnamed: 0_level_0,title,release_date,combined
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story,1995-10-30,"John Lasseter; Tom Hanks, Tim Allen, Don Rickles, Jim Varney, Wallace Shawn; Animation, Comedy, Family; jealousy, toy, boy, friendship, friends, rivalry, toy comes to life"
2,Jumanji,1995-12-15,"Joe Johnston; Robin Williams, Jonathan Hyde, Kirsten Dunst, Bradley Pierce, Bonnie Hunt; Adventure, Family, Fantasy; disappearance, based on children's book"
3,Grumpier Old Men,1995-12-22,"Howard Deutch; Walter Matthau, Jack Lemmon, Ann-Margret, Sophia Loren, Daryl Hannah; Comedy, Romance; fishing, best friend, duringcreditsstinger"
4,Waiting to Exhale,1995-12-22,"Forest Whitaker; Whitney Houston, Angela Bassett, Loretta Devine, Lela Rochon, Gregory Hines; Comedy, Drama, Romance; based on novel, interracial relationship, single mother, divorce"
5,Father of the Bride Part II,1995-02-10,"Charles Shyer; Steve Martin, Diane Keaton, Martin Short, Kimberly Williams-Paisley, George Newbern; Comedy; baby, midlife crisis, confidence, aging, daughter, mother daughter relationship, pregnancy"
6,Heat,1995-12-15,"Michael Mann; Al Pacino, Robert De Niro, Val Kilmer, Jon Voight, Tom Sizemore; Action, Crime, Drama, Thriller; robbery, detective, bank, obsession, chase, shooting, thief, honor, murder, suspense, heist, betrayal, money, gang, cult film, ex-con, neo-noir"
7,Sabrina,1995-12-15,"Sydney Pollack; Harrison Ford, Julia Ormond, Greg Kinnear, Angie Dickinson, Nancy Marchand; Comedy, Romance; paris, brother brother relationship, chauffeur, millionaire"
8,Tom and Huck,1995-12-22,"Peter Hewitt; Jonathan Taylor Thomas, Brad Renfro, Rachael Leigh Cook, Michael McShane, Amy Wright; Action, Adventure, Drama, Family;"
9,Sudden Death,1995-12-22,"Peter Hyams; Jean-Claude Van Damme, Powers Boothe, Dorian Harewood, Raymond J. Barry, Ross Malinger; Action, Adventure, Thriller; terrorist, hostage, explosive"
10,GoldenEye,1995-11-16,"Martin Campbell; Pierce Brosnan, Sean Bean, Izabella Scorupco, Famke Janssen, Joe Don Baker; Action, Adventure, Thriller; cuba, falsely accused, secret identity, computer virus, secret intelligence service, kgb, satellite"


In [10]:
import re
import nltk

def preprocess(text):
    """
    Summary:
        Preprocesses the input text by removing non-alphanumeric characters,
        converting to lowercase, tokenizing, and filtering out stopwords.
    Parameters:
        text (str): Input text to be preprocessed.
    Returns:
        str: Preprocessed text
    """
    # Handle NaN
    if not isinstance(text, str):
        return ""
    # Remove non-alphanumeric characters, convert to lowercase, and 
    # strip leading/trailing whitespaces
    text = re.sub(r"[^0-9a-zA-Z\s]", "", text, re.I | re.A).lower().strip()
    # Tokenize each sentence using WordPunctTokenizer from NLTK
    wpt = nltk.WordPunctTokenizer() # Get the list of stopwords in English from NLTK
    stop_words = nltk.corpus.stopwords.words("english")
    output = []
    # Tokenize and filter out stopwords to create a new list of tokens
    tokens = wpt.tokenize(text)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # Join the filtered tokens into a sentence. Then append it to the output list
    output.append(" ".join(filtered_tokens))
    # Join all the processed sentences into a single text string
    return " ".join(output)

df["preproc"] = df["combined"].apply(preprocess)

In [11]:
df["preproc"].value_counts().head(15)

preproc
jeanpierre melville alain delon franois prier nathalie delon cathy rosier catherine jourdan crime drama thriller paris bar jazz hearing garage hitman jazz club police canary treason danger stakeout french noir film noir little dialogue    8
steven soderbergh debbie doebereiner omar cowan dustin james ashley phyllis workman crime drama mystery murder independent film                                                                                                                8
jonathan frakes jesse bradford paula garcs robin thomas french stewart michael biehn adventure family science fiction thriller time airplane youth                                                                                             8
george clooney sam rockwell drew barrymore julia roberts rutger hauer brad pitt comedy crime drama romance thriller biography silencer intrigue                                                                                                8
jeanjacques annaud mark stro

In [12]:
df["preproc"].head(10)

id
1                                                                                  john lasseter tom hanks tim allen rickles jim varney wallace shawn animation comedy family jealousy toy boy friendship friends rivalry toy comes life
2                                                                                         joe johnston robin williams jonathan hyde kirsten dunst bradley pierce bonnie hunt adventure family fantasy disappearance based childrens book
3                                                                                                  howard deutch walter matthau jack lemmon annmargret sophia loren daryl hannah comedy romance fishing best friend duringcreditsstinger
4                                                                forest whitaker whitney houston angela bassett loretta devine lela rochon gregory hines comedy drama romance based novel interracial relationship single mother divorce
5                                               charles shyer ste

## Recommender

In [14]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
cv_matrix = cv.fit_transform(df["preproc"])

In [15]:
cv_matrix

<15800x29604 sparse matrix of type '<class 'numpy.int64'>'
	with 304761 stored elements in Compressed Sparse Row format>

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(cv_matrix, cv_matrix)

In [17]:
cosine_sim

array([[1.        , 0.04588315, 0.04850713, ..., 0.04714045, 0.        ,
        0.05163978],
       [0.04588315, 1.        , 0.        , ..., 0.05407381, 0.06917145,
        0.        ],
       [0.04850713, 0.        , 1.        , ..., 0.0571662 , 0.        ,
        0.06262243],
       ...,
       [0.04714045, 0.05407381, 0.0571662 , ..., 1.        , 0.63960215,
        0.        ],
       [0.        , 0.06917145, 0.        , ..., 0.63960215, 1.        ,
        0.07784989],
       [0.05163978, 0.        , 0.06262243, ..., 0.        , 0.07784989,
        1.        ]])

In [18]:
find_movie("stare was")

Unnamed: 0_level_0,title,release_date,runtime,genres,budget,revenue,popularity,vote_average,vote_count,director,top_actors,keywords,combined,preproc
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
188,Star Wars,1977-05-25,121,"Action, Adventure, Science Fiction",11.0,775.4,42.15,8.1,6778,George Lucas,"Mark Hamill, Harrison Ford, Carrie Fisher, Peter Cushing, Alec Guinness","android, rescue mission, rebellion, planet, space opera","George Lucas; Mark Hamill, Harrison Ford, Carrie Fisher, Peter Cushing, Alec Guinness; Action, Adventure, Science Fiction; android, rescue mission, rebellion, planet, space opera",george lucas mark hamill harrison ford carrie fisher peter cushing alec guinness action adventure science fiction android rescue mission rebellion planet space opera


In [27]:
idx = 188
n_recommendations = 5
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:(n_recommendations+1)]
similar_movies = [i[0] for i in sim_scores]

In [34]:
df.iloc[189]

title                                                                                                                                                                                                                                                                                                              A Little Princess
release_date                                                                                                                                                                                                                                                                                                              1995-05-10
runtime                                                                                                                                                                                                                                                                                                                           97
genres                   

In [31]:
print(f"Because you watched {df['title'].iloc[idx]}:")
df[["title", "release_date", "genres", "popularity", "vote_average", "vote_count", "director", "top_actors", "keywords"]].iloc[similar_movies]

Because you watched Little Women:


Unnamed: 0_level_0,title,release_date,genres,popularity,vote_average,vote_count,director,top_actors,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
10802,Enough Said,2013-09-18,"Comedy, Drama, Romance",6.78,6.6,351,Nicole Holofcener,"Julia Louis-Dreyfus, Catherine Keener, James Gandolfini, Toni Collette, Ben Falcone","thanksgiving, party, romance, mother daughter relationship, dating, relationship, divorce, woman director"
4330,My Life Without Me,2003-03-07,"Drama, Romance",10.31,7.2,78,Isabel Coixet,"Sarah Polley, Amanda Plummer, Scott Speedman, Leonor Watling, Deborah Harry","farewell, dying and death, daughter, secret love, mother daughter relationship, woman director"
3361,Italian for Beginners,2000-12-07,"Comedy, Drama, Romance",5.4,6.5,33,Lone Scherfig,"Peter Gantzler, Sara Indrio Jensen, Ann Eleonora Jørgensen, Anders W. Berthelsen, Anette Støvelbæk","venice, hotel, copenhagen, waitress, depression, italian, hairdresser, daughter, friendship, priest, mother daughter relationship, church, woman director"
1064,Eve's Bayou,1997-09-07,Drama,1.46,6.3,29,Kasi Lemmons,"Jurnee Smollett, Meagan Good, Lynn Whitfield, Samuel L. Jackson, Debbi Morgan","sister sister relationship, superstition, independent film, curse, mother daughter relationship, father daughter relationship, woman director"
1790,Drop Dead Gorgeous,1999-07-23,Comedy,6.4,6.5,88,Michael Patrick Jann,"Kirsten Dunst, Ellen Barkin, Denise Richards, Amy Adams, Kirstie Alley","mother role, mother daughter relationship, pretty woman"
