# MRS

Here I'll work on content-based movie recommender based on the previous notebook (skill_showcase.ipynb)

data source: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

This notebook will consist of two parts:

1. Preparing data for the recommender

2. Content-based recommender itself

# 1. Preparing Data

In [1]:
import pandas as pd
from ast import literal_eval

In [2]:
df = pd.read_csv('data/movies_metadata.csv')
# Transpose for easier exploration of this dataset with many cols
df.head(3).transpose()

  df = pd.read_csv('data/movies_metadata.csv')


Unnamed: 0,0,1,2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
homepage,http://toystory.disney.com/toy-story,,
id,862,8844,15602
imdb_id,tt0114709,tt0113497,tt0113228
original_language,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

## Filtering out bad/barely known movies

In [4]:
# Calculate the count of films that have vote_average < 4 or vote_count < 40
len(df[(df['vote_average'] < 4) | (df['vote_count'] < 40)])

35109

As you can see, this dataset has a lot of mediocre movies. Therefore, to speed up calculations, let's remove them

In [5]:
df = df[(df['vote_average'] >= 4) & (df['vote_count'] >= 40)]
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10351 entries, 0 to 45441
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  10351 non-null  object 
 1   belongs_to_collection  2135 non-null   object 
 2   budget                 10351 non-null  object 
 3   genres                 10351 non-null  object 
 4   homepage               3001 non-null   object 
 5   id                     10351 non-null  object 
 6   imdb_id                10350 non-null  object 
 7   original_language      10351 non-null  object 
 8   original_title         10351 non-null  object 
 9   overview               10299 non-null  object 
 10  popularity             10351 non-null  object 
 11  poster_path            10351 non-null  object 
 12  production_companies   10351 non-null  object 
 13  production_countries   10351 non-null  object 
 14  release_date           10349 non-null  object 
 15  revenue

## Dropping unneeded columns

In [6]:
df["video"].value_counts()

video
False    10348
True         3
Name: count, dtype: int64

In [7]:
df["status"].value_counts()

status
Released           10335
Post Production        6
Rumored                4
In Production          4
Planned                1
Name: count, dtype: int64

The columns 'adult', 'status' and 'video' have predominantly one value, so let's remove them. Also, let's remove 'poster_path', 'hopepage' (too many null values), 'imdb_id', 'spoken_languages', 'overview' and 'tagline'

Apart from this, let's drop not much useful for recommender columns

In [8]:
df = df.drop(
    [
        "adult",
        "status",
        "video",
        "poster_path",
        "original_title",
        "homepage",
        "imdb_id",
        "spoken_languages",
        "overview",
        "tagline",
        "original_language",
        "production_companies",
        "production_countries",
        "runtime",
        "popularity",
        "budget",
        "revenue",
    ],
    axis=1,
)
df.head(3).transpose()

Unnamed: 0,0,1,2
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
id,862,8844,15602
release_date,1995-10-30,1995-12-15,1995-12-22
title,Toy Story,Jumanji,Grumpier Old Men
vote_average,7.7,6.9,6.5
vote_count,5415.0,2413.0,92.0


Now let's have a look at dtypes

## Converting dtypes to more appropriate ones

In [9]:
df.dtypes

belongs_to_collection     object
genres                    object
id                        object
release_date              object
title                     object
vote_average             float64
vote_count               float64
dtype: object

First of all, let's handle 'release_date' column

In [10]:
# Convert 'release_date' column to datetime type
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
# Count the number of rows with bad date values
bad_date_count = df['release_date'].isnull().sum()
print(f"Number of rows with bad date values: {bad_date_count}")

Number of rows with bad date values: 2


Since 2 rows compared to 10,000 in total is nothing, we can freely remove them

In [11]:
# Remove rows with null or NaT values
df = df.dropna(subset=['release_date'])
bad_date_count = df['release_date'].isnull().sum()
print(f"Number of rows with bad date values: {bad_date_count}")

Number of rows with bad date values: 0


I don't like that columns with whole numbers like 'vote_count' have dtype set to float. Let's change that

In [12]:
# Specify columns and their new data types
dict_columns_to_convert = {
    "vote_count": "int",
    "id": "int"
}
# Fill NaN values with 0
cols_to_fill = list(dict_columns_to_convert.keys())
df[cols_to_fill] = df[cols_to_fill].fillna(0)
# Convert columns to integer type
df = df.astype(dict_columns_to_convert)
# Check the data types of the DataFrame
print(df.dtypes)

belongs_to_collection            object
genres                           object
id                                int32
release_date             datetime64[ns]
title                            object
vote_average                    float64
vote_count                        int32
dtype: object


## Working with 'genres' column

In [13]:
# Convert the stringified JSON into a list of dictionaries
df["genres"] = df["genres"].apply(
    lambda x: literal_eval(x.replace("'", '"')) if isinstance(x, str) else []
)
# Extract the names of genres into a list and sort them alphabetically
df["genres"] = df["genres"].apply(
    lambda x: sorted([genre["name"] for genre in x]) if isinstance(x, list) else []
)
# Display the DataFrame with the extracted genre names
df[["title", "genres"]].head(3)

Unnamed: 0,title,genres
0,Toy Story,"[Animation, Comedy, Family]"
1,Jumanji,"[Adventure, Family, Fantasy]"
2,Grumpier Old Men,"[Comedy, Romance]"


In [14]:
# Flatten the list of genre names
flat_genre_names = [genre for sublist in df["genres"] for genre in sublist]
# Get the unique genre names
unique_genre_names = set(flat_genre_names)
# Print the unique genre names
print(f"There are {len(unique_genre_names)} unique genres.")
print(unique_genre_names)

There are 20 unique genres.
{'Music', 'Western', 'Romance', 'Family', 'Thriller', 'Foreign', 'TV Movie', 'Action', 'Crime', 'War', 'Animation', 'Drama', 'Horror', 'Mystery', 'Fantasy', 'Adventure', 'Comedy', 'Science Fiction', 'Documentary', 'History'}


We can see that 'genres' colomn has faulty data like 'Carousel Productions' or 'Vision View Entertainment', which sound like production companies, not genres. Thus, let's remove such values from the column

In [15]:
# Define the list of valid genre names
valid_genres = {
    'Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
    'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Mystery',
    'Romance', 'Science Fiction', 'Thriller', 'War', 'Western'
}
# Filter the genre_names column to include only the valid genres
df["genres"] = df["genres"].apply(lambda x: [genre for genre in x if genre in valid_genres])

Now let's check again

In [16]:
flat_genre_names = [genre for sublist in df["genres"] for genre in sublist]
unique_genre_names = set(flat_genre_names)
print(f"There are {len(unique_genre_names)} unique genres.")
print(unique_genre_names)

There are 17 unique genres.
{'Mystery', 'Fantasy', 'Thriller', 'Western', 'Adventure', 'Romance', 'Comedy', 'Family', 'Action', 'Science Fiction', 'Animation', 'Crime', 'Documentary', 'Drama', 'Horror', 'War', 'History'}


In [17]:
df["genres"].value_counts().head(7)

genres
[Comedy]                    808
[Drama]                     785
[Comedy, Drama]             423
[Drama, Romance]            398
[Comedy, Drama, Romance]    355
[Comedy, Romance]           298
[Horror, Thriller]          262
Name: count, dtype: int64

In [18]:
# Convert the list of genres into a string with comma as a delimiter
df["genres"] = df["genres"].apply(lambda x: ", ".join(x) if x else None)

In [19]:
df["genres"].value_counts().head(7)

genres
Comedy                    808
Drama                     785
Comedy, Drama             423
Drama, Romance            398
Comedy, Drama, Romance    355
Comedy, Romance           298
Horror, Thriller          262
Name: count, dtype: int64

In [20]:
df.head().transpose()

Unnamed: 0,0,1,2,4,5
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect...","{'id': 96871, 'name': 'Father of the Bride Col...",
genres,"Animation, Comedy, Family","Adventure, Family, Fantasy","Comedy, Romance",Comedy,"Action, Crime, Drama, Thriller"
id,862,8844,15602,11862,949
release_date,1995-10-30 00:00:00,1995-12-15 00:00:00,1995-12-22 00:00:00,1995-02-10 00:00:00,1995-12-15 00:00:00
title,Toy Story,Jumanji,Grumpier Old Men,Father of the Bride Part II,Heat
vote_average,7.7,6.9,6.5,5.7,7.7
vote_count,5415,2413,92,173,1886


## Working with 'belongs_to_collection' column

In [21]:
def extract_franchise_name(x):
    try:
        # Use literal_eval to safely evaluate the string as a Python dictionary
        # Extract the 'name' value from the dictionary
        return literal_eval(x)["name"]
    except (ValueError, TypeError):
        return None

# Apply the extract_franchise_name function to each value in the 'belongs_to_collection' column
df["franchise"] = df["belongs_to_collection"].apply(extract_franchise_name).str.strip()
# Remove the word 'Collection' (case-insensitive) from the end of each franchise name
df["franchise"] = df["franchise"].str.replace(r"[Cc]ollection$", "", regex=True)
# Remove trailing spaces before and after the string
df["franchise"] = df["franchise"].str.strip()
df = df.drop(["belongs_to_collection"], axis=1)

In [22]:
df["franchise"].value_counts().head()

franchise
James Bond               26
Dragon Ball Z (Movie)    15
Pokémon                  12
Friday the 13th          12
Fantozzi                 10
Name: count, dtype: int64

Time to rearrange columns a little bit because I'm not happy with the order of columns

## Changing column order

In [23]:
new_cols_order = [
    "id",
    "title",
    "franchise",
    "release_date",
    "genres",
    "vote_average",
    "vote_count"
]
df = df[new_cols_order]
df.head(3).transpose()

Unnamed: 0,0,1,2
id,862,8844,15602
title,Toy Story,Jumanji,Grumpier Old Men
franchise,Toy Story,,Grumpy Old Men
release_date,1995-10-30 00:00:00,1995-12-15 00:00:00,1995-12-22 00:00:00
genres,"Animation, Comedy, Family","Adventure, Family, Fantasy","Comedy, Romance"
vote_average,7.7,6.9,6.5
vote_count,5415,2413,92


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10349 entries, 0 to 45441
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            10349 non-null  int32         
 1   title         10349 non-null  object        
 2   franchise     2135 non-null   object        
 3   release_date  10349 non-null  datetime64[ns]
 4   genres        10332 non-null  object        
 5   vote_average  10349 non-null  float64       
 6   vote_count    10349 non-null  int32         
dtypes: datetime64[ns](1), float64(1), int32(2), object(3)
memory usage: 566.0+ KB


## Adding data from other two datasets

In [23]:
credits = pd.read_csv('data/credits.csv')
keywords = pd.read_csv('data/keywords.csv')

In [24]:
df = df.merge(credits, on='id')
df = df.merge(keywords, on='id')
df.head(3)

Unnamed: 0,id,title,franchise,release_date,genres,vote_average,vote_count,cast,crew,keywords
0,862,Toy Story,Toy Story,1995-10-30,"Animation, Comedy, Family",7.7,5415,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,,1995-12-15,"Adventure, Family, Fantasy",6.9,2413,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,Grumpier Old Men,Grumpy Old Men,1995-12-22,"Comedy, Romance",6.5,92,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."


## Extracting director

In [25]:
def extract_director(crew_list):
    for crew_member in crew_list:
        if crew_member["job"] == "Director":
            return crew_member["name"]
    return None

# For function get_director to work, convert the string representations to actual dictionaries
df["crew"] = df["crew"].apply(literal_eval)
# Extract the director's name for each movie
df["director"] = df["crew"].apply(extract_director)

In [26]:
df["director"].value_counts().head()

director
Woody Allen         47
Alfred Hitchcock    36
Clint Eastwood      33
Steven Spielberg    31
Martin Scorsese     25
Name: count, dtype: int64

In [28]:
df[df["director"] == "Martin Scorsese"][["title", "release_date", "genres", "director"]].head(10)

Unnamed: 0,title,release_date,genres,director
14,Casino,1995-11-22,"Crime, Drama",Martin Scorsese
70,Taxi Driver,1976-02-07,"Crime, Drama",Martin Scorsese
233,The Age of Innocence,1993-09-17,"Drama, Romance",Martin Scorsese
631,GoodFellas,1990-09-12,"Crime, Drama",Martin Scorsese
646,Raging Bull,1980-11-14,Drama,Martin Scorsese
734,Cape Fear,1991-11-15,"Crime, Thriller",Martin Scorsese
906,Kundun,1997-12-25,Drama,Martin Scorsese
1060,The Last Temptation of Christ,1988-08-12,Drama,Martin Scorsese
1354,The Color of Money,1986-10-07,Drama,Martin Scorsese
1629,Bringing Out the Dead,1999-10-22,Drama,Martin Scorsese


## Extracting top actors

In [29]:
def extract_actors(cast_list):
    top_actors = []
    for actor in cast_list[:3]:  # Select the top 3 actors
        top_actors.append(actor["name"])
    return ", ".join(top_actors)

# Convert the string representations to actual dictionaries
df["cast"] = df["cast"].apply(literal_eval)
# Extract the top 3 actor names for each movie
df["top_actors"] = df["cast"].apply(extract_actors)

In [31]:
df[df["title"] == "The Empire Strikes Back"][["title", "release_date", "genres", "director", "top_actors"]]

Unnamed: 0,title,release_date,genres,director,top_actors
615,The Empire Strikes Back,1980-05-17,"Action, Adventure, Science Fiction",Irvin Kershner,"Mark Hamill, Harrison Ford, Carrie Fisher"


## Extracting keywords

In [32]:
from collections import Counter
import pandas as pd

# Convert the string representations to actual dictionaries
df["keywords"] = df["keywords"].apply(literal_eval)
# Flatten the list of dictionaries in the 'keywords' column
keywords = [keyword["name"] for sublist in df["keywords"] for keyword in sublist]
# Count the frequencies of each keyword
keyword_counts = Counter(keywords)
# Sort the keywords based on their frequencies in descending order
sorted_keywords = sorted(keyword_counts.items(), key=lambda x: x[1], reverse=True)
# Remove keywords that rarely occur
sorted_keywords_filtered = [(keyword, count) for keyword, count in sorted_keywords if count > 9]
# Create a set of keywords that appear in sorted_keywords_filtered
filtered_keywords_set = set([keyword for keyword, _ in sorted_keywords_filtered])
# Print the sorted and filtered keywords
for keyword, count in sorted_keywords_filtered[:10]:
    print(f"{keyword}: {count}")

woman director: 655
murder: 480
independent film: 434
based on novel: 387
duringcreditsstinger: 385
violence: 340
revenge: 259
sex: 252
police: 232
suspense: 230


In [33]:
# Print the number of unique keywords
num_unique_keywords_before = len(keyword_counts)
print(f"Number of unique keywords before filtering: {num_unique_keywords_before}")
num_unique_keywords_after = len(sorted_keywords_filtered)
print(f"Number of unique keywords after filtering: {num_unique_keywords_after}")

Number of unique keywords before filtering: 13953
Number of unique keywords after filtering: 1553


In [34]:
def filter_keywords(keywords_list):
    filtered_keywords = []
    for keyword in keywords_list:
        if keyword["name"] in filtered_keywords_set and len(filtered_keywords) < 5:
            filtered_keywords.append(keyword["name"])
            if len(filtered_keywords) == 5:
                break
    return ", ".join(filtered_keywords)

# Add a new column 'filtered_keywords' to the DataFrame
df['filtered_keywords'] = df['keywords'].apply(filter_keywords)

In [35]:
df[["keywords", "filtered_keywords"]]

Unnamed: 0,keywords,filtered_keywords
0,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","jealousy, toy, boy, friendship, friends"
1,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","disappearance, based on children's book"
2,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","fishing, best friend, duringcreditsstinger"
3,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","baby, midlife crisis, confidence, aging, daughter"
4,"[{'id': 642, 'name': 'robbery'}, {'id': 703, '...","robbery, detective, bank, obsession, chase"
...,...,...
10508,[],
10509,[],
10510,"[{'id': 2652, 'name': 'nazis'}, {'id': 3098, '...","nazis, castle, time travel"
10511,"[{'id': 9673, 'name': 'love'}, {'id': 13130, '...","love, teenager, lgbt, short"


## Final steps of data preparation with Pandas

In [36]:
# Drop columns 'id', 'cast', 'crew', and 'keywords'
df = df.drop(
    columns=[
        "id",
        "cast",
        "crew",
        "keywords"
    ]
)
# Extract the first four characters to get only date
df["release_date"] = df["release_date"].astype(str).str[:4]
# Rename 'filtered_keywosrds' column to 'keywords'
df = df.rename(columns={"filtered_keywords": "keywords", "release_date": "release_year"})
# Drop dublicate rows
df = df.drop_duplicates()
# Add the new 'id' column as the first column
df.insert(0, "id", range(1, 1 + len(df)))
df.head().transpose()

Unnamed: 0,0,1,2,3,4
id,1,2,3,4,5
title,Toy Story,Jumanji,Grumpier Old Men,Father of the Bride Part II,Heat
franchise,Toy Story,,Grumpy Old Men,Father of the Bride,
release_year,1995,1995,1995,1995,1995
genres,"Animation, Comedy, Family","Adventure, Family, Fantasy","Comedy, Romance",Comedy,"Action, Crime, Drama, Thriller"
vote_average,7.7,6.9,6.5,5.7,7.7
vote_count,5415,2413,92,173,1886
director,John Lasseter,Joe Johnston,Howard Deutch,Charles Shyer,Michael Mann
top_actors,"Tom Hanks, Tim Allen, Don Rickles","Robin Williams, Jonathan Hyde, Kirsten Dunst","Walter Matthau, Jack Lemmon, Ann-Margret","Steve Martin, Diane Keaton, Martin Short","Al Pacino, Robert De Niro, Val Kilmer"
keywords,"jealousy, toy, boy, friendship, friends","disappearance, based on children's book","fishing, best friend, duringcreditsstinger","baby, midlife crisis, confidence, aging, daughter","robbery, detective, bank, obsession, chase"


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10342 entries, 0 to 10512
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            10342 non-null  int64  
 1   title         10342 non-null  object 
 2   franchise     2134 non-null   object 
 3   release_year  10342 non-null  object 
 4   genres        10325 non-null  object 
 5   vote_average  10342 non-null  float64
 6   vote_count    10342 non-null  int32  
 7   director      10325 non-null  object 
 8   top_actors    10342 non-null  object 
 9   keywords      10342 non-null  object 
dtypes: float64(1), int32(1), int64(1), object(7)
memory usage: 848.4+ KB


Columns explanation:
- id - row id
- title - official title of the movie
- release_date - theatrical release date of the movie
- genres - genres associated with the movie, separated by a comma
- vote_average - average movie rating
- vote_count - number of votes by users, counted by TMDB
- director - name of the movie director
- top_actors - names of top 5 actors in the movie
- keywords - keywords associated with the movie

Let's save the cleaned up dataset, which we'll use in the next chapters

In [40]:
# df.to_csv("data/recommender_data.csv", index=False)

# 2. Content-based recommender

In [1]:
import pandas as pd

pd.set_option('display.max_colwidth', None)

df = pd.read_csv("data/recommender_data.csv", index_col=0)
df.head(3)

Unnamed: 0_level_0,title,franchise,release_year,genres,vote_average,vote_count,director,top_actors,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Toy Story,Toy Story,1995,"Animation, Comedy, Family",7.7,5415,John Lasseter,"Tom Hanks, Tim Allen, Don Rickles","jealousy, toy, boy, friendship, friends"
2,Jumanji,,1995,"Adventure, Family, Fantasy",6.9,2413,Joe Johnston,"Robin Williams, Jonathan Hyde, Kirsten Dunst","disappearance, based on children's book"
3,Grumpier Old Men,Grumpy Old Men,1995,"Comedy, Romance",6.5,92,Howard Deutch,"Walter Matthau, Jack Lemmon, Ann-Margret","fishing, best friend, duringcreditsstinger"


## Handling movie title

The idea behind this recommender is that you enter the title of the movie that you liked, and then you get a list of similar movies. However, first of all, you need to check whether the movie with the typed title exist in the database in the first place. Also, spelling problems may arise.

To handle all these issues, the 'fuzzywuzzy' library comes to help

In [2]:
from fuzzywuzzy import process

# def find_top_movies(title):
#     all_titles = df["title"].tolist()
#     matches = process.extract(title, all_titles, limit=7)
#     return matches
#     matched_titles = [match[0] for match in matches]
#     return df[df['title'].isin(matched_titles)]

# print(find_top_movies("ty stry"))

def find_movie(input):
    """
    Summary:
        Finds a movie in the DataFrame that closely matches the input title. Handles some spelling mistakes
    Parameters:
        input (str): The input movie title.
    Returns:
        DataFrame: A DataFrame row containing information about the matched movie.
    """
    all_titles = df["title"].tolist()
    closest_match = process.extractOne(input, all_titles)
    matched_title = closest_match[0]
    return df[df["title"] == matched_title]

find_movie("stare was")

Unnamed: 0_level_0,title,franchise,release_year,genres,vote_average,vote_count,director,top_actors,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
153,Star Wars,Star Wars,1977,"Action, Adventure, Science Fiction",8.1,6778,George Lucas,"Mark Hamill, Harrison Ford, Carrie Fisher","android, rescue mission, rebellion, planet, space opera"


## Getting cosine similarity matrix

In [3]:
import re

import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def preprocess(text):
    """
    Summary:
        Preprocesses the input text by removing non-alphanumeric characters,
        converting to lowercase, tokenizing, and filtering out stopwords.
    Parameters:
        text (str): Input text to be preprocessed.
    Returns:
        str: Preprocessed text
    """
    # Handle NaN
    if not isinstance(text, str):
        return ""
    # Remove non-alphanumeric characters, convert to lowercase, and 
    # strip leading/trailing whitespaces
    text = re.sub(r"[^0-9a-zA-Z\s]", "", text, re.I | re.A).lower().strip()
    # Tokenize each sentence using WordPunctTokenizer from NLTK
    wpt = nltk.WordPunctTokenizer() # Get the list of stopwords in English from NLTK
    stop_words = nltk.corpus.stopwords.words("english")
    output = []
    # Tokenize and filter out stopwords to create a new list of tokens
    tokens = wpt.tokenize(text)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # Join the filtered tokens into a sentence. Then append it to the output list
    output.append(" ".join(filtered_tokens))
    # Join all the processed sentences into a single text string
    return " ".join(output)

def calculate_cosine_sim(df=df):
    """
    Summary:
        Calculate the cosine similarity matrix for a DataFrame containing movie features.
    Parameters:
        df (DataFrame): DataFrame containing movie data
    Returns:
        ndarray: Cosine similarity matrix for the movie features.
    """
    df = df[["franchise", "director", "top_actors", "genres", "keywords"]].copy()
    # repeat director once, to get this column more weight
    df["combined"] = (df["franchise"].fillna("") + "; " +
                    df["director"].fillna("") + "; " +
                    df["director"].fillna("") + "; " +
                    df["top_actors"].fillna("") + "; " +
                    df["genres"].fillna("") + "; " +
                    df["keywords"].fillna(""))
    df["preproc"] = df["combined"].apply(preprocess)
    cv = CountVectorizer()
    cv_matrix = cv.fit_transform(df["preproc"])
    cosine_sim = cosine_similarity(cv_matrix, cv_matrix)
    return cosine_sim

cosine_sim = calculate_cosine_sim()

## Recommender

In [4]:
def recommender(input_title, numb_of_recommendations=3):
    """
    Summary:
        Recommends similar movies based on the input movie title.
    Parameters:
        input_title (str): The title of the input movie. Some degree of spelling mistakes is allowed.
        numb_of_recommendations (int): Number of recommended movies to return. Default is 3.
    Returns:
        DataFrame: A DataFrame containing information of the recommended movies.
    """
    movie = find_movie(input_title)
    title = movie["title"].values[0]
    release_year = movie["release_year"].values[0]
    title_year = f"'{title}' from {release_year}"
    print(f"For the input '{input_title}', the closest match is {title_year}.")
    print(
        f"""
{title_year} has the following characteristics used in the recommender system:
    franchise: {movie["franchise"].values[0]}.
    director: {movie["director"].values[0]}.
    top actors: {movie["top_actors"].values[0]}.
    genres: {movie["genres"].values[0]}.
    keywords: {movie["keywords"].values[0]}.
          """
    )
    # Data for SQL database was prepared with index starting from 1
    # while in pandas dataframes index starts from 0. Thus, subtract 1
    id = movie.index[0] - 1
    # Get the similarity scores of all movies with the movie above
    sim_scores = list(enumerate(cosine_sim[id]))
    # Sort the similarity scores in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Select the top 'numb_of_recommendations' similar movies (excluding the input movie itself)
    sim_scores = sim_scores[1 : (numb_of_recommendations + 1)]
    # Extract the indices of the top similar movies
    similar_movies = [i[0] for i in sim_scores]
    print(f"Similar movies to {title_year}:")
    return df.iloc[similar_movies]

recommender("felloship ring")

For the input 'felloship ring', the closest match is 'The Lord of the Rings: The Fellowship of the Ring' from 2001.

'The Lord of the Rings: The Fellowship of the Ring' from 2001 has the following characteristics used in the recommender system:
    franchise: The Lord of the Rings.
    director: Peter Jackson.
    top actors: Elijah Wood, Ian McKellen, Cate Blanchett.
    genres: Action, Adventure, Fantasy.
    keywords: elves, dwarves, based on novel, mountain, fireworks.
          
Similar movies to 'The Lord of the Rings: The Fellowship of the Ring' from 2001:


Unnamed: 0_level_0,title,franchise,release_year,genres,vote_average,vote_count,director,top_actors,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3017,The Lord of the Rings: The Two Towers,The Lord of the Rings,2002,"Action, Adventure, Fantasy",8.0,7641,Peter Jackson,"Elijah Wood, Ian McKellen, Viggo Mortensen","elves, based on novel, explosive, cave, army"
3487,The Lord of the Rings: The Return of the King,The Lord of the Rings,2003,"Action, Adventure, Fantasy",8.1,8226,Peter Jackson,"Elijah Wood, Ian McKellen, Viggo Mortensen","elves, based on novel, suspicion, bravery, war"
7174,The Hobbit: An Unexpected Journey,The Hobbit,2012,"Action, Adventure, Fantasy",7.0,8427,Peter Jackson,"Ian McKellen, Martin Freeman, Richard Armitage","riddle, elves, dwarves, mountain, wizard"
