# Assignment 4: Unsupervised Learning—Clustering and Recommendations
## Group 105
- Natasa Bolic (300241734)
- Brent Palmer (300193610)
## Imports

In [1]:
# imports
import pandas as pd
import numpy as np
import ast
import Levenshtein
import math
from scipy.spatial import distance

## Introduction

## Dataset Description

## Dataset Description

**Url:** https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset <br>
**Name:** The Movies Dataset <br>
**Author:** The dataset was uploaded by Rounak Banik. The data was collected from the TMDB Open API and the Official GroupLens website. <br>
**Purpose:** The original purpose of the dataset was to learn about the history of cinema through EDA and to build various types of recommender systems for movies. Some suggested uses for the dataset are building models to predict movie revenue or success, and building content based and collaborative filtering based recommendation systems. <br>
**Shape:** The data we are using is separated into two files.
- `movies_metadata.csv`
    - There are 45466 rows and 24 columns. (45466, 24)
- `ratings_small.csv`
    - There are 100004 rows and 4 columns. (100004, 4)

**Features:**
- `movies_metadata.csv`
    - `adult` (categorical): Indicates (True/False) if the movie is rated as "Adults Only", meaning that the content is not suitable for minors.
    - `belongs_to_collection` (categorical): The movie series/collection that the movie belongs to, stored as a string representation of a dictionary that describes the series with keys 'id', 'name', 'poster_path' and 'backdrop_path'.
    - `budget` (numerical): The budget of the movie (in USD).
    - `genres` (categorical): The genres of the movie, stored as a string representation of a list where each item is a dictionary representing a genre with keys 'id' and 'name'.
    - `homepage` (categorical): The link to the official homepage of the movie.
    - `id` (categorical): The movie's ID.
    - `imdb_id` (categorical): The IMDB ID for the movie.
    - `original_language` (categorical): The original language of the movie, abbreviated to the two-letter format (i.e. the ISO 639-1 code).
    - `original_title` (categorical): The title of the movie in its original language.
    - `overview` (categorical): A short description of the movie.
    - `popularity` (numerical): The score assigned by TMBD to quantify the movie's popularity.
    - `poster_path` (categorical): The path segment of the URL to the poster image of the movie.
    - `production_companies` (categorical): The production companies that were involved with the production of the movie, stored as a string representation of a list where each item is a dictionary representing a company with keys 'id' and 'name'.
    - `production_countries` (categorical): The countries where the movie was made, stored as a string representation of a list where each item is a dictionary representing a country with keys 'iso_3166_1' (2-letter format) and 'name'.
    - `release_date` (categorical): The release date of the movie in the format YYYY-MM-DD.
    - `revenue` (numerical): The total revenue generated by the movie (in USD). 
    - `runtime` (numerical): The length of the movie (in minutes).
    - `spoken_languages` (categorical): The languages spoken in the movie, stored as a string representation of a list where each item is a dictionary representing a language with keys 'iso_639_1' (2-letter format) and 'name'.
    - `status` (categorical): The status of the movie, which is either released, rumored, planned, post production, or in production.
    - `tagline` (categorical): The movie's tagline.
    - `title` (categorical): The title of the movie in English.
    - `video` (categorical): Indicates (True/False) if the movie has an official video associated with it in the TMBD API.
    - `vote_average` (numerical): The average rating given to the movie by users.
    - `vote_count` (numerical): The number of users that rated the movie.
- `ratings_small.csv`
    - `userId` (categorical): The ID of the user that provided the rating.
    - `movieId` (categorical): The ID of the movie that was rated.
    - `rating` (numerical): The rating given to the movie by the user.
    - `timestamp` (categorical): The timestamp at which the rating was provided.

## Loading Data and Basic Exploration

### Loading Movies Metadata Dataset

In [2]:
# Read in the movies metadata dataset from a public repository
url = "https://raw.githubusercontent.com/BrentMRPalmer/CSI4142-A4/refs/heads/main/movies_metadata.csv"
metadata_df = pd.read_csv(url, low_memory=False)
metadata_df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [3]:
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [4]:
metadata_df.shape

(45466, 24)

### Loading Ratings Small Dataset

In [5]:
# Read in the ratings small dataset from a public repository
url = "https://raw.githubusercontent.com/BrentMRPalmer/CSI4142-A4/refs/heads/main/ratings_small.csv"
ratings_df = pd.read_csv(url, low_memory=False)
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [6]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100004 non-null  int64  
 1   movieId    100004 non-null  int64  
 2   rating     100004 non-null  float64
 3   timestamp  100004 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [7]:
ratings_df.shape

(100004, 4)

## Data Preparation

To prepare the data, we use our validation techniques from Assignment 2 to clean the data and our EDA techniques from Assignment 1 to visualize the data.

There are two datasets to prepare, the first holding the metadata in `metadata_df`, and the second representing the user ratings in `ratings_df`. We perform the preparation on one dataset at a time.

### Cleaning and EDA of Metadata

We begin by cleaning and performing EDA on the metadata dataset.

#### Cleaning the Data

##### Validity Check 1: Exact Duplicates

We will first check for exact duplicates in the dataset, verifying that there are no rows that are identical over all columns.

**References:** <br>
Duplicated: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html <br>
Drop duplicates: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

In [8]:
# Exact duplicates check

# Apply the .duplicated method to the DataFrame to create a Series, with exact duplicates set to True
# keep=False will mark all duplicates as True (including the first and last occurrences)
duplicates = metadata_df.duplicated(keep=False)

# Print the number of rows that are exact duplicates
print(f"Number of duplicate rows: {duplicates.sum()}\n")

Number of duplicate rows: 33



Since there are 33 duplicate rows, let us further investigate the actual rows to determine how to handle them.

In [9]:
# Display the first 3 rows that are exact duplicates
print("Examples of three duplicate rows:")
metadata_df.loc[duplicates].head(3)

Examples of three duplicate rows:


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
676,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,105045,tt0111613,de,Das Versprechen,"East-Berlin, 1961, shortly after the erection ...",...,1995-02-16,0.0,115.0,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Released,"A love, a hope, a wall.",The Promise,False,5.0,1.0
1465,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,105045,tt0111613,de,Das Versprechen,"East-Berlin, 1961, shortly after the erection ...",...,1995-02-16,0.0,115.0,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Released,"A love, a hope, a wall.",The Promise,False,5.0,1.0
7345,False,,0,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",,5511,tt0062229,fr,Le Samouraï,Hitman Jef Costello is a perfectionist who alw...,...,1967-10-25,39481.0,105.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,There is no solitude greater than that of the ...,Le Samouraï,False,7.9,187.0


As expected, the rows are exact duplicates. The movie `Das Versprechen` appears twice in the results above. We can safely remove the duplicates. We will store the results in a new DataFrame called `cleaned_metadata_df`. All subsequent cleaning will be done to the new DataFrame.

In [10]:
# Create a copy of the metadata DataFrame
cleaned_metadata_df = metadata_df.copy()

# Drop duplicates (retains first instance of a duplicated row)
cleaned_metadata_df = cleaned_metadata_df.drop_duplicates()

# verify no duplicates remain
duplicates = cleaned_metadata_df.duplicated(keep=False)

# Print the number of rows that are exact duplicates
print(f"Number of duplicate rows: {duplicates.sum()}\n")

Number of duplicate rows: 0



Since there are no remaining duplicate rows, the DataFrame has been successfully cleaned.

##### Genre Format

The dataset stores genres in a json format, where... TODO - move to format check? for now just put the code that does the fix

**References:** <br>
Literal eval: https://docs.python.org/3/library/ast.html#ast.literal_eval <br>
Capitalize: https://www.w3schools.com/python/ref_string_capitalize.asp

In [11]:
def extract_genre(genres):
    # Convert from string to list
    genres = ast.literal_eval(genres)
    for i in range(len(genres)):
        genres[i] = genres[i]["name"].lower().capitalize()
    return genres

# Apply the function to the genres attribute, updating the genres to the processed list 
cleaned_metadata_df["genres"] = cleaned_metadata_df["genres"].apply(extract_genre)

# Check the cleaned results
print("Examples of the cleaned genres feature: ")
cleaned_metadata_df["genres"]

Examples of the cleaned genres feature: 


0         [Animation, Comedy, Family]
1        [Adventure, Fantasy, Family]
2                   [Romance, Comedy]
3            [Comedy, Drama, Romance]
4                            [Comedy]
                     ...             
45461                 [Drama, Family]
45462                         [Drama]
45463       [Action, Drama, Thriller]
45464                              []
45465                              []
Name: genres, Length: 45449, dtype: object

#### Production Company Format

In [12]:
def handle_false_company(companies):
    if companies == "False":
        return []
    else:
        return companies

cleaned_metadata_df["production_companies"] = cleaned_metadata_df["production_companies"].apply(handle_false_company)

In [13]:
def extract_companies(companies):
    # Handle missing companies
    if companies == [] or pd.isna(companies): 
        return []
        
    # Convert from string to list
    companies = ast.literal_eval(companies)
    
    for i in range(len(companies)):
        companies[i] = companies[i]["name"].lower().capitalize()
    return companies

# Apply the function to the genres attribute, updating the genres to the processed list 
cleaned_metadata_df["production_companies"] = cleaned_metadata_df["production_companies"].apply(extract_companies)

# Check the cleaned results
print("Examples of the cleaned production companies feature: ")
cleaned_metadata_df["production_companies"]

Examples of the cleaned production companies feature: 


0                                [Pixar animation studios]
1        [Tristar pictures, Teitler film, Interscope co...
2                           [Warner bros., Lancaster gate]
3                 [Twentieth century fox film corporation]
4             [Sandollar productions, Touchstone pictures]
                               ...                        
45461                                                   []
45462                                        [Sine olivia]
45463                            [American world pictures]
45464                                          [Yermoliev]
45465                                                   []
Name: production_companies, Length: 45449, dtype: object

##### Validity Check 2: Data Type Check

In [14]:
# Data type check

# Evalutes a single value's data type against the desired data type
def type_filter_method1(value, test_datatype):
    if pd.isna(value):
        return False
    return isinstance(value, test_datatype)

# Evalutes a single value's data type against the desired data type
def type_filter_method2(value, test_datatype):
    if pd.isna(value):
        return False
    if test_datatype == int:
        try:
            value = float(value)
            return value % 1 == 0
        except Exception as e:
            return False
    else:
        try:
            value = test_datatype(value)
            return True
        except Exception as e:
            return False

# Create a dictionary that maps each attribute to its correct datatype
data_type_dict = {
    "budget": float
}

# Apply the function to every feature, setting rows whose value is not stored as the correct datatype to True
for feature in data_type_dict.keys():
    invalid_datatype = cleaned_metadata_df[feature].apply(
        lambda attribute: not type_filter_method1(attribute, data_type_dict[feature])
    )
    # Print the number of rows with a value that is not stored as the correct datatype for the designated attribute
    print(f"Number of rows where the {feature} value is not stored as the correct datatype ({data_type_dict[feature]}): {invalid_datatype.sum()}")

Number of rows where the budget value is not stored as the correct datatype (<class 'float'>): 45449


[desc]

In [15]:
# Apply the function to every feature, setting rows whose value is not stored as the correct datatype to True
for feature in data_type_dict.keys():
    invalid_datatype = cleaned_metadata_df[feature].apply(
        lambda attribute: not type_filter_method2(attribute, data_type_dict[feature])
    )
    # Print the number of rows with a value that is not stored as the correct datatype for the designated attribute
    print(f"Number of rows where the {feature} value is not stored as the correct datatype ({data_type_dict[feature]}): {invalid_datatype.sum()}")

Number of rows where the budget value is not stored as the correct datatype (<class 'float'>): 3


In [16]:
invalid_datatype = cleaned_metadata_df[feature].apply(
    lambda attribute: not type_filter_method2(attribute, data_type_dict[feature])
)

# Save the invalid rows
invalid_datatype_df = cleaned_metadata_df.loc[invalid_datatype]

# Print the number of rows where the test attribute value contains an incorrect datatype
print(f"Number of rows where the budget value's data type is not float: {invalid_datatype.sum()}\n")

# Display the first 3 rows where the test attribute value contains an incorrect datatype
print(f"Examples of three rows where the budget value's data type is not float:")
invalid_datatype_df.head(5)

Number of rows where the budget value's data type is not float: 3

Examples of three rows where the budget value's data type is not float:


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[Carousel productions, Vision view entertainme...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",1997-08-20,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,1,,,,,,,,,
29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[Aniplex, Gohands, Brosta tv, Mardock scramble...","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-09-29,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,...,12,,,,,,,,,
35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[Odyssey media, Pulser productions, Rogue stat...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",2014-01-01,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,22,,,,,,,,,


[desc] obviously wrong, coerce - fill wioth 0 bc that's what's currently used

**References:** <br>
To Numeric: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html <br>
Fillna: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

In [17]:
cleaned_metadata_df["budget"] = pd.to_numeric(cleaned_metadata_df["budget"], errors="coerce").fillna(0)

invalid_datatype = cleaned_metadata_df[feature].apply(
    lambda attribute: not type_filter_method2(attribute, data_type_dict[feature])
)

# Save the invalid rows
invalid_datatype_df = cleaned_metadata_df.loc[invalid_datatype]

# Print the number of rows where the test attribute value contains an incorrect datatype
print(f"Number of rows where the budget value's data type is not float: {invalid_datatype.sum()}\n")

# Display the first 3 rows where the test attribute value contains an incorrect datatype
print(f"Examples of three rows where the budget value's data type is not float:")
invalid_datatype_df.head(5)

Number of rows where the budget value's data type is not float: 0

Examples of three rows where the budget value's data type is not float:


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count


#### Exploratory Data Analysis (EDA)

### Cleaning and EDA of Ratings

We now perform cleaning and EDA on the ratings dataset.

#### Cleaning the Data

Note that we will reuse the defined functions in the metadata cleaning section.

##### Validity Check 1: Exact Duplicates

We will first check for exact duplicates in the dataset, verifying that there are no rows that are identical over all columns.

In [18]:
# Exact duplicates check

# Apply the .duplicated method to the DataFrame to create a Series, with exact duplicates set to True
# keep=False will mark all duplicates as True (including the first and last occurrences)
duplicates = ratings_df.duplicated(keep=False)

# Print the number of rows that are exact duplicates
print(f"Number of duplicate rows: {duplicates.sum()}\n")

Number of duplicate rows: 0



Since there are no duplicate rows, no cleaning is required.

#### Exploratory Data Analysis (EDA)

## Studies

### Study 1 — Similarity Measures

[desc]

#### Similarity Measure 1 — Jaccard Distance on Genres

[desc]

**References:** <br>
Similarity Overview: https://medium.com/@jodancker/a-brief-introduction-to-distance-measures-ac89cbd2298 <br>
Jaccard: https://stackoverflow.com/questions/46975929/how-can-i-calculate-the-jaccard-similarity-of-two-lists-containing-strings-in-py <br>
Sets: https://docs.python.org/3/tutorial/datastructures.html <br>
To List: https://pandas.pydata.org/docs/reference/api/pandas.Series.to_list.html <br>
Select columns: https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html <br>
Select specific rows: https://stackoverflow.com/questions/46380075/pandas-select-n-middle-rows <br>
Sort by feature: https://realpython.com/pandas-sort-python/

In [19]:
# Similarity measure 1 — Jaccard distance on Genres

# Select the reference movie
reference_movie = "Se7en"

def jaccard_similarity(list1, list2):
    # Convert the lists into sets
    set1 = set(list1)
    set2 = set(list2)

    # Find the size of the intersection
    intersection_size = len(set1 & set2)

    # Find the size of the union
    union_size = len(set1 | set2)

    # Handle empty case
    if union_size == 0:
        return 1

    # Compute and return jaccard distance
    return 1 - (intersection_size/union_size)

# Store the genres for the reference movie
reference_genres = cleaned_metadata_df[cleaned_metadata_df["original_title"]==reference_movie]["genres"].to_list()[0]

# Apply the function to the genres attribute, computing the Jaccard distance between the genres of the reference movie with every other movie
new_feature_name = "jaccard_genre_distance_to_" + reference_movie
cleaned_metadata_df[new_feature_name] = cleaned_metadata_df["genres"].apply(
    lambda genres: jaccard_similarity(reference_genres, genres)
)

# Print the genres of the reference movie
print(f"The genres of {reference_movie}: {reference_genres}\n")

# Print the five example rows
print(f"Examples of five rows with Jaccard distance to {reference_movie} based on the movie genres:")
cleaned_metadata_df[["original_title", "genres", new_feature_name]].head(10).tail(5)

The genres of Se7en: ['Crime', 'Mystery', 'Thriller']

Examples of five rows with Jaccard distance to Se7en based on the movie genres:


Unnamed: 0,original_title,genres,jaccard_genre_distance_to_Se7en
5,Heat,"[Action, Crime, Drama, Thriller]",0.6
6,Sabrina,"[Comedy, Romance]",1.0
7,Tom and Huck,"[Action, Adventure, Drama, Family]",1.0
8,Sudden Death,"[Action, Adventure, Thriller]",0.8
9,GoldenEye,"[Adventure, Action, Thriller]",0.8


[desc]

In [20]:
# Request: Show me the top 10 movies of the same genre as the reference movie
cleaned_metadata_df.loc[cleaned_metadata_df["original_title"] != reference_movie, ["original_title", "genres", new_feature_name]].sort_values(new_feature_name).head(10)

Unnamed: 0,original_title,genres,jaccard_genre_distance_to_Se7en
5254,Insomnia,"[Crime, Mystery, Thriller]",0.0
40337,A ciascuno il suo,"[Crime, Mystery, Thriller]",0.0
12332,Cleaner,"[Crime, Thriller, Mystery]",0.0
12638,The Oxford Murders,"[Crime, Mystery, Thriller]",0.0
4051,The Mirror Crack'd,"[Crime, Thriller, Mystery]",0.0
41711,"Kiss Me, Kill Me","[Mystery, Crime, Thriller]",0.0
14349,23 Paces to Baker Street,"[Crime, Mystery, Thriller]",0.0
30066,बदलापुर,"[Crime, Mystery, Thriller]",0.0
17366,The Carey Treatment,"[Crime, Mystery, Thriller]",0.0
22954,Kvinden i buret,"[Thriller, Mystery, Crime]",0.0


[desc]

#### Similarity Measure 2 — Edit Distance on Title

[desc]

**References:** <br>
Levenshtein Distance: https://www.geeksforgeeks.org/introduction-to-python-levenshtein-module/ <br>

In [21]:
# Similarity Measure 2 — Edit Distance on Title

# Store the title used to compute the edit distance
reference_title = "Back to the Future"

# Apply the Levenshtein.distance function from the Levenshtein library to compute the edit distance between the reference title and the title of every other movie
new_feature_name = "edit_distance_title_to" + reference_title
cleaned_metadata_df[new_feature_name] = cleaned_metadata_df["original_title"].apply(
    lambda title: Levenshtein.distance(reference_title, title)
)

# Print the reference title
print(f"The reference title: {reference_title}\n")

# Print five example rows
print(f"Examples of five rows with edit distance to {reference_title} based on the movie title:")
cleaned_metadata_df[["original_title", new_feature_name]].head(5)

The reference title: Back to the Future

Examples of five rows with edit distance to Back to the Future based on the movie title:


Unnamed: 0,original_title,edit_distance_title_toBack to the Future
0,Toy Story,14
1,Jumanji,18
2,Grumpier Old Men,17
3,Waiting to Exhale,13
4,Father of the Bride Part II,19


[desc]

In [22]:
# Request: Show me the top 10 movies with a similar title to the reference movie
cleaned_metadata_df.loc[cleaned_metadata_df["original_title"] != reference_title, ["original_title", new_feature_name]].sort_values(new_feature_name).head(11).tail(10)

Unnamed: 0,original_title,edit_distance_title_toBack to the Future
21860,Back in the Saddle,7
23537,Maps to the Stars,7
30979,Back to the Jurassic,7
39554,Back in the Day,8
43818,Back to You and Me,8
1902,Back to the Future Part II,8
24299,Crimes of the Future,8
21531,The Lost Future,8
23550,Back in the Day,8
37449,Back To The Sea,8


[desc]

#### Similarity Measure 3 — Euclidean Distance on Revenue

[desc]

**References:** <br>
Euclidean Distance: https://www.w3schools.com/python/ref_math_dist.asp <br>

In [23]:
# Similarity Measure 3 — Euclidean Distance on Revenue

# Select the reference movie
reference_movie = "Pulp Fiction"

# Store the revenue for the reference movie
reference_revenue =  float(cleaned_metadata_df[cleaned_metadata_df["original_title"]==reference_movie]["revenue"].iloc[0])

# Apply the math.dist() function from the math library to compute the Euclidean distance between the revenue of the reference movie and the revenue of every other movie
new_feature_name = "euclidean_distance_revenue_to_" + reference_movie
cleaned_metadata_df[new_feature_name] = cleaned_metadata_df["revenue"].apply(
    lambda revenue: math.dist([reference_revenue], [revenue])
)

# Print the revenue of the reference movie
print(f"The revenue of {reference_movie}: {reference_revenue}\n")

# Print five example rows
print(f"Examples of five rows with Euclidean distance to {reference_movie} based on the movie revenue:")
cleaned_metadata_df[["original_title", "revenue", new_feature_name]].head(20).tail(5)

The revenue of Pulp Fiction: 213928762.0

Examples of five rows with Euclidean distance to Pulp Fiction based on the movie revenue:


Unnamed: 0,original_title,revenue,euclidean_distance_revenue_to_Pulp Fiction
15,Casino,116112375.0,97816387.0
16,Sense and Sensibility,135000000.0,78928762.0
17,Four Rooms,4300000.0,209628762.0
18,Ace Ventura: When Nature Calls,212385533.0,1543229.0
19,Money Train,35431113.0,178497649.0


[desc]

In [24]:
# Request: Show me the top 10 movies with a similar revenue to the reference movie
cleaned_metadata_df.loc[cleaned_metadata_df["original_title"] != reference_movie, ["original_title", "revenue", new_feature_name]].sort_values(new_feature_name).head(10)

Unnamed: 0,original_title,revenue,euclidean_distance_revenue_to_Pulp Fiction
1056,Dirty Dancing,213954274.0,25512.0
221,Disclosure,214015089.0,86327.0
5284,The Bourne Identity,214034224.0,105462.0
13847,Public Enemies,214104620.0,175858.0
3871,卧虎藏龙,213525736.0,403026.0
16818,Just Go with It,214918407.0,989645.0
25503,Into the Woods,212902372.0,1026390.0
13241,Bedtime Stories,212874442.0,1054320.0
5672,8 Mile,215000000.0,1071238.0
15483,The Sorcerer's Apprentice,215283742.0,1354980.0


[desc

#### Similarity Measure 4 — Manhattan Distance on Budget

[desc]

**References:** <br>
Manhattan Distance: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cityblock.html <br>

In [25]:
# Similarity Measure 4 — Manhattan Distance on Budget

# Select the reference movie
reference_movie = "Interstellar"

# Store the budget for the reference movie
reference_budget =  float(cleaned_metadata_df[cleaned_metadata_df["original_title"]==reference_movie]["budget"].iloc[0])

# Apply the distance.cityblock() function from the scipy library to compute the Manhattan distance between the budget of the reference movie and the budget of every other movie
new_feature_name = "manhattan_distance_budget_to_" + reference_movie
cleaned_metadata_df[new_feature_name] = cleaned_metadata_df["budget"].apply(
    lambda budget: distance.cityblock([reference_budget], [budget])
)

# Print the budget of the reference movie
print(f"The budget of {reference_movie}: {reference_budget}\n")

# Print five example rows
print(f"Examples of five rows with Manhattan distance to {reference_movie} based on the movie budget:")
cleaned_metadata_df[["original_title", "budget", new_feature_name]].head(20).tail(5)

The budget of Interstellar: 165000000.0

Examples of five rows with Manhattan distance to Interstellar based on the movie budget:


Unnamed: 0,original_title,budget,manhattan_distance_budget_to_Interstellar
15,Casino,52000000.0,113000000.0
16,Sense and Sensibility,16500000.0,148500000.0
17,Four Rooms,4000000.0,161000000.0
18,Ace Ventura: When Nature Calls,30000000.0,135000000.0
19,Money Train,60000000.0,105000000.0


[desc]

In [26]:
# Request: Show me the top 10 movies with a similar budget to the reference movie
cleaned_metadata_df.loc[cleaned_metadata_df["original_title"] != reference_movie, ["original_title", "budget", new_feature_name]].sort_values(new_feature_name).head(10)

Unnamed: 0,original_title,budget,manhattan_distance_budget_to_Interstellar
19726,Wreck-It Ralph,165000000.0,0.0
26568,Doctor Strange,165000000.0,0.0
15372,Shrek Forever After,165000000.0,0.0
14984,How to Train Your Dragon,165000000.0,0.0
30556,Independence Day: Resurgence,165000000.0,0.0
24455,Big Hero 6,165000000.0,0.0
8238,The Polar Express,165000000.0,0.0
16492,Cowboys & Aliens,163000000.0,2000000.0
18092,Hugo,170000000.0,5000000.0
2711,The 13th Warrior,160000000.0,5000000.0


[desc]

#### Similarity Measure 5 — Sorensen-Dice Index on Production Companies

[desc]

**References:** <br>
Sorensen-Dice Index: https://medium.com/@jodancker/a-brief-introduction-to-distance-measures-ac89cbd2298

In [36]:
# Similarity Measure 5 — Sorensen-Dice Index on Production Companies

# Select the reference movie
reference_movie = "Big Hero 6"

def sorensen_dice_index(list1, list2):
    # Convert the lists into sets
    set1 = set(list1)
    set2 = set(list2)

    # Handle empty case
    if len(set1 | set2) == 0:
        return 1

    # Find the size of the intersection
    intersection_size = len(set1 & set2)

    # Find the size of set 1
    set1_size = len(set1)

    # Find the size of set 2
    set2_size = len(set2)

    return (2 * intersection_size) / (set1_size + set2_size)

# Store the production companies for the reference movie
reference_production_companies = cleaned_metadata_df[cleaned_metadata_df["original_title"]==reference_movie]["production_companies"].to_list()[0]

# Apply the function to the production companies attribute, computing the Sorensen Dice Index between the production companies of the reference movie with every other movie
new_feature_name = "sorensen_dice_index_with_" + reference_movie
cleaned_metadata_df[new_feature_name] = cleaned_metadata_df["production_companies"].apply(
    lambda production_companies: sorensen_dice_index(reference_production_companies, production_companies)
)

# Print the production companies of the reference movie
print(f"The production companies of {reference_movie}: {reference_production_companies}\n")

# Print the five example rows
print(f"Examples of five rows with sorensen dice index to {reference_movie} based on the movie production companies:")
cleaned_metadata_df[["original_title", "production_companies", new_feature_name]].head(10).tail(5)

The production companies of Big Hero 6: ['Walt disney pictures', 'Walt disney animation studios']

Examples of five rows with sorensen dice index to Big Hero 6 based on the movie production companies:


Unnamed: 0,original_title,production_companies,sorensen_dice_index_with_Big Hero 6
5,Heat,"[Regency enterprises, Forward pass, Warner bros.]",0.0
6,Sabrina,"[Paramount pictures, Scott rudin productions, ...",0.0
7,Tom and Huck,[Walt disney pictures],0.666667
8,Sudden Death,"[Universal pictures, Imperial entertainment, S...",0.0
9,GoldenEye,"[United artists, Eon productions]",0.0


[desc] - mention how this is the only similarity one, so we use descending instead of ascending

In [38]:
# Request: Show me the top 10 movies with the same production companies as the reference movie
cleaned_metadata_df.loc[cleaned_metadata_df["original_title"] != reference_movie, ["original_title", "production_companies", new_feature_name]].sort_values(new_feature_name, ascending=False).head(10)

Unnamed: 0,original_title,production_companies,sorensen_dice_index_with_Big Hero 6
11727,Meet the Robinsons,"[Walt disney pictures, Walt disney animation s...",1.0
19901,Paperman,"[Walt disney pictures, Walt disney animation s...",1.0
22110,Frozen,"[Walt disney pictures, Walt disney animation s...",1.0
21419,Planes,"[Walt disney pictures, Walt disney animation s...",1.0
14496,The Princess and the Frog,"[Walt disney pictures, Walt disney animation s...",1.0
17469,Winnie the Pooh,"[Walt disney pictures, Walt disney animation s...",1.0
28665,Frozen Fever,"[Walt disney pictures, Walt disney animation s...",1.0
40457,How to Hook Up Your Home Theater,"[Walt disney pictures, Walt disney animation s...",1.0
40458,Tick Tock Tale,"[Walt disney pictures, Walt disney animation s...",1.0
41457,Moana,"[Walt disney pictures, Walt disney animation s...",1.0


[desc]

### Study 2 — Clustering Algorithms

### Study 3 — Content-Based Recommendation System

### Study 4 — Collaborate Filtering Recommendation System

## Conclusion

## References