# Assignment 4: Unsupervised Learning—Clustering and Recommendations
## Group 105
- Natasa Bolic (300241734)
- Brent Palmer (300193610)
## Imports

In [1]:
# imports
import pandas as pd
import ast
import Levenshtein

## Introduction

## Dataset Description

## Dataset Description

**Url:** https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset <br>
**Name:** The Movies Dataset <br>
**Author:** The dataset was uploaded by Rounak Banik. The data was collected from the TMDB Open API and the Official GroupLens website. <br>
**Purpose:** The original purpose of the dataset was to learn about the history of cinema through EDA and to build various types of recommender systems for movies. Some suggested uses for the dataset are building models to predict movie revenue or success, and building content based and collaborative filtering based recommendation systems. <br>
**Shape:** The data we are using is separated into two files.
- `movies_metadata.csv`
    - There are 45466 rows and 24 columns. (45466, 24)
- `ratings_small.csv`
    - There are 100004 rows and 4 columns. (100004, 4)

**Features:**
- `movies_metadata.csv`
    - `adult` (categorical): Indicates (True/False) if the movie is rated as "Adults Only", meaning that the content is not suitable for minors.
    - `belongs_to_collection` (categorical): The movie series/collection that the movie belongs to, stored as a string representation of a dictionary that describes the series with keys 'id', 'name', 'poster_path' and 'backdrop_path'.
    - `budget` (numerical): The budget of the movie (in USD).
    - `genres` (categorical): The genres of the movie, stored as a string representation of a list where each item is a dictionary representing a genre with keys 'id' and 'name'.
    - `homepage` (categorical): The link to the official homepage of the movie.
    - `id` (categorical): The movie's ID.
    - `imdb_id` (categorical): The IMDB ID for the movie.
    - `original_language` (categorical): The original language of the movie, abbreviated to the two-letter format (i.e. the ISO 639-1 code).
    - `original_title` (categorical): The title of the movie in its original language.
    - `overview` (categorical): A short description of the movie.
    - `popularity` (numerical): The score assigned by TMBD to quantify the movie's popularity.
    - `poster_path` (categorical): The path segment of the URL to the poster image of the movie.
    - `production_companies` (categorical): The production companies that were involved with the production of the movie, stored as a string representation of a list where each item is a dictionary representing a company with keys 'id' and 'name'.
    - `production_countries` (categorical): The countries where the movie was made, stored as a string representation of a list where each item is a dictionary representing a country with keys 'iso_3166_1' (2-letter format) and 'name'.
    - `release_date` (categorical): The release date of the movie in the format YYYY-MM-DD.
    - `revenue` (numerical): The total revenue generated by the movie (in USD). 
    - `runtime` (numerical): The length of the movie (in minutes).
    - `spoken_languages` (categorical): The languages spoken in the movie, stored as a string representation of a list where each item is a dictionary representing a language with keys 'iso_639_1' (2-letter format) and 'name'.
    - `status` (categorical): The status of the movie, which is either released, rumored, planned, post production, or in production.
    - `tagline` (categorical): The movie's tagline.
    - `title` (categorical): The title of the movie in English.
    - `video` (categorical): Indicates (True/False) if the movie has an official video associated with it in the TMBD API.
    - `vote_average` (numerical): The average rating given to the movie by users.
    - `vote_count` (numerical): The number of users that rated the movie.
- `ratings_small.csv`
    - `userId` (categorical): The ID of the user that provided the rating.
    - `movieId` (categorical): The ID of the movie that was rated.
    - `rating` (numerical): The rating given to the movie by the user.
    - `timestamp` (categorical): The timestamp at which the rating was provided.

## Loading Data and Basic Exploration

### Loading Movies Metadata Dataset

In [2]:
# Read in the movies metadata dataset from a public repository
url = "https://raw.githubusercontent.com/BrentMRPalmer/CSI4142-A4/refs/heads/main/movies_metadata.csv"
metadata_df = pd.read_csv(url, low_memory=False)
metadata_df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [3]:
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [4]:
metadata_df.shape

(45466, 24)

### Loading Ratings Small Dataset

In [5]:
# Read in the ratings small dataset from a public repository
url = "https://raw.githubusercontent.com/BrentMRPalmer/CSI4142-A4/refs/heads/main/ratings_small.csv"
ratings_df = pd.read_csv(url, low_memory=False)
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [6]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100004 non-null  int64  
 1   movieId    100004 non-null  int64  
 2   rating     100004 non-null  float64
 3   timestamp  100004 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [7]:
ratings_df.shape

(100004, 4)

## Data Preparation

To prepare the data, we use our validation techniques from Assignment 2 to clean the data and our EDA techniques from Assignment 1 to visualize the data.

There are two datasets to prepare, the first holding the metadata in `metadata_df`, and the second representing the user ratings in `ratings_df`. We perform the preparation on one dataset at a time.

### Cleaning and EDA of Metadata

We begin by cleaning and performing EDA on the metadata dataset.

#### Cleaning the Data

##### Validity Check 1: Exact Duplicates

We will first check for exact duplicates in the dataset, verifying that there are no rows that are identical over all columns.

**References:** <br>
Duplicated: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html <br>
Drop duplicates: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

In [8]:
# Exact duplicates check

# Apply the .duplicated method to the DataFrame to create a Series, with exact duplicates set to True
# keep=False will mark all duplicates as True (including the first and last occurrences)
duplicates = metadata_df.duplicated(keep=False)

# Print the number of rows that are exact duplicates
print(f"Number of duplicate rows: {duplicates.sum()}\n")

Number of duplicate rows: 33



Since there are 33 duplicate rows, let us further investigate the actual rows to determine how to handle them.

In [9]:
# Display the first 3 rows that are exact duplicates
print("Examples of three duplicate rows:")
metadata_df.loc[duplicates].head(3)

Examples of three duplicate rows:


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
676,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,105045,tt0111613,de,Das Versprechen,"East-Berlin, 1961, shortly after the erection ...",...,1995-02-16,0.0,115.0,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Released,"A love, a hope, a wall.",The Promise,False,5.0,1.0
1465,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,105045,tt0111613,de,Das Versprechen,"East-Berlin, 1961, shortly after the erection ...",...,1995-02-16,0.0,115.0,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Released,"A love, a hope, a wall.",The Promise,False,5.0,1.0
7345,False,,0,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",,5511,tt0062229,fr,Le Samouraï,Hitman Jef Costello is a perfectionist who alw...,...,1967-10-25,39481.0,105.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,There is no solitude greater than that of the ...,Le Samouraï,False,7.9,187.0


As expected, the rows are exact duplicates. The movie `Das Versprechen` appears twice in the results above. We can safely remove the duplicates. We will store the results in a new DataFrame called `cleaned_metadata_df`. All subsequent cleaning will be done to the new DataFrame.

In [10]:
# Create a copy of the metadata DataFrame
cleaned_metadata_df = metadata_df.copy()

# Drop duplicates (retains first instance of a duplicated row)
cleaned_metadata_df = cleaned_metadata_df.drop_duplicates()

# verify no duplicates remain
duplicates = cleaned_metadata_df.duplicated(keep=False)

# Print the number of rows that are exact duplicates
print(f"Number of duplicate rows: {duplicates.sum()}\n")

Number of duplicate rows: 0



Since there are no remaining duplicate rows, the DataFrame has been successfully cleaned.

##### Genre Format

The dataset stores genres in a json format, where... TODO - move to format check? for now just put the code that does the fix

**References:** <br>
Literal eval: https://docs.python.org/3/library/ast.html#ast.literal_eval <br>
Capitalize: https://www.w3schools.com/python/ref_string_capitalize.asp

In [11]:
def extract_genre(genres):
    # Convert from string to list
    genres = ast.literal_eval(genres)
    for i in range(len(genres)):
        genres[i] = genres[i]["name"].lower().capitalize()
    return genres

# Apply the function to the genres attribute, updating the genres to the processed list 
cleaned_metadata_df["genres"] = cleaned_metadata_df["genres"].apply(extract_genre)

# Check the cleaned results
print("Examples of the cleaned genres feature: ")
cleaned_metadata_df["genres"]

Examples of the cleaned genres feature: 


0         [Animation, Comedy, Family]
1        [Adventure, Fantasy, Family]
2                   [Romance, Comedy]
3            [Comedy, Drama, Romance]
4                            [Comedy]
                     ...             
45461                 [Drama, Family]
45462                         [Drama]
45463       [Action, Drama, Thriller]
45464                              []
45465                              []
Name: genres, Length: 45449, dtype: object

#### Exploratory Data Analysis (EDA)

### Cleaning and EDA of Ratings

We now perform cleaning and EDA on the ratings dataset.

#### Cleaning the Data

Note that we will reuse the defined functions in the metadata cleaning section.

##### Validity Check 1: Exact Duplicates

We will first check for exact duplicates in the dataset, verifying that there are no rows that are identical over all columns.

In [12]:
# Exact duplicates check

# Apply the .duplicated method to the DataFrame to create a Series, with exact duplicates set to True
# keep=False will mark all duplicates as True (including the first and last occurrences)
duplicates = ratings_df.duplicated(keep=False)

# Print the number of rows that are exact duplicates
print(f"Number of duplicate rows: {duplicates.sum()}\n")

Number of duplicate rows: 0



Since there are no duplicate rows, no cleaning is required.

#### Exploratory Data Analysis (EDA)

## Studies

### Study 1 — Similarity Measures

[desc]

#### Similarity Measure 1 — Jaccard distance on Genres

[desc]

**References:** <br>
Similarity Overview: https://medium.com/@jodancker/a-brief-introduction-to-distance-measures-ac89cbd2298 <br>
Jaccard: https://stackoverflow.com/questions/46975929/how-can-i-calculate-the-jaccard-similarity-of-two-lists-containing-strings-in-py <br>
Sets: https://docs.python.org/3/tutorial/datastructures.html <br>
To List: https://pandas.pydata.org/docs/reference/api/pandas.Series.to_list.html <br>
Select columns: https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html <br>
Select specific rows: https://stackoverflow.com/questions/46380075/pandas-select-n-middle-rows <br>
Sort by feature: https://realpython.com/pandas-sort-python/

In [13]:
# Similarity measure 1 — Jaccard distance on Genres

# Select the reference movie
reference_movie = "Se7en"

def jaccard_similarity(list1, list2):
    # Convert the lists into sets
    set1 = set(list1)
    set2 = set(list2)

    # Find the size of the intersection
    intersection_size = len(set1 & set2)

    # Find the size of the union
    union_size = len(set1 | set2)

    # Handle empty case
    if union_size == 0:
        return 1

    # Compute and return jaccard distance
    return 1 - (intersection_size/union_size)

# Store the genres for the reference movie
reference_genres = cleaned_metadata_df[cleaned_metadata_df["original_title"]==reference_movie]["genres"].to_list()[0]

# Apply the function to the genres attribute, computing the Jaccard distance between the genres of the reference movie with every other movie
new_feature_name = "jaccard_genre_distance_to_" + reference_movie
cleaned_metadata_df[new_feature_name] = cleaned_metadata_df["genres"].apply(
    lambda genres: jaccard_similarity(reference_genres, genres)
)

# Print the genres of the reference movie
print(f"The genres of {reference_movie}: {reference_genres}\n")

# Print the first five example rows
print(f"Examples of five rows with Jaccard distance to {reference_movie} based on the movie genres:")
cleaned_metadata_df[["original_title", "genres", new_feature_name]].head(10).tail(5)

The genres of Se7en: ['Crime', 'Mystery', 'Thriller']

Examples of five rows with Jaccard distance to Se7en based on the movie genres:


Unnamed: 0,original_title,genres,jaccard_genre_distance_to_Se7en
5,Heat,"[Action, Crime, Drama, Thriller]",0.6
6,Sabrina,"[Comedy, Romance]",1.0
7,Tom and Huck,"[Action, Adventure, Drama, Family]",1.0
8,Sudden Death,"[Action, Adventure, Thriller]",0.8
9,GoldenEye,"[Adventure, Action, Thriller]",0.8


[desc]

In [14]:
# Request: Show me movies of the same genre as the reference movie
cleaned_metadata_df[["original_title", "genres", new_feature_name]].sort_values(new_feature_name).head(10)

Unnamed: 0,original_title,genres,jaccard_genre_distance_to_Se7en
1054,Dial M for Murder,"[Crime, Mystery, Thriller]",0.0
13325,The Alphabet Killer,"[Crime, Mystery, Thriller]",0.0
12332,Cleaner,"[Crime, Thriller, Mystery]",0.0
5254,Insomnia,"[Crime, Mystery, Thriller]",0.0
4051,The Mirror Crack'd,"[Crime, Thriller, Mystery]",0.0
41711,"Kiss Me, Kill Me","[Mystery, Crime, Thriller]",0.0
14349,23 Paces to Baker Street,"[Crime, Mystery, Thriller]",0.0
12638,The Oxford Murders,"[Crime, Mystery, Thriller]",0.0
30066,बदलापुर,"[Crime, Mystery, Thriller]",0.0
40337,A ciascuno il suo,"[Crime, Mystery, Thriller]",0.0


[desc]

#### Similarity Measure 2 — Edit Distance on Title

[desc]

In [15]:
# Similarity Measure 2 — Edit Distance on Title

# Store the title used to compute the edit distance
reference_title = "Back to the Future"

# Apply the Levenshtein.distance function from the Levenshtein package to compute the edit distance between "Back to the Future" and the title of every other movie 
cleaned_metadata_df["edit_distance_title_to_bttf"] = cleaned_metadata_df["original_title"].apply(
    lambda title: Levenshtein.distance(reference_title, title)
)

# Print the reference title
print(f"The reference title: {reference_title}\n")

# Print five example rows
print(f"Examples of five rows with edit distance to {reference_title} based on the movie title:")
cleaned_metadata_df[["original_title", "edit_distance_title_to_bttf"]].head(5)

The reference title: Back to the Future

Examples of five rows with edit distance to Back to the Future based on the movie title:


Unnamed: 0,original_title,edit_distance_title_to_bttf
0,Toy Story,14
1,Jumanji,18
2,Grumpier Old Men,17
3,Waiting to Exhale,13
4,Father of the Bride Part II,19


### Study 2 — Clustering Algorithms

### Study 3 — Content-Based Recommendation System

### Study 4 — Collaborate Filtering Recommendation System

## Conclusion

## References