# CSI 4142 - Introduction to Data Science
# Assignment 4: Unsupervised Learning, Clustering and Recommendations.

Shacha Parker (300235525)\
Callum Frodsham and (300199446)\
Group 79

### Setup Instructions To Reproduce this Notebook:
(Step 1 Optional)
1. Create a virtual python environment in the project directory (if you want) for all of the packages required:  
``` 
python -m venv .venv
```
To enter the virutal environment: 
```
.venv/Scripts/activate.ps1 # on windows
source .venv/bin/activate # on mac/linux
```
2. Download all of the required packages (run in cmd/shell of choice):
```
pip install jupyter
pip install ipykernel
pip install pandas
pip install numpy
```
3. VSCode: Ensure you have the correct python kernel selected!
<br> 
If you are using a virtual environment, make sure to select the python interpreter for that virtual environment otherwise this will not work! If you have everything done globally, then just make sure the correct python kernel you are using is selected.

<h1>Dataset: </h1>
Author: Rounak Banik
<br>
Purpose: The purpose of this dataset is to provide insight on a largage amount of movie data comprised of 45,000 movies released on or before July 2017 and 26 million accompanying ratings from 270,000 users of the GroupLens website. 
<br>
Shape: This dataset is composed of 24 columns, 45466 rows.
<br><br>
Link: <a href="https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset"> The Movies Dataset</a>
<br>

Note: The "homepage", "poster_path" and "video" features will be omitted as they serve no purpose in notebook. (since we don't have access to any of the files they're referencing.)
<h3>Dataset Feature List: </h3>
movies_metadata.csv:
<ol>
    <li>adult:
    <br>
    Feature Type: Categorical
    <br>
    Description: Indicates if the movie is X-rated or not.
    </li>
    <li>belongs_to_collection:
    <br>
    Feature Type: Categorical
    <br>
    Description: Stringified dictionary that indicates which collection of films the movie belongs to. Empty if no collection.
    </li>
    <li>budget:
    <br>
    Feature Type: Numerical
    <br>
    Description: The budget of the film in dollars (USD). 0 if budget is unknown.
    </li>
    <li>genres:
    <br>
    Feature Type: Categorical
    <br>
    Description: Stringified list of dictionaries, that include the films genre(s).
    </li>
    <li>original_language:
    <br>
    Feature Type: Categorical
    <br>
    Description: The film's language of origin.
    </li>
    <li>original_title:
    <br>
    Feature Type: Categorical
    <br>
    Description: The original title of the movie on release.
    </li>
    <li>overview:
    <br>
    Feature Type: Categorical
    <br>
    Description: A brief description of the movie.
    </li>
    <li>popularity:
    <br>
    Feature Type: Numerical
    <br>
    Description: The popularity score as assigned by TMDB.
    </li>
    <li>production_companies:
    <br>
    Feature Type: Categorical
    <br>
    Description: Stringified list of production companies involved in creating the movie.
    </li>
    <li>production_countries:
    <br>
    Feature Type: Categorical
    <br>
    Description: Stringified list of countries where the film was shot in.
    </li>
    <li>release_data:
    <br>
    Feature Type: Numerical
    <br>
    Description: The release date of the movie.
    </li>
    <li>revenue:
    <br>
    Feature Type: Numerical
    <br>
    Description: Total revenue of the film in dollars.
    </li>
    <li>runtime:
    <br>
    Feature Type: Numerical
    <br>
    Description: The runtime of the film in minutes.
    </li>
    <li>spoken_languages:
    <br>
    Feature Type: Categorical
    <br>
    Description: Stringified list of dictionaries of the languages spoken in the film.
    </li>
    <li>status:
    <br>
    Feature Type: Categorical
    <br>
    Description: The release status of the film, with categories: 'Released', 'Rumored', 'Post Production', 'In Production', 'Planned', 'Canceled'
    </li>
    <li>Tagline:
    <br>
    Feature Type: Categorical
    <br>
    Description: The tagline of the movie.
    </li>
    <li>title:
    <br>
    Feature Type: Categorical
    <br>
    Description: The title of the movie.
    </li>
    <li>vote_average:
    <br>
    Feature Type: Numerical
    <br>
    Description: The average rating of the movie.
    </li>
    <li>vote_count:
    <br>
    Feature Type: Numerical
    <br>
    Description: The number number of votes by users as counted by TMDB.
    </li>
</ol>

In [32]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as sk
import Levenshtein as le
import heapq
import ast

# load the dataset
dataset = pd.read_csv("movies_metadata.csv")

# drop the unused columns mentioned above:
dataset.drop(columns=['homepage', 'poster_path', 'video'], inplace=True)
print(dataset.columns)

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'id', 'imdb_id',
       'original_language', 'original_title', 'overview', 'popularity',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title',
       'vote_average', 'vote_count'],
      dtype='object')


  dataset = pd.read_csv("movies_metadata.csv")


## Data Cleaning:

In [33]:
# get the general info of the dataset
print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   id                     45466 non-null  object 
 5   imdb_id                45449 non-null  object 
 6   original_language      45455 non-null  object 
 7   original_title         45466 non-null  object 
 8   overview               44512 non-null  object 
 9   popularity             45461 non-null  object 
 10  production_companies   45463 non-null  object 
 11  production_countries   45463 non-null  object 
 12  release_date           45379 non-null  object 
 13  revenue                45460 non-null  float64
 14  runtime                45203 non-null  float64
 15  sp

In [34]:
# check which columns have missing values:
missing_values = dataset.isna().sum()
print(missing_values)


adult                        0
belongs_to_collection    40972
budget                       0
genres                       0
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
vote_average                 6
vote_count                   6
dtype: int64


<h5>Original language data imputation: </h5>
The original language is missing 11 data points, however, we can manually impute using data from rotten tomatoes.
Since rotten tomatoes does not have an easily accessible api, and using an API for only 11 data points would be a little silly, we shall manually get the data for each point!

In [35]:
# get the or language
missing_language = dataset['original_language'].isna()

# get the indices of the missing vals
print(dataset[missing_language]['title'])  

missing_language_input_vals = [ "en",
                                "en",
                                "en",
                                "en",
                                "cs",
                                "en",
                                "zxx", # silent film ISO code
                                "en",
                                "en",
                                "en",
                                "zxx" # also a silent film.
                                ]
# get the indices of the missing values.
missing_language_indices = list(dataset[missing_language].index)

# fill in the values:
for i, row_num  in enumerate(missing_language_indices):
    dataset.at[row_num, 'original_language'] = missing_language_input_vals[i]

print(f"New NaN Value count for the original language feature: {dataset['original_language'].isna().sum()}")

19574       Shadowing the Third Man
21602                Unfinished Sky
22832               13 Fighting Men
32141                     Lambchops
37407                 Prince Bayaya
41047                Song of Lahore
41872    Annabelle Serpentine Dance
44057         Lettre d'une inconnue
44410                          Yarn
44576                        WiNWiN
44655    The Surrender of Tournavos
Name: title, dtype: object
New NaN Value count for the original language feature: 0


We are going to fill the 6 null title rows with their "original_title" counterpart. 

In [36]:
# get the or language
missing_titles = dataset['title'].isna()

# lets see which titles are valid: 
dataset[missing_titles]['original_title']


19729                                Midnight Man
19730    [{'iso_639_1': 'en', 'name': 'English'}]
29502                            マルドゥック・スクランブル 排気
29503        [{'iso_639_1': 'ja', 'name': '日本語'}]
35586                            Avalanche Sharks
35587    [{'iso_639_1': 'en', 'name': 'English'}]
Name: original_title, dtype: object

It is clear that some of the original title values are not valid, and fail the format checking. (because it also just so happens that these are the only 3 values that fail the format check of the 'original_title' feature) Thus, we will only update the 'title' feature for rows 19729, 29502, and 35586, and remove the other 3.

In [37]:
# remove the missing titles.
remove_titles = [35587, 29503,19730]
dataset.drop(index=remove_titles, inplace=True)

# get new missing_titles
missing_titles = dataset['title'].isna()
dataset[missing_titles]['original_title']

# update the other 3 missing title values using the original title.
dataset.loc[missing_titles, 'title'] = dataset.loc[missing_titles, 'original_title'] 

# show the fixed titles!
dataset[missing_titles]['title']

19729        Midnight Man
29502    マルドゥック・スクランブル 排気
35586    Avalanche Sharks
Name: title, dtype: object

Row Removal Rationalization: <br>
Since the dataset has 45000 some rows, we will be able to remove a small amount of rows without affecting the quality of the data. 
We will be removing all NULL rows in these features: popularity, production_countries, production_companies, release_date, status, vote_average, vote_count, and runtime.

Runtime has a lot of missing values, specifically, 242. These will be removed, but overview will not because that would include 1/45th the dataset approximately, and sometimes movies don't have a succint overview. Thus, all of the missing overview values will be kept. 

In [38]:
# remove them all in one fell swoop:
dataset.dropna(subset=['popularity', 'production_countries', 'production_companies', 'release_date', 'status', 'vote_average', 'vote_count','runtime', 'imdb_id'], inplace=True)
dataset.isna().sum()


adult                        0
belongs_to_collection    40565
budget                       0
genres                       0
id                           0
imdb_id                      0
original_language            0
original_title               0
overview                   683
popularity                   0
production_companies         0
production_countries         0
release_date                 0
revenue                      0
runtime                      0
spoken_languages             0
status                       0
tagline                  24659
title                        0
vote_average                 0
vote_count                   0
dtype: int64

In [None]:
# budget has the wrong datatype, convert it to int.
dataset['budget'] = pd.to_numeric(dataset['budget'], errors='coerce')

sum_budet = (dataset['budget'] == 0).sum()
sum_revenue = (dataset['revenue'] == 0).sum()
print(sum_revenue)
print(sum_budet)

# ensure IDs are all numeric as well
dataset['id'] = pd.to_numeric(dataset['id'], errors='coerce')
dataset.info()

37642
36170
<class 'pandas.core.frame.DataFrame'>
Index: 45042 entries, 0 to 45465
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45042 non-null  object 
 1   belongs_to_collection  4477 non-null   object 
 2   budget                 45042 non-null  int64  
 3   genres                 45042 non-null  object 
 4   id                     45042 non-null  int64  
 5   imdb_id                45042 non-null  object 
 6   original_language      45042 non-null  object 
 7   original_title         45042 non-null  object 
 8   overview               44359 non-null  object 
 9   popularity             45042 non-null  object 
 10  production_companies   45042 non-null  object 
 11  production_countries   45042 non-null  object 
 12  release_date           45042 non-null  object 
 13  revenue                45042 non-null  float64
 14  runtime                45042 non-null  float64


## EDA:

## Study 1: Similarity Measures 
Attribute subsets chosen: title, budget, popularity, vote_count, genre


### First Subset: title
Similarity measure: Levenshtein Distance

In [40]:
chosen_title = 'Star Wars'
# going to use a priority queue to get the top 10:
pq = []
for idx, row in dataset.iterrows():
    title = row['title']
    if title == chosen_title:
        continue
    word_distance = le.distance(chosen_title, title)
    if len(pq) < 10:
        heapq.heappush(pq, (-word_distance, [title, idx]))
    else:
        heapq.heappushpop(pq, (-word_distance, [title, idx]))

# Switch the values from negative to positive:
for i in range(0, len(pq)):
    pq[i] = (-pq[i][0], pq[i][1])
# sort them from least to greatest
pq.sort(key=lambda x: x[0])

print(f"Top 10 titles similar to: {chosen_title}")
title_indices = []
for idx, item in enumerate(pq):
    title_indices.append(item[1][1])
    print(f'{idx +1}. title: {item[1][0]}, distance:{item[0]}')
dataset.loc[title_indices, ['title','runtime', 'release_date', 'popularity', 'genres']]

Top 10 titles similar to: Star Wars
1. title: Star Maps, distance:2
2. title: Beer Wars, distance:3
3. title: Style Wars, distance:3
4. title: Flag Wars, distance:3
5. title: Road Wars, distance:3
6. title: Stars & Bars, distance:4
7. title: Strays, distance:4
8. title: Summer Wars, distance:4
9. title: Word Wars, distance:4
10. title: Triad Wars, distance:4


Unnamed: 0,title,runtime,release_date,popularity,genres
1540,Star Maps,86.0,1997-07-23,0.725694,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
14298,Beer Wars,89.0,2009-04-16,0.876713,"[{'id': 99, 'name': 'Documentary'}]"
9179,Style Wars,70.0,1983-01-01,0.870267,"[{'id': 10402, 'name': 'Music'}, {'id': 99, 'n..."
35876,Flag Wars,86.0,2003-05-01,0.102116,"[{'id': 99, 'name': 'Documentary'}]"
31843,Road Wars,90.0,2015-05-05,1.24917,"[{'id': 878, 'name': 'Science Fiction'}, {'id'..."
2135,Stars & Bars,94.0,1988-03-18,0.40317,"[{'id': 35, 'name': 'Comedy'}]"
37860,Strays,105.0,1997-01-18,3.013494,"[{'id': 18, 'name': 'Drama'}]"
16774,Summer Wars,114.0,2009-08-01,12.653798,"[{'id': 16, 'name': 'Animation'}]"
9503,Word Wars,80.0,2004-05-28,0.549944,"[{'id': 99, 'name': 'Documentary'}]"
33712,Triad Wars,112.0,2008-02-28,1.764311,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam..."


### Second Subset: budget
Similarity measure: Manhattan Distance

In [41]:
# chosen movie is Interview with the Vampire, and the budget can be seen below:
chosen_title = 'Interview with the Vampire'
chosen_movie_budget = (dataset.loc[dataset['title'] == chosen_title, 'budget']).item()
# going to use a priority queue to get the top 10:
pq = []

for idx, row in dataset.iterrows():
    cur_movie_budget = row['budget']
    if cur_movie_budget == chosen_movie_budget:
        continue
    budget_distance = abs(chosen_movie_budget - cur_movie_budget)
    # budget distance is always negative to ensure the heapq "kicks out" larger values.
    if len(pq) < 10:
        heapq.heappush(pq, (-budget_distance, [row['title'], cur_movie_budget, idx]))
    else:
        heapq.heappushpop(pq, (-budget_distance, [row['title'], cur_movie_budget, idx]))

# reverse the negative value
for i in range(0, len(pq)):
    pq[i] = (-pq[i][0], pq[i][1])
pq.sort(key = lambda x: x[0])
budget_indices = []
# get the indices and print out the values:
for idx, item in enumerate(pq):
    budget_indices.append(item[1][-1])
    print(f'{idx +1}. title: {item[1][0]}, budget: {item[1][1]}, budget distance: {item[0]}')
dataset.loc[budget_indices, ['title','runtime', 'release_date', 'popularity', 'genres']]

1. title: The Mermaid, budget: 60720000, budget distance: 720000
2. title: Perfect Stranger, budget: 60795000, budget distance: 795000
3. title: 2 Guns, budget: 61000000, budget distance: 1000000
4. title: Astérix and Obélix: God Save Britannia, budget: 61000000, budget distance: 1000000
5. title: Gone Girl, budget: 61000000, budget distance: 1000000
6. title: Maze Runner: The Scorch Trials, budget: 61000000, budget distance: 1000000
7. title: Ice Age, budget: 59000000, budget distance: 1000000
8. title: Les Misérables, budget: 61000000, budget distance: 1000000
9. title: Shooter, budget: 61000000, budget distance: 1000000
10. title: American Sniper, budget: 58800000, budget distance: 1200000


Unnamed: 0,title,runtime,release_date,popularity,genres
37460,The Mermaid,93.0,2016-02-08,5.296052,"[{'id': 35, 'name': 'Comedy'}, {'id': 878, 'na..."
11745,Perfect Stranger,109.0,2007-04-12,9.925255,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name..."
21337,2 Guns,109.0,2013-08-02,13.336512,"[{'id': 28, 'name': 'Action'}, {'id': 35, 'nam..."
19641,Astérix and Obélix: God Save Britannia,110.0,2012-10-17,9.722726,"[{'id': 10751, 'name': 'Family'}, {'id': 12, '..."
23675,Gone Girl,145.0,2014-10-01,154.801009,"[{'id': 9648, 'name': 'Mystery'}, {'id': 53, '..."
25206,Maze Runner: The Scorch Trials,132.0,2015-09-09,41.225769,"[{'id': 28, 'name': 'Action'}]"
5084,Ice Age,81.0,2002-03-10,17.328902,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
20062,Les Misérables,157.0,2012-12-18,13.52408,"[{'id': 18, 'name': 'Drama'}, {'id': 10402, 'n..."
11692,Shooter,124.0,2007-03-22,14.246918,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam..."
24241,American Sniper,133.0,2014-12-11,19.228561,"[{'id': 10752, 'name': 'War'}, {'id': 28, 'nam..."


### Third Subset: popularity
Similarity measure: Euclidean

### Fourth Subset: vote_count 
Similarity measure: Hamming

Steps: Find the larget integer, and then pick the maximum bit length to use for comparisons.

### Fifth Subset: genre
Similarity measure: Jaccard

In [None]:
# create function that converts the genre format into a list of genres.
def convert_genre_str_to_list(genre_list_str:str):
    dict_list = ast.literal_eval(genre_list_str)
    genre_list = []
    for item in dict_list:
        genre_list.append(item['name'])
    return genre_list
# since sci-kit wants us to encode our genres: lets not do that and make our own function
def jaccard_similarity(genres_a, genres_b):
    set_genres_a = set(genres_a)
    set_genres_b = set(genres_b)
    intersection_of_sets = set_genres_a & set_genres_b
    union_of_sets = set_genres_a | set_genres_b

    if len(union_of_sets) == 0:
        return 0
    return len(intersection_of_sets) / len(union_of_sets)     
chosen_title = 'Astérix and Obélix: God Save Britannia'




## Study 2: Clustering

## References:
<ul>
<li>
<a href="https://www.analyticsvidhya.com/blog/2024/02/ways-to-convert-string-to-a-list-in-python/">Parsing StringList using ast</a>
</li>
</ul>