# CSI 4142 - Introduction to Data Science
# Assignment 4: Unsupervised Learning, Clustering and Recommendations.

Shacha Parker (300235525)\
Callum Frodsham and (300199446)\
Group 79

### Setup Instructions To Reproduce this Notebook:
(Step 1 Optional)
1. Create a virtual python environment in the project directory (if you want) for all of the packages required:  
``` 
python -m venv .venv
```
To enter the virutal environment: 
```
.venv/Scripts/activate.ps1 # on windows
source .venv/bin/activate # on mac/linux
```
2. Download all of the required packages (run in cmd/shell of choice):
```
pip install jupyter
pip install ipykernel
pip install pandas
pip install numpy
```
3. VSCode: Ensure you have the correct python kernel selected!
<br> 
If you are using a virtual environment, make sure to select the python interpreter for that virtual environment otherwise this will not work! If you have everything done globally, then just make sure the correct python kernel you are using is selected.

<h1>Dataset: </h1>
Author: Rounak Banik
<br>
Purpose: The purpose of this dataset is to provide insight on a largage amount of movie data comprised of 45,000 movies released on or before July 2017 and 26 million accompanying ratings from 270,000 users of the GroupLens website. 
<br>
Shape: This dataset is composed of 24 columns, 45466 rows.
<br><br>
Link: <a href="https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset"> The Movies Dataset</a>
<br>

Note: The "homepage", "id", "imdb_id", "poster_path" and "video" features will be omitted as they serve no purpose in notebook.
<h3>Dataset Feature List: </h3>
movies_metadata.csv:
<ol>
    <li>adult:
    <br>
    Feature Type: Categorical
    <br>
    Description: Indicates if the movie is X-rated or not.
    </li>
    <li>belongs_to_collection:
    <br>
    Feature Type: Categorical
    <br>
    Description: Stringified dictionary that indicates which collection of films the movie belongs to. Empty if no collection.
    </li>
    <li>budget:
    <br>
    Feature Type: Numerical
    <br>
    Description: The budget of the film in dollars (USD). 0 if budget is unknown.
    </li>
    <li>genres:
    <br>
    Feature Type: Categorical
    <br>
    Description: Stringified list of dictionaries, that include the films genre(s).
    </li>
    <li>original_language:
    <br>
    Feature Type: Categorical
    <br>
    Description: The film's language of origin.
    </li>
    <li>original_title:
    <br>
    Feature Type: Categorical
    <br>
    Description: The original title of the movie on release.
    </li>
    <li>overview:
    <br>
    Feature Type: Categorical
    <br>
    Description: A brief description of the movie.
    </li>
    <li>popularity:
    <br>
    Feature Type: Numerical
    <br>
    Description: The popularity score as assigned by TMDB.
    </li>
    <li>production_companies:
    <br>
    Feature Type: Categorical
    <br>
    Description: Stringified list of production companies involved in creating the movie.
    </li>
    <li>production_countries:
    <br>
    Feature Type: Categorical
    <br>
    Description: Stringified list of countries where the film was shot in.
    </li>
    <li>release_data:
    <br>
    Feature Type: Numerical
    <br>
    Description: The release date of the movie.
    </li>
    <li>revenue:
    <br>
    Feature Type: Numerical
    <br>
    Description: Total revenue of the film in dollars.
    </li>
    <li>runtime:
    <br>
    Feature Type: Numerical
    <br>
    Description: The runtime of the film in minutes.
    </li>
    <li>spoken_languages:
    <br>
    Feature Type: Categorical
    <br>
    Description: Stringified list of dictionaries of the languages spoken in the film.
    </li>
    <li>status:
    <br>
    Feature Type: Categorical
    <br>
    Description: The release status of the film, with categories: 'Released', 'Rumored', 'Post Production', 'In Production', 'Planned', 'Canceled'
    </li>
    <li>Tagline:
    <br>
    Feature Type: Categorical
    <br>
    Description: The tagline of the movie.
    </li>
    <li>title:
    <br>
    Feature Type: Categorical
    <br>
    Description: The title of the movie.
    </li>
    <li>vote_average:
    <br>
    Feature Type: Numerical
    <br>
    Description: The average rating of the movie.
    </li>
    <li>vote_count:
    <br>
    Feature Type: Numerical
    <br>
    Description: The number number of votes by users as counted by TMDB.
    </li>
</ol>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

# load the dataset
dataset = pd.read_csv("movies_metadata.csv")

# drop the unused columns mentioned above:
dataset.drop(columns=['homepage', 'id', 'imdb_id', 'poster_path', 'video'], inplace=True)
print(dataset.columns)

## Data Cleaning:

In [None]:
# get the general info of the dataset
print(dataset.info())

In [None]:
# check which columns have missing values:
missing_values = dataset.isna().sum()
print(missing_values)


<h5>Original language data imputation: </h5>
The original language is missing 11 data points, however, we can manually impute using data from rotten tomatoes.
Since rotten tomatoes does not have an easily accessible api, and using an API for only 11 data points would be a little silly, we shall manually get the data for each point!

In [None]:
# get the or language
missing_language = dataset['original_language'].isna()

# get the indices of the missing vals
print(dataset[missing_language]['title'])  

missing_language_input_vals = [ "en",
                                "en",
                                "en",
                                "en",
                                "cs",
                                "en",
                                "zxx", # silent film ISO code
                                "en",
                                "en",
                                "en",
                                "zxx" # also a silent film.
                                ]
# get the indices of the missing values.
missing_language_indices = list(dataset[missing_language].index)

# fill in the values:
for i, row_num  in enumerate(missing_language_indices):
    dataset.at[row_num, 'original_language'] = missing_language_input_vals[i]

print(f"New NaN Value count for the original language feature: {dataset['original_language'].isna().sum()}")

We are going to fill the 6 null title rows with their "original_title" counterpart. 

In [None]:
# get the or language
missing_titles = dataset['title'].isna()

# lets see which titles are valid: 
dataset[missing_titles]['original_title']


It is clear that some of the original title values are not valid, and fail the format checking. (because it also just so happens that these are the only 3 values that fail the format check of the 'original_title' feature) Thus, we will only update the 'title' feature for rows 19729, 29502, and 35586, and remove the other 3.

In [None]:
# remove the missing titles.
remove_titles = [35587, 29503,19730]
dataset.drop(index=remove_titles, inplace=True)

# get new missing_titles
missing_titles = dataset['title'].isna()
dataset[missing_titles]['original_title']

# update the other 3 missing title values using the original title.
dataset.loc[missing_titles, 'title'] = dataset.loc[missing_titles, 'original_title'] 

# show the fixed titles!
dataset[missing_titles]['title']

Row Removal Rationalization: <br>
Since the dataset has 45000 some rows, we will be able to remove a small amount of rows without affecting the quality of the data. 
We will be removing all NULL rows in these features: popularity, production_countries, production_companies, release_date, status, vote_average, vote_count, and runtime.

Runtime has a lot of missing values, specifically, 242. These will be removed, but overview will not because that would include 1/45th the dataset approximately, and sometimes movies don't have a succint overview. Thus, all of the missing overview values will be kept. 

In [None]:
# remove them all in one fell swoop:
dataset.dropna(subset=['popularity', 'production_countries', 'production_companies', 'release_date', 'status', 'vote_average', 'vote_count','runtime'], inplace=True)

dataset.isna().sum()

## EDA:

## Study 1: Similarity Measures

## Study 2: Clustering