# Recommendation Systems

## Introduction
We will create a movie recommendation system based on The Movies Dataset available [here](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset).

### Outline:
1. Data Preprocessing
2. Content-based Filtering
3. Hybrid Recommendation System (Content-based Filtering + Collaborative Filtering)

### Importing Libraries

In [28]:
import ast
import json
import os
import pandas as pd

### Loading The Movies Dataset

In [29]:
df_metadata = pd.read_csv("data/movies_metadata.csv", low_memory=False)
df_keywords = pd.read_csv("data/keywords.csv")

## Data Proprocessing

### Movies Metadata

Contains information on 45,000 movies featured in the Full MovieLens dataset.

In [30]:
df_metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,30/10/1995,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,15/12/1995,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,22/12/1995,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,22/12/1995,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,10/2/1995,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [31]:
df_metadata.shape

(45466, 24)

Check if there are missing values.

In [32]:
df_metadata.isnull().sum()

adult                        0
belongs_to_collection    40972
budget                       0
genres                       0
homepage                 37684
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

Columns that are needed to build our recommendation system.

In [33]:
df_metadata = df_metadata[
    [
        "id",
        "title",
        "genres",
        "original_language",
        "overview",
        "tagline",
        "production_countries",
        "release_date",
        "status",
        "vote_average",
        "vote_count",
        "runtime",
    ]
]

Number of movies with same title and release date.

In [34]:
df_metadata[["title", "release_date"]].duplicated().sum()

32

Number of movies with no overviews.

In [35]:
df_metadata[df_metadata.overview.isnull()].shape[0]

954

Number of movies that have not yet been released.

In [36]:
df_metadata[df_metadata.status != "Released"].shape[0]

452

Genres and Production Countries

In [37]:
df_metadata["genres"][0]

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

In [38]:
df_metadata["production_countries"][0]

"[{'iso_3166_1': 'US', 'name': 'United States of America'}]"

**TODO:**
We will remove movies which:
- have same titles and release date.
- have no overviews.
- have not yet been released.  

We need to extract the names from the data.  

**Example:**
>>> extract_names("[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]")  
    'Animation, Comedy, Family'

Extract the movie feature names from the data

In [39]:
def extract_names(data: str) -> str:
    """
    Extract the names from the data.

    :param data: A string representing a list of objects. Each object should have a 'name' key.
    :return: A string containing the names of the features, separated by commas and spaces.
    """
    if data:
        try:
            # Convert the input string to a JSON-formatted string
            json_str = json.dumps(ast.literal_eval(data))
            # Load the JSON-formatted string into a Python object
            python_obj = json.loads(json_str)
            # Extract the names from the Python object
            data_names = [data["name"] for data in python_obj]
            # Join the names into a single string separated by spaces
            data_names_str = ", ".join(data_names)

            return data_names_str

        except TypeError:
            return ""
    else:
        # If the input is empty, return an empty string
        return ""

Clean the movies dataset

In [40]:
def clean_movies_data_set(df: pd.DataFrame) -> pd.DataFrame:
    """
    Clean the movies dataset by removing duplicates, null values and non-released movies,
    and extracting the genre names and production countries from the data.

    :param df: The movie dataset to be cleaned.
    :return: A cleaned pandas DataFrame.
    """
    print(f"The number of movies in the original data set is: {df.shape[0]}")

    # Removes duplicates titles that have same release date
    df.drop_duplicates(subset=["title", "release_date"], inplace=True)

    # Removes movies that have no overview or have not yet been released
    index_drop = df[(df.overview.isnull()) | (df.status != "Released")].index
    df.drop(index_drop, inplace=True)

    # Fills the rows with empty production_countries to NaN
    df.loc[df.production_countries == "[]", "production_countries"] = pd.NA

    # Replaces all the null values with empty string
    df.fillna("", inplace=True)

    # Extracts the genre names and production countries from the data
    df["genres"] = df["genres"].apply(extract_names)
    df["production_countries"] = df["production_countries"].apply(extract_names)

    print(f"The number of movies in the cleaned data set is: {df.shape[0]}")

    return df

In [41]:
df_metadata = clean_movies_data_set(df_metadata)

The number of movies in the original data set is: 45466
The number of movies in the cleaned data set is: 44065


### Keywords

Contains the movie plot keywords for our MovieLens movies.

In [42]:
df_keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


#### Data Preprocessing

Drop duplicates

In [43]:
df_keywords = df_keywords.drop_duplicates()

Extract keywords

In [44]:
df_keywords["keywords"] = df_keywords["keywords"].apply(extract_names)

In [45]:
df_keywords.head()

Unnamed: 0,id,keywords
0,862,"jealousy, toy, boy, friendship, friends, rival..."
1,8844,"board game, disappearance, based on children's..."
2,15602,"fishing, best friend, duringcreditsstinger, ol..."
3,31357,"based on novel, interracial relationship, sing..."
4,11862,"baby, midlife crisis, confidence, aging, daugh..."


### Merge two data frames

In [46]:
df_metadata["id"] = df_metadata["id"].astype(int)
df_merged = pd.merge(df_keywords, df_metadata, on="id")

### Create a new column `soup`
- a string that combines all the relevant data to be fed into the model. 
- `soup`: genres, original language, overview, tagline, keywords,  production countries

In [47]:
def create_soup(movie: pd.Series) -> str:
    """
    Concatenates several movies features into a single string to create a soup of text.

    :param movie: A movie containing features to concatenate.
    :return: A string containing the concatenated movie features.
    """
    return (
        movie.genres
        + " "
        + movie.original_language
        + " "
        + movie.overview
        + " "
        + movie.tagline
        + " "
        + movie.keywords
        + " "
        + movie.production_countries
    ).lower()

In [48]:
df_merged["soup"] = df_merged.apply(create_soup, axis=1)

### IMDB' weighted rating  
A movie with an average rating of 9 based on and only 2 votes cannot be considered better than a movie with a lower average rating of 8 but has 1000 votes. So we will be using IMDB's weighted rating to determine the quality of a movie.

Weighted Rating = ($\frac{v}{v+m}$ * R) + ($\frac{m}{v+m}$ * C)  

where,
- v is the number of votes for the movie (vote_count)
- R is the average rating of the movie (vote_average)
- C is the mean vote across the whole dataset
- m is the minimum votes required to be listed in the chart

In [49]:
mean_vote_average_C = df_merged["vote_average"].mean()
mean_vote_average_C

5.644187999273783

In [50]:
min_vote_counts_m = df_merged["vote_count"].quantile(0.9)
min_vote_counts_m

167.0

Filter out the movies that qualify for the chart

In [51]:
df_qualified = df_merged.loc[df_merged["vote_count"] >= min_vote_counts_m]
df_qualified.shape

(4418, 14)

In [52]:
def calculate_weighted_rating(
    movie: pd.Series, min_vote_counts_m: int, mean_vote_average_C: float
) -> float:
    """
    Calculate the weighted rating for a movie based on its vote count, vote average,
    and the minimum vote counts and mean vote average across the dataset.

    :param movie: A DataFrame row representing a movie.
    :param min_votes: The mnimum votes required to be listed in the chart.
    :param mean_vote_average: The mean vote average across the whole dataset.
    :return: The weighted rating for the movie.
    """
    vote_count = movie["vote_count"]
    vote_average = movie["vote_average"]

    weighted_rating = (vote_count / (vote_count + min_vote_counts_m) * vote_average) + (
        min_vote_counts_m / (min_vote_counts_m + vote_count) * mean_vote_average_C
    )
    
    weighted_rating = round(weighted_rating, 1)
    return weighted_rating

In [53]:
df_merged["weighted_rating"] = df_merged.apply(
    lambda movie: calculate_weighted_rating(
        movie, min_vote_counts_m, mean_vote_average_C
    ),
    axis=1,
)

### Export DataFrame to csv

In [54]:
folder_name = "preprocessed_data"

if not os.path.exists(folder_name):
    os.makedirs(folder_name)

csv_file_path = os.path.join(folder_name, "merged_metadata_keywords.csv")
df_merged.to_csv(csv_file_path, index=False)

csv_file_path = os.path.join(folder_name, "qualified_movies.csv")
df_qualified.to_csv(csv_file_path, index=False)