# IDEATE

## Overview

A movie recommendation system is a software tool that proposes films to users based on their preferences, past viewing habits, and other relevant information. It assists users in finding new films that correspond with their interests, resulting in a more tailored and captivating entertainment experience.

## Problem Statement

This notebook aims to create a **Movie Recommendation System** using content-based filtering approach on the TMDB dataset.

Throughout the notebook, we will explore methods for vectorizing movie features, developing similarity metrics to gauge movie similarity, and constructing a recommendation mechanism. By the end, the goal is to have a fully operational movie recommendation system that offers personalized movie suggestions, thereby enhancing the overall movie-viewing experience.

# EXPLORE

## About dataset

The datasets were scrapped from official **IMDB website**. It provides detailed metadata including:
- Title
- Genres
- Rating
- Description
- Release Year
- Director
- Main Cast
- Url

## Preparation

**Notice**: we will be using **Python** program for this study.

To begin, let’s prepare by loading the necessary python packages, libraries, and some pre-defined functions for the project. After that we’ll then import the data from **/datasets**

### Import libraries and packages

In [1]:
import pandas as pd
import numpy as np
import ast
import nltk
import pickle
import glob
import os
import re

from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Porter Stemmer for handling repeated/stemmed words in tags feature
ps = PorterStemmer()

def concat_csv_files(folder_path, pattern="*.csv"):
    """
    Concatenate all CSV files in a specified folder into a single DataFrame.
    
    Parameters:
        folder_path (str): Path to the folder containing CSV files.
        pattern (str): Pattern to match file names (default is '*.csv').
    
    Returns:
        pd.DataFrame: A single DataFrame containing all rows from matched CSV files.
    """
    csv_files = glob.glob(os.path.join(folder_path, pattern))  # Find all matching files
    df_list = [pd.read_csv(file) for file in csv_files]         # Read each CSV file
    combined_df = pd.concat(df_list, ignore_index=True)         # Concatenate into one DataFrame
    return combined_df

def to_snake_case_columns(df):
    """
    Convert all column names in a DataFrame to snake_case format.
    
    Parameters:
        df (pd.DataFrame): Input DataFrame with original column names.
    
    Returns:
        pd.DataFrame: DataFrame with column names converted to snake_case.
    """
    df.columns = [
        re.sub(r'\W+', '_', col.strip().lower())  # Remove special characters, lowercase, replace spaces with underscores
        for col in df.columns
    ]
    return df

def drop_incomplete_rows(df):
    """
    Drop rows where any cell is missing or invalid (NaN, 'NAN', 'None', or empty string).
    
    Parameters:
        df (pd.DataFrame): Input DataFrame that may contain missing/incomplete data.
    
    Returns:
        pd.DataFrame: Cleaned DataFrame with incomplete rows removed.
    """
    df_cleaned = df.replace(['NAN', 'None', ''], np.nan)  # Standardize invalid entries as NaN
    df_cleaned = df_cleaned.dropna(how='any')             # Drop any row with at least one NaN
    return df_cleaned

def split_column_values(df, columns):
    """
    Split comma-separated string values into lists for specified columns.
    
    Parameters:
        df (pd.DataFrame): Input DataFrame containing string values.
        columns (list): List of column names to be split into lists.
    
    Returns:
        pd.DataFrame: DataFrame with specified columns containing list of values.
    """
    for col in columns:
        df[col] = df[col].apply(
            lambda x: [v.strip() for v in x.split(',')] if pd.notnull(x) else []
        )  # Split by comma and trim whitespace
    return df
    
def stem(text):
    """
    Applies stemming to each word in the input text using the PorterStemmer.

    Parameters:
        text (str): A string of space-separated words.

    Returns:
        str: A string with each word stemmed (reduced to its root form).
    """
    y = []
    for i in text.split():          # Split the text into individual words
        y.append(ps.stem(i))        # Apply stemming to each word and collect the result
    return " ".join(y)              # Join the stemmed words back into a single string

### Import Datasets

In [3]:
# All your CSV files are in the 'scrap_datasets/' directory
dataset = concat_csv_files("scrap_datasets/")

In [4]:
dataset.reset_index(drop=True, inplace=True)

In [5]:
# Convert column names to snake_case
dataset = to_snake_case_columns(dataset)

In [6]:
# Create "movie_id" column
dataset.loc[:, "movie_id"] = dataset.index

In [7]:
# Check first row
dataset.head(1)

Unnamed: 0,title,genre,rating,description,release_year,director,main_cast,url,movie_id
0,Game of Thrones,"Action, Adventure, Drama",9.2,Nine noble families fight for control over the...,2011.0,,"Emilia Clarke, Peter Dinklage, Kit Harington",https://www.imdb.com/title/tt0944947/,0


**Since we will implement content-based filtering, we will select only those features that relevant to the content of the movies**

In [8]:
# Filter columns only those that relevant to the content of the movie
dataset = dataset[["movie_id", "title", "genre", "description", "director", "main_cast"]]

In [9]:
# Check dataset info
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9671 entries, 0 to 9670
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   movie_id     9671 non-null   int64 
 1   title        9671 non-null   object
 2   genre        9666 non-null   object
 3   description  9555 non-null   object
 4   director     9257 non-null   object
 5   main_cast    9363 non-null   object
dtypes: int64(1), object(5)
memory usage: 453.5+ KB


### Data Cleaning

In [10]:
# Drop any movies contain null value
dataset = drop_incomplete_rows(dataset)

In [11]:
# Apply the split_column_values to specific columns (convert string of those value to array)
dataset = split_column_values(dataset, ["genre", "director", "main_cast"])

In [12]:
# Split each word of "description" to an array
dataset["description"] = dataset["description"].apply(lambda x: re.sub(r'[^\w\s]', '', x.lower()).split())

In [13]:
dataset.head(1)

Unnamed: 0,movie_id,title,genre,description,director,main_cast
1,1,Star Wars: Episode III - Revenge of the Sith,"[Action, Adventure, Fantasy]","[as, the, clone, wars, nears, its, end, obiwan...",[George Lucas],"[Hayden Christensen, Natalie Portman, Ewan McG..."


In [14]:
# Removing " " (spaces) between Words from features
dataset["main_cast"] = dataset["main_cast"].apply(lambda x: [i.strip().replace(" ", "") for i in x])
dataset["director"] = dataset["director"].apply(lambda x: [i.strip().replace(" ", "") for i in x])
dataset["genre"] = dataset["genre"].apply(lambda x:[i.replace(" ","") for i in x])

In [15]:
dataset.head(1)

Unnamed: 0,movie_id,title,genre,description,director,main_cast
1,1,Star Wars: Episode III - Revenge of the Sith,"[Action, Adventure, Fantasy]","[as, the, clone, wars, nears, its, end, obiwan...",[GeorgeLucas],"[HaydenChristensen, NataliePortman, EwanMcGregor]"


In [28]:
# Create new column "tags"
def create_tags(row):
    return row['description'] + row['genre'] *  2 + row['main_cast'] * 3 + row['director'] * 2

dataset["tags"] = dataset.apply(create_tags, axis=1)

In [29]:
# Create new dataset
new_dataset = dataset[["movie_id", "title", "tags"]]

In [30]:
new_dataset.loc[:, "tags"] = new_dataset["tags"].apply(lambda x: " ".join(x).lower())

In [31]:
new_dataset.head(1)

Unnamed: 0,movie_id,title,tags
1,1,Star Wars: Episode III - Revenge of the Sith,as the clone wars nears its end obiwan kenobi ...


In [32]:
# Save new dataset
new_dataset.reset_index(drop=True, inplace=True)
new_dataset.to_csv("./scrap_datasets/clean_datasets/cleaned_imdb_scrapped_dataset.csv", index=False)

# DEVELOP

In [33]:
# Apply stem to the dataset
new_dataset.loc[:, "tags"] = new_dataset["tags"].apply(stem)

# Vectorization: Creating each movie as a Vector
cv = CountVectorizer(max_features=5000, stop_words="english")

# Vectorization the "tags" of dataset
vector = cv.fit_transform(new_dataset["tags"]).toarray()

# Calculating Cosine Angle between vectors
similar = cosine_similarity(vector)

In [34]:
def recommend(movie, top_n=5):
    """
    Recommend top N movies similar to the given movie title.
    
    Parameters:
        movie (str): The title of the movie to base recommendations on.
        top_n (int): Number of recommendations to return.
    
    Returns:
        list: A list of recommended movie titles.
    """
    movie = movie.strip().lower()
    match = new_dataset[new_dataset["title"].str.lower().str.strip() == movie]
    
    if match.empty:
        return f"Movie '{movie}' not found in the dataset."
    
    movie_index = match.index[0]
    distances = similar[movie_index]
    movie_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:top_n+1]
    
    recommendations = [new_dataset.iloc[i[0]].title for i in movie_list]
    return recommendations

In [40]:
recommend("Shrek")

['Shrek 2',
 'Shrek Forever After',
 'Shrek the Third',
 'The Chipmunk Adventure',
 'The Pirates Who Don&apos;t Do Anything: A VeggieTales Movie']

In [41]:
pickle.dump(new_dataset.to_dict(), open("./output/movies.pkl", "wb"))
pickle.dump(similar, open("./output/similar.pkl","wb"))