# IDEATE

## Overview

A movie recommendation system is a software tool that proposes films to users based on their preferences, past viewing habits, and other relevant information. It assists users in finding new films that correspond with their interests, resulting in a more tailored and captivating entertainment experience.

## Problem Statement

This notebook aims to create a **Movie Recommendation System** using content-based filtering approach on the TMDB dataset.

Throughout the notebook, we will explore methods for vectorizing movie features, developing similarity metrics to gauge movie similarity, and constructing a recommendation mechanism. By the end, the goal is to have a fully operational movie recommendation system that offers personalized movie suggestions, thereby enhancing the overall movie-viewing experience.

# EXPLORE

## About dataset

**TMDB (The Movie Database)** is a popular, community-driven online database for movies, TV shows, and the people behind them. It provides detailed metadata including:

- Movie and TV titles

- Overviews and synopses

- Genres and keywords

- Cast and crew details

- Posters and trailers

- Ratings and release dates

**TMDB** is widely used in movie recommendation systems and media applications because of its comprehensive and regularly updated content. It offers a free API that developers and researchers can use to access its data for various projects, including recommendation engines, cataloging tools, and entertainment apps. This rich dataset allows developers and data scientists to build various applications, including movie recommendation systems.

## Preparation

**Notice**: we will be using **Python** program for this study.

To begin, let’s prepare by loading the necessary python packages, libraries, and some pre-defined functions for the project. After that we’ll then import the data from **/datasets**

### Import libraries and packages

In [1]:
import pandas as pd
import numpy as np
import ast
import nltk
import pickle

from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Poreter Stemmer for handeling repeated words in tags feature
ps = PorterStemmer()

def extract_names(obj, field_type):
    """
    Extracts specific names based on the field type.
    
    Parameters:
        obj (str): The JSON-formatted string to parse.
        field_type (str): One of ['crew', 'cast', 'default'].
    
    Returns:
        list: A list of extracted names.
    """
    result = []
    try:
        data = ast.literal_eval(obj)
        if field_type == 'crew':
            for i in data:
                if i.get("job") == "Director":
                    result.append(i.get("name"))
                    break
        elif field_type == 'cast':
            for i in data[:3]:  # top 3 cast members
                result.append(i.get("name"))
        else:
            for i in data:
                result.append(i.get("name"))
    except:
        pass
    return result

from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

def stem(text):
    """
    Applies stemming to each word in the input text using the PorterStemmer.

    Parameters:
        text (str): A string of space-separated words.

    Returns:
        str: A string with each word stemmed (reduced to its root form).
    """
    y = []
    for i in text.split():          # Split the text into individual words
        y.append(ps.stem(i))        # Apply stemming to each word and collect the result
    return " ".join(y)              # Join the stemmed words back into a single string

In [3]:
import pandas as pd
import glob
import os

def concat_csv_files(folder_path, pattern="*.csv"):
    """
    Concatenate all CSV files in a folder into a single DataFrame.
    
    Parameters:
        folder_path (str): Path to the folder containing CSV files.
        pattern (str): Glob pattern to match files (default is '*.csv').
    
    Returns:
        pd.DataFrame: Combined DataFrame.
    """
    csv_files = glob.glob(os.path.join(folder_path, pattern))
    df_list = [pd.read_csv(file) for file in csv_files]
    combined_df = pd.concat(df_list, ignore_index=True)
    return combined_df

In [5]:
# Assuming all your CSV files are in the 'data/' directory
df = concat_csv_files("scrap_datasets/")
df.head()

Unnamed: 0,Title,Genre,Rating,Description,Release Year,Director,Main Cast,Url
0,Game of Thrones,"Action, Adventure, Drama",9.2,Nine noble families fight for control over the...,2011.0,,"Emilia Clarke, Peter Dinklage, Kit Harington",https://www.imdb.com/title/tt0944947/
1,Star Wars: Episode III - Revenge of the Sith,"Action, Adventure, Fantasy",7.6,"As the Clone Wars nears its end, Obi-Wan Kenob...",2005.0,George Lucas,"Hayden Christensen, Natalie Portman, Ewan McGr...",https://www.imdb.com/title/tt0121766/
2,The Four Seasons,"Comedy, Drama",6.8,Witty character study of three couples who vac...,1981.0,Alan Alda,"Alan Alda, Carol Burnett, Len Cariou",https://www.imdb.com/title/tt0082405/
3,Gladiator,"Action, Adventure, Drama",8.5,A former Roman General sets out to exact venge...,2000.0,Ridley Scott,"Russell Crowe, Joaquin Phoenix, Connie Nielsen",https://www.imdb.com/title/tt0172495/
4,Pride &amp; Prejudice,"Drama, Romance",7.8,When Elizabeth Bennet meets the handsome Mr. D...,2005.0,Joe Wright,"Keira Knightley, Matthew Macfadyen, Brenda Ble...",https://www.imdb.com/title/tt0414387/


In [6]:
len(df)

8500

### Import datasets

In [26]:
movies = pd.read_csv("./datasets/tmdb_5000_movies.csv")
credits = pd.read_csv("./datasets/tmdb_5000_credits.csv")

#### Merge datasets

In [27]:
# Merge dataset on "title"
dataset = movies.merge(credits, on="title")

In [28]:
# Check first row
dataset.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [29]:
# Check columns
dataset.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'movie_id', 'cast', 'crew'],
      dtype='object')

Since we will implement content-based filtering, we will select only those features that relevant to the content of the movies

In [30]:
# Filter columns only those that relevant to the content of the movie
dataset = dataset[["movie_id","title","overview","genres","keywords","cast","crew"]]
dataset.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [31]:
# Check dataset info
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4809 non-null   int64 
 1   title     4809 non-null   object
 2   overview  4806 non-null   object
 3   genres    4809 non-null   object
 4   keywords  4809 non-null   object
 5   cast      4809 non-null   object
 6   crew      4809 non-null   object
dtypes: int64(1), object(6)
memory usage: 263.1+ KB


In [32]:
dataset.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [33]:
# Drop null value
dataset.dropna(inplace=True)

In [34]:
# Extract genres as arrary
dataset["genres"] = dataset["genres"].apply(lambda x: extract_names(x, 'default'))

# Extract keywords as array
dataset["keywords"] = dataset["keywords"].apply(lambda x: extract_names(x, 'default'))

# Extract top 3 cast's name as array
dataset["cast"] = dataset["cast"].apply(lambda x: extract_names(x, 'cast'))

# Extract director's name
dataset["crew"] = dataset["crew"].apply(lambda x: extract_names(x, 'crew'))

In [35]:
dataset.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [36]:
dataset["overview"] = dataset["overview"].apply(lambda x: x.split())

In [37]:
# Removing " " (spaces) between Words from features
dataset["cast"] = dataset["cast"].apply(lambda x:[i.replace(" ","") for i in x])
dataset["crew"] = dataset["crew"].apply(lambda x:[i.replace(" ","") for i in x])
dataset["keywords"] = dataset["keywords"].apply(lambda x:[i.replace(" ","") for i in x])
dataset["genres"] = dataset["genres"].apply(lambda x:[i.replace(" ","") for i in x])

In [38]:
dataset.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]


In [39]:
dataset["tags"] = dataset["overview"] + dataset["genres"] + dataset["keywords"] + dataset["cast"] + dataset["crew"]

In [40]:
# new_dataset data is ready now!!!
new_dataset = dataset[["movie_id", "title", "tags"]]

In [41]:
new_dataset.loc[:, "tags"] = new_dataset["tags"].apply(lambda x:" ".join(x))
new_dataset.loc[:, "tags"] = new_dataset["tags"].apply(lambda x:x.lower())

In [42]:
new_dataset.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."


In [43]:
# Save new dataset
new_dataset.to_csv("./datasets/new_dataset.csv", index=False)

# DEVELOP

In [46]:
# Apply stem to the dataset
new_dataset.loc[:, "tags"] = new_dataset["tags"].apply(stem)

In [47]:
# Vectorization: Creating each movie as a Vector
cv = CountVectorizer(max_features=5000, stop_words="english")

In [50]:
# Vectorization the "tags" of dataset
vector = cv.fit_transform(new_dataset["tags"]).toarray()

In [51]:
# Calculating Cosine Angle between vectors
similar = cosine_similarity(vector)

In [56]:
# Creating our Recommend function it will return Top 5 movies back
def recommend(movie):
    movie_index = new_dataset[new_dataset["title"]==movie].index[0]
    distances = similar[movie_index]
    movie_list = sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:6]
    
    for i in movie_list: 
        print(new_dataset.iloc[i[0]].title)

In [64]:
recommend("Captain America: Civil War")

Captain America: The First Avenger
Iron Man 3
Captain America: The Winter Soldier
Avengers: Age of Ultron
The Avengers


In [65]:
recommend("Jurassic World")

Jurassic Park
The Lost World: Jurassic Park
Walking With Dinosaurs
Terminator Genisys
Jurassic Park III


In [60]:
recommend("Superman Returns")

Superman II
Superman III
Superman IV: The Quest for Peace
Superman
The Wolverine


In [68]:
pickle.dump(new_dataset.to_dict(), open("./output/movies.pkl", "wb"))
pickle.dump(similar, open("./output/similar.pkl","wb"))