# Movie Recommendation System: Project Overview

The goal of this project is to develop a movie recommendation system that delivers personalized movie suggestions based on movie content and attributes. By analyzing movie descriptions and keywords, this system helps users discover films that align with their tastes, enhancing their movie-watching experience.

This project specifically implements a Content-Based Filtering recommendation system using the TMDb Movies Dataset (2023).

Content-Based Filtering focuses on the characteristics of the movies themselves, rather than relying on user ratings or collaborative patterns.
It analyzes textual data such as movie overviews and keywords to find similarities between movies, ensuring accurate and relevant recommendations.
By leveraging advanced natural language processing techniques, this system efficiently handles a large dataset (25,000 movies) and demonstrates the effectiveness of content-based approaches in providing meaningful recommendations.

https://www.kaggle.com/code/moridata/recommendation-system-movie-recommendation/notebook

# Types of Recommendation Systems:

Recommendation systems can be broadly categorized into Non-Personalized and Personalized approaches, with each having unique advantages. These systems can be implemented using various techniques, depending on the data available and the level of personalization required.

**Content-Based Filtering:** 
    Recommends movies similar to those a user has liked in the past, based on movie features (genre, actors, director, plot keywords).
    
**Collaborative Filtering:** 
    Recommends movies that users with similar tastes have liked, without explicitly considering movie features.
    
**Hybrid Approaches:** 
    Combines content-based and collaborative filtering for improved accuracy.

# Import Libraries 

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from ast import literal_eval # For safely converting stringified lists

# Load Data

In [2]:
df = pd.read_csv('/kaggle/input/tmdb-movies-dataset-2023-930k-movies/TMDB_movie_dataset_v11.csv')

# 1. Exploratory Data Analysis (EDA)

In [3]:
df.head(2)

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,Inception,"Cobb, a skilled thief who commits corporate es...",83.952,/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, franc..."
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,...,Interstellar,The adventures of a group of explorers who mak...,140.241,/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,Mankind was born on Earth. It was never meant ...,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time,..."


In [4]:
df.shape

(1150581, 24)

In [5]:
df.columns

Index(['id', 'title', 'vote_average', 'vote_count', 'status', 'release_date',
       'revenue', 'runtime', 'adult', 'backdrop_path', 'budget', 'homepage',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'tagline', 'genres',
       'production_companies', 'production_countries', 'spoken_languages',
       'keywords'],
      dtype='object')

**The TMDb Movies Dataset 2023 contains extensive information about over 1,000,000 movies. Key features include:**

**Movie Details:**

    movie_id: Unique identifier for each movie.
    title: The name of the movie.
    genres: A list of movie genres (e.g., Action, Drama, Comedy).
    release_date: The date when the movie was released.
    overview: A short summary of the movie plot.
**Ratings and Popularity:**

    vote_average: Average rating given by users.
    vote_count: The number of ratings the movie has received.
**Additional Information:**

    budget: The movie's production budget.
    revenue: The total revenue generated by the movie.
    runtime: The duration of the movie.
    cast and crew: Information about the actors, directors, and producers.

# 2. Data Preprocessing

### Missing Values

In [6]:
df.isnull().sum()

id                            0
title                        13
vote_average                  0
vote_count                    0
status                        0
release_date             193011
revenue                       0
runtime                       0
adult                         0
backdrop_path            846461
budget                        0
homepage                1028998
imdb_id                  544445
original_language             0
original_title               13
overview                 235526
popularity                    0
poster_path              366090
tagline                  989281
genres                   463525
production_companies     632871
production_countries     512652
spoken_languages         492880
keywords                 840518
dtype: int64

### Duplicate 

In [7]:
# Count the number of duplicate rows
df.duplicated().sum()

368

In [8]:
# Find duplicate rows and display them
df[df.duplicated()][:3]

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
234137,1220680,Mr. Kobayashi,10.0,1,Released,,0,0,False,,...,Mr. Kobayashi,,0.63,/h4Xyf3gJgCWVTMlz1yuSqPLSVZ3.jpg,,"Action, Drama",,,,"samurai, katana sword"
234701,1226029,Dima Koval: Stand-up from Vegas,10.0,1,Released,2023-12-05,0,0,False,,...,Dima Koval: Stand-up from Vegas,,1.4,/gMREsQHG7Na3Tmhdg4OH51LPHt.jpg,,Comedy,Standup-view,Russia,Russian,"stand-up comedy, stand-up russia"
235628,1214170,Frienemies,10.0,1,Released,2013-07-15,0,11,False,/kssDqwfU82S2h1omqkuvCr9CJ8p.jpg,...,Frienemies,"Four friends Mike, Shane, Diego and Matt plan ...",0.6,/vRjOyfbNFb85JN8oeCMmBSu1oXx.jpg,Friendships come and go.,"Comedy, Action, Thriller",,,,


In [9]:
# Remove duplicate rows, keeping the first occurrence
df.drop_duplicates(inplace=True)

# Recommendation Systems

## Content-Based Filtering:

Content-based filtering recommends items similar to those a user has liked in the past, based on the items' inherent characteristics. In the context of movie recommendations, this means suggesting movies similar to ones a user has previously enjoyed, by analyzing features like genre, plot summaries, actors, director, etc. To implement this, I first preprocessed the movie data, extracting relevant features. Specifically, I parsed the "genres" column, converting stringified lists into usable lists of genres, and created binary vectors representing the presence or absence of each genre for each movie. I also used TF-IDF (Term Frequency-Inverse Document Frequency) to convert the "overview" (plot summary) text into numerical vectors, capturing the importance of words within each movie's description. Then, I calculated the cosine similarity between movies based on both their genre vectors and TF-IDF vectors of their overviews. Finally, I combined these similarity scores using weighted averages to create a composite similarity matrix. This matrix is then used to recommend movies; given a movie a user likes, the system finds other movies with the highest combined similarity scores and recommends them. This approach ensures that the recommendations are based on the actual content of the movies, rather than relying on user preferences or ratings from other users.

In [10]:
df['genres'].head()

0             Action, Science Fiction, Adventure
1              Adventure, Drama, Science Fiction
2                 Drama, Action, Crime, Thriller
3    Action, Adventure, Fantasy, Science Fiction
4             Science Fiction, Action, Adventure
Name: genres, dtype: object

In [11]:
df['overview'].head()

0    Cobb, a skilled thief who commits corporate es...
1    The adventures of a group of explorers who mak...
2    Batman raises the stakes in his war on crime. ...
3    In the 22nd century, a paraplegic Marine is di...
4    When an unexpected enemy emerges and threatens...
Name: overview, dtype: object

In [12]:
df['keywords'].head()

0    rescue, mission, dream, airplane, paris, franc...
1    rescue, future, spacecraft, race against time,...
2    joker, sadism, chaos, secret identity, crime f...
3    future, society, culture clash, space travel, ...
4    new york city, superhero, shield, based on com...
Name: keywords, dtype: object

Download necessary NLTK data
The code ensures that all required resources from the NLTK library are available for text processing. This is often necessary when working with stopwords or lemmatization. The resources are downloaded only if they are not already available.

In [13]:
# Download necessary NLTK data (only need to do this once)
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
try:
    WordNetLemmatizer().lemmatize('test')
except LookupError:
    nltk.download('wordnet')

# Download WordNet if not already available
try:
    nltk.data.find('corpora/wordnet.zip')
except LookupError:
    nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Preprocessing Setup
The code sets up key components for text preprocessing, such as defining stopwords and initializing a lemmatizer.

In [14]:
# Preprocessing functions
stop_words = set(stopwords.words('english'))  # Define stop words for text processing
lemmatizer = WordNetLemmatizer()  # Initialize lemmatizer for stemming words to their root form

Text Preprocessing Functions

In [None]:
def remove_stopwords(text):
    """Remove stopwords from a given text."""
    if isinstance(text, str):  # Check if the input is a string
        words = [word for word in text.split() if word.lower() not in stop_words]
        return " ".join(words)  # Join words back into a string
    return ""

def lemmatize_text(text):
    """Lemmatize the words in a given text."""
    if isinstance(text, str):  # Check if the input is a string
        words = [lemmatizer.lemmatize(word) for word in text.split()]
        return " ".join(words)  # Join lemmatized words back into a string
    return ""

What is Jaccard Similarity?
The Jaccard similarity is a statistical measure used to determine how similar two sets are. It compares the number of shared elements between the sets to the total number of unique elements.

Formula:

Jaccard Similarity = Size of Intersection / Size of Union

In [16]:
def jaccard_similarity(set1, set2):
    """Calculate Jaccard similarity between two sets."""
    intersection = len(set1.intersection(set2))  # Size of intersection
    union = len(set1.union(set2))  # Size of union
    return intersection / union if union != 0 else 0  # Avoid division by zero 

The safe_literal_eval function safely evaluates strings that may represent data structures like lists, tuples, or dictionaries. If the value is a string, it attempts to parse it using literal_eval, and if that fails, it splits the string by commas into a list. If the value is None or NaN, it returns an empty list. For other types of input, the function simply returns the value as-is. The function is applied to the keywords and genres columns of a DataFrame, ensuring that string representations of lists or similar structures are properly converted into Python objects.

In [17]:
from ast import literal_eval
import numpy as np

def safe_literal_eval(value):
    """Safely evaluate a string to its literal representation."""
    if isinstance(value, str):  # Check if it's a string
        try:
            # Try to evaluate the string as a literal (e.g., list, tuple, dict)
            return literal_eval(value)
        except (ValueError, SyntaxError):
            # If evaluation fails, return the string as a list by splitting on commas
            return value.split(',') if value else []
    elif isinstance(value, float) and np.isnan(value):  # Handle NaN values
        return []  # Return an empty list for NaN values
    elif value is None:  # Handle None explicitly
        return []  # Return an empty list for None
    else:
        return value  # Return the value as-is if it's already in the correct format

# Apply the function to the specified columns
features = ['keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(safe_literal_eval)

In [18]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        return ''
        

# Apply clean_data function to your features.
features = ['keywords', 'genres']

for feature in features:
    df[feature] = df[feature].apply(clean_data)

It first limits the data to the first 25,000 (RAM issue) rows for efficiency and fills missing values in the overview and keywords columns with empty strings. Then, it uses TF-IDF vectorization on the overviews and CountVectorizer on the keywords (after ensuring the keywords are properly formatted as strings). Cosine similarity is computed for both overviews and keywords to measure similarity between movies. This setup allows for the creation of a recommendation system by comparing movies based on either their content or their keywords.

In [19]:
df = df[:25000]

# Import necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Assuming you already have 'df' as your DataFrame
# Handle missing values in overview and keywords columns
df['overview'] = df['overview'].fillna('')  # Replace NaN in overview
df['keywords'] = df['keywords'].fillna('')  # Replace NaN in keywords

# Feature Extraction
# Create TF-IDF matrix for movie overviews
tfidf_overview = TfidfVectorizer(stop_words='english')
tfidf_overview_matrix = tfidf_overview.fit_transform(df['overview'])

# Convert keywords into strings (if they are lists) and create a TF-IDF matrix for keywords
# Ensure keywords are properly joined into a string for vectorization
keywords_ = [' '.join(keywords) if isinstance(keywords, list) else keywords for keywords in df['keywords']]
vector = CountVectorizer(stop_words='english')
vector_keywords_matrix = vector.fit_transform(keywords_)

# Extract genres and titles as lists for similarity computation
movie_genres = df['genres'].tolist()
movie_titles = df['title'].tolist()

# Calculate cosine similarity for both overview and keywords
cosine_sim_overview = cosine_similarity(tfidf_overview_matrix)
cosine_sim_keywords = cosine_similarity(vector_keywords_matrix)

The combined_similarity function calculates the similarity between two movies by considering three factors: genres (using Jaccard similarity), overviews (using cosine similarity), and keywords (also using cosine similarity). Each factor's contribution is weighted according to specified parameters (genre_weight, overview_weight, and keyword_weight). The get_recommendations function finds the index of the movie title in the dataset and computes its similarity to all other movies using the combined_similarity function. It then sorts the movies by their similarity scores and returns the top N recommendations. This setup allows for personalized movie recommendations by considering multiple attributes of each movie.

In [20]:
def combined_similarity(movie1_index, movie2_index, genre_weight=0.2, overview_weight=0.5, keyword_weight=0.3):
    """
    Combine similarity scores from genres, overview, and keywords with adjustable weights.
    """
    genre_sim = jaccard_similarity(set(movie_genres[movie1_index]), set(movie_genres[movie2_index]))  # Jaccard similarity for genres
    overview_sim = cosine_sim_overview[movie1_index, movie2_index]  # Cosine similarity for overviews
    keyword_sim = cosine_sim_keywords[movie1_index, movie2_index]  # Cosine similarity for keywords
    return (genre_weight * genre_sim) + (overview_weight * overview_sim) + (keyword_weight * keyword_sim)  # Weighted sum of similarities

In [None]:
def get_recommendations(movie_title, top_n=10):
    """
    Get top N recommendations for a given movie title based on combined similarity scores.
    """
    # Check if the input is a string
    if not isinstance(movie_title, str):
        return "Error: Movie title must be a string."

    try:
        movie_index = movie_titles.index(movie_title)  # Find index of the given movie
    except ValueError:
        return f"Error: '{movie_title}' not found in the dataset."  # Handle case when the movie is not in the dataset
    
    similarities = [(i, combined_similarity(movie_index, i)) for i in range(len(movie_titles)) if i != movie_index]  # Calculate similarity for all other movies
    similarities.sort(key=lambda x: x[1], reverse=True)  # Sort movies by similarity score in descending order
    
    print(f"\nRecommendations for '{movie_title}':")
    for i, (movie, sim) in enumerate(similarities[:top_n], 1):
        print(f"{i}. Movie: {movie_titles[movie]}, Similarity: {sim}")

In [22]:
# Example Usage
recommendations = get_recommendations('The Dark Knight Rises')  # Get recommendations for 'The Dark Knight Rises'
print(recommendations)

recommendations = get_recommendations('Non Existent Movie')  # Test with a movie not in the dataset
print(recommendations)

recommendations = get_recommendations(12345)  # Test with a non-string input
print(recommendations)


Recommendations for 'The Dark Knight Rises':
1. Movie: The Dark Knight, Similarity: 0.44367078129061277
2. Movie: Batman Begins, Similarity: 0.3312319791073811
3. Movie: Batman: The Long Halloween, Part One, Similarity: 0.31948801598275717
4. Movie: Batman: The Long Halloween, Part Two, Similarity: 0.2988531543167315
5. Movie: Batman vs. Two-Face, Similarity: 0.2820051833757711
6. Movie: The Siege, Similarity: 0.25485826749439033
7. Movie: The Batman, Similarity: 0.25241304745002513
8. Movie: Class of 1984, Similarity: 0.24245820935684362
9. Movie: The Negotiator, Similarity: 0.23901814723701992
10. Movie: Double Impact, Similarity: 0.23857391239249676
None
Error: 'Non Existent Movie' not found in the dataset.
Error: Movie title must be a string.


# Thank you 