## Introduction to Recommendation Systems

Recommendation systems are tools used to predict and suggest items to users based on their preferences and behaviors. These systems are widely used in various domains, including e-commerce, streaming platforms, social media, and more, to enhance user experience and engagement.

### Movie Recommender Systems

Movie recommender systems are a specific type of recommendation system that suggests movies to users based on their preferences, viewing history, and other relevant factors. These systems help users discover new movies they might enjoy, leading to increased user satisfaction and retention.

### Types of Movie Recommender Systems

There are several types of movie recommender systems, each using different algorithms and approaches to generate recommendations:

1. **Content-Based Filtering**: This approach recommends movies similar to those a user has liked in the past. It analyzes the content/features of movies (e.g., genre, actors, plot keywords) and suggests movies with similar characteristics.

2. **Collaborative Filtering**:
   - **User-Based Collaborative Filtering**: This method recommends movies based on the preferences of similar users. It identifies users with similar movie preferences and recommends movies liked by those users but not yet seen by the target user.
   - **Item-Based Collaborative Filtering**: Instead of comparing users, this method compares movies directly. It recommends movies similar to those already liked by the user, based on the ratings or interactions of other users.

3. **Hybrid Recommender Systems**: These systems combine multiple recommendation techniques to provide more accurate and diverse recommendations. For example, a hybrid system might combine content-based filtering with collaborative filtering to leverage the strengths of both approaches.

4. **Matrix Factorization Methods**: These advanced techniques aim to model user-item interactions by decomposing the user-item interaction matrix into lower-dimensional matrices. Methods like Singular Value Decomposition (SVD) and Alternating Least Squares (ALS) are commonly used for matrix factorization in recommender systems.

5. **Deep Learning-Based Recommender Systems**: With the advent of deep learning, neural network-based approaches have gained popularity in recommendation systems. Models like Neural Collaborative Filtering (NCF) and Deep Matrix Factorization (DMF) leverage deep learning architectures to capture complex user-item interactions and provide personalized recommendations.

Each type of movie recommender system has its advantages and limitations, and the choice of algorithm depends on factors such as available data, scalability, and the specific requirements of the application.


### Why Content-Based Recommender Systems?
- **Personalization**: Recommends items (movies) based on the characteristics of items and user preferences.
- **Independence**: Doesn't rely on other users' behavior.
- **Transparency**: Users can understand why a movie was recommended.


# Feature Engineering in Movie Recommendation System

## 1. Removing Unnecessary Columns

- Removed irrelevant columns from the dataset to streamline the data for analysis.
- Columns such as 'homepage', 'tagline', 'budget', 'revenue', etc., were removed as they are not required for movie recommendations.

## 2. Combining DataFrames

- Merged two DataFrames containing movie metadata and credits information.
- Utilized common identifiers such as movie IDs ('id') to merge the DataFrames efficiently.

## 3. Creating New Columns

- Introduced a new column named 'Director' to consolidate director information.
- Extracted director names from the 'crew' column and populated the 'Director' column accordingly.


# Data Cleaning for Text Data

## 1. Removing Empty Space

- Trimmed leading and trailing white spaces from the text data.
- Ensured consistency in text formatting and improved readability.

## 2. Removing Special Characters

- Eliminated special characters from the text data.
- Special characters such as punctuation marks, symbols, etc., were removed to focus on meaningful content.

## 3. Converting Text to Lowercase

- Converted all text data to lowercase.
- Standardized text formatting to ensure uniformity and facilitate analysis and modeling.

## 4. Removing Stop Words

- Eliminated common stop words from the text data.
- Stop words such as 'the', 'a', 'an', etc., were removed to focus on relevant content.

## 5. Replacing NaN Values

- Replaced missing values (NaN) in the text data with empty strings.
- Ensured consistency in data structure and facilitated further processing.

## TF-IDF for Text Vectorization

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents or corpus. It consists of two components:

1. **Term Frequency (TF)**: Measures how often a term occurs in a document.

2. **Inverse Document Frequency (IDF)**: Measures the rarity of a term across the corpus.

The TF-IDF score for a term in a document combines these components, capturing the importance of terms while reducing noise and providing flexibility in text representation.

Reasons for choosing TF-IDF:

- **Term Importance**: TF-IDF captures the importance of terms by considering both frequency of occurrence and rarity across the corpus.
- **Dimensionality Reduction**: TF-IDF reduces dimensionality by assigning lower weights to common terms and higher weights to rare terms.
- **Noise Reduction**: Stop words and common terms have low TF-IDF scores, reducing noise in document representation.
- **Flexibility**: Customizable parameters allow adaptation to different text mining tasks.


## Cosine Similarity in Recommendation Systems

**Definition**: Cosine similarity measures the similarity between two vectors in an inner product space.

**Why Use Cosine Similarity?**

- **Scale-Invariance**: It is unaffected by the magnitude of vectors, making it suitable for comparing items with different scales.
- **Efficiency**: Computationally efficient and effective for high-dimensional data in recommendation systems.
- **Robustness**: Robust to outliers and noise, focusing on the angle between vectors rather than their absolute values.

**Role in Recommendation Systems**

Cosine similarity is used to measure the similarity between items or users based on their feature vectors. It enables personalized recommendations by identifying items that are most similar to those previously liked by the user. In content-based recommendation systems, cosine similarity compares item feature vectors to make relevant recommendations.


In [2]:
import pandas as pd  # Import pandas library for data manipulation and analysis

import numpy as np  # Import numpy library for numerical computing

from sklearn.feature_extraction.text import TfidfVectorizer  # Import TfidfVectorizer for converting text data to TF-IDF features

from sklearn.metrics.pairwise import linear_kernel  # Import linear_kernel for computing similarity between vectors

from ast import literal_eval  # Import literal_eval for safely evaluating string representations of Python data structures

import re # Import re for handling regular expression

import pickle

In [3]:
df1 = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_credits.csv')
# Read the CSV file 'tmdb_5000_credits.csv' located in the '../input/tmdb-movie-metadata/' directory into a pandas DataFrame.
# Assign the DataFrame to the variable 'df1'.

df2 = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_movies.csv')
# Read the CSV file 'tmdb_5000_movies.csv' located in the '../input/tmdb-movie-metadata/' directory into a pandas DataFrame.
# Assign the DataFrame to the variable 'df2'.

In [4]:
print(df1.head(2))
print(df2.head(2))

   movie_id                        ...                                                                       crew
0     19995                        ...                          [{"credit_id": "52fe48009251416c750aca23", "de...
1       285                        ...                          [{"credit_id": "52fe4232c3a36847f800b579", "de...

[2 rows x 4 columns]
      budget    ...     vote_count
0  237000000    ...          11800
1  300000000    ...           4500

[2 rows x 20 columns]


In [23]:
df1.columns = ['id', 'title', 'cast', 'crew']
# Rename the columns of DataFrame df1 to ['id', 'title', 'cast', 'crew'].
# This line ensures that the columns of df1 have meaningful names.

df2 = df2.merge(df1, on='id')
# Merge DataFrame df2 with DataFrame df1 based on the 'id' column.
# This line combines the information from both DataFrames based on the common 'id' column, effectively joining them together.

In [51]:
def get_director(x):
    # Define a function named 'get_director' that takes a parameter 'x'.

    for i in x:
        # Iterate through each element 'i' in the input 'x'.

        if i["job"] == "Director":
            # Check if the value associated with the key 'job' in 'i' is equal to "Director".

            return i["name"]
            # If the condition is true, return the value associated with the key 'name' in 'i'.
    return np.nan
    # If the loop completes without finding a director, return 'np.nan' (numpy's representation of Not a Number).

In [25]:
features = ["cast", "crew", "keywords", "genres"]
# Define a list named 'features' containing the names of columns to be processed.

for feature in features:
    # Iterate through each feature in the list 'features'.

    df2[feature] = df2[feature].apply(literal_eval)
    # Apply the literal_eval function to each element in the column specified by 'feature' in DataFrame df2.
    # This function safely evaluates the string representation of Python data structures (e.g., lists of dictionaries).
    # It converts the string representation of lists of dictionaries to actual lists of dictionaries.

In [28]:
df2["director"] = df2["crew"].apply(get_director)
# Create a new column named 'director' in DataFrame df2.
# Apply the 'get_director' function to each element in the 'crew' column of df2.
# The 'get_director' function extracts the name of the director from the crew list for each movie.
# The extracted director names are then stored in the newly created 'director' column.

In [29]:
def get_list(x):
    # Define a function named 'get_list' that takes a parameter 'x'.

    if isinstance(x, list):
        # Check if 'x' is a list.

        names = [i["name"] for i in x]
        # If 'x' is a list, create a list comprehension to extract the 'name' attribute from each dictionary 'i' in 'x'.
        # Store the extracted names in the list 'names'.

        if len(names) > 3:
            # If the length of 'names' is greater than 3 (contains more than 3 elements),

            names = names[:3]
            # Keep only the first 3 elements in the list 'names'.

        return names
        # Return the list 'names'.

    return []
    # If 'x' is not a list (e.g., it is empty or not in the expected format), return an empty list.

In [30]:
features = ["cast", "keywords", "genres"]
# Define a list named 'features' containing the names of columns to be processed.

for feature in features:
    # Iterate through each feature in the list 'features'.

    df2[feature] = df2[feature].apply(get_list)
    # Apply the 'get_list' function to each element in the column specified by 'feature' in DataFrame df2.
    # This function extracts a list of names from each element (which is assumed to be a list of dictionaries),
    # and then updates the column with the extracted list of names.


In [31]:
df2 = df2[["original_title", "director", "overview", "genres", "cast"]]
# Select specific columns from DataFrame df2 using a list of column names.
# Create a new DataFrame containing only the columns "original_title", "director", "overview", "genres", and "cast".

In [32]:
df2.head(2)

Unnamed: 0,original_title,director,overview,genres,cast
0,Avatar,James Cameron,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy]","[Sam Worthington, Zoe Saldana, Sigourney Weaver]"
1,Pirates of the Caribbean: At World's End,Gore Verbinski,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[Johnny Depp, Orlando Bloom, Keira Knightley]"


In [34]:
df2['cast'] = df2['cast'].str.join(' ')
# Convert each element in the 'cast' column of DataFrame df2 to a string representation.
# Join the elements of each list in the 'cast' column with a space (' ') separator.
# Update the 'cast' column in df2 with the joined strings.

df2['genres'] = df2['genres'].str.join(' ')
# Convert each element in the 'genres' column of DataFrame df2 to a string representation.
# Join the elements of each list in the 'genres' column with a space (' ') separator.
# Update the 'genres' column in df2 with the joined strings.


In [35]:
df2['text'] = df2['overview'] + df2['cast'] + df2['genres'] + df2['director']
# Concatenate the 'overview', 'cast', 'genres', and 'director' columns of DataFrame df2 into a single column named 'text'.
# This creates a new column containing the combined textual information of these columns.

In [36]:
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove any special characters except numbers and whitespace
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
    return text

In [41]:
df2['text'] = df2['text'].fillna('')
# Fill any missing values (NaN) in the 'text' column of DataFrame df2 with an empty string ('').
# This ensures that all values in the 'text' column are non-null.

In [43]:
df2['text'] = df2['text'].apply(preprocess_text)
# Apply the 'preprocess_text' function to each element in the 'text' column of DataFrame df2.
# This function preprocesses each text by converting it to lowercase and removing special characters, as specified.

In [44]:
# Define a TF-IDF Vectorizer Object. Remove all English stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')
# Initialize a TF-IDF Vectorizer object named 'tfidf'.
# Set the parameter 'stop_words' to 'english' to remove common English stop words during tokenization.

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df2['text'])
# Fit and transform the 'text' column of DataFrame df2 using the TF-IDF Vectorizer object 'tfidf'.
# This process converts the text data into a TF-IDF matrix representation.

# Output the shape of tfidf_matrix
tfidf_matrix.shape
# Print the shape of the TF-IDF matrix.
# This provides information about the dimensions of the TF-IDF matrix, indicating the number of documents (rows) and unique words (columns).


(4803, 30178)

In [18]:
# Define the file path where you want to save the TF-IDF matrix
file_path = 'tfidf_matrix.pkl'

# Save the TF-IDF matrix to a file using pickle
with open(file_path, 'wb') as f:
    pickle.dump(tfidf_matrix, f)

In [45]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
# Calculate the cosine similarity between each pair of documents represented by rows in the TF-IDF matrix 'tfidf_matrix'.
# The resulting 'cosine_sim' matrix contains pairwise cosine similarity scores between all documents in the dataset.
# Each element (i, j) in the matrix represents the cosine similarity score between document i and document j.

In [47]:
# Construct a reverse map of indices and movie titles
indices = pd.Series(df2.index, index=df2['original_title']).drop_duplicates()
# Create a pandas Series named 'indices' where the index is set to the 'original_title' column of DataFrame df2.
# The values of the Series are the corresponding indices of the DataFrame df2.
# By setting the index to 'original_title', this creates a mapping of movie titles to their indices in the DataFrame.
# Use drop_duplicates() to ensure that each movie title is associated with a unique index.

In [54]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df2['original_title'].iloc[movie_indices]

In [55]:
# The get_recommendations function will return the titles of the top 10 movies that are most similar to "The Dark Knight Rises" based on cosine similarity scores.
get_recommendations('The Dark Knight Rises')

428                              Batman Returns
65                              The Dark Knight
299                              Batman Forever
1359                                     Batman
119                               Batman Begins
2507                                  Slow Burn
3854    Batman: The Dark Knight Returns, Part 2
210                              Batman & Robin
1398                                  Max Payne
9            Batman v Superman: Dawn of Justice
Name: original_title, dtype: object

In [23]:
get_recommendations('The Avengers')

7                   Avengers: Age of Ultron
1294                               Serenity
3144                                Plastic
85      Captain America: The Winter Soldier
26               Captain America: Civil War
1715                                Timecop
2136             Team America: World Police
1286                            Snowpiercer
588         Wall Street: Money Never Sleeps
1161                     The Social Network
Name: original_title, dtype: object

## Conclusion

In this project, we have successfully implemented a content-based recommendation system on a simpler and smaller scale. By leveraging techniques such as TF-IDF vectorization and cosine similarity, we were able to provide personalized recommendations based on the similarity of item features.

Looking ahead, there are various advanced techniques such as deep learning-based models that can be explored to further enhance the recommendation system's accuracy and generate more personalized recommendations. Additionally, the implementation of a recommendation web service could extend the functionality to reach a wider audience and provide recommendations in real-time.

By continuously exploring and integrating different techniques and technologies, we can enhance the recommendation system's capabilities and provide users with more accurate and personalized recommendations tailored to their preferences.

**Improvements:**
- Enhance scalability and performance for larger datasets and growing user bases.
- Optimize algorithms and data processing pipelines for faster recommendation generation.
- Integrate real-time data streams for dynamic content updates to ensure recommendation relevance.

