# Designing a Basic Recommender System using TMDB Dataset

## Introduction

In this notebook, we will explore the design and implementation of a basic recommender system using the TMDB (The Movie Database) dataset. Recommender systems are widely used in various domains to suggest items to users based on their preferences and behavior. There are several types of recommender systems, including Demographic Filtering, Collaborative Filtering, and Content-Based Filtering.

## Types of Recommender Systems

### 1. Demographic Filtering

Demographic filtering recommends items based on demographic information such as age, gender, occupation, etc., without considering the user's preferences or behavior.

### 2. Collaborative Filtering

Collaborative filtering recommends items to users based on the preferences of other users. It can be user-based or item-based, where recommendations are made either by finding similar users or similar items.

### 3. Content-Based Filtering

Content-based filtering recommends items to users based on the features or characteristics of the items themselves. It analyzes the content of items and recommends similar items based on their attributes.

## Steps in Designing the Recommender System

### Step 1: Recommender System Based on Overview

In this step, we will design a content-based recommender system that recommends movies based on their overview or summary.

### Step 2: Recommender System Based on Actors and Directors

Next, we will enhance the content-based recommender system by incorporating information about actors and directors. This will allow us to recommend movies based on the involvement of specific actors or directors.

### Step 3: Recommender System Based on Actors, Directors, and Overview

Building upon the previous step, we will create a more comprehensive recommender system that considers both the overview of the movies and the involvement of actors and directors.

### Step 4: Recommender System Based on Actors, Directors, Overview, and Genres

Finally, we will further refine our recommender system by including information about movie genres. This will enable us to recommend movies based on a combination of actors, directors, overview, and genres.

## Conclusion

In this notebook, we have demonstrated the design and implementation of a basic recommender system using content-based filtering techniques. By leveraging information about movies such as overview, actors, directors, and genres, we can provide personalized recommendations to users based on their preferences.


In [1]:
import json
import warnings
from datetime import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 70)

In [3]:
df = pd.read_csv('data.csv')

In [4]:
df.head()

Unnamed: 0,budget,genres,homepage,id,plot_keywords,language,original_title,overview,popularity,production_companies,production_countries,release_date,gross,duration,spoken_languages,status,tagline,movie_title,vote_average,num_voted_users,title_year,country,director_name,actor_1_name,actor_2_name,actor_3_name
0,237000000,Action|Adventure|Fantasy|Science Fiction,http://www.avatarmovie.com/,19995,culture clash|future|space war|space colony|so...,English,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{'name': 'Ingenious Film Partners', 'id': 289...","[{'iso_3166_1': 'US', 'name': 'United States o...",2009-12-10,2787965087,162.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,2009.0,United States of America,James Cameron,Sam Worthington,Zoe Saldana,Sigourney Weaver
1,300000000,Adventure|Fantasy|Action,http://disney.go.com/disneypictures/pirates/,285,ocean|drug abuse|exotic island|east india trad...,English,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{'name': 'Walt Disney Pictures', 'id': 2}, {'...","[{'iso_3166_1': 'US', 'name': 'United States o...",2007-05-19,961000000,169.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,2007.0,United States of America,Gore Verbinski,Johnny Depp,Orlando Bloom,Keira Knightley
2,245000000,Action|Adventure|Crime,http://www.sonypictures.com/movies/spectre/,206647,spy|based on novel|secret agent|sequel|mi6|bri...,Français,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",2015-10-26,880674609,148.0,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,A Plan No One Escapes,Spectre,6.3,4466,2015.0,United Kingdom,Sam Mendes,Daniel Craig,Christoph Waltz,Léa Seydoux
3,250000000,Action|Crime|Drama|Thriller,http://www.thedarkknightrises.com/,49026,dc comics|crime fighter|terrorist|secret ident...,English,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{'name': 'Legendary Pictures', 'id': 923}, {'...","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-07-16,1084939099,165.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,2012.0,United States of America,Christopher Nolan,Christian Bale,Michael Caine,Gary Oldman
4,260000000,Action|Adventure|Science Fiction,http://movies.disney.com/john-carter,49529,based on novel|mars|medallion|space travel|pri...,English,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{'name': 'Walt Disney Pictures', 'id': 2}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-03-07,284139100,132.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,2012.0,United States of America,Andrew Stanton,Taylor Kitsch,Lynn Collins,Samantha Morton


## Recommender system based on overview

1. **Data Preprocessing:**
   - Extract the 'overview' column from the DataFrame `df`.
   - Import necessary libraries such as `re`, `stopwords`, `WordNetLemmatizer`, and `TfidfVectorizer`.

2. **Text Preprocessing:**
   - Define a function `preprocess_text()` to preprocess each document (movie overview).
     - Convert text to lowercase.
     - Remove punctuation.
     - Tokenize the text.
     - Lemmatize tokens.
     - Remove stop words.
     - Join tokens back into processed text.

3. **TF-IDF Vectorization:**
   - Apply `preprocess_text()` to each document, resulting in a list of preprocessed documents.
   - Initialize a `TfidfVectorizer` object.
   - Fit the vectorizer to the preprocessed data and transform documents into a TF-IDF feature matrix.

4. **Cosine Similarity Calculation:**
   - Use `cosine_similarity()` to compute the cosine similarity matrix based on the TF-IDF feature matrix.

5. **Recommender Function:**
   - Define `recommender_function()` to generate movie recommendations based on a given movie title and the cosine similarity matrix.
   - Take movie title and cosine similarity matrix as input.
   - Retrieve index of given movie title from `indices`.
   - If movie title is valid:
     - Calculate similarity scores between given movie and all others.
     - Create a DataFrame containing movie titles and similarity scores.
     - Sort DataFrame in descending order of similarity scores.
     - Return top N recommendations (excluding given movie).
   - If movie title is not valid, print a message indicating it.

6. **Example Usage:**
   - Demonstrate usage of `recommender_function()` by passing the movie title "The Avengers" and cosine similarity matrix (`cosine_matrix_1`). This should return top recommended movies similar to "The Avengers".


### WordNetLemmatizer

- **Purpose**: Word lemmatization reduces words to their base or dictionary form (lemma), ensuring that different inflected forms of a word are treated as the same word.
- **Example**: Lemmatization converts words like "running" and "ran" to their base form "run".
- **Importance**: It helps in standardizing the vocabulary, improving the effectiveness of downstream tasks such as vectorization and similarity calculation.
- **Application**: In this code, `WordNetLemmatizer` from NLTK is used to lemmatize tokens extracted from movie overviews, ensuring consistency in word representation.

### English Stopwords

- **Purpose**: Stopwords are common words (e.g., "the", "is", "and") that are often removed during text preprocessing because they carry little meaning or significance.
- **Example**: In the phrase "The quick brown fox jumps over the lazy dog", stopwords like "the", "over", and "the" contribute little to the meaning.
- **Importance**: Removing stopwords reduces noise and dimensionality in the data, focusing on content-carrying words.
- **Application**: In this code, a combination of NLTK's stopwords list and scikit-learn's built-in English stopwords (`ENGLISH_STOP_WORDS`) is used to filter out common stopwords during text preprocessing, enhancing the quality of the TF-IDF features and improving the accuracy of the recommender system.

### TF-IDF Vectorization

After preprocessing, TF-IDF vectorization is performed using `TfidfVectorizer` from scikit-learn. TF-IDF reflects the importance of words in documents relative to a collection of documents. Here's how TF-IDF vectorization works:

- **Term Frequency (TF)**: Measures the frequency of a term (word) within a document. Words that appear more frequently are assigned higher weights.
- **Inverse Document Frequency (IDF)**: Measures the rarity of a term across documents. Rare terms are assigned higher weights to emphasize their importance.

TF-IDF is calculated as the product of TF and IDF. The resulting TF-IDF matrix represents each document as a vector in a high-dimensional space, where each dimension corresponds to a unique term.

In [5]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [6]:
#Replace NaN with an empty string
df['overview'] = df['overview'].fillna('')

In [7]:
df['overview']

0       In the 22nd century, a paraplegic Marine is di...
1       Captain Barbossa, long believed to be dead, ha...
2       A cryptic message from Bond’s past sends him o...
3       Following the death of District Attorney Harve...
4       John Carter is a war-weary, former military ca...
                              ...                        
4798    El Mariachi just wants to play his guitar and ...
4799    A newlywed couple's honeymoon is upended by th...
4800    "Signed, Sealed, Delivered" introduces a dedic...
4801    When ambitious New York attorney Sam is sent t...
4802    Ever since the second grade when he first saw ...
Name: overview, Length: 4803, dtype: object

In [8]:
documents = df['overview']

In [9]:
# Initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Additional stop words
custom_stopwords = set(stopwords.words('english')).union(ENGLISH_STOP_WORDS)

# Function to preprocess text
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize the text
    tokens = text.split()
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    # Remove stop words
    tokens = [token for token in tokens if token not in custom_stopwords]
    # Join tokens back into text
    text = ' '.join(tokens)
    return text

# Preprocess each document
preprocessed_documents = [preprocess_text(doc) for doc in documents]

In [10]:
# Create a TfidfVectorizer object with n-gram range from unigrams to trigrams
# vectorizer = TfidfVectorizer(ngram_range=(1, 3))
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the preprocessed data and transform the documents into a feature matrix
X_overview = vectorizer.fit_transform(preprocessed_documents)

# Print the TF-IDF feature matrix
print(np.shape(X_overview.toarray()))

(4803, 20710)


In [11]:
import numpy as np

# Get the feature names (tokens) from the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Get the TF-IDF matrix
tfidf_matrix = X_overview.toarray()

# Calculate the mean TF-IDF score for each token across all documents
mean_tfidf_scores = np.mean(tfidf_matrix, axis=0)

# Create a dictionary mapping each token to its mean TF-IDF score
token_tfidf_scores = dict(zip(feature_names, mean_tfidf_scores))

# Sort the tokens by their mean TF-IDF scores
sorted_tokens = sorted(token_tfidf_scores.items(), key=lambda x: x[1], reverse=True)

# Print the sorted tokens with their mean TF-IDF scores
for token, score in sorted_tokens:
    print(f"Token: {token}, Mean TF-IDF Score: {score}")


Token: life, Mean TF-IDF Score: 0.01748990896777487
Token: ha, Mean TF-IDF Score: 0.014669639746362187
Token: new, Mean TF-IDF Score: 0.01287853073278687
Token: young, Mean TF-IDF Score: 0.012741207030753734
Token: world, Mean TF-IDF Score: 0.01192207458443284
Token: man, Mean TF-IDF Score: 0.01158844233610407
Token: family, Mean TF-IDF Score: 0.01116081238553672
Token: friend, Mean TF-IDF Score: 0.011125883369209192
Token: story, Mean TF-IDF Score: 0.010659006584769546
Token: woman, Mean TF-IDF Score: 0.00970951591847458
Token: love, Mean TF-IDF Score: 0.009559379880257193
Token: year, Mean TF-IDF Score: 0.009239336180276071
Token: father, Mean TF-IDF Score: 0.007850683752819066
Token: time, Mean TF-IDF Score: 0.007258388035013897
Token: set, Mean TF-IDF Score: 0.0072322833999250045
Token: film, Mean TF-IDF Score: 0.007216938599616272
Token: girl, Mean TF-IDF Score: 0.006983361009132039
Token: make, Mean TF-IDF Score: 0.006928825610577879
Token: school, Mean TF-IDF Score: 0.0069092185

In [12]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_matrix_1 = cosine_similarity(X_overview, X_overview)

In [13]:
cosine_matrix_1.shape

(4803, 4803)

In [14]:
# Construct a reverse map of indices and movie titles using a dictionary
indices = dict(zip(df['original_title'], df.index))

In [15]:
indices

{'Avatar': 0,
 "Pirates of the Caribbean: At World's End": 1,
 'Spectre': 2,
 'The Dark Knight Rises': 3,
 'John Carter': 4,
 'Spider-Man 3': 5,
 'Tangled': 6,
 'Avengers: Age of Ultron': 7,
 'Harry Potter and the Half-Blood Prince': 8,
 'Batman v Superman: Dawn of Justice': 9,
 'Superman Returns': 10,
 'Quantum of Solace': 11,
 "Pirates of the Caribbean: Dead Man's Chest": 12,
 'The Lone Ranger': 13,
 'Man of Steel': 14,
 'The Chronicles of Narnia: Prince Caspian': 15,
 'The Avengers': 16,
 'Pirates of the Caribbean: On Stranger Tides': 17,
 'Men in Black 3': 18,
 'The Hobbit: The Battle of the Five Armies': 19,
 'The Amazing Spider-Man': 20,
 'Robin Hood': 21,
 'The Hobbit: The Desolation of Smaug': 22,
 'The Golden Compass': 23,
 'King Kong': 24,
 'Titanic': 25,
 'Captain America: Civil War': 26,
 'Battleship': 27,
 'Jurassic World': 28,
 'Skyfall': 29,
 'Spider-Man 2': 30,
 'Iron Man 3': 31,
 'Alice in Wonderland': 32,
 'X-Men: The Last Stand': 33,
 'Monsters University': 34,
 'Tra

In [16]:
def recommender_function(title, cosine_matrix):
    idx = indices.get(title)

    if idx:
        movie_scores = cosine_matrix[idx]

        score_df = df[['original_title']]

        score_df['score'] = movie_scores

        score_df_sorted = score_df.sort_values(by='score', ascending=False)

        top_N = 5
        
        score_df_sorted.iloc[1:top_N + 1].head()
        
        return score_df_sorted.iloc[1:top_N + 1]
    
    else:
        print('not valid movie title')


In [17]:
recommender_function('The Avengers', cosine_matrix_1)

Unnamed: 0,original_title,score
7,Avengers: Age of Ultron,0.148779
3144,Plastic,0.111274
1715,Timecop,0.11032
4124,This Thing of Ours,0.107406
3311,Thank You for Smoking,0.104367


## Recommender system based on actors and director

This code segment implements a recommender system based on actors and directors. It extracts relevant information from the DataFrame such as movie titles, director names, and actor names. Then, it preprocesses the data by replacing missing values with empty strings, converting strings to lowercase, and removing spaces. After preprocessing, it combines the director name and actor names into a single column called 'actors_and_director'.

### Data Preprocessing

- **Handling Missing Values**: NaN values in the director and actor columns are replaced with empty strings.
- **String Manipulation**: Director and actor names are converted to lowercase and spaces are removed to ensure uniformity.

### Feature Engineering

- **Combining Features**: The director name and actor names are concatenated into a single feature column 'actors_and_director', which represents the collective involvement of directors and actors in each movie.

### Vectorization

- **CountVectorizer**: The CountVectorizer from scikit-learn is utilized to convert the text data into a numerical format. It creates a bag-of-words representation, where each document is represented by a vector indicating the count of each word.

### Cosine Similarity Calculation

- **Cosine Similarity Matrix**: Cosine similarity is computed between each pair of documents based on their vectorized representations. This matrix quantifies the similarity between movies in terms of their actors and directors.

### Recommender Function

- **Input**: The function 'recommender_function' takes a movie title and cosine similarity matrix as input.
- **Similarity Scores Calculation**: It computes similarity scores between the given movie and all other movies based on the cosine similarity matrix.
- **Sorting and Recommendations**: The function sorts the movies based on similarity scores in descending order and returns the top N similar movies as recommendations.

### Example Usage

An example usage of the 'recommender_function' is provided, recommending similar movies to "The Avengers" based on the cosine similarity matrix 'cosine_matrix_2'. The function returns a DataFrame containing the top recommended movies along with their similarity scores.

This approach leverages the collective influence of directors and actors to recommend similar movies, offering a different perspective compared to the overview-based recommender system.


In [18]:
df.columns

Index(['budget', 'genres', 'homepage', 'id', 'plot_keywords', 'language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'gross', 'duration',
       'spoken_languages', 'status', 'tagline', 'movie_title', 'vote_average',
       'num_voted_users', 'title_year', 'country', 'director_name',
       'actor_1_name', 'actor_2_name', 'actor_3_name'],
      dtype='object')

In [19]:
df2 = df[['original_title', 'director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name']]

In [20]:
df2.head()

Unnamed: 0,original_title,director_name,actor_1_name,actor_2_name,actor_3_name
0,Avatar,James Cameron,Sam Worthington,Zoe Saldana,Sigourney Weaver
1,Pirates of the Caribbean: At World's End,Gore Verbinski,Johnny Depp,Orlando Bloom,Keira Knightley
2,Spectre,Sam Mendes,Daniel Craig,Christoph Waltz,Léa Seydoux
3,The Dark Knight Rises,Christopher Nolan,Christian Bale,Michael Caine,Gary Oldman
4,John Carter,Andrew Stanton,Taylor Kitsch,Lynn Collins,Samantha Morton


In [21]:
#Replace NaN with an empty string
df2['director_name'] = df2['director_name'].fillna('')
df2['actor_1_name'] = df2['actor_1_name'].fillna('')
df2['actor_2_name'] = df2['actor_2_name'].fillna('')
df2['actor_3_name'] = df2['actor_3_name'].fillna('')

In [22]:
# Lowercase the strings and remove spaces
df2['director_name'] = df2['director_name'].str.lower().str.replace(' ', '')
df2['actor_1_name'] = df2['actor_1_name'].str.lower().str.replace(' ', '')
df2['actor_2_name'] = df2['actor_2_name'].str.lower().str.replace(' ', '')
df2['actor_3_name'] = df2['actor_3_name'].str.lower().str.replace(' ', '')

# Create a new column 'actors_and_director' by combining director and actors
df2['actors_and_director'] = df2['director_name'] + ' ' + df2['actor_1_name'] + ' ' + df2['actor_2_name'] + ' ' + df2['actor_3_name']

# Print the DataFrame to see the new column
df2.head()

Unnamed: 0,original_title,director_name,actor_1_name,actor_2_name,actor_3_name,actors_and_director
0,Avatar,jamescameron,samworthington,zoesaldana,sigourneyweaver,jamescameron samworthington zoesaldana sigourn...
1,Pirates of the Caribbean: At World's End,goreverbinski,johnnydepp,orlandobloom,keiraknightley,goreverbinski johnnydepp orlandobloom keirakni...
2,Spectre,sammendes,danielcraig,christophwaltz,léaseydoux,sammendes danielcraig christophwaltz léaseydoux
3,The Dark Knight Rises,christophernolan,christianbale,michaelcaine,garyoldman,christophernolan christianbale michaelcaine ga...
4,John Carter,andrewstanton,taylorkitsch,lynncollins,samanthamorton,andrewstanton taylorkitsch lynncollins samanth...


In [23]:
documents = df2['actors_and_director']

In [24]:
from sklearn.feature_extraction.text import CountVectorizer


In [25]:
# Create a CountVectorizer object with n-gram range from unigrams to trigrams
vectorizer = CountVectorizer()

# Fit the vectorizer to the preprocessed data and transform the documents into a feature matrix
X_actors = vectorizer.fit_transform(documents)

# Print the feature matrix
print(X_actors.toarray())

X_actors.shape

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


(4803, 8091)

In [26]:
cosine_matrix_2 = cosine_similarity(X_actors, X_actors)

In [27]:
recommender_function('The Avengers', cosine_matrix_2)

Unnamed: 0,original_title,score
7,Avengers: Age of Ultron,0.75
421,Zodiac,0.5
26,Captain America: Civil War,0.5
3748,The Kids Are All Right,0.25
79,Iron Man 2,0.25


## Recommender system based on actors, directors, and overview

This code combines information from movie overviews, directors, and actors to create a comprehensive recommender system. It utilizes TF-IDF features extracted from movie overviews and bag-of-words representations of actors and directors, then calculates cosine similarity between movies based on these combined features.

### Feature Combination

- **TF-IDF Features**: TF-IDF features extracted from movie overviews represent the content and themes of each movie.
- **Bag-of-Words Features**: Bag-of-words representations of actors and directors capture the involvement of key personnel in each movie.

### Cosine Similarity Calculation

- **Combined Feature Matrix**: TF-IDF features and bag-of-words representations are concatenated into a single feature matrix.
- **Cosine Similarity Matrix**: Cosine similarity is computed between each pair of movies based on their combined features. This matrix reflects the overall similarity between movies, considering both content and personnel.

### Recommender Function

- **Input**: The function 'recommender_function' takes a movie title and cosine similarity matrix as input.
- **Similarity Scores Calculation**: It computes similarity scores between the given movie and all other movies based on the cosine similarity matrix.
- **Sorting and Recommendations**: The function sorts the movies based on similarity scores in descending order and returns the top N similar movies as recommendations.

### Example Usage

An example usage of the 'recommender_function' is provided, recommending similar movies to "The Dark Knight Rises" based on the cosine similarity matrix 'cosine_matrix_3'. The function returns a DataFrame containing the top recommended movies along with their similarity scores.

By integrating information from movie overviews, directors, and actors, this recommender system offers more comprehensive and personalized recommendations, taking into account both content and personnel factors.


In [28]:
X_total = np.concatenate((X_overview.toarray(), X_actors.toarray()), axis=1)

In [29]:
X_actors.shape

(4803, 8091)

In [30]:
X_overview.shape

(4803, 20710)

In [31]:
X_total.shape

(4803, 28801)

In [32]:
cosine_matrix_3 = cosine_similarity(X_total, X_total)

In [33]:
recommender_function('The Dark Knight Rises', cosine_matrix_3)

Unnamed: 0,original_title,score
119,Batman Begins,0.627891
1196,The Prestige,0.603451
65,The Dark Knight,0.460855
1181,JFK,0.2273
1246,Quest for Camelot,0.213722


## Recommender system based on actors, directors, overview and genres

This code extends the recommender system to incorporate information about movie genres in addition to actors, directors, and overviews. It preprocesses the genre data, combines it with existing features, and computes cosine similarity between movies based on the expanded feature set.

### Genre Data Preprocessing

- **Handling Missing Values**: NaN values in the genre column are replaced with empty strings.
- **Rewriting Genres**: The 'rewrite_genres' function splits multiple genres and retains the first three genres if available, then combines them into a single string. This ensures uniformity and reduces complexity.
- **Lowercasing**: Genre strings are converted to lowercase to ensure consistency.

### Feature Engineering

- **Combined Features**: The combined feature 'actors_and_director_genres' is created by concatenating information about actors, directors, and genres. This provides a comprehensive representation of each movie's personnel and thematic elements.

### Vectorization

- **CountVectorizer**: The CountVectorizer from scikit-learn is used to convert the textual data into numerical features. It creates a bag-of-words representation, capturing the presence of words (including actors, directors, and genres) in each movie.

### Cosine Similarity Calculation

- **Combined Feature Matrix**: TF-IDF features from overviews and bag-of-words representations from actors, directors, and genres are concatenated into a single feature matrix.
- **Cosine Similarity Matrix**: Cosine similarity is computed between each pair of movies based on their combined features. This matrix reflects the overall similarity between movies, considering content, personnel, and thematic elements.

### Example Usage

Example usages of the 'recommender_function' are provided, recommending similar movies to both "The Dark Knight Rises" and "The Godfather" based on the cosine similarity matrix 'cosine_matrix_4'. The function returns DataFrames containing the top recommended movies along with their similarity scores.

By incorporating genre information, this enhanced recommender system offers more refined and tailored recommendations, considering not only the personnel involved but also the thematic elements and genre preferences.


In [34]:
df['genres'] = df['genres'].fillna('')

In [35]:
def rewrite_genres(genres):
    
    genres_list = genres.split('|')
    
    if len(genres_list) > 2:
        return genres_list[0] + " " + genres_list[1] + " " + genres_list[2] + " " 
    
    elif len(genres_list) == 2:
        return genres_list[0] + " " + genres_list[1]
    
    elif len(genres_list) == 1:
        return genres_list[0] 
    
    else:
        return ""

In [36]:
df2['genres'] = df['genres'].apply(rewrite_genres)

In [37]:
df2['genres'] = df2['genres'].apply(str.lower)

In [38]:
df2.head()

Unnamed: 0,original_title,director_name,actor_1_name,actor_2_name,actor_3_name,actors_and_director,genres
0,Avatar,jamescameron,samworthington,zoesaldana,sigourneyweaver,jamescameron samworthington zoesaldana sigourn...,action adventure fantasy
1,Pirates of the Caribbean: At World's End,goreverbinski,johnnydepp,orlandobloom,keiraknightley,goreverbinski johnnydepp orlandobloom keirakni...,adventure fantasy action
2,Spectre,sammendes,danielcraig,christophwaltz,léaseydoux,sammendes danielcraig christophwaltz léaseydoux,action adventure crime
3,The Dark Knight Rises,christophernolan,christianbale,michaelcaine,garyoldman,christophernolan christianbale michaelcaine ga...,action crime drama
4,John Carter,andrewstanton,taylorkitsch,lynncollins,samanthamorton,andrewstanton taylorkitsch lynncollins samanth...,action adventure science fiction


In [39]:
df2['actors_and_director_genres'] = df2['actors_and_director'] + ' ' + df2['genres']
df2['actors_and_director_genres'][0]

'jamescameron samworthington zoesaldana sigourneyweaver action adventure fantasy '

In [40]:
documents = df2['actors_and_director_genres']

In [41]:
# Create a CountVectorizer object with n-gram range from unigrams to trigrams
vectorizer = CountVectorizer()

# Fit the vectorizer to the preprocessed data and transform the documents into a feature matrix
X_actors_genres = vectorizer.fit_transform(documents)

# Print the feature matrix
print(X_actors_genres.toarray())

X_actors_genres.shape

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


(4803, 8113)

In [42]:
X_total = np.concatenate((X_overview.toarray(), X_actors_genres.toarray()), axis=1)

In [43]:
X_total.shape

(4803, 28823)

In [44]:
cosine_matrix_4 = cosine_similarity(X_total, X_total)

In [45]:
recommender_function('The Dark Knight Rises', cosine_matrix_4)

Unnamed: 0,original_title,score
119,Batman Begins,0.767432
65,The Dark Knight,0.663034
4638,Amidst the Devil's Wings,0.53033
1196,The Prestige,0.502157
3073,Romeo Is Bleeding,0.5


In [46]:
recommender_function('The Godfather', cosine_matrix_4)

Unnamed: 0,original_title,score
2731,The Godfather: Part II,0.632163
867,The Godfather: Part III,0.554499
1525,Apocalypse Now,0.428571
3012,The Outsiders,0.428571
2649,The Son of No One,0.403901


In [47]:
"""
suggestion type:

1: base on overview
2: base on actors and director
3: base on actors and director and overview
4: base on actors, director, overview and genres
"""
title = 'The Godfather'
suggestion_type = 4
    
cosine_matrix_name = 'cosine_matrix_' + str(int(suggestion_type))

cosine_matrix_variable = globals()[cosine_matrix_name]

# Assuming recommender_function is defined elsewhere
recommender_function(title, cosine_matrix_variable)

Unnamed: 0,original_title,score
2731,The Godfather: Part II,0.632163
867,The Godfather: Part III,0.554499
1525,Apocalypse Now,0.428571
3012,The Outsiders,0.428571
2649,The Son of No One,0.403901
