i) Feature Extraction with TF-IDF

Use movie descriptions, genres, or keywords.

Convert text → numerical representation using TF-IDF (Term Frequency–Inverse Document Frequency).

This highlights important words while reducing weight of common words like "the" or "movie".

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Load dataset (TMDB 5000 movies dataset from Kaggle)
df = pd.read_csv("tmdb_5000_movies.csv")

# Use 'overview' (movie description) for features
df['overview'] = df['overview'].fillna('')  

# Create TF-IDF matrix
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['overview'])

print("TF-IDF Matrix Shape:", tfidf_matrix.shape)


TF-IDF Matrix Shape: (4803, 20978)


(ii) Similarity Calculation (Cosine Similarity)

Compute similarity scores between movies.

Cosine Similarity → measures angle between vectors (good for text data).

In [2]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)


iii) Build the Recommender Function

Input: Movie title

Output: Top 5 most similar movies

In [3]:
# Reset index for mapping
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

def recommend_movies(title, cosine_sim=cosine_sim):
    # Get index of the movie that matches the title
    idx = indices[title]
    
    # Get similarity scores for that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort movies by similarity score (descending)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get top 5 similar movies (skip the first one because it’s the same movie)
    sim_scores = sim_scores[1:6]
    
    # Get movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    return df['title'].iloc[movie_indices]


(iv) Test with Example Movies

In [5]:
print("Recommendations for 'The Dark Knight':")
print(recommend_movies("The Dark Knight"))

print("\nRecommendations for 'Inception':")
print(recommend_movies("Inception"))

print("\nRecommendations for 'The Avengers':")
print(recommend_movies("The Avengers"))


Recommendations for 'The Dark Knight':
3                         The Dark Knight Rises
428                              Batman Returns
3854    Batman: The Dark Knight Returns, Part 2
299                              Batman Forever
1359                                     Batman
Name: title, dtype: object

Recommendations for 'Inception':
2897                                Cypher
134     Mission: Impossible - Rogue Nation
1930                            Stone Cold
914                   Central Intelligence
1683                       Pitch Perfect 2
Name: title, dtype: object

Recommendations for 'The Avengers':
7       Avengers: Age of Ultron
3144                    Plastic
1715                    Timecop
4124         This Thing of Ours
3311      Thank You for Smoking
Name: title, dtype: object
