# 🎬 IMDb Top‑1000 Content‑Based Recommender
This notebook builds a simple content‑based movie recommendation engine using **Overview** (plot summary) and **Genre** from the IMDb Top‑1000 Movies & TV dataset.  
**Steps**:
1. Load & inspect data  
2. Clean and preprocess features  
3. Vectorize overviews with TF‑IDF  
4. One‑hot encode genres  
5. Combine similarities and recommend movies  

> **Dataset**: download the CSV from Kaggle and place it next to this notebook as `imdb_top_1000.csv`.  
> **Libraries required**: `pandas`, `numpy`, `scikit‑learn`, `nltk`. Install with `pip install pandas numpy scikit-learn nltk` if needed.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MultiLabelBinarizer
import re, nltk, string, warnings
warnings.filterwarnings('ignore')

# Make sure NLTK stopwords are downloaded once
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))

In [2]:
# ↳ CHANGE the path if your file lives elsewhere
df = pd.read_csv('imdb_top_1000.csv')
print(f"Loaded {df.shape[0]} rows and {df.shape[1]} columns")
df.head()

Loaded 1000 rows and 16 columns


Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [3]:
def clean_overview(text):
    if pd.isna(text):
        return ""
    # lowercase
    text = text.lower()
    # remove html tags & punctuation
    text = re.sub('<.*?>', ' ', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    # tokenize & remove stop words
    tokens = [w for w in text.split() if w not in STOPWORDS]
    return ' '.join(tokens)

df['clean_overview'] = df['Overview'].apply(clean_overview)
df['clean_overview'].head()

0    two imprisoned men bond number years finding s...
1    organized crime dynastys aging patriarch trans...
2    menace known joker wreaks havoc chaos people g...
3    early life career vito corleone 1920s new york...
4    jury holdout attempts prevent miscarriage just...
Name: clean_overview, dtype: object

In [4]:
df['genre_list'] = df['Genre'].str.split(',\s*')
mlb = MultiLabelBinarizer()
genre_matrix = mlb.fit_transform(df['genre_list'])
genre_df = pd.DataFrame(genre_matrix, columns=mlb.classes_, index=df.index)
print(f"One‑hot genre matrix shape: {genre_df.shape}")
#print(genre_df.head())

One‑hot genre matrix shape: (1000, 21)


In [5]:
tfidf = TfidfVectorizer(max_features=5000)
overview_tfidf = tfidf.fit_transform(df['clean_overview'])
overview_tfidf

<1000x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 13487 stored elements in Compressed Sparse Row format>

In [6]:
# Show the TF-IDF vector for the first movie (as a dense array)
print("TF-IDF vector for the first movie:")
print(overview_tfidf[0].toarray()[0])

# Or, print the top 10 nonzero terms for the first movie
feature_names = tfidf.get_feature_names_out()
row = overview_tfidf[0].toarray()[0]
top_indices = row.argsort()[-10:][::-1]
print("\nTop 10 TF-IDF terms for the first movie:")
for idx in top_indices:
    if row[idx] > 0:
        print(f"{feature_names[idx]}: {row[idx]:.3f}")


TF-IDF vector for the first movie:
[0. 0. 0. ... 0. 0. 0.]

Top 10 TF-IDF terms for the first movie:
decency: 0.352
number: 0.333
acts: 0.333
solace: 0.333
common: 0.319
finding: 0.299
redemption: 0.279
imprisoned: 0.279
bond: 0.274
men: 0.231


In [7]:
# Cosine similarity on overviews (dense may be large; keep sparse CSR)
cosine_sim = cosine_similarity(overview_tfidf)

# Genre similarity – simple dot product / intersection count normalised
genre_sim = cosine_similarity(genre_matrix)

# Combine similarities (tune alpha) – here 80% text, 20% genre
alpha = 0.8
combined_sim = alpha * cosine_sim + (1 - alpha) * genre_sim
combined_sim[:5, :5]

array([[1.        , 0.14142136, 0.11547005, 0.14142136, 0.14142136],
       [0.14142136, 1.        , 0.16329932, 0.26779516, 0.2       ],
       [0.11547005, 0.16329932, 1.        , 0.16329932, 0.16329932],
       [0.14142136, 0.26779516, 0.16329932, 1.        , 0.2       ],
       [0.14142136, 0.2       , 0.16329932, 0.2       , 1.        ]])

In [8]:
print(df.columns.tolist())


['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate', 'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director', 'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross', 'clean_overview', 'genre_list']


### 🔍 Quick Demo
Uncomment the last line in the previous cell and run it to see 10 recommendations for **"The Dark Knight"** (or choose any other title in the dataset).

In [9]:
def recommend(title, k=10):
    if title not in df['Series_Title'].values:
        raise ValueError(f"'{title}' not found in dataset.")
    idx = df.index[df['Series_Title'] == title][0]
    sim_scores = list(enumerate(combined_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    top_indices = [i for i, _ in sim_scores[1:k+1]]
    recs = df.iloc[top_indices][['Series_Title', 'Genre', 'IMDB_Rating']]
    return recs.reset_index(drop=True)

# Example usage
recommend('The Dark Knight')


Unnamed: 0,Series_Title,Genre,IMDB_Rating
0,Kill Bill: Vol. 1,"Action, Crime, Drama",8.1
1,Batman Begins,"Action, Adventure",8.2
2,Joker,"Crime, Drama, Thriller",8.5
3,Dip huet seung hung,"Action, Crime, Drama",7.8
4,The Fugitive,"Action, Crime, Drama",7.8
5,Lucky Number Slevin,"Action, Crime, Drama",7.7
6,Léon,"Action, Crime, Drama",8.5
7,Vikram Vedha,"Action, Crime, Drama",8.4
8,Haider,"Action, Crime, Drama",8.1
9,A Wednesday,"Action, Crime, Drama",8.1


### Evaluation


In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from collections import Counter

def calculate_genre_similarity(movie1_genres, movie2_genres):
    """Calculate Jaccard similarity between two movies' genres."""
    genres1 = set(movie1_genres.split(', '))
    genres2 = set(movie2_genres.split(', '))
    intersection = len(genres1.intersection(genres2))
    union = len(genres1.union(genres2))
    return intersection / union if union > 0 else 0

def evaluate_genre_consistency(df, recommendations, input_movie):
    """Evaluate how well the recommendations maintain genre consistency."""
    input_genres = df[df['Series_Title'] == input_movie]['Genre'].iloc[0]
    genre_similarities = []
    
    for _, rec in recommendations.iterrows():
        similarity = calculate_genre_similarity(input_genres, rec['Genre'])
        genre_similarities.append(similarity)
    
    return {
        'mean_genre_similarity': np.mean(genre_similarities),
        'min_genre_similarity': np.min(genre_similarities),
        'max_genre_similarity': np.max(genre_similarities)
    }

def evaluate_rating_distribution(df, recommendations, input_movie):
    """Evaluate the rating distribution of recommendations."""
    input_rating = df[df['Series_Title'] == input_movie]['IMDB_Rating'].iloc[0]
    rec_ratings = recommendations['IMDB_Rating'].values
    
    return {
        'input_movie_rating': input_rating,
        'mean_rec_rating': np.mean(rec_ratings),
        'std_rec_rating': np.std(rec_ratings),
        'min_rating': np.min(rec_ratings),
        'max_rating': np.max(rec_ratings),
        'rating_diff': abs(input_rating - np.mean(rec_ratings))
    }

def evaluate_diversity(recommendations):
    """Evaluate diversity in recommendations based on genres."""
    all_genres = []
    for genres in recommendations['Genre']:
        all_genres.extend(genres.split(', '))
    
    genre_counts = Counter(all_genres)
    unique_genres = len(genre_counts)
    genre_entropy = -sum((count/len(all_genres)) * np.log2(count/len(all_genres)) 
                        for count in genre_counts.values())
    
    return {
        'unique_genres': unique_genres,
        'genre_entropy': genre_entropy,
        'genre_distribution': dict(genre_counts)
    }

def comprehensive_evaluation(df, recommender_func, movie_title, k=10):
    """Run comprehensive evaluation for a given movie."""
    try:
        recommendations = recommender_func(movie_title, k)
        
        # Get all evaluation metrics
        genre_metrics = evaluate_genre_consistency(df, recommendations, movie_title)
        rating_metrics = evaluate_rating_distribution(df, recommendations, movie_title)
        diversity_metrics = evaluate_diversity(recommendations)
        
        # Combine all metrics
        evaluation_results = {
            'movie_title': movie_title,
            'num_recommendations': len(recommendations),
            **genre_metrics,
            **rating_metrics,
            **diversity_metrics
        }
        
        return evaluation_results, recommendations
        
    except Exception as e:
        return {'error': str(e)}, None

def evaluate_multiple_movies(df, recommender_func, num_samples=10, k=10):
    """Evaluate recommender system on multiple random movies."""
    sample_movies = df['Series_Title'].sample(n=num_samples).tolist()
    all_results = []
    
    for movie in sample_movies:
        results, _ = comprehensive_evaluation(df, recommender_func, movie, k)
        all_results.append(results)
    
    # Calculate aggregate statistics
    aggregate_metrics = pd.DataFrame(all_results).mean(numeric_only=True).to_dict()
    
    return {
        'individual_results': all_results,
        'aggregate_metrics': aggregate_metrics
    } 




### Save model

In [11]:
import pickle

# Create a dictionary with all the components needed for recommendations
model_components = {
    'tfidf_vectorizer': tfidf,
    'genre_binarizer': mlb,
    'combined_similarity': combined_sim,
    'movies_data': df[['Series_Title', 'Genre', 'IMDB_Rating', 'clean_overview', 'genre_list']]
}

# Save the model components
with open('movie_recommender_model.pkl', 'wb') as f:
    pickle.dump(model_components, f)

print("Model saved successfully as 'movie_recommender_model.pkl'")


Model saved successfully as 'movie_recommender_model.pkl'


### To test if the model import is successful

In [14]:
# First make sure we have the loaded model function
if 'recommend_from_loaded_model' not in globals():
    # Load the saved model
    with open('movie_recommender_model.pkl', 'rb') as f:
        loaded_model = pickle.load(f)
    
    # Define the recommendation function using the loaded model
    def recommend_from_loaded_model(title, k=10):
        movies_df = loaded_model['movies_data']
        combined_sim = loaded_model['combined_similarity']
        
        if title not in movies_df['Series_Title'].values:
            raise ValueError(f"'{title}' not found in dataset.")
        
        idx = movies_df.index[movies_df['Series_Title'] == title][0]
        sim_scores = list(enumerate(combined_sim[idx]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        top_indices = [i for i, _ in sim_scores[1:k+1]]
        
        recs = movies_df.iloc[top_indices][['Series_Title', 'Genre', 'IMDB_Rating']]
        return recs.reset_index(drop=True)

# Now compare the results
print("Original model recommendations:")
original_recommendations = recommend('The Dark Knight')
print(original_recommendations)

print("\nLoaded model recommendations:")
loaded_recommendations = recommend_from_loaded_model('The Dark Knight')
print(loaded_recommendations)

# Check if the recommendations are exactly the same
are_identical = original_recommendations.equals(loaded_recommendations)
print(f"\nAre the recommendations identical? {are_identical}")

if not are_identical:
    print("\nDifferences in recommendations:")
    # Compare the movie titles
    original_titles = set(original_recommendations['Series_Title'])
    loaded_titles = set(loaded_recommendations['Series_Title'])
    
    print("\nMovies only in original recommendations:")
    print(original_titles - loaded_titles)
    
    print("\nMovies only in loaded recommendations:")
    print(loaded_titles - original_titles)


Original model recommendations:
          Series_Title                   Genre  IMDB_Rating
0    Kill Bill: Vol. 1    Action, Crime, Drama          8.1
1        Batman Begins       Action, Adventure          8.2
2                Joker  Crime, Drama, Thriller          8.5
3  Dip huet seung hung    Action, Crime, Drama          7.8
4         The Fugitive    Action, Crime, Drama          7.8
5  Lucky Number Slevin    Action, Crime, Drama          7.7
6                 Léon    Action, Crime, Drama          8.5
7         Vikram Vedha    Action, Crime, Drama          8.4
8               Haider    Action, Crime, Drama          8.1
9          A Wednesday    Action, Crime, Drama          8.1

Loaded model recommendations:
          Series_Title                   Genre  IMDB_Rating
0    Kill Bill: Vol. 1    Action, Crime, Drama          8.1
1        Batman Begins       Action, Adventure          8.2
2                Joker  Crime, Drama, Thriller          8.5
3  Dip huet seung hung    Action, Cri

### Import model and test it

In [None]:
# Test the loaded model
import pickle

# Load the saved model
with open('movie_recommender_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

# Create a recommendation function using the loaded model
def recommend_from_loaded_model(title, k=10):
    movies_df = loaded_model['movies_data']
    combined_sim = loaded_model['combined_similarity']
    
    if title not in movies_df['Series_Title'].values:
        raise ValueError(f"'{title}' not found in dataset.")
    
    idx = movies_df.index[movies_df['Series_Title'] == title][0]
    sim_scores = list(enumerate(combined_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    top_indices = [i for i, _ in sim_scores[1:k+1]]
    
    recs = movies_df.iloc[top_indices][['Series_Title', 'Genre', 'IMDB_Rating']]
    return recs.reset_index(drop=True)

# Test with a movie (let's use "The Dark Knight" as an example)
print("Testing recommendations using the loaded model:")
print("\nRecommendations for 'The Dark Knight':")
recommendations = recommend_from_loaded_model('The Dark Knight')
print(recommendations)


In [12]:
# --- Run a comprehensive evaluation on random sample of movies ---
num_samples = 1000  # You can change this to evaluate more/less movies
k = 10  # Number of recommendations per movie

results = evaluate_multiple_movies(df, recommend, num_samples=num_samples, k=k)

print("=== Aggregate Evaluation Metrics ===")
for key, value in results['aggregate_metrics'].items():
    print(f"{key}: {value}")

print("\n=== Individual Movie Evaluation Results ===")
for res in results['individual_results']:
    print(res)


=== Aggregate Evaluation Metrics ===
num_recommendations: 10.0
mean_genre_similarity: 0.7377533333333334
min_genre_similarity: 0.40848333333333336
max_genre_similarity: 0.9656666666666667
input_movie_rating: 7.9494
mean_rec_rating: 7.98477
std_rec_rating: 0.2625621493941213
min_rating: 7.639199999999999
max_rating: 8.479700000000001
rating_diff: 0.23320999999999997
unique_genres: 5.349
genre_entropy: 1.8859751678530532

=== Individual Movie Evaluation Results ===
{'movie_title': 'A Night at the Opera', 'num_recommendations': 10, 'mean_genre_similarity': 0.42833333333333323, 'min_genre_similarity': 0.2, 'max_genre_similarity': 1.0, 'input_movie_rating': 7.9, 'mean_rec_rating': 7.860000000000001, 'std_rec_rating': 0.2059126028197399, 'min_rating': 7.6, 'max_rating': 8.2, 'rating_diff': 0.03999999999999915, 'unique_genres': 5, 'genre_entropy': 2.1312824845975875, 'genre_distribution': {'Comedy': 8, 'Music': 5, 'Musical': 2, 'Drama': 5, 'Romance': 2}}
{'movie_title': 'Sherlock Jr.', 'num_r