# Movie Recommendation System Using Hybrid Similarity

**Course:** 02807 Computational Tools for Data Science  
**Institution:** Technical University of Denmark (DTU)  
**Semester:** Autumn 2024

---

## Project Summary

This project implements a hybrid movie recommendation system that combines multiple similarity dimensions. Using the Rotten Tomatoes dataset, we demonstrate:

1. **Course Topic 1 - Similar Items:** We compute pairwise similarity across movies using six distinct feature types, combined through weighted averaging.

2. **Course Topic 2 - Clustering:** K-Means clustering groups movies by genre characteristics, with evaluation using Davies-Bouldin index and silhouette scores.

3. **Outside Topic - Topic Modeling (LDA):** We use Latent Dirichlet Allocation to discover latent topics from critic reviews, enabling automatic genre label generation based on review content.

The system produces recommendations by computing a hybrid similarity score:

$$S_{hybrid} = \alpha \cdot S_{info} + \beta \cdot S_{rating} + \gamma \cdot S_{genre} + \delta \cdot S_{year} + \epsilon \cdot S_{style} + \zeta \cdot S_{type}$$

---

## Table of Contents

1. [Setup and Dependencies](#1-setup-and-dependencies)
2. [Data Loading and Preprocessing](#2-data-loading-and-preprocessing)
3. [Feature Engineering and Vectorization](#3-feature-engineering-and-vectorization)
4. [Topic Modeling - LDA (Outside Topic)](#4-topic-modeling---lda-outside-topic)
5. [Similarity Matrix Construction](#5-similarity-matrix-construction)
6. [Hybrid Scoring System](#6-hybrid-scoring-system)
7. [Clustering Analysis (Course Topic 2)](#7-clustering-analysis-course-topic-2)
8. [Final Recommendation System](#8-final-recommendation-system)
9. [Results and Evaluation](#9-results-and-evaluation)
10. [Conclusion](#10-conclusion)

---

## 1. Setup and Dependencies

This section imports all required libraries for the project. Key dependencies include:
- **sentence-transformers** for creating text embeddings
- **scikit-learn** for similarity computation, clustering, and evaluation metrics
- **gensim** for LDA topic modeling
- **nltk** for text preprocessing (stopwords, lemmatization)

The global configuration defines parameters used throughout the project, including genre categories, content rating types, minimum review threshold, and number of LDA topics.

In [1]:
# Install dependencies (uncomment if running for the first time)
# !pip install sentence-transformers pandas numpy scikit-learn matplotlib seaborn gensim nltk

In [2]:
# Core imports
import numpy as np
import pandas as pd
import os
import pickle
import re
from collections import defaultdict

# Machine Learning
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score

# Sentence Transformers (for embeddings)
from sentence_transformers import SentenceTransformer

# Topic Modeling (Outside Topic)
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.utils import simple_preprocess
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)
%matplotlib inline

# Download NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

print("All imports successful")

  from .autonotebook import tqdm as notebook_tqdm


All imports successful


### Global Configuration

In [3]:
# All possible genres in the dataset
ALL_GENRES = [
    'science fiction & fantasy', 'drama', 'western', 'comedy', 'classics',
    'action & adventure', 'kids & family', 'musical & performing arts',
    'documentary', 'art house & international', 'horror', 'sports & fitness',
    'faith & spirituality', 'mystery & suspense', 'animation', 'special interest', 'romance'
]

ALL_AGE_RATINGS = ['pg', 'r', 'g', 'pg-13', 'nc17']
MIN_REVIEWS = 5
NUM_TOPICS = 15

print(f"Tracking {len(ALL_GENRES)} genres and {len(ALL_AGE_RATINGS)} content ratings")
print(f"LDA will discover {NUM_TOPICS} latent topics")

Tracking 17 genres and 5 content ratings
LDA will discover 15 latent topics


---

## 2. Data Loading and Preprocessing

This section loads the Rotten Tomatoes dataset and groups critic reviews by movie. Each movie must have at least 5 reviews to be included in the analysis. The grouped data structure contains movie metadata (title, year, genres, content rating) along with all associated critic reviews and review types (fresh/rotten).

In [4]:
def get_dataset() -> pd.DataFrame:
    """Load the final dataset"""
    dataset = '../datasets/final_dataset.csv'
    final_dataset = pd.read_csv(dataset)
    return final_dataset

def group_movies(final_dataset, min_reviews=MIN_REVIEWS):
    """Group reviews by movie"""
    grouped = final_dataset.groupby('rotten_tomatoes_link')
    movie_data = []

    for movie_id, group in grouped:
        reviews = group['review_content'].tolist()
        critic_names = group['critic_name'].to_list()
        review_types = group['review_type'].to_list()
        first_row = group.iloc[0]

        movie_data.append({
            'movie_id': movie_id,
            'movie_title': first_row['movie_title'],
            'content_rating': first_row['content_rating'],
            'genres': first_row['genres'],
            'year': int(first_row['original_release_date']) if str(first_row['original_release_date']).isdigit() else 0,
            'movie_info': first_row['movie_info'],
            'reviews': reviews,
            'critic_names': critic_names,
            'review_types': review_types,
            'combined_review_text': ' '.join(reviews)
        })
    
    movie_data_filtered = [m for m in movie_data if len(m['reviews']) >= min_reviews]
    print(f"Loaded {len(movie_data_filtered)} movies with >= {min_reviews} reviews")
    return movie_data_filtered

In [5]:
# Load and process dataset
final_dataset = get_dataset()
movie_data = group_movies(final_dataset)

Loaded 48 movies with >= 5 reviews


---

## 3. Feature Engineering and Vectorization

This section transforms movie data into numerical representations that can be compared mathematically. We use the sentence-transformers library to create 384-dimensional embeddings for movie descriptions and critic reviews. Additional features include binary encodings for genres and content ratings, normalized release years, and fresh/rotten review patterns. The vectorization process takes 10-20 minutes for the full dataset, so results are cached for faster subsequent runs.

In [6]:
def vectorize(movie_data: list, model) -> list[dict]:
    """Create embeddings for content and review style"""
    
    def encode_genres(movie_genres: list, all_genres: list):
        return [1 if genre in movie_genres else 0 for genre in all_genres]
    
    def encode_content_rating(content_rating, all_content_ratings):
        return [1 if cr == content_rating else 0 for cr in all_content_ratings]
    
    def encode_review_type(review_types):
        return [1 if rt == "fresh" else 0 for rt in review_types]
    
    years = [movie['year'] for movie in movie_data if movie['year'] > 0]
    min_year = min(years)
    max_year = max(years)

    total_movies = len(movie_data)

    for i, movie in enumerate(movie_data, start=1):
        if i % 500 == 0 or i == total_movies:
            print(f"Processing movie {i}/{total_movies} ({i/total_movies*100:.1f}%)")

        movie['movie_info_embeddings'] = model.encode(movie['movie_info'])
        movie['content_rating_norm'] = encode_content_rating(movie['content_rating'], ALL_AGE_RATINGS)
        movie['genre_vector'] = encode_genres(movie['genres'], ALL_GENRES)
        movie['year_norm'] = (movie['year'] - min_year) / (max_year - min_year)
        
        review_embeddings = model.encode(movie['reviews'])
        movie['avg_review_embeddings'] = review_embeddings.mean(axis=0)
        movie['review_types_norm'] = encode_review_type(movie['review_types'])
    
    return movie_data

In [7]:
# Load or create vectorized data
VECTORIZED_PATH = '../cache/vectorized_movie_data.pkl'

if os.path.exists(VECTORIZED_PATH):
    print(f"Loading from {VECTORIZED_PATH}...")
    with open(VECTORIZED_PATH, 'rb') as f:
        vectorized_data = pickle.load(f)
    
    # Ensure combined_review_text exists (for backward compatibility)
    for movie in vectorized_data:
        if 'combined_review_text' not in movie:
            movie['combined_review_text'] = ' '.join(movie['reviews'])
else:
    print("Vectorizing data...")
    st_model = SentenceTransformer('all-MiniLM-L6-v2')
    vectorized_data = vectorize(movie_data, st_model)
    with open(VECTORIZED_PATH, 'wb') as f:
        pickle.dump(vectorized_data, f)
        
print(f"Loaded {len(vectorized_data)} movies")

Loading from ../cache/vectorized_movie_data.pkl...
Loaded 48 movies


---

## 4. Topic Modeling - LDA (Outside Topic)

### 4.1 Introduction to LDA

**Latent Dirichlet Allocation (LDA)** is a generative probabilistic model that discovers latent topics in a collection of documents. Each document is represented as a mixture of topics, and each topic is a distribution over words.

**Why LDA for movie recommendations:**
- Discovers hidden thematic patterns in critic reviews
- Generates interpretable genre-like labels automatically
- Captures nuanced movie characteristics beyond predefined genre categories

**Implementation details:**
- Text preprocessing removes stopwords and applies lemmatization
- We train an LDA model with 15 topics on the combined movie descriptions and critic reviews
- Each movie receives a topic distribution vector, indicating the probability of belonging to each discovered topic
- Topic labels are manually mapped to interpretable genre-like categories based on the most prominent words in each topic
- The top 3 topics (with >5% probability) become the movie's generated genre labels

In [8]:
# Setup text preprocessing
stop_words = set(stopwords.words('english'))
stop_words.update({"movie", "film", "films", "story", "character", "characters"})
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = re.sub(r"[^a-zA-Z ]", " ", str(text))
    return text.lower()

def preprocess_for_lda(text):
    tokens = simple_preprocess(text, deacc=True)
    tokens = [lemmatizer.lemmatize(t) for t in tokens if t not in stop_words and len(t) > 2]
    return tokens

In [9]:
# Build or load LDA model
LDA_MODEL_PATH = '../cache/lda_model.pkl'
LDA_DICT_PATH = '../cache/lda_dictionary.pkl'

if os.path.exists(LDA_MODEL_PATH) and os.path.exists(LDA_DICT_PATH):
    print("Loading pre-trained LDA model...")
    with open(LDA_MODEL_PATH, 'rb') as f:
        lda_model = pickle.load(f)
    with open(LDA_DICT_PATH, 'rb') as f:
        dictionary = pickle.load(f)
else:
    print("Training LDA model...")
    texts_for_lda = []
    for movie in vectorized_data:
        combined = movie['movie_title'] + ' ' + movie['movie_info'] + ' ' + movie['combined_review_text']
        tokens = preprocess_for_lda(clean_text(combined))
        texts_for_lda.append(tokens)
        movie['tokens'] = tokens
    
    dictionary = Dictionary(texts_for_lda)
    dictionary.filter_extremes(no_below=15, no_above=0.5)
    corpus = [dictionary.doc2bow(tokens) for tokens in texts_for_lda]
    
    lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=NUM_TOPICS,
                         random_state=42, passes=10, alpha='auto', eta='auto')
    
    with open(LDA_MODEL_PATH, 'wb') as f:
        pickle.dump(lda_model, f)
    with open(LDA_DICT_PATH, 'wb') as f:
        pickle.dump(dictionary, f)
    print("LDA model trained and saved")

Loading pre-trained LDA model...


In [10]:
# Display discovered topics
print("Discovered Topics:")
print("=" * 80)
for idx, topic in lda_model.print_topics(num_topics=NUM_TOPICS, num_words=8):
    words = re.findall(r'"([^"]+)"', topic)
    print(f"Topic {idx:2d}: {', '.join(words)}")

Discovered Topics:
Topic  0: way, enough, keep, genre, kind, something, another, thing
Topic  1: original, real, better, less, best, also, right, despite
Topic  2: dialogue, action, idea, genre, filmmaker, human, despite, enough
Topic  3: full, actor, great, never, feel, man, scene, could
Topic  4: full, feel, drama, great, love, also, man, enough
Topic  5: action, thriller, genre, original, movie, great, fun, never
Topic  6: american, great, every, movie, see, less, year, best
Topic  7: sense, rather, year, great, tale, lack, movie, old
Topic  8: comedy, love, give, way, turn, see, best, get
Topic  9: original, thing, better, best, get, feel, bad, year
Topic 10: full, man, best, scene, drama, feel, great, year
Topic 11: show, fun, old, kind, original, try, could, watching
Topic 12: thriller, genre, best, way, drama, get, better, know
Topic 13: bad, comedy, audience, hard, thing, way, minute, nothing
Topic 14: people, see, experience, get, way, enough, filmmaker, know


In [11]:
# Topic label mapping - customize based on your discovered topics
TOPIC_LABELS = {
    0: "Action/Thriller", 1: "Drama/Character", 2: "Comedy/Light", 3: "Horror/Suspense",
    4: "Family/Animation", 5: "Documentary", 6: "Sci-Fi/Fantasy", 7: "Romance",
    8: "Crime/Mystery", 9: "War/Historical", 10: "Musical/Arts", 11: "Sports",
    12: "International", 13: "Adventure/Epic", 14: "Classic/Timeless"
}

def get_topic_vector(tokens, dictionary, lda_model, num_topics):
    bow = dictionary.doc2bow(tokens)
    topic_dist = lda_model.get_document_topics(bow, minimum_probability=0)
    vec = np.zeros(num_topics)
    for topic_id, prob in topic_dist:
        vec[topic_id] = prob
    return vec

def get_generated_genres(topic_vector, top_n=3):
    top_indices = np.argsort(topic_vector)[::-1][:top_n]
    genres = []
    for idx in top_indices:
        if topic_vector[idx] > 0.05:
            label = TOPIC_LABELS.get(idx, f"Topic_{idx}")
            pct = int(topic_vector[idx] * 100)
            genres.append(f"{label} ({pct}%)")
    return genres

In [12]:
# Compute topic vectors for all movies
print("Computing topic vectors...")
for movie in vectorized_data:
    if 'tokens' not in movie:
        combined = movie['movie_title'] + ' ' + movie['movie_info'] + ' ' + movie['combined_review_text']
        movie['tokens'] = preprocess_for_lda(clean_text(combined))
    movie['topic_vector'] = get_topic_vector(movie['tokens'], dictionary, lda_model, NUM_TOPICS)
    movie['generated_genres'] = get_generated_genres(movie['topic_vector'])

print(f"\nExample: {vectorized_data[0]['movie_title']}")
print(f"  Generated genres: {vectorized_data[0]['generated_genres']}")

Computing topic vectors...

Example: aliens
  Generated genres: ['Documentary (92%)', 'International (7%)']


---

## 5. Similarity Matrix Construction

This section builds six separate similarity matrices, each capturing a different dimension of movie similarity:

1. **Info Similarity** - Semantic similarity of movie descriptions using cosine similarity on embeddings
2. **Content Rating Similarity** - Matches based on age appropriateness (G, PG, PG-13, R, NC-17)
3. **Genre Similarity** - Overlap in genre categories (drama, comedy, action, etc.)
4. **Year Similarity** - Temporal proximity of release dates
5. **Review Style Similarity** - How critics write about movies (embeddings of review text)
6. **Review Type Similarity** - Correlation of fresh/rotten patterns across critics

Each matrix is NÃ—N where N is the number of movies, with values ranging from 0 (dissimilar) to 1 (identical).

In [13]:
def build_info_sim(vectorized_movie_data) -> np.ndarray:
    info_vectors = np.vstack([movie['movie_info_embeddings'] for movie in vectorized_movie_data])
    info_sim = cosine_similarity(info_vectors).astype('float32')
    print(f"Info similarity matrix: {info_sim.shape}")
    return info_sim

def build_content_rating_sim(vectorized_movie_data) -> np.ndarray:
    content_rating_vectors = np.vstack([movie['content_rating_norm'] for movie in vectorized_movie_data])
    content_rating_sim = cosine_similarity(content_rating_vectors).astype('float32')
    print(f"Content rating similarity matrix: {content_rating_sim.shape}")
    return content_rating_sim

def build_genre_sim(vectorized_movie_data) -> np.ndarray:
    genre_vectors = np.vstack([movie['genre_vector'] for movie in vectorized_movie_data])
    genre_sim = cosine_similarity(genre_vectors).astype('float32')
    print(f"Genre similarity matrix: {genre_sim.shape}")
    return genre_sim

def build_year_sim(vectorized_movie_data) -> np.ndarray:
    year_vectors = np.vstack([movie['year_norm'] for movie in vectorized_movie_data])
    year_sim = cosine_similarity(year_vectors).astype('float32')
    print(f"Year similarity matrix: {year_sim.shape}")
    return year_sim

def build_review_style_sim(vectorized_movie_data) -> np.ndarray:
    review_embeddings = np.vstack([movie['avg_review_embeddings'] for movie in vectorized_movie_data])
    review_sim = cosine_similarity(review_embeddings).astype('float32')
    print(f"Review similarity matrix: {review_sim.shape}")
    return review_sim

def build_review_type_sim(vectorized_movie_data) -> np.ndarray:
    all_critics = set()
    for movie in vectorized_movie_data:
        all_critics.update(movie['critic_names'])
    all_critics = sorted(all_critics)

    all_movies = [movie['movie_title'] for movie in vectorized_movie_data]
    matrix = pd.DataFrame(np.nan, index=all_movies, columns=all_critics)

    for movie in vectorized_movie_data:
        mit = movie['movie_title']
        for critic, review_type in zip(movie['critic_names'], movie['review_types_norm']):
            matrix.loc[mit, critic] = review_type

    type_sim = matrix.T.corr(method='pearson')
    type_sim = (type_sim + 1) / 2
    type_sim_array = type_sim.values
    type_sim = np.nan_to_num(type_sim_array, nan=0.5)

    print(f"Type similarity matrix: {type_sim.shape}")
    return type_sim

def build_sim_matrices(vectorized_data):
    info_sim = build_info_sim(vectorized_data)
    content_rating_sim = build_content_rating_sim(vectorized_data)
    genre_sim = build_genre_sim(vectorized_data)
    year_sim = build_year_sim(vectorized_data)
    style_sim = build_review_style_sim(vectorized_data)
    type_sim = build_review_type_sim(vectorized_data)
    return info_sim, content_rating_sim, genre_sim, year_sim, style_sim, type_sim

In [14]:
SIMILARITY_PATH = '../cache/similarity_matrices.pkl'

if os.path.exists(SIMILARITY_PATH):
    print(f"Loading similarity matrices...")
    with open(SIMILARITY_PATH, 'rb') as f:
        matrices = pickle.load(f)
    info_sim, content_rating_sim, genre_sim, year_sim, style_sim, type_sim = (
        matrices['info'], matrices['content_rating'], matrices['genre'],
        matrices['year'], matrices['style'], matrices['type']
    )
else:
    print("Computing similarity matrices...")
    info_sim, content_rating_sim, genre_sim, year_sim, style_sim, type_sim = build_sim_matrices(vectorized_data)
    with open(SIMILARITY_PATH, 'wb') as f:
        pickle.dump({'info': info_sim, 'content_rating': content_rating_sim, 'genre': genre_sim,
                     'year': year_sim, 'style': style_sim, 'type': type_sim}, f)
print("Matrices ready")

Loading similarity matrices...
Matrices ready


---

## 6. Hybrid Scoring System

The hybrid scoring system combines all six similarity matrices using weighted averaging. Each component receives a weight (α , β , γ , δ , ε, ζ) that determines its contribution to the final similarity score. The current configuration uses equal weights for the main features:

- α  = 0.2 (Info embeddings - semantic content)
- β  = 0.2 (Content rating)
- γ  = 0.2 (Genre)
- δ  = 0.2 (Release year)
- ε = 0.1 (Review style)
- ζ = 0.1 (Review type patterns)

This balanced approach gives equal importance to the four primary dimensions (info, rating, genre, year) while treating the review-based features as secondary signals. These weights can be adjusted to emphasize different aspects of similarity depending on the use case.

In [15]:
def hybrid_score(info_sim, content_rating_sim, genre_sim, year_sim, style_sim, type_sim,
                 alpha=0.2, beta=0.2, gamma=0.2, delta=0.2, epsilon=0.1, zeta=0.1):
    return (alpha*info_sim + beta*content_rating_sim + gamma*genre_sim + 
            delta*year_sim + epsilon*style_sim + zeta*type_sim)

hybrid_sim = hybrid_score(info_sim, content_rating_sim, genre_sim, year_sim, style_sim, type_sim)
title_to_idx = {m['movie_title']: i for i, m in enumerate(vectorized_data)}
print(f"Hybrid similarity matrix created for {len(title_to_idx)} movies")

Hybrid similarity matrix created for 48 movies


---

## 7. Clustering Analysis (Course Topic 2)

K-Means clustering groups movies based on their genre characteristics. We use standardized genre vectors as input features and evaluate different values of k (number of clusters) using two metrics:

- **Silhouette Score** - Measures how well-separated clusters are (higher is better)
- **Davies-Bouldin Index** - Measures cluster compactness and separation (lower is better)

The optimal k is selected based on these metrics, and each movie is assigned to a cluster. Movies in the same cluster share similar genre profiles, providing an alternative grouping strategy to complement the similarity-based recommendations.

In [16]:
genre_matrix = np.vstack([m['genre_vector'] for m in vectorized_data])
scaler = StandardScaler()
genre_scaled = scaler.fit_transform(genre_matrix)

# Find optimal k
k_range = range(2, 16)
silhouettes = [silhouette_score(genre_scaled, KMeans(n_clusters=k, random_state=42, n_init=10).fit_predict(genre_scaled)) for k in k_range]
db_scores = [davies_bouldin_score(genre_scaled, KMeans(n_clusters=k, random_state=42, n_init=10).fit_predict(genre_scaled)) for k in k_range]

print(f"Optimal k by Silhouette: {list(k_range)[np.argmax(silhouettes)]}")
print(f"Optimal k by Davies-Bouldin: {list(k_range)[np.argmin(db_scores)]}")

Optimal k by Silhouette: 3
Optimal k by Davies-Bouldin: 15


In [17]:
# Final clustering
OPTIMAL_K = 8
kmeans = KMeans(n_clusters=OPTIMAL_K, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(genre_scaled)

for i, m in enumerate(vectorized_data):
    m['cluster'] = cluster_labels[i]

print(f"\nCluster distribution:")
for c, cnt in zip(*np.unique(cluster_labels, return_counts=True)):
    print(f"  Cluster {c}: {cnt} movies")


Cluster distribution:
  Cluster 0: 23 movies
  Cluster 1: 1 movies
  Cluster 2: 7 movies
  Cluster 3: 3 movies
  Cluster 4: 2 movies
  Cluster 5: 2 movies
  Cluster 6: 6 movies
  Cluster 7: 4 movies


---

## 8. Final Recommendation System

The final recommendation system integrates all three components:

1. **Similar Items** - Returns the top k most similar movies based on hybrid similarity scores
2. **Generated Genres** - Displays LDA-discovered topic labels for each recommendation
3. **Cluster Members** - Shows other movies from the same K-Means cluster

For each query movie, the system displays:
- Original metadata (year, content rating, genres)
- LDA-generated genre labels based on review content
- Top k similar movies with similarity scores and their generated genres
- Other movies sharing the same cluster assignment

This multi-faceted approach provides comprehensive recommendations from different analytical perspectives.

In [18]:
def get_similar_movies(query_title, vectorized_data, hybrid_sim, title_to_idx, k=10):
    if query_title not in title_to_idx:
        suggestions = [t for t in title_to_idx.keys() if query_title in t][:5]
        print(f"Movie '{query_title}' not found. Suggestions: {suggestions}")
        return []
    
    q_idx = title_to_idx[query_title]
    scores = hybrid_sim[q_idx]
    sorted_idx = np.argsort(scores)[::-1]
    
    recs = [(vectorized_data[i], scores[i]) for i in sorted_idx if i != q_idx][:k]
    return recs

def get_cluster_members(query_title, vectorized_data, title_to_idx, max_n=5):
    if query_title not in title_to_idx:
        return []
    q_cluster = vectorized_data[title_to_idx[query_title]]['cluster']
    return [m for m in vectorized_data if m['cluster'] == q_cluster and m['movie_title'] != query_title][:max_n]

In [None]:
def display_recommendations(query_title, vectorized_data, hybrid_sim, title_to_idx, k=10):
    """
    Display comprehensive recommendations:
    - Similar movies with scores and generated genres
    - Movies from the same cluster
    """
    if query_title not in title_to_idx:
        print(f"Movie '{query_title}' not found")
        return
    
    query = vectorized_data[title_to_idx[query_title]]
    
    # Header
    print("\n" + "=" * 95)
    print(f"  RECOMMENDATIONS FOR: {query_title.upper()}")
    print("=" * 95)
    print(f"  Year: {query['year']}  |  Rating: {query['content_rating']}  |  Cluster: {query['cluster']}")
    print(f"  Original Genres: {query['genres']}")
    print(f"  Generated Genres (LDA): {', '.join(query.get('generated_genres', ['N/A']))}")
    print("=" * 95)
    
    # Similar Movies
    similar = get_similar_movies(query_title, vectorized_data, hybrid_sim, title_to_idx, k)
    
    print(f"\n  TOP {k} SIMILAR MOVIES")
    print("  " + "-" * 91)
    print(f"  {'RANK':<5}{'TITLE':<40}{'SCORE':<8}{'YEAR':<6}{'GENERATED GENRES (LDA)'}")
    print("  " + "-" * 91)
    
    for i, (m, score) in enumerate(similar, 1):
        title = m['movie_title'][:37] + "..." if len(m['movie_title']) > 40 else m['movie_title']
        genres = ', '.join(m.get('generated_genres', ['N/A'])[:2])
        print(f"  {i:<5}{title:<40}{score:<8.4f}{m['year']:<6}{genres}")
    
    # Cluster Members
    cluster_members = get_cluster_members(query_title, vectorized_data, title_to_idx, 5)
    
    print(f"\n  OTHER MOVIES IN CLUSTER {query['cluster']}")
    print("  " + "-" * 91)
    
    if cluster_members:
        for m in cluster_members:
            title = m['movie_title'][:45] + "..." if len(m['movie_title']) > 48 else m['movie_title']
            print(f"  • {title:<50} ({m['year']}) - {m['genres'][:35]}")
    else:
        print("  No other movies in this cluster.")
    
    print("\n" + "=" * 95)

In [20]:
# Test the recommendation system
display_recommendations('aliens', vectorized_data, hybrid_sim, title_to_idx, k=10)


  RECOMMENDATIONS FOR: ALIENS
  Year: 1986  |  Rating: r  |  Cluster: 2
  Original Genres: action & adventure, horror, science fiction & fantasy
  Generated Genres (LDA): Documentary (92%), International (7%)

  TOP 10 SIMILAR MOVIES
  -------------------------------------------------------------------------------------------
  RANK TITLE                                   SCORE   YEAR  GENERATED GENRES (LDA)
  -------------------------------------------------------------------------------------------
  1    escape from new york                    0.7539  1981  International (69%), Romance (30%)
  2    appurushîdo (appleseed)                 0.6883  2004  Comedy/Light (98%)
  3    the amityville horror                   0.6862  2005  Adventure/Epic (44%), Romance (36%)
  4    razorback                               0.6484  1984  Documentary (98%)
  5    kalifornia                              0.6412  1993  International (68%), Crime/Mystery (29%)
  6    hacksaw ridge                   

In [21]:
display_recommendations('toy story', vectorized_data, hybrid_sim, title_to_idx, k=10)


  RECOMMENDATIONS FOR: TOY STORY
  Year: 1995  |  Rating: g  |  Cluster: 3
  Original Genres: animation, comedy, kids & family
  Generated Genres (LDA): Romance (72%), Family/Animation (14%), Crime/Mystery (12%)

  TOP 10 SIMILAR MOVIES
  -------------------------------------------------------------------------------------------
  RANK TITLE                                   SCORE   YEAR  GENERATED GENRES (LDA)
  -------------------------------------------------------------------------------------------
  1    frosty the snowman                      0.7284  1969  Sports (97%)
  2    pokémon the first movie - mewtwo vs. mew0.6637  1999  Adventure/Epic (69%), Family/Animation (29%)
  3    grown ups 2                             0.4952  2013  Adventure/Epic (77%), Romance (22%)
  4    teenage mutant ninja turtles ii - the...0.4928  1991  Drama/Character (98%)
  5    boat trip                               0.4803  2003  Adventure/Epic (40%), Crime/Mystery (33%)
  6    lupin iii: the castl

In [22]:
display_recommendations('the godfather', vectorized_data, hybrid_sim, title_to_idx, k=10)


  RECOMMENDATIONS FOR: THE GODFATHER
  Year: 1972  |  Rating: r  |  Cluster: 0
  Original Genres: drama
  Generated Genres (LDA): Sci-Fi/Fantasy (99%)

  TOP 10 SIMILAR MOVIES
  -------------------------------------------------------------------------------------------
  RANK TITLE                                   SCORE   YEAR  GENERATED GENRES (LDA)
  -------------------------------------------------------------------------------------------
  1    black irish                             0.7995  2007  Sports (97%)
  2    the liberator                           0.7549  2014  Family/Animation (82%), Comedy/Light (9%)
  3    1911                                    0.7251  2011  Family/Animation (52%), Romance (45%)
  4    kalifornia                              0.7232  1993  International (68%), Crime/Mystery (29%)
  5    sex and lucia                           0.7191  2002  Family/Animation (99%)
  6    sleeping with the enemy                 0.7166  1991  International (99%)
  7    d

---

## 9. Results and Evaluation

This section provides a summary of the complete system, including dataset statistics, the similarity components used, clustering configuration, and topic modeling setup. The evaluation is primarily qualitative, examining whether recommendations align with expected movie relationships. Quantitative metrics like silhouette scores and Davies-Bouldin index are used to validate the clustering component.

In [23]:
print("\n" + "=" * 60)
print("PROJECT SUMMARY")
print("=" * 60)
print(f"\nDataset: {len(vectorized_data):,} movies, {sum(len(m['reviews']) for m in vectorized_data):,} reviews")
print(f"\nSimilarity Components: Info embeddings, Content rating, Genre, Year, Review style, Review type")
print(f"\nClustering: K-Means with k={OPTIMAL_K}")
print(f"\nOutside Topic: LDA with {NUM_TOPICS} topics for automatic genre generation")
print("=" * 60)


PROJECT SUMMARY

Dataset: 48 movies, 2,876 reviews

Similarity Components: Info embeddings, Content rating, Genre, Year, Review style, Review type

Clustering: K-Means with k=8

Outside Topic: LDA with 15 topics for automatic genre generation
