In [None]:
!pip install gensim
!pip install nltk

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m73.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ast
from sklearn.preprocessing import StandardScaler
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt_tab')
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
import hdbscan
import math
from joblib import load
import warnings
import gensim
from sklearn.metrics.pairwise import euclidean_distances
import os

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Movie Clustering Models: A Simple Performance Report

We explored three different clustering models – K-Means, HDBSCAN, and Gaussian Mixture Model (GMM) – to group movies based on their features. We tested each model's ability to classify three distinct test movies and looked at their statistical profiles.

### Quantitative Model Comparison (Evaluation Metrics)

To objectively assess the models, we used the following metrics (higher is better for Silhouette and Calinski-Harabasz; lower is better for Davies-Bouldin):

| Model   | N_Clusters | Silhouette | Davies_Bouldin | Calinski_Harabasz |
|:--------|:----------:|:----------:|:--------------:|:-----------------:|
| HDBSCAN |     8      |   0.5223   |     0.4314     |      16750.66     |
| KMeans  |     5      |   0.5006   |     0.6687     |      37058.03     |
| GMM     |     5      |   0.4328   |     0.7897     |      27865.44     |

*   **Note on N_Clusters:** For K-Means and GMM, the number of clusters (N_Clusters = 5) was determined using the Elbow Method graph, which suggested 5 as an optimal number of clusters. HDBSCAN, on the other hand, automatically determined 8 clusters based on its density-based approach.
*   **HDBSCAN** showed the best Silhouette score (indicating well-separated clusters) and Davies-Bouldin score (indicating good separation between clusters and compactness within them).
*   **K-Means** had the highest Calinski-Harabasz score (indicating dense, well-separated clusters), but slightly lower Silhouette compared to HDBSCAN.
*   **GMM** performed less optimally across these metrics.

### 1. K-Means: The Balanced & Intuitive Choice

*   **How it Works (Simply):** K-Means tries to find a fixed number of 'center points' for its clusters and then puts each movie into the group whose center it's closest to. It creates pretty straightforward, distinct groups.
*   **Performance on Test Movies:**
    *   **'Problematic Low-Quality Commercial Film':** Predicted as **Cluster 0: 'Profit-Driven Comedy/Action Mix'**. This was a good fit, recognizing its commercial nature. The numerical profile for Cluster 0 shows a mean vote average of -0.14 (Z-score), profit ratio of 0.02 (Z-score), and popularity of -0.20 (Z-score), aligning with a commercial film that might have average to slightly below-average quality but aims for profit.
    *   **'Extreme Horror Outlier':** Predicted as **Cluster 1: 'Recent Low-Performing Dramas'**. This was a reasonable placement for a low-budget, low-rated film, suggesting it's generally struggling in performance. Cluster 1 has a low mean vote average of -1.20 (Z-score) and low popularity of -0.40 (Z-score).
    *   **'Epic Space Opera Blockbuster':** Predicted as **Cluster 0: 'Profit-Driven Comedy/Action Mix'** (with Cluster 3: 'High-Quality Blockbuster Hits' as a very close second). This shows K-Means identified its commercial appeal, even if it didn't land squarely in the 'blockbuster' group, indicating its strong connection to profit-driven movies.
*   **Overall:** K-Means gave the most *understandable and reasonable* classifications that often aligned with what a human would expect. It's a good all-rounder for clear groupings.

### 2. HDBSCAN: The Outlier Specialist with Surprises

*   **How it Works (Simply):** HDBSCAN looks for dense 'neighborhoods' of movies to form clusters of different shapes and sizes. It's smart enough to leave movies that don't fit anywhere as 'noise' or 'outliers'.
*   **Performance on Test Movies:**
    *   **'Problematic Low-Quality Commercial Film':** Predicted as **Cluster 1: 'Extremely Poor Quality Outliers'**. This classification was very harsh and didn't capture its commercial aspect well. Cluster 1 has a mean vote average of 0.23 (Z-score), profit ratio of 0.05 (Z-score), and popularity of -0.30 (Z-score), which is surprisingly not 'extremely poor' by Z-score but might represent sparse, unusual films.
    *   **'Extreme Horror Outlier':** Predicted as **Cluster 0: 'Noise/Unassigned Outliers'**. This was a fitting result, correctly identifying it as a unique or difficult-to-categorize film. Cluster 0 has a mean vote average of -0.29 (Z-score) and popularity of -0.43 (Z-score).
    *   **'Epic Space Opera Blockbuster':** Predicted as **Cluster 1: 'Extremely Poor Quality Outliers'**. This was a significant misclassification, as a blockbuster would intuitively not belong in an 'extremely poor quality' group, even if the Z-scores might show unique patterns.
*   **Overall:** HDBSCAN is great for spotting truly odd movies (outliers) and flexible clusters. However, its classifications for our high-profile test films were often counter-intuitive and didn't align with common understanding of movie quality or success.

### 3. GMM (Gaussian Mixture Model): The Flexible Probabilistic Grouper

*   **How it Works (Simply):** GMM imagines that each cluster is a blurry, 'bell-shaped' cloud. It figures out where these clouds should be and how big they are to best cover all the movies. A movie can belong a little bit to several clouds, but mostly to one.
*   **Performance on Test Movies:**
    *   **'Problematic Low-Quality Commercial Film':** Predicted as **Cluster 0: 'High-Profit Classic Cinema'**. This was a mismatch; the film is neither 'classic' nor necessarily 'high-profit' in a broad sense. Cluster 0 has a mean vote average of 0.31 (Z-score), profit ratio of 0.11 (Z-score), but a very low release year of -1.40 (Z-score).
    *   **'Extreme Horror Outlier':** Predicted as **Cluster 4: 'Critically Panned/Profitable Niche'**. This was a more nuanced and reasonable categorization, recognizing its low quality but potential for niche profit. Cluster 4 has a mean vote average of -0.17 (Z-score) and popularity of -0.53 (Z-score).
    *   **'Epic Space Opera Blockbuster':** Predicted as **Cluster 0: 'High-Profit Classic Cinema'**. Another mismatch for a modern blockbuster, struggling to place it accurately.
*   **Overall:** GMM showed some flexibility and nuance but also produced classifications that were often not what we would expect, especially when dealing with the 'Classic Cinema' profile for modern movies.

### Conclusion

After considering the quantitative evaluation metrics, **K-Means generally offered the most balanced performance and produced the most intuitively reasonable classifications** for new movies during our testing. While HDBSCAN showed excellent statistical performance in some areas, its qualitative predictions for certain test cases were less aligned with human understanding.

It's important to note that a model's clustering doesn't always have to perfectly match human intuition to be valid or useful. Models might find groupings based on complex patterns that are not immediately obvious to us. However, the `predict_and_explain_cluster` function was crucial here. It allowed us to go beyond just the numerical scores and *qualitatively evaluate* how well each model's clusters actually *made sense* when applied to new, real-world examples. This hands-on testing helped confirm that K-Means, despite not having the absolute best score on every single metric, was the most practical choice for generating understandable and actionable movie categories in this context.

In [None]:
# Read this report to understand more what is suitable

In [None]:
# --- Cluster Naming and Description ---

# Cluster 0: Profit-Driven Comedy/Action Mix
# Numerical Profile: Average quality, slightly positive profit, older movies on average.
# Characteristics: A large, commercially focused group. Defined by a slightly positive profit margin (despite moderate popularity) and a broad mix of popular genres like Comedy, Thriller, and Action. These are solid, mainstream films that consistently make money.
# Example: 5 Days of War, 12 Rounds, The Shadow
Kmeans_CLUSTER_NAME_0 = "Profit-Driven Comedy/Action Mix"


# Cluster 1: Recent Low-Performing Dramas
# Numerical Profile: Sub-average quality, negative profit, recent. Very low popularity.
# Characteristics: These are recent films, mainly Dramas and Romances, that struggle to gain traction (very low popularity) and are generally losing money (-0.016 profit ratio). They often represent a large chunk of newly released, quickly forgotten titles.
# Example: Snabba Cash, Peaceful Warrior, Deuces Wild
Kmeans_CLUSTER_NAME_1 = "Recent Low-Performing Dramas"


# Cluster 2: Newest Sub-Average (Underperformers)
# Numerical Profile: Lower quality, negative profit, newest films on average.
# Characteristics: Represents the very newest films in the dataset (highest mean release year). They are defined by poor performance, slightly lower quality than Cluster 1, and low visibility.
# Example: Malevolence, United Passions, Little Nicky
Kmeans_CLUSTER_NAME_2 = "Newest Sub-Average (Underperformers)"


# Cluster 3: High-Quality Blockbuster Hits
# Numerical Profile: Highest quality (0.501), highest popularity (0.718), relatively recent.
# Characteristics: The "Mega-Hit" category. These films combine high critical acclaim with massive public interest, often dominated by large-scale Action, Adventure, and Thriller genres. This is the goal for major studio tentpoles.
# Example: Woodstock, The Punisher, Predator, 3:10 to Yuma
Kmeans_CLUSTER_NAME_3 = "High-Quality Blockbuster Hits"


# Cluster 4: Critically Panned/Profitable Niche
# Numerical Profile: Extremely low quality (-1.169), but surprisingly positive profit (0.027), very recent.
# Characteristics: This cluster is defined by the sharp contradiction between high negative critical perception (worst quality score) and a positive profit ratio. These are often low-budget, niche films (like some horror or direct-to-video titles) that achieve a small profit despite negative reviews.
# Example: Funny Ha Ha, Rust, Polisse
Kmeans_CLUSTER_NAME_4 = "Critically Panned/Profitable Niche"

In [None]:
# --- HDBSCAN Cluster Naming and Description ---
# outlier group
# Cluster -1: "Noise/Unassigned Outliers (DBSCAN)"
# Numerical Profile: vote_average (-0.416), popularity (-0.485), release_year (-0.585) - Generally below average and older.
# Description: These are points that the clustering algorithm couldn't reliably assign to any main group. They often include extremely unique or corrupted data entries.
HDBSCAN_CLUSTER_NAME_MINUS_ONE = "Noise/Unassigned Outliers (DBSCAN)"

# Cluster 0: Modern Low-Popularity Dramas
# Numerical Profile: Low quality and popularity, but average profit, relatively recent.
# Characteristics: These are modern films, mostly dramas and comedies, that perform poorly in terms of popularity but maintain an average critical rating. They are generally forgettable films not capturing wide public attention.
# Example: United Passions, Cinco de Mayo: La Batalla
HDBSCAN_CLUSTER_NAME_0 = "Noise/Unassigned Outliers (DBSCAN)"


# Cluster 1: Extremely Poor Quality Outliers
# Numerical Profile: Extremely low quality (-1.637), very low popularity, older.
# Characteristics: This cluster consists of some of the worst-rated films in the dataset, often cheap horror or generic independent films that barely register a score.
# Example: Death Calls, Vessel, Dude Where's My Dog?
HDBSCAN_CLUSTER_NAME_1 = "Extremely Poor Quality Outliers"


# Cluster 2: The Commercial Mainstream (Above Average)
# Numerical Profile: Above average quality (0.185), high popularity (0.244), slightly positive profit.
# Characteristics: This is the largest, most successful, and most diverse group. It represents popular blockbuster and major studio releases that are well-received (above average) and widely consumed across all major genres (Action, Drama, Comedy, Thriller). This is the 'Hit' category.
# Example: Mad Max Beyond Thunderdome, Valkyrie, Sanctum
HDBSCAN_CLUSTER_NAME_2 = "The Commercial Mainstream (Above Average)"


# Cluster 3: High-Quality, Medium-Sized Hits
# Numerical Profile: Highest quality (0.254), highest profit ratio (0.045), low popularity, but relatively recent.
# Characteristics: These are critically acclaimed films (highest mean vote average) that managed to generate good relative profit, often being niche dramas or independent comedies with strong word-of-mouth rather than mass market blockbusters.
# Example: Κυνόδοντας (Dogtooth - arthouse), Eden Lake, Glory Road
HDBSCAN_CLUSTER_NAME_3 = "High-Quality, Medium-Sized Hits"

# Cluster 4: Classic (Older) Cinema Mix
# Numerical Profile: Very old (-2.086 Z-Score), low popularity, average quality.
# Characteristics: Defined purely by age. These are older, non-recent films (predominantly pre-2000s) that mostly belong to the Drama or Romance genres. Their current low popularity is expected due to their age.
# Example: Солярис (Solaris), The Boys from Brazil
HDBSCAN_CLUSTER_NAME_4 = "Classic (Older) Cinema Mix"


# Cluster 5: New Releases, Sub-Average Drama Focus
# Numerical Profile: Below average quality (-0.183), low popularity, very recent (0.268).
# Characteristics: This group focuses on recent releases (high Z-Score for release year) that generally fail to impress critically or popularly, often filling the market with generic dramas and romantic comedies.
# Example: Without Men, The Christmas Candle, Be Kind Rewind
HDBSCAN_CLUSTER_NAME_5 = "New Releases, Sub-Average Drama Focus"

# Cluster 6: Lowest Rated, Unpopular Duds
# Numerical Profile: The absolute lowest quality (-4.736), lowest popularity, older.
# Characteristics: These are highly niche or poorly made films that have extremely low critic/user scores, making them outliers on the low end of the quality scale.
# Example: Her Cry: La Llorona Investigation, Mutant World
HDBSCAN_CLUSTER_NAME_6 = "Lowest Rated, Unpopular Duds"


# Cluster 7: Empty/Unpopulated Cluster
# Numerical Profile: N/A
# Description: This cluster center was created by K-Means but contained no assigned data points, suggesting it represents a theoretical region of the feature space that was not represented in the final reduced dataset passed to K-Means.
HDBSCAN_CLUSTER_NAME_7 = "Empty/Unpopulated Cluster"

In [None]:
# --- GMM Cluster Naming and Description ---

# Cluster 0: High-Profit Classic Cinema
# Numerical Profile: High quality, highest profit ratio (0.107), but extremely old (-1.379). Low current popularity is expected due to age.
# Characteristics: This cluster separates older, classic films (like '42nd Street' or 'The Nun's Story') that were highly successful and critically acclaimed in their time, resulting in a high historical profit ratio. The model isolated them based on the low release year Z-score.
# Example: 42nd Street, The Nun's Story, The Pirate
GMM_CLUSTER_NAME_0 = "High-Profit Classic Cinema"


# Cluster 1: Recent Low-Performing Dramas
# Numerical Profile: Sub-average quality, negative profit, recent. Very low popularity.
# Characteristics: Similar to the K-Means result. These are recent titles, mostly Dramas and Romances, that struggle commercially (negative profit ratio) and critically (sub-average rating). They are quickly forgotten releases.
# Example: The Man in the Iron Mask, Trash, Bathory
GMM_CLUSTER_NAME_1 = "Recent Low-Performing Dramas"


# Cluster 2: Newest Sub-Average (Underperformers)
# Numerical Profile: Lower quality, negative profit, newest films on average.
# Characteristics: Represents the newest batch of released films that performed poorly on average. Defined by lower mean quality and a moderate lack of popularity, struggling to perform in the market.
# Example: Malevolence, United Passions, Little Nicky
GMM_CLUSTER_NAME_2 = "Newest Sub-Average (Underperformers)"


# Cluster 3: The Commercial High-Popularity Mix
# Numerical Profile: Above average quality, highest current popularity (0.332), recent.
# Characteristics: This is the large "Mainstream Hit" category. Defined primarily by high popularity and a strong presence across all major commercial genres (Drama, Action, Comedy, Thriller). These are the films that dominate the public conversation.
# Example: Zombieland, Mission: Impossible II, Il buono, il brutto, il cattivo
GMM_CLUSTER_NAME_3 = "The Commercial High-Popularity Mix"


# Cluster 4: Critically Panned/Profitable Niche
# Numerical Profile: Extremely low quality (-1.180), yet positive profit (0.027), recent.
# Characteristics: A highly contradictory cluster. Films here receive terrible reviews (worst quality score) but achieve a positive profit margin. This points to low-budget, niche, or genre-specific films (like specific horror or low-budget comedies) that are cost-effective despite poor critical reception.
# Example: Funny Ha Ha, Rust, Because I Said So
GMM_CLUSTER_NAME_4 = "Critically Panned/Profitable Niche"

In [None]:
# run these 3 cells to have all the clusters for the 3 models named

In [None]:
# Suppress warnings for cleaner output
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# --- 1. DEFINE CLUSTER NAMES (from previous analysis) ---

# HDBSCAN Cluster Names
HDBSCAN_CLUSTER_NAMES_DICT = {
    -1: HDBSCAN_CLUSTER_NAME_MINUS_ONE,
    0: HDBSCAN_CLUSTER_NAME_0,
    1: HDBSCAN_CLUSTER_NAME_1,
    2: HDBSCAN_CLUSTER_NAME_2,
    3: HDBSCAN_CLUSTER_NAME_3,
    4: HDBSCAN_CLUSTER_NAME_4,
    5: HDBSCAN_CLUSTER_NAME_5,
    6: HDBSCAN_CLUSTER_NAME_6,
    7: HDBSCAN_CLUSTER_NAME_7
}

# K-Means Cluster Names
KMEANS_CLUSTER_NAMES_DICT = {
    0: Kmeans_CLUSTER_NAME_0,
    1: Kmeans_CLUSTER_NAME_1,
    2: Kmeans_CLUSTER_NAME_2,
    3: Kmeans_CLUSTER_NAME_3,
    4: Kmeans_CLUSTER_NAME_4
}

# GMM Cluster Names
GMM_CLUSTER_NAMES_DICT = {
    0: GMM_CLUSTER_NAME_0,
    1: GMM_CLUSTER_NAME_1,
    2: GMM_CLUSTER_NAME_2,
    3: GMM_CLUSTER_NAME_3,
    4: GMM_CLUSTER_NAME_4
}

# --- 2. LOAD SAVED MODELS AND TOOLS ---

try:
    # --- Common Models ---
    UMAP_MODEL_CLUST = load('umap_model_clust.joblib')
    DOC2VEC_MODEL = load('doc2vec_model.joblib')
    SCALER = load('scaler.joblib')
    SBERT_MODEL = SentenceTransformer('all-MiniLM-L6-v2')

    # --- HDBSCAN Specific Models and Profiles ---
    HDBSCAN_MODEL = load('hdbscan_model.joblib')
    HDBSCAN_NUMERICAL_PROFILE = load('hdbscan_numerical_profile.joblib')
    HDBSCAN_CATEGORICAL_PROFILES = load('hdbscan_categorical_profiles_dict.joblib')
    HDBSCAN_CENTROIDS_ARRAY = load('hdbscan_centroids_array.joblib')
    HDBSCAN_LABEL_INDEX_MAP = load('hdbscan_label_index_map.joblib')

    # --- K-Means Specific Models and Profiles ---
    KMEANS_MODEL = load('kmeans_model.joblib')
    KMEANS_NUMERICAL_PROFILE = load('kmeans_numerical_profile.joblib')
    KMEANS_CATEGORICAL_PROFILES = load('kmeans_categorical_profiles_dict.joblib')
    KMEANS_CENTROIDS_ARRAY = KMEANS_MODEL.cluster_centers_ # K-Means centroids are directly from the model

    # --- GMM Specific Models and Profiles (if saved, otherwise calculate on the fly if needed) ---
    # GMM does not have explicit 'centroids' in the same way as KMeans/HDBSCAN
    # but we can use the means of its Gaussian components or calculate cluster centers based on predictions
    GMM_MODEL = load('gmm_model.joblib')
    # For GMM, we will typically use the means of the Gaussian components as 'centroids' for distance calculation
    GMM_CENTROIDS_ARRAY = GMM_MODEL.means_
    # Now loading the GMM profiles that were just saved
    GMM_NUMERICAL_PROFILE = load('gmm_numerical_profile.joblib')
    GMM_CATEGORICAL_PROFILES = load('gmm_categorical_profiles_dict.joblib')

    print("✅ SUCCESS: All models, profiles, and centroids loaded successfully.")

except FileNotFoundError as e:
    print(f"❌ ERROR: Failed to load model file: {e}. Please ensure all .joblib files are in the correct locations (e.g., /content/).")
    raise
except Exception as e:
    print(f"❌ An unexpected error occurred during model loading: {e}")
    raise


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ SUCCESS: All models, profiles, and centroids loaded successfully.


In [None]:
# Suppress warnings for cleaner output
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# --- 1. DEFINE CLUSTER NAMES (from previous analysis) ---

# HDBSCAN Cluster Names (these are now global dictionaries for use in the function)
# These dictionaries are assumed to be defined in previous cells or loaded as global variables.
# HDBSCAN_CLUSTER_NAMES_DICT = { ... }
# KMEANS_CLUSTER_NAMES_DICT = { ... }
# GMM_CLUSTER_NAMES_DICT = { ... }

# --- 2. LOAD SAVED MODELS AND TOOLS ---
# This section remains unchanged as all models are already loaded.
# It's assumed that the global variables like UMAP_MODEL_CLUST, DOC2VEC_MODEL, SCALER, SBERT_MODEL,
# HDBSCAN_MODEL, HDBSCAN_NUMERICAL_PROFILE, HDBSCAN_CATEGORICAL_PROFILES, HDBSCAN_CENTROIDS_ARRAY, HDBSCAN_LABEL_INDEX_MAP,
# KMEANS_MODEL, KMEANS_NUMERICAL_PROFILE, KMEANS_CATEGORICAL_PROFILES, KMEANS_CENTROIDS_ARRAY,
# GMM_MODEL, GMM_CENTROIDS_ARRAY, GMM_NUMERICAL_PROFILE, GMM_CATEGORICAL_PROFILES
# are all loaded and available due to previous successful execution of cell 4vwBa82xbro-.

# --- HELPER FUNCTIONS FOR PREPROCESSING ---
def clean_name(name):
    """Cleans a single name string."""
    if not isinstance(name, str):
        return ""
    return name.lower().replace(" ", "")

import nltk
try:
    nltk.data.find('tokenizers/punkt')
except:
    nltk.download('punkt')
try:
    nltk.data.find('tokenizers/punkt_tab')
except:
    nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize


# --- THE INTEGRATED PREDICTION FUNCTION ---

def predict_and_explain_cluster(budget: float, revenue: float, vote_average: float, vote_count: float,
                                popularity: float, release_year: int, runtime: float,
                                overview: str, genres: list, keywords: list,
                                cast_names: list, director_name: str,
                                model_type: str = 'HDBSCAN') -> dict:
    """
    Predicts the cluster ID and name for a new movie using simpler inputs and
    finds the second closest cluster, providing an analytical explanation.

    Args:
        budget (float): The movie's production budget (Raw).
        revenue (float): The movie's gross revenue (Raw).
        vote_average (float): The movie's average user rating (Raw).
        vote_count (float): The movie's vote count (Raw).
        popularity (float): The movie's popularity score (Raw).
        release_year (int): The movie's release year (Raw).
        runtime (float): The movie's runtime in minutes (Raw).
        overview (str): The movie's plot summary (for SBERT embedding).
        genres (list): A list of genre names (e.g., ['Action', 'Adventure']).
        keywords (list): A list of keyword names (e.g., ['spy', 'conspiracy']).
        cast_names (list): A list of prominent cast member names (e.g., ['Tom Hanks', 'Meryl Streep']).
        director_name (str): The director's name (e.g., 'Steven Spielberg').
        model_type (str): The type of clustering model to use ('HDBSCAN', 'KMeans', or 'GMM').

    Returns:
        dict: A dictionary containing the predicted cluster ID and name,
              the second closest cluster ID and name, and an explanation
              of their similarity based on distance.
    """

    # --- PREPROCESSING NEW MOVIE DATA ---

    # 1. Feature Cleaning (Categorical)
    cleaned_genres = [clean_name(name) for name in genres]
    cleaned_keywords = [clean_name(name) for name in keywords]
    cleaned_cast_names = [clean_name(name) for name in cast_names]
    cleaned_director_name = [clean_name(director_name)] if director_name else []

    all_features_list = cleaned_genres + cleaned_keywords + cleaned_cast_names + cleaned_director_name
    all_features_str = ' '.join(all_features_list)

    # 2. Numerical Feature Engineering and Log Transformation
    profit_ratio = (revenue - budget) / (budget + 1e-6) if budget != 0 else 0.0
    profit_ratio_transformed = np.sign(profit_ratio) * np.log1p(np.abs(profit_ratio))

    raw_numerical_data = {
        'vote_average': vote_average,
        'vote_count': vote_count,
        'popularity': popularity,
        'budget_log': np.log1p(budget),
        'revenue_log': np.log1p(revenue),
        'runtime': runtime,
        'profit_ratio': profit_ratio_transformed,
        'release_year': release_year
    }
    raw_numerical_df = pd.DataFrame([raw_numerical_data])
    scaled_numerical_vector = SCALER.transform(raw_numerical_df)

    # 3. EMBEDDING GENERATION
    overview_text = overview if overview else 'no movie description available'
    sbert_vector = SBERT_MODEL.encode([overview_text], convert_to_numpy=True)[0]

    tokenized_features = word_tokenize(all_features_str) if all_features_str else []
    doc2vec_vector = DOC2VEC_MODEL.infer_vector(tokenized_features, epochs=20)

    # --- 4. VECTOR COMBINATION AND UMAP REDUCTION ---
    combined_vector = np.concatenate([
        scaled_numerical_vector,
        doc2vec_vector.reshape(1, -1),
        sbert_vector.reshape(1, -1)
    ], axis=1)
    final_50d_vector = UMAP_MODEL_CLUST.transform(combined_vector)

    # --- 5. FINAL PREDICTION AND ANALYTICAL INSIGHTS (Model-Agnostic) ---

    current_numerical_profile = None
    current_categorical_profiles = None
    current_centroids_array = None
    current_cluster_names_dict = None
    label_map_for_centroids = None # Only relevant for HDBSCAN's centroid mapping

    if model_type == 'HDBSCAN':
        current_numerical_profile = HDBSCAN_NUMERICAL_PROFILE
        current_categorical_profiles = HDBSCAN_CATEGORICAL_PROFILES
        current_centroids_array = HDBSCAN_CENTROIDS_ARRAY
        current_cluster_names_dict = HDBSCAN_CLUSTER_NAMES_DICT
        label_map_for_centroids = HDBSCAN_LABEL_INDEX_MAP
        # Prediction for HDBSCAN is based on distance to centroids
        distances_to_centroids = euclidean_distances(final_50d_vector, current_centroids_array)[0]
        distance_map_raw = sorted([(dist, i) for i, dist in enumerate(distances_to_centroids)], key=lambda x: x[0])
        # Map the index back to the original HDBSCAN label
        predicted_id_raw = label_map_for_centroids[distance_map_raw[0][1]]
        predicted_id = predicted_id_raw
        best_match_distance = distance_map_raw[0][0]

        if len(distance_map_raw) > 1:
            second_id_raw = label_map_for_centroids[distance_map_raw[1][1]]
            second_id = second_id_raw
            second_distance = distance_map_raw[1][0]
        else:
            second_id = predicted_id
            second_distance = best_match_distance


    elif model_type == 'KMeans':
        current_numerical_profile = KMEANS_NUMERICAL_PROFILE
        current_categorical_profiles = KMEANS_CATEGORICAL_PROFILES
        current_centroids_array = KMEANS_CENTROIDS_ARRAY # From KMEANS_MODEL.cluster_centers_
        current_cluster_names_dict = KMEANS_CLUSTER_NAMES_DICT

        predicted_id = KMEANS_MODEL.predict(final_50d_vector)[0]
        distances_to_centroids = euclidean_distances(final_50d_vector, current_centroids_array)[0]

        # Sort distances to find best and second best match
        distance_map_raw = sorted([(dist, i) for i, dist in enumerate(distances_to_centroids)], key=lambda x: x[0])

        best_match_distance = distances_to_centroids[predicted_id] # Distance to the actual predicted cluster

        # Find the second closest cluster that is NOT the predicted_id
        second_id = -1
        second_distance = np.inf
        for dist, idx in distance_map_raw:
            if idx != predicted_id:
                second_id = idx
                second_distance = dist
                break

        if second_id == -1: # Fallback if only one cluster or can't find distinct second
            second_id = predicted_id
            second_distance = best_match_distance


    elif model_type == 'GMM':
        current_numerical_profile = GMM_NUMERICAL_PROFILE
        current_categorical_profiles = GMM_CATEGORICAL_PROFILES
        current_centroids_array = GMM_CENTROIDS_ARRAY # From GMM_MODEL.means_
        current_cluster_names_dict = GMM_CLUSTER_NAMES_DICT

        predicted_id = GMM_MODEL.predict(final_50d_vector)[0]
        distances_to_centroids = euclidean_distances(final_50d_vector, current_centroids_array)[0]

        # Sort distances to find best and second best match
        distance_map_raw = sorted([(dist, i) for i, dist in enumerate(distances_to_centroids)], key=lambda x: x[0])

        best_match_distance = distances_to_centroids[predicted_id] # Distance to the actual predicted cluster

        # Find the second closest cluster that is NOT the predicted_id
        second_id = -1
        second_distance = np.inf
        for dist, idx in distance_map_raw:
            if idx != predicted_id:
                second_id = idx
                second_distance = dist
                break

        if second_id == -1: # Fallback if only one cluster or can't find distinct second
            second_id = predicted_id
            second_distance = best_match_distance


    else:
        raise ValueError(f"Unknown model_type: {model_type}. Must be 'HDBSCAN', 'KMeans', or 'GMM'.")


    predicted_name = current_cluster_names_dict.get(predicted_id, "Unknown Cluster")
    second_name = current_cluster_names_dict.get(second_id, "Unknown Cluster")


    # Prepare Analytical Interpretation with Cluster Profiles
    explanation_parts = []

    # Predicted Cluster Explanation
    explanation_parts.append(f"The predicted cluster (using {model_type}) is Cluster {int(predicted_id)}: '{predicted_name}'.")
    # The numerical profile's label column might be 'best_cluster_labels' (for KMeans/HDBSCAN) or 'current_cluster_labels' (for GMM).
    # Let's standardize this for easier access within the function.
    label_col_name = 'best_cluster_labels' if model_type != 'GMM' else 'current_cluster_labels'

    if predicted_id in current_numerical_profile[label_col_name].values: # Check if profile exists for this ID
        pred_num_profile = current_numerical_profile[current_numerical_profile[label_col_name] == predicted_id].iloc[0]
        pred_cat_profile = current_categorical_profiles.get(predicted_id, "No specific categorical profile available.")
        explanation_parts.append(f"""  Characteristics: This cluster typically has a mean vote average of {pred_num_profile['vote_average']:.2f} (Z-score),
           profit ratio of {pred_num_profile['profit_ratio']:.2f} (Z-score),
           and popularity of {pred_num_profile['popularity']:.2f} (Z-score).""")
        if isinstance(pred_cat_profile, pd.Series):
             explanation_parts.append(f"  Top genres/features: {', '.join(pred_cat_profile.index.map(str).tolist())}.")
        else:
             explanation_parts.append(f"  Top genres/features: {pred_cat_profile}")
    else:
        explanation_parts.append("  This cluster represents noise points or was an empty cluster in the training data, so a detailed profile is not available.")


    # Second Closest Cluster Explanation
    explanation_parts.append(f"\nThe second closest cluster (using {model_type}) is Cluster {int(second_id)}: '{second_name}'.")
    if second_id in current_numerical_profile[label_col_name].values: # Check if profile exists for this ID
        sec_num_profile = current_numerical_profile[current_numerical_profile[label_col_name] == second_id].iloc[0]
        sec_cat_profile = current_categorical_profiles.get(second_id, "No specific categorical profile available.")
        explanation_parts.append(f"""  Characteristics: This cluster typically has a mean vote average of {sec_num_profile['vote_average']:.2f} (Z-score),\n           profit ratio of {sec_num_profile['profit_ratio']:.2f} (Z-score),\n           and popularity of {sec_num_profile['popularity']:.2f} (Z-score).""")
        if isinstance(sec_cat_profile, pd.Series):
             explanation_parts.append(f"  Top genres/features: {', '.join(sec_cat_profile.index.map(str).tolist())}.")
        else:
             explanation_parts.append(f"  Top genres/features: {sec_cat_profile}")
    else:
        explanation_parts.append("  This cluster represents noise points or was an empty cluster in the training data, so a detailed profile is not available.")


    if best_match_distance > 1e-6:
         distance_difference_percentage = ((second_distance - best_match_distance) / best_match_distance) * 100
         distance_ratio = second_distance / best_match_distance
    else:
         distance_difference_percentage = 0.0
         distance_ratio = 1.0

    if distance_ratio < 1.1:
        similarity_level = "Very Strong (The movie is located very close to the boundary between these two clusters)"
    elif distance_ratio < 1.25:
        similarity_level = "Strong (Indicates a clear overlap in profile features between the predicted and second closest cluster)"
    elif distance_ratio < 1.5:
        similarity_level = "Moderate (Some shared characteristics with the second closest cluster)"
    else:
        similarity_level = "Weak (The movie is distinctly closer to the predicted cluster)"

    explanation_parts.append(f"\nThis indicates a {similarity_level} similarity, as the distance to the alternative center {int(second_id)} is {distance_difference_percentage:.2f}% greater than the distance to the predicted center {int(predicted_id)}.")

    explanation = '\n'.join(explanation_parts)

    return {
        'status': 'Prediction Successful',
        'predicted_id': int(predicted_id),
        'predicted_name': predicted_name,
        'second_closest_id': int(second_id),
        'second_closest_name': second_name,
        'distance_to_predicted': float(best_match_distance),
        'distance_to_second_closest': float(second_distance),
        'similarity_explanation': explanation
    }

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
InputDataHDBSCAN = {
    'budget': 40000000.0,
    'revenue': 60000000.0,
    'vote_average': 5.5,
    'popularity': 150.0,
    'release_year': 2025,
    'vote_count': 4000.0,
    'runtime': 95.0,
    'overview': "A predictable action film where a retired special agent must return for one last mission to save his family from a generic villain.",
    'genres': ["Action", "Thriller"],
    'keywords': ["sequel", "generic", "bad guy"],
    'cast_names': ["Scott Adkins", "A supporting actor"],
    'director_name': "A journeyman director"
}
# --- Input data if you want to use the HDBSCAN Model ---

In [None]:
print("\n--- Running HDBSCAN Prediction")
try:
    analysis_result_hdbscan_test1 = predict_and_explain_cluster(
        budget=InputDataHDBSCAN['budget'],
        revenue=InputDataHDBSCAN['revenue'],
        vote_average=InputDataHDBSCAN['vote_average'],
        vote_count=InputDataHDBSCAN['vote_count'],
        popularity=InputDataHDBSCAN['popularity'],
        release_year=InputDataHDBSCAN['release_year'],
        runtime=InputDataHDBSCAN['runtime'],
        overview=InputDataHDBSCAN['overview'],
        genres=InputDataHDBSCAN['genres'],
        keywords=InputDataHDBSCAN['keywords'],
        cast_names=InputDataHDBSCAN['cast_names'],
        director_name=InputDataHDBSCAN['director_name'],
        model_type='HDBSCAN' # Explicitly specify HDBSCAN model type
    )

    print("\n--- 'Problematic Low-Quality Commercial Film' HDBSCAN ANALYSIS REPORT ---")
    print(f"Status: {analysis_result_hdbscan_test1.get('status', 'Unknown Error')}")

    if analysis_result_hdbscan_test1.get('status') == 'Prediction Successful':
        print(f"🥇 Predicted Cluster (Best Match): {analysis_result_hdbscan_test1['predicted_id']} - {analysis_result_hdbscan_test1['predicted_name']}")
        print(f"🥈 Second Closest Cluster: {analysis_result_hdbscan_test1['second_closest_id']} - {analysis_result_hdbscan_test1['second_closest_name']}")
        print("-" * 50)
        print("Analytical Interpretation:")
        print(f"{analysis_result_hdbscan_test1['similarity_explanation']}")
        print("-" * 50)
    else:
        print(f"Details: {analysis_result_hdbscan_test1.get('details', 'N/A')}")

except Exception as e:
    print(f"\nAn unexpected error occurred during the execution: {e}")


--- Running HDBSCAN Prediction for 'Problematic Low-Quality Commercial Film' ---

--- 'Problematic Low-Quality Commercial Film' HDBSCAN ANALYSIS REPORT ---
Status: Prediction Successful
🥇 Predicted Cluster (Best Match): 1 - Extremely Poor Quality Outliers
🥈 Second Closest Cluster: 5 - New Releases, Sub-Average Drama Focus
--------------------------------------------------
Analytical Interpretation:
The predicted cluster (using HDBSCAN) is Cluster 1: 'Extremely Poor Quality Outliers'.
  Characteristics: This cluster typically has a mean vote average of 0.23 (Z-score),
           profit ratio of 0.05 (Z-score),
           and popularity of -0.30 (Z-score).
  Top genres/features: drama, comedy, romance, thriller, action.

The second closest cluster (using HDBSCAN) is Cluster 5: 'New Releases, Sub-Average Drama Focus'.
  Characteristics: This cluster typically has a mean vote average of -4.70 (Z-score),
           profit ratio of -0.02 (Z-score),
           and popularity of -0.67 (Z-scor

In [None]:
InputDataGMM = {
    'budget': 40000000.0,
    'revenue': 60000000.0,
    'vote_average': 5.5,
    'popularity': 150.0,
    'release_year': 2025,
    'vote_count': 4000.0,
    'runtime': 95.0,
    'overview': "A predictable action film where a retired special agent must return for one last mission to save his family from a generic villain.",
    'genres': ["Action", "Thriller"],
    'keywords': ["sequel", "generic", "bad guy"],
    'cast_names': ["Scott Adkins", "A supporting actor"],
    'director_name': "A journeyman director"
}
# --- Input data if you want to use the GMM Model ---

In [None]:
print("\n--- Running GMM Prediction")
try:
    analysis_result_gmm_test2 = predict_and_explain_cluster(
        budget=InputDataGMM['budget'],
        revenue=InputDataGMM['revenue'],
        vote_average=InputDataGMM['vote_average'],
        vote_count=InputDataGMM['vote_count'],
        popularity=InputDataGMM['popularity'],
        release_year=InputDataGMM['release_year'],
        runtime=InputDataGMM['runtime'],
        overview=InputDataGMM['overview'],
        genres=InputDataGMM['genres'],
        keywords=InputDataGMM['keywords'],
        cast_names=InputDataGMM['cast_names'],
        director_name=InputDataGMM['director_name'],
        model_type='GMM' # Specify GMM model type
    )

    print("\n--- 'Extreme Horror Outlier' GMM ANALYSIS REPORT ---")
    print(f"Status: {analysis_result_gmm_test2.get('status', 'Unknown Error')}")

    if analysis_result_gmm_test2.get('status') == 'Prediction Successful':
        print(f"🥇 Predicted Cluster (Best Match): {analysis_result_gmm_test2['predicted_id']} - {analysis_result_gmm_test2['predicted_name']}")
        print(f"🥈 Second Closest Cluster: {analysis_result_gmm_test2['second_closest_id']} - {analysis_result_gmm_test2['second_closest_name']}")
        print("-" * 50)
        print("Analytical Interpretation:")
        print(f"{analysis_result_gmm_test2['similarity_explanation']}")
        print("-" * 50)
    else:
        print(f"Details: {analysis_result_gmm_test2.get('details', 'N/A')}")

except Exception as e:
    print(f"\nAn unexpected error occurred during the execution: {e}")


--- Running GMM Prediction for 'Extreme Horror Outlier' ---

--- 'Extreme Horror Outlier' GMM ANALYSIS REPORT ---
Status: Prediction Successful
🥇 Predicted Cluster (Best Match): 0 - High-Profit Classic Cinema
🥈 Second Closest Cluster: 3 - The Commercial High-Popularity Mix
--------------------------------------------------
Analytical Interpretation:
The predicted cluster (using GMM) is Cluster 0: 'High-Profit Classic Cinema'.
  Characteristics: This cluster typically has a mean vote average of 0.31 (Z-score),
           profit ratio of 0.11 (Z-score),
           and popularity of -0.20 (Z-score).
  Top genres/features: drama, comedy, thriller, action, adventure.

The second closest cluster (using GMM) is Cluster 3: 'The Commercial High-Popularity Mix'.
  Characteristics: This cluster typically has a mean vote average of -0.29 (Z-score),
           profit ratio of -0.02 (Z-score),
           and popularity of -0.43 (Z-score).
  Top genres/features: drama, comedy, thriller, romance, act

In [None]:
InputDataKMeans = {
    'budget': 40000000.0,
    'revenue': 60000000.0,
    'vote_average': 5.5,
    'popularity': 150.0,
    'release_year': 2025,
    'vote_count': 4000.0,
    'runtime': 95.0,
    'overview': "A predictable action film where a retired special agent must return for one last mission to save his family from a generic villain.",
    'genres': ["Action", "Thriller"],
    'keywords': ["sequel", "generic", "bad guy"],
    'cast_names': ["Scott Adkins", "A supporting actor"],
    'director_name': "A journeyman director"
}
# Input data If you wnat to use the Kmeans model (top recommended model and this model is nearest predictions to the human sense and the best in general)

In [None]:
print("\n--- Running K-Means Prediction")
try:
    analysis_result_kmeans_test1 = predict_and_explain_cluster(
        budget=InputDataKMeans['budget'],
        revenue=InputDataKMeans['revenue'],
        vote_average=InputDataKMeans['vote_average'],
        vote_count=InputDataKMeans['vote_count'],
        popularity=InputDataKMeans['popularity'],
        release_year=InputDataKMeans['release_year'],
        runtime=InputDataKMeans['runtime'],
        overview=InputDataKMeans['overview'],
        genres=InputDataKMeans['genres'],
        keywords=InputDataKMeans['keywords'],
        cast_names=InputDataKMeans['cast_names'],
        director_name=InputDataKMeans['director_name'],
        model_type='KMeans' # Specify KMeans model type
    )

    print("\n--- 'Problematic Low-Quality Commercial Film' K-Means ANALYSIS REPORT ---")
    print(f"Status: {analysis_result_kmeans_test1.get('status', 'Unknown Error')}")

    if analysis_result_kmeans_test1.get('status') == 'Prediction Successful':
        print(f"🥇 Predicted Cluster (Best Match): {analysis_result_kmeans_test1['predicted_id']} - {analysis_result_kmeans_test1['predicted_name']}")
        print(f"🥈 Second Closest Cluster: {analysis_result_kmeans_test1['second_closest_id']} - {analysis_result_kmeans_test1['second_closest_name']}")
        print("-" * 50)
        print("Analytical Interpretation:")
        print(f"{analysis_result_kmeans_test1['similarity_explanation']}")
        print("-" * 50)
    else:
        print(f"Details: {analysis_result_kmeans_test1.get('details', 'N/A')}")

except Exception as e:
    print(f"\nAn unexpected error occurred during the execution: {e}")


--- Running K-Means Prediction

--- 'Problematic Low-Quality Commercial Film' K-Means ANALYSIS REPORT ---
Status: Prediction Successful
🥇 Predicted Cluster (Best Match): 0 - Profit-Driven Comedy/Action Mix
🥈 Second Closest Cluster: 3 - High-Quality Blockbuster Hits
--------------------------------------------------
Analytical Interpretation:
The predicted cluster (using KMeans) is Cluster 0: 'Profit-Driven Comedy/Action Mix'.
  Characteristics: This cluster typically has a mean vote average of -0.14 (Z-score),
           profit ratio of 0.02 (Z-score),
           and popularity of -0.20 (Z-score).
  Top genres/features: comedy, drama, thriller, action, romance.

The second closest cluster (using KMeans) is Cluster 3: 'High-Quality Blockbuster Hits'.
  Characteristics: This cluster typically has a mean vote average of -0.29 (Z-score),
           profit ratio of -0.02 (Z-score),
           and popularity of -0.43 (Z-score).
  Top genres/features: drama, comedy, thriller, romance, action