# Evaluation of KeyBERTMetadata

This notebook evaluates and compares different keyword extraction models applied to movie reviews. The focus is on understanding how well each model can extract meaningful keywords that align with the ground truth.

We aim to assess the performance of two models:
- **Base** KeyBERT model
- **Metadata-enhanced** version that integrates additional information: KeyBERTMetadata

The evaluation is performed on a set of reviews for a selected movie, where each model predicts a ranked list of top-5 keywords per review.

The notebook uses:
- A **ground truth dataset** of annotated keywords per movie retrieved from IMDB
- **Model outputs**: lists of predicted keywords with associated confidence scores for each review

### Evaluation Metrics

Two types of evaluation are conducted:

**1. Basic (Unweighted) Metrics**
- **Precision**, **Recall**, and **F1-score** based on approximate binary matching
- Each review is evaluated independently, and metrics are averaged

**2. Score-Aware Metrics**
- **Weighted Precision/Recall/F1**: matches are weighted by the model’s confidence scores
- **nDCG@5**: evaluates ranking quality of the predicted top-5 keywords

Together, these evaluations provide both a general and a fine-grained view of model performance.


## Setup: Installing and Importing Required Libraries

In [49]:
import subprocess
import sys

# List of required packages
required_packages = [
    "pandas", "numpy", "scikit-learn"
]

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)


pandas is already installed.
numpy is already installed.
Installing scikit-learn...
Defaulting to user installation because normal site-packages is not writeable


In [50]:
# Import pandas for structured data manipulation using DataFrames (tables)
import pandas as pd

# Import scikit-learn metrics to evaluate prediction performance
# - precision_score: proportion of correctly predicted positive samples
# - recall_score: proportion of actual positive samples correctly identified
# - f1_score: harmonic mean of precision and recall, balances precision/recall trade-off
from sklearn.metrics import precision_score, recall_score, f1_score

# Import the 're' module for text preprocessing using regular expressions
# (e.g., removing punctuation or matching patterns)
import re

# Import numpy for efficient numerical computations, array operations, and statistics
import numpy as np

# Import math for mathematical functions like logarithms used in metrics such as nDCG
import math

# Import os to handle file system operations (e.g., listing files in directories)
import os

## Select a Movie and Load its Ground Truth Keywords

In this step, we load the keyword extraction results for a specific movie and retrieve the corresponding ground truth keywords. The goal is to use these annotated keywords for evaluation and comparison with automatically extracted ones.


In [51]:
# Set the name of the movie to be evaluated
movie_name = "Parasite"

# Load the extracted keywords for the selected movie from a pickle file
# The file path is dynamically built using the movie name
selected_film = pd.read_pickle(f"../Dataset/Extracted_Keywords/kw_{movie_name}.pkl")

# Retrieve the Movie_ID of the selected film
# Assumes that the file contains a DataFrame with at least one row
selected_film_id = selected_film["Movie_ID"].iloc[0]

# Load the full dataset containing the ground truth keywords
# for all movies in the evaluation set
keywords = pd.read_pickle("../Dataset/keywords_ground_truth.pkl")

# Filter the ground truth dataset to extract only the keywords for the selected movie
kw_ground_truth = keywords[keywords["Movie_ID"] == selected_film_id]

## Keyword Matching and Evaluation Functions (Basic – Unweighted)

This block defines the baseline utility functions used to evaluate predicted keywords against the ground truth. These functions do **not** take into account keyword confidence scores or ranking—they perform **binary, unweighted evaluation**.

Specifically, this implementation includes:

- **Normalization**: keywords are converted to lowercase, stripped of punctuation, and cleaned of extra whitespace to ensure consistent matching.

- **Approximate Matching**: a relaxed rule that considers two keywords as matching if they are identical or if one is a substring of the other (e.g., *"social satire"* ≈ *"satire"*).

- **Evaluation**: standard metrics — **precision**, **recall**, and **F1-score** — are calculated based on the number of approximate matches between predicted and ground truth keywords.

This provides a basic but interpretable way to assess keyword extraction quality without considering the ranking or confidence scores assigned by the model.


In [52]:
# Simple normalization function for keywords:
# - Converts to lowercase
# - Removes punctuation
# - Strips leading/trailing spaces
def normalize_kw(kw):
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumerics and whitespace
    return kw.strip()

# Approximate matching function:
# Returns True if the predicted keyword matches the ground truth exactly
# or if either keyword contains the other as a substring
def is_approx_match(kw, gt_keywords):
    for gt in gt_keywords:
        if kw == gt or kw in gt or gt in kw:
            return True
    return False

# Evaluation function for a single prediction instance:
# - Normalizes both predicted and ground truth keywords
# - Computes how many predicted keywords approximately match the ground truth
# - Calculates precision, recall, and F1-score
def evaluate_keywords(pred_keywords, gt_keywords):
    pred_keywords = [normalize_kw(k) for k in pred_keywords]
    gt_keywords = [normalize_kw(k) for k in gt_keywords]
    
    # Count how many predicted keywords match approximately with any ground truth keyword
    match_count = sum([is_approx_match(k, gt_keywords) for k in pred_keywords])
    
    # Precision: percentage of predicted keywords that are correct
    precision = match_count / len(pred_keywords) if pred_keywords else 0

    # Recall: percentage of ground truth keywords that were correctly predicted
    recall = match_count / len(gt_keywords) if gt_keywords else 0

    # F1-score: harmonic mean of precision and recall
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0
    
    return precision, recall, f1


### Evaluate and Compare Models on Keyword Extraction (Basic – Unweighted)

This section evaluates two keyword extraction models — **base** and **metadata-enhanced** — using the ground truth.

For each **review**, basic precision, recall, and F1-score are computed based on binary keyword matching. These metrics are then **averaged across all reviews** to provide an overall performance comparison between the models.

In [53]:
# Define the models to be evaluated
models_to_evaluate = ["base", "metadata"]

# Create a dictionary to store evaluation results (precision, recall, F1) for each model
results = {model: [] for model in models_to_evaluate}

# Extract the list of ground truth keywords for the selected movie (same for all reviews)
ground_truth_keywords = kw_ground_truth["Keyword"].tolist()

# Iterate over each review in the selected film's predictions
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        
        # Column name with the predicted keywords for this model
        pred_col = f"keywords_{model}"
        
        # Proceed only if the column exists and contains a list of predicted keywords
        if pred_col in row and isinstance(row[pred_col], list):

            # Extract only the keyword strings from (keyword, score) tuples
            predicted_keywords = [kw for kw, _ in row[pred_col] if isinstance(kw, str)]
            
            # Evaluate the prediction using precision, recall, and F1-score
            precision, recall, f1 = evaluate_keywords(predicted_keywords, ground_truth_keywords)
            
            # Store the result for this specific review
            results[model].append({
                "precision": precision,
                "recall": recall,
                "f1": f1
            })

# Aggregate the results to compute average metrics for each model
summary = {}
for model in models_to_evaluate:
    precisions = [r["precision"] for r in results[model]]
    recalls = [r["recall"] for r in results[model]]
    f1s = [r["f1"] for r in results[model]]
    
    # Calculate the average of each metric and round to 4 decimal places
    summary[model] = {
        "avg_precision": round(np.mean(precisions), 4),
        "avg_recall": round(np.mean(recalls), 4),
        "avg_f1": round(np.mean(f1s), 4)
    }

In [54]:
# Convert the weighted summary dictionary to a pandas DataFrame
summary_df = pd.DataFrame(summary).T  # Transpose so models are rows

# Rename columns for better readability
summary_df.columns = [
    "Precision",
    "Recall",
    "F1-score",
]

# Display the summary table nicely
summary_df.style.format(precision=4).set_caption("Score-Aware Evaluation Summary")

Unnamed: 0,Precision,Recall,F1-score
base,0.16,0.0025,0.005
metadata,0.12,0.0019,0.0037


## Score-Aware Evaluation: Weighted Metrics and nDCG@k

In this extended evaluation, we incorporate the **confidence scores** assigned by the model to each predicted keyword. This enables us to evaluate not only the correctness of predictions, but also how confidently and how well they are ranked.

### Score-Aware Metrics

- **Weighted Precision**: Measures how much of the model’s total confidence is assigned to correct keywords. A confident but incorrect prediction hurts the score more.
- **Weighted Recall**: Reflects how much of the ground truth is recovered, weighted by the confidence given to each correct prediction.
- **Weighted F1-score**: The harmonic mean of weighted precision and recall. It balances the trade-off between being accurate and being comprehensive.
- **nDCG@k (Normalized Discounted Cumulative Gain)**: A ranking metric that rewards placing relevant keywords at the top of the prediction list.

### nDCG@k Definition

$nDCG@k = \frac{DCG@k}{IDCG@k}, \quad 
DCG@k = \sum_{i=1}^{k} \frac{rel_i}{\log_2(i + 1)}$

Where:
- $rel_i = 1$ if the predicted keyword at rank $i$ is in the ground truth, 0 otherwise  
- $IDCG@k$ is the best possible DCG if all relevant keywords were perfectly ranked at the top

#### Example (k = 5)

Predicted: `["scam", "family", "poverty", "cinematography", "fraud"]`  
Ground truth: `["fraud", "poverty", "scam"]`

Relevant keywords are found at ranks 1, 3, and 5:

$DCG = \frac{1}{\log_2(1+1)} + \frac{1}{\log_2(3+1)} + \frac{1}{\log_2(5+1)} = 1 + 0.5 + 0.386 = 1.886$

$IDCG = \frac{1}{\log_2(1+1)} + \frac{1}{\log_2(2+1)} + \frac{1}{\log_2(3+1)} = 1 + 0.6309 + 0.5 = 2.1309$

$nDCG@5 = \frac{1.886}{2.1309} \approx 0.885$

#### Interpretation

- **nDCG@5 = 1** → perfect ranking (all correct keywords are at the top)  
- **nDCG@5 = 0** → none of the ground truth keywords are in the top-*k*

These metrics provide a more nuanced and realistic evaluation by combining prediction accuracy with ranking quality.


In [55]:
# Simple normalization function for keywords:
# - Converts to lowercase
# - Removes punctuation
# - Strips leading/trailing spaces
def normalize_kw(kw):
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumerics and whitespace
    return kw.strip()


# Approximate matching function:
# Returns True if the predicted keyword matches any ground truth keyword
# using a relaxed comparison: exact match or substring containment
def is_approx_match(kw, gt_keywords):
    for gt in gt_keywords:
        if kw == gt or kw in gt or gt in kw:
            return True
    return False


# Weighted evaluation function:
# Calculates precision, recall, and F1-score using the confidence scores of predicted keywords
# - High-confidence correct predictions contribute more
# - Precision is score-weighted; recall divides by total ground truth
def evaluate_keywords_weighted(predicted_kw_score, gt_keywords):
    """
    Evaluate predicted keywords with confidence scores using weighted precision, recall, and F1.
    
    Parameters:
        predicted_kw_score (list of (str, float)): predicted keywords with associated confidence scores
        gt_keywords (list of str): ground truth keywords (annotated)
    
    Returns:
        (precision, recall, f1): all metrics computed using score-weighted matching
    """
    # Normalize both predicted and ground truth keywords
    gt_keywords = [normalize_kw(k) for k in gt_keywords]
    pred_keywords = [(normalize_kw(kw), score) for kw, score in predicted_kw_score if isinstance(kw, str)]
    
    total_score = sum(score for _, score in pred_keywords)
    if total_score == 0:
        return 0, 0, 0

    # Compute total score of predicted keywords that approximately match the ground truth
    match_score = sum(score for kw, score in pred_keywords if is_approx_match(kw, gt_keywords))
    
    # Weighted precision and recall
    precision = match_score / total_score
    recall = match_score / len(gt_keywords) if gt_keywords else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0

    return precision, recall, f1


# Ranking-based evaluation function:
# Computes the normalized Discounted Cumulative Gain (nDCG@k) for predicted keywords
# Gives more credit when correct keywords appear earlier in the ranking
def compute_ndcg(predicted_kw_score, gt_keywords, k=5):
    """
    Compute nDCG@k between predicted keywords (with scores) and the ground truth keywords.

    Parameters:
        predicted_kw_score (list of (str, float)): predicted keywords with associated scores
        gt_keywords (list of str): ground truth keywords
        k (int): number of top keywords to consider
    
    Returns:
        float: nDCG score (between 0 and 1)
    """
    # Normalize the ground truth and predicted keyword strings
    gt_keywords = [normalize_kw(k) for k in gt_keywords]
    pred_keywords = [normalize_kw(kw) for kw, _ in predicted_kw_score[:k]]

    # Assign binary relevance: 1 if keyword matches ground truth, 0 otherwise
    relevance = [1 if is_approx_match(kw, gt_keywords) else 0 for kw in pred_keywords]

    # Compute DCG (Discounted Cumulative Gain)
    dcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(relevance))

    # Compute IDCG (Ideal DCG): the best possible ordering
    ideal_relevance = sorted(relevance, reverse=True)
    idcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(ideal_relevance))

    # Return normalized DCG
    return dcg / idcg if idcg != 0 else 0


### Evaluate and Compare Models on Keyword Extraction (Weighted)

In this section, we apply the score-aware evaluation metrics to each review for both models:

- **Weighted Precision, Recall, F1**: accounts for the confidence scores of each predicted keyword.
- **nDCG@5**: evaluates the ranking quality of the top-5 keywords based on their alignment with the ground truth.

Each review is evaluated individually, and the metrics are then averaged across all reviews to summarize model performance.

In [56]:
# Models to evaluate
models_to_evaluate = ["base", "metadata"]

# Initialize results dictionary for weighted metrics and nDCG
weighted_results = {model: [] for model in models_to_evaluate}

# Ground truth keywords (same for all reviews in the selected film)
ground_truth_keywords = kw_ground_truth["Keyword"].tolist()

# Loop through each review
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        pred_col = f"keywords_{model}"
        
        # Skip if no prediction or wrong format
        if pred_col in row and isinstance(row[pred_col], list):
            predicted_kw_score = row[pred_col]  # list of (kw, score)

            # Compute weighted metrics
            w_precision, w_recall, w_f1 = evaluate_keywords_weighted(predicted_kw_score, ground_truth_keywords)

            # Compute nDCG@5
            ndcg = compute_ndcg(predicted_kw_score, ground_truth_keywords, k=5)

            # Save results
            weighted_results[model].append({
                "weighted_precision": w_precision,
                "weighted_recall": w_recall,
                "weighted_f1": w_f1,
                "ndcg@5": ndcg
            })

# Compute average metrics across all reviews
weighted_summary = {}
for model in models_to_evaluate:
    metrics = weighted_results[model]
    weighted_summary[model] = {
        "avg_weighted_precision": round(np.mean([m["weighted_precision"] for m in metrics]), 4),
        "avg_weighted_recall": round(np.mean([m["weighted_recall"] for m in metrics]), 4),
        "avg_weighted_f1": round(np.mean([m["weighted_f1"] for m in metrics]), 4),
        "avg_ndcg@5": round(np.mean([m["ndcg@5"] for m in metrics]), 4)
    }


In [57]:
# Convert the weighted summary dictionary to a pandas DataFrame
summary_df = pd.DataFrame(weighted_summary).T  # Transpose so models are rows

# Rename columns for better readability
summary_df.columns = [
    "Weighted Precision",
    "Weighted Recall",
    "Weighted F1-score",
    "nDCG@5"
]

# Display the summary table
summary_df.style.format(precision=4).set_caption("Score-Aware Evaluation Summary")

Unnamed: 0,Weighted Precision,Weighted Recall,Weighted F1-score,nDCG@5
base,0.1564,0.0012,0.0023,0.3003
metadata,0.1186,0.0011,0.0021,0.2003


## Evaluation Across All Movies

This section automatically evaluates all `.pkl` files in the `Extracted_Keywords` directory, where each file corresponds to a movie and contains predicted keywords from different models.

For each movie:
- The associated ground truth keywords are loaded.
- The predicted keywords from both models (**base** and **metadata**) are evaluated.
- The following metrics are computed for each review:
  - **Unweighted**: Precision, Recall, F1-score
  - **Score-aware**: Weighted Precision, Weighted Recall, Weighted F1, and nDCG@5

Finally, average metrics are computed per movie and model, and compiled into a final summary table.


In [58]:
# Paths
keywords_dir = "../Dataset/Extracted_Keywords/"
ground_truth_path = "../Dataset/keywords_ground_truth.pkl"

# Load the ground truth once for all movies
keywords_ground_truth = pd.read_pickle(ground_truth_path)

# List of models to evaluate
models_to_evaluate = ["base", "metadata"]

# Store results across all movies
all_results = []

# Iterate over all keyword prediction files
for file in os.listdir(keywords_dir):
    if file.endswith(".pkl") and file.startswith("kw_"):
        movie_name = file.replace("kw_", "").replace(".pkl", "")
        file_path = os.path.join(keywords_dir, file)

        try:
            # Load predicted keywords for the current movie
            selected_film = pd.read_pickle(file_path)
            selected_film_id = selected_film["Movie_ID"].iloc[0]

            # Retrieve ground truth keywords for the selected movie
            kw_ground_truth = keywords_ground_truth[keywords_ground_truth["Movie_ID"] == selected_film_id]
            gt_keywords = kw_ground_truth["Keyword"].tolist()

            # Initialize metrics for each model
            results = {model: [] for model in models_to_evaluate}

            # Evaluate each review in the dataset
            for _, row in selected_film.iterrows():
                for model in models_to_evaluate:
                    pred_col = f"keywords_{model}"

                    # Skip if no predictions or wrong format
                    if pred_col in row and isinstance(row[pred_col], list):
                        predicted_kw_score = row[pred_col]

                        # Extract only keyword strings for unweighted evaluation
                        pred_kw_only = [kw for kw, _ in predicted_kw_score if isinstance(kw, str)]

                        # Compute basic (unweighted) metrics
                        precision, recall, f1 = evaluate_keywords(pred_kw_only, gt_keywords)

                        # Compute score-weighted metrics
                        w_precision, w_recall, w_f1 = evaluate_keywords_weighted(predicted_kw_score, gt_keywords)

                        # Compute ranking-based metric (nDCG@5)
                        ndcg = compute_ndcg(predicted_kw_score, gt_keywords, k=5)

                        # Store all metrics for this review
                        results[model].append({
                            "precision": precision,
                            "recall": recall,
                            "f1": f1,
                            "w_precision": w_precision,
                            "w_recall": w_recall,
                            "w_f1": w_f1,
                            "ndcg@5": ndcg
                        })

            # Compute average metrics per model for the current movie
            for model in models_to_evaluate:
                if results[model]:
                    metrics_df = pd.DataFrame(results[model])
                    all_results.append({
                        "Movie": movie_name,
                        "Model": model,
                        "Avg_Precision": round(metrics_df["precision"].mean(), 4),
                        "Avg_Recall": round(metrics_df["recall"].mean(), 4),
                        "Avg_F1": round(metrics_df["f1"].mean(), 4),
                        "Avg_Weighted_Precision": round(metrics_df["w_precision"].mean(), 4),
                        "Avg_Weighted_Recall": round(metrics_df["w_recall"].mean(), 4),
                        "Avg_Weighted_F1": round(metrics_df["w_f1"].mean(), 4),
                        "Avg_nDCG@5": round(metrics_df["ndcg@5"].mean(), 4)
                    })

        except Exception as e:
            print(f"Error processing {file}: {e}")

# Create the final summary DataFrame
final_df = pd.DataFrame(all_results)

# Display table with clean index and sorted values
final_df_sorted = final_df.sort_values(by=["Movie", "Model"]).reset_index(drop=True)
final_df_sorted.style.format(precision=4).set_caption("Full Evaluation Summary per Movie and Model")


Unnamed: 0,Movie,Model,Avg_Precision,Avg_Recall,Avg_F1,Avg_Weighted_Precision,Avg_Weighted_Recall,Avg_Weighted_F1,Avg_nDCG@5
0,LaLaLand,base,0.08,0.0014,0.0027,0.0798,0.0008,0.0015,0.1387
1,LaLaLand,metadata,0.08,0.0014,0.0027,0.0827,0.0009,0.0018,0.2
2,Parasite,base,0.16,0.0025,0.005,0.1564,0.0012,0.0023,0.3003
3,Parasite,metadata,0.12,0.0019,0.0037,0.1186,0.0011,0.0021,0.2003
