# Evaluation of KeyBERTMetadata

This notebook evaluates and compares different keyword extraction models applied to movie reviews. The focus is on understanding how well each model can extract meaningful keywords that align with the ground truth.

We aim to assess the performance of two models:
- **Base** KeyBERT model
- **Metadata-enhanced** version that integrates additional information: KeyBERTMetadata

The evaluation is performed on a set of reviews for a selected movie, where each model predicts a ranked list of top-5 keywords per review.

The notebook uses:
- A **ground truth dataset** of annotated keywords per movie retrieved from IMDB
- **Model outputs**: lists of predicted keywords with associated confidence scores for each review

### Evaluation Metrics

Three types of evaluation are conducted:

**1. Basic (Unweighted) Metrics**

- **Precision**, **Recall**, and **F1-score** based on approximate binary matching
- Each review is evaluated independently, and metrics are averaged

**2. Score-Aware Metrics**

- **Weighted Precision/Recall/F1**: matches are weighted by the model’s confidence scores
- **nDCG@5**: evaluates ranking quality of the predicted top-5 keywords

**3. Semantic Evaluation (Embedding-Based)**

- Instead of using strict text matching, this evaluation uses **sentence-transformer embeddings** to compute **cosine similarity** between predicted and ground truth keywords.
- A predicted keyword is considered correct if its similarity to any ground truth keyword exceeds a given threshold (e.g., 0.75).
- We then compute **semantic precision, recall, and F1-score** based on these soft matches.

### Why Not BERTScore?

Although **BERTScore** is a powerful metric for evaluating textual similarity, it is designed for **long-form text comparisons** (e.g., full sentences or summaries). In our case:

- Each review contains only **5 short keywords**, making token-level matching less informative.

- BERTScore requires **pairwise comparisons** with equal-sized candidate and reference sets, which does not align with our top-*k* ranking setup.

- It is also computationally expensive and not optimized for many short sequences.

Therefore, we adopt a **lightweight and more interpretable approach** using sentence embeddings and cosine similarity, tailored specifically to the **semantic similarity of keyword-level predictions**.


## Setup: Installing and Importing Required Libraries

In [1]:
import subprocess
import sys

# List of required packages
required_packages = {
    "pandas", "numpy", "scikit-learn", "tqdm", "transformers", "torch"
}

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)


  from .autonotebook import tqdm as notebook_tqdm


transformers is already installed.
tqdm is already installed.
pandas is already installed.
torch is already installed.
numpy is already installed.
Installing scikit-learn...


In [2]:
# Standard Library
import os      # File system operations (e.g., listing files)
import re      # Regular expressions for text processing
import math    # Mathematical functions (e.g., logarithms for nDCG calculation)

# Third-Party Libraries
import pandas as pd                  # Data manipulation with DataFrames
import numpy as np                   # Numerical computations and array operations
from tqdm import tqdm                # Progress bars for loops

# Evaluation metrics from scikit-learn
from sklearn.metrics import precision_score, recall_score, f1_score

# Transformers and PyTorch for embeddings and models
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F


## Load Available Movies from Dataset

This section lists all the available movies stored as `.pkl` files inside the review dataset directory.

- It defines the root path (`../Dataset/Reviews_By_Movie`) where all review files are saved.
- It automatically detects and lists all movie filenames (removing the `.pkl` extension).

In [3]:
# Define root directory
root_dir = "../Dataset/Reviews_By_Movie"

# List all available movies
available_movies = sorted([f[:-4] for f in os.listdir(root_dir) if f.endswith(".pkl")])
print("Available movies:", available_movies)

Available movies: ['GoodBadUgly', 'HarryPotter', 'IndianaJones', 'LaLaLand', 'Oppenheimer', 'Parasite', 'SW_Episode1', 'SW_Episode2', 'SW_Episode3', 'SW_Episode4', 'SW_Episode5', 'SW_Episode6', 'SW_Episode7', 'SW_Episode8', 'SW_Episode9']


## Select a Movie and Load its Ground Truth Keywords

In this step, we load the keyword extraction results for a specific movie and retrieve the corresponding ground truth keywords. The goal is to use these annotated keywords for evaluation and comparison with automatically extracted ones.

In [4]:
# Set the name of the movie to be evaluated
movie_name = "SW_Episode6"

# Load the extracted keywords for the selected movie from a pickle file
# The file path is dynamically built using the movie name
selected_film = pd.read_pickle(f"../Dataset/Extracted_Keywords/kw_{movie_name}.pkl")

# Retrieve the Movie_ID of the selected film
# Assumes that the file contains a DataFrame with at least one row
selected_film_id = selected_film["Movie_ID"].iloc[0]

# Load the full dataset containing the ground truth keywords
# for all movies in the evaluation set
keywords = pd.read_pickle("../Dataset/keywords_ground_truth.pkl")

# Filter the ground truth dataset to extract only the keywords for the selected movie
kw_ground_truth = keywords[keywords["Movie_ID"] == selected_film_id]

## Keyword Matching and Evaluation Functions (Basic – Unweighted)

This block defines the baseline utility functions used to evaluate predicted keywords against the ground truth. These functions do **not** take into account keyword confidence scores or ranking—they perform **binary, unweighted evaluation**.

Specifically, this implementation includes:

- **Normalization**: keywords are converted to lowercase, stripped of punctuation, and cleaned of extra whitespace to ensure consistent matching.

- **Approximate Matching**: a relaxed rule that considers two keywords as matching if they are identical or if one is a substring of the other (e.g., *"social satire"* ≈ *"satire"*).

- **Evaluation**: standard metrics — **precision**, **recall**, and **F1-score** — are calculated based on the number of approximate matches between predicted and ground truth keywords.

This provides a basic but interpretable way to assess keyword extraction quality without considering the ranking or confidence scores assigned by the model.


In [5]:
# Simple normalization function for keywords:
# - Converts to lowercase
# - Removes punctuation
# - Strips leading/trailing spaces
def normalize_kw(kw):
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumerics and whitespace
    return kw.strip()

# Approximate matching function:
# Returns True if the predicted keyword matches the ground truth exactly
# or if either keyword contains the other as a substring
def is_approx_match(kw, gt_keywords):
    for gt in gt_keywords:
        if kw == gt or kw in gt or gt in kw:
            return True
    return False

# Evaluation function for a single prediction instance:
# - Normalizes both predicted and ground truth keywords
# - Computes how many predicted keywords approximately match the ground truth
# - Calculates precision, recall, and F1-score
def evaluate_keywords(pred_keywords, gt_keywords):
    pred_keywords = [normalize_kw(k) for k in pred_keywords]
    gt_keywords = [normalize_kw(k) for k in gt_keywords]
    
    # Count how many predicted keywords match approximately with any ground truth keyword
    match_count = sum([is_approx_match(k, gt_keywords) for k in pred_keywords])
    
    # Precision: percentage of predicted keywords that are correct
    precision = match_count / len(pred_keywords) if pred_keywords else 0

    # Recall: percentage of ground truth keywords that were correctly predicted
    recall = match_count / len(gt_keywords) if gt_keywords else 0

    # F1-score: harmonic mean of precision and recall
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0
    
    return precision, recall, f1


### Evaluate and Compare Models on Keyword Extraction (Basic – Unweighted)

This section evaluates two keyword extraction models — **base** and **metadata-enhanced** — using the ground truth.

For each **review**, basic precision, recall, and F1-score are computed based on binary keyword matching. These metrics are then **averaged across all reviews** to provide an overall performance comparison between the models.

In [6]:
# Define the models to be evaluated
models_to_evaluate = ["base", "metadata"]

# Create a dictionary to store evaluation results (precision, recall, F1) for each model
results = {model: [] for model in models_to_evaluate}

# Extract the list of ground truth keywords for the selected movie (same for all reviews)
ground_truth_keywords = kw_ground_truth["Keyword"].tolist()

# Iterate over each review in the selected film's predictions
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        
        # Column name with the predicted keywords for this model
        pred_col = f"keywords_{model}"
        
        # Proceed only if the column exists and contains a list of predicted keywords
        if pred_col in row and isinstance(row[pred_col], list):

            # Extract only the keyword strings from (keyword, score) tuples
            predicted_keywords = [kw for kw, _ in row[pred_col] if isinstance(kw, str)]
            
            # Evaluate the prediction using precision, recall, and F1-score
            precision, recall, f1 = evaluate_keywords(predicted_keywords, ground_truth_keywords)
            
            # Store the result for this specific review
            results[model].append({
                "precision": precision,
                "recall": recall,
                "f1": f1
            })

# Aggregate the results to compute average metrics for each model
summary = {}
for model in models_to_evaluate:
    precisions = [r["precision"] for r in results[model]]
    recalls = [r["recall"] for r in results[model]]
    f1s = [r["f1"] for r in results[model]]
    
    # Calculate the average of each metric and round to 4 decimal places
    summary[model] = {
        "avg_precision": round(np.mean(precisions), 4),
        "avg_recall": round(np.mean(recalls), 4),
        "avg_f1": round(np.mean(f1s), 4)
    }

# Convert the weighted summary dictionary to a pandas DataFrame
summary_df = pd.DataFrame(summary).T  # Transpose so models are rows

# Rename columns for better readability
summary_df.columns = [
    "Precision",
    "Recall",
    "F1-score",
]

# Display the summary table nicely
summary_df.style.format(precision=4).set_caption("Score-Aware Evaluation Summary")

Unnamed: 0,Precision,Recall,F1-score
base,0.5488,0.0101,0.0198
metadata,0.4671,0.0086,0.0169


## Score-Aware Evaluation: Weighted Metrics and nDCG@k with Graded Relevance

This extended evaluation considers the **confidence scores** assigned by the model to each predicted keyword, allowing us to measure not only whether the predictions are correct but also how confidently and effectively they are ranked.

#### Score-Aware Metrics

- **Weighted Precision**: Reflects the proportion of the model’s total confidence assigned to correct keywords. High confidence in incorrect predictions lowers this score.

- **Weighted Recall**: Measures how much of the ground truth is recovered, weighted by the confidence of correct predictions.

- **Weighted F1-score**: The harmonic mean of weighted precision and recall, balancing accuracy with coverage.

- **nDCG@k (Normalized Discounted Cumulative Gain)**: A ranking metric that rewards placing relevant keywords near the top of the prediction list. It uses **graded relevance**, which accounts for the importance of ground truth keywords based on their position.

#### How nDCG@k with Graded Relevance is Computed

1. **Assign graded relevance to ground truth keywords** based on their position $pos_{GT}$ (starting from 0). The relevance for a ground truth keyword at position $pos_{GT}$ is:

   $$
   rel_{GT} = \frac{1}{\log_2(pos_{GT} + 2)}
   $$

   Higher ranked keywords (lower $pos_{GT}$) have higher relevance scores.

2. **For each predicted keyword at position $i$ (starting from 0)**, find the best matching ground truth keyword (using approximate matching). Assign the relevance of the predicted keyword $rel_i$ as the graded relevance of its matched ground truth keyword:

   $$
   rel_i = \begin{cases}
   \frac{1}{\log_2(pos_{GT} + 2)} & \text{if predicted keyword matches GT keyword at } pos_{GT} \\
   0 & \text{otherwise}
   \end{cases}
   $$

3. **Compute Discounted Cumulative Gain (DCG) for the predicted keywords up to rank $k$**:

   $$
   DCG@k = \sum_{i=1}^{k} \frac{rel_i}{\log_2(i + 1)}
   $$

   This discounts the relevance by the predicted keyword’s position, rewarding relevant keywords ranked higher.

4. **Compute Ideal DCG (IDCG)** as the maximum possible DCG using the top-$k$ ground truth keywords ranked by their graded relevance:

   $$
   IDCG@k = \sum_{i=1}^{k} \frac{rel^{*}_i}{\log_2(i + 1)}
   $$

   where $rel^{*}_i$ are the graded relevance scores of the top-$k$ ground truth keywords sorted by importance.

5. **Calculate normalized DCG (nDCG)** by dividing DCG by IDCG:

   $$
   nDCG@k = \frac{DCG@k}{IDCG@k}
   $$

### Example ($k=5$)

Ground truth keywords ranked by importance:  
`["fraud", "poverty", "scam"]`

Their graded relevance:  
- "fraud" at position 0 → $rel_{GT} = \frac{1}{\log_2(0 + 2)} = 1.0$  
- "poverty" at position 1 → $rel_{GT} = \frac{1}{\log_2(1 + 2)} = 0.63$  
- "scam" at position 2 → $rel_{GT} = \frac{1}{\log_2(2 + 2)} = 0.5$

**Predicted keywords in order:**  
`["scam", "family", "poverty", "cinematography", "fraud"]`

Matching relevances assigned to predicted keywords:  
- "scam" matches GT at pos 2 → $rel_0 = 0.5$  
- "family" no match → $rel_1 = 0$  
- "poverty" matches GT at pos 1 → $rel_2 = 0.63$  
- "cinematography" no match → $rel_3 = 0$  
- "fraud" matches GT at pos 0 → $rel_4 = 1.0$

Compute DCG:

$$
DCG = \frac{0.5}{\log_2(1 + 1)} + \frac{0}{\log_2(2 + 1)} + \frac{0.63}{\log_2(3 + 1)} + \frac{0}{\log_2(4 + 1)} + \frac{1.0}{\log_2(5 + 1)} \approx 0.5 + 0 + 0.315 + 0 + 0.387 = 1.202
$$

Compute IDCG (ideal predicted order: "fraud", "poverty", "scam"):

$$
IDCG = \frac{1.0}{\log_2(1 + 1)} + \frac{0.63}{\log_2(2 + 1)} + \frac{0.5}{\log_2(3 + 1)} = 1.0 + 0.397 + 0.25 = 1.647
$$

Then,

$$
nDCG@5 = \frac{1.202}{1.647} \approx 0.73
$$

**Change predicted order to:**  
`["fraud", "poverty", "scam", "family", "cinematography"]`

Relevances for predicted keywords:

- "fraud" matches GT at pos 0 → $rel_0 = 1.0$  
- "poverty" matches GT at pos 1 → $rel_1 = 0.63$  
- "scam" matches GT at pos 2 → $rel_2 = 0.5$  
- "family" no match → $rel_3 = 0$  
- "cinematography" no match → $rel_4 = 0$

Compute DCG:

$$
DCG = \frac{1.0}{\log_2(1 + 1)} + \frac{0.63}{\log_2(2 + 1)} + \frac{0.5}{\log_2(3 + 1)} + 0 + 0 = 1.0 + 0.397 + 0.25 = 1.647
$$

IDCG is the same as before.

$$
nDCG@5 = \frac{1.647}{1.647} = 1.0
$$

Changing the order of predicted keywords **does affect** the nDCG score: placing highly relevant keywords earlier leads to higher nDCG, reflecting better ranking quality.

- When relevant keywords appear early in the predicted list, the score increases due to less discounting.
- Conversely, when relevant keywords are placed lower, the score decreases because of higher discounting.
- This metric thus rewards **both correct prediction and the quality of their ranking**.


In [7]:
# Simple normalization function for keywords:
# - Converts to lowercase
# - Removes punctuation
# - Strips leading/trailing spaces
def normalize_kw(kw):
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumerics and whitespace
    return kw.strip()


# Approximate matching function:
# Returns True if the predicted keyword matches any ground truth keyword
# using a relaxed comparison: exact match or substring containment
def is_approx_match(kw, gt_keywords):
    for gt in gt_keywords:
        if kw == gt or kw in gt or gt in kw:
            return True
    return False

# Weighted evaluation function:
# Calculates precision, recall, and F1-score using the confidence scores of predicted keywords
# - High-confidence correct predictions contribute more
# - Precision is score-weighted; recall divides by total ground truth
def evaluate_keywords_weighted(predicted_kw_score, gt_keywords):
    """
    Evaluate predicted keywords with confidence scores using weighted precision, recall, and F1.
    
    Parameters:
        predicted_kw_score (list of (str, float)): predicted keywords with associated confidence scores
        gt_keywords (list of str): ground truth keywords (annotated)
    
    Returns:
        (precision, recall, f1): all metrics computed using score-weighted matching
    """
    # Normalize both predicted and ground truth keywords
    gt_keywords = [normalize_kw(k) for k in gt_keywords]
    pred_keywords = [(normalize_kw(kw), score) for kw, score in predicted_kw_score if isinstance(kw, str)]
    
    total_score = sum(score for _, score in pred_keywords)
    if total_score == 0:
        return 0, 0, 0

    # Compute total score of predicted keywords that approximately match the ground truth
    match_score = sum(score for kw, score in pred_keywords if is_approx_match(kw, gt_keywords))
    
    # Weighted precision and recall
    precision = match_score / total_score
    recall = match_score / len(gt_keywords) if gt_keywords else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0

    return precision, recall, f1

# Ranking-based evaluation function:
# Computes the normalized Discounted Cumulative Gain (nDCG@k) for predicted keywords
# Gives more credit when correct keywords appear earlier in the ranking
def compute_ndcg(predicted_kw_score, gt_keywords, k=5):
    """
    Compute nDCG@k between predicted keywords (with scores) and ground truth keywords,
    using graded relevance based on ground truth ranking and approximate matching.

    Parameters:
        predicted_kw_score (list of (str, float)): predicted keywords with confidence scores
        gt_keywords (list of str): ground truth keywords ordered by importance
        k (int): top-k keywords to consider

    Returns:
        float: normalized DCG score
    """
    # Normalize ground truth and predicted keywords
    gt_keywords_norm = [normalize_kw(k) for k in gt_keywords]
    pred_keywords_norm = [normalize_kw(kw) for kw, _ in predicted_kw_score[:k]]

    relevance = []
    for pk in pred_keywords_norm:
        # Find ranks of all GT keywords matching predicted keyword approx.
        match_ranks = [i for i, gk in enumerate(gt_keywords_norm) if is_approx_match(pk, [gk])]
        if match_ranks:
            # Assign relevance inversely proportional to rank (log discount)
            best_rank = min(match_ranks)
            rel = 1 / math.log2(best_rank + 2)  # +2 since ranks start at 0
        else:
            rel = 0
        relevance.append(rel)

    # Compute DCG with graded relevance
    dcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(relevance))

    # Compute ideal DCG (IDCG) assuming best possible ordering (top k GT keywords)
    ideal_relevance = [1 / math.log2(i + 2) for i in range(min(k, len(gt_keywords_norm)))]
    idcg = sum(ideal_relevance)

    return dcg / idcg if idcg > 0 else 0.0


### Evaluate and Compare Models on Keyword Extraction (Weighted)

In this section, we apply the score-aware evaluation metrics to each review for both models:

- **Weighted Precision, Recall, F1**: accounts for the confidence scores of each predicted keyword.
- **nDCG@5**: evaluates the ranking quality of the top-5 keywords based on their alignment with the ground truth.

Each review is evaluated individually, and the metrics are then averaged across all reviews to summarize model performance.

In [8]:
# Models to evaluate
models_to_evaluate = ["base", "metadata"]

# Initialize results dictionary for weighted metrics and nDCG
weighted_results = {model: [] for model in models_to_evaluate}

# Ground truth keywords (same for all reviews in the selected film)
ground_truth_keywords = kw_ground_truth["Keyword"].tolist()

# Loop through each review
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        pred_col = f"keywords_{model}"
        
        # Skip if no prediction or wrong format
        if pred_col in row and isinstance(row[pred_col], list):
            predicted_kw_score = row[pred_col]  # list of (kw, score)

            # Compute weighted metrics
            w_precision, w_recall, w_f1 = evaluate_keywords_weighted(predicted_kw_score, ground_truth_keywords)

            # Compute nDCG@5
            ndcg = compute_ndcg(predicted_kw_score, ground_truth_keywords, k=5)

            # Save results
            weighted_results[model].append({
                "weighted_precision": w_precision,
                "weighted_recall": w_recall,
                "weighted_f1": w_f1,
                "ndcg@5": ndcg
            })

# Compute average metrics across all reviews
weighted_summary = {}
for model in models_to_evaluate:
    metrics = weighted_results[model]
    weighted_summary[model] = {
        "avg_weighted_precision": round(np.mean([m["weighted_precision"] for m in metrics]), 4),
        "avg_weighted_recall": round(np.mean([m["weighted_recall"] for m in metrics]), 4),
        "avg_weighted_f1": round(np.mean([m["weighted_f1"] for m in metrics]), 4),
        "avg_ndcg@5": round(np.mean([m["ndcg@5"] for m in metrics]), 4)
    }

# Convert the weighted summary dictionary to a pandas DataFrame
summary_df = pd.DataFrame(weighted_summary).T  # Transpose so models are rows

# Rename columns for better readability
summary_df.columns = [
    "Weighted Precision",
    "Weighted Recall",
    "Weighted F1-score",
    "nDCG@5"
]

# Display the summary table
summary_df.style.format(precision=4).set_caption("Score-Aware Evaluation Summary")


Unnamed: 0,Weighted Precision,Weighted Recall,Weighted F1-score,nDCG@5
base,0.5539,0.0052,0.0102,0.2272
metadata,0.4702,0.0051,0.01,0.18


## Semantic Evaluation (Base vs Metadata)

In this section, we evaluate and compare the **Base** and **Metadata-enhanced** keyword extraction models using a **semantic similarity approach** based on contextual embeddings.

Traditional evaluation metrics typically check for exact or approximate string matches between predicted and ground truth keywords. However, this can miss semantically related terms that are lexically different but convey the same meaning — for example, *"scam"* and *"fraud"*.

To overcome this limitation, we leverage **sentence embeddings** generated by a pre-trained transformer model (such as a sentence-transformer). Each keyword — both predicted and ground truth — is converted into a dense vector representation that captures its semantic content.

How the semantic evaluation works in detail:

1. **Embedding keywords**:  
    Both predicted keywords and ground truth keywords are embedded into high-dimensional vectors using the same model. The ground truth keywords are embedded **once beforehand** to avoid redundant computations. These embeddings are normalized to ensure cosine similarity is a valid similarity measure.

2. **Computing similarity scores**:  
   We calculate the **cosine similarity** between every predicted keyword embedding and every ground truth keyword embedding, resulting in a similarity matrix.

3. **Determining matches using a threshold**:  
   A predicted keyword is considered a **semantic match** if its cosine similarity with at least one ground truth keyword exceeds a set threshold (e.g., 0.75). This threshold balances between strictness and flexibility in matching semantic content.

4. **Calculating semantic precision**:  
   This is the fraction of predicted keywords that have a semantic match in the ground truth. It reflects how many of the model’s predictions are meaningful and relevant.

5. **Calculating semantic recall**:  
   This is the fraction of ground truth keywords that are captured by semantically similar predicted keywords. It indicates how well the model covers the essential concepts of the ground truth.

6. **Calculating semantic F1-score**:  
   The harmonic mean of semantic precision and recall, providing a single measure that balances both aspects.

By evaluating keyword extraction in this semantic space, the metric is more robust to lexical variation and better reflects the true relevance of the predictions. This provides a deeper understanding of how well each model captures the **meaning** behind the ground truth keywords, beyond surface-level text matches.

In [9]:
# Load a sentence embedding model from the SentenceTransformers family
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load tokenizer and model to generate contextual embeddings
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
encoder = AutoModel.from_pretrained(MODEL_NAME).to(device)

def embed_keywords(keywords, device="cuda"):
    """
    Compute sentence embeddings for a list of keyword strings.

    Parameters:
    ----------
    keywords : List[str]
        A list of keyword strings to encode.
    device : str
        Device to run the model on ('cuda' or 'cpu').

    Returns:
    -------
    torch.Tensor
        Normalized embeddings tensor of shape (num_keywords, embedding_dim).
    """
    # Return empty tensor if input list is empty
    if not keywords:
        return torch.empty(0, encoder.config.hidden_size).to(device)

    # Tokenize and prepare inputs for the model
    inputs = tokenizer(keywords, padding=True, truncation=True, return_tensors="pt").to(device)

    with torch.no_grad():
        # Forward pass through the encoder to get hidden states
        outputs = encoder(**inputs)

        # Use mean pooling on the last hidden state to get fixed-size embeddings
        embeddings = outputs.last_hidden_state.mean(dim=1)

        # Normalize embeddings to unit length for cosine similarity computations
        embeddings = F.normalize(embeddings, p=2, dim=1)

    return embeddings


def compute_semantic_metrics(pred_keywords, gt_embeddings, threshold=0.75, device="cuda"):
    """
    Compute semantic precision, recall, and F1 score between predicted keywords and
    ground truth embeddings based on cosine similarity.

    Parameters:
    ----------
    pred_keywords : List[str]
        List of predicted keywords for a single review.
    gt_embeddings : torch.Tensor
        Pre-computed normalized embeddings of ground truth keywords.
    threshold : float
        Cosine similarity threshold above which a predicted keyword is considered
        semantically matching a ground truth keyword.
    device : str
        Device to run computations on.

    Returns:
    -------
    precision : float
        Fraction of predicted keywords that match any ground truth keyword semantically.
    recall : float
        Fraction of ground truth keywords that are matched by any predicted keyword.
    f1 : float
        Harmonic mean of precision and recall.
    """
    # Handle empty predictions or empty ground truth embeddings edge cases
    if len(pred_keywords) == 0 or gt_embeddings.shape[0] == 0:
        return 0.0, 0.0, 0.0

    # Compute embeddings for the predicted keywords only
    pred_emb = embed_keywords(pred_keywords, device=device)

    # Compute cosine similarity matrix between predicted and ground truth embeddings
    # Shape: (num_predicted_keywords, num_ground_truth_keywords)
    sims = torch.matmul(pred_emb, gt_embeddings.T)

    # A predicted keyword is counted as a match if it has cosine similarity above
    # the threshold with at least one ground truth keyword
    pred_matches = (sims > threshold).any(dim=1).float().sum().item()

    # Similarly, a ground truth keyword is matched if any predicted keyword exceeds threshold
    gt_matches = (sims > threshold).any(dim=0).float().sum().item()

    # Calculate precision: matched predictions / total predictions
    precision = pred_matches / len(pred_keywords)

    # Calculate recall: matched ground truths / total ground truth keywords
    recall = gt_matches / gt_embeddings.shape[0]

    # Compute harmonic mean for F1 score, handling zero division
    if precision + recall == 0:
        f1 = 0.0
    else:
        f1 = 2 * precision * recall / (precision + recall)

    return precision, recall, f1

### Semantic Evaluation of Base and Metadata Models Using Sentence Embeddings

In this step, we evaluate the semantic similarity between the predicted keywords of two models — **Base** and **Metadata-enhanced** — and the ground truth keywords using **sentence embeddings**.

Unlike previous evaluations based on exact or approximate matching, this method leverages contextual embeddings from a pre-trained transformer to measure how semantically close the predicted keywords are to the reference keywords.

For each review:
- We extract only the **text of the predicted keywords**, ignoring their confidence scores.
- We compute **semantic precision, recall, and F1** based on cosine similarity between embeddings.
- We then average these metrics across all reviews for each model, providing an overall **semantic performance assessment**.

In [10]:
# Precompute embeddings for the ground truth keywords once per selected movie
# This avoids redundant computation when comparing against multiple predicted keywords
gt_keywords = kw_ground_truth["Keyword"].tolist()
gt_emb = embed_keywords(gt_keywords, device=device)

# Define the models to evaluate
models_to_evaluate = ["base", "metadata"]

# List to collect semantic evaluation results for each model
semantic_scores = []

# Loop over each model to evaluate semantic metrics separately
for model in models_to_evaluate:
    # Lists to accumulate precision, recall, and F1 scores for each review
    all_precisions = []
    all_recalls = []
    all_f1s = []

    # Iterate over each review (row) in the selected movie's predictions, with a progress bar
    for _, row in tqdm(selected_film.iterrows(), total=len(selected_film), desc=f"Semantic metrics - {model}"):
        pred_col = f"keywords_{model}"  # Column name for predicted keywords of the current model

        # Check if the predicted keywords column exists and contains a list
        if pred_col in row and isinstance(row[pred_col], list):
            # Extract only the keyword strings (ignore confidence scores)
            pred_kw = [kw for kw, _ in row[pred_col] if isinstance(kw, str)]

            # Compute semantic precision, recall, and F1 between predicted keywords and precomputed GT embeddings
            precision, recall, f1 = compute_semantic_metrics(pred_kw, gt_emb, device=device)

            # Append the scores for aggregation later
            all_precisions.append(precision)
            all_recalls.append(recall)
            all_f1s.append(f1)

    # After processing all reviews, calculate average semantic scores for the current model
    if all_f1s:
        semantic_scores.append({
            "Model": model,
            "Semantic_Precision": round(sum(all_precisions) / len(all_precisions), 4),
            "Semantic_Recall": round(sum(all_recalls) / len(all_recalls), 4),
            "Semantic_F1": round(sum(all_f1s) / len(all_f1s), 4)
        })

# Convert the list of dictionaries into a pandas DataFrame and set 'Model' as the index
summary_df = pd.DataFrame(semantic_scores).set_index("Model")

# Display the semantic evaluation summary as a nicely formatted table with 4 decimal places
summary_df.style.format(precision=4).set_caption("Semantic-Aware Evaluation Summary")


Semantic metrics - base: 100%|██████████| 1016/1016 [00:15<00:00, 63.70it/s]
Semantic metrics - metadata: 100%|██████████| 1016/1016 [00:16<00:00, 62.06it/s]


Unnamed: 0_level_0,Semantic_Precision,Semantic_Recall,Semantic_F1
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
base,0.3537,0.0103,0.0199
metadata,0.2748,0.0078,0.0151


## Evaluation Across All Movies

This section automatically processes all `.pkl` files in the `Extracted_Keywords` directory, where each file corresponds to a single movie and contains predicted keywords generated by different models.

For each movie:
- The corresponding ground truth keywords are loaded.
- Predicted keywords from both models — **Base** and **Metadata-enhanced** — are evaluated.
- For each review, the following metrics are computed:

  - **Unweighted Metrics**: Precision, Recall, and F1-score based on approximate matching.

  - **Score-aware Metrics**: Weighted Precision, Weighted Recall, Weighted F1, and nDCG@5 to evaluate prediction confidence and ranking quality.
  
  - **Semantic Metrics**: Semantic Precision, Semantic Recall, and Semantic F1 computed using cosine similarity between sentence embeddings.

Finally, the metrics are averaged per movie and per model, and compiled into a comprehensive summary table for comparison.

In [11]:
# Paths
keywords_dir = "../Dataset/Extracted_Keywords/"
ground_truth_path = "../Dataset/keywords_ground_truth.pkl"

# Load the ground truth once for all movies
keywords_ground_truth = pd.read_pickle(ground_truth_path)

# List of models to evaluate
models_to_evaluate = ["base", "metadata"]

# Store results across all movies
all_results = []

# Iterate over all keyword prediction files
for file in os.listdir(keywords_dir):
    if file.endswith(".pkl") and file.startswith("kw_"):
        movie_name = file.replace("kw_", "").replace(".pkl", "")
        file_path = os.path.join(keywords_dir, file)

        try:
            # Load predicted keywords for the current movie
            selected_film = pd.read_pickle(file_path)
            selected_film_id = selected_film["Movie_ID"].iloc[0]

            # Retrieve ground truth keywords for the selected movie
            kw_ground_truth = keywords_ground_truth[keywords_ground_truth["Movie_ID"] == selected_film_id]
            gt_keywords = kw_ground_truth["Keyword"].tolist()

            # Precompute GT embeddings once per movie for efficiency
            gt_embeddings = embed_keywords(gt_keywords, device=device)

            # Initialize metrics for each model
            results = {model: [] for model in models_to_evaluate}

            # Evaluate each review in the dataset
            for _, row in selected_film.iterrows():
                for model in models_to_evaluate:
                    pred_col = f"keywords_{model}"

                    # Skip if no predictions or wrong format
                    if pred_col in row and isinstance(row[pred_col], list):
                        predicted_kw_score = row[pred_col]

                        # Extract only keyword strings for unweighted evaluation
                        pred_kw_only = [kw for kw, _ in predicted_kw_score if isinstance(kw, str)]

                        # Compute basic (unweighted) metrics
                        precision, recall, f1 = evaluate_keywords(pred_kw_only, gt_keywords)

                        # Compute score-weighted metrics
                        w_precision, w_recall, w_f1 = evaluate_keywords_weighted(predicted_kw_score, gt_keywords)

                        # Compute ranking-based metric (nDCG@5)
                        ndcg = compute_ndcg(predicted_kw_score, gt_keywords, k=5)

                        # Compute semantic metrics using embeddings
                        semantic_precision, semantic_recall, semantic_f1 = compute_semantic_metrics(
                            pred_kw_only, gt_embeddings, threshold=0.5, device=device
                        )

                        # Store all metrics for this review
                        results[model].append({
                            "precision": precision,
                            "recall": recall,
                            "f1": f1,
                            "w_precision": w_precision,
                            "w_recall": w_recall,
                            "w_f1": w_f1,
                            "ndcg@5": ndcg,
                            "semantic_precision": semantic_precision,
                            "semantic_recall": semantic_recall,
                            "semantic_f1": semantic_f1
                        })

            # Compute average metrics per model for the current movie
            for model in models_to_evaluate:
                if results[model]:
                    metrics_df = pd.DataFrame(results[model])
                    all_results.append({
                        "Movie": movie_name,
                        "Model": model,
                        "Avg_Precision": round(metrics_df["precision"].mean(), 4),
                        "Avg_Recall": round(metrics_df["recall"].mean(), 4),
                        "Avg_F1": round(metrics_df["f1"].mean(), 4),
                        "Avg_Weighted_Precision": round(metrics_df["w_precision"].mean(), 4),
                        "Avg_Weighted_Recall": round(metrics_df["w_recall"].mean(), 4),
                        "Avg_Weighted_F1": round(metrics_df["w_f1"].mean(), 4),
                        "Avg_nDCG@5": round(metrics_df["ndcg@5"].mean(), 4),
                        "Avg_Semantic_Precision": round(metrics_df["semantic_precision"].mean(), 4),
                        "Avg_Semantic_Recall": round(metrics_df["semantic_recall"].mean(), 4),
                        "Avg_Semantic_F1": round(metrics_df["semantic_f1"].mean(), 4),
                    })

        except Exception as e:
            print(f"Error processing {file}: {e}")

# Create the final summary DataFrame
final_df = pd.DataFrame(all_results)

# Display table with clean index and sorted values
final_df_sorted = final_df.sort_values(by=["Movie", "Model"]).reset_index(drop=True)
final_df_sorted.style.format(precision=4).set_caption("Full Evaluation Summary per Movie and Model")

Unnamed: 0,Movie,Model,Avg_Precision,Avg_Recall,Avg_F1,Avg_Weighted_Precision,Avg_Weighted_Recall,Avg_Weighted_F1,Avg_nDCG@5,Avg_Semantic_Precision,Avg_Semantic_Recall,Avg_Semantic_F1
0,SW_Episode6,base,0.5488,0.0101,0.0198,0.5539,0.0052,0.0102,0.2272,0.9116,0.1918,0.2882
1,SW_Episode6,metadata,0.4671,0.0086,0.0169,0.4702,0.0051,0.01,0.18,0.8823,0.1614,0.2449
