# Evaluation of KeyBERTMetadata

This notebook evaluates and compares two different models:

- **Base**: the standard KeyBERT model
- **Metadata-enhanced**: an extended version, called *KeyBERTMetadata*, which integrates contextual information from the review metadata.

Each model predicts a ranked list of top-5 keywords for each review. However, all evaluations in this notebook are **performed globally**, by aggregating predictions across all reviews of the same movie.

The notebook uses:
- A **ground truth dataset** containing annotated keywords per movie, derived from IMDB.
- **Model predictions**: for each movie, a set of predicted keywords with associated confidence scores is extracted across all reviews.

We conduct three types of global evaluation:

#### **1. Basic (Unweighted) Metrics**

- **Precision**, **Recall**, and **F1-score** are computed based on exact (normalized) string matching between predicted and ground truth keywords.
- Predictions are aggregated across all reviews, and metrics are calculated on the global set of unique keywords per model.

#### **2. Score-Aware Metrics**

- **Weighted Precision, Recall, and F1-score**: keywords are matched against ground truth, weighting each match by the model’s confidence score.
- **nDCG@5** (Normalized Discounted Cumulative Gain): evaluates how well the model ranks relevant keywords within its top-5 predictions.

#### **3. Semantic Evaluation (Embedding-Based)**

- Predicted and ground truth keywords are encoded using **sentence-transformer embeddings**.
- **Cosine similarity** is used to identify soft matches between keywords.
- A predicted keyword is considered correct if its similarity with any ground truth keyword exceeds a threshold (e.g., 0.75).
- Based on these matches, we compute **semantic precision, recall, and F1-score**.

**Why Not Use BERTScore?**

Although **BERTScore** is a powerful metric for evaluating textual similarity, it is designed for **long-form text comparisons** (e.g., sentences or summaries). It is **not appropriate for evaluating keyword-level predictions**, for several reasons:

- Each model generates **short keyword lists**, where token-level similarity is not meaningful.
- BERTScore expects **equal-length** candidate and reference sequences, which is incompatible with the top-*k* keyword setting.
- It is computationally intensive and less interpretable for sparse keyword evaluation.

Instead, we adopt a **faster and more interpretable approach** using sentence embeddings and cosine similarity, tailored specifically to **keyword-level semantic evaluation**.

## Setup: Installing and Importing Required Libraries

In [1]:
import subprocess
import sys

# List of required packages
required_packages = {
    "pandas", "numpy", "tqdm", "transformers", "torch"
}

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)


numpy is already installed.
tqdm is already installed.
torch is already installed.


  from .autonotebook import tqdm as notebook_tqdm


transformers is already installed.
pandas is already installed.


In [2]:
# Standard Library
import os      # File system operations (e.g., listing files)
import re      # Regular expressions for text processing
import math    # Mathematical functions (e.g., logarithms for nDCG calculation)

# Third-Party Libraries
import pandas as pd                  # Data manipulation with DataFrames
import numpy as np                   # Numerical computations and array operations
from tqdm import tqdm                # Progress bars for loops

# Transformers and PyTorch for embeddings and models
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F


## Load Available Movies from Dataset

This section lists all the available movies stored as `.pkl` files inside the review dataset directory.

- It defines the root path (`../Dataset/Reviews_By_Movie`) where all review files are saved.
- It automatically detects and lists all movie filenames (removing the `.pkl` extension).

In [3]:
# Define root directory
root_dir = "../Dataset/Reviews_By_Movie"

# List all available movies
available_movies = sorted([f[:-4] for f in os.listdir(root_dir) if f.endswith(".pkl")])
print("Available movies:", available_movies)

Available movies: ['GoodBadUgly', 'HarryPotter', 'IndianaJones', 'LaLaLand', 'Oppenheimer', 'Parasite', 'SW_Episode1', 'SW_Episode2', 'SW_Episode3', 'SW_Episode4', 'SW_Episode5', 'SW_Episode6', 'SW_Episode7', 'SW_Episode8', 'SW_Episode9']


## Select a Movie and Load its Ground Truth Keywords

In this step, we load the keyword extraction results for a specific movie and retrieve the corresponding ground truth keywords. The goal is to use these annotated keywords for evaluation and comparison with automatically extracted ones.

In [4]:
# Set the name of the movie to be evaluated
movie_name = "SW_Episode6"

# Load the extracted keywords for the selected movie from a pickle file
# The file path is dynamically built using the movie name
selected_film = pd.read_pickle(f"../Dataset/Extracted_Keywords/kw_{movie_name}.pkl")

# Retrieve the Movie_ID of the selected film
# Assumes that the file contains a DataFrame with at least one row
selected_film_id = selected_film["Movie_ID"].iloc[0]

# Load the full dataset containing the ground truth keywords
# for all movies in the evaluation set
keywords = pd.read_pickle("../Dataset/keywords_ground_truth.pkl")

# Filter the ground truth dataset to extract only the keywords for the selected movie
kw_ground_truth = keywords[keywords["Movie_ID"] == selected_film_id]

## Keyword Matching and Evaluation Functions (Basic – Unweighted)

This block defines the core utility functions used to evaluate predicted keywords against the ground truth. These functions perform a **binary, unweighted evaluation**, ignoring confidence scores and ranking information.

The evaluation pipeline includes the following steps:

- **Normalization**: all keywords are lowercased, stripped of punctuation, and cleaned of extra whitespace to ensure consistent text matching.

- **Approximate Matching**: a relaxed rule considers two keywords as a match if:
  - They are exactly equal (after normalization), or
  - One is a substring of the other (e.g., *"social satire"* is considered a match with *"satire"*).

- **Global Evaluation**: for each model, all keywords predicted across the reviews of a given movie are aggregated, and then compared to the global set of ground truth keywords for that movie.

- **Metrics**: we compute **Precision**, **Recall**, and **F1-score** based on the number of approximate matches between the predicted and ground truth keywords.


In [5]:
def normalize_kw(kw):
    """
    Normalize a keyword string by:
    - Converting to lowercase
    - Removing punctuation and non-alphanumeric characters (except spaces)
    - Stripping leading and trailing whitespace

    Args:
        kw (str): The keyword string to normalize.

    Returns:
        str: The normalized keyword.
    """
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumeric characters and whitespace
    return kw.strip()


def is_approx_match(kw, gt_keywords):
    """
    Check if a predicted keyword approximately matches any ground truth keyword.

    A match is considered approximate if:
    - The predicted keyword is exactly equal to a ground truth keyword
    - OR the predicted keyword is a substring of a ground truth keyword
    - OR a ground truth keyword is a substring of the predicted one

    Args:
        kw (str): The normalized predicted keyword.
        gt_keywords (List[str]): A list of normalized ground truth keywords.

    Returns:
        bool: True if an approximate match is found, False otherwise.
    """
    for gt in gt_keywords:
        if kw == gt or kw in gt or gt in kw:
            return True
    return False



def evaluate_keywords(all_pred_keywords, all_gt_keywords):
    """
    Evaluate global precision, recall, and F1-score across a dataset using approximate matching.

    This function compares predicted keywords to ground truth keywords for each review.
    Matching is performed using approximate string comparison, and each ground truth keyword
    can be matched only once to ensure fairness. The metrics are aggregated globally,
    not per-review.

    Args:
        all_pred_keywords (List[List[str]]): 
            A list where each element is a list of predicted keywords for a single review.
        all_gt_keywords (List[List[str]]): 
            A list where each element is a list of ground truth keywords for the corresponding review.

    Returns:
        Tuple[float, float, float]: Global precision, recall, and F1-score based on approximate matching.
    """
    global_match_count = 0     # Total number of matched keywords across all reviews
    global_pred_count = 0      # Total number of predicted keywords
    global_gt_count = 0        # Total number of ground truth keywords

    # Iterate through each review's predictions and ground truths
    for pred_keywords, gt_keywords in zip(all_pred_keywords, all_gt_keywords):
        # Normalize and sort keywords to ensure consistent behavior
        pred_keywords = sorted([normalize_kw(k) for k in pred_keywords])
        gt_keywords = sorted([normalize_kw(k) for k in gt_keywords])

        global_pred_count += len(pred_keywords)
        global_gt_count += len(gt_keywords)

        matched_gts = set()  # Track which ground truth keywords have already been matched

        for pred in pred_keywords:
            for gt in gt_keywords:
                if gt not in matched_gts and is_approx_match(pred, [gt]):
                    global_match_count += 1
                    matched_gts.add(gt)  # Avoid matching the same GT keyword multiple times
                    break

    # Compute global metrics
    precision = global_match_count / global_pred_count if global_pred_count else 0
    recall = global_match_count / global_gt_count if global_gt_count else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0

    return precision, recall, f1



### Evaluate and Compare Models on Keyword Extraction (Basic – Unweighted)

This section evaluates two keyword extraction models — **base** and **metadata-enhanced** — against the ground truth annotations.

For each model, we collect all predicted keywords across all reviews in the selected movie and compare them to the ground truth keywords using **binary approximate matching**.

The evaluation computes **global precision, recall, and F1-score**, considering the entire set of predictions and ground truth keywords as a whole.

In [6]:
# Define the models to be evaluated
models_to_evaluate = ["base", "metadata"]

# Extract the list of ground truth keywords for the selected movie
ground_truth_keywords = [normalize_kw(kw) for kw in kw_ground_truth["Keyword"].tolist()]

# Dictionary to store all predicted keywords per model (across all reviews)
all_predictions = {model: [] for model in models_to_evaluate}

# Iterate over each review in the selected film's predictions
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        pred_col = f"keywords_{model}"

        if pred_col in row and isinstance(row[pred_col], list):
            predicted_keywords = [
                normalize_kw(kw) for kw, _ in row[pred_col] if isinstance(kw, str)
            ]
            
            # Remove duplicates per review
            seen = set()
            unique_kw = [kw for kw in predicted_keywords if kw not in seen and not seen.add(kw)]

            all_predictions[model].append(unique_kw)

# Evaluate each model globally
summary = {}
for model in models_to_evaluate:
    precision, recall, f1 = evaluate_keywords(
        all_predictions[model],  # List of lists
        ground_truth_keywords
    )

    summary[model] = {
        "Precision": round(precision, 4),
        "Recall": round(recall, 4),
        "F1-score": round(f1, 4)
    }

# Convert and display
summary_df = pd.DataFrame(summary).T
summary_df.columns = ["Precision", "Recall", "F1-score"]
summary_df.style.format(precision=4).set_caption("Global Evaluation Summary")


Unnamed: 0,Precision,Recall,F1-score
base,0.9199,0.3577,0.5151
metadata,0.9206,0.358,0.5155


## Score-Aware Evaluation: Weighted Metrics and nDCG@k with Graded Relevance

This extended evaluation considers the **confidence scores** assigned by the model to each predicted keyword, allowing us to measure not only whether the predictions are correct but also how confidently and effectively they are ranked.

#### **Score-Aware Metrics**

- **Weighted Precision**: Reflects the proportion of the model’s total confidence assigned to correct keywords. High confidence in incorrect predictions lowers this score.
- **Weighted Recall**: Measures how much of the ground truth is recovered, weighted by the confidence of correct predictions.
- **Weighted F1-score**: The harmonic mean of weighted precision and recall, balancing accuracy with coverage.
- **nDCG@k (Normalized Discounted Cumulative Gain)**: A ranking metric that rewards placing relevant keywords near the top of the prediction list. It uses **graded relevance**, which accounts for the importance of ground truth keywords based on their position.

#### **How nDCG@k with Graded Relevance is Computed**

1. **Assign graded relevance to ground truth keywords** based on their position $pos_{GT}$ (starting from 0):

$$
rel_{GT} = \frac{1}{\log_2(pos_{GT} + 2)}
$$

2. **Assign relevance to each predicted keyword at position $i$** (starting from 0), using approximate matching:

$$
rel_i = \begin{cases}
\frac{1}{\log_2(pos_{GT} + 2)} & \text{if predicted keyword matches GT keyword at } pos_{GT} \\
0 & \text{otherwise}
\end{cases}
$$

3. **Compute DCG@k** (Discounted Cumulative Gain):

$$
DCG@k = \sum_{i=0}^{k-1} \frac{rel_i}{\log_2(i + 2)}
$$

4. **Compute IDCG@k** (Ideal DCG using the best ranking):

$$
IDCG@k = \sum_{i=0}^{k-1} \frac{rel^*_i}{\log_2(i + 2)}
$$

5. **Compute normalized nDCG**:

$$
nDCG@k = \frac{DCG@k}{IDCG@k}
$$

#### **Example ($k=5$)**

**Ground truth keywords (ranked):**  
`["fraud", "poverty", "scam"]`

**Their graded relevance (using $rel_{GT} = 1/\log_2(pos_{GT}+2)$):**

- fraud (position 0): $1 / \log_2(0+2) = 1.0$
- poverty (position 1): $1 / \log_2(1+2) \approx 0.6309$
- scam (position 2): $1 / \log_2(2+2) = 0.5$

#### **First predicted list**:
`["scam", "family", "poverty", "cinematography", "fraud"]`

**Matches and assigned relevances:**

| Predicted keyword | Match         | Relevance |
|-------------------|---------------|-----------|
| scam              | yes (pos 2)   | 0.5       |
| family            | no            | 0         |
| poverty           | yes (pos 1)   | 0.6309    |
| cinematography    | no            | 0         |
| fraud             | yes (pos 0)   | 1.0       |

**Compute DCG:**

$$
DCG = \frac{0.5}{\log_2(0 + 2)} + \frac{0}{\log_2(1 + 2)} + \frac{0.6309}{\log_2(2 + 2)} + \frac{0}{\log_2(3 + 2)} + \frac{1.0}{\log_2(4 + 2)} \\
= \frac{0.5}{1.0} + 0 + \frac{0.6309}{2.0} + 0 + \frac{1.0}{2.58496} \approx 0.5 + 0 + 0.31545 + 0 + 0.38685 = \mathbf{1.2023}
$$

**Compute IDCG:**

Best possible ranking: `["fraud", "poverty", "scam"]`  
Relevance list: $[1.0, 0.6309, 0.5]$

$$
IDCG = \frac{1.0}{\log_2(0 + 2)} + \frac{0.6309}{\log_2(1 + 2)} + \frac{0.5}{\log_2(2 + 2)} \\
= \frac{1.0}{1.0} + \frac{0.6309}{1.58496} + \frac{0.5}{2.0} \approx 1.0 + 0.3979 + 0.25 = \mathbf{1.6479}
$$

**nDCG@5:**

$$
nDCG@5 = \frac{1.2023}{1.6479} \approx \mathbf{0.7294}
$$


#### **Second predicted list**:
`["fraud", "poverty", "scam", "family", "cinematography"]`

**All matches in top-3, correct order:**

| Predicted keyword | Match         | Relevance |
|-------------------|---------------|-----------|
| fraud             | yes (pos 0)   | 1.0       |
| poverty           | yes (pos 1)   | 0.6309    |
| scam              | yes (pos 2)   | 0.5       |
| family            | no            | 0         |
| cinematography    | no            | 0         |

**Compute DCG:**

$$
DCG = \frac{1.0}{\log_2(0 + 2)} + \frac{0.6309}{\log_2(1 + 2)} + \frac{0.5}{\log_2(2 + 2)} + 0 + 0 \\
= 1.0 + 0.3979 + 0.25 = \mathbf{1.6479}
$$

**nDCG@5:**

$$
nDCG@5 = \frac{1.6479}{1.6479} = \mathbf{1.0}
$$


#### **Interpretation**

- When relevant keywords appear early in the predicted list, the score increases due to less discounting.
- When relevant keywords are ranked lower, the score decreases due to higher discounting.
- **nDCG@k rewards both correct predictions and their correct ranking**, making it suitable for evaluating keyword extractors that produce ranked lists.

In [13]:
def normalize_kw(kw):
    """
    Normalize a keyword string by:
    - Converting to lowercase
    - Removing punctuation and non-alphanumeric characters (except spaces)
    - Stripping leading and trailing whitespace

    Args:
        kw (str): The keyword string to normalize.

    Returns:
        str: The normalized keyword.
    """
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumeric characters and whitespace
    return kw.strip()


def is_approx_match(kw, gt_keywords):
    """
    Check if a predicted keyword approximately matches any ground truth keyword.

    A match is considered approximate if:
    - The predicted keyword is exactly equal to a ground truth keyword
    - OR the predicted keyword is a substring of a ground truth keyword
    - OR a ground truth keyword is a substring of the predicted one

    Args:
        kw (str): The normalized predicted keyword.
        gt_keywords (List[str]): A list of normalized ground truth keywords.

    Returns:
        bool: True if an approximate match is found, False otherwise.
    """
    for gt in gt_keywords:
        if kw == gt or kw in gt or gt in kw:
            return True
    return False

def evaluate_keywords_weighted(all_predicted_kw_score, all_gt_keywords):
    """
    Evaluate global weighted precision, recall, and F1-score across multiple reviews.

    This function accounts for confidence scores assigned to predicted keywords.
    Matching is performed using approximate matching. Each keyword score contributes
    to the precision and recall based on whether it matches a ground truth keyword.

    Args:
        all_predicted_kw_score (List[List[Tuple[str, float]]]): 
            A list of predicted keyword-score pairs per review.
        all_gt_keywords (List[List[str]]): 
            A list of ground truth keyword lists per review.

    Returns:
        Tuple[float, float, float]: Weighted precision, recall, and F1-score.
    """
    total_score = 0.0         # Sum of all predicted keyword scores
    match_score = 0.0         # Sum of scores of correctly predicted keywords
    total_gt = 0              # Total number of ground truth keywords across all reviews

    for pred_kw_score, gt_kw in zip(all_predicted_kw_score, all_gt_keywords):
        # Normalize keywords
        gt_kw = [normalize_kw(k) for k in gt_kw]
        pred_kw_score = [
            (normalize_kw(kw), score) for kw, score in pred_kw_score if isinstance(kw, str)
        ]

        total_score += sum(score for _, score in pred_kw_score)
        total_gt += len(gt_kw)

        matched_gts = set()  # Track ground truth keywords already matched

        for kw, score in pred_kw_score:
            for gt in gt_kw:
                if gt not in matched_gts and is_approx_match(kw, [gt]):
                    match_score += score
                    matched_gts.add(gt)
                    break

    precision = match_score / total_score if total_score > 0 else 0
    recall = match_score / total_gt if total_gt > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    return precision, recall, f1


def compute_global_ndcg(all_predicted_kw_score, all_gt_keywords, k=5):
    """
    Compute global average nDCG@k (Normalized Discounted Cumulative Gain) over multiple reviews.

    The relevance of each predicted keyword is based on the position of its best matching
    ground truth keyword. Matching is done via approximate matching. The ideal DCG assumes
    the best possible ranking of ground truth keywords.

    Args:
        all_predicted_kw_score (List[List[Tuple[str, float]]]): 
            A list of predicted keyword-score pairs per review (ranked list).
        all_gt_keywords (List[List[str]]): 
            A list of ground truth keyword lists per review.
        k (int): The number of top predicted keywords to evaluate.

    Returns:
        float: The average nDCG@k across all reviews.
    """
    total_ndcg = 0.0
    count = 0

    for pred_kw_score, gt_kw in zip(all_predicted_kw_score, all_gt_keywords):
        # Normalize predicted and ground truth keywords
        gt_keywords_norm = [normalize_kw(k) for k in gt_kw]
        pred_keywords_norm = [normalize_kw(kw) for kw, _ in pred_kw_score[:k]]

        relevance = []  # Relevance scores assigned to predicted keywords

        for pk in pred_keywords_norm:
            # Find the best (earliest) match position in the GT list
            match_ranks = [
                i for i, gk in enumerate(gt_keywords_norm) if is_approx_match(pk, [gk])
            ]
            if match_ranks:
                best_rank = min(match_ranks)
                rel = 1 / math.log2(best_rank + 2)  # Graded relevance
            else:
                rel = 0
            relevance.append(rel)

        # Compute DCG for predicted keywords
        dcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(relevance))

        # Compute IDCG based on ideal ordering of GT keywords
        ideal_relevance = [1 / math.log2(i + 2) for i in range(min(k, len(gt_keywords_norm)))]
        idcg = sum(ideal_relevance)

        if idcg > 0:
            total_ndcg += dcg / idcg
            count += 1

    return total_ndcg / count if count > 0 else 0.0

### Evaluate and Compare Models on Keyword Extraction (Weighted)

In this section, we evaluate the overall performance of each model using **score-aware metrics** computed **globally across all reviews**:

- **Weighted Precision, Recall, and F1-score**: These metrics incorporate the **confidence scores** assigned to each predicted keyword, reflecting how much of the model’s confidence is placed on correct predictions.
- **nDCG@5 (Normalized Discounted Cumulative Gain)**: Assesses the overall **ranking quality** of the top-5 predicted keywords, rewarding correct keywords that are ranked higher.

This global evaluation provides a holistic view of each model’s effectiveness in ranking and selecting relevant keywords across the entire dataset.

In [14]:
# Models to evaluate
models_to_evaluate = ["base", "metadata"]

# Ground truth keywords (same for all reviews in the selected film)
ground_truth_keywords = kw_ground_truth["Keyword"].tolist()

# Prepare data structures to hold predictions for each model
all_predicted_kw_score = {model: [] for model in models_to_evaluate}

# Collect predictions and GT for each review
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        pred_col = f"keywords_{model}"

        # Skip if no prediction or wrong format
        if pred_col in row and isinstance(row[pred_col], list):
            predicted_kw_score = [(kw, score) for kw, score in row[pred_col] if isinstance(kw, str)]
            # Remove duplicates per review
            seen = set()
            unique_pred = [(kw, score) for kw, score in predicted_kw_score if kw not in seen and not seen.add(kw)]
            all_predicted_kw_score[model].append(unique_pred)


# Dictionary to store global evaluation results
weighted_summary = {}

# Evaluate each model globally
for model in models_to_evaluate:
    preds = all_predicted_kw_score[model]

    # Global weighted metrics
    w_precision, w_recall, w_f1 = evaluate_keywords_weighted(preds, ground_truth_keywords)

    # Global nDCG@5
    ndcg = compute_global_ndcg(preds, ground_truth_keywords, k=5)

    # Store results
    weighted_summary[model] = {
        "weighted_precision": round(w_precision, 4),
        "weighted_recall": round(w_recall, 4),
        "weighted_f1": round(w_f1, 4),
        "ndcg@5": round(ndcg, 4)
    }

# Convert summary to DataFrame
summary_df = pd.DataFrame(weighted_summary).T  # Models as rows

# Rename columns
summary_df.columns = [
    "Weighted Precision",
    "Weighted Recall",
    "Weighted F1-score",
    "nDCG@5"
]

# Display final table
summary_df.style.format(precision=4).set_caption("Global Score-Aware Evaluation Summary")


Unnamed: 0,Weighted Precision,Weighted Recall,Weighted F1-score,nDCG@5
base,0.9155,0.1789,0.2993,0.6973
metadata,0.9181,0.2085,0.3398,0.7169


## Semantic Evaluation (Base vs Metadata)

In this section, we evaluate and compare the **Base** and **Metadata-enhanced** keyword extraction models using a **semantic similarity approach** based on contextual embeddings.

Traditional evaluation metrics rely on exact or approximate string matching between predicted and ground truth keywords. However, this approach may miss semantically related terms that differ lexically but convey the same meaning — such as *"scam"* and *"fraud"*.

To address this limitation, we adopt a **global semantic evaluation**, where all predicted and ground truth keywords across the dataset are compared using **dense sentence embeddings** generated by a pre-trained transformer (e.g., Sentence-BERT).

#### **Semantic Evaluation Procedure**

1. **Embedding Keywords Globally**  
   All predicted and ground truth keywords across all reviews are embedded into high-dimensional vectors using the same transformer model. Ground truth keywords are embedded **once**, and all vectors are normalized to allow cosine similarity comparisons.

2. **Computing Similarity Matrix**  
   For each model, we compute a cosine similarity matrix between **all predicted keywords** and **all ground truth keywords**.

3. **Matching Threshold**  
   A predicted keyword is considered a **semantic match** if its cosine similarity with at least one ground truth keyword exceeds a fixed threshold (e.g., **0.75**). This allows for flexible yet meaningful semantic alignment.

4. **Global Semantic Precision**  
   The proportion of predicted keywords that have at least one semantic match in the ground truth. This reflects how many of the model's predictions are semantically relevant.

5. **Global Semantic Recall**  
   The proportion of ground truth keywords that are captured by semantically similar predictions. This indicates how well the model covers the key concepts.

6. **Global Semantic F1-score**  
   The harmonic mean of semantic precision and recall, summarizing both relevance and coverage into a single score.

This evaluation:

- Is **more robust** than string-based metrics.
- **Captures meaning**, not just surface forms.
- Helps evaluate models that paraphrase or generalize beyond exact matches.

This evaluation complements previous metrics and provides a more **realistic estimate** of how well the models capture the essence of user-annotated keywords in a global and context-aware manner.


In [26]:
# Load a sentence embedding model from the SentenceTransformers family
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load tokenizer and model to generate contextual embeddings
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
encoder = AutoModel.from_pretrained(MODEL_NAME).to(device)

# Simple normalization function for keywords:
# - Converts to lowercase
# - Removes punctuation
# - Strips leading/trailing spaces
def normalize_kw(kw):
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumerics and whitespace
    return kw.strip()

def embed_keywords(keywords, device="cuda"):
    """
    Compute sentence embeddings for a list of keyword strings.

    Parameters:
    ----------
    keywords : List[str]
        A list of keyword strings to encode.
    device : str
        Device to run the model on ('cuda' or 'cpu').

    Returns:
    -------
    torch.Tensor
        Normalized embeddings tensor of shape (num_keywords, embedding_dim).
    """
    # Return empty tensor if input list is empty
    if not keywords:
        return torch.empty(0, encoder.config.hidden_size).to(device)

    # Tokenize and prepare inputs for the model
    inputs = tokenizer(keywords, padding=True, truncation=True, return_tensors="pt").to(device)

    with torch.no_grad():
        # Forward pass through the encoder to get hidden states
        outputs = encoder(**inputs)

        # Use mean pooling on the last hidden state to get fixed-size embeddings
        embeddings = outputs.last_hidden_state.mean(dim=1)

        # Normalize embeddings to unit length for cosine similarity computations
        embeddings = F.normalize(embeddings, p=2, dim=1)

    return embeddings

def evaluate_semantic_keywords_global(all_pred_keywords, gt_keywords, threshold=0.75, device="cuda"):
    """
    Compute global semantic precision, recall, and F1 score between all predicted keywords
    and ground truth keywords using cosine similarity over embeddings.

    Parameters:
    ----------
    all_pred_keywords : List[List[str]]
        List of predicted keywords for each review.
    gt_keywords : List[str]
        Global list of ground truth keywords for the movie.
    threshold : float
        Cosine similarity threshold for considering a match.
    device : str
        Device to run the model on ('cuda' or 'cpu').

    Returns:
    -------
    precision : float
    recall : float
    f1 : float
    """
    # Early return if either set is empty
    if len(all_pred_keywords) == 0 or len(gt_keywords) == 0:
        return 0.0, 0.0, 0.0

    # Compute embeddings
    pred_emb = embed_keywords(all_pred_keywords, device=device)
    gt_emb = embed_keywords(gt_keywords, device=device)

    # Compute similarity matrix
    sims = torch.matmul(pred_emb, gt_emb.T)

    # Match counting based on threshold
    pred_matches = (sims > threshold).any(dim=1).float().sum().item()
    gt_matches = (sims > threshold).any(dim=0).float().sum().item()

    precision = pred_matches / len(all_pred_keywords)
    recall = gt_matches / len(gt_keywords)
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0

    return precision, recall, f1


### Semantic Evaluation of Base and Metadata Models Using Sentence Embeddings

In this step, we evaluate the **semantic similarity** between the predicted keywords of two models — **Base** and **Metadata-enhanced** — and the ground truth keywords using **sentence embeddings**.

Unlike exact or approximate string matching, this method leverages **contextual embeddings** from a pre-trained transformer to assess how semantically close the predicted keywords are to the reference keywords.

The evaluation procedure is as follows:

- We extract only the **text** of the predicted keywords for each model, discarding their confidence scores.
- We embed all **predicted** and **ground truth** keywords using the same sentence transformer model.
- Embeddings are **normalized** to ensure cosine similarity is a valid similarity measure.
- For each predicted keyword, we compute the **cosine similarity** with all ground truth keywords.
- A predicted keyword is considered a **semantic match** if its similarity with any ground truth keyword exceeds a fixed threshold (e.g., **0.75**).

Once all matches are determined across all reviews of the selected movie, we compute:

- **Semantic Precision**: Fraction of all predicted keywords (global) that have a semantic match.
- **Semantic Recall**: Fraction of all ground truth keywords that are matched by at least one semantically similar predicted keyword.
- **Semantic F1-score**: Harmonic mean of semantic precision and recall.

This global semantic evaluation better reflects the models’ ability to capture **meaningful and relevant keywords**, even when the wording differs from the ground truth.


In [27]:
# Precompute embeddings for the ground truth keywords once per selected movie
gt_keywords = kw_ground_truth["Keyword"].tolist()

# Define the models to evaluate
models_to_evaluate = ["base", "metadata"]

# Dictionary to collect all predicted keywords per model (without duplicates)
all_predictions = {model: set() for model in models_to_evaluate}

# Collect predicted keywords across all reviews (as a set for uniqueness)
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        pred_col = f"keywords_{model}"

        if pred_col in row and isinstance(row[pred_col], list):
            # Extract keyword strings and normalize
            pred_kw = [normalize_kw(kw) for kw, _ in row[pred_col] if isinstance(kw, str)]
            all_predictions[model].update(pred_kw)  # Add to set (no duplicates)

# Compute semantic evaluation globally for each model
semantic_scores = []
for model in models_to_evaluate:
    pred_kw = list(all_predictions[model])  # Convert back to list
    precision, recall, f1 = evaluate_semantic_keywords_global(pred_kw, gt_keywords, device=device)

    semantic_scores.append({
        "Model": model,
        "Semantic_Precision": round(precision, 4),
        "Semantic_Recall": round(recall, 4),
        "Semantic_F1": round(f1, 4)
    })

# Convert to DataFrame and format
summary_df = pd.DataFrame(semantic_scores).set_index("Model")
summary_df.style.format(precision=4).set_caption("Global Semantic-Aware Evaluation Summary")


Unnamed: 0_level_0,Semantic_Precision,Semantic_Recall,Semantic_F1
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
base,0.6736,0.489,0.5666
metadata,0.6434,0.5074,0.5673


## Evaluation Across All Movies

This section automatically processes all `.pkl` files in the `Extracted_Keywords` directory, where each file corresponds to a single movie and contains predicted keywords generated by different models.

For each movie:
- The corresponding **ground truth keywords** are loaded.
- Predicted keywords from both models — **Base** and **Metadata-enhanced** — are evaluated.
- The evaluation is performed globally across all reviews in the movie, without computing metrics per review.

For each model, the following **global metrics** are computed:

- **Unweighted Metrics**:  
  Precision, Recall, and F1-score using approximate string matching.

- **Score-aware Metrics**:  
  - **Weighted Precision**: proportion of total confidence assigned to correct predictions.  
  - **Weighted Recall**: coverage of ground truth weighted by prediction confidence.  
  - **Weighted F1-score**: harmonic mean of weighted precision and recall.  
  - **nDCG@5**: evaluates the quality of keyword ranking using graded relevance and position-based discounting.

- **Semantic Metrics**:  
  Semantic Precision, Recall, and F1-score using **cosine similarity** between **sentence embeddings** of predicted and ground truth keywords.

All metrics are computed globally for each movie and then compiled into a summary table to compare the overall performance of the two models.


In [28]:
# Settings
keywords_dir = "../Dataset/Extracted_Keywords/"
ground_truth_path = "../Dataset/keywords_ground_truth.pkl"

device = "cuda" if torch.cuda.is_available() else "cpu"
models_to_evaluate = ["base", "metadata"]

# Load ground truth
keywords_ground_truth = pd.read_pickle(ground_truth_path)

# Store all results
all_results = []

# Iterate over movie keyword predictions
for file in os.listdir(keywords_dir):
    if file.endswith(".pkl") and file.startswith("kw_"):
        movie_name = file.replace("kw_", "").replace(".pkl", "")
        file_path = os.path.join(keywords_dir, file)

        try:
            selected_film = pd.read_pickle(file_path)
            selected_film_id = selected_film["Movie_ID"].iloc[0]

            # Ground truth per quel film
            kw_ground_truth = keywords_ground_truth[keywords_ground_truth["Movie_ID"] == selected_film_id]
            gt_keywords = kw_ground_truth["Keyword"].tolist()

            # Init: predicted keyword lists (per review, no duplicates)
            all_predicted_kw_score = {model: [] for model in models_to_evaluate}

            for _, row in selected_film.iterrows():
                for model in models_to_evaluate:
                    pred_col = f"keywords_{model}"
                    if pred_col in row and isinstance(row[pred_col], list):
                        predicted_kw_score = [(normalize_kw(kw), score) for kw, score in row[pred_col] if isinstance(kw, str)]
                        seen = set()
                        unique_pred = [(kw, score) for kw, score in predicted_kw_score if kw not in seen and not seen.add(kw)]
                        all_predicted_kw_score[model].append(unique_pred)

            for model in models_to_evaluate:
                # Flatten keywords (global unique list)
                flat_kw = [kw for review in all_predicted_kw_score[model] for kw, _ in review]
                unique_pred_keywords = list(set(flat_kw))

                # Max score for each keyword
                kw_score_max = {}
                for review in all_predicted_kw_score[model]:
                    for kw, score in review:
                        if kw not in kw_score_max or score > kw_score_max[kw]:
                            kw_score_max[kw] = score
                sorted_kw_score = sorted(kw_score_max.items(), key=lambda x: x[1], reverse=True)

                # Score-aware format
                pred_kw_score_list = all_predicted_kw_score[model]
                keyword_only_list = [[kw for kw, _ in review] for review in pred_kw_score_list]

                # Compute all metrics
                p, r, f1 = evaluate_keywords(keyword_only_list, gt_keywords)
                wp, wr, wf = evaluate_keywords_weighted(pred_kw_score_list, gt_keywords)
                ndcg = compute_global_ndcg(pred_kw_score_list, gt_keywords, k=5)
                sp, sr, sf1 = evaluate_semantic_keywords_global(unique_pred_keywords, gt_keywords, device=device)

                all_results.append({
                    "Movie": movie_name,
                    "Model": model,
                    "Precision": round(p, 4),
                    "Recall": round(r, 4),
                    "F1-score": round(f1, 4),
                    "Weighted Precision": round(wp, 4),
                    "Weighted Recall": round(wr, 4),
                    "Weighted F1-score": round(wf, 4),
                    "nDCG@5": round(ndcg, 4),
                    "Semantic Precision": round(sp, 4),
                    "Semantic Recall": round(sr, 4),
                    "Semantic F1-score": round(sf1, 4),
                })

        except Exception as e:
            print(f"Error processing {file}: {e}")

# Final summary
final_df = pd.DataFrame(all_results)
final_df_sorted = final_df.sort_values(by=["Movie", "Model"]).reset_index(drop=True)
final_df_sorted.style.format(precision=4).set_caption("Global Evaluation Summary per Movie and Model")

Unnamed: 0,Movie,Model,Precision,Recall,F1-score,Weighted Precision,Weighted Recall,Weighted F1-score,nDCG@5,Semantic Precision,Semantic Recall,Semantic F1-score
0,SW_Episode6,base,0.8662,0.3369,0.4851,0.9155,0.1789,0.2993,0.6973,0.6736,0.489,0.5666
1,SW_Episode6,metadata,0.889,0.3457,0.4978,0.9181,0.2085,0.3398,0.7169,0.6434,0.5074,0.5673
