# Evaluation of keyBERTSentimentReranker

This notebook evaluates the **Reranker** model, which reorders the keywords predicted by the base KeyBERT model to improve their ranking.

Since the reranker **does not alter the keyword set** but only changes their order, traditional set-based metrics such as **Precision**, **Recall**, and **F1-score** will be **identical to those of the Base model**. Therefore, comparing these metrics is not meaningful in this case.

Instead, we perform a **global evaluation** using metrics that are sensitive to the **confidence scores** and **ranking** of the predicted keywords:

- **Weighted Precision, Recall, and F1-score**:  
  These metrics incorporate the confidence assigned to each predicted keyword. A high score assigned to an incorrect keyword negatively impacts performance, while correctly ranked and confident predictions improve the result. This provides a more nuanced view than simple binary matching.

- **nDCG@5 with Graded Relevance**:  
  This ranking-based metric measures how effectively the most important ground truth keywords are placed near the top of the predicted list. The evaluation uses **graded relevance**, which prioritizes early and meaningful matches. It rewards rankings that better align with the ground truth keyword importance.

All metrics are computed **globally across all reviews**, aggregating correct matches, confidence scores, and ranking positions. This allows for a robust assessment of the reranker's overall impact on keyword quality, beyond what is visible from individual reviews.


## Setup: Installing and Importing Required Libraries

In [19]:
import subprocess
import sys

# List of required packages
required_packages = {
    "pandas", "numpy", "tqdm"
}

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)

tqdm is already installed.
numpy is already installed.
pandas is already installed.


In [20]:
# Standard Library
import os      # File system operations (e.g., listing files)
import re      # Regular expressions for text processing
import math    # Mathematical functions (e.g., logarithms for nDCG calculation)

# Third-Party Libraries
import pandas as pd                  # Data manipulation with DataFrames
import numpy as np                   # Numerical computations and array operations
from tqdm import tqdm                # Progress bars for loops

## Load Available Movies from Dataset

This section lists all the available movies stored as `.pkl` files inside the review dataset directory.

- It defines the root path (`../Dataset/Reviews_By_Movie`) where all review files are saved.
- It automatically detects and lists all movie filenames (removing the `.pkl` extension).

In [21]:
# Define root directory
root_dir = "../Dataset/Reviews_By_Movie"

# List all available movies
available_movies = sorted([f[:-4] for f in os.listdir(root_dir) if f.endswith(".pkl")])
print("Available movies:", available_movies)

Available movies: ['GoodBadUgly', 'HarryPotter', 'IndianaJones', 'LaLaLand', 'Oppenheimer', 'Parasite', 'SW_Episode1', 'SW_Episode2', 'SW_Episode3', 'SW_Episode4', 'SW_Episode5', 'SW_Episode6', 'SW_Episode7', 'SW_Episode8', 'SW_Episode9']


## Select a Movie and Load its Ground Truth Keywords

In this step, we load the keyword extraction results for a specific movie and retrieve the corresponding ground truth keywords. The goal is to use these annotated keywords for evaluation and comparison with automatically extracted ones.

In [22]:
# Set the name of the movie to be evaluated
movie_name = "SW_Episode6"

# Load the extracted keywords for the selected movie from a pickle file
# The file path is dynamically built using the movie name
selected_film = pd.read_pickle(f"../Dataset/Extracted_Keywords/kw_{movie_name}.pkl")

# Retrieve the Movie_ID of the selected film
# Assumes that the file contains a DataFrame with at least one row
selected_film_id = selected_film["Movie_ID"].iloc[0]

# Load the full dataset containing the ground truth keywords
# for all movies in the evaluation set
keywords = pd.read_pickle("../Dataset/keywords_ground_truth.pkl")

# Filter the ground truth dataset to extract only the keywords for the selected movie
kw_ground_truth = keywords[keywords["Movie_ID"] == selected_film_id]

## Score-Aware Evaluation: Weighted Metrics and nDCG@k with Graded Relevance

This extended evaluation considers the **confidence scores** assigned by the model to each predicted keyword, allowing us to measure not only whether the predictions are correct but also how confidently and effectively they are ranked.

#### **Score-Aware Metrics**

- **Weighted Precision**: Reflects the proportion of the model’s total confidence assigned to correct keywords. High confidence in incorrect predictions lowers this score.
- **Weighted Recall**: Measures how much of the ground truth is recovered, weighted by the confidence of correct predictions.
- **Weighted F1-score**: The harmonic mean of weighted precision and recall, balancing accuracy with coverage.
- **nDCG@k (Normalized Discounted Cumulative Gain)**: A ranking metric that rewards placing relevant keywords near the top of the prediction list. It uses **graded relevance**, which accounts for the importance of ground truth keywords based on their position.

#### **How nDCG@k with Graded Relevance is Computed**

1. **Assign graded relevance to ground truth keywords** based on their position $pos_{GT}$ (starting from 0):

$$
rel_{GT} = \frac{1}{\log_2(pos_{GT} + 2)}
$$

2. **Assign relevance to each predicted keyword at position $i$** (starting from 0), using approximate matching:

$$
rel_i = \begin{cases}
\frac{1}{\log_2(pos_{GT} + 2)} & \text{if predicted keyword matches GT keyword at } pos_{GT} \\
0 & \text{otherwise}
\end{cases}
$$

3. **Compute DCG@k** (Discounted Cumulative Gain):

$$
DCG@k = \sum_{i=0}^{k-1} \frac{rel_i}{\log_2(i + 2)}
$$

4. **Compute IDCG@k** (Ideal DCG using the best ranking):

$$
IDCG@k = \sum_{i=0}^{k-1} \frac{rel^*_i}{\log_2(i + 2)}
$$

5. **Compute normalized nDCG**:

$$
nDCG@k = \frac{DCG@k}{IDCG@k}
$$

#### **Example ($k=5$)**

**Ground truth keywords (ranked):**  
`["fraud", "poverty", "scam"]`

**Their graded relevance (using $rel_{GT} = 1/\log_2(pos_{GT}+2)$):**

- fraud (position 0): $1 / \log_2(0+2) = 1.0$
- poverty (position 1): $1 / \log_2(1+2) \approx 0.6309$
- scam (position 2): $1 / \log_2(2+2) = 0.5$

#### **First predicted list**:
`["scam", "family", "poverty", "cinematography", "fraud"]`

**Matches and assigned relevances:**

| Predicted keyword | Match         | Relevance |
|-------------------|---------------|-----------|
| scam              | yes (pos 2)   | 0.5       |
| family            | no            | 0         |
| poverty           | yes (pos 1)   | 0.6309    |
| cinematography    | no            | 0         |
| fraud             | yes (pos 0)   | 1.0       |

**Compute DCG:**

$$
DCG = \frac{0.5}{\log_2(0 + 2)} + \frac{0}{\log_2(1 + 2)} + \frac{0.6309}{\log_2(2 + 2)} + \frac{0}{\log_2(3 + 2)} + \frac{1.0}{\log_2(4 + 2)} \\
= \frac{0.5}{1.0} + 0 + \frac{0.6309}{2.0} + 0 + \frac{1.0}{2.58496} \approx 0.5 + 0 + 0.31545 + 0 + 0.38685 = \mathbf{1.2023}
$$

**Compute IDCG:**

Best possible ranking: `["fraud", "poverty", "scam"]`  
Relevance list: $[1.0, 0.6309, 0.5]$

$$
IDCG = \frac{1.0}{\log_2(0 + 2)} + \frac{0.6309}{\log_2(1 + 2)} + \frac{0.5}{\log_2(2 + 2)} \\
= \frac{1.0}{1.0} + \frac{0.6309}{1.58496} + \frac{0.5}{2.0} \approx 1.0 + 0.3979 + 0.25 = \mathbf{1.6479}
$$

**nDCG@5:**

$$
nDCG@5 = \frac{1.2023}{1.6479} \approx \mathbf{0.7294}
$$


#### **Second predicted list**:
`["fraud", "poverty", "scam", "family", "cinematography"]`

**All matches in top-3, correct order:**

| Predicted keyword | Match         | Relevance |
|-------------------|---------------|-----------|
| fraud             | yes (pos 0)   | 1.0       |
| poverty           | yes (pos 1)   | 0.6309    |
| scam              | yes (pos 2)   | 0.5       |
| family            | no            | 0         |
| cinematography    | no            | 0         |

**Compute DCG:**

$$
DCG = \frac{1.0}{\log_2(0 + 2)} + \frac{0.6309}{\log_2(1 + 2)} + \frac{0.5}{\log_2(2 + 2)} + 0 + 0 \\
= 1.0 + 0.3979 + 0.25 = \mathbf{1.6479}
$$

**nDCG@5:**

$$
nDCG@5 = \frac{1.6479}{1.6479} = \mathbf{1.0}
$$


#### **Interpretation**

- When relevant keywords appear early in the predicted list, the score increases due to less discounting.
- When relevant keywords are ranked lower, the score decreases due to higher discounting.
- **nDCG@k rewards both correct predictions and their correct ranking**, making it suitable for evaluating keyword extractors that produce ranked lists.

In [23]:
def normalize_kw(kw):
    """
    Normalize a keyword string by:
    - Converting to lowercase
    - Removing punctuation and non-alphanumeric characters (except spaces)
    - Stripping leading and trailing whitespace

    Args:
        kw (str): The keyword string to normalize.

    Returns:
        str: The normalized keyword.
    """
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumeric characters and whitespace
    return kw.strip()


def is_approx_match(kw, gt_keywords):
    """
    Check if a predicted keyword approximately matches any ground truth keyword.

    A match is considered approximate if:
    - The predicted keyword is exactly equal to a ground truth keyword
    - OR the predicted keyword is a substring of a ground truth keyword
    - OR a ground truth keyword is a substring of the predicted one

    Args:
        kw (str): The normalized predicted keyword.
        gt_keywords (List[str]): A list of normalized ground truth keywords.

    Returns:
        bool: True if an approximate match is found, False otherwise.
    """
    for gt in gt_keywords:
        if kw == gt or kw in gt or gt in kw:
            return True
    return False


def evaluate_keywords_weighted(all_predicted_kw_score, all_gt_keywords):
    """
    Evaluate global weighted precision, recall, and F1-score across multiple reviews.

    This function accounts for confidence scores assigned to predicted keywords.
    Matching is performed using approximate matching. Each keyword score contributes
    to the precision and recall based on whether it matches a ground truth keyword.

    Args:
        all_predicted_kw_score (List[List[Tuple[str, float]]]): 
            A list of predicted keyword-score pairs per review.
        all_gt_keywords (List[List[str]]): 
            A list of ground truth keyword lists per review.

    Returns:
        Tuple[float, float, float]: Weighted precision, recall, and F1-score.
    """
    total_score = 0.0         # Sum of all predicted keyword scores
    match_score = 0.0         # Sum of scores of correctly predicted keywords
    total_gt = 0              # Total number of ground truth keywords across all reviews

    for pred_kw_score, gt_kw in zip(all_predicted_kw_score, all_gt_keywords):
        # Normalize keywords
        gt_kw = [normalize_kw(k) for k in gt_kw]
        pred_kw_score = [
            (normalize_kw(kw), score) for kw, score in pred_kw_score if isinstance(kw, str)
        ]

        total_score += sum(score for _, score in pred_kw_score)
        total_gt += len(gt_kw)

        matched_gts = set()  # Track ground truth keywords already matched

        for kw, score in pred_kw_score:
            for gt in gt_kw:
                if gt not in matched_gts and is_approx_match(kw, [gt]):
                    match_score += score
                    matched_gts.add(gt)
                    break

    precision = match_score / total_score if total_score > 0 else 0
    recall = match_score / total_gt if total_gt > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    return precision, recall, f1


def compute_global_ndcg(all_predicted_kw_score, all_gt_keywords, k=5):
    """
    Compute global average nDCG@k (Normalized Discounted Cumulative Gain) over multiple reviews.

    The relevance of each predicted keyword is based on the position of its best matching
    ground truth keyword. Matching is done via approximate matching. The ideal DCG assumes
    the best possible ranking of ground truth keywords.

    Args:
        all_predicted_kw_score (List[List[Tuple[str, float]]]): 
            A list of predicted keyword-score pairs per review (ranked list).
        all_gt_keywords (List[List[str]]): 
            A list of ground truth keyword lists per review.
        k (int): The number of top predicted keywords to evaluate.

    Returns:
        float: The average nDCG@k across all reviews.
    """
    total_ndcg = 0.0
    count = 0

    for pred_kw_score, gt_kw in zip(all_predicted_kw_score, all_gt_keywords):
        # Normalize predicted and ground truth keywords
        gt_keywords_norm = [normalize_kw(k) for k in gt_kw]
        pred_keywords_norm = [normalize_kw(kw) for kw, _ in pred_kw_score[:k]]

        relevance = []  # Relevance scores assigned to predicted keywords

        for pk in pred_keywords_norm:
            # Find the best (earliest) match position in the GT list
            match_ranks = [
                i for i, gk in enumerate(gt_keywords_norm) if is_approx_match(pk, [gk])
            ]
            if match_ranks:
                best_rank = min(match_ranks)
                rel = 1 / math.log2(best_rank + 2)  # Graded relevance
            else:
                rel = 0
            relevance.append(rel)

        # Compute DCG for predicted keywords
        dcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(relevance))

        # Compute IDCG based on ideal ordering of GT keywords
        ideal_relevance = [1 / math.log2(i + 2) for i in range(min(k, len(gt_keywords_norm)))]
        idcg = sum(ideal_relevance)

        if idcg > 0:
            total_ndcg += dcg / idcg
            count += 1

    return total_ndcg / count if count > 0 else 0.0

### Evaluate and Compare Models on Keyword Extraction (Weighted)

In this section, we evaluate the overall performance of each model using **score-aware metrics** computed **globally across all reviews**:

- **Weighted Precision, Recall, and F1-score**: These metrics incorporate the **confidence scores** assigned to each predicted keyword, reflecting how much of the model’s confidence is placed on correct predictions.
- **nDCG@5 (Normalized Discounted Cumulative Gain)**: Assesses the overall **ranking quality** of the top-5 predicted keywords, rewarding correct keywords that are ranked higher.

This global evaluation provides a holistic view of each model’s effectiveness in ranking and selecting relevant keywords across the entire dataset.

In [24]:
# Models to evaluate
models_to_evaluate = ["base", "reranker"]

# Ground truth keywords (same for all reviews in the selected film)
ground_truth_keywords = kw_ground_truth["Keyword"].tolist()

# Prepare data structures to hold predictions for each model
all_predicted_kw_score = {model: [] for model in models_to_evaluate}

# Collect predictions and GT for each review
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        pred_col = f"keywords_{model}"

        # Skip if no prediction or wrong format
        if pred_col in row and isinstance(row[pred_col], list):
            predicted_kw_score = [(kw, score) for kw, score in row[pred_col] if isinstance(kw, str)]
            # Remove duplicates per review
            seen = set()
            unique_pred = [(kw, score) for kw, score in predicted_kw_score if kw not in seen and not seen.add(kw)]
            all_predicted_kw_score[model].append(unique_pred)


# Dictionary to store global evaluation results
weighted_summary = {}

# Evaluate each model globally
for model in models_to_evaluate:
    preds = all_predicted_kw_score[model]

    # Global weighted metrics
    w_precision, w_recall, w_f1 = evaluate_keywords_weighted(preds, ground_truth_keywords)

    # Global nDCG@5
    ndcg = compute_global_ndcg(preds, ground_truth_keywords, k=5)

    # Store results
    weighted_summary[model] = {
        "weighted_precision": round(w_precision, 4),
        "weighted_recall": round(w_recall, 4),
        "weighted_f1": round(w_f1, 4),
        "ndcg@5": round(ndcg, 4)
    }

# Convert summary to DataFrame
summary_df = pd.DataFrame(weighted_summary).T  # Models as rows

# Rename columns
summary_df.columns = [
    "Weighted Precision",
    "Weighted Recall",
    "Weighted F1-score",
    "nDCG@5"
]

# Display final table
summary_df.style.format(precision=4).set_caption("Global Score-Aware Evaluation Summary")


Unnamed: 0,Weighted Precision,Weighted Recall,Weighted F1-score,nDCG@5
base,0.9155,0.1789,0.2993,0.6973
reranker,0.919,0.1642,0.2786,0.7023


## Evaluation Across All Movies

This section processes all `.pkl` files in the `Extracted_Keywords` directory, where each file corresponds to a single movie and contains the predicted keywords generated by different models.

For **each movie**, the evaluation proceeds as follows:

1. **Ground Truth Loading**  
   - The reference (ground truth) keywords for all reviews in the movie are loaded.

2. **Model Predictions**  
   - Predicted keywords from two models are collected and evaluated:
     - **Base** model  
     - **Reranker** model

3. **Global Metrics Computation**  
   Rather than evaluating each review individually, metrics are computed **globally**, by aggregating all predictions and ground truths across the entire movie.

   **Score-aware Weighted Metrics**  
   These metrics incorporate the confidence scores assigned to the predicted keywords:
   - **Weighted Precision**: Proportion of the model’s total confidence assigned to correct predictions.
   - **Weighted Recall**: Fraction of the ground truth captured by the confidence-weighted predictions.
   - **Weighted F1-score**: Harmonic mean of weighted precision and recall.

   **Ranking Metric**
   Measures the effectiveness of keyword ordering:
   - **nDCG@5** (Normalized Discounted Cumulative Gain at rank 5):  
     Reflects how well the model ranks the most important ground truth keywords near the top of the prediction list.
  
The results are compiled into a **summary table** to compare the overall ranking and confidence-aware performance of the **Base** and **Reranker** models across the dataset.


In [25]:
# Settings
keywords_dir = "../Dataset/Extracted_Keywords/"
ground_truth_path = "../Dataset/keywords_ground_truth.pkl"

models_to_evaluate = ["base", "reranker"]

# Load ground truth
keywords_ground_truth = pd.read_pickle(ground_truth_path)

# Store all results
all_results = []

# Iterate over movie keyword predictions
for file in os.listdir(keywords_dir):
    if file.endswith(".pkl") and file.startswith("kw_"):
        movie_name = file.replace("kw_", "").replace(".pkl", "")
        file_path = os.path.join(keywords_dir, file)

        try:
            selected_film = pd.read_pickle(file_path)
            selected_film_id = selected_film["Movie_ID"].iloc[0]

            # Ground truth for the film
            kw_ground_truth = keywords_ground_truth[keywords_ground_truth["Movie_ID"] == selected_film_id]
            gt_keywords = kw_ground_truth["Keyword"].tolist()

            # Init: predicted keyword lists (per review, no duplicates)
            all_predicted_kw_score = {model: [] for model in models_to_evaluate}

            for _, row in selected_film.iterrows():
                for model in models_to_evaluate:
                    pred_col = f"keywords_{model}"
                    if pred_col in row and isinstance(row[pred_col], list):
                        predicted_kw_score = [(normalize_kw(kw), score) for kw, score in row[pred_col] if isinstance(kw, str)]
                        seen = set()
                        unique_pred = [(kw, score) for kw, score in predicted_kw_score if kw not in seen and not seen.add(kw)]
                        all_predicted_kw_score[model].append(unique_pred)

            for model in models_to_evaluate:
                # Flatten keywords (global unique list)
                flat_kw = [kw for review in all_predicted_kw_score[model] for kw, _ in review]
                unique_pred_keywords = list(set(flat_kw))

                # Max score for each keyword
                kw_score_max = {}
                for review in all_predicted_kw_score[model]:
                    for kw, score in review:
                        if kw not in kw_score_max or score > kw_score_max[kw]:
                            kw_score_max[kw] = score
                sorted_kw_score = sorted(kw_score_max.items(), key=lambda x: x[1], reverse=True)

                # Score-aware format
                pred_kw_score_list = all_predicted_kw_score[model]
                keyword_only_list = [[kw for kw, _ in review] for review in pred_kw_score_list]

                # Compute all metrics
                wp, wr, wf = evaluate_keywords_weighted(pred_kw_score_list, gt_keywords)
                ndcg = compute_global_ndcg(pred_kw_score_list, gt_keywords, k=5)

                all_results.append({
                    "Movie": movie_name,
                    "Model": model,
                    "Weighted Precision": round(wp, 4),
                    "Weighted Recall": round(wr, 4),
                    "Weighted F1-score": round(wf, 4),
                    "nDCG@5": round(ndcg, 4)
                })

        except Exception as e:
            print(f"Error processing {file}: {e}")

# Final summary
final_df = pd.DataFrame(all_results)
final_df_sorted = final_df.sort_values(by=["Movie", "Model"]).reset_index(drop=True)
final_df_sorted.style.format(precision=4).set_caption("Global Evaluation Summary per Movie and Model")

Unnamed: 0,Movie,Model,Weighted Precision,Weighted Recall,Weighted F1-score,nDCG@5
0,SW_Episode6,base,0.9155,0.1789,0.2993,0.6973
1,SW_Episode6,reranker,0.919,0.1642,0.2786,0.7023
