# Evaluation of keyBERTSentimentReranker

This notebook evaluates the **reranker** model, which reorders the keywords predicted by the base KeyBERT model to improve their ranking.

Since the reranker **does not change the keyword set** but only their order, traditional metrics like **precision, recall, and F1-score** will be **identical to those of the base model**. Thus, comparing these metrics between base and reranker is not informative.

Instead, we focus on three score-aware metrics that capture ranking quality and confidence weighting:

- **Weighted Precision, Recall, and F1-score**:  
  These metrics incorporate the confidence scores assigned by the model to each predicted keyword. They measure not only correctness but also how confidently the model ranks the relevant keywords, providing a finer-grained evaluation than unweighted scores.

- **nDCG@5 with Graded Relevance**:  
  This metric evaluates how well the reranker places the most important ground truth keywords near the top of the predicted list. It rewards rankings that better align with the ground truth keyword importance, revealing improvements in ranking quality.

Together, these metrics provide a comprehensive assessment of the **ranking effectiveness, confidence weighting, and diversity** introduced by the reranker, beyond what simple set-based metrics can capture.


## Setup: Installing and Importing Required Libraries

In [1]:
import subprocess
import sys

# List of required packages
required_packages = {
    "pandas", "numpy", "scikit-learn", "tqdm"
}

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)

tqdm is already installed.
pandas is already installed.
numpy is already installed.
Installing scikit-learn...


In [2]:
# Standard Library
import os      # File system operations (e.g., listing files)
import re      # Regular expressions for text processing
import math    # Mathematical functions (e.g., logarithms for nDCG calculation)

# Third-Party Libraries
import pandas as pd                  # Data manipulation with DataFrames
import numpy as np                   # Numerical computations and array operations
from tqdm import tqdm                # Progress bars for loops

# Evaluation metrics from scikit-learn
from sklearn.metrics import precision_score, recall_score, f1_score

## Load Available Movies from Dataset

This section lists all the available movies stored as `.pkl` files inside the review dataset directory.

- It defines the root path (`../Dataset/Reviews_By_Movie`) where all review files are saved.
- It automatically detects and lists all movie filenames (removing the `.pkl` extension).

In [3]:
# Define root directory
root_dir = "../Dataset/Reviews_By_Movie"

# List all available movies
available_movies = sorted([f[:-4] for f in os.listdir(root_dir) if f.endswith(".pkl")])
print("Available movies:", available_movies)

Available movies: ['GoodBadUgly', 'HarryPotter', 'IndianaJones', 'LaLaLand', 'Oppenheimer', 'Parasite', 'SW_Episode1', 'SW_Episode2', 'SW_Episode3', 'SW_Episode4', 'SW_Episode5', 'SW_Episode6', 'SW_Episode7', 'SW_Episode8', 'SW_Episode9']


## Select a Movie and Load its Ground Truth Keywords

In this step, we load the keyword extraction results for a specific movie and retrieve the corresponding ground truth keywords. The goal is to use these annotated keywords for evaluation and comparison with automatically extracted ones.

In [4]:
# Set the name of the movie to be evaluated
movie_name = "SW_Episode6"

# Load the extracted keywords for the selected movie from a pickle file
# The file path is dynamically built using the movie name
selected_film = pd.read_pickle(f"../Dataset/Extracted_Keywords/kw_{movie_name}.pkl")

# Retrieve the Movie_ID of the selected film
# Assumes that the file contains a DataFrame with at least one row
selected_film_id = selected_film["Movie_ID"].iloc[0]

# Load the full dataset containing the ground truth keywords
# for all movies in the evaluation set
keywords = pd.read_pickle("../Dataset/keywords_ground_truth.pkl")

# Filter the ground truth dataset to extract only the keywords for the selected movie
kw_ground_truth = keywords[keywords["Movie_ID"] == selected_film_id]

## Score-Aware Evaluation: Weighted Metrics and nDCG@k with Graded Relevance

This extended evaluation considers the **confidence scores** assigned by the model to each predicted keyword, allowing us to measure not only whether the predictions are correct but also how confidently and effectively they are ranked.

#### Score-Aware Metrics

- **Weighted Precision**: Reflects the proportion of the model’s total confidence assigned to correct keywords. High confidence in incorrect predictions lowers this score.

- **Weighted Recall**: Measures how much of the ground truth is recovered, weighted by the confidence of correct predictions.

- **Weighted F1-score**: The harmonic mean of weighted precision and recall, balancing accuracy with coverage.

- **nDCG@k (Normalized Discounted Cumulative Gain)**: A ranking metric that rewards placing relevant keywords near the top of the prediction list. It uses **graded relevance**, which accounts for the importance of ground truth keywords based on their position.

#### How nDCG@k with Graded Relevance is Computed

1. **Assign graded relevance to ground truth keywords** based on their position $pos_{GT}$ (starting from 0). The relevance for a ground truth keyword at position $pos_{GT}$ is:

   $$
   rel_{GT} = \frac{1}{\log_2(pos_{GT} + 2)}
   $$

   Higher ranked keywords (lower $pos_{GT}$) have higher relevance scores.

2. **For each predicted keyword at position $i$ (starting from 0)**, find the best matching ground truth keyword (using approximate matching). Assign the relevance of the predicted keyword $rel_i$ as the graded relevance of its matched ground truth keyword:

   $$
   rel_i = \begin{cases}
   \frac{1}{\log_2(pos_{GT} + 2)} & \text{if predicted keyword matches GT keyword at } pos_{GT} \\
   0 & \text{otherwise}
   \end{cases}
   $$

3. **Compute Discounted Cumulative Gain (DCG) for the predicted keywords up to rank $k$**:

   $$
   DCG@k = \sum_{i=1}^{k} \frac{rel_i}{\log_2(i + 1)}
   $$

   This discounts the relevance by the predicted keyword’s position, rewarding relevant keywords ranked higher.

4. **Compute Ideal DCG (IDCG)** as the maximum possible DCG using the top-$k$ ground truth keywords ranked by their graded relevance:

   $$
   IDCG@k = \sum_{i=1}^{k} \frac{rel^{*}_i}{\log_2(i + 1)}
   $$

   where $rel^{*}_i$ are the graded relevance scores of the top-$k$ ground truth keywords sorted by importance.

5. **Calculate normalized DCG (nDCG)** by dividing DCG by IDCG:

   $$
   nDCG@k = \frac{DCG@k}{IDCG@k}
   $$

### Example ($k=5$)

Ground truth keywords ranked by importance:  
`["fraud", "poverty", "scam"]`

Their graded relevance:  
- "fraud" at position 0 → $rel_{GT} = \frac{1}{\log_2(0 + 2)} = 1.0$  
- "poverty" at position 1 → $rel_{GT} = \frac{1}{\log_2(1 + 2)} = 0.63$  
- "scam" at position 2 → $rel_{GT} = \frac{1}{\log_2(2 + 2)} = 0.5$

**Predicted keywords in order:**  
`["scam", "family", "poverty", "cinematography", "fraud"]`

Matching relevances assigned to predicted keywords:  
- "scam" matches GT at pos 2 → $rel_0 = 0.5$  
- "family" no match → $rel_1 = 0$  
- "poverty" matches GT at pos 1 → $rel_2 = 0.63$  
- "cinematography" no match → $rel_3 = 0$  
- "fraud" matches GT at pos 0 → $rel_4 = 1.0$

Compute DCG:

$$
DCG = \frac{0.5}{\log_2(1 + 1)} + \frac{0}{\log_2(2 + 1)} + \frac{0.63}{\log_2(3 + 1)} + \frac{0}{\log_2(4 + 1)} + \frac{1.0}{\log_2(5 + 1)} \approx 0.5 + 0 + 0.315 + 0 + 0.387 = 1.202
$$

Compute IDCG (ideal predicted order: "fraud", "poverty", "scam"):

$$
IDCG = \frac{1.0}{\log_2(1 + 1)} + \frac{0.63}{\log_2(2 + 1)} + \frac{0.5}{\log_2(3 + 1)} = 1.0 + 0.397 + 0.25 = 1.647
$$

Then,

$$
nDCG@5 = \frac{1.202}{1.647} \approx 0.73
$$

**Change predicted order to:**  
`["fraud", "poverty", "scam", "family", "cinematography"]`

Relevances for predicted keywords:

- "fraud" matches GT at pos 0 → $rel_0 = 1.0$  
- "poverty" matches GT at pos 1 → $rel_1 = 0.63$  
- "scam" matches GT at pos 2 → $rel_2 = 0.5$  
- "family" no match → $rel_3 = 0$  
- "cinematography" no match → $rel_4 = 0$

Compute DCG:

$$
DCG = \frac{1.0}{\log_2(1 + 1)} + \frac{0.63}{\log_2(2 + 1)} + \frac{0.5}{\log_2(3 + 1)} + 0 + 0 = 1.0 + 0.397 + 0.25 = 1.647
$$

IDCG is the same as before.

$$
nDCG@5 = \frac{1.647}{1.647} = 1.0
$$

Changing the order of predicted keywords **does affect** the nDCG score: placing highly relevant keywords earlier leads to higher nDCG, reflecting better ranking quality.

- When relevant keywords appear early in the predicted list, the score increases due to less discounting.
- Conversely, when relevant keywords are placed lower, the score decreases because of higher discounting.
- This metric thus rewards **both correct prediction and the quality of their ranking**.

In [5]:
# Simple normalization function for keywords:
# - Converts to lowercase
# - Removes punctuation
# - Strips leading/trailing spaces
def normalize_kw(kw):
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumerics and whitespace
    return kw.strip()


# Approximate matching function:
# Returns True if the predicted keyword matches any ground truth keyword
# using a relaxed comparison: exact match or substring containment
def is_approx_match(kw, gt_keywords):
    for gt in gt_keywords:
        if kw == gt or kw in gt or gt in kw:
            return True
    return False

# Weighted evaluation function:
# Calculates precision, recall, and F1-score using the confidence scores of predicted keywords
# - High-confidence correct predictions contribute more
# - Precision is score-weighted; recall divides by total ground truth
def evaluate_keywords_weighted(predicted_kw_score, gt_keywords):
    """
    Evaluate predicted keywords with confidence scores using weighted precision, recall, and F1.
    
    Parameters:
        predicted_kw_score (list of (str, float)): predicted keywords with associated confidence scores
        gt_keywords (list of str): ground truth keywords (annotated)
    
    Returns:
        (precision, recall, f1): all metrics computed using score-weighted matching
    """
    # Normalize both predicted and ground truth keywords
    gt_keywords = [normalize_kw(k) for k in gt_keywords]
    pred_keywords = [(normalize_kw(kw), score) for kw, score in predicted_kw_score if isinstance(kw, str)]
    
    total_score = sum(score for _, score in pred_keywords)
    if total_score == 0:
        return 0, 0, 0

    # Compute total score of predicted keywords that approximately match the ground truth
    match_score = sum(score for kw, score in pred_keywords if is_approx_match(kw, gt_keywords))
    
    # Weighted precision and recall
    precision = match_score / total_score
    recall = match_score / len(gt_keywords) if gt_keywords else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0

    return precision, recall, f1

# Ranking-based evaluation function:
# Computes the normalized Discounted Cumulative Gain (nDCG@k) for predicted keywords
# Gives more credit when correct keywords appear earlier in the ranking
def compute_ndcg(predicted_kw_score, gt_keywords, k=5):
    """
    Compute nDCG@k between predicted keywords (with scores) and ground truth keywords,
    using graded relevance based on ground truth ranking and approximate matching.

    Parameters:
        predicted_kw_score (list of (str, float)): predicted keywords with confidence scores
        gt_keywords (list of str): ground truth keywords ordered by importance
        k (int): top-k keywords to consider

    Returns:
        float: normalized DCG score
    """
    # Normalize ground truth and predicted keywords
    gt_keywords_norm = [normalize_kw(k) for k in gt_keywords]
    pred_keywords_norm = [normalize_kw(kw) for kw, _ in predicted_kw_score[:k]]

    relevance = []
    for pk in pred_keywords_norm:
        # Find ranks of all GT keywords matching predicted keyword approx.
        match_ranks = [i for i, gk in enumerate(gt_keywords_norm) if is_approx_match(pk, [gk])]
        if match_ranks:
            # Assign relevance inversely proportional to rank (log discount)
            best_rank = min(match_ranks)
            rel = 1 / math.log2(best_rank + 2)  # +2 since ranks start at 0
        else:
            rel = 0
        relevance.append(rel)

    # Compute DCG with graded relevance
    dcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(relevance))

    # Compute ideal DCG (IDCG) assuming best possible ordering (top k GT keywords)
    ideal_relevance = [1 / math.log2(i + 2) for i in range(min(k, len(gt_keywords_norm)))]
    idcg = sum(ideal_relevance)

    return dcg / idcg if idcg > 0 else 0.0


### Evaluate and Compare Models on Keyword Extraction (Weighted)

In this section, we apply the score-aware evaluation metrics to each review for both models:

- **Weighted Precision, Recall, F1**: accounts for the confidence scores of each predicted keyword.
- **nDCG@5**: evaluates the ranking quality of the top-5 keywords based on their alignment with the ground truth.

Each review is evaluated individually, and the metrics are then averaged across all reviews to summarize model performance.

In [6]:
# Models to evaluate
models_to_evaluate = ["base", "reranker"]

# Initialize results dictionary for weighted metrics and nDCG
weighted_results = {model: [] for model in models_to_evaluate}

# Ground truth keywords (same for all reviews in the selected film)
ground_truth_keywords = kw_ground_truth["Keyword"].tolist()

# Loop through each review
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        pred_col = f"keywords_{model}"
        
        # Skip if no prediction or wrong format
        if pred_col in row and isinstance(row[pred_col], list):
            predicted_kw_score = row[pred_col]  # list of (kw, score)

            # Compute weighted metrics
            w_precision, w_recall, w_f1 = evaluate_keywords_weighted(predicted_kw_score, ground_truth_keywords)

            # Compute nDCG@5
            ndcg = compute_ndcg(predicted_kw_score, ground_truth_keywords, k=5)

            # Save results
            weighted_results[model].append({
                "weighted_precision": w_precision,
                "weighted_recall": w_recall,
                "weighted_f1": w_f1,
                "ndcg@5": ndcg
            })

# Compute average metrics across all reviews
weighted_summary = {}
for model in models_to_evaluate:
    metrics = weighted_results[model]
    weighted_summary[model] = {
        "avg_weighted_precision": round(np.mean([m["weighted_precision"] for m in metrics]), 4),
        "avg_weighted_recall": round(np.mean([m["weighted_recall"] for m in metrics]), 4),
        "avg_weighted_f1": round(np.mean([m["weighted_f1"] for m in metrics]), 4),
        "avg_ndcg@5": round(np.mean([m["ndcg@5"] for m in metrics]), 4)
    }

# Convert the weighted summary dictionary to a pandas DataFrame
summary_df = pd.DataFrame(weighted_summary).T  # Transpose so models are rows

# Rename columns for better readability
summary_df.columns = [
    "Weighted Precision",
    "Weighted Recall",
    "Weighted F1-score",
    "nDCG@5"
]

# Display the summary table
summary_df.style.format(precision=4).set_caption("Score-Aware Evaluation Summary")


Unnamed: 0,Weighted Precision,Weighted Recall,Weighted F1-score,nDCG@5
base,0.5539,0.0052,0.0102,0.2272
reranker,0.5438,0.0046,0.0092,0.2024


## Evaluation Across All Movies

This section processes all `.pkl` files in the `Extracted_Keywords` directory, where each file corresponds to a single movie and contains the predicted keywords generated by different models.

For **each movie**:

1. **Ground Truth Loading**  
   - The reference (ground truth) keywords for each review are loaded.

2. **Model Predictions**  
   - Predicted keywords from two models are evaluated:
     - **Base** model  
     - **Reranker** model

3. **Metrics Computation**  
   For each review, two groups of metrics are calculated:

   **Score-aware Weighted Metrics**
   These metrics account for the confidence of each prediction:
   - **Weighted Precision**
   - **Weighted Recall**
   - **Weighted F1 Score**

   **Ranking Metric**
   Evaluates the ordering of predicted keywords:
   - **nDCG@5** (Normalized Discounted Cumulative Gain at rank 5): Rewards relevant keywords that appear higher in the prediction list.

Metrics are **averaged per movie** and **per model**.

All results are compiled into a **summary table** to compare the performance of the **Base** and **Reranker** models across the dataset.


In [7]:
# Paths
keywords_dir = "../Dataset/Extracted_Keywords/"
ground_truth_path = "../Dataset/keywords_ground_truth.pkl"

# Load the ground truth once for all movies
keywords_ground_truth = pd.read_pickle(ground_truth_path)

# List of models to evaluate
models_to_evaluate = ["base", "reranker"]

# Store results across all movies
all_results = []

# Iterate over all keyword prediction files
for file in os.listdir(keywords_dir):
    if file.endswith(".pkl") and file.startswith("kw_"):
        movie_name = file.replace("kw_", "").replace(".pkl", "")
        file_path = os.path.join(keywords_dir, file)

        try:
            # Load predicted keywords for the current movie
            selected_film = pd.read_pickle(file_path)
            selected_film_id = selected_film["Movie_ID"].iloc[0]

            # Retrieve ground truth keywords for the selected movie
            kw_ground_truth = keywords_ground_truth[keywords_ground_truth["Movie_ID"] == selected_film_id]
            gt_keywords = kw_ground_truth["Keyword"].tolist()

            # Initialize metrics for each model
            results = {model: [] for model in models_to_evaluate}

            # Evaluate each review in the dataset
            for _, row in selected_film.iterrows():
                for model in models_to_evaluate:
                    pred_col = f"keywords_{model}"

                    # Skip if no predictions or wrong format
                    if pred_col in row and isinstance(row[pred_col], list):
                        predicted_kw_score = row[pred_col]

                        # Extract only keyword strings for unweighted evaluation
                        pred_kw_only = [kw for kw, _ in predicted_kw_score if isinstance(kw, str)]

                        # Compute score-weighted metrics
                        w_precision, w_recall, w_f1 = evaluate_keywords_weighted(predicted_kw_score, gt_keywords)

                        # Compute ranking-based metric (nDCG@5)
                        ndcg = compute_ndcg(predicted_kw_score, gt_keywords, k=5)

                        # Store all metrics for this review
                        results[model].append({
                            "w_precision": w_precision,
                            "w_recall": w_recall,
                            "w_f1": w_f1,
                            "ndcg@5": ndcg,
                        })

            # Compute average metrics per model for the current movie
            for model in models_to_evaluate:
                if results[model]:
                    metrics_df = pd.DataFrame(results[model])
                    all_results.append({
                        "Movie": movie_name,
                        "Model": model,
                        "Avg_Weighted_Precision": round(metrics_df["w_precision"].mean(), 4),
                        "Avg_Weighted_Recall": round(metrics_df["w_recall"].mean(), 4),
                        "Avg_Weighted_F1": round(metrics_df["w_f1"].mean(), 4),
                        "Avg_nDCG@5": round(metrics_df["ndcg@5"].mean(), 4),
                    })

        except Exception as e:
            print(f"Error processing {file}: {e}")

# Create the final summary DataFrame
final_df = pd.DataFrame(all_results)

# Display table with clean index and sorted values
final_df_sorted = final_df.sort_values(by=["Movie", "Model"]).reset_index(drop=True)
final_df_sorted.style.format(precision=4).set_caption("Full Evaluation Summary per Movie and Model")

Unnamed: 0,Movie,Model,Avg_Weighted_Precision,Avg_Weighted_Recall,Avg_Weighted_F1,Avg_nDCG@5
0,SW_Episode6,base,0.5539,0.0052,0.0102,0.2272
1,SW_Episode6,reranker,0.5438,0.0046,0.0092,0.2024
