# Evaluation of KeyBERTSentimentAware

This notebook evaluates and compares different keyword extraction models applied to movie reviews, with a specific focus on assessing how well each model captures not only **semantic relevance** but also **sentiment alignment** with the content of the reviews.

We assess the performance of two models:

- **Base** KeyBERT model  
- **Sentiment-aware** extension (KeyBERTSentimentAware), which integrates a custom sentiment classifier to adjust keyword relevance scores based on predicted sentiment

This notebook adopts a **global evaluation approach**: all predicted and ground truth keywords across the entire dataset are aggregated **before** computing each metric. This provides a **holistic view** of each model's performance, unaffected by review-level variance.

We use a set of annotated ground truth keywords per movie (from IMDb), and the top-5 predicted keywords (with scores) per review for both models.

#### **Evaluation Layers**

#### 1. **Basic (Unweighted) Metrics**

- **Precision**, **Recall**, and **F1-score** computed globally via approximate binary matching.
- A predicted keyword is correct if it approximately matches any of the movie’s ground truth keywords.
- All keywords from all reviews are flattened and compared in aggregate.

#### 2. **Score-Aware Metrics**

- **Weighted Precision**, **Recall**, and **F1-score**:
  - Each predicted keyword is weighted by its score.
  - Correct predictions contribute proportionally to their confidence.
- **nDCG@5 (Normalized Discounted Cumulative Gain)**:
  - Evaluates whether correct keywords are ranked near the top globally.
  - Relevance is discounted by position, rewarding better keyword orderings.

#### 3. **Semantic Evaluation (Embedding-Based)**

- All predicted and ground truth keywords are embedded using a **sentence-transformer** model.
- **Cosine similarity** is used to detect approximate **semantic matches**.
- A predicted keyword is correct if its similarity with any ground truth keyword exceeds a given threshold (e.g., **0.75**).
- **Semantic Precision**, **Recall**, and **F1-score** are computed globally based on these soft matches.

#### 4. **Sentiment Appropriateness Score (SAS)**

This novel metric evaluates **how well the sentiment of predicted keywords aligns with the sentiment of the review**.

Two global variants are computed:

- **SAS_from_keywords**:  
  - Computes the average sentiment of predicted keywords (via VADER or custom classifier).  
  - Compares it to the average sentiment of the ground truth keywords for the movie.

- **SAS_from_text**:  
  - Compares the average sentiment of the predicted keywords to the sentiment of the **full review text**.

SAS values are normalized in [0, 1], where values closer to 1 indicate higher emotional coherence.

#### **Why Sentiment-Aware Evaluation Matters**

The **base** KeyBERT model selects keywords based only on semantic relevance, while the **sentiment-aware** version ranks keywords based on a combination of **semantic and emotional cues**.

Traditional evaluations may overlook whether the extracted keywords convey the **emotional tone** of the review.  
By introducing **Sentiment Appropriateness Scores**, we quantify this alignment explicitly and verify whether integrating sentiment enhances both **relevance** and **emotional fidelity**.

This global, multi-dimensional evaluation provides a robust framework for comparing keyword extraction systems beyond surface-level matching.


## Setup: Installing and Importing Required Libraries

In [48]:
import subprocess
import sys

# List of required packages
required_packages = {
    "pandas", "numpy", "tqdm", "transformers", "torch", "vaderSentiment"
}

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)


numpy is already installed.
pandas is already installed.
torch is already installed.
transformers is already installed.
tqdm is already installed.
vaderSentiment is already installed.


In [49]:
# Standard Library
import os      # File system operations (e.g., listing files)
import re      # Regular expressions for text processing
import math    # Mathematical functions (e.g., log2)

# Third-Party Libraries
import pandas as pd                  # Data manipulation with DataFrames
import numpy as np                   # Numerical computations and array operations
from tqdm import tqdm                # Progress bars for loops

# Transformers and PyTorch for embeddings and models
from transformers import AutoTokenizer, AutoModel # type:ignore
import torch
import torch.nn.functional as F

# Sentiment Analysis
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


## Load Available Movies from Dataset

This section lists all the available movies stored as `.pkl` files inside the review dataset directory.

- It defines the root path (`../Dataset/Reviews_By_Movie`) where all review files are saved.
- It automatically detects and lists all movie filenames (removing the `.pkl` extension).

In [50]:
# Define root directory
root_dir = "../Dataset/Reviews_By_Movie"

# List all available movies
available_movies = sorted([f[:-4] for f in os.listdir(root_dir) if f.endswith(".pkl")])
print("Available movies:", available_movies)

Available movies: ['GoodBadUgly', 'HarryPotter', 'IndianaJones', 'LaLaLand', 'Oppenheimer', 'Parasite', 'SW_Episode1', 'SW_Episode2', 'SW_Episode3', 'SW_Episode4', 'SW_Episode5', 'SW_Episode6', 'SW_Episode7', 'SW_Episode8', 'SW_Episode9']


## Select a Movie and Load its Ground Truth Keywords

In this step, we load the keyword extraction results for a specific movie and retrieve the corresponding ground truth keywords. The goal is to use these annotated keywords for evaluation and comparison with automatically extracted ones.

In [51]:
# Set the name of the movie to be evaluated
movie_name = "IndianaJones"

# Load the extracted keywords for the selected movie from a pickle file
# The file path is dynamically built using the movie name
selected_film = pd.read_pickle(f"../Dataset/Extracted_Keywords/kw_{movie_name}.pkl")

# Retrieve the Movie_ID of the selected film
# Assumes that the file contains a DataFrame with at least one row
selected_film_id = selected_film["Movie_ID"].iloc[0]

# Load the full dataset containing the ground truth keywords
# for all movies in the evaluation set
keywords = pd.read_pickle("../Dataset/keywords_ground_truth.pkl")

# Filter the ground truth dataset to extract only the keywords for the selected movie
kw_ground_truth = keywords[keywords["Movie_ID"] == selected_film_id]

## Keyword Matching and Evaluation Functions (Basic – Unweighted)

This block defines the core utility functions used to evaluate predicted keywords against the ground truth. These functions perform a **binary, unweighted evaluation**, ignoring confidence scores and ranking information.

The evaluation pipeline includes the following steps:

- **Normalization**: all keywords are lowercased, stripped of punctuation, and cleaned of extra whitespace to ensure consistent text matching.

- **Approximate Matching**: a relaxed rule considers two keywords as a match if:
  - They are exactly equal (after normalization), or
  - One is a substring of the other (e.g., *"social satire"* is considered a match with *"satire"*).

- **Global Evaluation**: for each model, all keywords predicted across the reviews of a given movie are aggregated, and then compared to the global set of ground truth keywords for that movie.

- **Metrics**: we compute **Precision**, **Recall**, and **F1-score** based on the number of approximate matches between the predicted and ground truth keywords.

In [52]:
def normalize_kw(kw):
    """
    Normalize a keyword string by:
    - Converting to lowercase
    - Removing punctuation and non-alphanumeric characters (except spaces)
    - Stripping leading and trailing whitespace

    Args:
        kw (str): The keyword string to normalize.

    Returns:
        str: The normalized keyword.
    """
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumeric characters and whitespace
    return kw.strip()


def is_approx_match(kw, gt_keywords):
    """
    Check if a predicted keyword approximately matches any ground truth keyword.

    A match is considered approximate if:
    - The predicted keyword is exactly equal to a ground truth keyword
    - OR the predicted keyword is a substring of a ground truth keyword
    - OR a ground truth keyword is a substring of the predicted one

    Args:
        kw (str): The normalized predicted keyword.
        gt_keywords (List[str]): A list of normalized ground truth keywords.

    Returns:
        bool: True if an approximate match is found, False otherwise.
    """
    for gt in gt_keywords:
        if kw == gt or kw in gt or gt in kw:
            return True
    return False


def evaluate_keywords(all_pred_keywords, all_gt_keywords):
    """
    Evaluate global precision, recall, and F1-score across a dataset using approximate matching.

    This function compares predicted keywords to ground truth keywords for each review.
    Matching is performed using approximate string comparison, and each ground truth keyword
    can be matched only once to ensure fairness. The metrics are aggregated globally,
    not per-review.

    Args:
        all_pred_keywords (List[List[str]]): 
            A list where each element is a list of predicted keywords for a single review.
        all_gt_keywords (List[List[str]]): 
            A list where each element is a list of ground truth keywords for the corresponding review.

    Returns:
        Tuple[float, float, float]: Global precision, recall, and F1-score based on approximate matching.
    """
    global_match_count = 0     # Total number of matched keywords across all reviews
    global_pred_count = 0      # Total number of predicted keywords
    global_gt_count = 0        # Total number of ground truth keywords

    # Iterate through each review's predictions and ground truths
    for pred_keywords, gt_keywords in zip(all_pred_keywords, all_gt_keywords):
        # Normalize and sort keywords to ensure consistent behavior
        pred_keywords = sorted([normalize_kw(k) for k in pred_keywords])
        gt_keywords = sorted([normalize_kw(k) for k in gt_keywords])

        global_pred_count += len(pred_keywords)
        global_gt_count += len(gt_keywords)

        matched_gts = set()  # Track which ground truth keywords have already been matched

        for pred in pred_keywords:
            for gt in gt_keywords:
                if gt not in matched_gts and is_approx_match(pred, [gt]):
                    global_match_count += 1
                    matched_gts.add(gt)  # Avoid matching the same GT keyword multiple times
                    break

    # Compute global metrics
    precision = global_match_count / global_pred_count if global_pred_count else 0
    recall = global_match_count / global_gt_count if global_gt_count else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0

    return precision, recall, f1

### Evaluate and Compare Models on Keyword Extraction (Basic – Unweighted)

This section evaluates two keyword extraction models — **base** and **sentiment-enhanced** — against the ground truth annotations.

For each model, we collect all predicted keywords across all reviews in the selected movie and compare them to the ground truth keywords using **binary approximate matching**.

The evaluation computes **global precision, recall, and F1-score**, considering the entire set of predictions and ground truth keywords as a whole.

In [53]:
# Define the models to be evaluated
models_to_evaluate = ["base", "sentiment"]

# Extract the list of ground truth keywords for the selected movie
ground_truth_keywords = [normalize_kw(kw) for kw in kw_ground_truth["Keyword"].tolist()]

# Dictionary to store all predicted keywords per model (across all reviews)
all_predictions = {model: [] for model in models_to_evaluate}

# Iterate over each review in the selected film's predictions
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        pred_col = f"keywords_{model}"

        if pred_col in row and isinstance(row[pred_col], list):
            predicted_keywords = [
                normalize_kw(kw) for kw, _ in row[pred_col] if isinstance(kw, str)
            ]
            
            # Remove duplicates per review
            seen = set()
            unique_kw = [kw for kw in predicted_keywords if kw not in seen and not seen.add(kw)]

            all_predictions[model].append(unique_kw)

# Evaluate each model globally
summary = {}
for model in models_to_evaluate:
    precision, recall, f1 = evaluate_keywords(
        all_predictions[model],  # List of lists
        ground_truth_keywords
    )

    summary[model] = {
        "Precision": round(precision, 4),
        "Recall": round(recall, 4),
        "F1-score": round(f1, 4)
    }

# Convert and display
summary_df = pd.DataFrame(summary).T
summary_df.columns = ["Precision", "Recall", "F1-score"]
summary_df.style.format(precision=4).set_caption("Global Evaluation Summary")


Unnamed: 0,Precision,Recall,F1-score
base,0.9125,0.3685,0.525
sentiment,0.8805,0.3391,0.4896


## Score-Aware Evaluation: Weighted Metrics

This extended evaluation considers the **confidence scores** assigned by the model to each predicted keyword, allowing us to measure not only whether the predictions are correct but also how confidently and effectively they are ranked.

#### Score-Aware Metrics

- **Weighted Precision**: Reflects the proportion of the model’s total confidence assigned to correct keywords. High confidence in incorrect predictions lowers this score.

- **Weighted Recall**: Measures how much of the ground truth is recovered, weighted by the confidence of correct predictions.

- **Weighted F1-score**: The harmonic mean of weighted precision and recall, balancing accuracy with coverage.

In [54]:
def normalize_kw(kw):
    """
    Normalize a keyword string by:
    - Converting to lowercase
    - Removing punctuation and non-alphanumeric characters (except spaces)
    - Stripping leading and trailing whitespace

    Args:
        kw (str): The keyword string to normalize.

    Returns:
        str: The normalized keyword.
    """
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumeric characters and whitespace
    return kw.strip()


def is_approx_match(kw, gt_keywords):
    """
    Check if a predicted keyword approximately matches any ground truth keyword.

    A match is considered approximate if:
    - The predicted keyword is exactly equal to a ground truth keyword
    - OR the predicted keyword is a substring of a ground truth keyword
    - OR a ground truth keyword is a substring of the predicted one

    Args:
        kw (str): The normalized predicted keyword.
        gt_keywords (List[str]): A list of normalized ground truth keywords.

    Returns:
        bool: True if an approximate match is found, False otherwise.
    """
    for gt in gt_keywords:
        if kw == gt or kw in gt or gt in kw:
            return True
    return False


def score_gt_keywords_from_rank(gt_keywords):
    return [
        (kw, 1 / math.log2(i + 2)) for i, kw in enumerate(gt_keywords)
    ]

def evaluate_keywords_weighted(all_predicted_kw_score, all_gt_keywords):
    """
    Evaluate global weighted precision, recall, and F1-score across multiple reviews.
    Matching is performed using approximate matching. Each predicted keyword contributes
    to the precision proportionally to its confidence score. Each ground truth keyword contributes
    to recall proportionally to its rank-based score.

    Args:
        all_predicted_kw_score (List[List[Tuple[str, float]]]): 
            A list of predicted keyword-score pairs per review.
        all_gt_keywords (List[List[str]]): 
            A list of ground truth keyword lists per review.

    Returns:
        Tuple[float, float, float]: Weighted precision, recall, and F1-score.
    """
    total_pred_score = 0.0  # Sum of all predicted keyword scores
    matched_pred_score = 0.0  # Sum of scores of correctly predicted keywords
    total_gt_score = 0.0  # Sum of all ground truth scores
    matched_gt_score = 0.0  # Sum of scores of matched ground truth keywords

    for pred_kw_score, gt_kw in zip(all_predicted_kw_score, all_gt_keywords):
        # Normalize and score ground truth keywords
        gt_kw_scored = score_gt_keywords_from_rank([normalize_kw(k) for k in gt_kw])
        pred_kw_score = [
            (normalize_kw(kw), score) for kw, score in pred_kw_score if isinstance(kw, str)
        ]

        total_pred_score += sum(score for _, score in pred_kw_score)
        total_gt_score += sum(score for _, score in gt_kw_scored)

        matched_gts = set()

        for kw, score in pred_kw_score:
            for gt_kw, gt_score in gt_kw_scored:
                if gt_kw not in matched_gts and is_approx_match(kw, [gt_kw]):
                    matched_pred_score += score
                    matched_gt_score += gt_score
                    matched_gts.add(gt_kw)
                    break

    precision = matched_pred_score / total_pred_score if total_pred_score > 0 else 0
    recall = matched_gt_score / total_gt_score if total_gt_score > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    return precision, recall, f1

### Evaluate and Compare Models on Keyword Extraction (Weighted)

In this section, we evaluate the overall performance of each model using **score-aware metrics** computed **globally across all reviews**:

- **Weighted Precision, Recall, and F1-score**: These metrics incorporate the **confidence scores** assigned to each predicted keyword, reflecting how much of the model’s confidence is placed on correct predictions.

This global evaluation provides a holistic view of each model’s effectiveness in ranking and selecting relevant keywords across the entire dataset.

In [55]:
# Models to evaluate
models_to_evaluate = ["base", "sentiment"]

# Ground truth keywords (same for all reviews in the selected film)
ground_truth_keywords = kw_ground_truth["Keyword"].tolist()

# Prepare data structures to hold predictions for each model
all_predicted_kw_score = {model: [] for model in models_to_evaluate}

# Collect predictions and GT for each review
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        pred_col = f"keywords_{model}"

        # Skip if no prediction or wrong format
        if pred_col in row and isinstance(row[pred_col], list):
            predicted_kw_score = [(kw, score) for kw, score in row[pred_col] if isinstance(kw, str)]
            
            # Remove duplicates per review
            seen = set()
            unique_pred = [(kw, score) for kw, score in predicted_kw_score if kw not in seen and not seen.add(kw)]
            all_predicted_kw_score[model].append(unique_pred)

# Dictionary to store global evaluation results
weighted_summary = {}

# Evaluate each model globally
for model in models_to_evaluate:
    preds = all_predicted_kw_score[model]

    # Global weighted metrics
    w_precision, w_recall, w_f1 = evaluate_keywords_weighted(preds, ground_truth_keywords)

    # Store results
    weighted_summary[model] = {
        "weighted_precision": round(w_precision, 4),
        "weighted_recall": round(w_recall, 4),
        "weighted_f1": round(w_f1, 4),
    }

# Convert summary to DataFrame
summary_df = pd.DataFrame(weighted_summary).T  # Models as rows

# Rename columns
summary_df.columns = [
    "Weighted Precision",
    "Weighted Recall",
    "Weighted F1-score",
]

# Display final table
summary_df.style.format(precision=4).set_caption("Global Score-Aware Evaluation Summary")


Unnamed: 0,Weighted Precision,Weighted Recall,Weighted F1-score
base,0.9176,0.4902,0.639
sentiment,0.8955,0.4489,0.5981


## Semantic Evaluation (Base vs Sentiment)

In this section, we evaluate and compare the **Base** and **Sentiment-enhanced** keyword extraction models using a **semantic similarity approach** based on contextual embeddings.

Traditional evaluation metrics rely on exact or approximate string matching between predicted and ground truth keywords. However, this approach may miss semantically related terms that differ lexically but convey the same meaning — such as *"scam"* and *"fraud"*.

To address this limitation, we adopt a **global semantic evaluation**, where all predicted and ground truth keywords across the dataset are compared using **dense sentence embeddings** generated by a pre-trained transformer (e.g., Sentence-BERT).

#### **Semantic Evaluation Procedure**

1. **Embedding Keywords Globally**  
   All predicted and ground truth keywords across all reviews are embedded into high-dimensional vectors using the same transformer model. Ground truth keywords are embedded **once**, and all vectors are normalized to allow cosine similarity comparisons.

2. **Computing Similarity Matrix**  
   For each model, we compute a cosine similarity matrix between **all predicted keywords** and **all ground truth keywords**.

3. **Matching Threshold**  
   A predicted keyword is considered a **semantic match** if its cosine similarity with at least one ground truth keyword exceeds a fixed threshold (e.g., **0.65**). This allows for flexible yet meaningful semantic alignment.

4. **Global Semantic Precision**  
   The proportion of predicted keywords that have at least one semantic match in the ground truth. This reflects how many of the model's predictions are semantically relevant.

5. **Global Semantic Recall**  
   The proportion of ground truth keywords that are captured by semantically similar predictions. This indicates how well the model covers the key concepts.

6. **Global Semantic F1-score**  
   The harmonic mean of semantic precision and recall, summarizing both relevance and coverage into a single score.

This evaluation:

- Is **more robust** than string-based metrics.
- **Captures meaning**, not just surface forms.
- Helps evaluate models that paraphrase or generalize beyond exact matches.

This evaluation complements previous metrics and provides a more **realistic estimate** of how well the models capture the essence of user-annotated keywords in a global and context-aware manner.

In [56]:
# Load a sentence embedding model from the SentenceTransformers family
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load tokenizer and model to generate contextual embeddings
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
encoder = AutoModel.from_pretrained(MODEL_NAME).to(device)

# Simple normalization function for keywords:
# - Converts to lowercase
# - Removes punctuation
# - Strips leading/trailing spaces
def normalize_kw(kw):
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumerics and whitespace
    return kw.strip()

def embed_keywords(keywords, device="cuda"):
    """
    Compute sentence embeddings for a list of keyword strings.

    Parameters:
    ----------
    keywords : List[str]
        A list of keyword strings to encode.
    device : str
        Device to run the model on ('cuda' or 'cpu').

    Returns:
    -------
    torch.Tensor
        Normalized embeddings tensor of shape (num_keywords, embedding_dim).
    """
    # Return empty tensor if input list is empty
    if not keywords:
        return torch.empty(0, encoder.config.hidden_size).to(device)

    # Tokenize and prepare inputs for the model
    inputs = tokenizer(keywords, padding=True, truncation=True, return_tensors="pt").to(device)

    with torch.no_grad():
        # Forward pass through the encoder to get hidden states
        outputs = encoder(**inputs)

        # Use mean pooling on the last hidden state to get fixed-size embeddings
        embeddings = outputs.last_hidden_state.mean(dim=1)

        # Normalize embeddings to unit length for cosine similarity computations
        embeddings = F.normalize(embeddings, p=2, dim=1)

    return embeddings

def evaluate_semantic_keywords(all_pred_keywords, gt_keywords, threshold=0.65, device="cuda"):
    """
    Compute global semantic precision, recall, and F1 score between all predicted keywords
    and ground truth keywords using cosine similarity over embeddings.

    Parameters:
    ----------
    all_pred_keywords : List[List[str]]
        List of predicted keywords for each review.
    gt_keywords : List[str]
        Global list of ground truth keywords for the movie.
    threshold : float
        Cosine similarity threshold for considering a match.
    device : str
        Device to run the model on ('cuda' or 'cpu').

    Returns:
    -------
    precision : float
    recall : float
    f1 : float
    """
    # Early return if either set is empty
    if len(all_pred_keywords) == 0 or len(gt_keywords) == 0:
        return 0.0, 0.0, 0.0

    # Compute embeddings
    pred_emb = embed_keywords(all_pred_keywords, device=device)
    gt_emb = embed_keywords(gt_keywords, device=device)

    # Compute similarity matrix
    sims = torch.matmul(pred_emb, gt_emb.T)

    # Match counting based on threshold
    pred_matches = (sims > threshold).any(dim=1).float().sum().item()
    gt_matches = (sims > threshold).any(dim=0).float().sum().item()

    precision = pred_matches / len(all_pred_keywords)
    recall = gt_matches / len(gt_keywords)
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0

    return precision, recall, f1

### Semantic Evaluation of Base and Sentiment Models Using Sentence Embeddings

In this step, we evaluate the **semantic similarity** between the predicted keywords of two models — **Base** and **Sentiment-enhanced** — and the ground truth keywords using **sentence embeddings**.

Unlike exact or approximate string matching, this method leverages **contextual embeddings** from a pre-trained transformer to assess how semantically close the predicted keywords are to the reference keywords.

The evaluation procedure is as follows:

- We extract only the **text** of the predicted keywords for each model, discarding their confidence scores.
- We embed all **predicted** and **ground truth** keywords using the same sentence transformer model.
- Embeddings are **normalized** to ensure cosine similarity is a valid similarity measure.
- For each predicted keyword, we compute the **cosine similarity** with all ground truth keywords.
- A predicted keyword is considered a **semantic match** if its similarity with any ground truth keyword exceeds a fixed threshold (e.g., **0.75**).

Once all matches are determined across all reviews of the selected movie, we compute:

- **Semantic Precision**: Fraction of all predicted keywords (global) that have a semantic match.
- **Semantic Recall**: Fraction of all ground truth keywords that are matched by at least one semantically similar predicted keyword.
- **Semantic F1-score**: Harmonic mean of semantic precision and recall.

This global semantic evaluation better reflects the models’ ability to capture **meaningful and relevant keywords**, even when the wording differs from the ground truth.


In [57]:
# Precompute embeddings for the ground truth keywords once per selected movie
gt_keywords = kw_ground_truth["Keyword"].tolist()

# Define the models to evaluate
models_to_evaluate = ["base", "sentiment"]

# Dictionary to collect all predicted keywords per model (without duplicates)
all_predictions = {model: set() for model in models_to_evaluate}

# Collect predicted keywords across all reviews (as a set for uniqueness)
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        pred_col = f"keywords_{model}"

        if pred_col in row and isinstance(row[pred_col], list):
            # Extract keyword strings and normalize
            pred_kw = [normalize_kw(kw) for kw, _ in row[pred_col] if isinstance(kw, str)]
            all_predictions[model].update(pred_kw)  # Add to set (no duplicates)

# Compute semantic evaluation globally for each model
semantic_scores = []
for model in models_to_evaluate:
    pred_kw = list(all_predictions[model])  # Convert back to list
    precision, recall, f1 = evaluate_semantic_keywords(pred_kw, gt_keywords, device=device, threshold=0.65)

    semantic_scores.append({
        "Model": model,
        "Semantic_Precision": round(precision, 4),
        "Semantic_Recall": round(recall, 4),
        "Semantic_F1": round(f1, 4)
    })

# Convert to DataFrame and format
summary_df = pd.DataFrame(semantic_scores).set_index("Model")
summary_df.style.format(precision=4).set_caption("Global Semantic-Aware Evaluation Summary")


Unnamed: 0_level_0,Semantic_Precision,Semantic_Recall,Semantic_F1
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
base,0.9628,0.7978,0.8726
sentiment,0.9616,0.8162,0.883


## Sentiment Appropriateness Score (SAS)

To further evaluate the quality of predicted keywords from a sentiment-aware perspective, we introduce the **Sentiment Appropriateness Score (SAS)**. This metric assesses how well the **overall sentiment of the predicted keywords** aligns with the **sentiment of the ground truth keywords** or the **sentiment of the full review text**, providing a complementary dimension to traditional keyword evaluation.

Unlike standard metrics like **Precision**, **Recall**, or **F1**, which measure lexical or semantic correctness, SAS explicitly measures **emotional alignment**.

#### **Two Global Evaluation Schemes**

1. **SAS from Ground Truth Keywords**

   - The sentiment of the reference is approximated using all ground truth keywords for the movie.
   - Each ground truth keyword is analyzed using **VADER** sentiment analyzer.
   - The sentiment score is calculated as a weighted combination:
     - `pos` → 1.0
     - `neu` → 0.5
     - `neg` → 0.0
   - The average sentiment across all ground truth keywords forms the **reference sentiment**.
   - This is compared to the **global average sentiment** of all predicted keywords (across reviews).

2. **SAS from Review Text**

   - The sentiment of the review set is approximated by aggregating the sentiments of the full review texts.
   - Each review is processed with VADER using the same weighted scoring.
   - The resulting average sentiment is compared to the sentiment of all predicted keywords.

#### **Formula**

$$
\text{SAS} = 1 - \left| \text{Sentiment}_{\text{predicted}} - \text{Sentiment}_{\text{reference}} \right|
$$

SAS is a value in **\[0, 1\]**, where values closer to **1** indicate stronger emotional coherence between the predicted keywords and the reference source (either ground truth keywords or full review text).

#### **Why VADER?**

We choose **VADER (Valence Aware Dictionary and sEntiment Reasoner)** for its suitability in this context:

- It is optimized for **short, informal text** like tags and keywords.
- It works well on **single words and short phrases**, which represent our prediction outputs.
- It provides interpretable and **probabilistic sentiment scores** (`pos`, `neu`, `neg`).
- It is lightweight, efficient, and scalable for large-scale global evaluations.

Although transformer-based sentiment models may offer more nuanced analysis on long text, VADER provides the best **balance of performance, interpretability, and scalability** for evaluating the emotional tone of predicted keywords in our keyword extraction task.


In [58]:
def compute_sas_from_keywords(all_predicted_keywords, ground_truth_keywords=None, analyzer=None, sentiment_gt=None):
    """
    Computes global SAS by comparing the average sentiment of all predicted keywords (across reviews)
    to the sentiment of the ground truth keywords (either computed or pre-given).

    Args:
        all_predicted_keywords (list of list of dict): list of predicted keyword dicts per review (each dict has 'sentiment_score' ∈ [0,1])
        ground_truth_keywords (list of str): optional, used if sentiment_gt is not given
        analyzer (SentimentIntensityAnalyzer): optional, used if sentiment_gt is not given
        sentiment_gt (float): optional precomputed global ground truth sentiment ∈ [0,1]

    Returns:
        float: SAS ∈ [0,1] — higher means better emotional alignment with ground truth
    """
    # Compute GT sentiment if not precomputed
    if sentiment_gt is None:
        if not ground_truth_keywords or analyzer is None:
            return None
        gt_scores = [
            analyzer.polarity_scores(kw)
            for kw in ground_truth_keywords
        ]
        sentiments_gt = [
            1.0 * s["pos"] + 0.5 * s["neu"] + 0.0 * s["neg"]
            for s in gt_scores
        ]
        sentiment_gt = sum(sentiments_gt) / len(sentiments_gt) if sentiments_gt else 0.0

    # Collect all predicted sentiments across reviews
    all_sentiments_pred = []
    for review_preds in all_predicted_keywords:
        all_sentiments_pred.extend(
            [kw['sentiment_score'] for kw in review_preds]
        )

    # Compute global average of predicted sentiment
    if not all_sentiments_pred:
        return None
    sentiment_pred = sum(all_sentiments_pred) / len(all_sentiments_pred)

    # Global SAS
    return 1 - abs(sentiment_pred - sentiment_gt)


def compute_sas_from_text(all_predicted_keywords, review_texts=None, analyzer=None, sentiment_text=None):
    """
    Computes global SAS by comparing the average sentiment of all predicted keywords
    to the sentiment of the full movie text corpus (or average of review texts).

    Args:
        all_predicted_keywords (list of list of dict): list of predicted keyword dicts per review (each dict has 'sentiment_score' ∈ [0,1])
        review_texts (list of str): optional, used if sentiment_text is not given
        analyzer (SentimentIntensityAnalyzer): optional, used if sentiment_text is not given
        sentiment_text (float): optional precomputed global sentiment of review text ∈ [0,1]

    Returns:
        float: SAS ∈ [0,1] — higher means better emotional alignment with full review sentiment
    """
    # Compute sentiment from full text if not precomputed
    if sentiment_text is None:
        if not review_texts or analyzer is None:
            return None
        text_scores = [
            analyzer.polarity_scores(text)
            for text in review_texts if isinstance(text, str)
        ]
        sentiments_text = [
            1.0 * s["pos"] + 0.5 * s["neu"] + 0.0 * s["neg"]
            for s in text_scores
        ]
        sentiment_text = sum(sentiments_text) / len(sentiments_text) if sentiments_text else 0.0

    # Collect all predicted keyword sentiments across reviews
    all_sentiments_pred = []
    for review_preds in all_predicted_keywords:
        all_sentiments_pred.extend(
            [kw['sentiment_score'] for kw in review_preds]
        )

    if not all_sentiments_pred:
        return None
    sentiment_pred = sum(all_sentiments_pred) / len(all_sentiments_pred)

    # Global SAS
    return 1 - abs(sentiment_pred - sentiment_text)

### Sentiment Appropriateness Evaluation Procedure

In this section, we evaluate the sentiment alignment of predicted keywords for each model using the **Sentiment Appropriateness Score (SAS)**.

Unlike traditional metrics, SAS assesses how well the **overall sentiment of the predicted keywords** reflects the sentiment of the movie’s content, based on two global references:

- **SAS from Ground Truth Keywords**  
  Compares the average sentiment of all predicted keywords to that of the ground truth keywords using **VADER**.

- **SAS from Review Texts**  
  Compares the predicted sentiment to the overall sentiment of all preprocessed reviews.

Both reference sentiments are computed **globally per movie**, and each model's predictions are evaluated accordingly.  
SAS is defined as:

$$
\text{SAS} = 1 - \left| \text{sentiment}_{\text{pred}} - \text{sentiment}_{\text{ref}} \right|
$$

Higher scores indicate better emotional alignment between the predicted keywords and the movie's actual sentiment.

The results are summarized in a table for **global comparison across models**.

In [59]:
# Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Define models to evaluate
models_to_evaluate = ["base", "sentiment"]

# Precompute global ground truth sentiment (from all keywords in the movie)
gt_keywords = kw_ground_truth["Keyword"].tolist()
sentiments_gt = []
for kw in gt_keywords:
    scores = analyzer.polarity_scores(kw)
    sentiment = 1.0 * scores["pos"] + 0.5 * scores["neu"] + 0.0 * scores["neg"]
    sentiments_gt.append(sentiment)
sentiment_gt = sum(sentiments_gt) / len(sentiments_gt) if sentiments_gt else None

# Precompute global review sentiment (from all review texts)
all_review_texts = selected_film["Preprocessed_Review"].dropna().tolist()
sentiments_text = []
for text in all_review_texts:
    scores = analyzer.polarity_scores(text)
    sentiment = 1.0 * scores["pos"] + 0.5 * scores["neu"] + 0.0 * scores["neg"]
    sentiments_text.append(sentiment)
sentiment_text = sum(sentiments_text) / len(sentiments_text) if sentiments_text else None

# Collect SAS results
sas_scores = []

for model in models_to_evaluate:
    all_predicted_keywords = []

    for _, row in selected_film.iterrows():
        pred_col = f"keywords_{model}"

        if pred_col in row and isinstance(row[pred_col], list):
            for kw, score in row[pred_col]:
                if isinstance(kw, str) and isinstance(score, (float, int)):
                    all_predicted_keywords.append({
                        "keyword": kw,
                        "sentiment_score": float(score)
                    })

    if not all_predicted_keywords:
        continue

    # Compute SAS from GT keywords
    sas_kw = compute_sas_from_keywords([all_predicted_keywords], sentiment_gt=sentiment_gt)

    # Compute SAS from full review text
    sas_text = compute_sas_from_text([all_predicted_keywords], sentiment_text=sentiment_text)

    result = {"Model": model}
    if sas_kw is not None:
        result["SAS from keywords"] = round(sas_kw, 4)
    if sas_text is not None:
        result["SAS from text"] = round(sas_text, 4)

    sas_scores.append(result)

# Create and display summary DataFrame
sas_df = pd.DataFrame(sas_scores).set_index("Model")
sas_df.style.format(precision=4).set_caption("Sentiment Appropriateness Score (SAS) - Global Evaluation")


Unnamed: 0_level_0,SAS from keywords,SAS from text
Model,Unnamed: 1_level_1,Unnamed: 2_level_1
base,0.9675,0.9206
sentiment,0.9015,0.8545


## Evaluation Across All Movies

This section automatically processes all `.pkl` files in the `Extracted_Keywords` directory, where each file corresponds to a single movie and contains predicted keywords generated by different models.

For **each movie**:

- The corresponding **ground truth keywords** are loaded.
- Predicted keywords from both models — **Base** and **Sentiment-aware** — are aggregated across all reviews.
- The following **global evaluation metrics** are computed:

#### Unweighted Metrics
- **Precision**, **Recall**, and **F1-score**  
  Based on approximate string matching between predicted and ground truth keywords, without considering prediction scores.

#### Score-aware Metrics
- **Weighted Precision**, **Weighted Recall**, **Weighted F1-score**  
  Evaluate prediction correctness while incorporating confidence scores.
  
- **nDCG@5**  
  Measures the ranking quality of the top 5 predicted keywords, rewarding correct keywords ranked higher.

#### Semantic Metrics
- **Semantic Precision**, **Semantic Recall**, **Semantic F1-score**  
  Computed using cosine similarity between **sentence embeddings** of predicted and reference keywords.

#### Sentiment Alignment Metrics
- **SAS_from_keywords**:  
  Measures how well the average sentiment of predicted keywords aligns with the sentiment of the ground truth keywords (via VADER).
  
- **SAS_from_text**:  
  Measures alignment with the sentiment of the full review texts.

All metrics are computed **globally per movie**, not per review, and the results are compiled into a **comprehensive summary table** for comparison across models.


In [None]:
# Paths
keywords_dir = "../Dataset/Extracted_Keywords/"
ground_truth_path = "../Dataset/keywords_ground_truth.pkl"

# Load ground truth keywords
keywords_ground_truth = pd.read_pickle(ground_truth_path)

# Models to evaluate
models_to_evaluate = ["base", "sentiment"]

# Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Container to store results for all movies and models
global_results = []

# Iterate over all predicted keyword files (one file per movie)
for file in os.listdir(keywords_dir):
    if file.endswith(".pkl") and file.startswith("kw_"):
        movie_name = file.replace("kw_", "").replace(".pkl", "")
        file_path = os.path.join(keywords_dir, file)

        try:
            # Load predicted keywords DataFrame and movie ID
            selected_film = pd.read_pickle(file_path)
            selected_film_id = selected_film["Movie_ID"].iloc[0]

            # Get ground truth keywords for this movie
            kw_ground_truth = keywords_ground_truth[
                keywords_ground_truth["Movie_ID"] == selected_film_id
            ]
            gt_keywords = kw_ground_truth["Keyword"].tolist()

            # Compute average sentiment of ground truth keywords
            gt_sentiments = []
            for kw in gt_keywords:
                scores = analyzer.polarity_scores(kw)
                sentiment = 1.0 * scores["pos"] + 0.5 * scores["neu"]
                gt_sentiments.append(sentiment)
            sentiment_gt = sum(gt_sentiments) / len(gt_sentiments) if gt_sentiments else None

            # Compute average sentiment from all review texts
            review_texts = selected_film["Preprocessed_Review"].dropna().tolist()
            review_sentiments = []
            for text in review_texts:
                scores = analyzer.polarity_scores(text)
                sentiment = 1.0 * scores["pos"] + 0.5 * scores["neu"]
                review_sentiments.append(sentiment)
            sentiment_text = sum(review_sentiments) / len(review_sentiments) if review_sentiments else None

            # Evaluate each model globally
            for model in models_to_evaluate:
                pred_col = f"keywords_{model}"

                # Lists of lists for evaluation functions:
                # - pred_kw_per_review: list of lists of keywords (strings) for evaluate_keywords()
                # - pred_kwscore_per_review: list of lists of (keyword, score) tuples for evaluate_keywords_weighted()
                pred_kw_per_review = []
                pred_kwscore_per_review = []
                
                # Flat list of dicts for SAS functions
                flat_keyword_list = []

                # Iterate over reviews to build these lists
                for _, row in selected_film.iterrows():
                    if pred_col in row and isinstance(row[pred_col], list):

                        # Extract keywords only for evaluate_keywords()
                        pred_kw_only = [kw for kw, _ in row[pred_col] if isinstance(kw, str)]

                        # Extract (keyword, score) tuples for evaluate_keywords_weighted()
                        pred_kw_score = [
                            (kw, score) for kw, score in row[pred_col]
                            if isinstance(kw, str) and isinstance(score, (float, int))
                        ]

                        # Append per review keyword lists if not empty
                        if pred_kw_only:
                            pred_kw_per_review.append(pred_kw_only)
                        if pred_kw_score:
                            pred_kwscore_per_review.append(pred_kw_score)

                        # Append flat dicts for SAS calculations
                        for kw, score in pred_kw_score:
                            flat_keyword_list.append({
                                "keyword": kw,
                                "sentiment_score": float(score)
                            })

                # Compute classic precision, recall, F1
                precision, recall, f1 = evaluate_keywords(pred_kw_per_review, gt_keywords)

                # Compute weighted precision, recall, F1
                w_precision, w_recall, w_f1 = evaluate_keywords_weighted(pred_kwscore_per_review, gt_keywords)

                # Compute semantic precision, recall, F1
                flat_kw_list = [kw for review in pred_kw_per_review for kw in review]
                flat_kw_list = list(set(flat_kw_list))

                semantic_precision, semantic_recall, semantic_f1 = evaluate_semantic_keywords(
                    flat_kw_list, gt_keywords, device=device, threshold=0.65
                )

                # Compute SAS (Sentiment Appropriateness Score)
                sas_kw = compute_sas_from_keywords([flat_keyword_list], sentiment_gt=sentiment_gt)
                sas_txt = compute_sas_from_text([flat_keyword_list], sentiment_text=sentiment_text)

                # Store results for this movie-model pair
                global_results.append({
                    "Movie": movie_name,
                    "Model": model,
                    "Precision": precision,
                    "Recall": recall,
                    "F1-score": f1,
                    "Weighted Precision": w_precision,
                    "Weighted Recall": w_recall,
                    "Weighted F1-score": w_f1,
                    "Semantic Precision": semantic_precision,
                    "Semantic Recall": semantic_recall,
                    "Semantic F1-score": semantic_f1,
                    "SAS from Keywords": sas_kw,
                    "SAS from text": sas_txt
                })

        except Exception as e:
            print(f"Error processing {file}: {e}")

# Create final DataFrame and sort results
final_df = pd.DataFrame(global_results)
final_df = final_df.sort_values(by=["Movie", "Model"]).reset_index(drop=True)
final_df.style.format(precision=4).set_caption("Global Evaluation Summary per Movie and Model")


Unnamed: 0,Movie,Model,Precision,Recall,F1-score,Weighted Precision,Weighted Recall,Weighted F1-score,Semantic Precision,Semantic Recall,Semantic F1-score,SAS from Keywords,SAS from text
0,GoodBadUgly,base,0.9341,0.3612,0.521,0.9352,0.4857,0.6394,0.8691,0.7888,0.827,0.9868,0.9245
1,GoodBadUgly,sentiment,0.9074,0.3391,0.4937,0.9104,0.4481,0.6006,0.7477,0.8043,0.775,0.9773,0.8886
2,HarryPotter,base,0.9194,0.3394,0.4957,0.9202,0.4552,0.6091,0.9016,0.882,0.8917,0.9738,0.9087
3,HarryPotter,sentiment,0.9023,0.3251,0.4779,0.9043,0.4348,0.5873,0.8966,0.8976,0.8971,0.9449,0.8798
4,IndianaJones,base,0.9125,0.3678,0.5243,0.9176,0.4902,0.639,0.9628,0.7978,0.8726,0.9675,0.9206
5,IndianaJones,sentiment,0.8805,0.3385,0.489,0.8955,0.4489,0.5981,0.9616,0.8162,0.883,0.9015,0.8545
6,LaLaLand,base,0.8917,0.3193,0.4702,0.9063,0.4422,0.5943,0.8537,0.7759,0.8129,0.9438,0.9046
7,LaLaLand,sentiment,0.8928,0.3126,0.4631,0.8962,0.4212,0.5731,0.7331,0.7897,0.7604,0.9185,0.8793
8,Oppenheimer,base,0.9384,0.3207,0.4781,0.947,0.4611,0.6202,0.9905,0.8556,0.9181,0.9926,0.9481
9,Oppenheimer,sentiment,0.9247,0.309,0.4632,0.9219,0.4328,0.5891,0.8451,0.9428,0.8913,0.9686,0.9092


In [None]:
# Save the df 
final_df.to_csv("base_vs_sentiment_evaluation.csv", index=False)

In [82]:
import os
import pandas as pd
import numpy as np
from nltk.sentiment import SentimentIntensityAnalyzer
from scipy.stats import spearmanr
#import nltk
#nltk.download('vader_lexicon')  # Esegui una volta sola fuori da questo script

def compute_sentiment_rank_correlation(all_predicted_keywords):
    correlations = []
    for review in all_predicted_keywords:
        if len(review) < 2:
            continue
        sentiment_intensity = [abs(kw['sentiment_score'] - 0.5) for kw in review]
        rank = list(reversed(range(len(review))))  # 0 = top rank keyword
        corr, _ = spearmanr(rank, sentiment_intensity)
        if corr is not None:
            correlations.append(abs(corr))  # valore assoluto per forza relazione
    return np.mean(correlations) if correlations else None

# Percorsi
keywords_dir = "../Dataset/Extracted_Keywords/"
ground_truth_path = "../Dataset/keywords_ground_truth.pkl"

# Carica ground truth keywords
keywords_ground_truth = pd.read_pickle(ground_truth_path)

# Modelli da valutare
models_to_evaluate = ["base", "sentiment"]

# Inizializza il sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

global_results = []

for file in os.listdir(keywords_dir):
    if file.endswith(".pkl") and file.startswith("kw_"):
        movie_name = file.replace("kw_", "").replace(".pkl", "")
        file_path = os.path.join(keywords_dir, file)

        try:
            selected_film = pd.read_pickle(file_path)
            selected_film_id = selected_film["Movie_ID"].iloc[0]

            # Ground truth keywords per il film
            kw_ground_truth = keywords_ground_truth[
                keywords_ground_truth["Movie_ID"] == selected_film_id
            ]
            gt_keywords = kw_ground_truth["Keyword"].tolist()

            # Calcola sentiment medio keyword GT
            gt_sentiments = []
            for kw in gt_keywords:
                scores = analyzer.polarity_scores(kw)
                sentiment = 1.0 * scores["pos"] + 0.5 * scores["neu"]
                gt_sentiments.append(sentiment)
            sentiment_gt = sum(gt_sentiments) / len(gt_sentiments) if gt_sentiments else None

            # Calcola sentiment medio da recensioni
            review_texts = selected_film["Preprocessed_Review"].dropna().tolist()
            review_sentiments = []
            for text in review_texts:
                scores = analyzer.polarity_scores(text)
                sentiment = 1.0 * scores["pos"] + 0.5 * scores["neu"]
                review_sentiments.append(sentiment)
            sentiment_text = sum(review_sentiments) / len(review_sentiments) if review_sentiments else None

            for model in models_to_evaluate:
                pred_col = f"keywords_{model}"

                all_predicted_keywords_per_review = []

                for _, row in selected_film.iterrows():
                    if pred_col in row and isinstance(row[pred_col], list):
                        review_kw = []
                        for kw, score in row[pred_col]:
                            if isinstance(kw, str) and isinstance(score, (float, int)):
                                review_kw.append({
                                    "keyword": kw,
                                    "sentiment_score": float(score)
                                })
                        if review_kw:
                            all_predicted_keywords_per_review.append(review_kw)

                if not all_predicted_keywords_per_review:
                    continue

                rank_corr = compute_sentiment_rank_correlation(all_predicted_keywords_per_review)

                global_results.append({
                    "Movie": movie_name,
                    "Model": model,
                    "Rank Correlation": rank_corr
                })

        except Exception as e:
            print(f"Error processing {file}: {e}")

final_df = pd.DataFrame(global_results)
final_df = final_df.sort_values(by=["Movie", "Model"]).reset_index(drop=True)

final_df.style.format(precision=4).set_caption("Rank Correlation per Movie and Model")


Unnamed: 0,Movie,Model,Rank Correlation
0,GoodBadUgly,base,0.7354
1,GoodBadUgly,sentiment,0.8408
2,HarryPotter,base,0.8135
3,HarryPotter,sentiment,0.8556
4,IndianaJones,base,0.664
5,IndianaJones,sentiment,0.8842
6,LaLaLand,base,0.7608
7,LaLaLand,sentiment,0.8363
8,Oppenheimer,base,0.7989
9,Oppenheimer,sentiment,0.8526
