# Evaluation of KeyBERTSentimentAware

This notebook evaluates and compares different keyword extraction models applied to movie reviews, with a specific focus on assessing how well each model captures not only semantic relevance but also **sentiment alignment** with the content of the reviews.

We aim to assess the performance of two models:

- **Base** KeyBERT model

- **Sentiment-aware** version that integrates a custom sentiment classifier to enrich the keyword scoring process (KeyBERTSentimentAware)

The evaluation is performed on a set of reviews for each movie, where both models predict a ranked list of top-5 keywords per review. Each keyword is associated with a confidence score; in the sentiment-aware model, this score incorporates both relevance and predicted sentiment intensity.

The notebook uses:

- A **ground truth dataset** of annotated keywords per movie retrieved from IMDB

- **Model outputs**: top-5 predicted keywords per review, each with a relevance (and possibly sentiment-based) score

Four complementary evaluation layers are used to provide a comprehensive comparison:

**1. Basic (Unweighted) Metrics**

- **Precision**, **Recall**, and **F1-score** based on approximate binary matching

- A keyword is considered correct if it approximately matches any of the ground truth keywords for the movie

- Each review is evaluated independently and the scores are averaged

**2. Score-Aware Metrics**

- **Weighted Precision/Recall/F1**: correct predictions are weighted based on the model’s score (semantic or sentiment-aware)

- **nDCG@5**: evaluates how well the top-ranked predictions align with the expected set of keywords, rewarding correct keywords appearing earlier

**3. Semantic Evaluation (Embedding-Based)**

- Uses **sentence-transformer embeddings** and **cosine similarity** to detect approximate semantic matches between predicted and ground truth keywords

- A predicted keyword is considered correct if it reaches a similarity threshold (e.g., 0.75) with any ground truth keyword

- Semantic Precision, Recall, and F1-score are then computed based on these soft matches

**4. Sentiment Appropriateness Score (SAS)**

- Evaluates how well the **average sentiment of the predicted keywords** aligns with the **sentiment of the review**

- Two variants are computed:

  - `SAS_from_keywords`: compares the sentiment of predicted keywords to the average sentiment of the ground truth keywords (estimated using VADER)

  - `SAS_from_text`: compares the sentiment of predicted keywords to the sentiment of the full preprocessed review text

- SAS values closer to 1 indicate better emotional alignment between the predictions and the reference

The sentiment-aware model is designed not just to select relevant keywords, but also to reflect the emotional tone of the review in its predictions. While the base KeyBERT model selects keywords based solely on semantic similarity, the sentiment-aware version explicitly integrates sentiment prediction into the keyword ranking.

By introducing **SAS**, we quantify this difference and validate whether the added sentiment signal results in predictions that are not only relevant but also emotionally coherent with the content.

This sentiment-aware evaluation helps bridge the gap between **semantic correctness** and **emotional alignment**, offering a more holistic understanding of keyword quality in user-generated content.


## Setup: Installing and Importing Required Libraries

In [49]:
import subprocess
import sys

# List of required packages
required_packages = {
    "pandas", "numpy", "scikit-learn", "tqdm", "transformers", "torch", "vaderSentiment"
}

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)


Installing scikit-learn...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
vaderSentiment is already installed.
numpy is already installed.
tqdm is already installed.
torch is already installed.
transformers is already installed.
pandas is already installed.


In [50]:
# Standard Library
import os      # File system operations (e.g., listing files)
import re      # Regular expressions for text processing
import math    # Mathematical functions (e.g., logarithms for nDCG calculation)

# Third-Party Libraries
import pandas as pd                  # Data manipulation with DataFrames
import numpy as np                   # Numerical computations and array operations
from tqdm import tqdm                # Progress bars for loops

# Evaluation metrics from scikit-learn
from sklearn.metrics import precision_score, recall_score, f1_score

# Transformers and PyTorch for embeddings and models
from transformers import AutoTokenizer, AutoModel # type:ignore
import torch
import torch.nn.functional as F

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


## Load Available Movies from Dataset

This section lists all the available movies stored as `.pkl` files inside the review dataset directory.

- It defines the root path (`../Dataset/Reviews_By_Movie`) where all review files are saved.
- It automatically detects and lists all movie filenames (removing the `.pkl` extension).

In [51]:
# Define root directory
root_dir = "../Dataset/Reviews_By_Movie"

# List all available movies
available_movies = sorted([f[:-4] for f in os.listdir(root_dir) if f.endswith(".pkl")])
print("Available movies:", available_movies)

Available movies: ['GoodBadUgly', 'HarryPotter', 'IndianaJones', 'LaLaLand', 'Oppenheimer', 'Parasite', 'SW_Episode1', 'SW_Episode2', 'SW_Episode3', 'SW_Episode4', 'SW_Episode5', 'SW_Episode6', 'SW_Episode7', 'SW_Episode8', 'SW_Episode9']


## Select a Movie and Load its Ground Truth Keywords

In this step, we load the keyword extraction results for a specific movie and retrieve the corresponding ground truth keywords. The goal is to use these annotated keywords for evaluation and comparison with automatically extracted ones.

In [52]:
# Set the name of the movie to be evaluated
movie_name = "Parasite"

# Load the extracted keywords for the selected movie from a pickle file
# The file path is dynamically built using the movie name
selected_film = pd.read_pickle(f"../Dataset/Extracted_Keywords/kw_{movie_name}.pkl")

# Retrieve the Movie_ID of the selected film
# Assumes that the file contains a DataFrame with at least one row
selected_film_id = selected_film["Movie_ID"].iloc[0]

# Load the full dataset containing the ground truth keywords
# for all movies in the evaluation set
keywords = pd.read_pickle("../Dataset/keywords_ground_truth.pkl")

# Filter the ground truth dataset to extract only the keywords for the selected movie
kw_ground_truth = keywords[keywords["Movie_ID"] == selected_film_id]

## Keyword Matching and Evaluation Functions (Basic – Unweighted)

This block defines the baseline utility functions used to evaluate predicted keywords against the ground truth. These functions do **not** take into account keyword confidence scores or ranking—they perform **binary, unweighted evaluation**.

Specifically, this implementation includes:

- **Normalization**: keywords are converted to lowercase, stripped of punctuation, and cleaned of extra whitespace to ensure consistent matching.

- **Approximate Matching**: a relaxed rule that considers two keywords as matching if they are identical or if one is a substring of the other (e.g., *"social satire"* ≈ *"satire"*).

- **Evaluation**: standard metrics — **precision**, **recall**, and **F1-score** — are calculated based on the number of approximate matches between predicted and ground truth keywords.

This provides a basic but interpretable way to assess keyword extraction quality without considering the ranking or confidence scores assigned by the model.


In [53]:
# Simple normalization function for keywords:
# - Converts to lowercase
# - Removes punctuation
# - Strips leading/trailing spaces
def normalize_kw(kw):
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumerics and whitespace
    return kw.strip()

# Approximate matching function:
# Returns True if the predicted keyword matches the ground truth exactly
# or if either keyword contains the other as a substring
def is_approx_match(kw, gt_keywords):
    for gt in gt_keywords:
        if kw == gt or kw in gt or gt in kw:
            return True
    return False

# Evaluation function for a single prediction instance:
# - Normalizes both predicted and ground truth keywords
# - Computes how many predicted keywords approximately match the ground truth
# - Calculates precision, recall, and F1-score
def evaluate_keywords(pred_keywords, gt_keywords):
    pred_keywords = [normalize_kw(k) for k in pred_keywords]
    gt_keywords = [normalize_kw(k) for k in gt_keywords]
    
    # Count how many predicted keywords match approximately with any ground truth keyword
    match_count = sum([is_approx_match(k, gt_keywords) for k in pred_keywords])
    
    # Precision: percentage of predicted keywords that are correct
    precision = match_count / len(pred_keywords) if pred_keywords else 0

    # Recall: percentage of ground truth keywords that were correctly predicted
    recall = match_count / len(gt_keywords) if gt_keywords else 0

    # F1-score: harmonic mean of precision and recall
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0
    
    return precision, recall, f1


### Evaluate and Compare Models on Keyword Extraction (Basic – Unweighted)

This section evaluates two keyword extraction models — **base** and **sentiment-enhanced** — using the ground truth.

For each **review**, basic precision, recall, and F1-score are computed based on binary keyword matching. These metrics are then **averaged across all reviews** to provide an overall performance comparison between the models.

In [54]:
# Define the models to be evaluated
models_to_evaluate = ["base", "sentiment"]

# Create a dictionary to store evaluation results (precision, recall, F1) for each model
results = {model: [] for model in models_to_evaluate}

# Extract the list of ground truth keywords for the selected movie (same for all reviews)
ground_truth_keywords = kw_ground_truth["Keyword"].tolist()

# Iterate over each review in the selected film's predictions
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        
        # Column name with the predicted keywords for this model
        pred_col = f"keywords_{model}"
        
        # Proceed only if the column exists and contains a list of predicted keywords
        if pred_col in row and isinstance(row[pred_col], list):

            # Extract only the keyword strings from (keyword, score) tuples
            predicted_keywords = [kw for kw, _ in row[pred_col] if isinstance(kw, str)]
            
            # Evaluate the prediction using precision, recall, and F1-score
            precision, recall, f1 = evaluate_keywords(predicted_keywords, ground_truth_keywords)
            
            # Store the result for this specific review
            results[model].append({
                "precision": precision,
                "recall": recall,
                "f1": f1
            })

# Aggregate the results to compute average metrics for each model
summary = {}
for model in models_to_evaluate:
    precisions = [r["precision"] for r in results[model]]
    recalls = [r["recall"] for r in results[model]]
    f1s = [r["f1"] for r in results[model]]
    
    # Calculate the average of each metric and round to 4 decimal places
    summary[model] = {
        "avg_precision": round(np.mean(precisions), 4),
        "avg_recall": round(np.mean(recalls), 4),
        "avg_f1": round(np.mean(f1s), 4)
    }

# Convert the weighted summary dictionary to a pandas DataFrame
summary_df = pd.DataFrame(summary).T  # Transpose so models are rows

# Rename columns for better readability
summary_df.columns = [
    "Precision",
    "Recall",
    "F1-score",
]

# Display the summary table nicely
summary_df.style.format(precision=4).set_caption("Score-Aware Evaluation Summary")

Unnamed: 0,Precision,Recall,F1-score
base,0.16,0.0025,0.005
sentiment,0.28,0.0044,0.0087


## Score-Aware Evaluation: Weighted Metrics

This extended evaluation considers the **confidence scores** assigned by the model to each predicted keyword, allowing us to measure not only whether the predictions are correct but also how confidently and effectively they are ranked.

#### Score-Aware Metrics

- **Weighted Precision**: Reflects the proportion of the model’s total confidence assigned to correct keywords. High confidence in incorrect predictions lowers this score.

- **Weighted Recall**: Measures how much of the ground truth is recovered, weighted by the confidence of correct predictions.

- **Weighted F1-score**: The harmonic mean of weighted precision and recall, balancing accuracy with coverage.

In [55]:
# Simple normalization function for keywords:
# - Converts to lowercase
# - Removes punctuation
# - Strips leading/trailing spaces
def normalize_kw(kw):
    kw = kw.lower()
    kw = re.sub(r"[^a-zA-Z0-9\s]", "", kw)  # Keep only alphanumerics and whitespace
    return kw.strip()


# Approximate matching function:
# Returns True if the predicted keyword matches any ground truth keyword
# using a relaxed comparison: exact match or substring containment
def is_approx_match(kw, gt_keywords):
    for gt in gt_keywords:
        if kw == gt or kw in gt or gt in kw:
            return True
    return False

# Weighted evaluation function:
# Calculates precision, recall, and F1-score using the confidence scores of predicted keywords
# - High-confidence correct predictions contribute more
# - Precision is score-weighted; recall divides by total ground truth
def evaluate_keywords_weighted(predicted_kw_score, gt_keywords):
    """
    Evaluate predicted keywords with confidence scores using weighted precision, recall, and F1.
    
    Parameters:
        predicted_kw_score (list of (str, float)): predicted keywords with associated confidence scores
        gt_keywords (list of str): ground truth keywords (annotated)
    
    Returns:
        (precision, recall, f1): all metrics computed using score-weighted matching
    """
    # Normalize both predicted and ground truth keywords
    gt_keywords = [normalize_kw(k) for k in gt_keywords]
    pred_keywords = [(normalize_kw(kw), score) for kw, score in predicted_kw_score if isinstance(kw, str)]
    
    total_score = sum(score for _, score in pred_keywords)
    if total_score == 0:
        return 0, 0, 0

    # Compute total score of predicted keywords that approximately match the ground truth
    match_score = sum(score for kw, score in pred_keywords if is_approx_match(kw, gt_keywords))
    
    # Weighted precision and recall
    precision = match_score / total_score
    recall = match_score / len(gt_keywords) if gt_keywords else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0

    return precision, recall, f1

### Evaluate and Compare Models on Keyword Extraction (Weighted)

In this section, we apply the score-aware evaluation metrics to each review for both models:

- **Weighted Precision, Recall, F1**: accounts for the confidence scores of each predicted keyword.
- **nDCG@5**: evaluates the ranking quality of the top-5 keywords based on their alignment with the ground truth.

Each review is evaluated individually, and the metrics are then averaged across all reviews to summarize model performance.

In [56]:
# Models to evaluate
models_to_evaluate = ["base", "sentiment"]

# Initialize results dictionary for weighted metrics and nDCG
weighted_results = {model: [] for model in models_to_evaluate}

# Ground truth keywords (same for all reviews in the selected film)
ground_truth_keywords = kw_ground_truth["Keyword"].tolist()

# Loop through each review
for _, row in selected_film.iterrows():
    for model in models_to_evaluate:
        pred_col = f"keywords_{model}"
        
        # Skip if no prediction or wrong format
        if pred_col in row and isinstance(row[pred_col], list):
            predicted_kw_score = row[pred_col]  # list of (kw, score)

            # Compute weighted metrics
            w_precision, w_recall, w_f1 = evaluate_keywords_weighted(predicted_kw_score, ground_truth_keywords)

            # Save results
            weighted_results[model].append({
                "weighted_precision": w_precision,
                "weighted_recall": w_recall,
                "weighted_f1": w_f1,
            })

# Compute average metrics across all reviews
weighted_summary = {}
for model in models_to_evaluate:
    metrics = weighted_results[model]
    weighted_summary[model] = {
        "avg_weighted_precision": round(np.mean([m["weighted_precision"] for m in metrics]), 4),
        "avg_weighted_recall": round(np.mean([m["weighted_recall"] for m in metrics]), 4),
        "avg_weighted_f1": round(np.mean([m["weighted_f1"] for m in metrics]), 4),
    }

# Convert the weighted summary dictionary to a pandas DataFrame
summary_df = pd.DataFrame(weighted_summary).T  # Transpose so models are rows

# Rename columns for better readability
summary_df.columns = [
    "Weighted Precision",
    "Weighted Recall",
    "Weighted F1-score",
]

# Display the summary table
summary_df.style.format(precision=4).set_caption("Score-Aware Evaluation Summary")


Unnamed: 0,Weighted Precision,Weighted Recall,Weighted F1-score
base,0.1564,0.0012,0.0023
sentiment,0.28,0.0015,0.0031


## Semantic Evaluation (Base vs Sentiment)

In this section, we evaluate and compare the **Base** and **Sentiment-enhanced** keyword extraction models using a **semantic similarity approach** based on contextual embeddings.

Traditional evaluation metrics typically check for exact or approximate string matches between predicted and ground truth keywords. However, this can miss semantically related terms that are lexically different but convey the same meaning — for example, *"scam"* and *"fraud"*.

To overcome this limitation, we leverage **sentence embeddings** generated by a pre-trained transformer model (such as a sentence-transformer). Each keyword — both predicted and ground truth — is converted into a dense vector representation that captures its semantic content.

How the semantic evaluation works in detail:

1. **Embedding keywords**:  
    Both predicted keywords and ground truth keywords are embedded into high-dimensional vectors using the same model. The ground truth keywords are embedded **once beforehand** to avoid redundant computations. These embeddings are normalized to ensure cosine similarity is a valid similarity measure.

2. **Computing similarity scores**:  
   We calculate the **cosine similarity** between every predicted keyword embedding and every ground truth keyword embedding, resulting in a similarity matrix.

3. **Determining matches using a threshold**:  
   A predicted keyword is considered a **semantic match** if its cosine similarity with at least one ground truth keyword exceeds a set threshold (e.g., 0.75). This threshold balances between strictness and flexibility in matching semantic content.

4. **Calculating semantic precision**:  
   This is the fraction of predicted keywords that have a semantic match in the ground truth. It reflects how many of the model’s predictions are meaningful and relevant.

5. **Calculating semantic recall**:  
   This is the fraction of ground truth keywords that are captured by semantically similar predicted keywords. It indicates how well the model covers the essential concepts of the ground truth.

6. **Calculating semantic F1-score**:  
   The harmonic mean of semantic precision and recall, providing a single measure that balances both aspects.

By evaluating keyword extraction in this semantic space, the metric is more robust to lexical variation and better reflects the true relevance of the predictions. This provides a deeper understanding of how well each model captures the **meaning** behind the ground truth keywords, beyond surface-level text matches.

In [57]:
# Load a sentence embedding model from the SentenceTransformers family
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load tokenizer and model to generate contextual embeddings
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
encoder = AutoModel.from_pretrained(MODEL_NAME).to(device)

def embed_keywords(keywords, device="cuda"):
    """
    Compute sentence embeddings for a list of keyword strings.

    Parameters:
    ----------
    keywords : List[str]
        A list of keyword strings to encode.
    device : str
        Device to run the model on ('cuda' or 'cpu').

    Returns:
    -------
    torch.Tensor
        Normalized embeddings tensor of shape (num_keywords, embedding_dim).
    """
    # Return empty tensor if input list is empty
    if not keywords:
        return torch.empty(0, encoder.config.hidden_size).to(device)

    # Tokenize and prepare inputs for the model
    inputs = tokenizer(keywords, padding=True, truncation=True, return_tensors="pt").to(device)

    with torch.no_grad():
        # Forward pass through the encoder to get hidden states
        outputs = encoder(**inputs)

        # Use mean pooling on the last hidden state to get fixed-size embeddings
        embeddings = outputs.last_hidden_state.mean(dim=1)

        # Normalize embeddings to unit length for cosine similarity computations
        embeddings = F.normalize(embeddings, p=2, dim=1)

    return embeddings


def compute_semantic_metrics(pred_keywords, gt_embeddings, threshold=0.75, device="cuda"):
    """
    Compute semantic precision, recall, and F1 score between predicted keywords and
    ground truth embeddings based on cosine similarity.

    Parameters:
    ----------
    pred_keywords : List[str]
        List of predicted keywords for a single review.
    gt_embeddings : torch.Tensor
        Pre-computed normalized embeddings of ground truth keywords.
    threshold : float
        Cosine similarity threshold above which a predicted keyword is considered
        semantically matching a ground truth keyword.
    device : str
        Device to run computations on.

    Returns:
    -------
    precision : float
        Fraction of predicted keywords that match any ground truth keyword semantically.
    recall : float
        Fraction of ground truth keywords that are matched by any predicted keyword.
    f1 : float
        Harmonic mean of precision and recall.
    """
    # Handle empty predictions or empty ground truth embeddings edge cases
    if len(pred_keywords) == 0 or gt_embeddings.shape[0] == 0:
        return 0.0, 0.0, 0.0

    # Compute embeddings for the predicted keywords only
    pred_emb = embed_keywords(pred_keywords, device=device)

    # Compute cosine similarity matrix between predicted and ground truth embeddings
    # Shape: (num_predicted_keywords, num_ground_truth_keywords)
    sims = torch.matmul(pred_emb, gt_embeddings.T)

    # A predicted keyword is counted as a match if it has cosine similarity above
    # the threshold with at least one ground truth keyword
    pred_matches = (sims > threshold).any(dim=1).float().sum().item()

    # Similarly, a ground truth keyword is matched if any predicted keyword exceeds threshold
    gt_matches = (sims > threshold).any(dim=0).float().sum().item()

    # Calculate precision: matched predictions / total predictions
    precision = pred_matches / len(pred_keywords)

    # Calculate recall: matched ground truths / total ground truth keywords
    recall = gt_matches / gt_embeddings.shape[0]

    # Compute harmonic mean for F1 score, handling zero division
    if precision + recall == 0:
        f1 = 0.0
    else:
        f1 = 2 * precision * recall / (precision + recall)

    return precision, recall, f1

### Semantic Evaluation of Base and Sentiment Models Using Sentence Embeddings

In this step, we evaluate the semantic similarity between the predicted keywords of two models — **Base** and **Sentiment-enhanced** — and the ground truth keywords using **sentence embeddings**.

Unlike previous evaluations based on exact or approximate matching, this method leverages contextual embeddings from a pre-trained transformer to measure how semantically close the predicted keywords are to the reference keywords.

For each review:
- We extract only the **text of the predicted keywords**, ignoring their confidence scores.
- We compute **semantic precision, recall, and F1** based on cosine similarity between embeddings.
- We then average these metrics across all reviews for each model, providing an overall **semantic performance assessment**.

In [58]:
# Precompute embeddings for the ground truth keywords once per selected movie
# This avoids redundant computation when comparing against multiple predicted keywords
gt_keywords = kw_ground_truth["Keyword"].tolist()
gt_emb = embed_keywords(gt_keywords, device=device)

# Define the models to evaluate
models_to_evaluate = ["base", "sentiment"]

# List to collect semantic evaluation results for each model
semantic_scores = []

# Loop over each model to evaluate semantic metrics separately
for model in models_to_evaluate:
    # Lists to accumulate precision, recall, and F1 scores for each review
    all_precisions = []
    all_recalls = []
    all_f1s = []

    # Iterate over each review (row) in the selected movie's predictions, with a progress bar
    for _, row in tqdm(selected_film.iterrows(), total=len(selected_film), desc=f"Semantic metrics - {model}"):
        pred_col = f"keywords_{model}"  # Column name for predicted keywords of the current model

        # Check if the predicted keywords column exists and contains a list
        if pred_col in row and isinstance(row[pred_col], list):
            # Extract only the keyword strings (ignore confidence scores)
            pred_kw = [kw for kw, _ in row[pred_col] if isinstance(kw, str)]

            # Compute semantic precision, recall, and F1 between predicted keywords and precomputed GT embeddings
            precision, recall, f1 = compute_semantic_metrics(pred_kw, gt_emb, device=device)

            # Append the scores for aggregation later
            all_precisions.append(precision)
            all_recalls.append(recall)
            all_f1s.append(f1)

    # After processing all reviews, calculate average semantic scores for the current model
    if all_f1s:
        semantic_scores.append({
            "Model": model,
            "Semantic_Precision": round(sum(all_precisions) / len(all_precisions), 4),
            "Semantic_Recall": round(sum(all_recalls) / len(all_recalls), 4),
            "Semantic_F1": round(sum(all_f1s) / len(all_f1s), 4)
        })

# Convert the list of dictionaries into a pandas DataFrame and set 'Model' as the index
summary_df = pd.DataFrame(semantic_scores).set_index("Model")

# Display the semantic evaluation summary as a nicely formatted table with 4 decimal places
summary_df.style.format(precision=4).set_caption("Semantic-Aware Evaluation Summary")


Semantic metrics - base: 100%|██████████| 5/5 [00:00<00:00, 63.71it/s]
Semantic metrics - sentiment: 100%|██████████| 5/5 [00:00<00:00, 60.17it/s]


Unnamed: 0_level_0,Semantic_Precision,Semantic_Recall,Semantic_F1
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
base,0.08,0.0006,0.0013
sentiment,0.04,0.0013,0.0025


## Sentiment Appropriateness Score (SAS)

To further evaluate the quality of the predicted keywords from a sentiment-aware perspective, we introduce the **Sentiment Appropriateness Score (SAS)**. This metric measures how well the **overall sentiment of the predicted keywords aligns with the sentiment of the ground truth** or the original review text.

Traditional evaluation metrics such as precision, recall, and F1-score focus on **syntactic correctness**, but they do not assess whether the predicted keywords **reflect the emotional tone** of the review. SAS addresses this by incorporating sentiment into the evaluation process.

We compute SAS in two ways:

1. **SAS from Ground Truth Keywords**  
   The sentiment of the review is approximated using the set of ground truth keywords. Each keyword is analyzed using **VADER**, and its sentiment is mapped using the following weights:  
   - `pos` → 1.0  
   - `neu` → 0.5  
   - `neg` → 0.0  
   The final sentiment is the average of these values. This reference sentiment is compared to the average sentiment of the predicted keywords.

2. **SAS from Review Text**  
   If the full review text is available, we compute its sentiment using the same weighted VADER scheme and compare it to the sentiment of the predicted keywords.

In both cases, SAS is calculated as:

$$
\text{SAS} = 1 - |\text{sentiment}_{\text{predicted}} - \text{sentiment}_{\text{reference}}|
$$

Values closer to **1** indicate that the predicted keywords are **emotionally aligned** with the ground truth or the review.

We use **VADER (Valence Aware Dictionary and sEntiment Reasoner)** because:

- It is optimized for **short, informal text** like keywords or tags.  
- It works well for **single words and short phrases**, which are the output of our models.  
- It provides class-level probabilities (`pos`, `neu`, `neg`) that are directly usable for **interpretable sentiment scoring**.  
- It is lightweight and efficient, making it suitable for **large-scale evaluation**.

While transformer-based sentiment models may be more powerful for longer text, **VADER offers the best trade-off** between accuracy, interpretability, and scalability for our keyword-level sentiment matching task.

In [59]:
# Compute SAS based on ground truth keywords
def compute_sas_from_keywords(predicted_keywords, ground_truth_keywords=None, analyzer=None, sentiment_gt=None):
    """
    Computes SAS by comparing the average sentiment of predicted keywords to the sentiment of the ground truth keywords.

    You can pass:
    - Either the ground_truth_keywords and analyzer (to compute sentiment on the fly)
    - Or directly a precomputed sentiment_gt value

    Sentiment is computed as: 1.0 * pos + 0.5 * neu + 0.0 * neg

    Args:
        predicted_keywords (list of dict): each dict has 'sentiment_score' ∈ [0,1]
        ground_truth_keywords (list of str): optional, used if sentiment_gt not given
        analyzer (SentimentIntensityAnalyzer): optional, used if sentiment_gt not given
        sentiment_gt (float): optional, precomputed GT sentiment ∈ [0,1]

    Returns:
        float: SAS ∈ [0,1] — higher is better alignment with GT sentiment
    """
    # Compute GT sentiment only if not precomputed
    if sentiment_gt is None:
        if not ground_truth_keywords or analyzer is None:
            return None
        sentiments_gt = []
        for kw in ground_truth_keywords:
            scores = analyzer.polarity_scores(kw)
            sentiment = 1.0 * scores['pos'] + 0.5 * scores['neu'] + 0.0 * scores['neg']
            sentiments_gt.append(sentiment)
        sentiment_gt = sum(sentiments_gt) / len(sentiments_gt)

    # Compute average sentiment of predicted keywords
    sentiments_pred = [kw['sentiment_score'] for kw in predicted_keywords]
    sentiment_pred = sum(sentiments_pred) / len(sentiments_pred)

    # SAS = 1 - absolute difference between predicted and GT sentiment
    return 1 - abs(sentiment_pred - sentiment_gt)


# Compute SAS based on review text
def compute_sas_from_text(predicted_keywords, review_text=None, analyzer=None, sentiment_text=None):
    """
    Computes SAS by comparing the average sentiment of predicted keywords to the sentiment of the full review.

    You can pass:
    - Either the review_text and analyzer (to compute sentiment on the fly)
    - Or directly a precomputed sentiment_text value

    Sentiment is computed as: 1.0 * pos + 0.5 * neu + 0.0 * neg

    Args:
        predicted_keywords (list of dict): each dict has 'sentiment_score' ∈ [0,1]
        review_text (str): optional, used if sentiment_text not given
        analyzer (SentimentIntensityAnalyzer): optional, used if sentiment_text not given
        sentiment_text (float): optional, precomputed review sentiment ∈ [0,1]

    Returns:
        float: SAS ∈ [0,1] — higher is better alignment with review sentiment
    """
    # Compute review sentiment only if not precomputed
    if sentiment_text is None:
        if not review_text or analyzer is None:
            return None
        scores = analyzer.polarity_scores(review_text)
        sentiment_text = 1.0 * scores['pos'] + 0.5 * scores['neu'] + 0.0 * scores['neg']

    # Compute average sentiment of predicted keywords
    sentiments_pred = [kw['sentiment_score'] for kw in predicted_keywords]
    sentiment_pred = sum(sentiments_pred) / len(sentiments_pred)

    # SAS = 1 - absolute difference between predicted and review sentiment
    return 1 - abs(sentiment_pred - sentiment_text)

### Sentiment Appropriateness Evaluation Procedure

This section evaluates the **Sentiment Appropriateness Score (SAS)** for each model (`base` and `sentiment`) using precomputed sentiment references to improve efficiency.

Steps:

- **Ground truth sentiment** is computed **once per movie** by averaging the sentiment of all annotated keywords, using VADER’s class probabilities mapped as follows:  
  - Positive → 1.0  
  - Neutral → 0.5  
  - Negative → 0.0

- For each review:
  - Predicted keywords are extracted and formatted as a list of dictionaries with their sentiment scores.
  - **SAS from keywords (`SAS_from_keywords`)** is computed by comparing the average sentiment of predicted keywords to the **precomputed ground truth sentiment**.
  - **SAS from text (`SAS_from_text`)** is computed by comparing the predicted sentiment to that of the **full preprocessed review**, also using the VADER class probabilities.

- SAS is calculated as:  
  `SAS = 1 - |predicted_sentiment - reference_sentiment|`

- The final output is a table reporting the average `SAS_from_keywords` and `SAS_from_text` for each model.

In [60]:
# Initialize VADER once
analyzer = SentimentIntensityAnalyzer()

# Define the models to evaluate
models_to_evaluate = ["base", "sentiment"]

# Collect SAS results for each model
sas_scores = []

# Precompute GT sentiment (once per movie)
gt_keywords = kw_ground_truth["Keyword"].tolist()
sentiments_gt = []
for kw in gt_keywords:
    scores = analyzer.polarity_scores(kw)
    sentiment = 1.0 * scores["pos"] + 0.5 * scores["neu"] + 0.0 * scores["neg"]
    sentiments_gt.append(sentiment)
sentiment_gt = sum(sentiments_gt) / len(sentiments_gt) if sentiments_gt else None

# Loop over each model
for model in models_to_evaluate:
    all_sas_keyword = []
    all_sas_text = []

    # Iterate over each review
    for _, row in tqdm(selected_film.iterrows(), total=len(selected_film), desc=f"SAS Evaluation - {model}"):
        pred_col = f"keywords_{model}"

        if pred_col in row and isinstance(row[pred_col], list):
            predicted_keywords = []
            for kw, score in row[pred_col]:
                if isinstance(kw, str) and isinstance(score, (float, int)):
                    predicted_keywords.append({"keyword": kw, "sentiment_score": float(score)})

            if not predicted_keywords:
                continue

            # SAS A: use precomputed GT sentiment
            sas_kw = compute_sas_from_keywords(predicted_keywords, sentiment_gt=sentiment_gt)
            if sas_kw is not None:
                all_sas_keyword.append(sas_kw)

            # SAS B: precompute sentiment of full review (once per row)
            if "Preprocessed_Review" in row and isinstance(row["Preprocessed_Review"], str):
                review_text = row["Preprocessed_Review"]
                scores = analyzer.polarity_scores(review_text)
                sentiment_text = 1.0 * scores["pos"] + 0.5 * scores["neu"] + 0.0 * scores["neg"]

                sas_text = compute_sas_from_text(predicted_keywords, sentiment_text=sentiment_text)
                if sas_text is not None:
                    all_sas_text.append(sas_text)

    # Average and store results
    if all_sas_keyword or all_sas_text:
        result = {"Model": model}
        if all_sas_keyword:
            result["SAS_from_keywords"] = round(sum(all_sas_keyword) / len(all_sas_keyword), 4)
        if all_sas_text:
            result["SAS_from_text"] = round(sum(all_sas_text) / len(all_sas_text), 4)
        sas_scores.append(result)

# Create DataFrame and display results
sas_df = pd.DataFrame(sas_scores).set_index("Model")
sas_df.style.format(precision=4).set_caption("Sentiment Appropriateness Score (SAS) Summary")


SAS Evaluation - base: 100%|██████████| 5/5 [00:00<00:00, 66.37it/s]
SAS Evaluation - sentiment: 100%|██████████| 5/5 [00:00<00:00, 70.32it/s]


Unnamed: 0_level_0,SAS_from_keywords,SAS_from_text
Model,Unnamed: 1_level_1,Unnamed: 2_level_1
base,0.9558,0.9065
sentiment,0.9059,0.8075


## Evaluation Across All Movies

This section automatically processes all `.pkl` files in the `Extracted_Keywords` directory, where each file corresponds to a single movie and contains predicted keywords generated by different models.

For each movie:
- The corresponding ground truth keywords are loaded.
- Predicted keywords from both models — **Base** and **Sentiment-aware** — are evaluated.
- For each review, the following metrics are computed:

  - **Unweighted Metrics**: Precision, Recall, and F1-score based on approximate matching.

  - **Score-aware Metrics**: Weighted Precision, Weighted Recall, Weighted F1, and nDCG@5 to evaluate prediction confidence and ranking quality.

  - **Semantic Metrics**: Semantic Precision, Semantic Recall, and Semantic F1 computed using cosine similarity between sentence embeddings.

  - **Sentiment Alignment Metrics**:  
    - `SAS_from_keywords`: alignment between the average sentiment of predicted keywords and that of the ground truth keywords.  
    - `SAS_from_text`: alignment between the average sentiment of predicted keywords and that of the full review text.

All metrics are averaged per movie and per model, and compiled into a comprehensive summary table for comparison.

In [62]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Paths
keywords_dir = "../Dataset/Extracted_Keywords/"
ground_truth_path = "../Dataset/keywords_ground_truth.pkl"

# Load the ground truth once
keywords_ground_truth = pd.read_pickle(ground_truth_path)
models_to_evaluate = ["base", "sentiment"]

# Initialize analyzer
analyzer = SentimentIntensityAnalyzer()

# Store results across all movies
all_results = []

# Iterate over all .pkl keyword prediction files
for file in os.listdir(keywords_dir):
    if file.endswith(".pkl") and file.startswith("kw_"):
        movie_name = file.replace("kw_", "").replace(".pkl", "")
        file_path = os.path.join(keywords_dir, file)

        try:
            # Load predicted keywords and Movie_ID
            selected_film = pd.read_pickle(file_path)
            selected_film_id = selected_film["Movie_ID"].iloc[0]

            # Ground truth keywords
            kw_ground_truth = keywords_ground_truth[keywords_ground_truth["Movie_ID"] == selected_film_id]
            gt_keywords = kw_ground_truth["Keyword"].tolist()
            gt_embeddings = embed_keywords(gt_keywords, device=device)

            # --- Precompute GT sentiment ---
            sentiments_gt = []
            for kw in gt_keywords:
                scores = analyzer.polarity_scores(kw)
                sentiment = 1.0 * scores["pos"] + 0.5 * scores["neu"] + 0.0 * scores["neg"]
                sentiments_gt.append(sentiment)
            sentiment_gt = sum(sentiments_gt) / len(sentiments_gt) if sentiments_gt else None

            # Init model-specific results
            results = {model: [] for model in models_to_evaluate}

            # Evaluate each review
            for _, row in selected_film.iterrows():
                # Precompute review text sentiment if available
                sentiment_text = None
                if "Preprocessed_Review" in row and isinstance(row["Preprocessed_Review"], str):
                    scores = analyzer.polarity_scores(row["Preprocessed_Review"])
                    sentiment_text = 1.0 * scores["pos"] + 0.5 * scores["neu"] + 0.0 * scores["neg"]

                for model in models_to_evaluate:
                    pred_col = f"keywords_{model}"
                    if pred_col not in row or not isinstance(row[pred_col], list):
                        continue

                    predicted_kw_score = row[pred_col]
                    pred_kw_only = [kw for kw, _ in predicted_kw_score if isinstance(kw, str)]
                    predicted_kw_structured = [
                        {"keyword": kw, "sentiment_score": float(score)}
                        for kw, score in predicted_kw_score
                        if isinstance(kw, str) and isinstance(score, (float, int))
                    ]

                    if not pred_kw_only or not predicted_kw_structured:
                        continue

                    # Compute classic metrics
                    precision, recall, f1 = evaluate_keywords(pred_kw_only, gt_keywords)
                    w_precision, w_recall, w_f1 = evaluate_keywords_weighted(predicted_kw_score, gt_keywords)

                    # Compute semantic metrics
                    semantic_precision, semantic_recall, semantic_f1 = compute_semantic_metrics(
                        pred_kw_only, gt_embeddings, threshold=0.5, device=device
                    )

                    # SAS metrics using precomputed values
                    sas_kw = compute_sas_from_keywords(predicted_kw_structured, sentiment_gt=sentiment_gt)
                    sas_txt = compute_sas_from_text(predicted_kw_structured, sentiment_text=sentiment_text) if sentiment_text is not None else None

                    # Store metrics
                    results[model].append({
                        "precision": precision,
                        "recall": recall,
                        "f1": f1,
                        "w_precision": w_precision,
                        "w_recall": w_recall,
                        "w_f1": w_f1,
                        "semantic_precision": semantic_precision,
                        "semantic_recall": semantic_recall,
                        "semantic_f1": semantic_f1,
                        "sas_keywords": sas_kw,
                        "sas_text": sas_txt
                    })

            # Average results per model
            for model in models_to_evaluate:
                if results[model]:
                    metrics_df = pd.DataFrame(results[model])
                    avg_metrics = {
                        "Movie": movie_name,
                        "Model": model,
                        "Avg_Precision": round(metrics_df["precision"].mean(), 4),
                        "Avg_Recall": round(metrics_df["recall"].mean(), 4),
                        "Avg_F1": round(metrics_df["f1"].mean(), 4),
                        "Avg_Weighted_Precision": round(metrics_df["w_precision"].mean(), 4),
                        "Avg_Weighted_Recall": round(metrics_df["w_recall"].mean(), 4),
                        "Avg_Weighted_F1": round(metrics_df["w_f1"].mean(), 4),
                        "Avg_Semantic_Precision": round(metrics_df["semantic_precision"].mean(), 4),
                        "Avg_Semantic_Recall": round(metrics_df["semantic_recall"].mean(), 4),
                        "Avg_Semantic_F1": round(metrics_df["semantic_f1"].mean(), 4),
                    }

                    if "sas_keywords" in metrics_df:
                        avg_metrics["Avg_SAS_from_keywords"] = round(metrics_df["sas_keywords"].mean(), 4)
                    if "sas_text" in metrics_df and metrics_df["sas_text"].notna().any():
                        avg_metrics["Avg_SAS_from_text"] = round(metrics_df["sas_text"].dropna().mean(), 4)

                    all_results.append(avg_metrics)

        except Exception as e:
            print(f"Error processing {file}: {e}")

# Create summary DataFrame
final_df = pd.DataFrame(all_results)
final_df_sorted = final_df.sort_values(by=["Movie", "Model"]).reset_index(drop=True)

# Display final summary
final_df_sorted.style.format(precision=4).set_caption("Full Evaluation Summary per Movie and Model")


Unnamed: 0,Movie,Model,Avg_Precision,Avg_Recall,Avg_F1,Avg_Weighted_Precision,Avg_Weighted_Recall,Avg_Weighted_F1,Avg_Semantic_Precision,Avg_Semantic_Recall,Avg_Semantic_F1,Avg_SAS_from_keywords,Avg_SAS_from_text
0,LaLaLand,base,0.08,0.0014,0.0027,0.0798,0.0008,0.0015,0.52,0.0524,0.0872,0.9302,0.914
1,LaLaLand,sentiment,0.2,0.0034,0.0068,0.1987,0.0016,0.0032,0.6,0.1786,0.2539,0.8672,0.8574
2,Parasite,base,0.16,0.0025,0.005,0.1564,0.0012,0.0023,0.44,0.0297,0.0538,0.9558,0.9065
3,Parasite,sentiment,0.28,0.0044,0.0087,0.28,0.0015,0.0031,0.48,0.1829,0.2225,0.9059,0.8075
