# KeyBERT with Sentiment-aware Embedding Fusion

This notebook introduces a **sentiment-aware extension** of the KeyBERT keyword extraction model, which integrates sentiment information directly into the candidate selection and ranking process. Unlike simple post-hoc reranking approaches, this method incorporates sentiment consistency during both candidate filtering and final keyword scoring.

### Theoretical Approach

Traditional KeyBERT extracts candidate keywords purely based on semantic similarity between the document embedding and candidate embeddings. This extension enhances the process by considering the **emotional coherence** between candidates and the document, operationalized as continuous sentiment polarity scores.

Given:
-  $\text{sim}_{sem}$ : cosine similarity between document and candidate embeddings,
-  $s_{doc} \in [0,1]$: continuous sentiment polarity score of the document,
-  $s_{cand} \in [0,1]$: continuous sentiment polarity score of a candidate keyword,

we define the **sentiment alignment score** as:

$$
\text{align}(s_{doc}, s_{cand}) = 1 - |s_{doc} - s_{cand}|
$$

which equals 1 for perfect polarity match and decreases linearly to 0 for maximal polarity difference.

The overall combined score used to filter and rank candidates is:

$$
\text{score}_{final} = w_{sentiment} \times \text{align}(s_{doc}, s_{cand}) + (1 - w_{sentiment}) \times \text{sim}_{sem}
$$

where $$ w_{sentiment} \in [0,1] $$ is a tunable weight balancing sentiment alignment and semantic similarity.

### Key Features

- **Integrated sentiment filtering**: Sentiment is incorporated early to filter out candidates that are sentimentally incongruent with the document, not only at reranking stage.
- **Continuous sentiment modeling**: Uses probability-weighted sentiment polarity scores from a pretrained transformer classifier, enabling nuanced sentiment comparisons.
- **Flexible weighting parameter**: The parameter \( w_{sentiment} \) allows task-specific tuning of the relative importance of sentiment versus semantic relevance.
- **Candidate generation enhancement**: The candidate pool is initially large and filtered by combined semantic and sentiment scores, improving quality and relevance.

### Advantages Over Post-hoc Reranking

- Unlike reranking approaches that adjust keyword order **after** candidate generation, this method filters candidates **before** ranking, reducing noise and irrelevant candidates early.
- Sentiment influences the candidate pool itself, resulting in more coherent and contextually appropriate keyword extraction.
- The approach remains compatible with any KeyBERT-compatible embedding model and sentiment classification backend.

### Intended Applications

This sentiment-aware KeyBERT extension is especially suited for sentiment-rich domains such as:

- Product and service reviews
- Social media opinion mining
- Customer feedback analysis
- Any text where emotional tone is critical to understanding key themes

It enables the extraction of keywords that are both topically relevant and emotionally aligned, enhancing interpretability and downstream analysis.


### Setup: Installing and Importing Required Libraries

In [3]:
import subprocess
import sys

# List of required packages
required_packages = [
    "numpy", "torch", "scikit-learn", "keybert", "transformers"
    
]

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)

numpy is already installed.
torch is already installed.
Installing scikit-learn...
Defaulting to user installation because normal site-packages is not writeable
keybert is already installed.
transformers is already installed.



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m


In [5]:
import numpy as np  # Fundamental package for numerical computing in Python
from typing import Tuple  # Used for type hinting tuples in function signatures

import torch  # Core PyTorch library for tensor computations
import torch.nn.functional as F  # Functional interface for activation functions, etc.

from sklearn.feature_extraction.text import CountVectorizer  # Extract text n-gram candidates
from sklearn.metrics.pairwise import cosine_similarity  # Compute cosine similarity between embeddings

from keybert import KeyBERT as KB  # KeyBERT keyword extraction base class

from transformers import (
    AutoTokenizer,  # Tokenizer for preparing input text for transformer models
    AutoModelForSequenceClassification  # Transformer model for classification tasks
)

# Classes Definition

## SentimentModel Class: Transformer-based Sentiment Probability Predictor

The `SentimentModel` class is a wrapper around a pretrained HuggingFace transformer model designed for sentiment classification. It provides a convenient interface to obtain **probability distributions over sentiment classes** for batches of input texts.

### Purpose and Functionality

- **Model loading:**  
  Upon initialization, the class loads both the tokenizer and the sequence classification model specified by the `model_name`.  
  By default, it uses `"nlptown/bert-base-multilingual-uncased-sentiment"`, a multilingual BERT model fine-tuned for 5-class sentiment classification (1 to 5 stars).

- **Device management:**  
  The model and tokenizer are moved to the specified device (`cpu` or `cuda`).  
  Input validation ensures that `cuda` is only used if a compatible GPU is available.

- **Batch sentiment prediction:**  
  The core method `predict_proba` takes a list of texts and:  
  1. Tokenizes and encodes them into the format expected by the transformer.  
  2. Performs a forward pass through the model without computing gradients (efficient inference).  
  3. Applies a softmax to the output logits to obtain a probability distribution over the sentiment classes for each text.  
  4. Returns a NumPy array of shape `(batch_size, num_classes)` containing the class probabilities.

### Advantages

- Allows seamless integration of sentiment analysis into larger NLP pipelines.
- Outputs probabilistic sentiment scores, enabling nuanced, continuous sentiment representations rather than hard labels.
- Supports batch processing for efficiency.

### Example Output

**Text:**  
_I absolutely loved this movie! It was fantastic._  
Sentiment probabilities (1 to 5 stars): [0.01 0.02 0.05 0.12 0.80]


**Text:**  
_The plot was boring and predictable._  
Sentiment probabilities (1 to 5 stars): [0.70 0.20 0.07 0.02 0.01]


**Text:**  
_The movie was okay, nothing special but not bad either._  
Sentiment probabilities (1 to 5 stars): [0.05 0.10 0.65 0.15 0.05]


In [None]:
class SentimentModel:
    """
    Wrapper class for a HuggingFace transformer sentiment classification model.

    This class loads a pretrained sentiment classification model and tokenizer,
    and provides a method to compute the probability distribution over sentiment classes
    for a batch of input texts.

    Parameters:
    -----------
    model_name : str, optional
        The identifier of the pretrained sentiment model on HuggingFace Hub.
        Default is "nlptown/bert-base-multilingual-uncased-sentiment",
        a 5-class sentiment classifier (1 to 5 stars).

    device : str, optional
        The device to run the model on. Typical values: "cpu" or "cuda".
    """

    def __init__(
            self, 
            model_name="nlptown/bert-base-multilingual-uncased-sentiment", 
            device="cpu"):
        
        # Set the device for model computation
        if device not in ["cpu", "cuda"]:
            raise ValueError("Device must be 'cpu' or 'cuda'.")
        
        if device == "cuda" and not torch.cuda.is_available():
            raise ValueError("CUDA is not available. Please use 'cpu' instead.")
        
        self.device = device

        # Load the tokenizer associated with the pretrained model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        # Load the pretrained sequence classification model on the given device
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)

    def predict_proba(self, texts):
        """
        Compute the probability distribution over sentiment classes for input texts.

        Parameters:
        -----------
        texts : list of str
            List of input texts for which to compute sentiment probabilities.

        Returns:
        --------
        numpy.ndarray
            Array of shape (len(texts), num_classes) where each row corresponds
            to the probability distribution over sentiment classes for that text.
        """
        # Tokenize and encode the input texts, handling padding and truncation for batching
        inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(self.device)
        
        with torch.no_grad():
            # Perform forward pass without gradient computation for efficiency
            outputs = self.model(**inputs)
            logits = outputs.logits  # raw model outputs before softmax
            
            # Convert logits to probabilities using softmax along class dimension
            probs = F.softmax(logits, dim=1).cpu().numpy()
        return probs


## KeyBERTSentimentAware Class: Sentiment-Integrated Keyword Extraction

This class extends the base KeyBERT model by integrating sentiment analysis directly into the keyword extraction pipeline. It enhances the traditional semantic-only approach by incorporating continuous sentiment polarity scores for both the entire document and each candidate keyword.

### Overview

- **Candidate Extraction:**  
  Uses `CountVectorizer` to extract a broad pool of candidate keywords (n-grams) from the document text.  
  **Note:** This initial candidate generation is purely statistical and **does not incorporate sentiment information**.

- **Sentiment Analysis:**  
  Leverages a pretrained transformer sentiment classification model to compute **continuous sentiment polarity scores** ranging from 0 (very negative) to 1 (very positive) for both the document and each candidate.

- **Combined Scoring and Filtering:**  
  Calculates a weighted score combining:
  - Semantic similarity (cosine similarity between embeddings).
  - Sentiment alignment (1 minus the absolute difference between candidate and document sentiment).

  Candidates with combined scores below a threshold are **filtered out before final ranking**, effectively integrating sentiment as a filter immediately after candidate extraction.

  The weighting is controlled by `weight_sentiment`:
  - `weight_sentiment=1.0` means keywords are ranked purely by sentiment alignment.
  - `weight_sentiment=0.0` means keywords are ranked purely by semantic similarity.

### Candidate Selection in KeyBERT vs Sentiment-Aware Extension

In the original KeyBERT model, the candidate keywords are extracted purely based on **statistical properties** of the text. Specifically, KeyBERT uses a tool like `CountVectorizer` to identify n-grams (contiguous sequences of words) that appear frequently or are relevant according to basic frequency statistics. This means:

- The **candidate pool is generated without any semantic or sentiment understanding**.
- All candidates are treated equally in this phase, regardless of their emotional tone or contextual relevance beyond raw occurrence patterns.

This purely **statistical candidate extraction** can lead to a large number of candidates that are relevant but may not align emotionally with the overall document sentiment. For example, in a strongly negative review, KeyBERT might still generate positive-sounding candidates simply because those phrases appear often, potentially misrepresenting the sentiment conveyed.

To address this limitation, our sentiment-aware extension introduces a **joint filtering mechanism** that combines both semantic relevance and sentiment alignment **immediately after the initial statistical candidate extraction**:

1. We first extract a large pool of candidates statistically using `CountVectorizer` to ensure broad coverage.

2. We compute **continuous sentiment polarity scores** for both the entire document and each candidate keyword using a pretrained transformer sentiment model.

3. We calculate a **combined score** for each candidate that balances:
   - Semantic similarity to the document (embedding cosine similarity).
   - Sentiment alignment with the document's overall polarity (inverted absolute difference between sentiment scores).

4. Candidates whose combined score falls below a threshold are **filtered out early**, significantly reducing the pool to those that are both topically and emotionally relevant.

This approach allows the model to **avoid candidates that are semantically plausible but sentimentally inconsistent**, leading to more meaningful and context-aware keyword extraction.

### Summary

| Step                        | KeyBERT Base                        | Sentiment-Aware Extension          |
|-----------------------------|-----------------------------------|-----------------------------------|
| Candidate generation         | Purely statistical (n-gram counts)| Statistical, followed by sentiment-semantic filtering (no sentiment during extraction but sentiment used immediately after to filter) |
| Candidate ranking            | Semantic similarity only           | Semantic + sentiment combined     |
| Sentiment consideration     | None                              | Integral part of candidate filtering|

By incorporating sentiment as an early filtering step (post-statistical extraction), our extension improves the **precision and emotional coherence** of extracted keywords, especially in domains where sentiment plays a crucial role.

### Parameters

- `model`: The base semantic embedding model (usually a SentenceTransformer).
- `sentiment_model_name`: Identifier of the pretrained sentiment model (default is a 5-class multilingual sentiment classifier).
- `weight_sentiment`: Balances importance between sentiment alignment and semantic similarity.
- `candidate_pool_size`: Number of candidates initially extracted.
- `device`: Compute device, `"cpu"` or `"cuda"`.

### Usage

The class allows flexible, context-aware keyword extraction that respects both topical relevance and emotional tone, ideal for analyzing opinion-rich texts such as reviews or social media posts.

---


In [None]:
class KeyBERTSentimentAware(KB):
    """
    Extension of KeyBERT to integrate sentiment analysis in keyword extraction.

    This class overrides and extends parts of KeyBERT's pipeline to:
    - Extract a larger candidate pool using CountVectorizer.
    - Calculate sentiment polarity scores for the document and candidates,
      using a pretrained sentiment classification model with continuous outputs.
    - Combine semantic similarity and sentiment alignment scores via a weighting factor weight_sentiment.
    - Filter candidate keywords based on this combined score before final ranking.

    Parameters:
    -----------
    model : SentenceTransformer
        Semantic embedding model used by KeyBERT.

    sentiment_model_name : str, optional (default: "nlptown/bert-base-multilingual-uncased-sentiment")
        Identifier of pretrained sentiment model on HuggingFace Hub.

    weight_sentiment : float, optional (default: 0.7)
        Weight to balance sentiment alignment vs semantic similarity.
        weight_sentiment=1.0 means only sentiment alignment is considered.
        weight_sentiment=0.0 means only semantic similarity is considered.

    candidate_pool_size : int, optional (default: 100)
        Maximum number of initial candidate keywords to extract.

    device : str, optional (default: "cpu")
        Device to run embedding and sentiment models on ("cpu" or "cuda").
    """

    def __init__(
        self,
        model,
        sentiment_model_name: str = "nlptown/bert-base-multilingual-uncased-sentiment",
        weight_sentiment: float = 0.7,
        candidate_pool_size: int = 100,
        device: str = "cpu",
    ):
        # Call the superclass (KeyBERT) constructor to initialize base functionality
        super().__init__(model)
        
        # Validate and set parameters
        self.weight_sentiment = weight_sentiment
        self.candidate_pool_size = candidate_pool_size
        self.device = device

        # Store the semantic embedding model (typically SentenceTransformer)
        self.embedder = model

        # Initialize the sentiment model wrapper to obtain probabilities for sentiment classes
        self.sentiment_model = SentimentModel(sentiment_model_name, device=device)

        # Sentiment classes ordered from most negative to most positive
        self.labels_ordered = ['1 star', '2 stars', '3 stars', '4 stars', '5 stars']

        # Map sentiment labels to continuous numeric values in [0, 1]
        self.label_to_score = {
            label: i / (len(self.labels_ordered) - 1)
            for i, label in enumerate(self.labels_ordered)
        }

    def _get_doc_polarity_continuous(self, doc: str) -> float:
        """
        Compute the document's continuous sentiment polarity score as the weighted sum of
        predicted class probabilities multiplied by their numeric mappings.

        This method overrides and replaces any default sentiment handling in the base class.

        Parameters:
        -----------
        doc : str
            The document text.

        Returns:
        --------
        float
            Continuous sentiment polarity score between 0 (very negative) and 1 (very positive).
        """
        # Get probability distribution over sentiment classes for the document
        probs = self.sentiment_model.predict_proba([doc])[0]

        # Compute continuous polarity as weighted average of class scores
        polarity = sum(
            p * self.label_to_score[label]
            for p, label in zip(probs, self.labels_ordered)
        )
        return polarity

    def _get_candidate_polarities(self, candidates) -> np.ndarray:
        """
        Compute continuous sentiment polarity scores for each candidate keyword.

        This method extends candidate scoring with sentiment, overriding base candidate processing.

        Parameters:
        -----------
        candidates : iterable of str
            List of candidate keywords.

        Returns:
        --------
        np.ndarray
            Array of polarity scores for each candidate keyword.
        """
        candidates = list(candidates)  # ensure correct input format for tokenizer
        
        # Batch predict probabilities for all candidates
        probs_list = self.sentiment_model.predict_proba(candidates)
        
        polarities = []
        for probs in probs_list:
            # Weighted average as continuous polarity score
            polarity = sum(
                p * self.label_to_score[label]
                for p, label in zip(probs, self.labels_ordered)
            )
            polarities.append(polarity)
        return np.array(polarities)

    def _select_candidates(
            self, 
            doc: str, 
            ngram_range: Tuple[int, int] = (1, 3), 
            threshold: float = 0.4
    ):
        """
        Extract initial candidates with CountVectorizer and filter them based on combined
        semantic similarity and sentiment alignment scores.

        This method replaces the default candidate generation and filtering steps of KeyBERT,
        incorporating sentiment filtering before final keyword ranking.

        Parameters:
        -----------
        doc : str
            Document text.

        ngram_range : tuple of int
            N-gram size range for candidate extraction.

        threshold : float
            Minimum combined score for candidate retention.

        Returns:
        --------
        list of str
            Filtered list of candidate keywords.
        """
        # Extract candidates with CountVectorizer (statistical n-grams)
        vectorizer = CountVectorizer(
            ngram_range=ngram_range,
            stop_words='english',
            max_features=self.candidate_pool_size
        )
        candidates = vectorizer.fit([doc]).get_feature_names_out()

        # Compute semantic embeddings for doc and candidates
        doc_emb = self.model.embed([doc])
        cand_emb = self.model.embed(candidates)

        # Compute continuous sentiment polarity scores
        doc_pol = self._get_doc_polarity_continuous(doc)
        cand_pols = self._get_candidate_polarities(candidates)

        # Calculate cosine semantic similarity scores
        sim_scores = cosine_similarity(doc_emb, cand_emb)[0]

        # Calculate sentiment alignment scores
        sentiment_scores = 1 - np.abs(cand_pols - doc_pol)

        # Combine semantic and sentiment scores with alpha weighting
        combined_scores = self.weight_sentiment * sentiment_scores + (1 - self.weight_sentiment) * sim_scores

        # Filter candidates that meet threshold on combined score
        filtered_candidates = [c for c, s in zip(candidates, combined_scores) if s >= threshold]

        return filtered_candidates

    def extract_keywords(
        self,
        doc: str,
        top_n: int = 5,
        candidate_threshold: float = 0.4,
        keyphrase_ngram_range: Tuple[int, int] = (1, 3),
    ):
        """
        Extract top keywords from a document by combining semantic similarity and sentiment alignment.

        This method overrides the `extract_keywords` method from KeyBERT base class,
        adding sentiment-aware candidate filtering and scoring.

        Parameters:
        -----------
        doc : str
            Input document text.

        top_n : int
            Number of keywords to return.

        candidate_threshold : float
            Threshold score to filter candidate keywords.

        keyphrase_ngram_range : tuple of int
            N-gram range for candidate keyword extraction.

        Returns:
        --------
        list of tuples
            List of (keyword, score) tuples sorted by descending combined score.
        """

        # Select candidates filtered by combined semantic+sentiment scoring
        candidates = self._select_candidates(
            doc,
            ngram_range=keyphrase_ngram_range,
            threshold=candidate_threshold
        )
        if not candidates:
            print("No candidates passed the sentiment-semantic filter.")
            return []

        # Compute semantic embeddings for document and filtered candidates
        doc_emb = self.model.embed([doc])
        cand_emb = self.model.embed(candidates)

        # Compute continuous sentiment polarity for the document
        doc_pol = self._get_doc_polarity_continuous(doc)

        print(f"Document polarity: {doc_pol:.3f}")

        # Compute sentiment polarities for candidates
        cand_pols = self._get_candidate_polarities(candidates)

        # Calculate semantic similarity and sentiment alignment scores
        sim_scores = cosine_similarity(doc_emb, cand_emb)[0]
        sentiment_scores = 1 - np.abs(cand_pols - doc_pol)

        # Final combined score with weighting factor alpha
        final_scores = self.weight_sentiment * sentiment_scores + (1 - self.weight_sentiment) * sim_scores

        # Select top_n keywords sorted by combined score descending
        top_indices = np.argsort(final_scores)[-top_n:][::-1]

        return [(candidates[i], final_scores[i]) for i in top_indices]


# Tests

### Test 1