# KeyBERT with Sentiment-aware Embedding Fusion

This notebook implements a **sentiment-aware embedding-level extension** of KeyBERT, where sentiment information is directly injected into the document and candidate embeddings **before** similarity is computed.

*The goal is to guide keyword selection by embedding not just semantic content, but also the **emotional tone** of the text, so that emotionally coherent topics naturally emerge.*

---

## Approach

Unlike post-hoc reranking strategies, this method modifies the **core embedding computation** used by KeyBERT. Specifically, we create a combined representation:

- `E_sem`: the original semantic embedding of the text (e.g., using `MiniLM`)
- `s`: the sentiment vector (either raw probabilities or a non-linear transformation)
- `E_final`: the sentiment-aware embedding, combining `E_sem` and `s` through either linear or non-linear fusion.

We support two orthogonal options:

1. **Sentiment Vector Type**
   - `"linear"`: raw sentiment probabilities `[p_neg, p_neu, p_pos]`
   - `"nonlinear"`: transformed via a learned MLP projection

2. **Combination Formula**
   - `"concat"`: `E_final = [E_sem ; β × s]`
   - `"add"`: `E_final = E_sem + β × s_proj`
   - `"nonlinear"`: `E_final = E_sem + s_proj + E_sem * s_proj`

---

## Characteristics

- **Modular**: easily switch between sentiment vector types and combination formulas.
- **Prior-aware**: modifies candidate scores before similarity is computed, influencing topic formation directly.
- **Extensible**: can plug into any KeyBERT-based pipeline or BERTopic-style clustering.
- **Evaluation-ready**: supports metric-based validation for ablation studies.

---

This implementation is meant to **systematically evaluate the impact of injecting sentiment into the embedding space**, to discover whether richer emotional context improves keyword extraction.


### Setup: Installing and Importing Required Libraries

In [1]:
import subprocess
import sys

# List of required packages
required_packages = [
    "keybert", "sentence-transformers", "transformers", "torch", "numpy"
]

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)

  from .autonotebook import tqdm as notebook_tqdm


keybert is already installed.
Installing sentence-transformers...
Defaulting to user installation because normal site-packages is not writeable
transformers is already installed.
torch is already installed.
numpy is already installed.



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m


# Class Definition

In [2]:
import torch
import numpy as np

# PyTorch neural network module — used to define the MLP that projects sentiment vectors
import torch.nn as nn

# SentenceTransformer is used to generate dense semantic embeddings for full documents or keywords
from sentence_transformers import SentenceTransformer

# HuggingFace Transformers: 
# - AutoTokenizer tokenizes input text for the sentiment model
# - AutoModelForSequenceClassification runs the sentiment classification model (e.g., RoBERTa)
from transformers import AutoTokenizer, AutoModelForSequenceClassification


## SentimentEmbedder: Fusing Semantics and Sentiment for Keyword Extraction

This cell defines the `SentimentEmbedder` class, a **modular and extensible embedding model** that enhances the standard KeyBERT pipeline by incorporating sentiment information directly into the embedding space. Instead of applying sentiment corrections after keyword extraction (post-hoc), this approach injects **emotional context directly into the vector representation**, enabling **prior-aware keyword ranking**.  

---

### Sentiment Computation Scope

The sentiment is computed for **both the document (e.g., a review)** and **each candidate keyword**. This is because `SentimentEmbedder.encode()` is called for both sides of the similarity computation in KeyBERT:

- When generating the embedding for the full document.
- When generating embeddings for all candidate keywords.

This enables the model to favor keywords that are not only semantically related but also **emotionally aligned** with the document, enhancing both interpretability and contextual coherence.

---

### Core Idea

KeyBERT ranks keywords by measuring the cosine similarity between the semantic embedding of a document and the candidate phrases. However, this ignores the **emotional polarity** of the text, which may be critical in applications such as reviews, opinion mining, or narrative modeling. 

To address this, the `SentimentEmbedder` class extends the sentence embedding by fusing in a **sentiment-aware component** via one of several fusion strategies, ensuring that **semantic and affective dimensions are jointly encoded**.

---

### Structure and Behavior

The class is compatible with any `KeyBERT` pipeline, and supports the following configurable components:

- **`base_model`**: the semantic encoder (e.g., `all-MiniLM-L6-v2`) used to generate standard sentence embeddings.
- **`sentiment_model`**: a HuggingFace classifier that outputs probabilities over sentiment classes: `[negative, neutral, positive]`.
- **`sentiment_mode`**:
  - `"linear"`: directly uses raw sentiment probabilities scaled by a factor `β`.
  - `"nonlinear"`: applies an MLP (Multi-Layer Perceptron) projection to obtain a dense, continuous representation aligned with the semantic space.
- **`combination_mode`**:
  - `"concat"`: appends the sentiment vector to the semantic embedding.
  - `"add"`: performs element-wise addition between the two vectors (requires matching dimensionality).
  - `"nonlinear"`: combines `add` with an element-wise product for richer interaction:  
    $$
    E_\text{final} = E_\text{sem} + E_\text{sent} + (E_\text{sem} * E_\text{sent})
    $$
- **`beta`**: scales the influence of the sentiment vector in `"linear"` mode.
- **`device`**: specifies the computation device (`"cpu"` or `"cuda"`).

### Output Format

The `encode()` method returns a matrix of shape `(batch_size, dim)` or `(batch_size, dim + 3)` depending on the fusion strategy. These vectors can be directly used by KeyBERT or any cosine-similarity-based retrieval system.

---

### Continuous Sentiment Embedding (Nonlinear Mode)

In `"nonlinear"` mode, the sentiment classifier output is passed through a small MLP ending in a `Sigmoid()` activation. This vector is:

1. **Rescaled to the range [0, 10]**, representing sentiment intensity.
2. **Normalized to the interval [-1, +1]**, to match the distributional range of semantic embeddings.

This enables **compatibility with semantic vectors** and ensures that sentiment influences ranking without overpowering the semantic meaning.

---

### Applicability

This architecture supports both **intrinsic evaluation** (embedding structure, keyword quality) and **extrinsic downstream tasks** (clustering, classification, sentiment-coherent retrieval), and is general enough to be adapted to other affective signals (e.g., emotion, subjectivity, sarcasm).


In [3]:
class SentimentEmbedder(nn.Module):
    """
    A KeyBERT-compatible embedding model that fuses semantic and sentiment information
    into a single embedding vector, using configurable strategies.

    Parameters

    base_model : str
        Name of the SentenceTransformer model to use for semantic embeddings.
        Default is "all-MiniLM-L6-v2", a lightweight and efficient model.

    sentiment_model : str
        HuggingFace model identifier used for sentiment analysis.
        Must be a classification model with 3 outputs: [negative, neutral, positive].

    use_sentiment : bool
        If True, sentiment information will be integrated into the final embedding.
        If False, the model will behave like a standard semantic-only embedder.

    sentiment_mode : str
        Specifies how the sentiment vector is handled:
        - "linear": use raw sentiment probabilities [p_neg, p_neu, p_pos]
        - "nonlinear": pass the sentiment vector through a small MLP projection

    combination_mode : str
        Defines how semantic and sentiment embeddings are fused:
        - "concat": concatenate the vectors
        - "add": add them element-wise (requires same dimensionality)
        - "nonlinear": add + element-wise product for richer interaction

    beta : float
        Scaling factor for the sentiment vector, controlling its influence.
        Typical values range between 0.1 and 0.5.

    device : str
        Device to run the models on: "cpu" or "cuda".
    """


    def __init__(self,
                 base_model="all-MiniLM-L6-v2",
                 sentiment_model="finiteautomata/bertweet-base-sentiment-analysis", # HuggingFace model for sentiment analysis
                 use_sentiment=True,
                 sentiment_mode="linear",         # "linear" or "nonlinear"
                 combination_mode="concat",       # "concat", "add", "nonlinear"
                 beta=0.5,                        # strength of sentiment influence
                 device="cpu"):
        super().__init__()

        # Initialize input parameters
        self.device = device                                # Device used for computation ("cpu" or "cuda")
        self.use_sentiment = use_sentiment                  # Whether to include sentiment in the embedding
        self.sentiment_mode = sentiment_mode                # How sentiment is processed: "linear" or "nonlinear"
        self.combination_mode = combination_mode            # How semantic and sentiment vectors are fused
        self.beta = beta                                    # Scaling factor controlling sentiment influence

        # Load the semantic model (e.g., MiniLM or other SentenceTransformer)
        self.base = SentenceTransformer(base_model, device=device)  # Pretrained semantic encoder
        self.dim = self.base.get_sentence_embedding_dimension()     # Dimensionality of the semantic embeddings

        # If sentiment is used, load the sentiment model and define its transformation
        if use_sentiment:

            # Load the tokenizer corresponding to the sentiment model
            # It transforms raw text into token IDs expected by the transformer
            self.tokenizer = AutoTokenizer.from_pretrained(sentiment_model)
            
            # Load the pretrained sentiment classification model
            # It outputs a 3-class probability distribution: [negative, neutral, positive]
            self.sent_model = AutoModelForSequenceClassification.from_pretrained(sentiment_model).to(device)

            # Define a small neural network (MLP) to project the 3-dimensional sentiment vector
            # into a continuous, dense vector aligned with the semantic embedding space.
            # Output of the last layer is passed through a Sigmoid, resulting in values ∈ [0, 1],
            # which will later be rescaled into [0, 10] and normalized to match the semantic scale.
            self.sent_proj = nn.Sequential(
                nn.Linear(3, 32),        # First layer expands the sentiment input to a hidden size
                nn.ReLU(),               # Apply non-linearity for expressiveness
                nn.Linear(32, self.dim), # Project to same dimension as semantic embedding (e.g., 384)
                nn.Sigmoid()             # Ensure all outputs are bounded between 0 and 1
            ).to(device)



    @torch.no_grad()  # Disable gradient tracking since we are only doing inference
    def _get_sentiment_vector(self, texts):
        """
        Compute a sentiment-aware vector for a list of texts.
        
        Output depends on the selected `sentiment_mode`:
        - "linear"    → returns raw probabilities [p_neg, p_neu, p_pos], scaled by beta
        - "nonlinear" → projects sentiment into a continuous embedding aligned with the semantic space,
                        scaled in [0, 10], then normalized to [-1, +1] to match semantic embedding range
        """

        # Preprocess the input texts by converting them into model-ready tokenized format.
        # This includes:
        # - tokenizing the text into subword IDs
        # - adding special tokens (e.g., [CLS], [SEP])
        # - padding all sequences in the batch to the same length
        # - truncating sequences that are too long
        # - returning the result as PyTorch tensors
        # The output includes both:
        # - input_ids: the token indices
        # - attention_mask: a mask to distinguish real tokens from padding
        inputs = self.tokenizer(
            texts, 
            padding=True,           # Pad all sequences to the same length
            truncation=True,        # Truncate longer sequences to fit the model's max length
            return_tensors="pt"     # Return as PyTorch tensors
        ).to(self.device)           # Move the batch to the specified device (CPU or GPU)


        # Perform inference with the sentiment classifier to get logits
        logits = self.sent_model(**inputs).logits  # shape: (batch_size, 3)

        # Convert logits to probabilities over [negative, neutral, positive]
        probs = torch.softmax(logits, dim=1).to(self.device)  # shape: (batch_size, 3)

        # LINEAR: return the scaled sentiment probabilities directly
        if self.sentiment_mode == "linear":
           
            # Values will be in range [0, beta]
            return self.beta * probs  # shape: (batch_size, 3)

        # NON-LINEAR: use an MLP to project the 3D sentiment vector into semantic space
        elif self.sentiment_mode == "nonlinear":
            
            # Step 1: project to [0, 1] via sigmoid (from MLP)
            projected = self.sent_proj(probs)  # shape: (batch_size, dim), values in [0, 1]

            # Step 2: rescale to [0, 10] to model sentiment intensity
            rescaled = projected * 10

            # Step 3: normalize to [-1, +1] so it aligns numerically with the semantic embedding
            normalized = (rescaled - 5) / 5  # now values ∈ [-1, +1]

            return normalized  # shape: (batch_size, dim)

        else:
            # Catch invalid input modes
            raise ValueError("Invalid sentiment_mode: choose 'linear' or 'nonlinear'")



    def encode(self, texts, **kwargs):
        """
        Main method required by KeyBERT: returns embeddings for a list of texts.
        
        The output depends on:
        - `use_sentiment`: whether sentiment should be included
        - `sentiment_mode`: how the sentiment vector is obtained ("linear" or "nonlinear")
        - `combination_mode`: how to fuse semantic and sentiment embeddings
        """

        # Step 1: Get semantic embedding from SentenceTransformer model
        base_emb = self.base.encode(
            texts, 
            convert_to_tensor=True, 
            **kwargs
        ).to(self.device)  # shape: (batch_size, dim)

        if not self.use_sentiment:
            # If sentiment is disabled, return semantic embedding as-is (KeyBERT solution)
            return base_emb.cpu().numpy()

        # Step 2: Get sentiment vector (either raw or projected)
        sent_vec = self._get_sentiment_vector(texts)  # shape depends on mode:
                                                      # linear  → (batch_size, 3)
                                                      # nonlinear → (batch_size, dim)

        # Step 3: Combine semantic and sentiment vectors based on selected strategy
        if self.sentiment_mode == "linear":
            # Linear mode → sentiment vector has shape (batch_size, 3)

            if self.combination_mode == "concat":
                # Concatenate semantic embedding and sentiment probabilities
                return torch.cat([base_emb, sent_vec], dim=1).cpu().numpy()  # shape: (batch_size, dim + 3)

            elif self.combination_mode in ["add", "nonlinear"]:
                # Cannot add or multiply vectors of mismatched dimensions
                raise ValueError("Cannot use '{}' combination with 'linear' sentiment vector (dimension mismatch).".format(self.combination_mode))

        elif self.sentiment_mode == "nonlinear":
            # Nonlinear mode → sentiment vector has same shape as base embedding

            if self.combination_mode == "add":
                # Simple element-wise addition of semantic + sentiment vectors
                return (base_emb + sent_vec).cpu().numpy()  # shape: (batch_size, dim)

            elif self.combination_mode == "nonlinear":
                # Nonlinear fusion: sum + element-wise product
                return (base_emb + sent_vec + base_emb * sent_vec).cpu().numpy()  # shape: (batch_size, dim)

            elif self.combination_mode == "concat":
                # Concatenate semantic embedding and projected sentiment vector
                return torch.cat([base_emb, sent_vec], dim=1).cpu().numpy()  # shape: (batch_size, dim * 2)

            else:
                # Unsupported combination mode
                raise ValueError("Invalid combination_mode: choose 'concat', 'add', or 'nonlinear'")

        else:
            # Unsupported sentiment mode
            raise ValueError("Invalid sentiment_mode: choose 'linear' or 'nonlinear'")


# Tests

In [4]:
import time
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from huggingface_hub import snapshot_download

### Model Preloading

This cell preloads the semantic and sentiment models used by the `SentimentEmbedder` class.  
Since HuggingFace models are downloaded the first time they are used, this step may take **several minutes**.
Running this cell ensures that all models are cached locally, making subsequent runs significantly faster.

In [None]:
# Timer start
start = time.time()

# Load semantic model (Sentence-BERT)
print("Loading semantic model...")
sem_model = SentenceTransformer("all-MiniLM-L6-v2")
print("Semantic model loaded.")

# Load sentiment model using snapshot_download (more robust download)
print("Downloading sentiment model snapshot...")
sentiment_model_path = snapshot_download("j-hartmann/sentiment-roberta-large-english-3-classes")

print("Loading sentiment model and tokenizer from local snapshot...")
tokenizer = AutoTokenizer.from_pretrained(sentiment_model_path)
sent_model = AutoModelForSequenceClassification.from_pretrained(sentiment_model_path)
print("Sentiment model loaded.")

# Timer end
end = time.time()
print(f"\nTotal loading time: {end - start:.2f} seconds")


Loading semantic model...
Semantic model loaded.
Downloading sentiment model snapshot...


Fetching 11 files:  55%|█████▍    | 6/11 [00:12<00:10,  2.09s/it]


### Test 1

In [None]:
# Initialize the sentiment-aware model
sent_model = SentimentEmbedder(
    sentiment_mode="nonlinear",        # "nonlinear" or "linear"
    combination_mode="nonlinear",      # "nonlinear", "add" or "concat"
    beta=0.5,                          # only used in "linear" mode
    device="cpu"                       # or "cuda" if available
)

# Plug it into KeyBERT
kw_model = KeyBERT(model=sent_model)

doc = "I absolutely loved this product. It exceeded my expectations and the quality is fantastic."

keywords = kw_model.extract_keywords(doc, top_n=5)
print("Extracted Keywords:\n", keywords)

# View the shape and some sample values of the sentiment-aware embedding
embedding = sent_model.encode([doc])
print("Embedding shape:", embedding.shape)
print("Embedding preview:", embedding[0][:10])  # first 10 values

Error while downloading from https://cas-bridge.xethub.hf.co/xet-bridge-us/621ffdc136468d709f1802ec/b843f68c48263ac9fc3ea8f55e59bed7065194bf524cb2ae67542dbe1c329c10?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20250516%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250516T093602Z&X-Amz-Expires=3600&X-Amz-Signature=18b3c8c84014a4b758601d009601c2b1dda6c6cc543f2968f1cdcf0363ea2d45&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&x-id=GetObject&Expires=1747391762&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0NzM5MTc2Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MjFmZmRjMTM2NDY4ZDcwOWYxODAyZWMvYjg0M2Y2OGM0ODI2M2FjOWZjM2VhOGY1NWU1OWJlZDcwNjUxOTRiZjUyNGNiMmFlNjc1NDJkYmUxYzMyOWMxMCoifV19&Signature=R0xdoaYq3V8-0nUb87DZR6zuisJyXPm0M2Qc9J0cgbxBQ-cIIi4ZZ

ConnectTimeout: (MaxRetryError("HTTPSConnectionPool(host='cas-bridge.xethub.hf.co', port=443): Max retries exceeded with url: /xet-bridge-us/621ffdc136468d709f1802ec/b843f68c48263ac9fc3ea8f55e59bed7065194bf524cb2ae67542dbe1c329c10?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20250516%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250516T093602Z&X-Amz-Expires=3600&X-Amz-Signature=18b3c8c84014a4b758601d009601c2b1dda6c6cc543f2968f1cdcf0363ea2d45&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&x-id=GetObject&Expires=1747391762&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0NzM5MTc2Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MjFmZmRjMTM2NDY4ZDcwOWYxODAyZWMvYjg0M2Y2OGM0ODI2M2FjOWZjM2VhOGY1NWU1OWJlZDcwNjUxOTRiZjUyNGNiMmFlNjc1NDJkYmUxYzMyOWMxMCoifV19&Signature=R0xdoaYq3V8-0nUb87DZR6zuisJyXPm0M2Qc9J0cgbxBQ-cIIi4ZZL1B3Qi6GuQ2qzo~dTRKHAAMbDfYa9rstXQ4mNYibNq5diNAWJB8zfQgFxr6nJoHbZ3crrHtB2pF42AwnaNju3Ab8ZlZDyAJMblwlvq1JZvdIy9xtIecAYF4One4roSYGzL8BwyVeDNgCzhbzPBUQfEHetyyJbqwE2yXZKHm7xlba7OLjXIZAYXH9kcelJo2N36v9WREBAlVY6RinIYFgB--TnHXwzTAWQzvEPPRZ~rmZDFSwfRbqYFRCx5dkKziziNoILQ7vcIO0IoSO3U1nfsKx~B1RD2kBQ__&Key-Pair-Id=K2L8F4GPSG1IFC (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x335e97130>, 'Connection to cas-bridge.xethub.hf.co timed out. (connect timeout=10)'))"), '(Request ID: 9d035b6f-25bb-4037-86cc-16e5b4e7bd0f)')