<!-- Title Slide -->

<table>
<tr>
<td style="width:40%; vertical-align:middle; text-align:center;">
  <img src="https://maartengr.github.io/BERTopic/logo.png" alt="Topic Modeling Illustration" width="300"/>
</td>
<td style="width:60%; vertical-align:middle; padding-left:20px;">

  # 📝 Topic Modeling with BERTopic
  ## Session: Text Embeddings

  <br>
  <br>
  <span style="font-size:1.2em; color:gray;">
    Michael Jantscher · TU Graz · Know Center Research GmbH
  </span>

</td>
</tr>
</table>


# 👋 Michael Jantscher

<img src="./images/profil_michael.jpg" alt="Michael Jantscher" width="250"/>

**PhD Student** - TU Graz <br>
**Researcher** - Know Center Research GmbH

---

### 🟢 Focus Areas
- Natural Language Processing (NLP) in medical & clinical domains
- Causal reasoning in healthcare and (neuro)radiology
- Agentic AI systems for decision support and research workflows


# 📌 What is a Text Embedding Vector ?- Formal Definition
* Numerical representation of text (words, sentences or documents) in a multi-dimensional space
* Captures meaning and context
* Semantic similar words/sentences/documents -> vectors closer together

In [None]:
from IPython.display import HTML
HTML("""
<style>
.reveal .slides section { text-align: left !important; }
.reveal h1, .reveal h2, .reveal h3, .reveal p, .reveal table { text-align: left !important; }
.reveal table th, .reveal table td { text-align: left !important; }
</style>
""")

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
st_model_small = SentenceTransformer('all-minilm-l6-v2')

sample_string = "I really like this summer school!"

sample_string_embedding = st_model_small.encode(sample_string)
df = pd.DataFrame({
    "text": [sample_string],
    "embedding": [sample_string_embedding]
})
df

## Why Use Embedding Vectors Instead of Strings?

### 🟢 Drawbacks with Raw Strings
- Computers see `"dog"` as just `"d", "o", "g"` — no understanding of meaning
- Hard to measure **similarity** or **relationships** between words
- Not suitable for **search, clustering, or ML models**

---

### 🟠️ Benefits of Embeddings
- **Numerical Representation:** Converts text into vectors that algorithms understand
- **Capture Meaning:** Similar words or sentences are close in vector space
- **Efficient & Scalable:** Enables fast similarity search with dot product or cosine similarity
- **Handle Synonyms & Context:** `"car"` ≈ `"automobile"`, context-aware models disambiguate `"bank"`

---

### 🔵 Example

| Text      | Raw Representation         | Embedding (Example)          |
|-----------|---------------------------|-----------------------------|
| `"dog"`   | `"d", "o", "g"`           | `[0.12, -0.45, 0.87, ...]`  |
| `"puppy"` | `"p", "u", "p", "p", "y"` | `[0.11, -0.48, 0.90, ...]`  |

*Vectors are close → words are semantically similar*


# 💡 Uses Cases of Text Embedding Vectors

| **Use Case**                   | **Example** |
|--------------------------------|-------------|
| **Semantic Search**         | Search “doctor” → find content on “physicians” or “healthcare providers” even without exact keywords |
| **Recommendation Systems**  | Suggest similar research papers, movies, or products based on descriptions or reviews |
| **Sentiment Analysis**      | Understand tone (positive/negative/neutral) beyond simple keywords in tweets or reviews |
| **Clustering & Topic Modeling** | Group thousands of news articles or support tickets by topic automatically |
| **Chatbots & Virtual Assistants** | Improve NLU so bots answer contextually, not just by keyword |
| **Fraud Detection**        | Spot unusual or suspicious text patterns in financial or insurance claims |

---



# 🦖 Pre-Embedding Area
📌 **Definition:**
Early NLP approaches represented text as **sparse, high-dimensional vectors**.
Each dimension corresponded to a **unique word or token**, with no sense of meaning or context.

---

### Bag-of-Words (BoW)
- Represents documents as a **vector of word counts**
- Ignores grammar, order, and semantics
- Example:
  `"I like NLP"` → `[1, 1, 1, 0, 0, ...]`

---

### TF-IDF (Term Frequency – Inverse Document Frequency)
- Adjusts raw counts to emphasize **rare, informative words** and downweight common words
- Example: “the” → low weight, “quantum” → high weight

---

#### TF-IDF Formula

$$TF\text{-}IDF(t, d) = TF(t, d) \times \log\left(\frac{N}{DF(t)}\right)$$

Where:
- (TF(t, d)\): Frequency of term \(t\) in document \(d\)
- \(DF(t)\): Number of documents containing \(t\)
- \(N\): Total number of documents


In [None]:
# Load parlamint dataset
df_parlamint = pd.read_csv("../../../datasets/parlamint/parlamint-it-is-2022.txt", sep="\t").head(2000)
df_parlamint_subset = df_parlamint.head(1000).copy(deep=True)
df_parlamint

# Group sentence by utterance (=Parent_ID)
df_parlamint_grouped = (df_parlamint.groupby(["Parent_ID"])["Text"]
                        .apply(lambda s: " ".join(s))
                        .reset_index(name="utterance_text")).head(100)

In [None]:
df_parlamint.head(5)

In [None]:
df_parlamint_grouped.head(5)

In [None]:
# Take a single utterance from the dataset
sample_utterance = df_parlamint[df_parlamint["Parent_ID"] == "ParlaMint-IS_2022-01-17-20.u1"]["Text"]
print("\n".join([e for e in sample_utterance]))

In [None]:
from models import CountVectorizerEmbedder

# Adding just the utterance sample as vocabulary
cv_model = CountVectorizerEmbedder(vocabulary=sample_utterance, max_features=5, stop_words='english')
# cv_model = CountVectorizerEmbedder(vocabulary=sample_utterance)

cv_embeddings = cv_model.embed(sample_utterance)
print(f"Number features: {len(cv_model.embedding_model.get_feature_names_out())}", cv_model.embedding_model.get_feature_names_out())
print(f"Shape embedding array: {cv_embeddings.toarray().shape}")
df_cv_output = pd.DataFrame(columns=cv_model.embedding_model.get_feature_names_out(), data=cv_embeddings.toarray())
df_cv_output

In [None]:
from models import TfIdfEmbedder

# Adding just the utterance sample as vocabulary
tfidf_model = TfIdfEmbedder(vocabulary=sample_utterance, max_features=5, stop_words='english')
tfidf_embeddings = tfidf_model.embed(sample_utterance)
print(f"Number features: {len(tfidf_model.embedding_model.get_feature_names_out())}", tfidf_model.embedding_model.get_feature_names_out())
print(f"Shape embedding array: {tfidf_embeddings.toarray().shape}")
df_tfidf_output = pd.DataFrame(columns=tfidf_model.embedding_model.get_feature_names_out(), data=tfidf_embeddings.toarray())
df_tfidf_output

## Dense Text Embeddings

📌 **Definition:**
Dense embeddings represent words, sentences, or documents as **low-dimensional, dense vectors**
where similar meanings are **close together in vector space**.

---

### 🟢 Characteristics
- **Low-dimensional** (e.g., 100–1,536 dimensions, not vocab-sized)
- **Dense representation**: most values ≠ 0
- Captures **semantic meaning** and context
- Learned from data via **neural networks**

---

### 🟠️ Brief History
- **Word2Vec (2013)** – First widely used dense word embeddings (Mikolov et al.)
- **GloVe (2014)** – Global Vectors for word representation
- **FastText (2016)** – Adds subword information for better handling of rare words
- **ELMo (2018)** – Contextual word embeddings
- **BERT (2018)** – Contextual embeddings for entire sentences
- **OpenAI / Modern Embeddings (2020s)** – High-quality sentence/document embeddings (e.g., `text-embedding-3-large`)

---

### 🔴 Why It’s Better
- Reduces dimensionality dramatically
- Learns **semantic relationships**
- Powers modern **search, recommendation, and AI assistants**


## Word2Vec: CBOW & Skip-Gram Overview

### 🟢 What is Word2Vec?
- A **shallow, two-layer neural network** that learns word embeddings from context.
- Maps words to **dense vectors** in a continuous space; similar words are **close together**.
- Introduced by Mikolov et al., **2013**.

---

### 🟠️ Architectures

#### 🔹 Continuous Bag-of-Words (CBOW)
- Predicts the **center word** given its context words.
- Fast to train; works well for **frequent words**.

#### 🔹 Skip-Gram
- Predicts **context words** given a center word.
- Performs better for **rare words**, large datasets.

---

### 🔵 Why It Matters
- Captures **semantic relationships**:
  `king - man + woman ≈ queen`
- Major leap from sparse (BoW/TF-IDF) to **dense, meaningful embeddings**.


## Word2Vec: Training & Visual Intuition

### 🟢 Training Workflow
1. **One-hot encoding** for words.
2. **Hidden layer** = embedding lookup table.
3. **Output layer** predicts context words (softmax with negative sampling).
4. Final **hidden layer weights** = embeddings.

---

### 🟠 Visual Intuition

<table>
<tr>
<td width="50%">
<img src="./images/word2vec_diagrams.png" alt="Word2Vec Architecture" style="width:100%;">
</td>
<td width="50%">
  <img src="./images/skip_gram_net_arch.png" alt="Word2Vec Context" style="width:100%;">
</td>
</tr>
</table>

- Left: Skip-Gram architecture predicting context words.
- Right: Example of context window for a target word.
- Only the **embedding layer weights** are retained.

---

### Reference
- Israel G. (2017). *Word2Vec Explained*.
  [https://israelg99.github.io/2017-03-23-Word2Vec-Explained/](https://israelg99.github.io/2017-03-23-Word2Vec-Explained/)


## BERT: Contextual Embeddings

### 🟢 What is BERT?
- **Bidirectional Encoder Representations from Transformers** (2018, Google AI).
- Uses **Transformer architecture** to create **contextual word embeddings**:
  - Each word’s vector depends on **all surrounding words** (left & right context).
- Trained on **masked language modeling** and **next sentence prediction** tasks.

---

### 🟠 Key Innovations
- **Bidirectional**: Unlike Word2Vec/Glove, captures context from both sides.
- **Transformer encoder layers** with self-attention results in rich, deep embeddings.
- **Contextualization**: Same word gets **different vectors** depending on context
  (“bank” in “river bank” vs. “bank account”).

---

### 🔵 Visual Intuition

<img src="images/bert_embeddings.png" alt="BERT Transformer" style="width:40%;">

---

### Reference
- Devlin et al. (2018). *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.*
  [https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805)


## Sentence Transformers Recap & Training

### 🟢 Key Takeaways
- **Sparse vectors** (BoW, TF-IDF):
  - One dimension per word, mostly zeros
  - No deep semantic meaning
- **Dense embeddings** (Word2Vec, BERT, SBERT):
  - Low-dimensional, rich semantic context
  - Words/sentences cluster by meaning

---

### 🟠 How SBERT is Trained
- **Backbone:** Pretrained BERT or RoBERTa encoders
- **Siamese/Triplet Network Architecture:**
  - Encodes two or three sentences **independently** into embeddings
  - Trains to minimize distance for similar sentences and maximize for dissimilar ones
- **Training Objectives [Loss Overview](https://sbert.net/docs/sentence_transformer/loss_overview.html):**
  - **Contrastive loss** (distance-based similarity)
  - **Natural Language Inference** datasets (entailment, contradiction, neutral)
  - **MultipleNegativesRankingLoss** for retrieval tasks
- **Result:**
  - Embeddings suitable for **cosine similarity** → semantic search, clustering, recommendations

---

### Reference
- Reimers & Gurevych (2019). *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks*
  [https://arxiv.org/abs/1908.10084](https://arxiv.org/abs/1908.10084)
- SentenceTransformers Documentation [https://www.sbert.net/](https://www.sbert.net/)


In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
st_model_small = SentenceTransformer('all-minilm-l6-v2')

sample_string = "I really like this summer school!"

sample_string_embedding = st_model_small.encode(sample_string)
df = pd.DataFrame({
    "text": [sample_string],
    "embedding": [sample_string_embedding]
})
df

## Distance & Similarity Measures (Brief Intro)

### 🟢 Why It Matters
- Compare embeddings → find **semantic similarity** between words, sentences, or documents.
- Core to **semantic search, clustering, and topic modeling**.

---

### 🟠 Common Measures

<table>
<tr>
<td width="70%">
🔹 <b>Dot Product</b>
Unnormalized similarity; sensitive to magnitude:
$$
a \cdot b = \sum a_i b_i
$$
</td>
<td width="30%">
<img src="images/dot_prod.png" alt="Dot Product" width="200" align="right">
</td>
</tr>
</table>

<table>
<tr>
<td width="70%">
🔹 <b>Cosine Similarity</b>
  Measures <b>angle</b> between vectors (ignores magnitude):
  $$
  \text{cosine\_sim}(a, b) = \frac{a \cdot b}{\|a\|\|b\|}
  $$
</td>
<td width="30%">
<img src="images/dot_prod.png" alt="Dot Product" width="200" align="right">
</td>
</tr>
</table>

## Distance & Similarity Measures (Brief Intro)

### 🟠 Common Measures cont'd

<table>
<tr>
<td width="70%">
🔹 <b>Euclidean Distance</b>
  Straight-line distance in vector space:
  $$
  d(a, b) = \sqrt{\sum (a_i - b_i)^2}
  $$
 </td>
<td width="30%">
<img src="images/euc_distance.png" alt="Dot Product" width="200" align="right">
</td>
</tr>
</table>

<table>
<tr>
<td width="70%">
🔹 <b>Manhattan (L1) Distance</b>
  Sum of absolute differences:
  $$
  d_{\text{L1}}(a, b) = \sum |a_i - b_i|
  $$
</td>
<td width="30%">
<img src="images/manhattan_distance.png" alt="Dot Product" width="200" align="right">
</td>
</tr>
</table>


## Chunking Techniques for Embeddings

### 🟢 Why Chunking?
- Long documents exceed model token limits (e.g., BERT ~512 tokens).
- Splitting text into **manageable chunks** improves:
  - ✅ Embedding quality
  - ✅ Retrieval accuracy
  - ✅ Context management in downstream tasks
- 🎯 **Goal of Good Chunking:**
  Create chunks that are **small enough** to fit model limits
  but **large enough** to retain full semantic meaning.

---

### 🟠 Common Techniques

1. **Fixed-Length Chunking**
   - Split text into chunks of `N` tokens/words.
   - Simple, fast, but can cut off sentences mid-way.

2. **Sentence-Based Chunking**
   - Split by sentence boundaries (NLTK, spaCy).
   - Better for readability, semantic grouping.

3. **Paragraph-Based Chunking**
   - Keep natural paragraph structure.
   - Good for preserving context, but chunk sizes vary.

4. **Sliding Window / Overlapping Chunks**
   - Add overlap between chunks (e.g., 50 tokens).
   - Prevents loss of context between splits.

5. **Semantic Chunking**
   - Use topic segmentation or embeddings to find boundaries.
   - Most accurate, but computationally heavier.

---

💡 **Choose technique based on document length, model limits, and retrieval needs.**


# 🏋️ Exercise: Simple Embedding Use Case

In [None]:
# Encode utterance-wise dataset
df_parlamint_embeddings_per_utterance = st_model_small.encode(df_parlamint_grouped["utterance_text"].to_list(),
                                                     show_progress_bar=True)

# Encode sentence-wise dataset
df_parlamint_embeddings_per_sentence = st_model_small.encode(df_parlamint["Text"].to_list(), show_progress_bar=True)

In [None]:
df_parlamint_grouped["embedding"] = list(df_parlamint_embeddings_per_utterance)
df_parlamint["embedding"] = list(df_parlamint_embeddings_per_sentence)
df_parlamint

In [None]:
import numpy as np
from sentence_transformers import util

question = "What is the government policy on climate change?"
# question = "America?"
k = 5  # choose how many results you want

# 1. Embed the question
question_embedding = st_model_small.encode(question)

# 2. Compute cosine similarities
cosine_similarities = util.cos_sim(question_embedding, df_parlamint_embeddings_per_utterance)[0].cpu().numpy()

# 3. Get indices of top-k most similar utterances
top_k_idx = np.argsort(cosine_similarities)[::-1][:k]

# 4. Retrieve the top-k utterances and their similarity scores
for idx in top_k_idx:
    text = df_parlamint_grouped.iloc[idx]["utterance_text"]
    score = cosine_similarities[idx]
    print(f"Score: {score:.4f} | Utterance: {text}\n")


## Adding It All Together

### 🟢 The Journey So Far
- **Sparse Vectors (BoW, TF-IDF):**
  High-dimensional, simple counts, no semantics
- **Dense Word Embeddings (Word2Vec, GloVe):**
  Compact vectors capturing basic word meaning
- **Contextual Models (BERT):**
  Token embeddings adapt to context
- **Sentence Transformers (SBERT):**
  Sentence-level semantic embeddings for similarity & search
- **Chunking Strategies:**
  Break long texts into meaningful, model-friendly pieces

---

### 🟠 Why It Matters
- Transform raw text into **meaningful numerical representations**
- Enable **semantic search, clustering, Q&A, recommendations**
- Foundation for **modern NLP pipelines & AI assistants**

---

### 🔵 Key Takeaway
A well-designed embedding pipeline =
**Chunking + Contextual Models + Smart Similarity Metrics**
→ Powerful, scalable text understanding!


# 🏋️ Exercises: Hands-On with Embeddings

### 🟠 Generate embeddings for the Parlamint (sub) dataset
- **Experiment with different embedding techniques:**
  - Compare different embedding techniques (dense vs sparse) regarding (i) vector dimensionality (ii) Semantic similarity (are similar texts actually closer together?)
  - Suggested algorithms: [BERTopic Models](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html) or [HuggingFace (general)](https://huggingface.co/models?other=embeddings&sort=trending) or [Huggingface (sentence transformers)](https://huggingface.co/sentence-transformers/models)
- **Consider different chunking techniques:**
  - Sentence based vs utterance level vs ??
- **Save them as pickle file(s):** `df_dataset.to_pickle("<path_and_filename>.pkl")`

### 🟠 Generate embeddings for the HSA (sub) dataset
- **Same tasks as for the Parlamint dataset**

### 🟠 Retrieval: Play around with embeddings and similarity retrieval
- **Write queries:**
  - Search for valid topics, write queries and manually evaluate the result
- **Consider different chunking techniques:**
  - Sentence based vs utterance level
  - Are topics semantically better captured on sentence level or utterance level=
- **Utilize different embedding models for retrieval:**
  - Sparse vs dense embeddings
  - Experiment with different similarity scores