<!-- Title Slide -->

<table>
<tr>
<td style="width:40%; vertical-align:middle; text-align:center;">
  <img src="https://maartengr.github.io/BERTopic/logo.png" alt="Topic Modeling Illustration" width="300"/>
</td>
<td style="width:60%; vertical-align:middle; padding-left:20px;">

  # 📝 Topic Modeling with BERTopic
  ## Session: Text Embeddings

  <br>
  <br>
  <span style="font-size:1.2em; color:gray;">
    Michael Jantscher · TU Graz · Know Center Research GmbH
  </span>

</td>
</tr>
</table>


# 👋 Michael Jantscher

<img src="https://dhgraz.github.io/clariah2025-dse-ml/images/artists/profil_michael.jpg" alt="Michael Jantscher" width="250"/>

**PhD Student** - TU Graz <br>
**Researcher** - Know Center Research GmbH

---

### 🧠 Focus Areas
- Natural Language Processing (NLP) in medical & clinical domains
- Causal reasoning in healthcare and (neuro)radiology
- Agentic AI systems for decision support and research workflows


# 📌 What is a Text Embedding Vector ?- Formal Definition
* Numerical representation of text (words, sentences or documents) in a multi-dimensional space
* Captures meaning and context
* Semantic similar words/sentences/documents -> vectors closer together

In [1]:
from IPython.display import HTML
HTML("""
<style>
.reveal .slides section { text-align: left !important; }
.reveal h1, .reveal h2, .reveal h3, .reveal p, .reveal table { text-align: left !important; }
.reveal table th, .reveal table td { text-align: left !important; }
</style>
""")

In [40]:
import pandas as pd
from sentence_transformers import SentenceTransformer
st_model_small = SentenceTransformer('all-minilm-l6-v2')

sample_string = "I really like this summer school!"

sample_string_embedding = st_model_small.encode(sample_string)
df = pd.DataFrame({
    "text": [sample_string],
    "embedding": [sample_string_embedding]
})
df

Unnamed: 0,text,embedding
0,I really like this summer school!,"[-0.057146158, -0.053543467, 0.045457065, 0.01..."


# 💡 Uses Cases of Text Embedding Vectors

| **Use Case**                   | **Example** |
|--------------------------------|-------------|
| **Semantic Search**         | Search “doctor” → find content on “physicians” or “healthcare providers” even without exact keywords |
| **Recommendation Systems**  | Suggest similar research papers, movies, or products based on descriptions or reviews |
| **Sentiment Analysis**      | Understand tone (positive/negative/neutral) beyond simple keywords in tweets or reviews |
| **Clustering & Topic Modeling** | Group thousands of news articles or support tickets by topic automatically |
| **Chatbots & Virtual Assistants** | Improve NLU so bots answer contextually, not just by keyword |
| **Fraud Detection**        | Spot unusual or suspicious text patterns in financial or insurance claims |

---



# 🦖 Pre-Embedding Area
📌 **Definition:**
Early NLP approaches represented text as **sparse, high-dimensional vectors**.
Each dimension corresponded to a **unique word or token**, with no sense of meaning or context.

---

### Bag-of-Words (BoW)
- Represents documents as a **vector of word counts**
- Ignores grammar, order, and semantics
- Example:
  `"I like NLP"` → `[1, 1, 1, 0, 0, ...]`

---

### TF-IDF (Term Frequency – Inverse Document Frequency)
- Adjusts raw counts to emphasize **rare, informative words** and downweight common words
- Example: “the” → low weight, “quantum” → high weight

---

#### TF-IDF Formula

$$TF\text{-}IDF(t, d) = TF(t, d) \times \log\left(\frac{N}{DF(t)}\right)$$

Where:
- (TF(t, d)\): Frequency of term \(t\) in document \(d\)
- \(DF(t)\): Number of documents containing \(t\)
- \(N\): Total number of documents


In [41]:
# Load parlamint dataset
df_parlamint = pd.read_csv("../../materials/parlamint/parlamint-it-is-2022.txt", sep="\t")
df_parlamint_subset = df_parlamint.head(1000).copy(deep=True)
df_parlamint

# Group sentence by utterance (=Parent_ID)
df_parlamint_grouped = (df_parlamint.groupby(["Parent_ID"])["Text"]
                        .apply(lambda s: " ".join(s))
                        .reset_index(name="utterance_text"))

In [47]:
df_parlamint.head(5)

Unnamed: 0,ID,Parent_ID,Text
0,ParlaMint-IS_2022-01-17-20.seg2.1,ParlaMint-IS_2022-01-17-20.u1,President of the United States reports:
1,ParlaMint-IS_2022-01-17-20.seg3.1,ParlaMint-IS_2022-01-17-20.u1,"I have decided, according to the proposal of t..."
2,ParlaMint-IS_2022-01-17-20.seg4.1,ParlaMint-IS_2022-01-17-20.u1,"Arrange sites, January 11th, 2022."
3,ParlaMint-IS_2022-01-17-20.seg6.1,ParlaMint-IS_2022-01-17-20.u1,Katrín Jakobsdóttir's daughter.
4,ParlaMint-IS_2022-01-17-20.seg7.1,ParlaMint-IS_2022-01-17-20.u1,Presidential Letters for a meeting of the Gene...


In [46]:
df_parlamint_grouped.head(5)

Unnamed: 0,Parent_ID,utterance_text
0,ParlaMint-IS_2022-01-17-20.u1,President of the United States reports: I have...
1,ParlaMint-IS_2022-01-17-20.u10,"Before the weekend, an article by Stefánssonar..."
2,ParlaMint-IS_2022-01-17-20.u11,"I read this decision in Perconte, which is not..."
3,ParlaMint-IS_2022-01-17-20.u12,"In fact, this is shown in the letter quoted by..."
4,ParlaMint-IS_2022-01-17-20.u13,"Yes, that's right. That's right. A senator who..."


In [48]:
# Take a single utterance from the dataset
sample_utterance = df_parlamint[df_parlamint["Parent_ID"] == "ParlaMint-IS_2022-01-17-20.u1"]["Text"]
print("\n".join([e for e in sample_utterance]))

President of the United States reports:
I have decided, according to the proposal of the prime minister, that the Council should meet for an extended meeting on Monday, January 17, 2022 p.m. 3:00.
Arrange sites, January 11th, 2022.
Katrín Jakobsdóttir's daughter.
Presidential Letters for a meeting of the General Assembly for a subsequent meeting on January 17, 2022
I'd like to use this opportunity here after reading this letter and offer the highest. President and w. Senators welcome to New Year's Parliamentary Conferences.


In [50]:
from src.models.models import CountVectorizerEmbedder

# Adding just the utterance sample as vocabulary
cv_model = CountVectorizerEmbedder(vocabulary=sample_utterance, max_features=5, stop_words='english',
                                   ngram_range=(1, 2))

cv_embeddings = cv_model.embed(sample_utterance)
print(f"Number features: {len(cv_model.embedding_model.get_feature_names_out())}", cv_model.embedding_model.get_feature_names_out())
print(f"Shape embedding array: {cv_embeddings.toarray().shape}")
df_cv_output = pd.DataFrame(columns=cv_model.embedding_model.get_feature_names_out(), data=cv_embeddings.toarray())
df_cv_output

Call 'transform' only...
Number features: 5 ['17 2022' '2022' 'january' 'january 17' 'meeting']
Shape embedding array: (6, 5)


Unnamed: 0,17 2022,2022,january,january 17,meeting
0,0,0,0,0,0
1,1,1,1,1,1
2,0,1,1,0,0
3,0,0,0,0,0
4,1,1,1,1,2
5,0,0,0,0,0


In [51]:
from src.models.models import TfIdfEmbedder

# Adding just the utterance sample as vocabulary
tfidf_model = TfIdfEmbedder(vocabulary=sample_utterance, max_features=5, stop_words='english')
tfidf_embeddings = tfidf_model.embed(sample_utterance)
print(f"Number features: {len(tfidf_model.embedding_model.get_feature_names_out())}", tfidf_model.embedding_model.get_feature_names_out())
print(f"Shape embedding array: {tfidf_embeddings.toarray().shape}")
df_tfidf_output = pd.DataFrame(columns=tfidf_model.embedding_model.get_feature_names_out(), data=tfidf_embeddings.toarray())
df_tfidf_output

Call 'transform' only...
Number features: 5 ['17' '2022' 'january' 'meeting' 'president']
Shape embedding array: (6, 5)


Unnamed: 0,17,2022,january,meeting,president
0,0.0,0.0,0.0,0.0,1.0
1,0.540298,0.456156,0.456156,0.540298,0.0
2,0.0,0.707107,0.707107,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0
4,0.394497,0.333062,0.333062,0.788994,0.0
5,0.0,0.0,0.0,0.0,1.0


## Dense Text Embeddings

📌 **Definition:**
Dense embeddings represent words, sentences, or documents as **low-dimensional, dense vectors**
where similar meanings are **close together in vector space**.

---

### 🟢 Characteristics
- **Low-dimensional** (e.g., 100–1,536 dimensions, not vocab-sized)
- **Dense representation**: most values ≠ 0
- Captures **semantic meaning** and context
- Learned from data via **neural networks**

---

### 🟠️ Brief History
- **Word2Vec (2013)** – First widely used dense word embeddings (Mikolov et al.)
- **GloVe (2014)** – Global Vectors for word representation
- **FastText (2016)** – Adds subword information for better handling of rare words
- **ELMo (2018)** – Contextual word embeddings
- **BERT (2018)** – Contextual embeddings for entire sentences
- **OpenAI / Modern Embeddings (2020s)** – High-quality sentence/document embeddings (e.g., `text-embedding-3-large`)

---

### 🔴 Why It’s Better
- Reduces dimensionality dramatically
- Learns **semantic relationships**
- Powers modern **search, recommendation, and AI assistants**


## Word2Vec: CBOW & Skip-Gram Overview

### 🟢 What is Word2Vec?
- A **shallow, two-layer neural network** that learns word embeddings from context.
- Maps words to **dense vectors** in a continuous space; similar words are **close together**.
- Introduced by Mikolov et al., **2013**.

---

### 🟠️ Architectures

#### 🔹 Continuous Bag-of-Words (CBOW)
- Predicts the **center word** given its context words.
- Fast to train; works well for **frequent words**.

#### 🔹 Skip-Gram
- Predicts **context words** given a center word.
- Performs better for **rare words**, large datasets.

---

### 🔵 Why It Matters
- Captures **semantic relationships**:
  `king - man + woman ≈ queen`
- Major leap from sparse (BoW/TF-IDF) to **dense, meaningful embeddings**.


## Word2Vec: Training & Visual Intuition

### 🟢 Training Workflow
1. **One-hot encoding** for words.
2. **Hidden layer** = embedding lookup table.
3. **Output layer** predicts context words (softmax with negative sampling).
4. Final **hidden layer weights** = embeddings.

---

### 🟠 Visual Intuition

<img src="https://israelg99.github.io/images/2017-03-23-Word2Vec-Explained/word2vec_diagrams.png" alt="Word2Vec Architecture" style="width:40%;">

- Diagram shows the Skip-Gram model predicting context words.
- Only the **embedding layer weights** are retained.

---

### Reference
- Israel G. (2017). *Word2Vec Explained*.
  [https://israelg99.github.io/2017-03-23-Word2Vec-Explained/](https://israelg99.github.io/2017-03-23-Word2Vec-Explained/)


## BERT: Contextual Embeddings

### 🟢 What is BERT?
- **Bidirectional Encoder Representations from Transformers** (2018, Google AI).
- Uses **Transformer architecture** to create **contextual word embeddings**:
  - Each word’s vector depends on **all surrounding words** (left & right context).
- Trained on **masked language modeling** and **next sentence prediction** tasks.

---

### 🟠 Key Innovations
- **Bidirectional**: Unlike Word2Vec/Glove, captures context from both sides.
- **Transformer encoder layers** with self-attention results in rich, deep embeddings.
- **Contextualization**: Same word gets **different vectors** depending on context
  (“bank” in “river bank” vs. “bank account”).

---

### 🔵 Visual Intuition

<img src="images/bert_embeddings.png" alt="BERT Transformer" style="width:40%;">

---

### Reference
- Devlin et al. (2018). *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.*
  [https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805)


## Sentence Transformers Recap & Training

### 🟢 Key Takeaways
- **Sparse vectors** (BoW, TF-IDF):
  - One dimension per word, mostly zeros
  - No deep semantic meaning
- **Dense embeddings** (Word2Vec, BERT, SBERT):
  - Low-dimensional, rich semantic context
  - Words/sentences cluster by meaning

---

### 🟠 How SBERT is Trained
- **Backbone:** Pretrained BERT or RoBERTa encoders
- **Siamese/Triplet Network Architecture:**
  - Encodes two or three sentences **independently** into embeddings
  - Trains to minimize distance for similar sentences and maximize for dissimilar ones
- **Training Objectives [Loss Overview](https://sbert.net/docs/sentence_transformer/loss_overview.html):**
  - **Contrastive loss** (distance-based similarity)
  - **Natural Language Inference** datasets (entailment, contradiction, neutral)
  - **MultipleNegativesRankingLoss** for retrieval tasks
- **Result:**
  - Embeddings suitable for **cosine similarity** → semantic search, clustering, recommendations

---

### Reference
- Reimers & Gurevych (2019). *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks*
  [https://arxiv.org/abs/1908.10084](https://arxiv.org/abs/1908.10084)
- SentenceTransformers Documentation [https://www.sbert.net/](https://www.sbert.net/)


In [53]:
import pandas as pd
from sentence_transformers import SentenceTransformer
st_model_small = SentenceTransformer('all-minilm-l6-v2')

sample_string = "I really like this summer school!"

sample_string_embedding = st_model_small.encode(sample_string)
df = pd.DataFrame({
    "text": [sample_string],
    "embedding": [sample_string_embedding]
})
df

Unnamed: 0,text,embedding
0,I really like this summer school!,"[-0.057146158, -0.053543467, 0.045457065, 0.01..."


## Distance & Similarity Measures (Brief Intro)

### 🟢 Why It Matters
- Compare embeddings → find **semantic similarity** between words, sentences, or documents.
- Core to **semantic search, clustering, and topic modeling**.

---

### 🟠 Common Measures

- 🔹 **Dot Product**
  Unnormalized similarity; sensitive to magnitude:
  $$
  a \cdot b = \sum a_i b_i
  $$

- 🔹 **Cosine Similarity**
  Measures **angle** between vectors (ignores magnitude):
  $$
  \text{cosine\_sim}(a, b) = \frac{a \cdot b}{\|a\|\|b\|}
  $$

- 🔹 **Euclidean Distance**
  Straight-line distance in vector space:
  $$
  d(a, b) = \sqrt{\sum (a_i - b_i)^2}
  $$

- 🔹 **Manhattan (L1) Distance**
  Sum of absolute differences:
  $$
  d_{\text{L1}}(a, b) = \sum |a_i - b_i|
  $$


## Chunking Techniques for Embeddings

### 🟢 Why Chunking?
- Long documents exceed model token limits (e.g., BERT ~512 tokens).
- Splitting text into **manageable chunks** improves:
  - ✅ Embedding quality
  - ✅ Retrieval accuracy
  - ✅ Context management in downstream tasks
- 🎯 **Goal of Good Chunking:**
  Create chunks that are **small enough** to fit model limits
  but **large enough** to retain full semantic meaning.

---

### 🟠 Common Techniques

1. **Fixed-Length Chunking**
   - Split text into chunks of `N` tokens/words.
   - Simple, fast, but can cut off sentences mid-way.

2. **Sentence-Based Chunking**
   - Split by sentence boundaries (NLTK, spaCy).
   - Better for readability, semantic grouping.

3. **Paragraph-Based Chunking**
   - Keep natural paragraph structure.
   - Good for preserving context, but chunk sizes vary.

4. **Sliding Window / Overlapping Chunks**
   - Add overlap between chunks (e.g., 50 tokens).
   - Prevents loss of context between splits.

5. **Semantic Chunking**
   - Use topic segmentation or embeddings to find boundaries.
   - Most accurate, but computationally heavier.

---

💡 **Choose technique based on document length, model limits, and retrieval needs.**


# 🏋️ Exercise: Simple Embedding Use Case

In [55]:
# Encode utterance-wise dataset
df_parlamint_embeddings_per_utterance = st_model_small.encode(df_parlamint_grouped["utterance_text"].to_list(),
                                                     show_progress_bar=True)

# Encode sentence-wise dataset
df_parlamint_embeddings_per_sentence = st_model_small.encode(df_parlamint["Text"].to_list(), show_progress_bar=True)

Batches: 100%|██████████| 432/432 [00:10<00:00, 40.07it/s] 
Batches: 100%|██████████| 5018/5018 [00:33<00:00, 148.66it/s]


In [58]:
df_parlamint_grouped["embedding"] = list(df_parlamint_embeddings_per_utterance)
df_parlamint["embedding"] = list(df_parlamint_embeddings_per_sentence)

In [67]:
import numpy as np
from sentence_transformers import util

question = "What is the government policy on climate change?"
# question = "America?"
k = 5  # choose how many results you want

# 1. Embed the question
question_embedding = st_model_small.encode(question)

# 2. Compute cosine similarities
cosine_similarities = util.cos_sim(question_embedding, df_parlamint_embeddings_per_utterance)[0].cpu().numpy()

# 3. Get indices of top-k most similar utterances
top_k_idx = np.argsort(cosine_similarities)[::-1][:k]

# 4. Retrieve the top-k utterances and their similarity scores
for idx in top_k_idx:
    text = df_parlamint_grouped.iloc[idx]["utterance_text"]
    score = cosine_similarities[idx]
    print(f"Score: {score:.4f} | Utterance: {text}\n")


Score: 0.5964 | Utterance: After listening to the highest. Minister, both today and yesterday, I get a little bit of the feeling that he looks at himself and his Ministry more like an observer than a doer when it comes to reducing greenhouse gas emissions. It is best that Ministers burn for more and larger activations, but, as the energy manager has noted, the energy flows directly into the energy exchange. It takes a very clear policy, but it needs a whole plan to make sure it does. That's why it hurts to the top. Ministers will not give us very clear answers on how Iceland's national target of climate change will be updated, when it will happen, and whether the government's climate programme of action will be reviewed and how these updated targets will appear in government policy and measures at all times. But I hope it reaches the highest. Minister to review it better afterwards. Last night we talked about the bus and the electric car truck. It turned out that the government still a

## Adding It All Together

### 🟢 The Journey So Far
- **Sparse Vectors (BoW, TF-IDF):**
  High-dimensional, simple counts, no semantics
- **Dense Word Embeddings (Word2Vec, GloVe):**
  Compact vectors capturing basic word meaning
- **Contextual Models (BERT):**
  Token embeddings adapt to context
- **Sentence Transformers (SBERT):**
  Sentence-level semantic embeddings for similarity & search
- **Chunking Strategies:**
  Break long texts into meaningful, model-friendly pieces

---

### 🟠 Why It Matters
- Transform raw text into **meaningful numerical representations**
- Enable **semantic search, clustering, Q&A, recommendations**
- Foundation for **modern NLP pipelines & AI assistants**

---

### 🔵 Key Takeaway
A well-designed embedding pipeline =
**Chunking + Contextual Models + Smart Similarity Metrics**
→ Powerful, scalable text understanding!


# 🏋️ Exercises: Hands-On with Embeddings

### 🟠 Generate embeddings for the Parlamint (sub) dataset
- **Experiment with different embedding techniques:**
  - Compare different embedding techniques (dense vs sparse) regarding (i) vector dimensionality (ii) Semantic similarity (are similar texts actually closer together?)
  - Suggested algorithms: [BERTopic Models](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html) or [HuggingFace (general)](https://huggingface.co/models?other=embeddings&sort=trending) or [Huggingface (sentence transformers)](https://huggingface.co/sentence-transformers/models)
- **Consider different chunking techniques:**
  - Sentence based vs utterance level vs ??
- **Save them as pickle file(s):** `df_dataset.to_pickle("<path_and_filename>.pkl")`

### 🟠 Generate embeddings for the HSA (sub) dataset
- **Same tasks as for the Parlamint dataset**

### 🟠 Retrieval: Play around with embeddings and similarity retrieval
- **Write queries:**
  - Search for valid topics, write queries and manually evaluate the result
- **Consider different chunking techniques:**
  - Sentence based vs utterance level
  - Are topics semantically better captured on sentence level or utterance level=
- **Utilize different embedding models for retrieval:**
  - Sparse vs dense embeddings
  - Experiment with different similarity scores