<!-- Title Slide -->

<table>
<tr>
<td style="width:40%; vertical-align:middle; text-align:center;">
  <img src="https://maartengr.github.io/BERTopic/logo.png" alt="Topic Modeling Illustration" width="300"/>
</td>
<td style="width:60%; vertical-align:middle; padding-left:20px;">

  # 📝 Topic Modeling with BERTopic
  ## Session: Text Embeddings

  <br>
  <br>
  <span style="font-size:1.2em; color:gray;">
    Michael Jantscher · TU Graz · Know Center Research GmbH
  </span>

</td>
</tr>
</table>


# 👋 Michael Jantscher

<img src="https://dhgraz.github.io/clariah2025-dse-ml/images/artists/profil_michael.jpg" alt="Michael Jantscher" width="250"/>

**PhD Student** - TU Graz <br>
**Researcher** - Know Center Research GmbH

---

### 🧠 Focus Areas
- Natural Language Processing (NLP) in medical & clinical domains
- Causal reasoning in healthcare and (neuro)radiology
- Agentic AI systems for decision support and research workflows


# 📌 What is a Text Embedding Vector - Formal Definition?
* Numerical representation of text (words, sentences or documents) in a multi-dimensional space
* Captures meaning and context
* Semantic similar words/sentences/documents -> vectors closer together

In [51]:
import pandas as pd
from sentence_transformers import SentenceTransformer
st_model_small = SentenceTransformer('all-minilm-l6-v2')

sample_string = "I really like this summer school!"

sample_string_embedding = st_model_small.encode(sample_string)
df = pd.DataFrame({
    "text": [sample_string],
    "embedding": [sample_string_embedding]
})
df

Unnamed: 0,text,embedding
0,I really like this summer school!,"[-0.057146158, -0.053543467, 0.045457065, 0.01..."


# 💡 Use Cases
* **Search & Retrieval:** How similar is a query to a document in your database?
* Recommendation Systems
* Sentiment Analysis
* Clustering & Topic Modeling

# 🦖 Pre-Embedding Area
* Sparse and very large vectors

### Select different sentence-transformer embedding models
* https://huggingface.co/models?pipeline_tag=sentence-similarity

All generated embeddings are based on [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) <br>
Maybe give a smaller model also a try: [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

### Training
* How could a training look like?
* Which kind of data is necessary therefore?
* Differences between sparse and dense models in terms of training?

### Interesting Tasks
* Find other promising embedding models (sparse and dense) and try them out
    * What about Doc2Vec, word2vec, bag-of-words etc [here](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html)
* Calculate the embedding similarities like in [here](https://huggingface.co/sentence-transformers)
    * There might be different similarity scores dependeing on the embedder (sparse vs dense)

In [1]:
i = 0
i

0