## Document Analysis: Computational Methods - Summer Term 2025
### Lectures: Jun.-Prof. Dr. Andreas Spitz
### Tutorials: Julian Schelb

# Exercise 07

### You will learn about:

- Language Models
- Contextualized Word Embeddings

---


## Task 1: Terminology and Understanding:

Give a concise answers to each of the following questions.

1. What is fine-tuning?

<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit) </font>

2. What is the difference between static and contextualized word embeddings?

<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit) </font>

3. Explain the purpose of the [SEP] token in BERT.

<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit) </font>

---

## Task 2 - Contextualized Word Embeddings
To avoid issues due to package inconsistencies, please i

### Part 1:
Use a pre-trained BERT model to extract word embeddings of the word "bank" from the following seven sentences. 
To extract the embeddings, identify the token that corresponds to "bank" in each sentence 
(keep in mind that token != word), then extract the hidden states for this token from the model's final four 
hidden layers (this is a tensor of dimension 4 x 768). 
Average these four states to obtain a single 768-dimensional embedding for the token "bank" in each sentence. 


1. It is good to always have money in the bank.
2. A bank of fog came drifting in from the sea.
3. Sadly, my piggy bank is empty.
4. Gold frequently deposits on the bank of a river.
5. Vast banks of clouds were visible on the horizon.
6. The First Bank of the United States was chartered in 1791.
7. The flight instructor told her to bank hard left.



In [1]:
# We use bert-base-uncased model from HuggingFace (https://huggingface.co/bert-base-uncased).
from transformers import BertTokenizer, BertModel

# BERT has its own specific tokenizer, you need to use it to tokenize the given sentences
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# And this is the language model; it consists of 12 layers. 
# You need to extract hidden-spaces for the "bank" token from the last four layers and then average them.
model = BertModel.from_pretrained("bert-base-uncased")

# TODO ADD YOUR CODE HERE

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



---

#### Part 2:

Visualize your results by computing a suitable 2-dimensional projection of the seven embeddings (e.g., by using PCA, t-SNE, or UMAP). 
Make sure that each embedding is labeled clearly in the plot with the sentence number from which you extracted it.


In [2]:
# You need to use **one** of the dimensionality reduction techniques:

# for PCA use:
# from sklearn.decomposition import PCA

# for t-SNE use:
# from sklearn.manifold import TSNE

# for UMAP first install the package: 
# pip install umap-learn
# import umap

# ADD YOUR CODE HERE


---

#### Part 3:

Discuss your findings. Can you explain why some of the embeddings are more similar than others? Do some of the similarities make less sense than others?

<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit) </font>

## Task 3 - Sentence Embeddings
You will analyze the following four sentences:
1. If a tree falls in a forest and no one is around to hear it, does it make a sound?
2. A tree is an abstract data model consisting of a root, internal nodes, leaves, and path branches.
3. In autumn, leaves will slowly fall from the branches towards the roots, covering the path below.
4. If a graph is a tree, all node pairs are connected by exactly one path which traverses the root node.

### Part 1:
Compute sentence embeddings of the given four sentences with a pre-trained BERT model (use the same model as in Task 2). 
To retrieve sentence embeddings, extract the representation of the [CLS] token from the final layer (i.e., layer 12).


In [3]:
# TODO - ADD YOUR CODE HERE



### Part 2:
Compute sentence embeddings of the same sentences by averaging the word2vec word embeddings of all words in the sentence (similar to Assignment 6).

In [4]:
# TODO - ADD YOUR CODE HERE



### Part 3: 
Repeat the process described in Part 2, but remove stop words before averaging the word2vec embeddings to create the sentence representation.

In [5]:
# TODO - ADD YOUR CODE HERE



### Part 4:
Separately compute the pairwise cosine similarities between all four sentence embeddings in each model. 
That is, compute cosine similarities between 

(1) all sentence embeddings generated by BERT, 

(2) all sentence embeddings generated by word2vec with stop words, and 

(3) all sentence embeddings generated by word2vec without stop words 

(you **do not** need to compare the embeddings **between different models**!). 

Display your results in a table with 5 columns: SentenceID1, SentenceID2, cosine BERT, cosine w2v, cosine w2v-stopwords.

In [6]:
# TODO - ADD YOUR CODE HERE




### Part 5:
Discuss the results. Which differences in sentence similarities make sense? Can you explain why?

<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit) </font>


#### Submitting your results:

To submit your results, please:

- save this file, i.e., `ex??_assignment.ipynb`.
- if you reference any external files (e.g., images), please create a zip or rar archive and put the notebook files and all referenced files in there.
- login to ILIAS and submit the `*.ipynb` or archive for the corresponding assignment.

**Remarks:**
    
- Do not copy any code from the Internet. In case you want to use publicly available code, please, add the reference to the respective code snippet.
- Check your code compiles and executes, even after you have restarted the Kernel.
- Submit your written solutions and the coding exercises within the provided spaces and not otherwise.
- Write the names of your partner and your name in the top section.