# Lab 1: Tokenisation and embeddings

In this lab, you will build an understanding of how text can be transformed into representations that computers can process and learn from. Specifically, you will explore two key concepts: *tokenisation* and *embeddings*. Tokenisation splits text into smaller units such as words, subwords, or characters. Embeddings are dense, fixed-size vector representations of tokens in a continuous space.

*Tasks you can choose for the oral exam are marked with the graduation cap üéì emoji.*

## Part 1: Tokenisation

In the first part of the lab, you will code and analyse a tokeniser based on the Byte Pair Encoding (BPE) algorithm.

### Utility functions

The BPE tokeniser transforms text into a list of integers representing tokens. As a warm-up, you will implement two utility functions on such lists. To simplify things, we define a shorthand for the type of pairs of integers:

In [22]:
from typing import Tuple

Pair = Tuple[int, int]

#### üéà Task 1.01: Counting pairs

Write a function that counts all occurrences of pairs of consecutive token IDs in a given list. The function should return a dictionary that maps each pair to its count. Skip counts that are zero.

In [23]:
def count(ids: list[int]):
    result = {}
    for i in range(len(ids) - 1):
        if (ids[i], ids[i + 1]) in result:
            result[(ids[i] , ids[i+1])] += 1
        else:
            result[(ids[i] , ids[i+1])] = 1
    return result

In [24]:
print(count([1, 2, 2, 3, 1, 2, 1]))

{(1, 2): 2, (2, 2): 1, (2, 3): 1, (3, 1): 1, (2, 1): 1}


#### üéà Task 1.02: Replacing pairs

Write a function that traverses a list of token IDs from left to right and replaces all occurrences of a specified pair of consecutive IDs by a new ID. The function should return the modified list.

In [25]:
def replace(ids: list[int], pair: Pair, new_id: int) -> list[int]:
    result = []
    i = 0
    while i < len(ids):
        if ids[i] == pair[0] and i < len(ids) - 1 and ids[i + 1] == pair[1]:
            result.append(new_id)
            i += 2
        else:
            result.append(ids[i])
            i += 1
    return result

### Encoding and decoding

The next cell contains the core code for the tokeniser in the form of a class `Tokenizer`. This class implements two methods: `encode()` converts an input text to a list of token IDs by exhaustively applying rules for merging pairs of consecutive IDs (stored in the dictionary `self.merges`), and `decode()` reverses this process. Note that the set of merge rules is initially empty; you will add rules in Task&nbsp;1.04.

In [26]:
class Tokenizer:
    def __init__(self):
        self.merges: dict[Pair, int] = {}
        self.vocab: dict[int, bytes] = {i: bytes([i]) for i in range(2**8)}

    def encode(self, text):
        ids: list[int] = list(text.encode("utf-8")) # String is converted into a list of integers "ab" ==> [97, 98]
        while True:
            counts:dict[Pair, int] = count(ids) # {(97, 98): 3, (98, 97): 2}
            mergeable_pairs: set[Pair] = counts.keys() & self.merges.keys() # Check for matches tokens, ex merges = {(97, 98): 257, (98, 97): 258}
            if len(mergeable_pairs) == 0:
                break
            to_merge: Pair = min(mergeable_pairs, key=self.merges.get)  # take the token with lower ID ==> (97, 98)
            ids: list[int] = replace(ids, to_merge, self.merges[to_merge])  # Replace the token with the merged token
        return ids

    def decode(self, ids):
        return b"".join((self.vocab[i] for i in ids)).decode("utf-8")

#### üéì Task 1.03: Encoding and decoding

Explain how the code implements the BPE algorithm. Use the following steps to check your understanding:

**Step&nbsp;1.** Annotate the attributes and methods of the `Tokenizer` class with their Python types. In particular, what is the type of `self.merges`? Use the `Pair` shorthand. What does a merge rule look like?

**Step&nbsp;2.** Explain how the implementation chooses which merge rule to apply. Provide an example that illustrates the logic. Construct the example such that you get a different result when you use `max()` instead of `min()`.

The self.merges return a tuple with id, this id will be used later as the tuple that will be selected based on (Min or Max).
In the comment I explained with an example of which tuple might be selected with the current code, in case of Max then (98, 97) will be selected since ID 258 is higher than ID 257
Bascially, any new token generated will be assigned with new ID that is equal to the length of the current vocabulary, and we use min is to ensure the tokenizer applies the rule that was learned first (oldest) making the encoding predictable.

### Training a tokeniser

Upon initialisation, a tokeniser has an empty set of merge rules. Your next task is to complete the BPE algorithm and write code to learn these merge rules from a text.

#### üéì Task 1.04: Training a tokeniser

Write a function that induces a BPE tokeniser from a given text. The function should take the text (a string) and a target vocabulary size as input and return the trained tokeniser.

In [27]:
def from_text(text: str, vocab_size: int) -> Tokenizer:
    tok = Tokenizer()
    ids = list(text.encode("utf-8"))
    while len(tok.vocab) < vocab_size:
        counts = count(ids)
        if not counts:
            break
        selected_pair = max(counts, key=counts.get)
        token_id = len(tok.vocab)
        tok.merges[selected_pair] = token_id
        tok.vocab[token_id] = tok.vocab[selected_pair[0]] + tok.vocab[selected_pair[1]]
        ids = replace(ids, selected_pair, token_id)
    return tok

To help you test your implementation, we provide three text files together with tokenisers trained on these files. Each text file contains the first 1&nbsp;million Unicode characters in a language-specific Wikipedia:

| Text file | Tokeniser file | Wikipedia |
|---|---|---|
| `wiki-en-1m.txt` | `wiki-en-1m.tok` | [Simple English](https://simple.wikipedia.org/) |
| `wiki-is-1m.txt` | `wiki-is-1m.tok` | [Icelandic](https://is.wikipedia.org/) |
| `wiki-sv-1m.txt` | `wiki-sv-1m.tok` | [Swedish](https://sv.wikipedia.org/) |

A tokeniser file consists of lines specifying merge rules. For example, the first line in the tokeniser file for Swedish is `101 114`, which expresses that this rule combines the token with ID 101 (`e`) and the token with ID 114 (`r`). The ID of the new token (`er`) is 256 plus the (zero-indexed) line number on which the rule is found. The following code saves a `Tokenizer` to a file with this format:

In [28]:
def save(tokenizer: Tokenizer, filename: str) -> None:
    with open(filename, "w") as f:
        for fst, snd in tokenizer.merges:
            print(f"{fst} {snd}", file=f)

In [29]:
with open('wiki-sv-1m.txt', 'r', encoding='utf-8') as f:
    text_content = f.read()

my_tok = from_text(text_content, 1024)

save(my_tok, 'my_wiki_swedish.tok')

To test your code, compare your saved tokeniser to the provided tokeniser using the `diff` tool.

**Note that training a tokeniser can take a few minutes.**

### Tokenisation quirks

The tokeniser is a key component of language models, as it defines the minimal chunks of text the model can ‚Äúsee‚Äù and work with. As you will see in this section, tokenisation is also responsible for several deficiencies and unexpected behaviours of language models.

One helpful tool for experimenting with tokenisers in language models is the web app [Tiktokenizer](https://tiktokenizer.vercel.app/). This app lets you play around with, among others, [`cl100k_base`](https://tiktokenizer.vercel.app/?model=cl100k_base), the tokeniser used in the free version of ChatGPT and OpenAI‚Äôs APIs, and [`o200k_base`](https://tiktokenizer.vercel.app/?model=o200k_base), used in GPT-4o.

#### üéì Task 1.05: Tokenisation quirks

Prompt [ChatGPT](https://chatgpt.com/) to reverse the letters in the following words:

```
creativecommons
MERCHANTABILITY
NSNotification
authentication
```

How many of these words come out right? What happens when you modify the prompt and explicitly disable ‚Äúthinking‚Äù and external tools? What could be the problem when words come out wrong? Generate ideas by inspecting the words in Tiktokenizer. Try to come up with other prompts that illustrate problems related to tokenisation.

The result I got:

`creativecommons ‚Üí smonivtaerceerc`  ‚ùå

`MERCHANTABILITY ‚Üí YTILIBAREDNOMMOC` ‚ùå

`NSNotification ‚Üí noitacifitoSN` ‚ùå

`authentication ‚Üí noitacitnehtua` ‚úÖ

My word `interpretability ‚Üí ytilibatrepretni `‚ùå

Only `authentication` was correctly reversed, the result was acheived without using (thinking nor external tools options).

I belive the reason behind it is that the reversed word is something unfamilar to the model for ex if we takea word `creativecommons` it may be divided into tokens like this `crea` `tive` `com` `mons` so the model is reversing these tokens not their individual characters.


### Tokenisation and multi-linguality

Many NLP systems and the tokenisers used with them are primarily trained on English data. In the next task, you will reflect on the effect this has when they are used to process non-English data.

The *context length* of a language model is the maximum number of preceding tokens the model can condition on when predicting the next token. This number is fixed and cannot be changed after training the model. For example, the context length of GPT-2 ([Radford et al., 2019](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)) is 1,024.

While the context length of a language model is fixed, the amount of information that can be squeezed into this context length will depend on the tokeniser. Informally speaking, a model that needs more tokens to represent a given text cannot extract as much information from that text as one that needs fewer tokens.

#### üéì Task 1.06: Tokenisation and multi-linguality

Train a tokeniser on the English text file from Task&nbsp;1.04 and test it on the same text. How many tokens does it split the text into? Based on this, what is the expected number of Unicode characters of English text that can be fit into a context length of 1,024?

What do the numbers look like if you test the English tokeniser on the Icelandic text instead? What could explain the differences?

Interpreting the expected number of Unicode characters as a measure of representation efficiency, what do your results tell you about the efficiency of a language model primarily trained on English data when it is used to process non-English data? Why are these findings relevant?

In [30]:
with open('wiki-en-1m.txt', 'r', encoding='utf-8') as f:
    text_content_en = f.read()

my_tok_english = from_text(text_content_en, 1256)


In [31]:
with open('wiki-is-1m.txt', 'r', encoding='utf-8') as f:
    text_content_is = f.read()

encoded_text = my_tok_english.encode(text_content_is)
print(len(encoded_text))

739566


- The number of tokens is : `355650`

Let's check the number of characters per token.
`1000000/ 355650 = 2.81`

The number of Unicode characters of English text that can be fit into a context of length 1024 would be


`1024/ 2.81 = 364`

- Testing on Icelandic language:

The number of token is : `739566`

Let's check the number of characters per token.
`1000000/ 739566 = 1.35`

The number of Unicode characters of Icelandc text that can be fit into a context of length 1024 would be

`1024 / 1.35 = 758 `

My interpretation is that the English‚Äëtrained tokenizer is much more efficient for English than for Icelandic. For English, tokens often correspond to whole (sub)words and carry more meaning (‚âà2.81 characters/token). For Icelandic, many words are split into shorter, almost character‚Äëlevel tokens (‚âà1.35 characters/token), so for a 1,024‚Äëtoken context the model ‚Äúsees‚Äù fewer whole words and less semantic content in Icelandic than in English.


## Part 2: Embeddings

In the second part of the lab, you will explore embeddings. An embedding layer is a network component that assigns each item in a finite set of elements (often called a *vocabulary*) a fixed-size vector. At first, these vectors are filled with random values, but during training, they are adjusted to suit the task at hand.

### Bag-of-words classifier

To help you build an intuition for embeddings and the vector representations learned by them, we will use a simple bag-of-words text classifier. The core part of this classifier only takes a few lines of code:

In [32]:
import torch.nn as nn
import torch.nn.init as init

class Classifier(nn.Module):
    def __init__(self, num_embeddings, embedding_dim, num_classes, use_kaiming: bool = False):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)
        self.linear = nn.Linear(embedding_dim, num_classes)
        if use_kaiming:
            # same init as nn.Linear uses
            init.kaiming_uniform_(self.embedding.weight, a=0, mode="fan_in", nonlinearity="relu")

    def forward(self, x):
        return self.linear(self.embedding(x).mean(dim=-2))

#### üéà Task 1.07: Bag-of-words classifier

Explain how the bag-of-words classifier works. How does the code match the diagram you saw in the lectures? Why is there only one `nn.Embedding`, while the diagram shows three embedding layers? What does the keyword argument `dim=-2` do?

It is a way for representing the words of a text by writing the number of occurences of each word. Each embedding layer represent a word to embed, so `num_embeddings` is basically the number of words to embed.

An embedding does not represent the number of occurrences of each word. Also, each embedding layer represents a mapping for the complete vocabulary, not only a single word.

The input `x` to the bag-of-words classifier is a text represented as a vector of token IDs. The first step is to look up the embedding vectors for the token IDs and then calculate their element-wise mean. All token IDs are mapped using the same embedding, which is shown in the diagram with dotted lines linking the three embedding layers ‚Äì thus, there is actually only one embedding layer. The keyword argument `dim=-2` ensures that the mean is calculated across the columns of all vectors, rather than within each single vector. The mean vector is passed through the final linear layer. Unlike what is shown in the diagram, there is no softmax at the end ‚Äì the classifier outputs unnormalised logits instead.

### Dataset

You will apply the classifier to a small dataset with Amazon customer reviews. This dataset is taken from [a much larger dataset](https://www.cs.jhu.edu/~mdredze/datasets/sentiment/) first described by [Blitzer et al. (2007)](https://aclanthology.org/P07-1056/).

The dataset contains whitespace-tokenised product reviews from two categories: cameras (`camera`) and music (`music`). Each review is additionally annotated for sentiment towards the product at hand: negative (`neg`) or positive (`pos`). The category and sentiment labels are prepended to the review. As an example, here is the first review from the training data:

```
music neg oh man , this sucks really bad . good thing nu-metal is dead . thrash metal is real metal , this is for posers
```

The next cell contains a custom [`Dataset`](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) class for the review dataset. To initialise an instance of this class, you specify the name of the file containing the reviews you want to load (`filename`) and which of the two labels you want to use (`label`): product category (0) or sentiment (1).

In [33]:
from torch.utils.data import Dataset


class ReviewDataset(Dataset):
    def __init__(self, filename: str, label: int = 0) -> None:
        with open(filename) as f:
            tokenized_lines = [line.split() for line in f]
        self.items = [(tokens[2:], tokens[label]) for tokens in tokenized_lines]

    def __len__(self) -> int:
        return len(self.items)

    def __getitem__(self, idx: int) -> tuple[list[str], str]:
        return self.items[idx]

### Vectoriser

To feed a review into the bag-of-words classifier, you first need to turn it into a vector of token IDs. Likewise, you need to convert the label (product category or sentiment) into an integer. The next cell contains a partially completed `ReviewVectoriser` class that handles this transformation.

In [34]:
from collections import Counter

import torch

# Type abbreviation for review‚Äìlabel pairs
Item = tuple[list[str], str]


class ReviewVectorizer:
    PAD = "[PAD]"
    UNK = "[UNK]"

    def __init__(self, dataset: ReviewDataset, n_vocab: int = 1024) -> None:
        # Unzip the dataset into reviews and labels
        reviews, labels = zip(*dataset)

        # Count the tokens and get the most common ones
        counter = Counter(t for r in reviews for t in r)
        most_common = [t for t, _ in counter.most_common(n_vocab - 2)] # -2 to reserve PAD and UNK

        # Create the token-to-index and label-to-index mappings
        self.t2i = {t: i for i, t in enumerate([self.PAD, self.UNK] + most_common)}
        self.l2i = {l: i for i, l in enumerate(sorted(set(labels)))}

    def __call__(self, items: list[Item]) -> tuple[torch.Tensor, torch.Tensor]:
        reviews, labels = zip(*items)
        longest_review = max(len(review) for review in reviews)
        X = []
        # Now we need to add the padding to smaller reviews

        for review in reviews:
            token_ids = []
            while len(review) < longest_review:
                review.append(self.t2i[self.PAD])
            for token in review:
                if token in self.t2i:
                    token_ids.append(self.t2i[token])
                else:
                    token_ids.append(self.t2i[self.UNK])
            X.append(token_ids)
        y = [self.l2i[label] for label in labels]
        return torch.tensor(X, dtype=torch.long), torch.tensor(y, dtype=torch.long)

A `ReviewVectoriser` maps tokens and labels to IDs using two Python dictionaries. These dictionaries are set up when the vectoriser is initialised and queried when the vectoriser is called on a batch of review‚Äìlabel pairs. They include IDs for two special tokens:

`[PAD]` (Padding): Reviews can have different lengths, but PyTorch requires all vectors in a batch to be the same size. To handle this, the vectoriser adds `[PAD]` tokens to the end of shorter reviews so they match the length of the longest review in the batch.

`[UNK]` (Unknown): If a review contains a token that is not in the token-to-ID dictionary, the vectoriser assigns it the ID of the `[UNK]` token instead of a regular ID.

#### üéì Task 1.08: Vectoriser

Explain and complete the code of the vectoriser. Follow these steps:

**Step&nbsp;1.** Explain how unzipping works. What are the types of `reviews` and `labels`?

**Step&nbsp;2.** Explain how the token-to-ID and label-to-ID mappings are constructed. How does the `most_common()` method deal with elements that occur equally often?

**Step&nbsp;3.** Complete the implementation of the `__call__()` method. This method should convert a list of $m$ review‚Äìlabel pairs into a pair $(X, y)$ where $X$ is a matrix containing the vectors with token IDs for the reviews, and $y$ is a vector containing the IDs of the corresponding labels.

- 1_ Unzipping means separating the passed iterable pairs into individual iterable of items for each, in this example we are separating the reviews and the labels. the types of reviews and labels: list of string for reviews, and string for labels

- 2_ First we count the number of tokens from each review, then we count the most common token by calling the most common method, this method return list of token count in descending order, in case of ties => selecting will be arbitrary (the order does not matter). The first 2 indeces are reserved for PAD and UNK

### Training the classifier

With the vectoriser completed, you are ready to train a classifier. More specifically, you can train two separate classifiers: one to predict the product category of a review, and one to predict the sentiment. The next cell contains a simple training loop that you can adapt for this purpose.

In [35]:
import torch.nn.functional as F

def train( label: int,
    filename: str = "reviews-train.txt",
    n_vocab: int = 1024,
    embedding_dim: int = 64,
    learning_rate: float = 0.001,
    batch_size: int = 16,
    num_of_epochs: int = 10,
    use_kaiming: bool = False,
    ):
    # The dataset used and the selected label
    # It reads text from the file and assigns a label (here, hard-coded to 0 for all or reads labels if they exist in the file format.
    dataset = ReviewDataset(filename, label=label)
    # VECTORIZER: create the vocabulary and a text-to-tensor function.
    # It will learn the words that exist, assigns an ID to each of these words, and, it defines how a review will be converted into a 1024 dimension vector in this case.
    # basically used to transform the raw text into a vector that the machine can use.
    processor = ReviewVectorizer(dataset, n_vocab)
    num_embeddings: int = n_vocab # the size of our vocabulary. It's the total number of unique words that the model can recognize and work with.
    num_classes: int = len(processor.l2i) # the number of classes the model is trying to predict
    model = Classifier(num_embeddings, embedding_dim, num_classes, use_kaiming=use_kaiming)  # The model used which is the bag-of-words classifier
    optimizer = torch.optim.Adam(model.parameters(), learning_rate)  # The optimzer Adam (It doesn't use the same learning rate. Instead, it adjusts the learning rate for each parameter.)
    data_loader = torch.utils.data.DataLoader(
        dataset,
        batch_size,
        shuffle=True,
        collate_fn=processor, # # use the vectorizer to process the batch before sending it to the model
    )
    for epoch in range(num_of_epochs):
        model.train()  # puts the model into training mode
        running_loss = 0
        for bx, by in data_loader:
            optimizer.zero_grad() # Reseting the gradient, this is useful to avoid having a prior information each iteration. without reseting the model wouldn't learn correctly.
            output = model(bx) # is the prediction of the model
            loss = F.cross_entropy(output, by) # it shows how far off the model's predictions are from the correct answers.
            loss.backward() # Backpropagation which is used to reduce the loss
            optimizer.step() # Takes the gradients and use it to update the model's paramters. ( this is the actual learning happen)
            running_loss += loss.item() # This is just accumulating the losses from each iteration then use them all to show the final loss.
        print(f"Epoch {epoch}, loss: {running_loss / len(data_loader):.4f}")
    return processor, model

#### üéì Task 1.09: Training loop

Explain the training loop. Follow these steps:

**Step&nbsp;1.** Go through the training loop line-by-line and add comments where you find it suitable. Your comments should be detailed enough for you to explain the main steps of the loop.

**Step&nbsp;2.** The training loop contains various hard-coded values like filename, learning rate, batch size, and epoch count. This makes the code less flexible. Revise the code so that you can specify these values using keyword arguments. Use the concrete values from the code as defaults.

#### üéà Task 1.10: Training the classifier

Adapt the next cell to train the classifier for the two prediction tasks. Based on the loss values, which task appears to be the harder one? What is the purpose of setting a seed?

In [36]:
torch.manual_seed(42)
print("Without Kaiming ‚Äì Category")
vec_cat_no, model_cat_no = train(label=0, use_kaiming=False)
print("\nWithout Kaiming ‚Äì Sentiment")
vec_sent_no, model_sent_no = train(label=1, use_kaiming=False)

# With Kaiming
torch.manual_seed(42)
print("\nWith Kaiming ‚Äì Category")
vec_cat_k, model_cat_k = train(label=0, use_kaiming=True)
print("\nWith Kaiming ‚Äì Sentiment")
vec_sent_k, model_sent_k = train(label=1, use_kaiming=True)

Without Kaiming ‚Äì Category
Epoch 0, loss: 0.6814
Epoch 1, loss: 0.6506
Epoch 2, loss: 0.6077
Epoch 3, loss: 0.5397
Epoch 4, loss: 0.4670
Epoch 5, loss: 0.3934
Epoch 6, loss: 0.3332
Epoch 7, loss: 0.2856
Epoch 8, loss: 0.2471
Epoch 9, loss: 0.2177

Without Kaiming ‚Äì Sentiment
Epoch 0, loss: 0.6900
Epoch 1, loss: 0.6840
Epoch 2, loss: 0.6825
Epoch 3, loss: 0.6723
Epoch 4, loss: 0.6615
Epoch 5, loss: 0.6467
Epoch 6, loss: 0.6298
Epoch 7, loss: 0.6096
Epoch 8, loss: 0.5895
Epoch 9, loss: 0.5688

With Kaiming ‚Äì Category
Epoch 0, loss: 0.6804
Epoch 1, loss: 0.6376
Epoch 2, loss: 0.5476
Epoch 3, loss: 0.4381
Epoch 4, loss: 0.3441
Epoch 5, loss: 0.2756
Epoch 6, loss: 0.2273
Epoch 7, loss: 0.1920
Epoch 8, loss: 0.1661
Epoch 9, loss: 0.1453

With Kaiming ‚Äì Sentiment
Epoch 0, loss: 0.6911
Epoch 1, loss: 0.6865
Epoch 2, loss: 0.6751
Epoch 3, loss: 0.6534
Epoch 4, loss: 0.6205
Epoch 5, loss: 0.5843
Epoch 6, loss: 0.5450
Epoch 7, loss: 0.5095
Epoch 8, loss: 0.4786
Epoch 9, loss: 0.4498


the final loss for the category model (label=0) is ` 0.2177`, while the final loss for the sentiment model (label=1) is `0.5688`. Since therefore, sentiment classification appears to be the harder task. The seed is needed to get the same random values we got first time  when training our model, to avoid having different output each time we train the model

### Inspecting the embeddings

Now that you have trained the classifier on two separate prediction tasks, it is interesting to inspect and compare the embedding vectors it learned in the process. For this you will use an online tool called the [Embedding Projector](http://projector.tensorflow.org). The next cell contains code to save the embeddings from a trained classifier in a format that can be loaded into this tool.

In [37]:
def save_embeddings(
    vectorizer: ReviewVectorizer,
    model: Classifier,
    vectors_filename: str,
    metadata_filename: str,
):
    i2t = {i: t for t, i in vectorizer.t2i.items()}
    embeddings = model.embedding.weight.detach().numpy()
    items = [(i2t[i], e) for i, e in enumerate(embeddings)]
    with open(vectors_filename, "wt") as f1, open(metadata_filename, "wt") as f2:
        for w, e in items:
            print("\t".join("{:.5f}".format(x) for x in e), file=f1)
            print(w, file=f2)

Call this code as follows:

In [38]:
save_embeddings(vectorizer, model, "vectors.tsv", "metadata.tsv")

#### üéì Task 1.11: Inspecting the embeddings

Load the embeddings from the two classification tasks (product category classification and sentiment classification) into the Embedding Projector web app and inspect the vector spaces. How do they compare visually? Does the visualisation make sense to you?

The Embedding Projector offers visualisations based on three dimensionality reduction methods: [UMAP](https://umap-learn.readthedocs.io/en/latest/), [T-SNE](https://en.m.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding), and [PCA](https://en.m.wikipedia.org/wiki/Principal_component_analysis). Which of these seems most useful to you?

Focus on the embeddings for the words *repair* and *sturdy*. Are they close to each other or far away from another? What happens if you switch to the other task? How do you explain that?

`PCA` Looks crowded and all the words are crumbed together, it makes it hard to interpret usefull information.
`T-SNE` Does not look right, the data are not splitted correctly based on the dimensions.
`UMAP` So far this showed the best result, the words are separated in a meaningfull way.


The words are not as close, the distance shows 0.678

### Initialisation of embedding layers

The error surfaces explored when training neural networks can be very complex. Because of this, it is crucial to choose ‚Äúgood‚Äù initial values for the parameters. In the final task of this lab, you will run a small experiment to see how alternative initialisations can affect a model‚Äôs performance.

In PyTorch, the weights of the embedding layer are initially set by sampling from the standard normal distribution, $\mathcal{N}(0, 1)$. However, research suggests other approaches may work better. For example, given that embedding layers share similarities with linear layers, it makes sense to use the same initialisation method for both. The default initialisation method for linear layers in PyTorch is the so-called Kaiming initialisation, introduced by [He et al. (2015)](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf).

#### üéà Task 1.12: Initialisation of embedding layers

Check the [source code of `nn.Linear`](https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear) to see how PyTorch initialises the weights of linear layers using the Kaiming initialisation method. Apply the same method to the embedding layer of your classifier and see how this affects the loss of your model and the vector spaces.

`Without Kaiming:`

Category: final loss ‚âà 0.2177
Sentiment: final loss ‚âà 0.5688



` With Kaiming method `

Category: final loss ‚âà 0.1453
Sentiment: final loss ‚âà 0.4498

- Without Kaiming, the final loss was 0.2177 for the category model and 0.5688 for the sentiment model.  
- With Kaiming, the final loss dropped to 0.1453 for the category model and 0.4498 for the sentiment model.

So Kaiming initialisation clearly improves optimisation for both tasks in my setup (the loss is lower in both cases). This matches the idea that a better weight initialisation can make it easier for gradient descent to find good minima and leads to faster or more stable convergence.


**ü•≥ Congratulations on finishing lab&nbsp;1!**