# Introduction (Causal) Language Models

Since the success of ChatGPT at the latest, we almost all know how the technology roughly works. Many providers offer low-code or even no-code options that make the use of language models more accessible than ever. In this presentation, we want to implement some ideas ourselves to develop a better understanding of this technology. The aim is not to build the most powerful language model, but to concentrate on the ideas.

The content of the talk is:

1. Understanding the preprocessing step (Tokenization).
2. Discuss what a causal language model actually models.
3. How to generate text from a causal language model.
4. How to fine-tune a model to learn a downstream task.
5. How to mitigate hallucination in knowledge intensive tasks with Retrieval-Augmented Generation (RAG).

---

**Content**

1. [Preprocessing: Tokenization of text](#Preprocessing:-Tokenization-of-text-(Input-representation-for-Language-Models))
2. [Causal Language Model](#Causal-Language-Model)
3. [Generate text from a causal model and its strategies](#Generate-text-from-a-causal-model-and-its-strategies)
4. [How to train a Language model](#How-to-train-a-Language-model)
5. [Retrieval Augmented Generation (RAG)](#Retrieval-Augmented-Generation-(RAG))

In [None]:
import os

from datasets import load_dataset
import matplotlib.pyplot as plt
import nltk.tokenize
import numpy as np 
import pandas as pd
import seaborn as sns
import torch
# import the transformer package from Huggingface
from transformers import AutoTokenizer

from plot_utils import pareto_plot

# use this `'cuda'` instead of `'mps'` if you have a cuda compatible GPU. If you want to run the code on the Mac > M1 use MPS.
# if none of it holds true, use `'cpu'`
device = torch.device('mps')

# set to `True` if you want to train a model, this is need to execute certain cells.
with_fine_tuning = False

# Preprocessing: Tokenization of text (Input representation for Language Models)

A Tokenizer is the interface between the raw text and the language model. A tokenizer decomposes text into smaller text chunks, called **tokens**, which then are encoded to **token ids** (non-negative integer) through a look-up table.

<img src="../images/tokenizer_schema.jpg">

(Image [Andrej Karpathys - Let's build the GPT Tokenizer](https://youtu.be/zduSFxRajkE?si=V4SKTxELBiFGx5Kc))

**Important** The text units/tokens the tokenizer knows is fixed (and typically finite). This fixed set of tokens is called the _vocabulary_.

## Character-Level Tokenization (Unicode Code Point Encoding)
The most canonical way to split text/strings into smaller units is to decompose the string by its characters and use the Unicode Encoding as encoding.
The [Unicode Standard](https://en.wikipedia.org/wiki/Unicode) defines 149813 characters (May 2024). Hence, the vocabulary size of the character-level tokenizer is 149813: each character defines one token in the vocabulary.

In [None]:
# tokenization on character level: character -> unicode
def string_unicode_code_point_seq(text: str) -> list[int]:
    return [ord(c) for c in text]


def unicode_code_point_seq_string(unicode_code_point_seq: list[int]) -> str:
    return "".join([chr(cp) for cp in unicode_code_point_seq])


first_example = "Some random sentence 😀😇"
encoded_example = string_unicode_code_point_seq(first_example)

print("Text:", first_example)
print("Unicode Code Point sequence:\n", encoded_example)
print("Length Unicode Code Point sequence:", len(encoded_example))
print("Decoded Unicode Code Point Sequence:", unicode_code_point_seq_string(encoded_example))

## Byte Level Tokenization (UTF-8 Encoding)
We can reduce the size of the encoding even more by using UTF-8 Encoding. UTF-8 Encoding encodes each Unicode Code Point into at most 4 bytes. A byte-level tokenizer based on UTF-8 has only 256 tokens in the vocabulary: Each byte forms one token in the vocabulary.

In [None]:
def string_byte_seq(text: str) -> list[int]:
    return list(text.encode("utf-8"))

def byte_seq_string(byte_seq) -> str:
    return bytes(byte_seq).decode("utf-8")


uft_8_encoded_example = string_byte_seq(first_example)

print("Text:", first_example)
print("Bytes sequence:\n", uft_8_encoded_example)
print("Length bytes sequence:\n", len(uft_8_encoded_example))
print("Decoded bytes sequence:\n", byte_seq_string(uft_8_encoded_example))


What we observe is that UTF-8 encoding reduces the number of integers we need to encode our text. However, it increases the length of the encoding sequence. The reason is that newer Unicode Code Points need more than one byte to encode these Unicode Code Points:

## Subword-Level Tokenization

Previous research indicates that language models generally perform better when modeled at the word level instead of character level (e.g. [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)).

However, word-level tokenization has drawbacks: 
  - The Unicode standard includes nearly 150,000 elements, leading to potentially massive vocabularies.
  - To manage size, elements like emojis are often excluded.
  - Language evolution, especially with new internet slang and abbreviations, poses challenges as new terms are constantly introduced, which a word tokenizer might not recognize, resulting in unknown tokens (Out-of-Vocabulary (OOV) Words).

Given the limitations of both word and character-level tokenization, modern language models often employ methods which interpolate between word and character(byte) levels to optimize both performance (of the language model) and vocabulary management. 
These methods are called **subword tokenizers**. Subword tokenizers decompose text into subwords instead of words which enables them to effectively handle the morphological variations by breaking words into subwords.

Below you see a typical example of a subword tokenizer (Each coloured field marks a token).

<img src="../images/tokenizer.jpg">

(Image GPT-2 Tokenizer)

Prominent subword tokenizers are

- BPE (BPE = Byte Pair Encoding)
- WordPiece
- SentencePiece


## How Does A Subword Tokenizer Work?

The tokenization process typically consists of at least the following steps: 

 - **Pre-Tokenization** Decompose the raw text into "words".
 - **Model** Encode "words" into a byte or character sequence (as seen above). Then merge byte/character pairs successively into larger tokens of the vocabulary.

<img src="../images/tokenizer_steps.jpg">

In [None]:
# load the tokenizer for a gpt2 model which is a byte-level BPE
model_checkpoint = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# get the vocabulary size of the tokenizer
vocab_size = tokenizer.vocab_size

In [None]:
print(f"Vocabulary size of the {model_checkpoint}-Tokenizer: {vocab_size}")

### Pre-Tokenization

A first step is the **Pre-Tokenization**. In this step, the text is decomposed into smaller entities (typically word like chunks). 

In [None]:
example_text = "How do tokenizers work?"
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(example_text)

### Model

In this step the pre-tokenized text chunks are encoded into unicode code points or bytes, as seen above. Then byte/character pairs are merged successively into larger tokens (based on a certain **merging rule**).

**Important** The **merging rule** is a trained parameter of a subword tokenizer. It is trained by the compression algorithm [Byte Pair Encoding](#https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE): Starting from a small vocabulary, say the vocabulary of the UTF-8 encoder, the BPE merges the most frequent pairs of tokens in the text data into a new token. This process stops if a pre-defined vocabulary size is reached. 

Hence, the **merging rule** depends on the data, the BPE is trained on.


In [None]:
# tiktoken is the implementation of OpenAI's tokenizers
# beside fast implementations of several tokenizers, the package also contains educational code
from tiktoken._educational import bpe_encode  # noqa

pre_tokenized_string = "Ġtokenizers"

mergeable_ranks = {
    tok.encode("utf-8"): rank 
    for tok, rank in tokenizer.get_vocab().items()  # get the vocabulary of our tokenizer
}
bpe_encode(mergeable_ranks, pre_tokenized_string.encode("utf-8"))

## Examples

In [None]:
small_text = """Today it rains.
The rain.
rain
Rain
raining
"""

dict(zip(tokenizer.encode(small_text), (tokenizer.decode(token_id) for token_id in tokenizer.encode(small_text))))

In [None]:
for i, token_idx in enumerate([0, 1, 2, 4, 200, 300, 5000, vocab_size// 2, vocab_size//2 + 1, vocab_size//2 + 2, vocab_size-2, vocab_size -1]):
    print(f"Example {i}:", f"Token Id {token_idx},", f"Token '{tokenizer.decode(token_idx)}'")

Beside the usual "tokens", there are several other technical (control) tokens that give the corresponding language model a certain signal:

- EOS Token: End of String (Text) Token.
- SEP Token: Separation Token.
- UNK Token: Unknown Token

In [None]:
tokenizer.all_special_tokens

There is a nice visualisation for tokenizers. Let's try it out.

```
Let us make some experiments with given tokenizers.

How does the same word behave under the given tokenizer.

Today it rains.
The rain.
rain
Rain
raining

Some calculations:

11 + 23 = 34
1045 + 145 = 1090
10000000 + 10 = 10000010

Some bad code:

def bad_even_check(n: int) -> bool:
   if n % 2 == 0:
      return True
   else:
      return False

The equivalent function
def bad_even_check(n: int) -> bool:
   if n%2==0:
      return True
   else:
      return False
 ```
in [Tokenizer Visualisation](https://tiktokenizer.vercel.app/)

## Implication: Language-Specific Dependencies

Subword tokenizers are probabilistic tokenizers that are trained on specific text data. This training introduces dependencies on the language of the text used.

English, being the dominant language in most datasets, often results in longer token lengths for English words compared to other languages when using subword tokenization.

The difference in token length has direct financial implications, since many services charge based on the number of tokens:

- [OpenAI Pricing](#https://openai.com/api/pricing/)
- [Mistral AI](#https://mistral.ai/technology/)

In [None]:
def word_tokenize(text: str) -> list[int]:
    return nltk.tokenize.word_tokenize(text)


In [None]:
english_example_text = """Byte pair encoding (also known as digram coding) is an algorithm, first described in 1994 by Philip Gage for encoding strings of text into tabular form for use in downstream modeling. Its modification is notable as the large language model tokenizer with an ability to combine both tokens that encode single characters (including single digits or single punctuation marks) and those that encode whole words (even the longest compound words). This modification, in the first step, assumes all unique characters to be an initial set of 1-character long n-grams (i.e. initial "tokens"). Then, successively the most frequent pair of adjacent characters is merged into a new, 2-character long n-gram and all instances of the pair are replaced by this new token. This is repeated until a vocabulary of prescribed size is obtained. Note that new words can always be constructed from final vocabulary tokens and initial-set characters."""

# with deepl translated german text
german_translated_text = """Die Byte-Paar-Kodierung (auch Digram-Kodierung genannt) ist ein Algorithmus, der erstmals 1994 von Philip Gage für die Kodierung von Textstrings in Tabellenform zur Verwendung bei der nachgelagerten Modellierung beschrieben wurde. Seine Modifikation ist bemerkenswert als der große Sprachmodell-Tokenizer mit der Fähigkeit, sowohl Token zu kombinieren, die einzelne Zeichen kodieren (einschließlich einzelner Ziffern oder einzelner Satzzeichen) als auch solche, die ganze Wörter kodieren (sogar die längsten zusammengesetzten Wörter). Bei dieser Modifikation wird im ersten Schritt davon ausgegangen, dass alle eindeutigen Zeichen eine anfängliche Menge von n-Grammen mit einer Länge von einem Zeichen darstellen (d. h. anfängliche „Token“). Dann wird nach und nach das häufigste Paar benachbarter Zeichen zu einem neuen, 2 Zeichen langen n-Gramm verschmolzen und alle Instanzen des Paares werden durch dieses neue Token ersetzt. Dies wird so lange wiederholt, bis ein Vokabular der vorgeschriebenen Größe erreicht ist. Es ist zu beachten, dass neue Wörter immer aus den Token des endgültigen Vokabulars und den Zeichen des Anfangssatzes gebildet werden können."""


print("Tokens per word (english):", len(tokenizer.encode(english_example_text)) / len(word_tokenize(english_example_text)))
print("Tokens per word (german):", len(tokenizer.encode(german_translated_text)) / len(word_tokenize(german_translated_text)))

print(f"Mean Token length in characters (english): {np.mean([len(x) for x in tokenizer.tokenize(english_example_text)])}")
print(f"Mean Token length in characters (german): {np.mean([len(x) for x in tokenizer.tokenize(german_translated_text)])}")


## Implication: Issues with simple tasks.

Subword-level encodings have some drawbacks.

- Since Tokenizers encode whole words, language models working on top of tokenizers have issue with spelling. Andrej Karpathy showed for example that the string `".DefaultCellStyle"` is one token for the GPT-4-Tokenizer. If you ask GPT-4 to ask how many "i" characters are contained in `".DefaultCellStyle"`, then it will give you a wrong answer. (The whole idea by Andrej Karpathy)
- Language models working on top of subword tokenizers can not revert words.
- Tokenizer often tokenize words with "whitespace" + "word" (e.g. `" rain"`). Since tokenizers are the interface of a language model adding whitespace add the end of text typically lower the performance of the text generation.


## Remarks
If you want to learn how to develop such a tokenizer from scratch, the following tutorial is recommended: [Andrej Karpathys - Let's build the GPT Tokenizer](https://youtu.be/zduSFxRajkE?si=V4SKTxELBiFGx5Kc)

See also [TikToken](#https://github.com/openai/tiktoken/tree/main): TikToken implements all tokenizers that are used for GPT-3, GPT-3.5, GPT-4 and GPT-4o.

# Causal Language Model

A causal language model tries to estimate how a given text could be continued.

* Simple example: The text `"The capital of France is "` is probably continued with the word "Paris".
* Harder example: It is not clear at all how to continue the text `"I like "`. How the text could be continued certainly depends very much on who you ask. And even for one person it is hard to answer the question, because there are potentially many correct answers.

But what we could do is to **rank** what words or tokens are rather likely to continue a given text and what words or tokens are not very likely to continue the text:

* For example, `"I like "` could be continued with words like `apples, cars` etc. It is rather unlikely one would continue this text with `being ill, being thristy`. Furthermore, if you would continue the given text `"I like "` with words like `I` or `like`, it results in grammatically incorrect text.

Hence, given a text and a vocabulary, we could potentially rank each word in the vocabulary to indicate what words are like to continue the text and what words are rather unlikely to continue the given text.

## Formal Objective
The formal objective of a causal language model is to determine a conditional probability measure:
$$
p(output \vert input)
$$
where "input" refers to the initial segment of text and "output" is the subsequent token. This probability quantifies how likely a particular continuation is, given the preceding text.

<img src="../images/language_model.jpg" alt="Causal Language Model" />

In the context of language models, the "input" is typically a sequence of tokens, and the model's task is to estimate the next tokens probability distributions in the sequence based on its understanding of language structure and content. This is mathematically represented as:

$$
p(x_t \vert x_{1}, \dots, x_{t-1})
$$

where $x_t$ is the token at position $t$, and the sequence $[x_{1}, \dots, x_{t-1}]$ constitutes the input or context.

A causal language model is therefore technically simply a mathematical function of the following type:
$$
p\colon 
\begin{cases}
    \left\{ [x_1,\dots, x_t] \vert t \geq 1,\, x_i \in \mathcal{V} \right\} \to \lbrace (p_w)_{w\in \mathcal{V}} \mid p_w \geq 0, \, \sum_{w\in \mathcal{V}} p_w = 1 \rbrace \\
    [x_1, \dots, x_t] \mapsto (p(x\vert x_1, \dots, x_t))_{x\in \mathcal{V}}
  \end{cases}
$$
where $\mathcal{V}$ is the vocabulary (set of tokens).


## Why are neural networks used to represent a language model?

Language models rely on tokenizers. As seen above, tokenizers break text into tokens without capturing the underlying semantics of the words. For instance, different forms or inflections of the same word stem are assigned distinct tokens, each with a unique encoding.

**Example 1** Variants of the word "rain" receive different tokens and encodings:

| Tokenizer | Word stem | Token     | Encoding |
|-----------|----------|-----------|--------|
| GPT-2     | "rain"    | ' rains'  | 29424  |
| GPT-2     | "rain"    | ' rain'   | 6290   |
| GPT-2     | "rain"    | 'rain'    | 3201   |
| GPT-2     | "rain"    | 'Rain'    | 31443  |
| GPT-2     | "rain"    | 'raining' | 24674  |

**Example 2** Different forms of the verb "go" are encoded differently:

| Tokenizer | Lemma  | Token    | Encoding |
|-----------|--------|----------|----------|
| GPT-2     | "go"   | 'go'     | 2188     |
| GPT-2     | "go"   | 'went'   | 19963    |
| GPT-2     | "go"   | 'going'  | 5146     |

**Example 3** Related words are assigned distinct tokens and encodings:

| Tokenizer | Related Words | Tokens       | Encoding     | Category     |
|-----------|---------------|--------------|--------------|--------------|
| GPT-2     | "cat", "dog"  | 'cat', 'dog' | 9246, 9703   | Animal, Pet  |
| GPT-2     | "run", "walk" | 'run', 'walk' | 5143, 11152  | motion verb  |


Given these examples, it is apparent that token encodings by tokenizers lack efficiency in capturing language nuances. To address this, tokens can be represented as points in a multidimensional vector space:

$$
\mathbb{R}^d = \lbrace (r_1, \dots, r_d) \vert r_i \in \mathbb{R} \rbrace,
$$

where $d$ is a positive integer. 

This vector space framework enables a concept of **closeness** between tokens. Metrics such as **Euclidean distance** or **cosine similarity** quantify how close or distant the tokens are, allowing semantically similar verbs like "go" and "goes" to be positioned near each other, whereas unrelated words are placed far apart. (see [Word Embedding Visualization](#https://projector.tensorflow.org/))

Converting the discrete token encodings into a continuous vector representation supports modeling the conditional expectation as a continuous function, preserving semantic closeness so that small changes in the input result in small, meaningful shifts in the output.

**For example**: The sentences `"I walk into a room"` and `"I go into a room"` could probably be continuate quite similar.

<img src="../images/neural_network_lm.jpg" alt="Causal Language Model" />

## Load pre-trained Causal Language Model

Neural network based language models are basically distinguishable by two main aspects:

* The architecture of the neural network which determines how the different _layers_ or _blocks_ are composed and how large these _layers_ or _blocks_ are. A typical building _block_ of a neural network consists of a matrix (a two-dimensional numeric array), a bias vector (one-dimensional array) and a so-called activation function. The values of the matrix and the bias vector are called _weights_ and are the parameters that are actually updated during training a model.
* The concrete _weights_ of model which are updated during the training and store the learned information. 

The architecture of a model determines (among other things) how large a model is in terms of the number of weights, and therefore also affects memory usage.

The saying ‘more is more’ is quite true for language models based on neural networks. Large models with many weights typically perform better than smaller models when modeling language. This is because they can capture more complex patterns and nuances in the data. However, the trade-off is that high-performance language models with many parameters rarely fit on ordinary GPUs due to their substantial memory requirements. 

| Model       | Number of Weights | Memory Size (weight precision float32) |
|-------------|-------------------|----------------------------------------|
| GPT2-small  | 0.124 B           | 0.5 GB                                 |
| GPT2-medium | 0.355 B           | 1.42 GB                                |
| PHI-3-mini  | 3.8 B             | 14.2 GB                                |
| Llama3-8B   | 8.03 B            | 32 GB                                  |
| Llama3-70B  | 70.6 B            | 280 GB                                 |
| GPT 3       | 174 B             | 696 GB                                 |
| GPT 4       | 1760 B            | 7 TB                                   |

---

We choose a quite small language model as you can see below (so that it fit onto each device).

To load the language model we again use Huggingface's `transformer` library.

In [None]:
from transformers import AutoModelForCausalLM

# load pre-trained model from a HuggingFace repository
model = AutoModelForCausalLM.from_pretrained(model_checkpoint, device_map=device)

# set the model to evaluation mode to deactivate technical "modules" (like dropout)
model.eval() 


In [None]:
print(f"Number of model weights of {model_checkpoint}: {model.num_parameters() / 1e6:.3f} million")
print(f"Model foodprint of {model_checkpoint}: {model.get_memory_footprint()/ 1e9:.3f} GB")

The following link you can use to estimate VRAM usage. [Memory Estimator](https://huggingface.co/docs/accelerate/main/en/usage_guides/model_size_estimator)

## Embedding

Let us have a look how our loaded model embeds tokens:

In [None]:
token = "go"

EMBEDDING_LAYER_NAME = {
    "gpt2": "transformer.wte",
    "keeeeenw/MicroLlama": "model.embed_tokens",
}

def get_module_by_name(name: str, model) -> torch.nn.Module | None:
    for module_name, module in model.named_modules():
        
        if module_name != name:
            continue
        
        return module

with torch.no_grad():
    token_embedding = get_module_by_name(EMBEDDING_LAYER_NAME[model_checkpoint], model)(tokenizer.encode(token, return_tensors="pt").to(model.device))

print(f"Shape of the embedding tensor: {tuple(token_embedding.shape)}")
print(token_embedding)


In [None]:
token = " dog" # "go" #  "people" # "night" # 

@torch.no_grad()
def get_closest_by_euclidean_distance(word: str, embedding_module_name: str, model, tokenizer, topk: int = 10) -> dict[str, list]:
    
    # get the embedding module from the neural network language model
    embedding = get_module_by_name(embedding_module_name, model)
    assert embedding is not None, "No embedding found"
    
    # get the token id for `word`
    word_token = tokenizer.encode(word, return_tensors="pt")
    word_token = torch.tensor(
        [
            token_id.item() for token_id in word_token.view(-1) if token_id not in tokenizer.all_special_ids
        ]
    ).to(model.device)

    assert word_token.shape[-1] == 1, "`word` decomposes into more than one token."

    # embed the `word`
    word_embedding = embedding(word_token)
    
    # embed the complete vocabulary (all token ids)
    all_embedded_tokens = embedding(torch.tensor(list(range(tokenizer.vocab_size))).to(model.device))
    
    # get the Euclidean distance between the embedded `word` and all other embeddings
    dists = torch.norm(word_embedding - all_embedded_tokens, dim=-1).view(-1)
    
    # get the top closest tokens 
    values, token_ids = torch.topk(-dists, k=topk + 1)
    
    return {
        "Token": [tokenizer.decode(token_id) for token_id in token_ids[1:]],
        "Distance": [-val.item() for val in values[1:]]
    }


plt.figure(figsize=(10, 6))
sns.barplot(
    y='Distance', 
    hue='Token', 
    data=pd.DataFrame(get_closest_by_euclidean_distance(token, EMBEDDING_LAYER_NAME[model_checkpoint], model, tokenizer, topk=20)), 
    palette='viridis'
)

plt.title(f"Tokens close to '{token}'") 
plt.xlabel('Token')
plt.ylabel('Euclidean Distance')
plt.show()

## Probability model output

The output of a neural network-based language model is typically not a probability vector of the vocabulary size, but rather it consists of logits. Logits are logistic units or unnormalized log-probabilities and can be any real number. To interpret a sequence of logits correctly:

$$
\text{logits} = (l_1, \dots, l_V)
$$ 

we must apply the _softmax function_, which is a normalized exponential function:

$$
\sigma(l_1,\dots, l_V) = \frac{1}{\sum_{i=1}^V \exp(l_i)} [ \exp(l_1), \dots,  \exp(l_V) ],
$$

Here, $V$ is the vocabulary size.

Another distinctive aspect of the output is influenced by the architecture of the **transformer** model. The shape of the output depends on the length of the input token sequence $[ x_1, \dots, x_t ] $:

$$
[ x_1, \dots, x_t ] \xrightarrow[]{\text{transformer}} \begin{bmatrix} \text{logits}_1 \\ \vdots \\ \text{logits}_t \end{bmatrix} 
\xrightarrow[]{\sigma} \begin{bmatrix} p(. \vert x_1) \\ \vdots \\ p(. |x_1,\dots, x_t) \end{bmatrix},
$$
In this model, the softmax function $\sigma$ is applied to each row of the output logits matrix. This processing converts the logits into probabilities that predict the likelihood of each subsequent token given the prior sequence.


In [None]:
example_text = "Some test continue to write some"
print("Input tokens sequence:", tokenizer.encode(example_text, return_tensors="pt").to(device))
model_output = model(tokenizer.encode(example_text, return_tensors="pt").to(device)).logits
print("Output shape:", list(model_output.shape))
print("Output:", model_output)

In [None]:
def get_top_k_from_transformer(
    model: torch.nn.Module, tokenizer, input_text: str, k: int = 10, temperature: float = 1.0
) -> dict[str, float]:
    
    # set model to evaluate model
    model.eval()
    
    # return pytorch tensors
    inputs = tokenizer.encode(input_text, return_tensors="pt").to(model.device)

    # evaluate model
    with torch.no_grad(): 
        outputs = model(inputs)
    
    # multiply temperature
    logits = (1/temperature) * outputs.logits
    
    # get probabilities from the next token
    probabilities = torch.softmax(logits[0, -1, :], dim=-1)
    
    # get the k most likely 
    probs, token_ids = torch.topk(probabilities, k)
    
    return {token: val.item() for token, val in zip([tokenizer.decode(tok) for tok in token_ids.tolist()], probs)}

In [None]:
from plot_utils import display_process

input_text = "Today is a sunny day and"

display(display_process([
    input_text,
    tokenizer.tokenize(input_text),
    tokenizer.encode(input_text),
    get_top_k_from_transformer(model, tokenizer, input_text, k=4)
]))

In [None]:
input_text = "I like to think" # "I like to think that I'm a good person." # "Today is a sunny day and"

topk = get_top_k_from_transformer(model, tokenizer, input_text, k=20)

plt.figure(figsize=(10, 6))  # Set figure size
sns.barplot(y='Probability', hue='Token', data=pd.DataFrame(
    {"Probability": list(topk.values()), "Token": list(topk.keys())}
), palette='viridis')  # Create bar plot

plt.title('Probability of Next Tokens')  # Add title
plt.xlabel('Probability')  # Label x-axis
plt.ylabel('Next Token')  # Label y-axis
plt.show()  # Show plot


# Generate text from a causal model and its strategies

How to generate text from a given (causal) language model?

```
Given an array of n tokens [x_1,..., x_n]
Set context = [x_1,..., x_n]
while True:
  sample = draw sample from p(. | context)
  context.append(sample)
```

## Context Length

To generate text, we feed the language model iteratively with its own output to create longer text. This process involves using the model's previous output as input for generating the next token in the sequence.

Keep in mind that the maximal context length of language models is typically limited. This means that the length of the token array you can feed into a language model has a bounded size.

As with other specifications of a language model, the context length depends on the model. Different models have different maximum context lengths, which are determined by their architecture and design.

| Model       | Maximal Context length |
|-------------|------------------------|
| GPT2-small  | 1024 Tokens            |
| GPT2-medium | 1024 Tokens            |
| PHI-3-mini  | 4000 Tokens            |
| Llama3-8B   | 8000 Tokens            |
| Llama3-70B  | 8000 Tokens            | 
| GPT 3       | 2048 Tokens            |
| GPT 4       | 32000 Tokens           |


In [None]:
print(f"Maximal context length of {model_checkpoint}: {tokenizer.model_max_length} Tokens")

## Top p

Language models based on neural networks, when given a context $[x_1, \dots, x_t]$ and a threshold $\varepsilon > 0$ (close to zero), typically reveal that a large number of tokens from the vocabulary have a probability

$$ p(x\vert x_1, \dots, x_t) > \varepsilon.$$ 

This characteristic suggests a strong generalization capacity of the model. However, relying solely on this broad probability distribution for text generation can lead to unpredictable outcomes. Specifically, tokens that are less likely to be the next logical choice might be randomly selected, which can result in nonsensical or incoherent text. This is particularly problematic when the goal is to generate fact-based content, where a more deterministic approach is preferable.

To illustrate this point, let's consider an example:

In [None]:
input_text = "The capital of france is"
plt.figure(figsize=(20, 6))

# get token-likelihood mapping for the most k likely next tokens
topk = get_top_k_from_transformer(model, tokenizer, input_text, k=tokenizer.vocab_size)

# create pareto plot
pareto_plot(
    pd.DataFrame({'Token': list(topk.keys()), 'Percentage': list(topk.values())}), 
    'Token', 
    'Percentage', 
    topk=30
);


### The "Top p" Sampling Strategy

As we have seen above, a large mass of the probability distribution is distributed over a large set of tokens that are rather unlikely to be a reasonable next token. Sampling directly from such a wide distribution can result in nonsensical or irrelevant text outputs, undermining the model's utility for practical applications.

To mitigate this issue, the "top p" sampling method, also known as "nucleus sampling," is employed. This method involves:

- **Ordering Tokens by Likelihood**: Tokens are ranked according to their probability given the context $p(x\vert x_1, \dots, x_n)$.
- **Cutoff at Cumulative Probability p**: "Top p" sampling focuses only on the subset of tokens that collectively make up the top $p%$ of the probability mass. This approach effectively narrows down the choice of tokens to those most likely to be contextually relevant, reducing the inclusion of implausible tokens.
- **Deterministic Sampling within a Probabilistic Framework**: By limiting the scope to a more probable subset of tokens, the "top p" method allows for a more controlled and deterministic generation process while maintaining the inherent probabilistic nature of the model. 

In [None]:
def sample_top_p(probs, p):
    # sort probability value
    probs_sort, probs_idx = torch.sort(probs, dim=-1, descending=True)
    # 
    probs_sum = torch.cumsum(probs_sort, dim=-1)
    mask = probs_sum - probs_sort > p
    probs_sort[mask] = 0.0
    probs_sort.div_(probs_sort.sum(dim=-1, keepdim=True))
    next_token = torch.multinomial(probs_sort, num_samples=1)
    next_token = torch.gather(probs_idx, -1, next_token)
    return next_token


## Temperature in Language Model Generation

As previously discussed, a language model predicts the likelihood of the next token based on the provided context. When sampling from this probability distribution:

1. If the model is very confident, the generation will tend to be deterministic.
2. If the model is less confident, the generation will be more random or unpredictable.


**Temperature** is a key parameter in controlling the predictability of text generation. High temperatures result in less predictable text generation, while low temperatures lead to more deterministic outputs.

The application of temperature in text generation is described as follows:  Given a temperature $\theta > 0$ and a vector of logits $(l_1, \dots, l_V)$
$$
\sigma\left(\frac{l_1}{\theta}, \dots \frac{l_V}{\theta} \right) = \frac{1}{\sum_{i=1}^V \exp({l_i/\theta})} \left[\exp({l_1/\theta}), \dots, \exp({l_V/\theta})
\right],
$$
where $V$ is the vocabulary size and $\sigma$ denotes the softmax function. This formula adjusts the logits by the temperature, effectively scaling the sharpness of the probability distribution.

Let's visualize the impact of temperature for the distribution.

In [None]:
input_text = "The capital of france is"
topk = get_top_k_from_transformer(model, tokenizer, input_text, k=20, temperature=100.3)

plt.figure(figsize=(10, 6))  # Set figure size
sns.barplot(y='Probability', hue='Word', data=pd.DataFrame(
    {"Probability": list(topk.values()), "Word": list(topk.keys())}
), palette='viridis')  # Create bar plot

plt.title('Probability of Next Words')  # Add title
plt.xlabel('Probability')  # Label x-axis
plt.ylabel('Next Word')  # Label y-axis
plt.show()  # Show plot


## Frequency penalty

When generating text as described above from a causal language model. It can happen that the model start to repeating itself.

In [None]:
def freq_penalty(logits: torch.Tensor, contexts: torch.Tensor, alpha_penalty: float) -> torch.Tensor:
    occurrences_vectors = []
    
    vocab_size = logits.shape[-1]
    
    for context in contexts:
        occurrences_vector = torch.zeros(vocab_size, dtype=torch.long, device=logits.device)
        
        unique, counts_unique = torch.unique(context, return_counts=True)

        occurrences_vector[unique[unique >= 0]] = counts_unique[unique >= 0]
        
        occurrences_vectors.append(occurrences_vector)
        
    occurrences = torch.stack(occurrences_vectors).to(logits.device)
    
    return logits - occurrences * alpha_penalty


## Sum up to generate text

We are now ready to implement a _generation_ function with which we generate text based on a language model.

**Notes**: Note that a language model is "stateless" and does not have any capacity to memorize. To generate text, we always feed the language model with the complete context. In particular, the maximal context length of a language model limits the possibility to generate infinite text.

In [None]:
def get_next_token(
    model: torch.nn.Module, input_ids: torch.Tensor, top_p: float = 0.9, temperature=1.0, alpha_penalty=0.2
) -> torch.Tensor:
    # set model to evaluate model
    model.eval()
    
    # evaluate model
    with torch.no_grad(): 
        outputs = model(input_ids.to(model.device))
    
    # get logits from the model
    logits = outputs.logits
    
    logits = freq_penalty(logits[0, -1, :], input_ids, alpha_penalty)
    
    # calculate the probabilities from the logits
    probability = torch.softmax((1/temperature) * logits, dim=-1)

    # get the next token by sampling
    next_token = sample_top_p(probability, top_p)
    
    return next_token


def generate(
    model: torch.nn.Module, 
    tokenizer, 
    text: str, 
    max_gen_length: int = 124, 
    top_p: float = .9, 
    temperature: float = .6,
    alpha_penalty: float = 0
) -> tuple[torch.Tensor, list[int]]:
    # encode initial context
    tokens = tokenizer.encode(text, return_tensors="pt").to(model.device)  # shape [1, token_sequence_length]
    
    # instantiate sequence of output ids
    out_ids = []
    
    for _ in range(max_gen_length):
        
        # get next token id by sampling
        next_token = get_next_token(model, tokens, top_p, temperature, alpha_penalty)

        # append next token id to `out_ids`
        out_ids.append(next_token.item())
        
        # check escape conditions
        max_gen_length_reached = len(out_ids) == max_gen_length
        eos_token_reached = next_token.item() == tokenizer.eos_token_id
        
        if max_gen_length_reached or eos_token_reached:
            break
        
        # add next token to context
        tokens = torch.concat([tokens, next_token.view(1, -1)], dim=-1)
    
    return tokens, out_ids

example_text = "I like" # "The capital of france is"
tokens, out_ids = generate(model, tokenizer, example_text, max_gen_length=26, top_p=.3, temperature=100.8, alpha_penalty=10.)
tokenizer.decode(out_ids)


# How to train a Language model

## 1. Pre-training
**Objective**: Develop a broad understanding of language patterns and structures through extensive exposure to diverse textual data.

**Process**: The model is trained on a large corpus, often with billions of words spanning various topics. 

## 2. Supervised Fine-Tuning
After the pre-training step, the language model has a general understanding of language. The language model is then a **text completer**, i.e. it tries to continue input text. Unlike language models for chatbots, for example, which react to our input like a human conversation partner, the model simply completes text after the pre-training step.

**Objective**: Adapt a pre-trained model to excel in specific tasks such as 
   * question-answering
   * **following instructions**
   * or summarizing text.

**Process**: In fine-tuning, the general model is trained further on a smaller, task-specific dataset. The model is trained to learn specific prompt template and how it has to react on certain signals. This step is relatively faster and less resource-intensive than pre-training because the model is already knowledgeable about language fundamentals. During fine-tuning, the model parameters are adjusted to optimize performance for the specific task metrics.

**Technical notes**: Training (pre-training/fine-tuning (there is no difference between these steps in the implementation of the training loop except certain hyperparameters like batch size, learning rate etc.)) a neural network is in almost all cases based on a gradient descent algorithm. 

## 3. Alignment Training
Language models reproduce text on which they have been trained. This can lead to the generation of conspiracy theories, violent language or other false information and common misconceptions.

**Objective**: Aligning model with human preferences to increase its utility and safety. By leveraging human or AI preference in the training loop, one can attain large improvements.

**Process**: Uses different preference learning algorithms like 
- [Deep Reinforcement Learning from Human Preferences](#https://arxiv.org/pdf/1706.03741)
- [Direct Preference Optimization](#https://arxiv.org/pdf/2305.18290)

## Practical Notes 

**Domain specific models**: As described above, a concrete language model is determined by its architecture and its weights. The weights control the strengths and weakness of a model.

| Model                               | Model Architecture | Task        | Good at                       | Link                                                       | Fine Tuned on                                               |
|-------------------------------------|--------------------|-------------|-------------------------------|------------------------------------------------------------|-------------------------------------------------------------|
| microsoft/Phi-3-mini-128k-instruct  | PHI-3              | Instruction | Math, Code, Logical Reasoning | https://huggingface.co/microsoft/Phi-3-mini-128k-instruct  |                                                             |
| nvidia/Llama3-ChatQA-1.5-70B        | Llama3             | QA/RAG      |                               | https://huggingface.co/nvidia/Llama3-ChatQA-1.5-8B         | https://huggingface.co/datasets/nvidia/ChatQA-Training-Data |
| aaditya/Llama3-OpenBioLLM-70B       | Llama3             | Instruction | Bio medic                     | https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B       | biomedical data                                             |
| shenzhi-wang/Llama3-8B-Chinese-Chat | Llama3             | Instruction | Responses in Chinese          | https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat |                                                             |


## Fine-Tuning our language model to learn a specific task

Until now, the model we've been discussing primarily functions as a **text completer**. 

**Goal**: Train the model to respond to instructions.

To achieve this, the model must learn to interpret **control tokens** and **prompt templates**. These elements act as signals, guiding the model to respond appropriately to given inputs. 

Here is an example of how that could look like for a Mistral Model.

```python
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3", use_fast=True)

messages = [
    {"role": "user", "content": "{user content}"},
    {"role": "assistant", "content": "{assistant response}"},
    {"role": "user", "content": "{user content}"},
]

print(tokenizer.decode(tokenizer.apply_chat_template(messages)))
```
_Output_
```
'<s>[INST]  {user content} [/INST] {assistant response}</s>[INST]  {user content} [/INST]'
```

There is no standardised approach in this context. Prompt templates depend on the model. See for example [microsoft/Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) for another prompt template example.

In [None]:
text_dataset = load_dataset(
    "yahma/alpaca-cleaned"
)

In [None]:
text_dataset["train"].to_pandas().tail(5)

In [None]:
MAX_CONTEXT_LENGTH = 400

## Create a specific prompt template

As described above, we want to teach our model how to react to instructions. To make this possible, we need to establish a certain prompt template that gives the model the right signals. 

In [None]:
def prompt_template(example):
    text = ""
    if input_ := example.get("input"):
        text += f"<Input> {input_}"
    if instruction := example.get("instruction"):
        text += f"<Inst> {instruction}<Response>"
    if output := example.get("output"):
        text += f"{output} </Response>"
    return text


In [None]:
prompt_template(
    {
        "input": "The capital of germany is Berlin.",
        "instruction": "What is the capital of germany?"
    }
)


In [None]:
prompt_template(
    {
        "input": "The capital of germany is Berlin.",
        "instruction": "What is the capital of germany?",
        "output": "The capital of germany is Berlin."
    }
)


## Prepare the dataset

In [None]:
def prompt(example, tokenizer=None):
    
    text = prompt_template(example)
    
    if tokenizer:
        text += tokenizer.eos_token
    
    example["text"] = text
    
    return example


text_dataset = text_dataset.map(prompt).filter(lambda ex: len(tokenizer.encode(ex["text"])) <= MAX_CONTEXT_LENGTH)



In [None]:
text_dataset["train"].to_pandas()

In [None]:
train_val_dataset = text_dataset["train"].train_test_split(train_size=.9, seed=1901)

### Fine-Tuning


In [None]:
trainer = None

if with_fine_tuning:
    from transformers import TrainingArguments
    from trl import SFTTrainer
    from peft import LoraConfig
    
    TARGET_MODULS = {
        "gpt2": ["wte", "c_attn", "c_proj", "c_fc", "c_proj", "lm_head"],
        "phi3": ['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj'],
        "keeeeenw/MicroLlama": ['gate_proj', 'up_proj', 'down_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
    }
    
    # lora config
    peft_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=TARGET_MODULS[model_checkpoint]
    )
    
    # Set training arguments
    training_arguments = TrainingArguments(
        output_dir=os.path.join("..", "logs", "model_checkpoints"),
        save_strategy="steps",
        save_steps=200,
        num_train_epochs=2,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=2,
        logging_steps=200,
        optim="adamw_torch",
        learning_rate=2e-4,
        lr_scheduler_type="linear",
        warmup_steps=10,
        # fp16=True  # does not work on mps devices
    )
    
    # Set supervised fine-tuning parameters
    trainer = SFTTrainer(
        model=model,
        train_dataset=train_val_dataset["train"],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=MAX_CONTEXT_LENGTH,
        tokenizer=tokenizer,
        args=training_arguments,
    )
    
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token


In [None]:
if trainer:
    # train model
    trainer.train()
    
    # save model
    trainer.save_model("gpt2-instruction")


In [None]:
def load_peft_model(model_checkpoint: str, adapters_name: str, device=None):
    from peft import PeftModel
    model = AutoModelForCausalLM.from_pretrained(
        model_checkpoint,
    )
    
    # load our fine-tuned model
    peft_model = PeftModel.from_pretrained(model, adapters_name)
    
    # put model on specified device
    if device is not None:
        peft_model.to(device)
    
    # get the associated tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
    
    return peft_model, tokenizer


peft_model, tokenizer = load_peft_model(model_checkpoint, "gpt2-instruction", device)


### Let's try our instruction model

<img src="../images/meme.jpeg" alt="Fine Tuned Model" />

In [None]:
_, out_ids = generate(
    peft_model, 
    tokenizer, 
    prompt_template({"input": "Answer the following instruction in 10 words.", "instruction": "How are you?"}), 
    max_gen_length=50, top_p=.5, temperature=.9)
print(tokenizer.decode(out_ids))

In [None]:
for i in range(10):
    _, out_ids = generate(
        peft_model, 
        tokenizer, 
        prompt_template({"instruction": "What is the capital of germany?"}), 
        max_gen_length=15, top_p=.5, temperature=.9)
    print(f"Response {i}:", tokenizer.decode(out_ids))

We observe that the language model has learned our _signals_ and _prompts_ we introduced.
However, it is also apparent that the model tends to **hallucinate** — that is, it generates incorrect information.

# Retrieval Augmented Generation (RAG)

We observed that our model could hallucinate in some instances. Hallucinations in language models refer to generating incorrect or nonsensical information not supported by the input data. To mitigate this, the Retrieval Augmented Generation (RAG) approach involves providing an information system that adds relevant information to the input prompt based on our initial prompt. This helps ground the model's responses in factual data.

The concept of Retrieval Augmented Generation was originally proposed in the paper  [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/pdf/2005.11401). This approach combines traditional language modeling with a retrieval mechanism, which fetches relevant external information that can be used during the generation process. This combination allows the model to enhance its outputs with up-to-date and pertinent information, reducing the risk of generating inaccurate content.

## Mechanism
**Retrieval Component**: During text generation, RAG retrieves relevant documents or information from an external knowledge base.

**Generation Component**: The retrieved information is then used to inform and improve the text generation process, grounding responses in factual data.

<img src="../images/rag.jpg" alt="Retrieval Augmented Generation Workflow" />

(illustration inspired by https://towardsdatascience.com/retrieval-augmented-generation-rag-from-theory-to-langchain-implementation-4e9bd5f6a4f2)

## Benefits
**Reduced Hallucinations**: By incorporating real-time, relevant information, RAG decreases the likelihood of generating incorrect or nonsensical outputs.

**Enhanced Accuracy**: The retrieval process ensures that the model's outputs are more accurate and contextually appropriate.

**Dynamic Knowledge Update**: Unlike static language models, RAG can dynamically incorporate new information, making it adaptable and up-to-date (for example financial news).

## Application
**Knowledge-Intensive Tasks**: RAG is particularly useful for tasks that require access to a large and dynamic body of knowledge, such as question answering, dialogue systems, and information retrieval.

**Expert Systems**: It can be applied to domains requiring specialized knowledge, including company restricted knowledge, legal advice, and technical support.


Let's start with manually adding some info to examine the impact on the response.

In [None]:
for i in range(10):
    _, out_ids = generate(
        peft_model, 
        tokenizer, 
        prompt_template({
            "input": "Answer the following question using the input: Berlin is the capital of Germany.", 
            "instruction": "What is the capital of germany?"
        }), 
        max_gen_length=20, 
        top_p=.5, 
        temperature=.9
    )
    print(f"Response {i}:", tokenizer.decode(out_ids))
    

## Sentence Embedding and Semantic Search

In our very limited test, we can see that adding relevant information has a positive effect on the quality of the response.

--- 

We have seen already the concept of embeddings in a previous section: To recap, embeddings help a model determine how similar or different words are by placing them at certain distances from each other in a multidimensional vector space: similar words are placed closer together, and dissimilar words are farther apart from each other.

Building on this concept, there are advanced models that apply a similar technique to entire sentences or even paragraphs, not just single words.

**What Are Sentence Embeddings?**

Sentence embeddings extend the idea of word embeddings to longer pieces of text. These models take whole sentences and encode them into numerical vectors in a vector space. By doing this, each sentence is represented by a point in a multidimensional vector space.

**How Do Sentence Embeddings Encode Semantic Similarity?**

The goal of sentence embeddings is to capture the overall meaning of a sentence, rather than just the meanings of individual words. This is achieved by considering the context of the entire sentence. For example, the sentences "The weather is sunny" and "It's a bright sunny day" share a similar theme and, thus, their vector representations would be placed close together in the embedding vector space.


**Notes**

Sentence embedding models, unlike the token embeddings we talked about earlier, are standalone models.



In [None]:
# import sentence transformers from Huggingface
from sentence_transformers import SentenceTransformer, util

# load a sentence embedding model from HuggingFace
sent_embedder = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

In [None]:
example_sentences = [
    "What is the capital of germany.", 
    "The capital of germany is Graz."
]

sent_embedder.encode(example_sentences)

Caused by the training method of sentence embedding models, _closeness_ is often not measured by the Euclidean distance but rather by the _cosine similarity_.

### Cosine similarity

For two vectors $v_1, v_2 \in \mathbb{R}^d$, $v_1, v_2 \neq 0$, we have the following identity

$$
\cos(\theta_{v_1, v_2}) = \left\langle \frac{v_1}{\|v_1\|}, \frac{v_2}{\|v_2 \|} \right\rangle = \frac{1}{\|v_1\| \|v_2\|} \sum_{i =1}^d v_{1,i}v_{2,i},
$$
where $v_1 = (v_{1,1},\dots, v_{1, d}), \, v_2 = (v_{2,1}, \dots, v_{2,d})$, $\|v_j \| =  \sqrt{\langle v_j, v_j \rangle}$ and $\theta_{v_1, v_2}$ is one the two angles between $v_1$ and $v_2$.

The _cosine similarity_ for $v_1$ and $v_2$, $v_1, v_2 \neq 0$, is defined by 

$$
\mathrm{cosSim}(v_1, v_2) = \left\langle \frac{v_1}{\|v_1\|}, \frac{v_2}{\|v_2 \|} \right\rangle
$$.

Hence, $\mathrm{cosSim}$ ranges between $[-1, 1]$.

- If $v_1, v_2$ are unit vectors and $\mathrm{cosSim}(v_1, v_2) = 1$, then $v_1 = v_2$.
- If $v_1, v_2$ are unit vector and $\mathrm{cosSim}(v_1, v_2) = -1$, then $v_1 = -v_2$.
- If $v_1, v_2$ are unit vector and $\mathrm{cosSim}(v_1, v_2) = 0$, then $v_1$ and $v_2$ are orthogonal.

In summary, if two vector have a cosine simularity close to one, they are similar. The smaller the cosine similarity value, the less similar they are.

In [None]:
v1 = np.array([0., 1., .8])
v2 = np.array([0., 0., 1.])

v1 = (1/np.linalg.norm(v1)) * v1
v2 = (1/np.linalg.norm(v2)) * v2

def add_3d_vector(x: np.ndarray, color: str, ax):
    x = x.copy().reshape(-1)
    
    if x.shape == 3:
        raise ValueError
    
    origin = np.zeros(shape=(3,))

    x_vec = np.concatenate([origin.reshape(-1, 1), x.reshape(-1, 1)], axis=-1)
    
    ax.quiver(0, 0, 0, x_vec[0], x_vec[1], x_vec[2],color=color)
    
    return ax


def add_unit_sphere(ax):
    # make data
    u = np.linspace(0, 2 * np.pi, 250)
    v = np.linspace(0, np.pi, 250)
    x = np.outer(np.cos(u), np.sin(v))
    y = np.outer(np.sin(u), np.sin(v))
    z = np.outer(np.ones(np.size(u)), np.cos(v))
    
    # plot the surface
    ax.plot_surface(x, y, z, rstride=4, cstride=4, color='grey', linewidth=0, alpha=0.5)
    
    return ax


def add_hyperplane(x1, x2, ax):
    
    u = np.linspace(-1, 1, 10)
    v = np.linspace(-1, 1, 10)
    um, vm = np.meshgrid(u, v)
    
    # generate points on the plane
    xs = um * x1[0] + vm * x2[0]
    ys = um * x1[1] + vm * x2[1]
    zs = um * x1[2] + vm * x2[2]
    
    ax.plot_surface(xs, ys, zs, alpha=0.2, rstride=10, cstride=10, color="blue")
    
    return ax


fig = plt.figure()
ax = fig.add_subplot(projection='3d')

ax = add_unit_sphere(ax)

ax = add_3d_vector(v1, color="red", ax=ax)
ax = add_3d_vector(v2, color="green", ax=ax)
ax = add_hyperplane(v1, v2, ax=ax)

ax.set_xlim((-1., 1.))
ax.set_ylim((-1., 1.))
ax.set_zlim((-1., 1.))

ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')

plt.show()

In [None]:
similar_sentences = [
    "The capital of germany is Berlin.", 
    "Berlin is the capital of Germany."
]

embeddings = sent_embedder.encode(similar_sentences)

util.pytorch_cos_sim(embeddings[0], embeddings[1])

In [None]:
non_similar_sentences = [
    "The capital of germany is Berlin.", 
    "I like large language models."
]

embeddings = sent_embedder.encode(non_similar_sentences)

util.pytorch_cos_sim(embeddings[0], embeddings[1])

## Semantic Search and Vector Database

Vector databases store data objects, such as text chunks, along with their vector representations or embeddings. This technique is not limited to text alone; it can also be applied to other data types like images, videos, and audio. By converting data into vector form, it becomes possible to perform queries based on semantic similarity rather than just exact matches or keywords.

### How it works

**Data Storage**:
- Each data object (e.g., a text document, image, or video) is processed through a model to generate a vector representation. These vectors capture the semantic meaning of the data.
- The vector representations are stored in the database alongside the original data.

**Querying**:
- When a query is made, it is first converted into a vector using the same model used for the data objects.
- The database then compares this query vector with the stored vectors to find the most semantically similar data objects.



In [None]:
# load our poor man's document database
def load_capital_db():
    doc_format = "The capital of {country} is {city}."
    import json
    
    with open(os.path.join("..", "data", "country_by_capital_city.json")) as file:
        capital = json.load(file)
        
    capital = pd.DataFrame(capital)
    
    capital["text"] = capital.apply(lambda row: doc_format.format(country=row["country"], city=row["city"] or "unknown"), axis=1)
    
    return capital[["text"]]

country_capital_documents = load_capital_db()
country_capital_documents.head()


In [None]:
sent_embedder.encode(country_capital_documents["text"].tolist(), convert_to_tensor=True)

In [None]:
document_db = {
    "documents": country_capital_documents,
    "embeddings": sent_embedder.encode(country_capital_documents["text"].tolist(), convert_to_tensor=True),
    "sentence_embedder": sent_embedder,
}

In [None]:
def retrieve_topk_documents(query: str, top_k: int = 10, score_threshold: float = .7):
    
    # embed query text
    query_embedding = document_db["sentence_embedder"].encode([query], convert_to_tensor=True)
    
    # calculate the cosine similarity of the query embedding to all document embeddings
    cos_sim = util.dot_score(query_embedding, document_db["embeddings"])[0]
    
    # get the most similar documents
    scores, indices = torch.topk(cos_sim, k=top_k)
    scores = [score.item() for score in scores if score.item() >= score_threshold]
    indices = [indices.item() for i, indices in enumerate(indices) if i < len(scores)]
    return scores, indices
    
scores, indices = retrieve_topk_documents("What is the capital of germany?", top_k=3, score_threshold=0.4)
print(scores)
print(indices)
document_db["documents"].iloc[indices].squeeze()

In [None]:
question = "The capital of germany is what?"
for i in range(10):
    scores, indices = retrieve_topk_documents(question, top_k=1, score_threshold=0.8)
    input_ = document_db["documents"]["text"].iloc[indices].to_list()
    prompt_input = {"instruction": question}
    
    if input_:
        prompt_input["input"] = input_[0]
    
    _, out_ids = generate(
        peft_model, 
        tokenizer, 
        prompt_template(prompt_input), 
        max_gen_length=20, 
        top_p=.5, 
        temperature=.9, 
        alpha_penalty=0.01
    )
    print(f"Response {i}:", tokenizer.decode(out_ids))