<a href="https://colab.research.google.com/github/Showcas/NLP/blob/main/03_2_Galactica.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with LLMs

## Introduction to Galactica Models

Galactica is a family of large language models optimized for scientific applications. Trained on a specialized corpus, it excels in handling scientific terminology, mathematical expressions, citations, and other usefull things.


The Galactica Paper: https://galactica.org/static/paper.pdf

---

## 1. Setup and Installation

To use Galactica, we install the required library and load the model.

We recommend using `float16` precision to reduce memory usage when running on Colab.

In [1]:
!pip install galai



In [2]:
import galai as gal
from galai.notebook_utils import display_latex
from galai.notebook_utils import  display_markdown

After installation, you can obtain an overview of the available model sizes with the following.

In [3]:
from galai.utils import ModelInfo
ModelInfo.all()

Name,Parameters,Layers,Heads,Head Size,Vocabulary Size,Context Size
mini,125.0 M,12,12,64,50000,2048
base,1.3 B,24,32,64,50000,2048
standard,6.7 B,32,32,128,50000,2048
large,30.0 B,48,56,128,50000,2048
huge,121.3 B,96,80,128,50000,2048


With 16-bit floating-point precission (FP16) we can estimate the model footprint of the different model sizes:


| Model     | Parameters  | Size (FP16) |
|-----------|------------|------------|
| **Mini**     | 125M  | **0.25 GB** |
| **Base**     | 1.3B  | **2.6 GB**  |
| **Standard** | 6.7B  | **13.4 GB** |
| **Large**    | 30B   | **60 GB**   |
| **Huge**     | 121.3B | **242.6 GB** |

Depending on how much computing capacity your system has available, you can choose the model size accordingly. The `standard` model is selected by default. If this consumes too many resources, switch to the `base` or `mini` model.

In [4]:
# Load a smaller model for Colab compatibility
model = gal.load_model(
    name="base",  # change this for other model sizes (e.g. "base")
    dtype="float16",
    # num_gpus=None,    # If you have multiple GPUs, you can specify the number of GPUs to use
    parallelize=None, # This will parallelize the model across multiple GPUs
    )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/166 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.00 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

  return torch.load(checkpoint_file, map_location="cpu")


**Remark:** Using a smaller modelsize than standard can yield to bad examples in the following notebook. It is highly recommended to use the biggest model you can and feel free to experiment with different model sizes.

## 2. Text Generation

Galactica can generate high-quality scientific text based on prompts. This is useful for research papers, blog posts, or academic discussions.

In [5]:
prompt = "What is Natural Language Processing?"
output = model.generate(prompt, max_new_tokens=100)
print(output)

What is Natural Language Processing?, Manning[END_REF]).

# 2.2.2.2.2.3.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2


## 3. Generating Citations

Galactica has been trained to recognize and suggest citations using `[START_REF]` tokens.

This feature can help researchers find references without manual searches.

Keep in Mind, Galactica was trained in 2022 and therefore also has a knowledge cut of this date.

In [6]:
model.generate("The Transformer architecture [START_REF]")

'The Transformer architecture [START_REF] Attention is All you Need, Vaswani[END_REF] is a popular choice for sequence-to-sequence models. It consists of a stack of encoder and decoder layers, each of which is composed of a multi-head self-attention mechanism and a feed-forward network. The encoder is used to encode the'

If we ask the model for more recent papers, we get, as with many LLMs, hallucinated results:

In [7]:
model.generate("The new LLM DeepSeek [START_REF]")

'The new LLM DeepSeek [START_REF] DeepSeek: A New Multi-Scale Deep Convolutional Neural Network for Fast Image Retrieval, Liu[END_REF] is a multi-scale CNN that uses a multi-scale feature pyramid to extract features from the input image. The features are then fed into a fully connected layer to generate a feature vector. The feature'

## 4. Handling Mathematical Expressions

Galactica was trained also on math formulars and can generate and understand LaTeX equations, which is beneficial for scientific writing. The framework even has its own LaTeX and markdown display functions that make it possible to have appropriate formatting in notebooks. (Due to different LaTeX standards, the output of math formulas in local jupyter notebooks does not work sometimes).

In [8]:
from galai.notebook_utils import display_latex

math_prompt = "The Riemann zeta function is given by: \\["
math_output = model.generate(math_prompt, max_new_tokens=100)
display_latex(math_output)

#### new_doc = True:
Galactica was trained on scientific documents that were separated using `</s>`, which marks the end of one document and the beginning of another. By adding `</s>` before your prompt, it signals to the model that this is the start of a new document rather than a continuation.

Without the parameter, the model treats the prompt as part of an ongoing text, possibly completing an unfinished sentence. To prevent this, we set `new_doc = True`.

In [9]:
from galai.notebook_utils import  display_markdown
display_markdown(model.generate("# The Hitchhiker's Guide to the Galaxy (novel) Wikipedia Entry\n\n", new_doc=True,  max_new_tokens=200))

## 5. Step-by-Step Reasoning

The `<work>` token guides Galactica through logical step-by-step reasoning. This is useful for breaking down complex problems into smaller logical steps.

Note, this was befor reasoning models and was all achieved through clever training and special tokens.


In [10]:
reasoning_prompt = "A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?\n\n<work>"
reasoning_output = model.generate(reasoning_prompt, max_new_tokens=100)
print(reasoning_output)

A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?

<work>

The bat costs $1.00 more than the ball.

calc_1.py
```
result = 1.00+1.10

with open("output.txt", "w") as file:
    file.write(str(round(result)))
```

<<run: "calc_1.py">>

<<read: "output.txt">>


**Remark**: The Huge model solves the this Task correctly to **0.05** when using the step-by-step reasoning. The model size plays a decisive role in resoning in particular.

In [11]:
output = model.generate(
        "What is the $7$-th harmonic number of the second order? Answer with Python source code.\n\n<work>",
        max_new_tokens=700,
    )

This should have a reasoning format as output where a python code for the calculation of the problem is generated.

for example like this:


In [12]:
# A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?

# <work>

# The bat costs $1.00 more than the ball, so the ball costs $1.00 + $1.10

# calc_1.py
# ```
# result = 1.00+1.10

# with open("output.txt", "w") as file:
#     file.write(str(round(result)))
# ```

# <<run: "


#### TASK 3.3

Implement a function that extracts the code block from the Glalactica reasoning output and executes it. Note that not every reasoning output from galactica also contains a codeblock.

In [15]:
import re

def calculate_reasoning(model_output):
    ### IMPLEMENT YOUR SOLUTION HERE ###

    match = re.search(r'```(.*?)```', model_output, re.DOTALL)

    if match:
        code_block = match.group(1).strip()

        try:
            local_variables = {}
            exec(code_block, {}, local_variables)

            return local_variables
        except Exception as e:
            return f"Error in code: {str(e)}"
    else:
        return "No code block found."

    return None

In [16]:
test_output = 'A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?\n\n<work>\n\nThe bat costs $1.00 more than the ball, so the ball costs $1.00 + $1.10 \n\ncalc_1.py\n```\nresult = 1.00+1.10\n\nprint(result)\n```\n\n<<run: "'
calculate_reasoning(test_output)

## this should print 2.1

2.1


{'result': 2.1}

Lets try a harder problem where more code is required. This requires a bit more computing power since we need more tokens to generate the complete calculation.

In [17]:
output = model.generate(
        "What is the $7$-th harmonic number of the second order? Answer with Python source code.\n\n<work>",
        max_new_tokens=500,
    )

In [18]:
print(output)

What is the $7$-th harmonic number of the second order? Answer with Python source code.

<work>

The $7$-th harmonic number of the second order is $\dfrac{1}{2}\left(\dfrac{1}{2}+\dfrac{1}{4}+\dfrac{1}{8}+\dfrac{1}{16}+\dfrac{1}{32}+\dfrac{1}{64}\right)$.

Let's find the $7$-th harmonic number of the first order.

$\begin{aligned} \dfrac{1}{2}\left(\dfrac{1}{2}+\dfrac{1}{4}+\dfrac{1}{8}+\dfrac{1}{16}+\dfrac{1}{32}+\dfrac{1}{64}\right) &= \dfrac{1}{2}\left(\dfrac{1}{2}+\dfrac{1}{4}+\dfrac{1}{8}+\dfrac{1}{16}+\dfrac{1}{32}+\dfrac{1}{64}\right)+\dfrac{1}{2}\left(\dfrac{1}{2}+\dfrac{1}{4}+\dfrac{1}{8}+\dfrac{1}{16}+\dfrac{1}{32}+\dfrac{1}{64}\right) \\\\ &= \dfrac{1}{2}\left(\dfrac{1}{2}+\dfrac{1}{4}+\dfrac{1}{8}+\dfrac{1}{16}+\dfrac{1}{32}+\dfrac{1}{64}\right)+\dfrac{1}{2}\left(\dfrac{1}{2}+\dfrac{1}{4}+\dfrac{1}{8}+\dfrac{1}{16}+\dfrac{1}{32}+\dfrac{1}{64}\right) \\\\ &= \dfrac{1}{2}\left(


In [19]:
calculate_reasoning(output)

'No code block found.'

## 6. Text Summarization

Galactica can generate concise summaries of longer texts, making it useful for academic papers and research abstracts. For summarization we use the `\n\nTLDR:` in our prompt to generate the desired summary.

In [20]:
TEXT = """Information overload is a major obstacle to scientific progress. The explosive growth in scientific literature and data has made it ever harder to discover useful insights in a large mass of information. Today scientific knowledge is accessed through search engines, but they are unable to organize scientific knowledge alone. In this paper we introduce Galactica: a large language model that can store, combine and reason about scientific knowledge. We train on a large scientific corpus of papers, reference material, knowledge bases and many other sources. We outperform existing models on a range of scientific tasks. On technical knowledge probes such as LaTeX equations, Galactica outperforms the latest GPT-3 by 68.2% versus 49.0%. Galactica also performs well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH with a score of 20.4% versus 8.8%. It also sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6% and 52.9%. And despite not being trained on a general corpus, Galactica outperforms BLOOM and OPT-175B on BIG-bench. We believe these results demonstrate the potential for language models as a new interface for science. We open source the model for the benefit of the scientific community."""
summary = model.generate(TEXT + "\n\nTLDR:", max_new_tokens=100)
print(summary)

Information overload is a major obstacle to scientific progress. The explosive growth in scientific literature and data has made it ever harder to discover useful insights in a large mass of information. Today scientific knowledge is accessed through search engines, but they are unable to organize scientific knowledge alone. In this paper we introduce Galactica: a large language model that can store, combine and reason about scientific knowledge. We train on a large scientific corpus of papers, reference material, knowledge bases and many other sources. We outperform existing models on a range of scientific tasks. On technical knowledge probes such as LaTeX equations, Galactica outperforms the latest GPT-3 by 68.2% versus 49.0%. Galactica also performs well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH with a score of 20.4% versus 8.8%. It also sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6

## 7. Sentiment Analysis


With some guidance, the model can also determine the sentiment by logical combination.

In [21]:
prompt = """Post: "I hate it when my phone battery dies."
Sentiment: Negative
###
Post: "My day has been 👍"
Sentiment: Positive
###
Post: "This is the link to the article"
Sentiment: Neutral
###
Post: "This new music video was incredibile"
Sentiment:"""
output=model.generate(prompt, max_length=90)
display_markdown(output)

In [22]:
prompt = """Movie Review: "The movie was great, but the ending was disappointing."
Sentiment: Negative
\n
Movie Review: "The movie was a bit slow, but the acting was good."
Sentiment: Neutral
\n
Movie Review: "The movie was amazing, I loved it!"
Sentiment: Positive
\n
Movie Review: "The movie was terrible, I hated it."
Sentiment:"""

output=model.generate(prompt, max_new_tokens=3)
display_markdown(output)

## 8. Question-Answering Knowledge

Galactica can answer technical and scientific questions by retrieving stored knowledge. It also tries to combine knowledge with references marked with the tokens for `[START_REF]` and `[END_REF]`.

In [23]:
prompt = "Question: What is a Transformer model?\n\nAnswer:"
output = model.generate(prompt, max_new_tokens=100)
print(output)

Question: What is a Transformer model?

Answer: A Transformer is a type of neural network that is used to model sequential data. It is a type of recurrent neural network (RNN).

A Transformer is a type of neural network that is used to model sequential data. It is a type of recurrent neural network (RNN).

A Transformer is a type of neural network that is used to model sequential data. It is a type of recurrent neural network (RNN).

A Transformer is a type of neural


The multiple modalities that Galactica is able to work with allows us to query for papers using math, source code, etc.:

In [24]:
prompt = """The paper that presented a novel computing block given by the formula:
\\[
f(Q, K, V) = \\textrm{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V
\\]

"""
reference = model.generate_reference(prompt)
display_markdown(f"**Prompt**: {prompt}\n\n**Reference**: {reference}")

In [25]:
prompt = """```python
while k > 1:
    if k % 2 == 0:
        k = k // 2
    else:
        k = 3 * k + 1
```

A paper studying if the loop above terminates for all positive integers """
reference = model.generate_reference(prompt)
display_markdown(f"**Prompt**:\n{prompt}\n\n**Reference**: {reference}")

You can get multiple suggestions of reference for a given prompt by setting `suggestions` parameter. With `suggestions > 1` a beam search decoding is used to try to generate more suggestions.

In [26]:
for reference in model.generate_reference("A survey paper on nlp models",suggestions=5):
    print(reference)

Natural Language Processing (Almost) from Scratch, Collobert
A survey of deep neural network architectures and their applications, Liu
A survey of deep neural network architectures and their applications, Liu
A survey of deep neural network architectures and their applications, Liu
A survey of deep neural network architectures and their applications, Liu


In [27]:
for reference in model.generate_reference(
    "A survey paper on nlp models",
    suggestions=5, diversity_penalty=0.9
):
    print(reference)



Natural Language Processing (Almost) from Scratch, Collobert
Neural Network Models for Natural Language Processing, Goldberg
A Survey of Neural Network Models for Natural Language Processing, Goldberg
Survey on Natural Language Processing, Joshi
A survey on natural language processing models, Soni


This example shows how important the choice of tokens is during generation in addition to model training and prompting. The geedy method is not always the best way to select the next token. More complex search algorithms may take a little longer, but they produce significantly better results.

It is comparable to a game of chess. The player who only ever thinks one move ahead will not play as well as a player who takes the next 5 moves or more into account.

We will learn more about that in the next section.

## 9. Text Generation Sampling

Galactica uses the Huggingface library and with that we can compare different sampling methods for the text generation. We compare them with the following prompt:

In [28]:
prompt = "Title: A Literature Review on Sentiment Analysis\n\n# Abstract\n"

### Greedy Decoding

This is the standard algorithm used by `Model.generate`. Using the prompt and already generated tokens, the model computes a probability distribution of the next token over all tokens. The token with the highest score is appended to the generated text and the process is repeated.

In [29]:
output = model.generate(prompt, max_new_tokens=150)
display_markdown(output)

### Beam Search

In Beam Search, the model computes a probability distribution of the next token over all tokens for each of the `num_beams` generated sequences. The `num_beams` sequences with the highest probability are kept and the process is repeated.

In [30]:
output = model.generate(prompt, num_beams=5, max_new_tokens=150)
display_markdown(output)

You can return up to `num_beams` sequences by specifying `num_return_sequences`.

Beam search is slower and requires more memory compared to the Greedy Decoding. The increase in memory consumption is proportional to the number of beams used.

### Contrastive Search

The contrastive search ([Su et al.](https://arxiv.org/abs/2202.06417), [Su et al.](https://arxiv.org/abs/2210.14140)) algorithm is a novel generation method that aims to produce more natural texts by penalizing repetitions. We can use `transformers` implementation (see more at https://huggingface.co/blog/introducing-csearch) by specifying `penalty_alpha` and `top_k`.

In [31]:
# contrastive search
output = model.generate(prompt, top_k=5, top_p=0.95, max_new_tokens=150)
display_markdown(output)

You can find more options in the Huggingface Transformers Documentation: https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig

If you are interested in the topic Decoding and Sampling you can read further in the blogpost by von Platen on huggingface: https://huggingface.co/blog/how-to-generate

## 10. Composition


Galactica models are able to mix & combine scientific modalities, stored knowledge and generalize to new tasks.

In [32]:
output = model.generate("""Question: Translate the following python code:

```python
def cheapestProduct(products: List[Product]) -> Product:
    return min(products, key=lambda p: p.price)
```

into C++.

Answer:""", max_new_tokens=130)
print(output)

Question: Translate the following python code:

```python
def cheapestProduct(products: List[Product]) -> Product:
    return min(products, key=lambda p: p.price)
```

into C++.

Answer:

```
#include <iostream>
#include <cstdlib>
#include <cstring>
#include <cmath>
#include <cstdio>
#include <cmath>
#include <cstring>
#include <cmath>
#include <cstdio>
#include <cmath>
#include <cstdio>
#include <cmath>
#include <cstdio>
#include <cmath>
#include <cstdio>



In [33]:
output = model.generate("""Question: Translate the following math formula:

\\[
  \\zeta(s) = \\sum_{n=1}^{\\infty} n^{-s}
\\]

into plain English.

Answer:""", max_new_tokens=100)
print(output)

Question: Translate the following math formula:

\[
  \zeta(s) = \sum_{n=1}^{\infty} n^{-s}
\]

into plain English.

Answer: \(\zeta(s)=\sum_{n=1}^{\infty} n^{-s}=\sum_{n=1}^{\infty} \frac{1}{n^{s}}\)

## Exercise \(\PageIndex{1}\)

Translate the following math formula:

\[
  \zeta(s) = \sum_{n=


## 11. Few-Shot Prompting

As we did with the sentiment analysis, we can use examples to teach the model our desired result format before we ask for the actual answer.

In [34]:
display_markdown(model.generate("""Question: does "kayak" read the same backward as forward? Answer with code.

Code:

```python
def is_palindrome(s):
    return s == s[::-1]
```

Answer: `is_palindrome("kayak")`.

Question: An $i$-th Peanut Butter number is given by the formula $pb_i = \\prod_{k=2}^{i} \\frac{1}{1-1/k}$. An $i$-th Jelly number is given by $J_i = \\sum_{k=2}^{i} pb_k$. What is the 6-th Jelly number? Answer with code.
""", max_new_tokens=150))

#### TASK 3.4

Use Few-Shot Prompting to classify the following paper based on its abstract into category.

Hu et al. 2021, "LoRA: Low-Rank Adaptation of Large Language Models" (https://arxiv.org/abs/2106.09685)

Abstract:
> An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at this https URL (https://github.com/microsoft/LoRA).

In [35]:
### IMPLEMENT YOUR SOLUTION HERE ###
display_markdown(model.generate("""Question: How would you classify the sentence "Machine Learning is cool"? Answer with code.

Code:

```python
def classify(sentence):

    examples = [
        ("This paper discusses an efficient approach to train large-scale deep learning models using model pruning, quantization, and distillation techniques, which reduce the computational cost and memory usage of the models without sacrificing performance.",
         "Machine Learning/Artificial Intelligence - Model Efficiency and Adaptation"),

        ("We propose a new method for named entity recognition in texts that uses a hybrid of rule-based and deep learning models to improve the extraction of entities such as names, dates, and locations.",
         "Natural Language Processing"),

        ("In this paper, we introduce a novel deep convolutional neural network architecture for image classification that achieves state-of-the-art performance on the CIFAR-10 dataset.",
         "Computer Vision"),

        ("We explore the use of deep reinforcement learning for game-playing agents, focusing on training an agent to play Atari games using Q-learning with a deep neural network.",
         "Reinforcement Learning")
    ]

    for example, category in examples:
        if example in sentence:
            return category

    return "Unknown Category"

```
Answer: `classify("Machine Learning is cool")`.

Question: How would you classify this paper based on its abstract; An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at this https URL (https://github.com/microsoft/LoRA)? Answer with code.
""", max_new_tokens=150))

## 12. Tokenization


As we observed already, we have different tokens in the galactica models that help to structure the knowledge and tasks. All Galactica models share the same vocabulary of 50000 tokens. The vocabulary was trained on 2% of our training corpus using Byte-Pair Encoding (BPE) tokenization.

The following provides a overview on the tokens and what makes galactica special.

### Special Tokens

Some of the tokens (f.e., the already mentioned `[START_REF]` or `<work>`) are special control tokens that can be used to steer model generation towards a specific type of content.


`<unk>` - reserved.

`<s>` - reserved.

`</s>` - end-of-document token used to split documents during trainig. Prepending this token to prompt (see `new_doc` parameter in `Model.generate`) biases a model into generating a new document.

`<pad>` - a standard padding token to align sequences in a batch.

`[START_REF]` and `[END_REF]` - markers denoting a reference to a paper. Each paper is represented as `Title, First author name`. F.e., `[START_REF] Backpropagation Applied to Handwritten Zip Code Recognition, LeCun[END_REF]`.

`[IMAGE]` - a placeholder for an image removed from a text.

`<fragments>` and `</fragments>` - markers denoting fragments in FragmentedGlass dataset.

`<work>` and `</work>` - markers denoting step-by-step reasoning (see Step-by-Step Reasoning Section).

`[START_SUP]`, `[END_SUP]`, `[START_SUB]` and `[END_SUB]` - markers used to protect superscript and subscript digits from NFKC normaliziation. Our tokenizer uses the standard NFKC rules, which means that `x²⁵` would be tokenized in the same way as `x25`. To prevent this, we encode `x²⁵` as `x[START_SUP]25[END_SUP]`.

`[START_DNA]`, `[END_DNA]`, `[START_AMINO]`, `[END_AMINO]`, `[START_SMILES]`, `[END_SMILES]`, `[START_I_SMILES]` and `[END_I_SMILES]` - markers denoting special sequences, respectively: nucleic acids sequences, amino acids sequeqnces, canonical simplified molecular-input line-entry system (SMILES) strings and isometric SMILES strings. Besides marking a sequence of a given type, these tokens force a special tokenization mode in which each character is represented as a single token. F.e., `GATTACA` is tokenized as `G|ATT|ACA`, while `[START_DNA]GATTACA[END_DNA]` is tokenized as `[START_DNA]|G|A|T|T|A|C|A|[END_DNA]`. Note that for this to work you need to transform your prompt with `galai.utils.escape_custom_split_sequence`. All standard text generation functions of `galai.model.Model` do this automatically.

The `galai` library takes care of handling of the special tokens. If you are using `tokenizers` directly then most likely you want to keep the special tokens in the output for further processing. Set `skip_special_tokens=False` in `tokenizers.Tokenizer.decode`.

### Decoupling of Tokens

The BPE training algorithm creates vocabulary based on frequncies of subwords in the training corpus, with more frequent subwords being represented with fewer number of tokens. This means that visually similar subwords may end up having totally different token representations. For example, in the GPT-2 tokenizer (trained before year 2020) each of the numbers `{2000, 2001, ..., 2020}` is encoded with a unique token, and all of the numbers `{2021, 2022, ..., 2030}` are represented as two tokens: `20|21`, `20|22`, etc. Training on a corpus with math, TeX formulas and source code it can happen that a single token encodes multiple independent functions. F.e., `\(-` can end up being a single token making prompting more difficult and the model less robust to changes in spaces.

To prevent this issue galactica implements custom splitting rules, presented in the example below. For performance reasons we keep a leading space (i.e., ` text` can be a single token).

In [36]:
from galai.utils import escape_custom_split_sequence
from IPython.display import HTML
import html

def tokenization_example(tokenizer, text):
    text = escape_custom_split_sequence(text)
    tokens = [tokenizer.decode([x], skip_special_tokens=False) for x in tokenizer.encode(text)]
    spans = "</span><span>".join([html.escape(t).replace(" ", "▁").replace("\n", "\\n") for t in tokens])
    style = "<style>.tok-examp1e > span {border: 1px solid #555; padding: 4px 6px; margin: 2px; background: #f8f8f8}</style>"
    return HTML(style + "<div class='tok-examp1e' style='display: flex; flex-wrap: wrap'><span>" + spans + "</span></div>")

tokenization_example(model.tokenizer, r"""Tokenization of most of the natural texts is not impacted by the rules.
However, most of the non-alphanumeric ASCII characters are split. This is mostly visible in TeX formulas,
for example: $\frac{d}{dx}\,\cos(x) = -\sin(x)$, \(\zeta(s)=\sum_{n=1}^{\infty} n^{-s}\).
It also impacts source codes, like: x+=((1,2));
As a side-effect, contractions (I'll, you've, it's, etc.) and emoticons (like this Santa Claus *<|:‑) ) are split.
This rule makes exception for a repeated sequence of the same character, so f.e., ---------------- is still a single token.
Additionally, EOL character is always split, so that




are 5 tokens.
Numbers are slit into individual digits as before, f.e., $$\pi=3.14159265\ldots$$
Note that non-alphanumeric splitting splits space in front as well (f.e., i ++, x <-> y, if ( x <= y )).
Special tokens like [START_REF], <work> or [IMAGE] are left intact.
The tokenizer additionally supports custom sequence splitting (does not work by default, requires a custom preprocessing step), f.e.:
[START_DNA]GATTACA[END_DNA], [START_AMINO]PEPTIDES[END_AMINO],
[START_SMILES]CC(=O)NCCC1=CNc2c1cc(OC)cc2[END_SMILES] and [START_I_SMILES]CN1CCC[C@H]1c2cccnc2[END_I_SMILES]""")


## 13. Limitations and Pitfalls

While Galactica language models enable one to analyze and work with scientific data in multiple new ways, it's important to understand the shortcomings of the models. To explore the models limitations we explore some in the following.


### Hallucinations

In [37]:
print(model.generate("# Götz-Henrik Wiegand\n", max_new_tokens=120))

# Götz-Henrik Wiegand

 Götz-Henrik Wiegand (born 1960) is a German politician of the Christian Democratic Union (CDU) who has been serving as a member of the Bundestag from the state of Hesse since 2017.

## Political career

 Wiegand was elected to the Bundestag in the 2017 German federal election. In parliament, he serves on the Committee on Foreign Affairs.

## External links

* Official website (in German)
* Bundestag biography (


The issue is especially visible in case of prompts with incorrect assumptions, in which a prompt already includes made up statements. Here are some examples:

In [38]:
display_latex(
    model.generate(
        "The Einstein-Handschuh-Wiegand equation is given by:\n",
        max_new_tokens=200,
        new_doc=True
    )
)

In [39]:
print(
    model.generate(
        "Question: what was the main reason that lead to the duel between Yann LeCun and Jürgen Schmidhuber?\n\nAnswer:"
    )
)

Question: what was the main reason that lead to the duel between Yann LeCun and Jürgen Schmidhuber?

Answer: Schmidhuber's work was more focused on the development of a general purpose learning algorithm, while LeCun's work was more focused on the development of a specific learning algorithm.


In [40]:
print(
    model.generate(
        "Question: what is the largest prime number?\n\nAnswer:"
    )
)

Question: what is the largest prime number?

Answer: $2^{2^{2^{2^{2^{2^{2^{2^{2^{2^{2^{2^{2^{2^{2^{2^{2^{2^{2^{2


### Multi Lingual

The Galactica models are not multi-lingual by design. Most of the natural language documents in the NatureBook corpus are written in **English**. Prompting in different language results in more random generations.

In [41]:
print(model.generate(" # Galaxie\nEine Galaxie ist eine Ansammlung von Sternen,", new_doc=True, max_new_tokens=65))

 # Galaxie
Eine Galaxie ist eine Ansammlung von Sternen, die sich in einem einzigen, einzigen, einzigen, einzigen, einzigen, einzigen, einzigen, einzigen, einzigen, einzigen, einzigen, einzigen, einzigen, einzigen, einz


A translation by a native speaker:
> A galaxy is a group of stars, galaxies, planetary systems, etc. that are located in a specific region of the universe.
Galaxy is a tool to generate galaxy simulations at a specific time of the Universe.

In [42]:
print(model.generate("Question: how do you say 'Good morning' in German?\n\nAnswer:", max_new_tokens=65)) # correct would be "Guten Morgen"

Question: how do you say 'Good morning' in German?

Answer: Good morning


In [43]:
print(model.generate("Question: how do you say 'Good morning' in Italian?\n\nAnswer:", max_new_tokens=65)) # correct would be "Buongiorno"

Question: how do you say 'Good morning' in Italian?

Answer: 'Già'


The NatureBook corpus was assembled in July 2022, so the models have no information about anything that happened after.

In [44]:
print(model.generate("# Elizabeth II\n"))

# Elizabeth II

 Elizabeth II may refer to:

* Elizabeth II (1558–1618), Queen of England and Scotland
* Elizabeth II (1618–1658), Queen of England and Scotland
* Elizabeth II (16


In [45]:
print(model.generate("Question: What year is it?\n\nAnswer:", new_doc=True))

Question: What year is it?

Answer: 1999

Question: What is the year of the year?

Answer: 1999

Question: What is the year of the year?

Answer: 1999

Question: What is the year of the year?



### Prompt Robustness


#### Spelling Errors
A large part of the NatureBook corpus consists of documents using a formal and technical language. The model output may change depending on spelling, punctuation and grammatical errors in a prompt.

**Reality**: Jürgen Schmidhuber is born 1963 and studied at Technischen Universität München

In [46]:
print(model.generate("# Jürgen Schmithuber\n"))  # a typo in the last name

# Jürgen Schmithuber

 Jürgen Schmithuber (born 1953) is a German politician of the Christian Democratic Union (CDU) who has been serving as a member of the Bundestag from the state of North Rhine-Westphalia since 2005.


In [47]:
print(model.generate("# Jürgen Schmidhuber\n")) # correct name

# Jürgen Schmidhuber

 Jürgen Schmidhuber (born 1950) is a German politician of the Christian Democratic Union (CDU) who has been serving as a member of the Bundestag from the state of North Rhine-Westphalia since 2005.


**Note:** Depending on the model size, we still might get some wrong results.

#### TeX formula markers

Most of the documents in the NatureBook corpus use `\(` and `\)` for inline TeX formulas and `\[` and `\]` for display mode maths, but some of the data sources use `$` and `$$` instead.

In [48]:
# using \( \)
display_markdown(
    model.generate(
"""Question: What is the expected value of a random variable uniformly distributed over the interval \\([a^2, b+c]\\)?

Answer:""", max_new_tokens=20)
)

In [49]:
# using \[ \]
display_markdown(
    model.generate(
"""Question: What is the expected value of a random variable uniformly distributed over the interval \\[[a^2, b+c]\\]?

Answer:""", max_new_tokens=20)
)

In [50]:
# using $
display_markdown(
    model.generate(
"""Question: What is the expected value of a random variable uniformly distributed over the interval $[a^2, b+c]$?

Answer:""", max_new_tokens=500)
)

In [51]:
# using $$
display_markdown(
    model.generate(
"""Question: What is the expected value of a random variable uniformly distributed over the interval $$[a^2, b+c]$$?

Answer:""", max_new_tokens=40)
)

In [52]:
# plaintext math
display_markdown(
    model.generate(
"""Question: What is the expected value of a random variable uniformly distributed over the interval [a^2, b+c]?

Answer:""", max_new_tokens=22)
)

In [53]:
# using $, beam search
display_markdown(
    model.generate(
"""Question: What is the expected value of a random variable uniformly distributed over the interval $[a^2, b+c]$?

Answer:""", max_new_tokens=50, num_beams=5)
)

#### Letter-case

In [54]:
print(
    model.generate("Question: what is Alzheimer's Disease?\n\n")
)

Question: what is Alzheimer's Disease?

Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that is the most common cause of dementia in the elderly. It is characterized by the presence of extracellular amyloid plaques and intracellular neurofibrillary tangles. The amyloid plaques are composed of amyloid-β (Aβ) peptides, which


In [55]:
print(
    model.generate("Question: what is alzheimer's disease?\n\n")
)

Question: what is alzheimer's disease?

Answer: Alzheimer's disease (AD) is a progressive neurodegenerative disorder that is the most common cause of dementia in the elderly. It is characterized by the presence of extracellular amyloid plaques and intracellular neurofibrillary tangles. The amyloid plaques are composed of amyloid-β (Aβ) peptides,


In [56]:
print(
    model.generate("Question: what is ALZHEIMER'S DISEASE?\n\n")
)


Question: what is ALZHEIMER'S DISEASE?

Answer: ALZHEIMER'S DISEASE is a rare, chronic, progressive, neurodegenerative disorder that affects the brain and spinal cord. It is characterized by progressive dementia, ataxia, and spasticity.

ALZHEIMER'S DISEASE is a rare, chronic


In addition to the problems outlined here, there are other problems. Some of these are listed in the paper (https://arxiv.org/abs/2211.09085).

---

Galactica shows how, with clever training, elements of reasoning — though not on the level of modern models like OpenAI’s o-Models or DeepSeek — could be incorporated into LLMs as early as 2022.