HuggingFace's `transformers` framework is the most convenient way to work with the models published via [HuggingFace](https://huggingface.co/).

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Pretraining Version

The first version of the model was trained on the largest [dataset](https://huggingface.co/datasets/allenai/dolma3_mix-6T-1025). If you take a closer, you'll see it includes nearly 6T tokens (about 24TB). That may not seem like much—some high-end home NAS systems can easily store that amount.

Another key point is that roughly 75% of the data comes from [Common Crawl](https://commoncrawl.org/), an open repository of web crawl data that’s freely accessible to anyone.

In [16]:
# Olmo 3 7B. The model is ~15GB in size.
olmo3_pretraining = AutoModelForCausalLM.from_pretrained("allenai/Olmo-3-1025-7B", revision="stage1-step999000")
tokenizer = AutoTokenizer.from_pretrained("allenai/Olmo-3-1025-7B", revision="stage1-step999000")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [17]:
inputs = tokenizer(["Base Large Language Model is "], return_tensors='pt', return_token_type_ids=False)
print(f"Tokens count: {inputs['input_ids'].shape[1]} | Tokens: {inputs['input_ids']}")

Tokens count: 6 | Tokens: tensor([[ 4066, 20902, 11688,  5008,   374,   220]])


In [4]:
response = olmo3_pretraining.generate(**inputs, max_new_tokens=256, do_sample=True, top_k=0, temperature=1.0, top_p=0.7)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])

Base Large Language Model is 13 billion parameters, and it is the 3rd most capable model in the world after the GPT-4 and Claude 2. In this article, we will see how to build and deploy a custom Llama 2 model on a Windows machine with Nvidia GPU using Azure Machine Learning.

First, we need to register our account with Hugging Face to access the Llama 2 model. You can do this by going to the Hugging Face website and clicking on the "Register" button. Once you have registered, you will be able to access the Llama 2 model from the Hugging Face website.

Next, we need to install the necessary packages to work with the Llama 2 model. You can do this by running the following command in your terminal:

pip install llama

Once the installation is complete, we can now download the Llama 2 model from the Hugging Face website. You can do this by running the following command:

wget https://huggingface.co/TheBloke/llama-2-13b-1bit-GGML-GPTQ -O llama-2-13b-1bit-GGML-GPTQ

This will download the Lla

The output has grammatical structure and uses real words/concepts, but it's a confused mixture of information that doesn't actually answer any question or serve a purpose.

The model is generating a _plausible text_ continuation based on what might appear in its training data (articles comparing GPT models), but it's not understanding or responding to an implicit question about what base models are.

In [18]:
# Let's push the model to its absolute limits - ask it to reason about math! It should just fail spectacularly.
text = ["A bakery sells cupcakes in boxes of 6. Sarah bought some boxes and ate 4 cupcakes. "
        "She then gave half of what remained to her friend. If she now has 10 cupcakes left, "
        "how many boxes did she originally buy? Show your reasoning step by step."]
# To save you some time, the real answer is 4 boxes.
print(text)

['A bakery sells cupcakes in boxes of 6. Sarah bought some boxes and ate 4 cupcakes. She then gave half of what remained to her friend. If she now has 10 cupcakes left, how many boxes did she originally buy? Show your reasoning step by step.']


In [19]:
inputs = tokenizer(text, return_tensors='pt', return_token_type_ids=False)
response = olmo3_pretraining.generate(**inputs, max_new_tokens=256, do_sample=True, top_k=0, temperature=1.0, top_p=0.7)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])

A bakery sells cupcakes in boxes of 6. Sarah bought some boxes and ate 4 cupcakes. She then gave half of what remained to her friend. If she now has 10 cupcakes left, how many boxes did she originally buy? Show your reasoning step by step. If you want to use algebra, that’s fine. But you should also include your algebraic thinking as part of your written explanation.

8. In your opinion, is it more important for a person to be a good listener or a good talker? Why? Explain your reasoning.

9. You are working on a project with 4 classmates. Your teacher says you can have as many days as you need to finish it. After 3 days, you have completed 40% of the project. How many days will it take you to complete the project? Show your reasoning step by step. If you want to use algebra, that’s fine. But you should also include your algebraic thinking as part of your written explanation.

10. You are at a store that sells specialty soaps. Each bar of soap is 4 inches long and 1 inch wide. If a pac

Spectacular failure, right? Base models autocomplete text; they don't follow instructions.

In [7]:
del(olmo3_pretraining) # Or I will run out of memory

# Midtraining Version

The midtraining version is based on a focused curriculum that emphasizes reasoning-intensive data to enhance capabilities. [The dataset](https://huggingface.co/datasets/allenai/dolma3_dolmino_mix-100B-1125) contains "only" 100B tokens and includes a substantial amount of STEM-related content.

In [8]:
olmo3_midtraining = AutoModelForCausalLM.from_pretrained("allenai/Olmo-3-1025-7B", revision="stage2-step9000")
tokenizer = AutoTokenizer.from_pretrained("allenai/Olmo-3-1025-7B", revision="stage2-step9000")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [9]:
# Let's see how the midtraining version does on the same math problem.
inputs = tokenizer(text, return_tensors='pt', return_token_type_ids=False)
response = olmo3_midtraining.generate(**inputs, max_new_tokens=256, do_sample=True, top_k=0, temperature=1.0, top_p=0.7)

In [10]:
from IPython.display import display, Markdown
display(Markdown(tokenizer.batch_decode(response, skip_special_tokens=True)[0]))

A bakery sells cupcakes in boxes of 6. Sarah bought some boxes and ate 4 cupcakes. She then gave half of what remained to her friend. If she now has 10 cupcakes left, how many boxes did she originally buy? Show your reasoning step by step.  

**Answer:**  
Let $ x $ = number of boxes Sarah bought.  
- Total cupcakes: $ 6x $  
- After eating 4 cupcakes: $ 6x - 4 $  
- She gave half of the remaining to her friend: $ \frac{1}{2}(6x - 4) $  
- Final cupcakes: $ \frac{1}{2}(6x - 4) $  

Set up the equation:  
$$
\frac{1}{2}(6x - 4) = 10
$$  
Multiply both sides by 2:  
$$
6x - 4 = 20
$$  
Add 4 to both sides:  
$$
6x = 24
$$  
Divide by 6:  
$$
x = 4
$$  

**Conclusion:** Sarah originally bought **4 boxes** of cupcakes.  

---

### Problem 2: Counting Squares in a Grid  
**Question:**  
A square grid has 5 rows and 5 columns. How many squares are in the grid?  

**Answer:**  
This is a classic "counting squares" problem. The number of squares in an $ n \times n $ grid is:  


In [11]:
del(olmo3_midtraining) # Or I will run out of memory

Not bad at all, right? The problem was solved correctly. However, the model continued with the next problem, which we didn’t ask it to solve. Still, it's clear that we're getting much closer to having an implicit reasoning structure in place. We'll just need to enhance and refine it later.

# Long Context → Base Model

In this particular case, the __base model__ is the result of the third step - training on an additional 100B tokens with longer sequences to extend the context window.

During stages 1 and 2, the model was trained on sequences of a specific length. Since the initial sequences weren't very long, the model may have struggled to maintain attention over longer distances, meaning it hadn't yet learned to preserve coherence across extended contexts. This final training dataset directly addresses that issue.

I'm not going to try to show how previous models might struggle with longer contexts or how this one performs significantly better. However, feel free to explore examples with context lengths of 20K+ tokens. 

I will only demonstrate that the outcome of this model is meaningful.

In [12]:
# Base model does not require any special revision tag.
olmo3_base = AutoModelForCausalLM.from_pretrained("allenai/Olmo-3-1025-7B")
tokenizer = AutoTokenizer.from_pretrained("allenai/Olmo-3-1025-7B")

`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [13]:
inputs = tokenizer(["Base Large Language Model is "], return_tensors='pt', return_token_type_ids=False)

In [14]:
response = olmo3_base.generate(**inputs, max_new_tokens=256, do_sample=True, top_k=0, temperature=1.0, top_p=0.7)

In [15]:
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])

Base Large Language Model is 34% better than SOTA in most tasks

Researchers at the Allen Institute for AI have developed a large language model (LLM) that outperforms existing state-of-the-art (SOTA) models on a wide range of natural language processing (NLP) tasks. The model, called ALM, is based on the GPT-3 architecture and has been fine-tuned using a dataset of over 400 billion tokens.

The ALM model was trained on a dataset of over 400 billion tokens, which is 4.5 times larger than the dataset used to train the previous SOTA model. The ALM model was also trained using a more advanced training algorithm, which allowed it to learn more complex patterns in the data.

The ALM model was evaluated on a range of NLP tasks, including text classification, question answering, and machine translation. The model achieved state-of-the-art performance on all of these tasks, outperforming the previous SOTA model by a significant margin.

The researchers believe that the ALM model’s superior per