<a href="https://colab.research.google.com/github/Signed-B/build-your-own-llm/blob/main/lecture_workbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How To Build a Large Language Model
### Presented by Beckett Hyde - 03/05/2024

Welcome to this notebook that walks you through all the steps necessary for creating your first LLM application:
+ Pulling pre-trained base model weights
+ Loading those weights into memory, incuding on small, consumer GPUs
+ Pulling and tokenizing a dataset (or tokenizing your own dataset)
+ Fine-tuning the model
+ Model inference

This is designed to go along with the in-person presentation given on March 5th to the University of Colorado, Boulder. This does not cover training a model from scratch (as that is extremely time and resource intensive) and takes extra steps to ensure the model can even work on higher-end consumer hardware (no NVIDIA A100 GPUs required! If your machine has ~15GB of vRAM, sometimes even less, this could work off of colab's T4s for you).

If you are interested in skipping straight to the working solutions, see [this notebook](https://github.com/Signed-B/build-your-own-llm/blob/main/lecture_solution.ipynb). If you are interested in the theory portion of the presetation, see [this sldieshow](example.com).

<hr />

This presentation was sponsored by the CU Boulder Undergraduate SIAM Chapter. A thank you to them for their support in promoting and financing the event.

<center>
<p float="left">
  <img src="https://www.colorado.edu/brand/sites/default/files/styles/medium/public/block/boulder-one-line_4.png" width="300" />
  <img width="10" hspace="10" />
  <img src="https://www.siam.org/portals/0/Logo%20Guide/logo_cobrand.png" width="300" />
</p>
</center>
<hr/>
<hr />

# Step 1: Install Software

A number of packages, some with specific versions, are required for this notebook to work (especially with the modifications we make for smaller hardware). For simplicity, we give these to you. Run the below cell to prepare your environment.

### A note about Google Colab:

Ensure you are running on Colab's T4 environment, which is available as part of the free tier.

It is also possibe that you may have to disconnect and reconnect to your environment for some of these installs to take effect (particularly `accelerate`).

In [1]:
!pip install -q transformers==4.34.0
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets einops

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml

# Step 2: Download & configure the base model

Almost all interesting pre-trained models (both base-models are available on [Huggingface](huggingface.co). We will be downloading a specific version of the `MPT-7B` model, released in July 2018, with modifications to work on our limited hardware.

The model is `eluzhnica/mpt-7b-8k-peft-compatible` available [here](https://huggingface.co/eluzhnica/mpt-7b-8k-peft-compatible).

In [2]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model_id = "eluzhnica/mpt-7b-8k-peft-compatible"


In order for our model to work, we need to use gradient checkpointing and "reformat" the model to work with our artificially reduced bit-sizes.

In [3]:
from peft import prepare_model_for_kbit_training


How big is our model? How many parameters are trainable?

In [5]:
# print("All params:", sum([param.numel() for _, param in model.named_parameters()]))
# print("Trainable params:", sum([param.numel() if param.requires_grad else 0 for _, param in model.named_parameters()]))

# Step 2: Prepare our model for training

Our model currently has no trainable paramaters. We're going to use LoRA as a training algorithm (using `peft`) with a lot of hyperparameters to control how we train.

### What does each hyperparameter do?

TODO

How many parameters are trainable now?

In [6]:
# print("All params:", sum([param.numel() for _, param in model.named_parameters()]))
# print("Trainable params:", sum([param.numel() if param.requires_grad else 0 for _, param in model.named_parameters()]))

# Step 3: Load and tokenize our data

Huggingface, like Kaggle, also holds publicly available datasets we can use, like `vicgalle/alpaca-gpt4`, which was actually created, [ironically](https://en.wikipedia.org/wiki/Dead_Internet_theory), using GPT-4!

Each model comes with a pre-trained tokenizer we can use. This converts the text into a series of tokens represented in a high-dimensional space where relative proximity encodes meaning. We use the `datasets` library to manage our dataseets and the `transformers.AutoTokenizer` class to pull and hold our tokenizer.

### Let's explore the data!

You aren't like those *other* data scientists and MLEs, you actually *do your job!* (please god I am done fixing your messes).

### Train test splits

# Step 3.5: Test the base model

Warning: the model is completely untrained and simply predicts the most likely next token. It is very easy to get it to "say" unsavory things at this stage.

Let's try it with one of our prompts.

In [7]:
prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Generate a creative activity for a child to do during their summer vacation.

### Response:
"""
print(prompt)




Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Generate a creative activity for a child to do during their summer vacation.

### Response:



# Step 4: Train the model!

Using the full power of `transformers` now, we create a training plan with `TrainingArguments` and a trainer with `Trainer`.

We use the following settings:
TODO

Now we train!

# Step 5: Inference

Now we use the model to answer some questions.

In [8]:
from transformers import TextStreamer

def stream(question, context=None):
    system_prompt = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n'
    inst_tag, input_tag, resp_tag = "### Instruction:\n", "### Input:\n", "### Response:\n"

    prompt = f"{system_prompt}{inst_tag}{question.strip()}\n\n{input_tag}{context.strip()}\n\n{resp_tag}" \
             if context else f"{system_prompt}{inst_tag}{question.strip()}\n\n{resp_tag}"

    inputs = tokenizer([prompt], return_tensors="pt").to("cuda:0")

    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    # Despite returning the usual output, the streamer will also print the generated text to stdout.
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=50)