Mathematical Framework Describing the Goal of Open-LLM's for UMN Math Presentation
This document is a compendium to the Industrial Problems Seminar presentation given by Patrick Delaney on 29 March 2024 at the University of Minnesota Institute for Mathematics and its Applications, "Viva la Revolución of Open Source Large Language Models: Unleashing the Dark Horse in AI Innovation".
Physical resources in the Universe are scarce. At this point in time, there is a lot of demand for Graphical Processing Units (GPU's) which are used for a wide variety of Artificial Intelligence computational tasks such as cryptocurrency mining, image recognition, large language model training.
The goal of encoding is to compress information such that we can get outputs which are overall more pleasing or acceptable to humans from the fixed amount of computational resources in the Universe, in this case, the limited number of GPU's.
The goal then is to maximize U, given the equation:
We can't control Demand (D), and we only may be able to access a given small R at any given time due to scarce resources, so ultimately the only thing we can do to maximize how acceptable the outputs we get as defined by people, U, is to manipulate E, the efficiency of encoding algorithms. This brings up the notion of precision.
In general, Precision, or Output Precision is a metric used in binary classification tasks to evaluate the accuracy of the positive predictions made by a model. It's defined as:
Whereas Encoding Precision is fundamentally different. As an analogy, UTF-8 encoding is a way to represent characters in a format that can be stored and processed by computers.
Character | Encoding | Representation |
---|---|---|
A | UTF-8 | 41 (hex) |
A | UTF-16 | 0041 (hex) |
🙂 | UTF-8 | F0 9F 99 82 (hex) |
🙂 | UTF-16 | D83D DE42 (hex) |
Roughly speaking, Output Precision will be some how inversely proportional to the Encoding Errors:
Which is why Output Precision and Encoding Precision get conflated - generally having better Encoding Precision leads to better Output Precision, and both may be wrapped up together in the concept we introduced above, "E," which is the efficiency of encoding algorithms.
Implicit in the above but not mentioned is the time we have to run things on computers. If we had infinite time, none of the above would matter, we could just take a tiny computing resource and just set it to solve whatever we want it to solve beyond the heat death of the Universe. But obviously this is fanciful and there are competitive pressures in life, so we may not even have a week or even a few days to compute somthing, it might need to be instantanous or within an hour or so.
This brings up Big O notation (O) which is used to describe the upper bound of an algorithm's running time. One might say that for a particular task, we only will allow a certain number of seconds or hours and assign that the notation O, and anything beyond that would be called Theta
Within Large Language Models (LLM's), the smallest unit of text for a given application is called a, "token." Tokens can be a words, parts of a words (such as a syllable or a subword), or even punctuation, depending on how the model's input is tokenized. Tokenization is the process of breaking down text into these manageable pieces (tokens) so that it can be read and written by a computer. So that being said, Encoding in the context of tokens is a way to translate human language to computer bits and bytes.
That being said, there are many ways to tokenize language, and so different tokenization systems are going to process different amounts of bytes. Also the amount of text that a task has to go through can vary - one task may be to have a computer read an entire book, or another task may be to have a computer just read a paragraph. Hence we introduce a couple other new variables:
That is to say, if the tokenization system being used holds a lot of bytes per token, analogous to using UTF-16 rather than UTF-8 shown above, then n is going to be larger. If we're having a computer read an entire book rather than just a paragraph, then T is going to be larger.
So that being said, we can express the upper bound of an algorithm's running time with:
Going beyond O would give us the dreaded
Managing memory requirements is crucial for maintaining optimal performance, especially when processing large batches of text tokens with limited GPU memory. When the required memory exceeds the available GPU memory, settings within the HuggingFace library for LLM's allow a fallback to CPU and system memory, leading to a significant decrease in performance. There are ways to adjust one's settings to not allow a fallback, but this results in an error, so the operation doesn't get performed.
Drawing off of our Big O equation above, we'll define some variables to express this:
To communincate that we can't allow the total memory requirement exceed our time requirement, O we can simply say:
As mentioned above, saying, "O" is not enough though, because there are built-in functions within the HuggingFace library which allow fallback to a CPU. In some instances this might be acceptable and not completely invalidate O, the time it takes to perform an operation. So here we just mention that, hypothetically there might be an acceptable performance P and a factor
Tying this all together with our above equation, our overall Big O Notation equation is inversely proportional to the Performance due to CPU/MEM fallback, which should be fairly intuitive - if the machine exceeds its system resources, then performance will suffer.
But by how much will performance suffer? This is difficult to generalize, because it really depends upon what is being done. However, since we are in the age of, "Large Language Models," with the key word being, "Large," it's probably safe to assume that for a lot of the newer stuff, this means models with lots of parameters, or tasks involving lots of text, will involve orders of magnitude slowdowns.
Consider that all of the above applies to both:
- Using an existing LLM to provide text, e.g. feeding in some task and performing an, "inference,"
- As well as fine-tuning an LLM, (also referred to as pre-training) e.g. customizing some of the parameters in a model that can be customized, such that specific types of outputs will be given with specific types of inputs when performing an inference.
Merely using a hammer is quite different than customizing a hammer, making it bigger, putting a claw on the back, putting a rubber tip on the front and so on. Just as in inference, in training an LLM, all of the Demand, Efficiency and Resource factors are still constraints talked about in the, "Goal of Encoding," section above.
LoRA stands for "Low-Rank Adaptation of Large Language Models" or, “Layer-wise Learning Rate Adaptation” within the context of Hugging Face's Diffusers documentation. This is a technique to build efficiency within the fine-tuning, training and adaptation phase of diffusion models (the broader term for large language models and other probabilistic models such as image generators.
To understand what, "Layer-wise," means, one first must accept that underlying LLM's are Convolutional Neural Networks, which are a sort of mathematical transform in which nodes pass data to another node, the second node applies a weight, and then passes on again to an additional node and so on. If one imagines a gigantic spreadsheet, with the first column containing a vector of starting data, the next multiplying that virst vector by something, and then the third multiplying on that second vector and so on for many columns, this would be a super simple analogy of how a neural network works.
So one could reasonably imagine that there is a computational time involved with applying a weighted factor from one column to the next. While intuitively based upon how we see computers work, one might think that these computations happen instantaneously, but really there is always a bit of a delay, even if it's in the nanoseconds, and with sufficiently large, "spreadsheets," this delay becomes larger and larger. We'll call this delay, "η" (eta).
Now, suppose one could actually tweak that learning rate from layer to layer, so that some layers will process faster than others. This would mean that if we're going to fine-tune the parameters, we could do it, "faster," using this special, "LoRA," method.
So for a given layer in a neural network (analogous to a column in a spreadsheet), the transformation can be described as follows:
So we use a simple Matrix operation to represent the transformation.
So then, we use Adaptation, the A in "Layer-wise Learning Rate Adaptation," to adapt the W, by adding the AB matrix elementwise.
So our new result can be expressed by Matrix operation, where A and B are the low-rank matrices introduced as part of the LoRA technique, where their product AB is also a matrix, with the same dimensions of W:
This approach allows for efficient fine-tuning because it maintains the general capabilities learned during pre-training, it provides a mechanism to adapt the model to specific tasks with minimal adjustments, represented by the low-rank matrices.
LoRA's efficiency comes from the AB matrix, which is the one getting modified, and which has significantly fewer parameters than W. In traditional LLM training, W is trained directly, which takes a huge amount of memory and GPU usage. In constrast, changing only AB requires significantly less memory.
But what about performance? Performance is measured according to various benchmarks and scores, and any fine tuning (pre-training) is liable to either increase or decrease a benchmark score, and this is dependent upon the quality and quantity of the data put in and the methodology used to elicit a response from the fine-tuned LLM.
That being said, by keeping the original weight matrix W unchanged or only minimally adjusted, LoRA ensures that the vast majority of the pre-trained knowledge is retained without the need for resource-intensive re-learning. This not only saves computational resources but also shortens the fine-tuning time, as the model does not need to re-learn representations that are already effective.
In our discussion on weights and biases above, we were fairly hand-wavy because the purpose of this document is to describe the goal of encoding, to describe why from a software development perspecitive someone may undergo the exercise of actual training. If one were going to do that, assuming you had a nice environment set up, with all of the proper dependencies installed, and you have torch linking to a GPU, some pseudocode demonstrating how LoRA training is the following:
from transformers import GPT2Tokenizer, GPT2Model, GPT2Config
# Assuming CustomGPT2Model is your model class adjusted for LoRA
from your_model_file import CustomGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = CustomGPT2Model.from_pretrained('gpt2')
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="your_dataset.txt",
block_size=128,
)
def custom_optimizer(params):
return bnb.optim.Adam8bit(params, lr=5e-5)
training_args = TrainingArguments(
output_dir="./gpt2-finetuned",
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=4,
save_steps=100,
save_total_limit=2,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
optimizers=(custom_optimizer, None), # Custom optimizer and no scheduler
)
trainer.train()
Assuming you had a proper dataset, and the model architecture you are working with is GPT2-based, the above format would be basically how one would employ LoRA. The section, "LoRA: An Example of Non-Encoding Technique to Adapt Fine-Tuning," describes why we use LoRA, whereas the psuedocode describes roughly how to apply LoRA.
We had mentioned in the above equation:
The notion of, "bias," being glossed over above. Bias does not refer to bias in the sense of statistical bias, e.g., in the sense that data doesn't represent the underlying reality for some reason. Rather, it's a term specific to Neural Networks, a constant that can be tuned, as shown in the below illustration:
Concerning the, "how," going back to pseudocode, bias can be directly modified, though doing so arbitrarily can significantly impact a model's performance.
for name, param in model.named_parameters():
if 'bias' in name:
param.data.fill_(0.0)
QLoRA modifies the traditional framework by incorporating quantized adaptation matrices, QA and QB, where the product (QA)(QB) precisely targets the modifications needed for new tasks while optimizing computational efficiency:
The layer's output, now enhanced by QLoRA's strategy, is recalculated as:
Through the application of quantization to the low-rank matrices QA and QB, QLORA achieves even more reductions in memory requirements. So again, by combining low-rank adaptation with the advantges of quantization.
Similar to above, the, how of applying qlora is shown as follows in pseudocode. Note again, the model used has to be compatible with the tokenization being used.
from transformers import GPT2Tokenizer, GPT2Config
import bitsandbytes as bnb
# Import your custom QLoRA model class
from your_model_file import CustomGPT2ModelQLoRA
# Initialize tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# Ensure CustomGPT2ModelQLoRA incorporates QLoRA modifications
model = CustomGPT2ModelQLoRA.from_pretrained('gpt2')
# Prepare your dataset
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="your_dataset.txt",
block_size=128,
)
# Define a custom optimizer using bitsandbytes for efficiency
def custom_optimizer(params):
# Use bitsandbytes' 8bit optimizer for reduced memory footprint
return bnb.optim.Adam8bit(params, lr=5e-5)
# Training arguments setup
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
training_args = TrainingArguments(
output_dir="./gpt2-qlora-finetuned",
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=4,
save_steps=10,
save_total_limit=2,
)
# Prepare data collator for dynamic padding
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False, # GPT-2 does not use MLM
)
# Initialize the Trainer with custom optimizer
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
optimizers=(custom_optimizer, None), # Set the custom optimizer
)
# Start training
trainer.train()