This project is made in order to demonstrate the possibilities of customization of the xllm
library. In this project,
the simplest extensions will be implemented so that the user can understand as much as possible how to make his project
using xllm
.
xllm
enables not only to prototype models, but also facilitates the development of production-ready solutions through
built-in capabilities and customization.
Using xllm
to train a model is simple and involves these few steps:
Prepare
— Get the data and the model ready by downloading and preparing them. Saves data locally toconfig.train_local_path_to_data
andconfig.eval_local_path_to_data
if you are using eval datasetTrain
— Use the data prepared in the previous step to train the modelFuse
— If you used LoRA during the training, fuse LoRAQuantize
— Make your model take less memory by quantizing it
Remember, these tasks in xllm
start from the command line. In this project, we will cover all aspects of
customizing xllm
.
- X—LLM Repo: main repo of the
xllm
library - Quickstart: basics of
xllm
- Examples: minimal examples of using
xllm
- Guide: here, we go into detail about everything the library can do
- Demo project: here's a minimal step-by-step example of how to use X—LLM and fit it into your own project
- WeatherGPT: this repository features an example of how to utilize the xllm library. Included is a solution for a common type of assessment given to LLM engineers, who typically earn between $120,000 to $140,000 annually
- Shurale: project with the finetuned 7B Mistal model
Create python virtual environment
python3 -m venv venv
Activate venv
source venv/bin/activate
Install project (editable) and requirements
pip install -e .
Make .env
file and fill with your values. Please take a look at .env.template
file.
W&B values needs only for train
step.
For demo purposes we will use Anthropic/hh-rlhf
This dataset is widely used in the field of language models. It is made up of pairs of dialogues, labeled as chosen
and rejected
, with the distinction being only in the final line of dialogue. Our approach to utilizing this dataset
will be unconventional; rather than adhering to the authors' original design, we will select the chosen
dialogues and
use them to train a conventional language model to predict the following token.
To achieve this, we will need to implement a new xllm
dataset.
Implementing a new dataset is very simple. You need to create a new class that inherits from BaseDataset
and implement
two methods: get_data
(classmethod) and get_sample
.
from typing import Tuple, Dict, List, Optional
from xllm import Config
from xllm.datasets import BaseDataset
from xllm.types import RawSample
class AntropicDataset(BaseDataset):
@classmethod
def get_data(cls, config: Config) -> Tuple[List[RawSample], Optional[List[RawSample]]]:
...
def get_sample(self, index: int) -> RawSample:
...
Please read this information to understand what you need to do at each method: How to implement dataset
In brief, the get_data
method should return a list of training samples (and optionally evaluation samples) with
arbitrary structure. In the get_sample
method, you need to transform each sample you created in get_data
into a
specific structure, an example of which is provided below.
Example of get_sample
output:
{
"text_parts": [
"Hello!",
"My name is Boris"
]
}
We need to register new components such as dataset, collator, trainer, and experiment to make them accessible through
the xllm
command-line tools.
To do this, let's implement the components_registry
function. We need to import datasets_registry
and
add AntropicDataset
to it with the key antropic
. We can place the key antropic
in the constants of our project.
xllm_demo/core/registry.py
from xllm.datasets import datasets_registry
from xllm_demo.core.constants import DATASET_KEY
from xllm_demo.core.dataset import AntropicDataset
def components_registry():
datasets_registry.add(key=DATASET_KEY, item=AntropicDataset)
At this stage, we are now ready to initiate the first step in our pipeline, which is prepare
. This step is responsible
for data preparation, downloading the tokenizer, and the model.
Let's make a file prepare.py
xllm_demo/cli/prepare.py
from xllm.cli import cli_run_prepare
from xllm_demo.core.registry import components_registry
if __name__ == "__main__":
components_registry()
cli_run_prepare()
Now we can run the prepare
step, specifying the key of the new dataset.
python xllm_demo/cli/prepare.py --dataset_key antropic
More details you could find here: How to add CLI tools to your project
Let's add the remaining CLI functions: train
, fuse
, quantize
. In each function, we will call the component
registration and pass the new config.
xllm_demo/cli/train.py
from xllm.cli import cli_run_train
from xllm_demo.core.registry import components_registry
if __name__ == "__main__":
components_registry()
cli_run_train()
xllm_demo/cli/fuse.py
from xllm.cli import cli_run_fuse
from xllm_demo.core.registry import components_registry
if __name__ == "__main__":
components_registry()
cli_run_fuse()
xllm_demo/cli/quantize.py
from xllm.cli import cli_run_quantize
from xllm_demo.core.registry import components_registry
if __name__ == "__main__":
components_registry()
cli_run_quantize()
Now our project is ready and we can train the model.
cli/prepare.sh
Bash script
#!/bin/bash
python3 xllm_demo/cli/prepare.py \
--dataset_key antropic \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--path_to_env_file ./.env
cli/train.sh
Bash script
#!/bin/bash
python3 xllm_demo/cli/train.py \
--use_gradient_checkpointing True \
--deepspeed_stage 0 \
--stabilize True \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--use_flash_attention_2 False \
--load_in_4bit True \
--apply_lora True \
--raw_lora_target_modules all \
--per_device_train_batch_size 2 \
--warmup_steps 1000 \
--save_total_limit 0 \
--push_to_hub True \
--hub_model_id BobaZooba/DemoXLLM7B-v1-LoRA \
--hub_private_repo True \
--report_to_wandb True \
--path_to_env_file ./.env
cli/fuse.sh
Bash script
#!/bin/bash
python3 xllm_demo/cli/fuse.py \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--lora_hub_model_id BobaZooba/DemoXLLM7B-v1-LoRA \
--hub_model_id DemoXLLM7B-v1 \
--hub_private_repo True \
--force_fp16 True \
--fused_model_local_path ./fused_model/ \
--path_to_env_file ./.env
cli/quantize.sh
Bash script
#!/bin/bash
python3 xllm_demo/cli/quantize.py \
--model_name_or_path ./fused_model/ \
--apply_lora False --stabilize False \
--quantized_model_path ./quantized_model/ \
--prepare_model_for_kbit_training False \
--quantized_hub_model_id DemoXLLM7B-v1-GPTQ \
--quantized_hub_private_repo True \
--path_to_env_file ./.env
We can train our model on multiple GPUs using DeepSpeed.
cli/train-deepspeed.sh
Bash script
#!/bin/bash
deepspeed --num_gpus=8 xllm_demo/cli/train.py \
--use_gradient_checkpointing True \
--deepspeed_stage 2 \
--stabilize True \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--use_flash_attention_2 False \
--load_in_4bit True \
--apply_lora True \
--raw_lora_target_modules all \
--per_device_train_batch_size 8 \
--warmup_steps 1000 \
--save_total_limit 0 \
--push_to_hub True \
--hub_model_id BobaZooba/DemoXLLM7B-v1-LoRA \
--hub_private_repo True \
--report_to_wandb True \
--logging_steps 1 \
--num_train_epochs 3 \
--save_steps 1000 \
--max_steps 10050 \
--path_to_env_file ./.env
Don't forget to install additional libraries (deepspeed) for this case, which are specified in requirements-train.txt
.
pip install -r requirements-train.txt
Now we need to grant executable permission to our bash scripts, and then we can start our training.
chmod -R 755 ./scripts/
Run all steps
./scripts/prepare.sh
./scripts/train.sh
./scripts/fuse.sh
./scripts/quantize.sh
IMPORTANT: Actually, we can already run our entire project by implementing only the new dataset. All other steps are optional.
The main task of the collator is to convert a list of samples into a batch that can be input to the model. That is, to turn texts into a tokenized batch and create targets.
Every collator must be inherited from BaseCollator
and the method parse_batch
must be implemented.
from typing import List
from xllm.types import RawSample, Batch
from xllm.collators import BaseCollator
class LastPartCollator(BaseCollator):
def parse_batch(self, raw_batch: List[RawSample]) -> Batch:
...
Your task is to write the logic of how to process the list in order to eventually obtain a Batch
. A Batch
is a
dictionary where the key is a string, and the value is a PyTorch Tensor
.
More details you could find here: How to implement collator
Let's expand the registration of new components by adding a collator.
xllm_demo/core/registry.py
from xllm.datasets import datasets_registry
from xllm.collators import collators_registry
from xllm_demo.core.constants import DATASET_KEY, COLLATOR_KEY
from xllm_demo.core.dataset import AntropicDataset
from xllm_demo.core.collator import LastPartCollator
def components_registry():
datasets_registry.add(key=DATASET_KEY, item=AntropicDataset)
collators_registry.add(key=COLLATOR_KEY, item=LastPartCollator)
We can extend Config. This may be necessary in order to pass the values of new fields via the command line.
Let's add a new field to the config.
xllm_demo/core/config.py
from dataclasses import dataclass, field
from xllm import Config
from xllm_demo.core.constants import DATASET_KEY
@dataclass
class DemoXLLMConfig(Config):
text_field: str = field(default="chosen", metadata={
"help": "Field for Antropic RLHF dataset",
})
dataset_key: str = field(default=DATASET_KEY, metadata={
"help": "Dataset key",
})
At the moment of calling get_data
of the newly implemented Dataset
, we can choose not the chosen
dialog but, for
example, the rejected
one. Although this logic is made for demonstration purposes, it can be useful in real projects.
Now we will be able to pass through the command line which field needs to be selected.
Now by default, the dataset_key
in the project is the one that we have recently implemented. Therefore, there is no
need for us to pass it every time we call the CLI functions. This is an optional step for demonstration.
Now we need to pass the new config class to every CLI method that we implement.
xllm_demo/cli/prepare.py
from xllm.cli import cli_run_prepare
from xllm_demo.core.config import DemoXLLMConfig
from xllm_demo.core.registry import components_registry
if __name__ == "__main__":
components_registry()
cli_run_prepare(config_cls=DemoXLLMConfig)
Run CLI
python xllm_demo/cli/prepare.py --text_field rejected
Please note that we do not need to register a new config. We need to explicitly specify it in the calls to the xllm
CLI functions.
More details you could find here: How to extend config
In this example, we do not need to update the config for the other steps (train
, fuse
, quantize
), because all that
we added to the config only concerns the preparation of the dataset, meaning only the prepare
step. However, sometimes
it may be necessary. We can do this similarly to the prepare
step.
xllm_demo/cli/train.py
from xllm.cli import cli_run_train
from xllm_demo.core.config import DemoXLLMConfig
from xllm_demo.core.registry import components_registry
if __name__ == "__main__":
components_registry()
cli_run_train(config_cls=DemoXLLMConfig)
We can also implement a new trainer
. It should inherit from the standard trainer of transformers
.
xllm_demo/core/trainer.py
from typing import Union, Dict, Tuple, Optional
from peft import PeftModel
from torch import Tensor
from transformers import PreTrainedModel, TrainingArguments
from xllm.trainers import LMTrainer
from xllm.datasets import BaseDataset
from xllm.collators import BaseCollator
from xllm_demo.core.config import DemoXLLMConfig
class MyLMTrainer(LMTrainer):
def __init__(
self,
config: DemoXLLMConfig,
model: Union[PreTrainedModel, PeftModel],
args: TrainingArguments,
data_collator: BaseCollator,
train_dataset: BaseDataset,
ignore_index: int,
eval_dataset: Optional[BaseDataset] = None,
):
super().__init__(config, model, args, data_collator, train_dataset, ignore_index, eval_dataset)
self.my_steps = 0
def compute_loss(
self,
model: Union[PreTrainedModel, PeftModel],
inputs: Dict[str, Tensor],
return_outputs: bool = False,
) -> Union[Tensor, Tuple[Tensor, Dict[str, Tensor]]]:
self.my_steps += 1
return super().compute_loss(model=model, inputs=inputs, return_outputs=return_outputs)
Above, we simply added a step counter.
We also need to register the new trainer.
xllm_demo/core/registry.py
from xllm.datasets import datasets_registry
from xllm.collators import collators_registry
from xllm.trainers import trainers_registry
from xllm_demo.core.constants import DATASET_KEY, COLLATOR_KEY, TRAINER_KEY
from xllm_demo.core.dataset import AntropicDataset
from xllm_demo.core.collator import LastPartCollator
from xllm_demo.core.trainer import MyLMTrainer
def components_registry():
datasets_registry.add(key=DATASET_KEY, item=AntropicDataset)
collators_registry.add(key=COLLATOR_KEY, item=LastPartCollator)
trainers_registry.add(key=TRAINER_KEY, item=MyLMTrainer)
More details you could find here: How to implement trainer
The experiment acts as an aggregator for training, where the necessary components are loaded. We can customize
the Experiment
by overriding hook methods.
Let's add a few checks and slightly expand the logging.
xllm_demo/core/experiment.py
from xllm.experiments import Experiment
from xllm import dist_logger
class MyExperiment(Experiment):
def before_model_build(self) -> None:
assert self.model is None
dist_logger.info("Model is not None", local_rank=self.config.local_rank)
def after_model_build(self) -> None:
assert self.model is not None
dist_logger.info("Model is not None", local_rank=self.config.local_rank)
def after_train(self) -> None:
if hasattr(self.model, "my_steps"):
num_steps = self.model.my_steps
dist_logger.info(f"Steps: {num_steps}", local_rank=self.config.local_rank)
Let's register the new experiment.
xllm_demo/core/registry.py
from xllm.datasets import datasets_registry
from xllm.collators import collators_registry
from xllm.trainers import trainers_registry
from xllm.experiments import experiments_registry
from xllm_demo.core.constants import DATASET_KEY, COLLATOR_KEY, TRAINER_KEY, EXPERIMENT_KEY
from xllm_demo.core.dataset import AntropicDataset
from xllm_demo.core.collator import LastPartCollator
from xllm_demo.core.trainer import MyLMTrainer
from xllm_demo.core.experiment import MyExperiment
def components_registry():
datasets_registry.add(key=DATASET_KEY, item=AntropicDataset)
collators_registry.add(key=COLLATOR_KEY, item=LastPartCollator)
trainers_registry.add(key=TRAINER_KEY, item=MyLMTrainer)
experiments_registry.add(key=EXPERIMENT_KEY, item=MyExperiment)
More details you could find here: How to implement experiment
Having implemented and registered the new components, we can make some adjustments to the training bash scripts, as only in this step might the new components be utilized.
We add three new lines:
--collator_key last_part \
--trainer_key steps \
--experiment_key check_model
cli/train.sh
Bash script
#!/bin/bash
python3 xllm_demo/cli/train.py \
--use_gradient_checkpointing True \
--deepspeed_stage 0 \
--stabilize True \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--use_flash_attention_2 False \
--load_in_4bit True \
--apply_lora True \
--raw_lora_target_modules all \
--per_device_train_batch_size 2 \
--warmup_steps 1000 \
--save_total_limit 0 \
--push_to_hub True \
--hub_model_id BobaZooba/DemoXLLM7B-v1-LoRA \
--hub_private_repo True \
--report_to_wandb True \
--path_to_env_file ./.env \
--collator_key last_part \
--trainer_key steps \
--experiment_key check_model
cli/train-deepspeed.sh
Bash script
#!/bin/bash
deepspeed --num_gpus=8 xllm_demo/cli/train.py \
--use_gradient_checkpointing True \
--deepspeed_stage 2 \
--stabilize True \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--use_flash_attention_2 False \
--load_in_4bit True \
--apply_lora True \
--raw_lora_target_modules all \
--per_device_train_batch_size 8 \
--warmup_steps 1000 \
--save_total_limit 0 \
--push_to_hub True \
--hub_model_id BobaZooba/DemoXLLM7B-v1-LoRA \
--hub_private_repo True \
--report_to_wandb True \
--logging_steps 1 \
--num_train_epochs 3 \
--save_steps 1000 \
--max_steps 10050 \
--path_to_env_file ./.env \
--collator_key last_part \
--trainer_key steps \
--experiment_key check_model
Afterward, we can run all the scripts just as we did before.
./scripts/prepare.sh
./scripts/train.sh
./scripts/fuse.sh
./scripts/quantize.sh
xllm
library has a plethora of customization options. Not all of them are necessary, and most often you will only
utilize the implementation of a new dataset, which is by design.
- X—LLM Repo: main repo of the
xllm
library - Quickstart: basics of
xllm
- Examples: minimal examples of using
xllm
- Guide: here, we go into detail about everything the library can do
- Demo project: here's a minimal step-by-step example of how to use X—LLM and fit it into your own project
- WeatherGPT: this repository features an example of how to utilize the xllm library. Included is a solution for a common type of assessment given to LLM engineers, who typically earn between $120,000 to $140,000 annually
- Shurale: project with the finetuned 7B Mistal model
Tale Quest
is my personal project which was built using xllm
and Shurale
. It's an interactive text-based game
in Telegram
with dynamic AI characters, offering infinite scenarios
You will get into exciting journeys and complete fascinating quests. Chat
with George Orwell
, Tech Entrepreneur
, Young Wizard
, Noir Detective
, Femme Fatale
and many more
Try it now: https://t.me/talequestbot