# Quick-Start Example

Follow the instructions below to apply LIFT on your models!

**⚠ Notice:** Do not run the notebook directly, since we adopt `accelerate` for fine-tuning.

## 1 Load the model to train

LIFT fine-tunes a pretrained LLM using LoRA. Empirically, it's sufficient to train a 7~8B model like `Llama-3-8B-Instruct` using LoRA with r=8 on a GPU of 80GB memory, so we just adopt data parallelism.

In [None]:
# We take Llama 3 as an example here.
# Replace it with your own model path or name.
MODEL_NAME_OR_PATH = "meta-llama/Meta-Llama-3-8B-Instruct"

from lift.model import load_training_model

# A util function to load a model with LoRA adapter for training.
# We prepare the model for subsequent fine-tuning. You can check the implementation for details.
model = load_training_model(MODEL_NAME_OR_PATH, adapter="lora", adapter_config="configs/adapter/lora_128.yaml")

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME_OR_PATH)

## 2 Prepare the dataset

We use an LLM server for data generation. Launch the vLLM server first. You can also adopt DeepSeek as the generator backend.

In [None]:
# This is the script we used in our experiments.
# You can adjust the parameters as needed.
vllm serve Qwen/Qwen2.5-72B-Instruct \
    --served-model-name qwen \
    --gpu_memory_utilization 0.96 \
    --tensor-parallel-size 8 \
    --max-model-len 4000 \
    --distributed-executor-backend mp \
    --port 8001

# These environment variables are used by our data generation scripts.
export VLLM_MODEL_NAME="qwen"  # Use the --served-model-name specified above
export VLLM_BASE_URL="http://localhost:8001/v1"

We also support using DeepSeek as the backend. You just need to set the environment variable:

In [None]:
export DEEPSEEK_API_KEY="<your_api_key_here>"

Prepare your context using the Spacy sentence-splitting model. **⚠ Notice**: You may need to import spacy after import torch to avoid CUDA initialization issues. Please refer to our implementation `pred/loogle/pred_lift_icl.py`.

In [None]:
import spacy

sentence_model = spacy.load("en_core_web_sm")

context = "<your_context_here>"  # It should be meaningful text in natural language.
sentences = [sent.text for sent in sentence_model(context).sents]

We implement `EverySentenceDataset` in our package, which is an `IterableDataset`. Once instantialized, it begins to synthesize data.

In [None]:
from accelerate import Accelerator
from lift.synqa import AsyncOpenAIServer, EverySentenceDataset

accelerator = Accelerator()  

server = AsyncOpenAIServer.from_config("configs/generator/vllm-qwen2.5-backend.yaml")
# Or use the following line if you adopt DeepSeek as the backend.
# server = AsyncOpenAIServer.from_config("configs/generator/deepseek-backend.yaml")

dataset = EverySentenceDataset.from_config(
    "configs/generator/vllm-qwen2.5-everysentence.yaml",
    tokenizer,
    is_main_process=accelerator.is_main_process,
    server=server,
    sentences=sentences,
)

## 3 Fine-tune

We adopt `transformers.Trainer` to fine-tune the model. Since we use an `IterableDataset`, we implement a callback util `EverySentenceStopCallback` to control the training process -- training for exact `num_train_epochs` epochs.

Besides, since the dataset generates training data on-the-fly, we only allow the main process to generate training data and distribute to other processes, by setting `accelerator_config={"dispatch_batches": True}`.

In [None]:
import torch
import yaml
from lift.synqa import EverySentenceStopCallback
from lift.utils import custom_2_hf_training_args, WrapperForIterableDataset, LIFTDataCollator
from transformers import Trainer


# We set many default arguments here. You can modify them as needed.
with open("configs/training/lift_lora.yaml", "r") as f:
    training_args = yaml.safe_load(f)
training_args["per_device_train_batch_size"] *= torch.cuda.device_count()
training_args = custom_2_hf_training_args(
    training_args,
    accelerator_config={"split_batches": True, "dispatch_batches": True, "even_batches": True},
    max_steps=int(10000),  # sufficiently large -- the stop of training is controlled by `control_callback`
    use_liger_kernel=True,
)
real_batch_size = training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps

data_collator = LIFTDataCollator(tokenizer)
control_callback = EverySentenceStopCallback(dataset, training_args.num_train_epochs, real_batch_size)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=WrapperForIterableDataset(dataset, real_batch_size),  # The wrapper is used for padding the dataset to make it divisible by batch size.
    data_collator=data_collator,
    callbacks=[control_callback],
)

trainer.train()

## 4 Inference

After fine-tuning, the model is capable of answering context-related questions, without providing the context in its context window. Enjoy it! :D

In [None]:
from lift.utils import get_pad_token_id

model.eval()
# Optional
# model.merge_and_unload()

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "<your_question_here>"},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).input_ids.to(model.device)
attention_mask = torch.ones_like(input_ids)
output_ids = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    use_cache=True,
    max_new_tokens=1024,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=get_pad_token_id(tokenizer),  # A util function to deal with the tokenizers without a pad token.
)
response = tokenizer.decode(output_ids[0, input_ids.shape[-1]:], skip_special_tokens=True)

print(response)