<a href="https://colab.research.google.com/github/kaan1derful/GenAI-powered-creation-of-process-models/blob/main/LLM_Fine-Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Python Notebook contains the script to fine-tune a local LLM for process modelling using process trees based on process descriptions.
This notebook contains all steps performed in phase 2 of the paper Apaydin et al. (2024)
This notebook uses 40 training examples for fine-tuning and evaluating the process modelling capabilities. To see the fine-tuning results for 80 training examples see this notebook. To see the fine-tuning results for 120 training examples see this notebook.

Unsloth is used for fine-tuning. The code is inspired by unsloth's fine-tuning llama3 example.


To run the notebook yourself, select an Nvidia L4 GPU. Then press "*Runtime*" and press "*Run all*".

In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
!pip install pm4py

We tried Llama 3 and llama 3.1 with and without quantization, respectively, as well as their instruct versions with and without quantization. We also tested gemma 2 27b with 4 bit quantization.
Other LLMs, not yet tested are Mistral, Phi-3, Gemma 2 9b

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

tested_models = [
    "unsloth/llama-3-8b-bnb-4bit",
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-8b",
    "unsloth/llama-3-8b-Instruct",
    "unsloth/llama-3.1-8b-bnb-4bit",
    "unsloth/llama-3.1-8b-Instruct-bnb-4bit",
    "unsloth/llama-3.1-8b",
    "unsloth/llama-3.1-8b-Instruct",
    "unsloth/gemma-2-27b-bnb-4bit",
]
test_next = [
    "unsloth/gemma-2-27b-it-bnb-4bit",
    "unsloth/gemma-2-9b-it",
    "unsloth/llama-3-70b-bnb-4bit", #could not load with one A100 available
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: NVIDIA L4. Max memory: 22.168 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Add special tokens for process tree operators to the vocabulary of the LLM

In [3]:
# check if the tokens are already in the vocabulary
new_tokens = ["->_token", "X_token", "+_token", "*_token", "process_tree="]

# check if the tokens are already in the vocabulary
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())

# add the tokens to the tokenizer vocabulary
tokenizer.add_tokens(list(new_tokens))

# add new, random embeddings for the new tokens
model.resize_token_embeddings(len(tokenizer))

Embedding(128261, 4096)

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3` format for conversation style finetunes. We use [Open Assistant conversations](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style) in ShareGPT style. Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old` and our own optimized `unsloth` template.

Note ShareGPT uses `{"from": "human", "value" : "Hi"}` and not `{"role": "user", "content" : "Hi"}`, so we use `mapping` to map it.

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [5]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(list(convo), tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset, concatenate_datasets
dataset = load_dataset("kaan1derful/process_trees_and_process_descriptions", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

dataset_20_examples_with_4_activities_each = dataset.filter(lambda example: example['Number of Activities'] == 4)
dataset_20_examples_with_5_activities_each = dataset.filter(lambda example: example['Number of Activities'] == 5)
dataset_20_examples_with_6_activities_each = dataset.filter(lambda example: example['Number of Activities'] == 6)
dataset_20_examples_with_7_activities_each = dataset.filter(lambda example: example['Number of Activities'] == 7)
dataset_20_examples_with_8_activities_each = dataset.filter(lambda example: example['Number of Activities'] == 8)
dataset_20_examples_with_9_activities_each = dataset.filter(lambda example: example['Number of Activities'] == 9)

dataset_40_examples_with_4_or_5_activities_each = concatenate_datasets([dataset_20_examples_with_4_activities_each,
                                                                        dataset_20_examples_with_5_activities_each])

dataset_80_examples_with_4_5_6_or_7_activities_each = concatenate_datasets([dataset_40_examples_with_4_or_5_activities_each,
                                                                            dataset_20_examples_with_6_activities_each,
                                                                            dataset_20_examples_with_7_activities_each])

dataset_120_examples_with_4_5_6_7_8_or_9_activities_each = concatenate_datasets([dataset_80_examples_with_4_5_6_or_7_activities_each,
                                                                                 dataset_20_examples_with_8_activities_each,
                                                                                 dataset_20_examples_with_9_activities_each])

dataset_with_4_activities_each_split_datasets = dataset_20_examples_with_4_activities_each.train_test_split(test_size=0.2, seed=42)
train_dataset_with_4_activities_each = dataset_with_4_activities_each_split_datasets['train']
validation_dataset_with_4_activities_each = dataset_with_4_activities_each_split_datasets['test']

dataset_with_5_activities_each_split_datasets = dataset_20_examples_with_5_activities_each.train_test_split(test_size=0.2, seed=52)
train_dataset_with_5_activities_each = dataset_with_5_activities_each_split_datasets['train']
validation_dataset_with_5_activities_each = dataset_with_5_activities_each_split_datasets['test']

dataset_with_6_activities_each_split_datasets = dataset_20_examples_with_6_activities_each.train_test_split(test_size=0.2, seed=62)
train_dataset_with_6_activities_each = dataset_with_6_activities_each_split_datasets['train']
validation_dataset_with_6_activities_each = dataset_with_6_activities_each_split_datasets['test']

dataset_with_7_activities_each_split_datasets = dataset_20_examples_with_7_activities_each.train_test_split(test_size=0.2, seed=72)
train_dataset_with_7_activities_each = dataset_with_7_activities_each_split_datasets['train']
validation_dataset_with_7_activities_each = dataset_with_7_activities_each_split_datasets['test']

dataset_with_8_activities_each_split_datasets = dataset_20_examples_with_8_activities_each.train_test_split(test_size=0.2, seed=82)
train_dataset_with_8_activities_each = dataset_with_8_activities_each_split_datasets['train']
validation_dataset_with_8_activities_each = dataset_with_8_activities_each_split_datasets['test']

dataset_with_9_activities_each_split_datasets = dataset_20_examples_with_9_activities_each.train_test_split(test_size=0.2, seed=92)
train_dataset_with_9_activities_each = dataset_with_9_activities_each_split_datasets['train']
validation_dataset_with_9_activities_each = dataset_with_9_activities_each_split_datasets['test']


train_dataset_with_4_or_5_activities_each = concatenate_datasets([train_dataset_with_4_activities_each,
                                                                  train_dataset_with_5_activities_each])

train_dataset_with_4_5_6_or_7_activities_each = concatenate_datasets([train_dataset_with_4_or_5_activities_each,
                                                                      train_dataset_with_6_activities_each,
                                                                      train_dataset_with_7_activities_each])

train_dataset_with_4_5_6_7_8_or_9_activities_each = concatenate_datasets([train_dataset_with_4_5_6_or_7_activities_each,
                                                                          train_dataset_with_8_activities_each,
                                                                          train_dataset_with_9_activities_each])

validation_dataset_with_4_or_5_activities_each = concatenate_datasets([validation_dataset_with_4_activities_each,
                                                                       validation_dataset_with_5_activities_each])

validation_dataset_with_4_5_6_or_7_activities_each = concatenate_datasets([validation_dataset_with_4_or_5_activities_each,
                                                                           validation_dataset_with_6_activities_each,
                                                                           validation_dataset_with_7_activities_each])

validation_dataset_with_4_5_6_7_8_or_9_activities_each = concatenate_datasets([validation_dataset_with_4_5_6_or_7_activities_each,
                                                                               validation_dataset_with_8_activities_each,
                                                                               validation_dataset_with_9_activities_each])

Map:   0%|          | 0/120 [00:00<?, ? examples/s]

Filter:   0%|          | 0/120 [00:00<?, ? examples/s]

Filter:   0%|          | 0/120 [00:00<?, ? examples/s]

Filter:   0%|          | 0/120 [00:00<?, ? examples/s]

Filter:   0%|          | 0/120 [00:00<?, ? examples/s]

Filter:   0%|          | 0/120 [00:00<?, ? examples/s]

Filter:   0%|          | 0/120 [00:00<?, ? examples/s]

Let's see how the `Llama-3` format works by printing the 5th element

In [6]:
dataset[5]["conversations"]

[{'from': 'system',
  'value': 'Process trees allow us to model processes that comprise a control-flow hierarchy. A process tree is a mathematical tree, where the internal vertices are operators, and leaves are activities.  Operators specify how their children, i.e., sub-trees, need to be combined from a control-flow perspective. There are four operators and each operator has two children:  The sequence operator ->_token specifies sequential behavior, e.g., ->_token(X, Y) means that first X is executed and then Y.  The choice operator X_token specifies a choice, e.g., X_token(X, Y) means that either X is executed or Y is executed.  The parallel operator +_token specifies simultaneous behavior or indifferent executing order, e.g., +_token(X, Y) means that X is executed while Y is also executed or that X and Y are both executed independently from another.  The loop operator *_token specifies repetitive behaviour, e.g., *_token(X, Y) means that after X is executed, Y could be executed. If

In [7]:
print(dataset[5]["text"])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Process trees allow us to model processes that comprise a control-flow hierarchy. A process tree is a mathematical tree, where the internal vertices are operators, and leaves are activities.  Operators specify how their children, i.e., sub-trees, need to be combined from a control-flow perspective. There are four operators and each operator has two children:  The sequence operator ->_token specifies sequential behavior, e.g., ->_token(X, Y) means that first X is executed and then Y.  The choice operator X_token specifies a choice, e.g., X_token(X, Y) means that either X is executed or Y is executed.  The parallel operator +_token specifies simultaneous behavior or indifferent executing order, e.g., +_token(X, Y) means that X is executed while Y is also executed or that X and Y are both executed independently from another.  The loop operator *_token specifies repetitive behaviour, e.g., *_token(X, Y) means that after X is exec

Set Training Parameters for fine-tuning

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset_with_4_or_5_activities_each,
    eval_dataset = validation_dataset_with_4_or_5_activities_each,

    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        evaluation_strategy = "steps",
        eval_steps = 1, # Evaluate every 4 steps
        warmup_steps = 5,
        num_train_epochs = 10,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)



Map (num_proc=2):   0%|          | 0/32 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/8 [00:00<?, ? examples/s]

In [9]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA L4. Max memory = 22.168 GB.
17.053 GB of memory reserved.


In [10]:
#@title Train
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 32 | Num Epochs = 10
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 40
 "-____-"     Number of trainable parameters = 335,544,320


Step,Training Loss,Validation Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 18.00 MiB. GPU 

In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Prompt fine-tuned LLM with process description
Let's run the model! Since we're using `Llama-3`, use `apply_chat_template` with `add_generation_prompt` set to `True` for inference.

In [None]:
from unsloth.chat_templates import get_chat_template
import pm4py
import pm4py.utils as u

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

instruction_prompt_with_special_tokens_and_same_placeholders = "Process trees allow us to model processes that comprise a control-flow hierarchy. A process tree is a mathematical tree, where the internal vertices are operators, and leaves are activities.  Operators specify how their children, i.e., sub-trees, need to be combined from a control-flow perspective. There are four operators and each operator has two children:  The sequence operator ->_token specifies sequential behavior, e.g., ->_token(X, Y) means that first X is executed and then Y.  The choice operator X_token specifies a choice, e.g., X_token(X, Y) means that either X is executed or Y is executed.  The parallel operator +_token specifies simultaneous behavior or indifferent executing order, e.g., +_token(X, Y) means that X is executed while Y is also executed or that X and Y are both executed independently from another.  The loop operator *_token specifies repetitive behaviour, e.g., *_token(X, Y) means that after X is executed, Y could be executed. If Y is executed then X has to be executed again. This implies that the loop only can be left after X is executed. Now your task is to analyze a process description to identify activities within the process description and the relationship between the activities within the process description.  Afterwards model a process tree that represents the process within the process description. Use the operators defined above to to model the control flow and use one verb and one noun if possible to model the activities.  You can reason step by step to analyze the process description and model the process tree. However, in the end finish your response with process_tree=[insert the modelled process tree here]. The process description you need to analyze and model a process tree for is: "

process_description = """Consider a process for purchasing items from an online shop.
The user starts an order by logging in to their account.
Then, the user simultaneously selects the items to purchase and sets a payment method.
Afterward, the user either pays or completes an installment agreement.
After selecting the items, the user chooses between multiple options for a free reward.
Since the reward value depends on the purchase value, this step is done after selecting the items, but it is independent of the payment activities.
Finally, the items are delivered. The user has the right to return items for exchange.
Every time items are returned, a new delivery is made."""


# Write Prompt in ShareGPT Style conversation
conversations = [
    {"from": "system", "value": instruction_prompt_with_special_tokens_and_same_placeholders},
    {"from": "human", "value": process_description},
]

# Convert Prompt to LLM Chat Template
inputs = tokenizer.apply_chat_template(
    conversations,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

# prompt the LLM and convert response to string
outputs = model.generate(input_ids = inputs, max_new_tokens = 1024, use_cache = True)
llm_response = str(tokenizer.batch_decode(outputs))

# Replace the special process tree tokens with pm4py process tree symbols
llm_response = llm_response.replace("->_token(", "->(").replace("X_token(", "X(").replace("+_token(", "+(",).replace("*_token(", "*(")

# Extract process tree string in LLM response
process_tree_string = llm_response[llm_response.rfind("process_tree=")+len("process_tree="):llm_response.rfind("<|eot_id|>")]

# Print LLM response, Process Tree string, view the process modell as process tree and bpmn is possible and finally print the process description fo comparision
llm_response = llm_response.replace("\\\n", "\n")
print(llm_response)

try:
  process_tree = u.parse_process_tree(process_tree_string)
  pm4py.view_process_tree(process_tree)
  bpmn_graph = pm4py.convert_to_bpmn(process_tree)
  pm4py.view_bpmn(bpmn_graph)
except:
  print("Error parsing process tree")

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

instruction_prompt_with_special_tokens_and_same_placeholders = """Process trees allow us to model processes that comprise a control-flow hierarchy.
A process tree is a mathematical tree, where the internal vertices are operators, and leaves are activities.
Operators specify how their children, i.e., sub-trees, need to be combined from a control-flow perspective.
There are four operators and each operator has two children:
The sequence operator ->_token specifies sequential behavior, e.g., ->_token(X, Y) means that first X is executed and then Y.
The choice operator X_token specifies a choice, e.g., X_token(X, Y) means that either X is executed or Y is executed.
The parallel operator +_token specifies simultaneous behavior or indifferent executing order, e.g., +_token(X, Y) means that X is executed while Y is also executed or that X and Y are both executed independently from another.
The loop operator *_token specifies repetitive behaviour, e.g., *_token(X, Y) means that after X is executed, Y could be executed. If Y is executed then X has to be executed again.
This implies that the loop only can be left after X is executed.
Now your task is to analyze a process description to identify activities within the process description and the relationship between the activities within the process description.
Afterwards model a process tree that represents the process within the process description. Use the operators defined above to to model the control flow and use one verb and one noun if possible to model the activities.
You can reason step by step to analyze the process description and model the process tree.
However, in the end finish your response with process_tree=[insert the modelled process tree here].
The process description you need to analyze and model a process tree for is: """

process_description = """The Evanstonian is an upscale independent hotel.
When a guest calls room service at The Evanstonian, the room-service manager takes down the order.
She then submits an order ticket to the kitchen to begin preparing the food.
She also gives an order to the sommelier (i.e., the wine waiter) to fetch wine from the cellar and to prepare any other alcoholic beverages.
Eighty percent of room-service orders include wine or some other alcoholic beverage.
Finally, she assigns the order to the waiter.
While the kitchen and the sommelier are doing their tasks, the waiter readies a cart (i.e., puts a tablecloth on the cart and gathers silverware).
The waiter is also responsible for nonalcoholic drinks. Once the food, wine, and cart are ready, the waiter delivers it to the guest’s room.
After returning to the room-service station, the waiter debits the guest’ s account.
The waiter may wait to do the billing if he has another order to prepare or deliver""",

# ShareGPT Style conversation
conversations = [
    {"from": "system", "value": instruction_prompt_with_special_tokens_and_same_placeholders},
    {"from": "human", "value": process_description},
]

inputs = tokenizer.apply_chat_template(
    conversations,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

# prompt the LLM and convert response to string
outputs = model.generate(input_ids = inputs, max_new_tokens = 1024, use_cache = True)
llm_response = str(tokenizer.batch_decode(outputs))

# Replace the special process tree tokens with pm4py process tree symbols
llm_response = llm_response.replace("->_token(", "->(").replace("X_token(", "X(").replace("+_token(", "+(",).replace("*_token(", "*(")

# Extract process tree string in LLM response
process_tree_string = llm_response[llm_response.rfind("process_tree=")+len("process_tree="):llm_response.rfind("<|eot_id|>")]

# Print LLM response, Process Tree string, view the process modell as process tree and bpmn is possible and finally print the process description fo comparision
llm_response = llm_response.replace("\\n", "\n").replace("\\\n", "\n")
print(llm_response)

try:
  process_tree = u.parse_process_tree(process_tree_string)
  pm4py.view_process_tree(process_tree)
  bpmn_graph = pm4py.convert_to_bpmn(process_tree)
  pm4py.view_bpmn(bpmn_graph)
except:
  print("Error parsing process tree")

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
#model.save_pretrained("lora_model") # Local saving
if False: model.push_to_hub("kaan1derful/lora_model_process_tree_generator", token = "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") # Online saving
if False: tokenizer.push_to_hub("kaan1derful/lora_model_process_tree_generator", token = "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")

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"from": "human", "value": "What is a famous tall tower in Paris?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
#text_streamer = TextStreamer(tokenizer)
#_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).