In [1]:
import pandas as pd
from datasets import Dataset


train_prompts_df = pd.read_json('../fn1.7-train-prompts.jsonl', lines=True)
train_prompts = Dataset.from_pandas(train_prompts_df)


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from unsloth import FastLanguageModel

max_seq_length = 5000
model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name="llama-3.2-3b-fsp-ft",
    model_name="unsloth/Qwen2.5-7B-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=None,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.9: Fast Qwen2 patching. Transformers = 4.46.3.
   \\   /|    GPU: NVIDIA GeForce RTX 4070 SUPER. Max memory: 11.994 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu124. CUDA = 8.9. CUDA Toolkit = 12.4.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!




In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.9 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [4]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen-2.5",
)

def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

def formatting_prompts_func_test(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo[:-1], tokenize = False, add_generation_prompt = True) for convo in convos]
    return { "text" : texts, }

dataset = train_prompts.map(formatting_prompts_func, batched = True,)

test_prompts_df = pd.read_json('../fn1.7-test-prompts.jsonl', lines=True)
test_prompts = Dataset.from_pandas(test_prompts_df)

test_dataset = test_prompts.map(formatting_prompts_func_test, batched = True,)

Map: 100%|██████████| 18865/18865 [00:00<00:00, 21888.20 examples/s]
Map: 100%|██████████| 6223/6223 [00:00<00:00, 25829.53 examples/s]


In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 20,
        num_train_epochs = 1, # Set this for 1 full training run.
        # max_steps = 20,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2): 100%|██████████| 18865/18865 [00:10<00:00, 1809.64 examples/s]


In [6]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|im_start|>user\n",
    response_part = "<|im_start|>assistant\n",
)

Map: 100%|██████████| 18865/18865 [00:03<00:00, 4817.62 examples/s]


In [16]:
print(tokenizer.decode(trainer.train_dataset[5]["input_ids"]))

<|im_start|>system
### Task:
You are given a sentence and a frame with its associated frame elements and sometimes examples. Your task is to label the frame elements in the sentence using JSON. Keys should only be one of the defined frame elements. Do not make up your own frame elements, and do not remove or change the input in any way. Identify the frame elements based on the highlighted target word. 

### Notes:
- Return the tagged sentence in a ```json ``` code block.
- Texts must not overlap.
<|im_end|>
<|im_start|>user
### Frame Information:
Frame Name: Natural_features
Frame Definition: The Locale is a geographical location as defined by shape. This frame includes natural geographic features, including land/ice forms and bodies of water.

Frame Elements:
Constituent_parts (Peripheral): Salient parts of the Locale.
Container_possessor (Extra-Thematic): The location that the Locale is a part of.
Formational_cause (Extra-Thematic): Indicates the action (or causer) which brings the f

In [14]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

"                                                                                                                                                                                                                                                                                                                                                               \n### Output: \n```json\n{'Type': 'prehistoric', 'Locale': 'lake'}\n```\n<|im_end|>\n"

In [17]:
import torch
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4070 SUPER. Max memory = 11.994 GB.
7.52 GB of memory reserved.


In [10]:
# trainer_stats = trainer.train()
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 18,865 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 2,358
 "-____-"     Number of trainable parameters = 40,370,176


Step,Training Loss
1,1.9905
2,1.9137
3,1.6056
4,1.444
5,1.494
6,1.5741
7,1.4701
8,1.1018
9,1.217
10,0.8044


KeyboardInterrupt: 

In [22]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)

predictions = []

for i, example in enumerate(test_dataset['messages']):
    inputs = tokenizer.apply_chat_template(
        example[:-1],
        tokenize = True,
        add_generation_prompt = True, # Must add for generation
        return_tensors = "pt",
    ).to("cuda")
    # print(inputs)
    print(tokenizer.decode(inputs[0]))
    outputs = model.generate(input_ids = inputs, max_new_tokens = 256, use_cache = True, streamer=text_streamer)
    # Decode the generated text, add to predictions, ignore prompt in output
    predictions.append(tokenizer.decode(outputs[0]).split("<|im_start|>assistant\n")[1])
    print(f"Completed {i+1} of {len(test_dataset['messages'])} generations.")

test_prompts_df['responses'] = predictions
test_prompts_df.to_json('qwen-2.5-7b-preds.jsonl', orient='records', lines=True)

<|im_start|>system
### Task:
You are given a sentence and a frame with its associated frame elements and sometimes examples. Your task is to label the frame elements in the sentence using JSON. Keys should only be one of the defined frame elements. Do not make up your own frame elements, and do not remove or change the input in any way. Identify the frame elements based on the highlighted target word. 

### Notes:
- Return the tagged sentence in a ```json ``` code block.
- Texts must not overlap.
<|im_end|>
<|im_start|>user
### Frame Information:
Frame Name: Calendric_unit
Frame Definition: Words in this frame name the different parts of the calendric cycle, both man-made and natural. The Unit (e.g. Tuesday) specifies some time period as part of a specific larger temporal Whole (Tuesday of next week), or may be resolved to an exact time span by a Relative_time (next Tuesday).

Frame Elements:
Relative_time (Core): Relative_time is used for the word or words that locate the time with re

KeyboardInterrupt: 

In [12]:
predictions

["### Output: \n```json\n{'Name': 'December'}\n```\n Prostitutes\natedRoute\nصندnection\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.Disclaimer\n.

In [None]:
model.save_pretrained("llama-3.2-3b-fsp-ft2") # Local saving
tokenizer.save_pretrained("llama-3.2-3b-fsp-ft2")

In [22]:
x = test_prompts_df.head(len(predictions)).copy()
x['prediction'] = predictions
x.to_json('half-pred.jsonl', lines=True, orient='records')