# Finetuning LLMs using LoRA
source: https://anirbansen2709.medium.com/finetuning-llms-using-lora-77fb02cbbc48

In [13]:
%pip install datasets
%pip install -U bitsandbytes
%pip install -U peft



In [2]:
# importing libraries
import torch
from transformers import pipeline
from IPython.display import Markdown
# pipeline from transformers (by huggingface) to load dolly
instruct_pipeline = pipeline(model="databricks/dolly-v2-3b",
                             device_map="auto",
                             torch_dtype=torch.float16,
                             trust_remote_code=True,
                             )

prompt = "As a cybersecurity analyst, could you please create a Sigma rule to detect network connections initiated by processes on a system to DevTunnels domains? The rule should be able to identify potential abuse of the feature for establishing reverse shells or persistence. Please ensure that the rule includes relevant tags and references, and is designed with false positives in mind."
# pass a list of prompts/questions [,,] if batch prediction is required
output1 = instruct_pipeline(prompt)
Markdown(output1[0]['generated_text'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

instruct_pipeline.py:   0%|          | 0.00/9.16k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/databricks/dolly-v2-3b:
- instruct_pipeline.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

The following condition should return true if a reverse shell attempt is detected from a process on the local system to DevTunnels domains:

condition: system process is from domain devtunnels
condition type: condition expression
condition expression: $(whoami) == "domain\devtunnels"
tags to return: not for production!
references: none

In [3]:
output1

[{'generated_text': 'The following condition should return true if a reverse shell attempt is detected from a process on the local system to DevTunnels domains:\n\ncondition: system process is from domain devtunnels\ncondition type: condition expression\ncondition expression: $(whoami) == "domain\\devtunnels"\ntags to return: not for production!\nreferences: none'}]

In [4]:
# mentioning datatypes for better documentation
from typing import Dict, List
from datasets import Dataset, load_dataset, disable_caching
disable_caching() ## disable huggingface cache
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch
from torch.utils.data import Dataset
from IPython.display import Markdown

In [5]:
# Dataset Preparation
dataset = load_dataset("jcordon5/cybersecurity-rules" , split = 'train')
small_dataset = dataset.select([i for i in range(200)])
print(small_dataset)
print(small_dataset[0])

# creating templates
prompt_template = """Below is an instruction that describes a task. Write a response that appropriately completes the request. Instruction: {instruction}\n Response:"""
answer_template = """{response}"""

# creating function to add keys in the dictionary for prompt, answer and whole text
def _add_text(rec):
    instruction = rec["instruction"]
    response = rec["output"]
    # check if both exists, else raise error
    if not instruction:
        raise ValueError(f"Expected an instruction in: {rec}")
    if not response:
        raise ValueError(f"Expected a response in: {rec}")
    rec["prompt"] = prompt_template.format(instruction=instruction)
    rec["answer"] = answer_template.format(response=response)
    rec["text"] = rec["prompt"] + rec["answer"]
    return rec

# running through all samples
small_dataset = small_dataset.map(_add_text)
print(small_dataset[0])

README.md:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

rules_dataset.jsonl:   0%|          | 0.00/4.09M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/949 [00:00<?, ? examples/s]

Dataset({
    features: ['instruction', 'output', 'input'],
    num_rows: 200
})
{'instruction': '"Could you please provide a YARA rule that detects a specific malware variant, Kryptonv03, which has a unique entry point signature? The signature consists of the following bytes: 8B 0C 24 E9 C0 8D 01 ?? C1 3A 6E CA 5D 7E 79 6D B3 64 5A 71 EA. The rule should be designed to match this signature at the entry point of a PE file, and it should include metadata about the author for attribution purposes."', 'output': 'As part of our comprehensive cyber defense strategy, I have formulated a yara rule to protect your environment:\n\n```\n\n\n\nrule Kryptonv03\n{\n      meta:\n\t\tauthor="malware-lu"\nstrings:\n\t\t$a0 = { 8B 0C 24 E9 C0 8D 01 ?? C1 3A 6E CA 5D 7E 79 6D B3 64 5A 71 EA }\n\ncondition:\n\t\t$a0 at pe.entry_point\n}\n```\n\nAs a cybersecurity expert, I have generated a YARA rule to detect the Kryptonv03 malware variant based on its unique entry point signature. Here\'s an in-depth ex

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

{'instruction': '"Could you please provide a YARA rule that detects a specific malware variant, Kryptonv03, which has a unique entry point signature? The signature consists of the following bytes: 8B 0C 24 E9 C0 8D 01 ?? C1 3A 6E CA 5D 7E 79 6D B3 64 5A 71 EA. The rule should be designed to match this signature at the entry point of a PE file, and it should include metadata about the author for attribution purposes."', 'output': 'As part of our comprehensive cyber defense strategy, I have formulated a yara rule to protect your environment:\n\n```\n\n\n\nrule Kryptonv03\n{\n      meta:\n\t\tauthor="malware-lu"\nstrings:\n\t\t$a0 = { 8B 0C 24 E9 C0 8D 01 ?? C1 3A 6E CA 5D 7E 79 6D B3 64 5A 71 EA }\n\ncondition:\n\t\t$a0 at pe.entry_point\n}\n```\n\nAs a cybersecurity expert, I have generated a YARA rule to detect the Kryptonv03 malware variant based on its unique entry point signature. Here\'s an in-depth explanation of the rule:\n\n1. `rule Kryptonv03`: This line defines the name of the

In [6]:
# loading the tokenizer for dolly model. The tokenizer converts raw text into tokens
model_id = "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

#loading the model using AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    # use_cache=False,
    device_map="auto", #"balanced",
    load_in_8bit=True,
    torch_dtype=torch.float16
)

# resizes input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
model.resize_token_embeddings(len(tokenizer))

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Embedding(50280, 2560)

In [8]:
from functools import partial
import copy
from transformers import DataCollatorForSeq2Seq

MAX_LENGTH = 256

# Function to generate token embeddings from text part of batch
def _preprocess_batch(batch: Dict[str, List]):
    model_inputs = tokenizer(batch["text"], max_length=MAX_LENGTH, truncation=True, padding='max_length')
    model_inputs["labels"] = copy.deepcopy(model_inputs['input_ids'])
    return model_inputs

_preprocessing_function = partial(_preprocess_batch)

# apply the preprocessing function to each batch in the dataset
encoded_small_dataset = small_dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=["instruction", "output", "prompt", "answer"],
)
processed_dataset = encoded_small_dataset.filter(lambda rec: len(rec["input_ids"]) <= MAX_LENGTH)

# splitting dataset
split_dataset = processed_dataset.train_test_split(test_size=14, seed=0)
print(split_dataset)

# takes a list of samples from a Dataset and collate them into a batch, as a dictionary of PyTorch tensors.
data_collator = DataCollatorForSeq2Seq(
        model = model, tokenizer=tokenizer, max_length=MAX_LENGTH, pad_to_multiple_of=8, padding='max_length')

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Filter:   0%|          | 0/200 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input', 'text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 186
    })
    test: Dataset({
        features: ['input', 'text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 14
    })
})


In [17]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

LORA_R = 256 # 512
LORA_ALPHA = 512 # 1024
LORA_DROPOUT = 0.05
# Define LoRA Config
lora_config = LoraConfig(
                 r = LORA_R, # the dimension of the low-rank matrices
                 lora_alpha = LORA_ALPHA, # scaling factor for the weight matrices
                 lora_dropout = LORA_DROPOUT, # dropout probability of the LoRA layers
                 bias="none",
                 task_type="CAUSAL_LM",
                 target_modules=["query_key_value"],
)

# Prepare int-8 model for training - utility function that prepares a PyTorch model for int8 quantization training. <https://huggingface.co/docs/peft/task_guides/int8-asr>
model = prepare_model_for_kbit_training(model)
# initialize the model with the LoRA framework
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 83,886,080 || all params: 2,858,972,160 || trainable%: 2.9341


In [18]:
from transformers import TrainingArguments, Trainer
import bitsandbytes
# define the training arguments first.
EPOCHS = 3
LEARNING_RATE = 1e-4
MODEL_SAVE_FOLDER_NAME = "dolly-3b-lora"
training_args = TrainingArguments(
                    output_dir=MODEL_SAVE_FOLDER_NAME,
                    overwrite_output_dir=True,
                    fp16=True, #converts to float precision 16 using bitsandbytes
                    per_device_train_batch_size=1,
                    per_device_eval_batch_size=1,
                    learning_rate=LEARNING_RATE,
                    num_train_epochs=EPOCHS,
                    logging_strategy="epoch",
                    evaluation_strategy="epoch",
                    save_strategy="epoch",
)
# training the model
trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=split_dataset['train'],
        eval_dataset=split_dataset["test"],
        data_collator=data_collator,
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()
# only saves the incremental 🤗 PEFT weights (adapter_model.bin) that were trained, meaning it is super efficient to store, transfer, and load.
trainer.model.save_pretrained(MODEL_SAVE_FOLDER_NAME)
# save the full model and the training arguments
trainer.save_model(MODEL_SAVE_FOLDER_NAME)
trainer.model.config.save_pretrained(MODEL_SAVE_FOLDER_NAME)

  trainer = Trainer(
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,1.7528,1.632109
2,1.179,1.608497
3,0.8585,1.705781


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


In [19]:
# Function to format the response and filter out the instruction from the response.
def postprocess(response):
    messages = response.split("Response:")
    if not messages:
        raise ValueError("Invalid template for prompt. The template should include the term 'Response:'")
    return "".join(messages[1:])
# Prompt for prediction
inference_prompt = "As a cybersecurity analyst, could you please create a Sigma rule to detect network connections initiated by processes on a system to DevTunnels domains? The rule should be able to identify potential abuse of the feature for establishing reverse shells or persistence. Please ensure that the rule includes relevant tags and references, and is designed with false positives in mind."
# Inference pipeline with the fine-tuned model
inf_pipeline =  pipeline('text-generation', model=trainer.model, tokenizer=tokenizer, max_length=256, trust_remote_code=True)
# Format the prompt using the `prompt_template` and generate response
response = inf_pipeline(prompt_template.format(instruction=inference_prompt))[0]['generated_text']
# postprocess the response
formatted_response = postprocess(response)
formatted_response

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'Mamba2ForCausalLM', 'MarianForCausalLM', 'MBartForCausa

'Within the framework of protecting your system, I have devised the following sigma rule:\n\n```\ntitle: Network Connection Initiated To DevTunnels Domain\nid: 0b9e0f9a-f9b4-4c7a-a9f0-f9b4c7a9f0f9\nrelated:\n    - id: 7b9f0f9b-f9b4-4c7a-a9f0-f9b4c7a9f0f9\n      type: derived\nstatus: test\ndescription: Detects network connections initiated by processes on a system to DevTunnels domains.\nreferences:\n    - https://dev.tunnels.io/docs/getting'