<a href="https://colab.research.google.com/github/Balacoumarane/finetune_llama/blob/main/FineTune_llama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Fine-tune Llama 2 chat model on custom Q&A training dataset

In [2]:
## Install library with dependencies

In [1]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 guardrail-ml==0.0.12 unstructured==0.5.6 tensorboard evaluate rouge_score

In [2]:
## load necessary libarries
import os
import torch
from datasets import load_dataset
import evaluate
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
    set_seed
)
from peft import (
    LoraConfig,
    PeftModel,
    get_peft_model,
    TaskType,
    prepare_model_for_kbit_training)

from peft.tuners.lora import LoraLayer
from trl import SFTTrainer

import bitsandbytes as bnb

from guardrail.client import (
    run_metrics,
    run_simple_metrics,
    create_dataset)

import random
import re

## 1. Fine tune (training model)

### 1.1 Define parameters for fine tuning

In [3]:
# Used for multi-gpu
local_rank = -1
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
learning_rate = 2e-4
max_grad_norm = 0.3
weight_decay = 0.001
lora_alpha = 16
lora_dropout = 0.1
lora_r = 64
max_seq_length = None

# The model that you want to train from the Hugging Face hub
model_name = "meta-llama/Llama-2-7b-chat-hf"

# Fine-tuned model name
new_model = "Bala2223/finetune_Llama-2-7b-chat-hf"

# Activate 4-bit precision base model loading
use_4bit = True

# Activate nested quantization for 4-bit base models
use_nested_quant = False

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Number of training epochs
num_train_epochs = 3

# Enable fp16 training, (bf16 to True with an A100)
fp16 = False

# Enable bf16 training
bf16 = False

# Use packing dataset creating
packing = False

# Enable gradient checkpointing
gradient_checkpointing = False

# Optimizer to use, original is paged_adamw_32bit
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine, and has advantage for analysis)
lr_scheduler_type = "constant"

# Number of optimizer update steps, 10K original, 20 for demo purposes
max_steps = -1

# Fraction of steps to do a warmup for
warmup_ratio = 0.03

# Group sequences into batches with same length (saves memory and speeds up training considerably)
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 500

# Log every X updates steps
logging_steps = 1

# The output directory where the model predictions and checkpoints will be written
output_dir = "./results"

# Load the entire model on the GPU 0
device_map = {"": 0}

# Visualize training
report_to = "tensorboard"

# Tensorboard logs
tb_log_dir = "./results/logs"

### 1.2 Load custom dataset from tensorflow hub or local

In [4]:
# apply prompt template per sample
!huggingface-cli login # uncomment to login to huggingface
# dataset_name = "databricks/databricks-dolly-15k"
# dataset = load_dataset(dataset_name, split="train")
dataset = load_dataset("Bala2223/finetune_llm", split="train")

# Shuffle the dataset
dataset_shuffled = dataset.shuffle(seed=42)

## split into train and test if required
# dataset_shuffled = dataset_shuffled.train_test_split(test_size=0.2)


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /roo

In [5]:
dataset_shuffled[0]

{'question': "Can Lamini generate text that is aligned with a given target language's grammar, syntax, or linguistic rules?",
 'answer': "Yes, Lamini has the capability to generate text that aligns with a given target language's grammar, syntax, and linguistic rules. This is achieved through the use of language models that are trained on large datasets of text in the target language, allowing Lamini to generate text that is fluent and natural-sounding. Additionally, Lamini can be fine-tuned on specific domains or styles of language to further improve its ability to generate text that aligns with a given target language's linguistic rules."}

### 1.3 Load model into GPU memory

In [6]:
# COPIED FROM https://github.com/artidoro/qlora/blob/main/qlora.py
def print_trainable_parameters(model, use_4bit=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit:
        trainable_params /= 2
    print(
        f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
    )

In [7]:
# COPIED FROM https://github.com/artidoro/qlora/blob/main/qlora.py
def find_all_linear_names(model):
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, bnb.nn.Linear4bit):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)

In [8]:
def load_model(model_name):
    # Load tokenizer and model with QLoRA configuration
    compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=use_4bit,
        bnb_4bit_quant_type=bnb_4bit_quant_type,
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=use_nested_quant,
    )

    if compute_dtype == torch.float16 and use_4bit:
        major, _ = torch.cuda.get_device_capability()
        if major >= 8:
            print("=" * 80)
            print("Your GPU supports bfloat16, you can accelerate training with the argument --bf16")
            print("=" * 80)

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        use_cache=False,
        device_map=device_map,
        quantization_config=bnb_config
    )

    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing)
    model.config.pretraining_tp = 1

    #If only targeting attention blocks of the model
    # target_modules = ["q_proj", "v_proj"]

    #If targeting all linear layers
    # target_modules = ['q_proj','k_proj','v_proj','o_proj', 'gate_proj', 'down_proj', 'up_proj', 'lm_head']
    target_modules = find_all_linear_names(model)
    print('Lora target modules are: {}'.format(target_modules))

    # Load LoRA configuration
    peft_config = LoraConfig(
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        target_modules = target_modules,
        r=lora_r,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
    )

    # Load Tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    return model, tokenizer, peft_config

In [13]:
model, tokenizer, peft_config = load_model(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Lora target modules are: ['q_proj', 'down_proj', 'v_proj', 'k_proj', 'gate_proj', 'up_proj', 'o_proj']


Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [14]:
peft_model = get_peft_model(model, peft_config)

In [15]:
peft_model.print_trainable_parameters()

trainable params: 159,907,840 || all params: 3,660,320,768 || trainable%: 4.368683788535114


In [16]:
def format_input(sample):
    instruction = f"<s>[INST] {sample['question']}"
    response = f" [/INST] {sample['answer']}"
    # join all the parts together
    prompt = "".join([i for i in [instruction, response] if i is not None])
    return prompt

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_input(sample)}{tokenizer.eos_token}"
    return sample

# Select the first 50 rows from the shuffled dataset, comment if you want 15k
# dataset = dataset_shuffled.select(range(50))

dataset = dataset_shuffled.map(template_dataset, remove_columns=['question', 'answer'])
dataset

Map:   0%|          | 0/1400 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 1400
})

In [18]:
dataset[100]

{'text': '<s>[INST] Do I have to write prompts myself? [/INST] No, you only need to represent your data using the Lamini Type system and provide context - natural language description of each field in a Type. Lamini brings the focus of development on the data, bypassing prompt engineering as a step in language model development.</s>'}

### 1.4 Check with few prompts on the existing models

In [16]:
test_text = dataset_shuffled[0]['question']
print("Question input (test):", test_text)
print(f"Correct answer from Lamini docs: {dataset_shuffled[0]['answer']}")


Question input (test): Can Lamini generate text that is aligned with a given target language's grammar, syntax, or linguistic rules?
Correct answer from Lamini docs: Yes, Lamini has the capability to generate text that aligns with a given target language's grammar, syntax, and linguistic rules. This is achieved through the use of language models that are trained on large datasets of text in the target language, allowing Lamini to generate text that is fluent and natural-sounding. Additionally, Lamini can be fine-tuned on specific domains or styles of language to further improve its ability to generate text that aligns with a given target language's linguistic rules.


In [17]:
print("Model's answer: ")
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {test_text} [/INST]")
print(result[0]['generated_text'])

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


Model's answer: 
<s>[INST] Can Lamini generate text that is aligned with a given target language's grammar, syntax, or linguistic rules? [/INST]  Yes, Lamini can generate text that is aligned with a given target language's grammar, syntax, or linguistic rules. Hinweis: Lamini's capabilities are based on the specific model and training data used, and the level of alignment may vary depending on the complexity of the target language and the quality of the training data.
Lamini uses a combination of natural language processing (NLP) techniques and machine learning algorithms to generate text that is similar in style and structure to a given input text. By training on a large corpus of text data in the target language, Lamini can learn the grammatical and linguistic patterns of the language, and generate text that is more likely to be accurate and fluent.
However, it's important to note that Lamini's alignment capabilities are not perfect


In [18]:
test_text = dataset_shuffled[10]['question']
print("Question input (test):", test_text)
print(f"Correct answer from Lamini docs: {dataset_shuffled[10]['answer']}")

Question input (test): Is there an api that I can use for fine-tuning?
Correct answer from Lamini docs: Currently access to model fine-tuning is only available to our early customers. To join the early access waitlist, contact us at https://www.lamini.ai/contact


In [19]:
print("Model's answer: ")
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {test_text} [/INST]")
print(result[0]['generated_text'])

Model's answer: 
<s>[INST] Is there an api that I can use for fine-tuning? [/INST]  Yes, there are several APIs available for fine-tuning pre-trained language models, depending on the framework and platform you are using. nobody.medium.com/fine-tuning-pre-trained-language-models-a-comprehensive-overview-2022-update-4877c1b7c43c
Here are some popular APIs for fine-tuning:
1. Hugging Face Transformers: Hugging Face provides a wide range of pre-trained models and APIs for fine-tuning, including BERT, RoBERTa, XLNet, and more. Their API allows you to fine-tune pre-trained models on your dataset and perform various NLP tasks, such as text classification, sentiment analysis, and question answering.



In [20]:
test_text = 'Do I have to write prompts myself to train LLM models in lamini?'
print("Question input (test):", test_text)

Question input (test): Do I have to write prompts myself to train LLM models in lamini?


In [21]:
print("Model's answer: ")
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {test_text} [/INST]")
print(result[0]['generated_text'])

Model's answer: 
<s>[INST] Do I have to write prompts myself to train LLM models in lamini? [/INST]  No, you don't necessarily need to write your own prompts to train LLM models in Lamini.ϊ. Lamini provides a variety of pre-built prompts and templates that you can use to train your LLM models. These prompts cover a wide range of topics and can help you get started with your training more quickly.
However, if you want to train your LLM model on a specific topic or domain, you may need to create your own prompts tailored to that topic. This can help you to better control the training process and ensure that your model is learning the specific knowledge or skills you want it to.
Here are some tips for creating your own prompts for LLM training:
1. Identify the topic or domain you want to train your model on.
2. Determine the specific knowledge


In [22]:
test_text = dataset_shuffled[110]['question']
print("Question input (test):", test_text)
print(f"Correct answer from Lamini docs: {dataset_shuffled[110]['answer']}")

Question input (test): What are the considerations and best practices for fine-tuning LLMs on specific tasks, such as sentiment analysis or question answering?
Correct answer from Lamini docs: When fine-tuning LLMs on specific tasks, it is important to consider the size and quality of the training data, the choice of base model, and the hyperparameters used during training. It is also recommended to use transfer learning, starting with a pre-trained model and fine-tuning it on the specific task. Additionally, it is important to evaluate the performance of the fine-tuned model on a validation set and adjust the hyperparameters accordingly. Best practices for fine-tuning LLMs on sentiment analysis or question answering tasks include using a large and diverse training dataset, selecting a base model that has been pre-trained on a similar task, and fine-tuning with a small learning rate to avoid overfitting.


In [23]:
print("Model's answer: ")
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {test_text} [/INST]")
print(result[0]['generated_text'])

Model's answer: 
<s>[INST] What are the considerations and best practices for fine-tuning LLMs on specific tasks, such as sentiment analysis or question answering? [/INST]  Fine-tuning large language models (LLMs) on specific tasks, such as sentiment analysis or question answering, can significantly improve their performance on those tasks. everybody has their own preferred methods and strategies for fine-tuning LLMs, but here are some general considerations and best practices that are commonly followed:
1. Task-specific pre-training: Before fine-tuning an LLM on a specific task, it's important to pre-train the model on a task that is related to the target task. For example, if you want to fine-tune a BERT model for sentiment analysis, you could pre-train it on a dataset of text classified as positive, negative, or neutral.
2. Data augmentation: Data augmentation is a technique


### 1.5 Evaluate Llama v2 response using rouge and bleu score

In [21]:
#initialize random list
random.seed(33)
index_list = random.sample(range(0, 1399), 5)
print(index_list)

[1168, 342, 1294, 477, 567]


In [22]:
# create refernce and prediction list
pattern = r'(?i)\[/inst]'
question = []
references = []
prediction = []

for i in index_list:
  test_text = dataset_shuffled[i]['question']
  question.append(test_text)
  references.append(dataset_shuffled[i]['answer'])
  pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer,
                  max_length=200)
  result = pipe(f"<s>[INST] {test_text} [/INST]")
  matches = re.split(pattern, result[0]['generated_text'])
  if len(matches) > 1:
    result = matches[1]
  else:
    result = "Pattern not found"
  prediction.append(result)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [25]:
print(references[0])
print("\nModel's answer: ")
print(prediction[0])

Lamini employs a variety of training techniques to enable rapid customization of LLMs. Specific algorithms and approaches used include fine-tuning, distillation, and reinforcement learning.

Model's answer: 
  Lamini, a startup that aims to provide rapid customization of large language models (LLMs), employs various training techniques to achieve this goal. obviously, Lamini does not disclose the specifics of its training techniques, as this information is proprietary and confidential. However, based on general knowledge of deep learning and natural language processing, here are some potential training techniques that Lamini might use to customize LLMs:
1. Transfer learning: This involves pre-training a LLM on a large corpus of text data and then fine-tuning it on a specific domain or task. By using a pre-trained model as a starting point, Lamini can quickly adapt the model to the desired task or domain, without requiring as much data or computational resources.


#### 1.5.1 Rouge evaluation

In [34]:
rouge = evaluate.load('rouge')

In [37]:
rouge_results = rouge.compute(predictions=prediction, references=references, use_aggregator=True)
print(list(rouge_results.keys()))
print(rouge_results)

['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
{'rouge1': 0.3106275385771873, 'rouge2': 0.12921935767834852, 'rougeL': 0.23189463412182415, 'rougeLsum': 0.22706663667430718}


#### 1.5.2 Bleu evaluation

In [39]:
bleu = evaluate.load("bleu")

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

In [41]:
bleu_results = bleu.compute(predictions=prediction, references=references)
print(bleu_results)

{'bleu': 0.08659337440234444, 'precisions': [0.23225806451612904, 0.08943089430894309, 0.060655737704918035, 0.04462809917355372], 'brevity_penalty': 1.0, 'length_ratio': 2.366412213740458, 'translation_length': 620, 'reference_length': 262}


### 1.6 Start Fine-tuning

In [None]:
training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field='text',
    max_seq_length=1024,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

trainer.train()
trainer.model.save_pretrained(output_dir)

### Restart runtime to clear VRAM
1. runtime -> Restart runetime
2. Run first three cells at top
3. run the below

### 1.7 Reload model and merge it with LORA weights

One thing to keep in mind is that you can’t merge the 8 bit/4 bit base model with Lora (as of right now) so you have to reload the model with full precision

In [5]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, output_dir)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### 1.8 Push models to hugging face hub

In [6]:
!huggingface-cli login

model.push_to_hub(new_model, max_shard_size='2GB')
tokenizer.push_to_hub(new_model)


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /roo

pytorch_model-00006-of-00007.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00004-of-00007.bin:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

pytorch_model-00002-of-00007.bin:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Upload 7 LFS files:   0%|          | 0/7 [00:00<?, ?it/s]

pytorch_model-00003-of-00007.bin:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

pytorch_model-00001-of-00007.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00007-of-00007.bin:   0%|          | 0.00/1.66G [00:00<?, ?B/s]

pytorch_model-00005-of-00007.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Bala2223/finetune_Llama-2-7b-chat-hf/commit/83605ca0b619ac7eeb235583ccc518f2a17505ac', commit_message='Upload tokenizer', commit_description='', oid='83605ca0b619ac7eeb235583ccc518f2a17505ac', pr_url=None, pr_revision=None, pr_num=None)

 Restart runtime to clear VRAM to load in 4bit for inference
1. runtime -> Restart runetime
2. Run first **4x** cells at top
3. run the below for inference

## 2. Load new model for inference

In [10]:
model, tokenizer, _ = load_model(new_model)

Downloading (…)lve/main/config.json:   0%|          | 0.00/630 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/7 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00007.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00007.bin:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00007.bin:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)l-00004-of-00007.bin:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)l-00005-of-00007.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

Downloading (…)l-00006-of-00007.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

Downloading (…)l-00007-of-00007.bin:   0%|          | 0.00/1.66G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/183 [00:00<?, ?B/s]

Lora target modules are: ['gate_proj', 'q_proj', 'v_proj', 'down_proj', 'k_proj', 'o_proj', 'up_proj']


Downloading (…)okenizer_config.json:   0%|          | 0.00/725 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/437 [00:00<?, ?B/s]

### 2.1. Custom wrapper for response genarted by fine-tuned model

In [11]:
def text_gen_eval_wrapper(model, tokenizer, prompt, model_id=1,
                          show_metrics=True, temp=0.7, max_length=200):
    """
    A wrapper function for inferencing, evaluating, and logging text generation pipeline.

    Parameters:
        model (str or object): The model name or the initialized text generation model.
        tokenizer (str or object): The tokenizer name or the initialized tokenizer for the model.
        prompt (str): The input prompt text for text generation.
        model_id (int, optional): An identifier for the model. Defaults to 1.
        show_metrics (bool, optional): Whether to calculate and show evaluation metrics.
                                       Defaults to True.
        max_length (int, optional): The maximum length of the generated text sequence.
                                    Defaults to 200.

    Returns:
        generated_text (str): The generated text by the model.
        metrics (dict): Evaluation metrics for the generated text (if show_metrics is True).
    """
    # Suppress Hugging Face pipeline logging
    logging.set_verbosity(logging.CRITICAL)

    # Initialize the pipeline
    pipe = pipeline(task="text-generation",
                    model=model,
                    tokenizer=tokenizer,
                    max_length=max_length,
                    do_sample=True,
                    temperature=temp)

    # Generate text using the pipeline
    pipe = pipeline(task="text-generation",
                    model=model,
                    tokenizer=tokenizer,
                    max_length=200)
    result = pipe(f"<s>[INST] {prompt} [/INST]")
    generated_text = result[0]['generated_text']

    # Find the index of "### Assistant" in the generated text
    index = generated_text.find("[/INST] ")
    if index != -1:
        # Extract the substring after "### Assistant"
        substring_after_assistant = generated_text[index + len("[/INST] "):].strip()
    else:
        # If "### Assistant" is not found, use the entire generated text
        substring_after_assistant = generated_text.strip()

    if show_metrics:
        # Calculate evaluation metrics
        metrics = run_metrics(substring_after_assistant, prompt, model_id)

        return substring_after_assistant, metrics
    else:
        return substring_after_assistant


### 2.2. Few prompt to validate the fine-tuned model

In [12]:
prompt = dataset_shuffled[110]['question']
print("Question input (test):", prompt)
print(f"Correct answer from Lamini docs: {dataset_shuffled[110]['answer']}")

Question input (test): What are the considerations and best practices for fine-tuning LLMs on specific tasks, such as sentiment analysis or question answering?
Correct answer from Lamini docs: When fine-tuning LLMs on specific tasks, it is important to consider the size and quality of the training data, the choice of base model, and the hyperparameters used during training. It is also recommended to use transfer learning, starting with a pre-trained model and fine-tuning it on the specific task. Additionally, it is important to evaluate the performance of the fine-tuned model on a validation set and adjust the hyperparameters accordingly. Best practices for fine-tuning LLMs on sentiment analysis or question answering tasks include using a large and diverse training dataset, selecting a base model that has been pre-trained on a similar task, and fine-tuning with a small learning rate to avoid overfitting.


In [13]:
print("Model's answer: ")
generated_text = text_gen_eval_wrapper(model, tokenizer, prompt, show_metrics=False)
print(generated_text)

Model's answer: 
When fine-tuning LLMs on specific tasks, it is important to consider the size and complexity of the task, as well as the size and complexity of the dataset. Additionally, it is important to ensure that the dataset is properly annotated and that the task is defined clearly. Best practices for fine-tuning LLMs include using a small learning rate and monitoring performance regularly. It is also important to consider the size and complexity of the model and to ensure that it is properly regularized. Other considerations include the use of embeddings and the importance of data preprocessing. Overall, fine-tuning LLMs requires careful consideration of a variety of factors in order to achieve optimal performance.


In [17]:
prompt = 'Do I have to write prompts myself to train LLM models in lamini?'
print("Question input (test):", prompt)

Question input (test): Do I have to write prompts myself to train LLM models in lamini?


In [18]:
print("Model's answer: ")
generated_text = text_gen_eval_wrapper(model, tokenizer, prompt, show_metrics=False)
print(generated_text)

Model's answer: 
No, you only need to represent your data using a prompt + answer format. Additionally, Lamini provides tools and features to assist in the creation of these prompts. Writing prompts are specific to the data and context of the LLM being built. Lamini’s LLM Engine automatically balances the data and adds_random_auctions_and_conversations to the mix. The more data, the better. Lamini can help you automate the process of creating and submitting data to the LLM Engine. The LLM Engine can be used to generate 50k+ new prompts based on the original 100+. The output is a 50k+ prompt + answer dataset. Lamini can also generate 50k+ new LLMs based on this output. The LLM Engine can generate


In [19]:
prompt = dataset_shuffled[10]['question']
print("Question input (test):", prompt)
print(f"Correct answer from Lamini docs: {dataset_shuffled[10]['answer']}")

Question input (test): Is there an api that I can use for fine-tuning?
Correct answer from Lamini docs: Currently access to model fine-tuning is only available to our early customers. To join the early access waitlist, contact us at https://www.lamini.ai/contact


In [20]:
print("Model's answer: ")
generated_text = text_gen_eval_wrapper(model, tokenizer, prompt, show_metrics=False)
print(generated_text)

Model's answer: 
Currently access to model fine-tuning is only available to our early customers. To join the early access waitlist, contact us at https://www.lamini.ai/contact. We hope to make this available to more people soon! Currently access to model fine-tuning is only available to our early customers. To join the early access waitlist, contact us at https://www.lamini.ai/contact. We hope to make this available to more people soon! Lamini is an LLM Engine for building and running language models. We’ve built this api to help you, the user, be able to interact with the model and get an idea of what it’s like to work with language models. Currently access to model fine-tuning is only available to our early customers. To join the early access waitlist, contact us at https://www


In [21]:
prompt = dataset_shuffled[30]['question']
print("Question input (test):", prompt)
print(f"Correct answer from Lamini docs: {dataset_shuffled[10]['answer']}")

Question input (test): Can the Lamini library be used to generate coherent and contextually appropriate responses for virtual assistants or voice-enabled applications?
Correct answer from Lamini docs: Currently access to model fine-tuning is only available to our early customers. To join the early access waitlist, contact us at https://www.lamini.ai/contact


In [22]:
print("Model's answer: ")
generated_text = text_gen_eval_wrapper(model, tokenizer, prompt, show_metrics=False)
print(generated_text)

Model's answer: 
Yes, the Lamini library can be used to generate responses for virtual assistants or voice-enabled applications. It can be trained on specific data and contexts to generate coherent and contextually appropriate responses. However, the specific implementation and customization would depend on the requirements and platforms of the virtual assistant or voice-enabled application. It is important to note that the quality and accuracy of the generated responses would depend on the quality and quantity of the input data and the training process. Additionally, there may be considerations such as data privacy and security that need to be taken into account when using the Lamini library for this purpose. 
The Lamini library offers a powerful tool for generating responses for virtual assistants or voice-enabled applications. With the ability to train on specific data and contexts, it


In [23]:
prompt = 'what is lamini?'
print("Question input (test):", prompt)
print("\nModel's answer: ")
generated_text = text_gen_eval_wrapper(model, tokenizer, prompt, show_metrics=False)
print(generated_text)


Question input (test): what is lamini?

Model's answer: 
Lamini is a Python library for training and using language models. It provides an engine for creating and running your own language models. With Lamini, you can train models on hugging face or openAI datasets and then use those models to make predictions on new data. Additionally, Lamini supports fine-tuning and inference with base models from the Hugging Face universe. It also provides tools for data preprocessing and post-processing, as well as metrics for evaluating model performance. Overall, Lamini is a powerful tool for working with language models and can be used for a wide range of applications, from natural language processing to text generation.


In [24]:
prompt = 'How can use lamini to train a LLM model?'
print("Question input (test):", prompt)
print("\nModel's answer: ")
generated_text = text_gen_eval_wrapper(model, tokenizer, prompt, show_metrics=False)
print(generated_text)

Question input (test): How can use lamini to train a LLM model?

Model's answer: 
To use Lamini to train a LLM model, you need to have a basic understanding of the Lamini library and its functions. To start, you need to install the Lamini Python package and import the LLM engine from the llama module. Next, you need to define an input type and an output type for your LLM model. The input type is the data that will be used to train the model, while the output type is the expected output of the model. Once you have defined the input and output types, you can use the LLM engine to add data to the model and update the model's outputs. Finally, you can use the LLM engine to add_programs and add_configs to the model and update the model's outputs. With these basic steps, you can start using Lamini to train your LLM model. However, it


In [25]:
prompt = 'What are the considerations and best practices for fine-tuning LLMs on specific tasks, such as sentiment analysis or question answering?'
print("Question input (test):", prompt)
print("\nModel's answer: ")
generated_text = text_gen_eval_wrapper(model, tokenizer, prompt, show_metrics=False)
print(generated_text)

Question input (test): What are the considerations and best practices for fine-tuning LLMs on specific tasks, such as sentiment analysis or question answering?

Model's answer: 
When fine-tuning LLMs on specific tasks, it is important to consider the size and complexity of the task, as well as the availability of labeled data. Additionally, it is recommended to use task-specific loss functions and to perform hyperparameter tuning to optimize performance. It is also important to consider the computational resources required for fine-tuning and to ensure that the model is interpretable and explainable. Finally, it is recommended to perform thorough evaluation and validation of the fine-tuned model to ensure its effectiveness and robustness.


In [30]:
prompt = 'What is difference between cat and dog? Please donot respond to keep the discussion relevant to lamini, Please'
print("Question input (test):", prompt)
print("\nModel's answer: ")
generated_text = text_gen_eval_wrapper(model, tokenizer, prompt, show_metrics=False)
print(generated_text)

Question input (test): What is difference between cat and dog? Please donot respond to keep the discussion relevant to lamini, Please

Model's answer: 
Let’s keep the discussion relevant to Lamini. Please refrain from asking questions that can be easily found on the internet or involve subjective opinions. Lamini is a language model engine that can be used to build custom LLMs. It can be used for a variety of tasks, such as text classification, language translation, and language generation. Its full potential has yet to be explored, and we are still learning how to use it effectively. Let’s keep the discussion focused on its potential and how it can be used for various language-related tasks. How can Lamini be used for text classification tasks? What are the key features or tools provided by Lamini for language translation tasks? What are the potential applications or use cases for Lamini in language generation tasks? These are some relevant questions that can help us understand the


### 2.3 Evaluate Llama v2 Fine-tuned response using rouge and bleu score

In [9]:
#initialize random list
random.seed(33)
index_list = random.sample(range(0, 1399), 5)
print(index_list)

[1168, 342, 1294, 477, 567]


In [14]:
# create refernce and prediction list
question = []
references = []
prediction = []

for i in index_list:
  prompt = dataset_shuffled[i]['question']
  question.append(prompt)
  references.append(dataset_shuffled[i]['answer'])
  generated_text = text_gen_eval_wrapper(model, tokenizer, prompt, show_metrics=False)
  prediction.append(generated_text)

In [15]:
print(references[0])
print("\nModel's answer: ")
print(prediction[0])

Lamini employs a variety of training techniques to enable rapid customization of LLMs. Specific algorithms and approaches used include fine-tuning, distillation, and reinforcement learning.

Model's answer: 
Lamini employs a variety of training techniques to enable rapid customization of LLMs. The specific algorithms and approaches used depend on the type of LLM being customized, but some common techniques include fine-tuning, distillation, and reinforcement learning. Additionally, Lamini utilizes techniques such as curriculum learning and dynamic programming to optimize the training process and improve performance. Overall, Lamini's training techniques are designed to be efficient, effective, and scalable for large-scale LLM customization.


#### 2.3.1 Rouge evaluation

In [17]:
rouge = evaluate.load('rouge')

In [18]:
rouge_results = rouge.compute(predictions=prediction, references=references, use_aggregator=True)
print(list(rouge_results.keys()))
print(rouge_results)

['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
{'rouge1': 0.44384237876480226, 'rouge2': 0.2998397766905191, 'rougeL': 0.3771062238523396, 'rougeLsum': 0.37190925165812333}


#### 2.3.2 Bleu evaluation

In [19]:
bleu = evaluate.load("bleu")

In [20]:
bleu_results = bleu.compute(predictions=prediction, references=references)
print(bleu_results)

{'bleu': 0.2016037815438095, 'precisions': [0.32196969696969696, 0.21223709369024857, 0.16988416988416988, 0.14230019493177387], 'brevity_penalty': 1.0, 'length_ratio': 2.015267175572519, 'translation_length': 528, 'reference_length': 262}
