### **Fine-tune TinyLlama-1.1B-Chat-v1.0 relatively small model**

Here we will use `QLoRA (Efficient Finetuning of Quantized LLMs)`, a highly efficient fine-tuning technique that involves `quantizing` a pretrained LLM to just `4` bits and adding small `“Low-Rank Adapters”`. This unique approach allows for fine-tuning LLMs using just a single GPU! This technique is supported by the `PEFT` library.

#### **Step-1 - Install Dependencies**

In [1]:
# !pip install -q -U bitsandbytes transformers peft accelerate datasets scipy einops evaluate trl rouge_score

we are `not` going to track our training metrics, so let’s disable Weights and Biases. The `W&B` Platform constitutes a fundamental collection of robust components for monitoring, visualizing data and models, and conveying the results. To deactivate Weights and Biases during the fine-tuning process, set the below environment property.

If you have an account with Weights and Biases, feel free to enable it and experiment with it.

In [None]:
import os
# disable Weights and Biases
os.environ['WANDB_DISABLED']="true"
os.environ['HUGGING_FACE_API_TOKEN']="XXXXXXXXXXXXXXXXXXXXXXXXXXXX" 

- `Bitsandbytes`: An excellent package that provides a lightweight wrapper around custom CUDA functions that make LLMs go faster — optimizers, matrix multiplication, and quantization. In this tutorial, we’ll be using this library to load our model as efficiently as possible.
- `transformers`: A library by Hugging Face (🤗) that provides pre-trained models and training utilities for various natural language processing tasks.
- `peft`: A library by Hugging Face (🤗) that enables parameter-efficient fine-tuning.
- `accelerate`: Accelerate abstracts exactly and only the boilerplate code related to multi-GPUs/TPU/fp16 and leave the rest of your code unchanged.
- `datasets`: Another library by Hugging Face (🤗) that provides easy access to a wide range of datasets.
- `einops`: A library that simplifies tensor operations.


Loading the required libraries



#### **Step-2 - Load the Libraries**

In [3]:
from datasets import load_dataset, DatasetDict
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    GenerationConfig
)
from tqdm import tqdm
from trl import SFTTrainer
import torch
import time
import pandas as pd
import numpy as np
from huggingface_hub import interpreter_login

interpreter_login()

  from .autonotebook import tqdm as notebook_tqdm
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|



In [4]:
from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

Need to Load the Datasets, There are Numerous datasets are available for fine-tuning the model. Now we will utilize the DialogSum DataSet from HuggingFace for the fine-tuning process. DialogSum is an extensive dialogue summarization dataset, featuring 13,460 dialogues along with manually labeled summaries and topics.

#### **Step-3 - Load the Datasets**

In [5]:
huggingface_dataset_name = "g3lu/addictive_manufacturing_reasoning" # https://huggingface.co/datasets/g3lu/addictive_manufacturing_reasoning
dataset = load_dataset(huggingface_dataset_name)

In [8]:
print(dataset)
print(type(dataset["train"][0]), dataset["train"][0])


DatasetDict({
    train: Dataset({
        features: ['question', 'reason', 'answer'],
        num_rows: 9000
    })
    validation: Dataset({
        features: ['question', 'reason', 'answer'],
        num_rows: 1000
    })
})
<class 'dict'> {'question': 'How do the specific requirements for strength, wear resistance, and surface quality in injection mould inserts influence the design and production of additive manufactured tools, and what are the potential trade-offs between these factors in achieving optimal tool performance?', 'reason': "Okay, so I'm trying to figure out how the specific requirements for strength, wear resistance, and surface quality in injection mold inserts influence the design and production of additive manufactured tools. And also, what are the potential trade-offs between these factors when aiming for optimal tool performance.\n\nFirst, let me break down each factor:\n\n1. Strength: The insert needs to be strong enough to handle the forces during injection mol

In [9]:
# As there is No default Validation set from the Huggingface Dataset, 
# Now we are splitting the dataset into training set(90%) and validation ser(10%)
# With validation we can make the model, not too Overfit
# Now, let's split the dataset into 90% training and 10% validation

shuffled_dataset = dataset["train"].shuffle(seed=42) # Why? If the data is ordered by topic, difficulty, or other hidden biases — that can mess with generalization.
split_dataset = shuffled_dataset.train_test_split(test_size=0.1, seed=42)


# Now, split_dataset contains the "train" and "test" split
train_dataset = split_dataset["train"]
eval_dataset = split_dataset["test"]

# Optional: Re-wrap into DatasetDict if you want to return the result as a dictionary
dataset = DatasetDict({
    "train": train_dataset,
    "validation": eval_dataset
})

# Print the new split dataset to confirm
print(dataset)


DatasetDict({
    train: Dataset({
        features: ['question', 'reason', 'answer'],
        num_rows: 8100
    })
    validation: Dataset({
        features: ['question', 'reason', 'answer'],
        num_rows: 900
    })
})


It contains the below fields.

- question : Question about industrial Implementation.
- reason : Reason How to resolve user query step by step.
- answer : human written Answer for the question.


#### **Step-4 - Create Bitsandbytes configuration**

using `BitsAndBytesConfig` to load our model in 4-bit format. This will reduce memory consumption considerably, at a cost of some accuracy.

In [10]:
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=False,
    )

#### **Step-5 - Loading the Pre-Trained model**

`TinyLlama`, a Small Language Model(SLM) with 1+ billion parameters.

load TinyLlama using `4-bit quantization` from HuggingFace.

In [11]:
model_name='TinyLlama/TinyLlama-1.1B-Chat-v1.0'
device_map = {"": 0}
original_model = AutoModelForCausalLM.from_pretrained(model_name, 
                                                      device_map=device_map,
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)



The model is loaded in 4-bit using the `BitsAndBytesConfig` from the bitsandbytes library. This is a part of the QLoRA process, which involves quantizing the pre-trained weights of the model to 4-bit and keeping them fixed during fine-tuning.

#### **Step-6 -  Tokenization**

incorporating left-padding to optimize memory usage during training.

In [13]:
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True,padding_side="left",add_eos_token=True,add_bos_token=True,use_fast=False)
tokenizer.pad_token = tokenizer.eos_token

In [51]:
print(original_model.device)  # Should say 'cuda:0'

print_gpu_utilization()

cuda:0
GPU memory occupied: 6911 MB.


In [16]:
eval_tokenizer = AutoTokenizer.from_pretrained(model_name, add_bos_token=True, trust_remote_code=True, use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token

def gen(model,p, maxlen=100, sample=True):
    toks = eval_tokenizer(p, return_tensors="pt")
    res = model.generate(**toks.to("cuda"), max_new_tokens=maxlen, do_sample=sample,num_return_sequences=1,temperature=0.1,num_beams=1,top_p=0.95,).to('cpu')
    return eval_tokenizer.batch_decode(res,skip_special_tokens=True)

In [17]:
print(gen(original_model, "Explain additive manufacturing."))

['Explain additive manufacturing.']


#### **Step-7 - Test the Model with Zero Shot Inferencing**

Evaluate the base model that we loaded above using a few sample inputs.

In [18]:
%%time
from transformers import set_seed
seed = 42
set_seed(seed)

index = 10

prompt = dataset['train'][index]['question']
reason = dataset['train'][index]['reason']
answer = dataset['train'][index]['answer']

formatted_prompt = f"Instruct: Summarize the following conversation.\n{prompt}\nOutput:\n"
res = gen(original_model, formatted_prompt, 100)

output = res[0].split('Output:\n')[1] if 'Output:' in res[0] else res[0]
dash_line = '-' * 100

print(dash_line)
print(f'INPUT PROMPT:\n{formatted_prompt}')
print(dash_line)
print(f'BASELINE HUMAN REASONING:\n{reason}\n')
print(dash_line)
print(f'HUMAN FINAL ANSWER:\n{answer}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')


----------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: Summarize the following conversation.
What specific material properties and microstructural characteristics must be optimized and controlled during the development of novel thermoplastic filaments for Material Extrusion (MEX) in order to achieve tailored mechanical performance, thermal stability, and processability, considering factors such as crystallinity, molecular weight, and additive formulations?
Output:

----------------------------------------------------------------------------------------------------
BASELINE HUMAN REASONING:
Okay, so I need to figure out what specific material properties and microstructural characteristics are important when developing new thermoplastic filaments for Material Extrusion (MEX). The goal is to get tailored mechanical performance, thermal stability, and processability. Factors like crystallinity, molecular weight, and addi

From the observation above, it’s evident that the model faces challenges in summarizing the dialogue compared to the baseline summary. However, it manages to extract essential information from the text, suggesting the potential for fine-tuning the model for the specific task at hand.

#### **Step-8 - Pre-processing dataset**

The dataset cannot be directly employed for fine-tuning. It is essential to format the prompt in a way that the model can comprehend. Referring to the HuggingFace model documentation, it is evident that a prompt needs to be generated using dialogue and summary in the specified format below.


To Encourage the model to write more concise answers, we can try the following `QA` Prompt. using "Instruct:<prompt>\nOutput:"

Ex:   
Instruct: Write a brief analogy between books and windows.  
Output: Books are like windows—each one opens to a new world, offering a glimpse into lives, ideas, and places beyond our own.  

Where the model Generates the Text after "Output:".


We will create some helper function to format our input dataset, ensuring it's suitability for fine-tuning process. Here, we need to convert the dialog-summary (prompt response) pairs into explicit instructions to LLM.

In [None]:
def create_prompt_formats_with_answer(sample):
    """
    Format the fields of the sample, including 'question', 'reason', and 'answer'.
    This version includes the answer as part of the expected output for the model to generate.
    
    :param sample: Sample dictionary
    """
    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruct: Summarize the below conversation and answer the question."
    END_KEY = "### End"
    
    blurb = f"\n{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}"
    input_context = f"Question: {sample['question']}" if sample.get('question') else None
    reason = f"Reason: {sample['reason']}" if sample.get('reason') else None
    answer = f"Answer: {sample['answer']}" if sample.get('answer') else None
    end = f"{END_KEY}"
    
    # Combine the parts into a complete prompt
    parts = [part for part in [blurb, instruction, input_context, reason, answer, end] if part]

    formatted_prompt = "\n\n".join(parts)
    
    # Store the formatted prompt in the sample under the 'text' field
    sample["text"] = formatted_prompt

    return sample


The above function can be used to convert our input into prompt format.

Now, we will use our model tokenizer to process these prompts into tokenized ones.

Our aim here is to generate input sequences with consistent lengths, which is beneficial for fine-tuning the language model by optimizing efficiency and minimizing computational overhead. It is essential to ensure that these sequences do not surpass the model’s maximum token limit.

In [20]:
from functools import partial

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int,seed, dataset):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """
    
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats_with_answer)#, batched=True)
    
    # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=['question', 'reason', 'answer'],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
    
    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset

By utilizing these functions, our dataset will be prepared for the fine-tuning process!

In [21]:
## Pre-process dataset
max_length = get_max_length(original_model)
print(max_length)

train_dataset = preprocess_dataset(tokenizer, max_length,seed, dataset['train'])


Found max lenth: 2048
2048
Preprocessing dataset...


Map: 100%|██████████| 8100/8100 [00:01<00:00, 4255.37 examples/s]
Map: 100%|██████████| 8100/8100 [01:08<00:00, 118.20 examples/s]
Filter: 100%|██████████| 8100/8100 [00:09<00:00, 832.50 examples/s]


In [22]:
print(f"Shapes of the datasets:")
print(f"Training: {train_dataset.shape}")
print(train_dataset)
print_gpu_utilization()

Shapes of the datasets:
Training: (7728, 3)
Dataset({
    features: ['text', 'input_ids', 'attention_mask'],
    num_rows: 7728
})
GPU memory occupied: 1773 MB.


#### **Step-9 - Setup the PEFT/LoRA model for Fine-Tuning**

Now, let's perform `Parameter Efficient Fine-Tuning (PEFT)` fine-tuning. `PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning`. PEFT is a generic term that includes Low-Rank Adaptation (LoRA) and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, in essence, enables efficient model fine-tuning using fewer computational resources, often achievable with just a single GPU. Following LoRA fine-tuning for a specific task or use case, the outcome is an unchanged original LLM and the emergence of a considerably smaller "LoRA adapter," often representing a single-digit percentage of the original LLM size (in MBs rather than GBs).

During inference, the LoRA adapter must be combined with its original LLM. The advantage lies in the ability of many LoRA adapters to reuse the original LLM, thereby reducing overall memory requirements when handling multiple tasks and use cases.

`Note the rank (r) hyper-parameter, which defines the rank/dimension of the adapter to be trained`. `r is the rank of the low-rank matrix used in the adapters, which thus controls the number of parameters trained`. A `higher rank will allow for more expressivity`, but there is a compute tradeoff.

alpha is the scaling factor for the learned weights. The weight matrix is scaled by alpha/r, and thus a higher value for alpha assigns more weight to the LoRA activations.

In [23]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 131164160
all model parameters: 615606272
percentage of trainable model parameters: 21.31%


In [24]:
print(original_model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear4bit(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), e

In [25]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

config = LoraConfig(
    r=32, #Rank
    lora_alpha=32,
    target_modules=[
        'q_proj',
        'k_proj',
        'v_proj',
        'dense'
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

# 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
original_model.gradient_checkpointing_enable()

# 2 - Using the prepare_model_for_kbit_training method from PEFT
original_model = prepare_model_for_kbit_training(original_model)

peft_model = get_peft_model(original_model, config)

Once everything is set up and the base model is prepared, we can use the print_trainable_parameters() helper function to see how many trainable parameters are in the model.

In [26]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 6127616
all model parameters: 621733888
percentage of trainable model parameters: 0.99%


In [27]:
# See how the model looks different now, with the LoRA adapters added:
print(peft_model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 2048)
        (layers): ModuleList(
          (0-21): 22 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora

#### **Step-10 - Train PEFT Adapters**

Define Training arguments and create Trainer instance.

In [30]:
output_dir = f'./peft-tinyLlama-manufacturing-training-{str(int(time.time()))}'

import transformers

peft_training_args = TrainingArguments(
    output_dir = output_dir,
    warmup_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    max_steps=1000,
    learning_rate=2e-4,
    optim="paged_adamw_8bit",
    logging_steps=25,
    logging_dir="./logs",
    save_strategy="steps",
    save_steps=25,
    # evaluation_strategy="steps",
    eval_steps=25,
    do_eval=True,
    gradient_checkpointing=True,
    report_to="none",
    overwrite_output_dir = 'True',
    group_by_length=True,
)

peft_model.config.use_cache = False

peft_trainer = transformers.Trainer(
    model=peft_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=peft_training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [31]:
peft_training_args.device

device(type='cuda', index=0)

In [32]:
peft_trainer.train()

Step,Training Loss
25,0.8164
50,0.7491
75,0.7057
100,0.6866
125,0.6915
150,0.6679
175,0.6649
200,0.6475
225,0.6647
250,0.6435


TrainOutput(global_step=1000, training_loss=0.6255666465759278, metrics={'train_runtime': 2419.9414, 'train_samples_per_second': 3.306, 'train_steps_per_second': 0.413, 'total_flos': 7.433320241664e+16, 'train_loss': 0.6255666465759278, 'epoch': 1.0351966873706004})

In [33]:
print_gpu_utilization()

GPU memory occupied: 4001 MB.


In [34]:
# Free memory for merging weights
del original_model
del peft_trainer
torch.cuda.empty_cache()

In [35]:
print_gpu_utilization()

GPU memory occupied: 2355 MB.


Once the model is trained successfully, we can use it for inference. Let’s now prepare the inference model by adding an adapter to the original `TinyLlama` model. Here, we are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model

In [36]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
base_model = AutoModelForCausalLM.from_pretrained(base_model_id, 
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)



In [39]:
eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True, use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token

In [None]:
from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "./peft-tinyLlama-manufacturing-training-1744528743/checkpoint-1000",torch_dtype=torch.float16,is_trainable=False)

#### **Step-11 - Human Evaluation**

 let’s perform inference using the same input but with the PEFT model, as we did previously in step 7 with the original model.

In [45]:
%%time
from transformers import set_seed
set_seed(seed)

index = 10
dialogue = dataset['train'][index]['question']
summary = dataset['train'][index]['reason']

# Match training-time formatting exactly
prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruct: Summarize the below conversation.

{dialogue}

### Output:
"""

# Generate with longer max tokens to avoid cutoff
peft_model_res = gen(ft_model, prompt, maxlen=500)

# Extract model output cleanly
# Handle case where "### End" might not be present
generated_text = peft_model_res[0]
if "### Output:" in generated_text:
    generated_text = generated_text.split("### Output:")[1]
summary_output, _, _ = generated_text.partition("### End")

# Display nicely
dash_line = '-' * 100
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary.strip()}\n')
print(dash_line)
print(f'PEFT MODEL:\n{summary_output.strip()}')


----------------------------------------------------------------------------------------------------
INPUT PROMPT:
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruct: Summarize the below conversation.

What specific material properties and microstructural characteristics must be optimized and controlled during the development of novel thermoplastic filaments for Material Extrusion (MEX) in order to achieve tailored mechanical performance, thermal stability, and processability, considering factors such as crystallinity, molecular weight, and additive formulations?

### Output:

----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Okay, so I need to figure out what specific material properties and microstructural characteristics are important when developing new thermoplastic filaments for Material Extrusion (MEX). The goal is to get tailored mechan

Fine-tuning is often an iterative process. Based on the validation and test sets results, we may need to make further adjustments to the model’s architecture, hyperparameters, or training data to improve its performance. Let’s now see how to evaluate the results of Fine-tuned LLM.

#### **Step-12 - Evaluate the Model Quantitatively (with ROUGE Metric)**

`ROUGE, or Recall-Oriented Understudy for Gisting Evaluation`, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.


Let’s now use the `ROUGE metric` to quantify the validity of summarizations produced by models. It compares summarizations to a “baseline” summary which is usually created by a human. While it’s not a perfect metric, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

To demonstrate the capability of ROUGE Metric Evaluation we will use some sample inputs to evaluate.

In [43]:
original_model = AutoModelForCausalLM.from_pretrained(base_model_id, 
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)



In [46]:
import pandas as pd

dialogues = dataset['train'][0:3]['question']
human_baseline_summaries = dataset['train'][0:10]['reason']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    human_baseline_text_output = human_baseline_summaries[idx]
    prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"
    
    original_model_res = gen(original_model,prompt,100,)
    original_model_text_output = original_model_res[0].split('Output:\n')[1]
    
    peft_model_res = gen(ft_model,prompt,100,)
    peft_model_output = peft_model_res[0].split('Output:\n')[1]
    print(peft_model_output)
    peft_model_text_output, success, result = peft_model_output.partition('###')

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df

df.to_csv("model_summary_comparison.csv", index=False)



- Summarize the conversation on cooperative agreements between standardization bodies, such as the PSDO agreement between ASTM International and ISO, and their impact on harmonizing national standards and technical regulations in additive manufacturing.
- Discuss the potential benefits and challenges of implementing a three-tier structure of AM standards across different categories and classes, considering factors such as the complexity of the standards, the role of international organizations, and the need for standardization bodies
- Summarize the conversation about additive manufacturing optimization.
- Explain how the additive manufacturing process can be optimized to minimize errors and difficulties in product development, particularly in terms of identifying and addressing underlying issues early on.
- Discuss the role of iterative design, automated processes, and computer-aided design in facilitating this optimization.
- Provide examples of how these strategies have been success

In [47]:
import evaluate

rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)


ORIGINAL MODEL:
{'rouge1': 0.1339806559953702, 'rouge2': 0.06260404490693139, 'rougeL': 0.09707812571423212, 'rougeLsum': 0.12751964964755005}
PEFT MODEL:
{'rouge1': 0.149750761876448, 'rouge2': 0.06631024046241864, 'rougeL': 0.10370722351347923, 'rougeLsum': 0.14226401745278225}


Metric--------Base Model--PEFT Model--🔥 Improvement  
ROUGE-1-------0.134-------0.150-------✅ (more overlap of unigrams)  
ROUGE-2-------0.063-------0.066-------✅ (slight nudge in bigram quality)  
ROUGE-L-------0.097-------0.104-------✅ (better longest common subsequence match)  
ROUGE-Lsum---0.128-------0.142-------✅ (summary-level coherence)  

ROUGE scores, commonly used metrics for evaluating text summarization and natural language generation (NLG) tasks. ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It compares the model's output (generated summary) with a reference (human-written) summary, measuring how much overlap there is in terms of words and phrases.

In [48]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 1.58%
rouge2: 0.37%
rougeL: 0.66%
rougeLsum: 1.47%


Metric-----Improvement (%)---💬 Interpretation  
ROUGE-1----+1.58%------------Better unigram (word) overlap  
ROUGE-2----+0.37%------------Slightly better phrase-level match (bigrams)    
ROUGE-L----+0.66%------------More structure-aligned summaries  
ROUGE-Lsum-+1.47%------------Better summary-level alignment with human ref  

As we can see in the above results, there is a significant improvement in the PEFT model as compared to the original model denoted in terms of percentage.

In [49]:
from bert_score import score

# Human references
references = df['human_baseline_summaries'].tolist()

# Model summaries
original_preds = df['original_model_summaries'].tolist()
peft_preds = df['peft_model_summaries'].tolist()

# BERTScore for Original Model
P_o, R_o, F1_o = score(original_preds, references, lang="en")
print(f"Original Model BERTScore F1: {F1_o.mean().item():.4f}")

# BERTScore for PEFT Model
P_p, R_p, F1_p = score(peft_preds, references, lang="en")
print(f"PEFT Model BERTScore F1: {F1_p.mean().item():.4f}")


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Original Model BERTScore F1: 0.8351


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


PEFT Model BERTScore F1: 0.8394


In [None]:
import torch
import math
from transformers import AutoTokenizer, AutoModelForCausalLM

# Use the same tokenizer/model for both (just need one to get perplexity)
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")  # e.g., "gpt2"
model = AutoModelForCausalLM.from_pretrained("./peft-tinyLlama-manufacturing-training-1744528743/checkpoint-1000").cuda()

def compute_perplexity(text, model, tokenizer):
    encodings = tokenizer(text, return_tensors="pt").to(model.device)
    input_ids = encodings.input_ids

    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss

    return math.exp(loss.item())

# Apply to your outputs
original_ppl = [compute_perplexity(text, model, tokenizer) for text in original_preds]
peft_ppl = [compute_perplexity(text, model, tokenizer) for text in peft_preds]

print(f"Original Model Perplexity (avg): {sum(original_ppl)/len(original_ppl):.2f}")
print(f"PEFT Model Perplexity (avg): {sum(peft_ppl)/len(peft_ppl):.2f}")


Original Model Perplexity (avg): 5.91
PEFT Model Perplexity (avg): 4.70


In [51]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
smoothie = SmoothingFunction().method4

bleu_scores = [
    sentence_bleu([ref.split()], pred.split(), smoothing_function=smoothie)
    for ref, pred in zip(df['human_baseline_summaries'], df['peft_model_summaries'])
]

print(f"PEFT BLEU Score (avg, smoothed): {np.mean(bleu_scores):.4f}")


PEFT BLEU Score (avg, smoothed): 0.0332


In [54]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')  # Optional but improves METEOR synonyms


[nltk_data] Downloading package wordnet to /home/ubuntu/nltk_data...
[nltk_data] Downloading package omw-1.4 to /home/ubuntu/nltk_data...


True

In [55]:
from nltk.translate.meteor_score import meteor_score

meteor_scores = [
    meteor_score([ref.split()], pred.split())
    for ref, pred in zip(df['human_baseline_summaries'], df['peft_model_summaries'])
]

print(f"PEFT METEOR Score (avg): {np.mean(meteor_scores):.4f}")


PEFT METEOR Score (avg): 0.2819
