# CHALLENGE 3: FINE TUNING OF LLM MODELS

- Enrique Álmazan
- Victor Miguel Álvarez Camarero
- Javier Alfonso Villolo Fernández.


In [None]:
!pin install datasets
!pip install accelerate
!pip install bitsandbytes
!pip install peft
!pip install evaluate
!pip install trl
!pip install rouge_score

/bin/bash: line 1: pin: command not found
[31mERROR: Operation cancelled by user[0m[31m
[31mERROR: Operation cancelled by user[0m[31m
[31mERROR: Operation cancelled by user[0m[31m
[31mERROR: Operation cancelled by user[0m[31m


In [None]:
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, pipeline
from datasets import Dataset
from peft import LoraConfig, TaskType, get_peft_model
from huggingface_hub import notebook_login
import evaluate
from trl import SFTTrainer
import requests
import json

## **Basic concepts before starting:**

**- LoRA:** is a fine-tuning technique for large language models. It involves training additional "relevance parameters" alongside the main model parameters. These relevance parameters determine the importance or relevance of each layer's contribution to the final prediction. By adjusting these parameters, the model learns which layers are more relevant for the task at hand, enabling it to focus more on important parts of the input data. LoRA fine-tuning optimizes the entire model, including both the main parameters and the relevance parameters. In other words, it trains weights over each of the existing layer to train the model for an specific task indentifying more relevant layers for that task

**- Transfer Learning:** Transfer learning is about taking the model that had learned on general-purpose, massive datasets and training it on distinct, task-specific data. This dataset may include labeled examples related to that domain. Transfer learning is used when there is not enough data or a lack of time to train data. It involves transferring knowledge from one task or domain to another. It means using a pre-trained model trained on a large dataset for a general task (such as language modeling) and fine-tuning it on a specific task or dataset of interest. Transfer learning can be achieved through various methods, including feature extraction (using the pre-trained model as a fixed feature extractor) and fine-tuning (updating the parameters of the pre-trained model on the target task), LoRA being one of them.

**- Prompt tuning:** Prompt tuning involves explicitly providing the model with task-specific prompts during fine-tuning. Instead of relying solely on the input data to learn the task, the model is guided by these prompts to produce task-specific outputs. LoRA introduces additional parameters. Whereas Prompt tuning teaches a specific task through prompts

In [None]:
# Clean the cache before traning each model to avoid memory errors
torch.cuda.empty_cache()

In [None]:
# Check if CUDA (GPU) is available
# For processing with GPU instead of CPU
if torch.cuda.is_available():
    # Get the number of available CUDA devices
    num_devices = torch.cuda.device_count()
    print("Number of available CUDA devices:", num_devices)

    # Iterate over CUDA devices and print their indices and names
    for i in range(num_devices):
        print("GPU index", i, ":", torch.cuda.get_device_name(i))
else:
    print("CUDA is not available. CPU will be used.")

Number of available CUDA devices: 1
GPU index 0 : Tesla T4


In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Your token has been saved to /root/.cache/huggingface/token

In [None]:
# Base model
# Llama-2-7b-hf model architecture: It is an auto-regressive language model that uses an optimized transformer architecture.
# It is possible to the official Meta Llama-2 model from Hugging Face, but you have to apply and wait a couple of days for confirmation.
# Model: https://huggingface.co/meta-llama/Llama-2-7b-hf
# Paper: https://arxiv.org/abs/2307.09288
# More info: https://llama.meta.com/llama2/

#model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

## **Objective**

Compare the performance of **Llama-2-7b-hf** model seen in class with another model from the Hugging Face provider in terms of rouge coefficients and the time of execution. For our second model we have chosen **Mistral-7B-Instruct-v0.2**. We opt for Mistral's Mistral-7B-Instruct-v0.2 model as it has a parameter count similar to that of the Llama model. Our aim is to assess the variance across the two model architectures which have been trained with different data and gauge their efficacy when fine tuned independently, irrespective of parameter count.

In [None]:
# Base model
# Model: https://huggingface.co/meta-llama/Meta-Llama-3-8B

# New model:
# Model: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

In [None]:
# Dataset
# Train
url_train = "https://raw.githubusercontent.com/architkaila/Fine-Tuning-LLMs-for-Medical-Entity-Extraction/main/data/entity_extraction/entity-extraction-train-data.json"
response_train = requests.get(url_train)
if response_train.status_code == 200:
    data_train = response_train.json()
    print("Training OK")
else:
    print("Error obtaining training data:", response_train.status_code)

Training OK


In [None]:
data_train[0]

{'input': "Robert Johnson\nrobert.johnson@email.com\n789 Maple Lane, Chicago, IL 60601\n555-234-5678, United States\n\nRelationship to XYZ Pharma Inc.: Patient\nReason for contacting: Adverse Event\n\nMessage: I've been on Onglyza for a while, and I've noticed that I'm experiencing frequent painful urination. Is this a known side effect?",
 'output': '{"drug_name": "Onglyza", "adverse_events": ["painful urination"]}'}

In [None]:
# Test
url_test = "https://raw.githubusercontent.com/architkaila/Fine-Tuning-LLMs-for-Medical-Entity-Extraction/main/data/entity_extraction/entity-extraction-test-data.json"
response_test = requests.get(url_test)
if response_test.status_code == 200:
    data_test = response_test.json()
    print("Test OK")
else:
    print("Error obtaining test data:", response_test.status_code)

Test OK


In [None]:
data_test[0]

{'input': "Natalie Cooper,\nncooper@example.com\n6789 Birch Street, Denver, CO 80203,\n303-555-6543, United States\n\nRelationship to XYZ Pharma Inc.: Patient\nReason for contacting: Adverse Event\n\nMessage: Hi, after starting Abilify for bipolar I disorder, I've noticed that I am experiencing nausea and vomiting. Are these typical reactions? Best, Natalie Cooper",
 'output': '{"drug_name": "Abilify", "adverse_events": ["nausea", "vomiting"]}'}

```
---- LlaMa2 datasets ----
https://gpus.llm-utils.org/llama-2-prompt-template/
Note that this only applies to the llama 2 chat models. The base models have no prompt structure, they’re raw non-instruct tuned models.

<s>[INST] {user_message_1} [/INST] {model_reply_1}</s>

---- Alpaca datasets ----
### Instruction:
(Instruction Text)

### Input:
(Auxiliary Input Text)

### Response:
(Desired Response Text)

---- Vicuna datasets ----
Vicuna datasets
### Human:
(Question Text)
### Assistant:
(Response Text)

---- Mistral datasets ----
<s>[INST] Instruction [/INST] Model answer</s>

---- Gemma ----
```

## **Pre-Processing**

Now we format the data from the dataset downloaded to fit the prompt structure
of the LLM model. Fortunatelly we can preserve the arrangement seen in class as Llama and Mistral have the same structure.




In [None]:
# Without a system message
# <s>[INST] {user_message_1} [/INST] {model_reply_1}</s>

formatted_data_train = []

for item in data_train:
    input_text = item["input"]
    output_text = item["output"]

    formatted_input = f"<s>[INST] {input_text}[/INST]"

    formatted_output = output_text.replace('\"', '').replace('{', '').replace('}', '')

    formatted_data_train.append({'text':formatted_input + formatted_output + '</s>'})

In [None]:
formatted_data_train[0]

{'text': "<s>[INST] Robert Johnson\nrobert.johnson@email.com\n789 Maple Lane, Chicago, IL 60601\n555-234-5678, United States\n\nRelationship to XYZ Pharma Inc.: Patient\nReason for contacting: Adverse Event\n\nMessage: I've been on Onglyza for a while, and I've noticed that I'm experiencing frequent painful urination. Is this a known side effect?[/INST]drug_name: Onglyza, adverse_events: [painful urination]</s>"}

In [None]:
dataset_train = Dataset.from_list(formatted_data_train)

In [None]:
dataset_train

Dataset({
    features: ['text'],
    num_rows: 700
})

In [None]:
dataset_train[0]

{'text': "<s>[INST] Robert Johnson\nrobert.johnson@email.com\n789 Maple Lane, Chicago, IL 60601\n555-234-5678, United States\n\nRelationship to XYZ Pharma Inc.: Patient\nReason for contacting: Adverse Event\n\nMessage: I've been on Onglyza for a while, and I've noticed that I'm experiencing frequent painful urination. Is this a known side effect?[/INST]drug_name: Onglyza, adverse_events: [painful urination]</s>"}

In [None]:
formatted_data_test = []

for item in data_test:
    input_text = item["input"]
    output_text = item["output"]

    formatted_input = f"<s>[INST] {input_text}[/INST]"

    formatted_output = output_text.replace('\"', '').replace('{', '').replace('}', '')

    formatted_data_test.append({'text':formatted_input + formatted_output + '</s>'})

In [None]:
formatted_data_test[0]

{'text': "<s>[INST] Natalie Cooper,\nncooper@example.com\n6789 Birch Street, Denver, CO 80203,\n303-555-6543, United States\n\nRelationship to XYZ Pharma Inc.: Patient\nReason for contacting: Adverse Event\n\nMessage: Hi, after starting Abilify for bipolar I disorder, I've noticed that I am experiencing nausea and vomiting. Are these typical reactions? Best, Natalie Cooper[/INST]drug_name: Abilify, adverse_events: [nausea, vomiting]</s>"}

In [None]:
dataset_test = Dataset.from_list(formatted_data_test) # HuggingFace Dataset

In [None]:
dataset_test

Dataset({
    features: ['text'],
    num_rows: 59
})

In [None]:
dataset_test[0]

{'text': "<s>[INST] Natalie Cooper,\nncooper@example.com\n6789 Birch Street, Denver, CO 80203,\n303-555-6543, United States\n\nRelationship to XYZ Pharma Inc.: Patient\nReason for contacting: Adverse Event\n\nMessage: Hi, after starting Abilify for bipolar I disorder, I've noticed that I am experiencing nausea and vomiting. Are these typical reactions? Best, Natalie Cooper[/INST]drug_name: Abilify, adverse_events: [nausea, vomiting]</s>"}

In [None]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") # Load tokenizer, specify the one for the corresponding LLM model
tokenizer.pad_token = tokenizer.eos_token # Padding token of the tokenizer to be the same as the end-of-sequence (eos) token
tokenizer.padding_side = "right" # Padding should be added to the right side of the input sequences

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Base model
# We will use NousResearch's Llama-2-7b-chat-hf as a base model, which is the same as the original, but easily accessible.
# Model: https://huggingface.co/NousResearch/Llama-2-7b-chat-hf

# New model
# We will use Mistral's Mistral-7B-Instruct-v0.2. Our aim is to assess the variance across various model architectures and gauge their efficacy when trained independently, irrespective of parameter count.

# Create quantization config (reduce precision as well as size)
# https://huggingface.co/docs/transformers/main_classes/quantization
# Quantization techniques reduce memory and computational costs
# by representing weights and activations with lower-precision data types
quantization_config = BitsAndBytesConfig(
    Load_in_4bit=True, # This flag is used to enable 4-bit quantization
    bnb_4bit_compute_dtype=torch.float16, # This sets the computational type: once the weights are loaded in 4-bit, the computations will be performed using 16-bit floating-point precision.
    bnb_4bit_quant_type="nf4" # This sets the quantization data type
)

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", # Load model
                                             quantization_config= quantization_config, # Quantification configuration
                                             device_map=0 # device_map = 0 means put the whole model on GPU 0; device_map="auto" compute the most optimized `device_map` automatically
)

Unused kwargs: ['Load_in_4bit']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )

### Zero-shot

Here we are Introducing formatted input without the solution for the model to suse previous learning to come up with an answer

In [None]:
print(data_test[1]['input'])

Mia Garcia
mia.garcia@email.com
321 Magnolia Drive, Dallas, TX 75201
555-890-1234, United States

Relationship to XYZ Pharma Inc.: Patient
Reason for contacting: Adverse Event

Message: I experienced a feeling of light-headedness and near-fainting after taking Staxyn for my erectile dysfunction. Is this a common side effect, and should I be worried?


In [None]:
# Run text generation pipeline with our model
prompt = data_test[1]['input']
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Mia Garcia
mia.garcia@email.com
321 Magnolia Drive, Dallas, TX 75201
555-890-1234, United States

Relationship to XYZ Pharma Inc.: Patient
Reason for contacting: Adverse Event

Message: I experienced a feeling of light-headedness and near-fainting after taking Staxyn for my erectile dysfunction. Is this a common side effect, and should I be worried? [/INST] Subject: Report of Adverse Event - Staxyn and Light-headedness

Dear XYZ Pharma Inc. Team,

I hope this message finds you well. I am writing to report an adverse event I recently experienced after taking Staxyn, a medication I have been using to manage my erectile dysfunction.

On


In [None]:
print(data_test[2]['input'])

Brandon Lee,
blee@example.com
3333 Pine Road, Hilltown, MA 02108,
617-555-3333, United States

Relationship to XYZ Pharma Inc.: Patient
Reason for contacting: Adverse Event

Message: Since I started on Byetta, I've noticed an increase in thirst and dry mouth. Is this related to the medication? Best, Brandon Lee


In [None]:
# Run text generation pipeline with our model
prompt = data_test[2]['input']
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Brandon Lee,
blee@example.com
3333 Pine Road, Hilltown, MA 02108,
617-555-3333, United States

Relationship to XYZ Pharma Inc.: Patient
Reason for contacting: Adverse Event

Message: Since I started on Byetta, I've noticed an increase in thirst and dry mouth. Is this related to the medication? Best, Brandon Lee [/INST] Subject: Report of Potential Adverse Effect from Byetta

Dear Brandon Lee,

Thank you for reaching out to XYZ Pharma Inc. regarding your experience with Byetta. We take all reports of adverse events seriously and appreciate your feedback.

Your symptoms of increased thirst and dry mouth are known side effects of Byetta. These symptoms are typically related to the medication'


### One-shot

In this case we introduce formatted input with its respective formatted output and then another formatted input without the solution for the model to use previous example to guide answer.

In [None]:
prompt = dataset_test[0]['text'] + '\n' + f"<s>[INST] {data_test[1]['input']} [/INST]"
print(prompt)


<s>[INST] Natalie Cooper,
ncooper@example.com
6789 Birch Street, Denver, CO 80203,
303-555-6543, United States

Relationship to XYZ Pharma Inc.: Patient
Reason for contacting: Adverse Event

Message: Hi, after starting Abilify for bipolar I disorder, I've noticed that I am experiencing nausea and vomiting. Are these typical reactions? Best, Natalie Cooper[/INST]drug_name: Abilify, adverse_events: [nausea, vomiting]</s>
<s>[INST] Mia Garcia
mia.garcia@email.com
321 Magnolia Drive, Dallas, TX 75201
555-890-1234, United States

Relationship to XYZ Pharma Inc.: Patient
Reason for contacting: Adverse Event

Message: I experienced a feeling of light-headedness and near-fainting after taking Staxyn for my erectile dysfunction. Is this a common side effect, and should I be worried? [/INST]


In [None]:
print(dataset_test[0]['text'])

<s>[INST] Natalie Cooper,
ncooper@example.com
6789 Birch Street, Denver, CO 80203,
303-555-6543, United States

Relationship to XYZ Pharma Inc.: Patient
Reason for contacting: Adverse Event

Message: Hi, after starting Abilify for bipolar I disorder, I've noticed that I am experiencing nausea and vomiting. Are these typical reactions? Best, Natalie Cooper[/INST]drug_name: Abilify, adverse_events: [nausea, vomiting]</s>


In [None]:
# Run text generation pipeline with our next model
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=350)
result = pipe(prompt)
print(result[0]['generated_text'])

<s>[INST] Natalie Cooper,
ncooper@example.com
6789 Birch Street, Denver, CO 80203,
303-555-6543, United States

Relationship to XYZ Pharma Inc.: Patient
Reason for contacting: Adverse Event

Message: Hi, after starting Abilify for bipolar I disorder, I've noticed that I am experiencing nausea and vomiting. Are these typical reactions? Best, Natalie Cooper[/INST]drug_name: Abilify, adverse_events: [nausea, vomiting]</s>
<s>[INST] Mia Garcia
mia.garcia@email.com
321 Magnolia Drive, Dallas, TX 75201
555-890-1234, United States

Relationship to XYZ Pharma Inc.: Patient
Reason for contacting: Adverse Event

Message: I experienced a feeling of light-headedness and near-fainting after taking Staxyn for my erectile dysfunction. Is this a common side effect, and should I be worried? [/INST] drug_name: Staxyn, adverse_events: [light-headedness, near-fainting]

Response:

Subject: Re: Inquiry Regarding Side Effects of Staxyn

Dear Mia Garcia,

Thank you for reaching out to us regarding your exper

### Training

Now we use the entire dataset with many formatted input and outputs to fine tune the model using the LoRA configuration.

In [None]:
# Create LoRA config
# More info in https://huggingface.co/docs/peft/main/en/conceptual_guides/lora
peft_config = LoraConfig(
    r=8, # The rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
    target_modules=["g_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"], # The modules to apply the LoRA update matrices.
    bias="none", # Specifies if the bias parameters should be trained.
    task_type = TaskType.CAUSAL_LM
)


In [None]:
# Subset of the arguments thath we use to the training.
# https://huggingface.co/docs/transformers/main_classes/trainer

training_params = TrainingArguments(
    output_dir="./results", # where the model's checkpoints and predictions will be stored
    num_train_epochs=1, # number of epochs
    per_device_train_batch_size=4, # batch size for training
    gradient_accumulation_steps=1, # # Number of update steps to accumulate the gradients for
    optim="paged_adamw_32bit", # AdamW optimizer
    save_steps=25, # save checkpoint every 25 update steps
    logging_steps=25, # logs every 25 update steps
    learning_rate=2e-4, # initial learning rate
    weight_decay=0.001, # weight decay to apply to all layers except bias/LayerNorm weights
    fp16=False,
    bf16=False,
    max_grad_norm=0.3, # maximum gradient normal (gradient clipping)
    max_steps=-1, # number of training steps (if not -1 overrides num_train_epochs)
    warmup_ratio=0.03, # ratio of steps for a linear warmup (from 0 to learning rate)
    group_by_length=True, # group sequences into batches with same length
    lr_scheduler_type="constant", # learning rate schedule
    report_to="tensorboard"
)

In [None]:
# Set supervised fine-tuning parameters
max_seq_length = None
packing = False

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_train,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_params,
    packing=packing,
)

# Train model
trainer.train()



Map:   0%|          | 0/700 [00:00<?, ? examples/s]

Step,Training Loss
25,1.0608
50,0.6319
75,0.488
100,0.4813
125,0.4565
150,0.4827
175,0.4342




TrainOutput(global_step=175, training_loss=0.5764838518415178, metrics={'train_runtime': 991.6445, 'train_samples_per_second': 0.706, 'train_steps_per_second': 0.176, 'total_flos': 4345650882772992.0, 'train_loss': 0.5764838518415178, 'epoch': 1.0})

In [None]:
# Run text generation pipeline with our next model
prompt = data_test[1]['input']
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Mia Garcia
mia.garcia@email.com
321 Magnolia Drive, Dallas, TX 75201
555-890-1234, United States

Relationship to XYZ Pharma Inc.: Patient
Reason for contacting: Adverse Event

Message: I experienced a feeling of light-headedness and near-fainting after taking Staxyn for my erectile dysfunction. Is this a common side effect, and should I be worried? [/INST]drug_name: Staxyn, adverse_events: [feeling of light-headedness, near-fainting]

Drug information: Staxyn is a medication used to treat erectile dysfunction. It works by increasing blood flow to the penis, allowing a man to get and maintain an erection.

Regarding your


In [None]:
# Run text generation pipeline with our next model
prompt = data_test[12]['input']
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Daniel Wilson
daniel.wilson@example.com
112 Pine Avenue, Atlanta, GA 30301
4045554321, United States

Relationship to XYZ Pharma Inc.: Patient
Reason for contacting: Adverse Event

Message: I took Nexium for acid reflux, and I had a headache and stomach pain. Could this be due to the medication? [/INST]drug_name: Nexium, adverse_events: [headache, stomach pain]

Drug Information
Nexium is a medication used to treat and prevent stomach ulcers, erosive esophagitis, and GERD. It works by blocking the production of stomach acid.

Relationship to Adverse Events
The adverse events you experienced, headache and stomach pain, are known side effects


### Evaluation

In [None]:
# Evaluate the Model Quantitatively
rouge = evaluate.load('rouge') # https://en.wikipedia.org/wiki/ROUGE_(metric)

In [None]:
input = []

for d in data_test:
  input.append(f"<s>[INST] {d['input']} [/INST]")

In [None]:
output = dataset_test['text']

In [None]:
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=250)

output_model = []

for i in input:
  output_model.append(pipe(i))

In [None]:
print(output[10])

<s>[INST] William Harris
william.harris@example.com
890 Oak Road, San Francisco, CA 94101
4155558765, United States

Relationship to XYZ Pharma Inc.: Patient
Reason for contacting: Adverse Event

Message: I received Neupogen and had trouble breathing and fever. Are these common side effects of the medication?[/INST]drug_name: Neupogen, adverse_events: [trouble breathing, fever]</s>


In [None]:
output_model__ = []

for ii in output_model:
  output_model__.append(ii[0]['generated_text'])

print(output_model__[10])

<s>[INST] William Harris
william.harris@example.com
890 Oak Road, San Francisco, CA 94101
4155558765, United States

Relationship to XYZ Pharma Inc.: Patient
Reason for contacting: Adverse Event

Message: I received Neupogen and had trouble breathing and fever. Are these common side effects of the medication? [/INST]drug_name: Neupogen, adverse_events: [trouble breathing, fever]

Drug Information
Neupogen is a medication used to stimulate the production of white blood cells in the body. It is commonly used during chemotherapy to prevent infections.

Adverse Events
The adverse events you experienced, trouble breathing and fever, are not common side effects of Neupogen. However, they can be serious and require immediate medical attention.

If you have trouble breathing or a fever, contact your healthcare provider right away. These symptoms could be related to an allergic reaction to Neupogen or a side effect of the medication.

Additional Information
Neupogen can


In [None]:
rouge_results = rouge.compute(
    predictions=output_model__,
    references=output,
    use_aggregator=True, # Scores are averaged over all examples
    use_stemmer=True, # Stemmer will be used during the computation of the ROUGE scores (stemmer reduces words to their root form, which can help in matching similar words)
)

In [None]:
print(rouge_results)

{'rouge1': 0.6705005718546562, 'rouge2': 0.6584466929107382, 'rougeL': 0.669451575070622, 'rougeLsum': 0.6656510145259794}


In [None]:
# RESULTS FOR LLAMA 2 MODEL
#  {'rouge1': 0.988583793986618, 'rouge2': 0.9802423646077636, 'rougeL': 0.9873031083402044, 'rougeLsum': 0.9871804787998142}
print(f"Rouge scores for Llama 2 model: {{'rouge1': 0.988583793986618, 'rouge2': 0.9802423646077636, 'rougeL': 0.9873031083402044, 'rougeLsum': 0.9871804787998142}}")

# RESULTS FOR MISTRAL MODEL
# {'rouge1': 0.6705005718546562, 'rouge2': 0.6584466929107382, 'rougeL': 0.669451575070622, 'rougeLsum': 0.6656510145259794}
print(f"Rouge scores for Mistral 7B 0.2v model: {{'rouge1': 0.6705005718546562, 'rouge2': 0c, 'rougeL': 0.669451575070622, 'rougeLsum': 0.6656510145259794}}")


Rouge scores for Llama 2 model: {'rouge1': 0.988583793986618, 'rouge2': 0.9802423646077636, 'rougeL': 0.9873031083402044, 'rougeLsum': 0.9871804787998142}
Rouge scores for Mistral 7B 0.2v model: {'rouge1': 0.6705005718546562, 'rouge2': 0.6584466929107382, 'rougeL': 0.669451575070622, 'rougeLsum': 0.6656510145259794}


#**DISCUSSION**

To compare the results lets first provide an explanation of each of the rouge parameters:

Overall ROUGE metrics provide a comprehensive evaluation of text generation tasks by considering overlap between generated and reference texts. They help assess the similarity and quality of the generated texts compared to human-written text.

**- Rouge 1:** Measures the precision of unigram (single-word) overlap between the generated text and a reference (human-generated) text. It calculates Precision, recall, and F1-score computed based on the number of overlapping unigrams.

**- Rouge 2:** Measures the precision of bigram overlap between the generated text and a reference text. Similar to ROUGE-1, but considers pairs of adjacent words (bigrams) instead of single words (unigrams).

**- Rouge L:**  Matches the Longest Common Subsequence (LCS) between the generated text and a reference text. Precision, recall, and F1-score are computed based on the longest sequence of words that appear in both the generated and reference texts while preserving the order of the words.

**- Rouge Lsum:** considers the sum of the ROUGE-L scores for each reference text. This means that if there are multiple reference texts available for a particular input text, ROUGE-Lsum calculates the ROUGE-L score for each reference-generated pair and then sums up these scores. It considers the performance across all input texts.

Once this is clear what we can infer from the obtained results is the following:

The difference in precision of unigram overlaps between the Llama 2 model (0.988583793986618) and the Mistral model (0.6705005718546562) indicates that the Llama model has greater precision for unigram overlap.

The same occurs with bigram overlap precision with Llama 2 having a rouge2 score of 0.9802423646077636 and Mistral a rouge2 score of 0.6584466929107382.

Regarding the matches of longest sequence of words between single reference text and its corresponding generated response as well as the aggreate sum of how well the generated text aligns with multiple reference texts, the precision of rougeL and rougeLsum is significantly lower (0.669451575070622, 0.6656510145259794; respectively) for the Mistral model than for the Llama 2 model (0.9873031083402044, 0.9871804787998142; respectively).

It is not only in the scores where we see a noticeable difference in performace. By comparing running times, Llama 2 took 6 minutes for the training of the model and another 5 minutes for generating the text generation pipeline to then evaluate obtained results. Whereas the mistral model took 16 minutes to train and another 15 minutes to generate results.

Although both models have a similar parameter size, approximately 7 billion, their performance differs significantly, with the Llama 2 model clearly outperforming the Mistral model. This variance in performance could be attributed to differences in model architecture as well as the quality of the training data utilized for each model.