<a href="https://www.kaggle.com/code/aisuko/fine-tuning-llm-for-dialogue-summarization?scriptVersionId=162954352" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

In this notebook, we will try to supervised fine-tune `microsoft/phi2` on the [DialogSum](https://huggingface.co/datasets/neil-code/dialogsum-test) dataset. DialogSum is an extensive dialogue summarization dataset, featuring 13,460 dialogues along with manually labeled summaries and topics.

We fine-tune `microsoft/phi2` on THUDM/webglm-qa dataset without supervised training in [Fine-tuning Microsoft-phi2](https://www.kaggle.com/code/aisuko/fine-tuning-microsoft-phi2)

In [1]:
%%capture
!pip install transformers==4.36.2
!pip install accelerate==0.25.0
!pip install evaluate==0.4.1
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3
# !pip install tqdm==4.66.1

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Supervised-fine-tune-models"
os.environ["WANDB_NOTES"] = "Supervised fine tune models"
os.environ["WANDB_NAME"] = "sft-microsoft-phi2-on-dialogsum"
os.environ["MODEL_NAME"] = "microsoft/phi-2"
os.environ["DATASET_NAME"] = "neil-code/dialogsum-test"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `microsoft/phi-2` from `transformers`...
config.json: 100%|█████████████████████████████| 863/863 [00:00<00:00, 4.51MB/s]
┌────────────────────────────────────────────────────┐
│     Memory Usage for loading `microsoft/phi-2`     │
├───────┬─────────────┬──────────┬───────────────────┤
│ dtype │Largest Layer│Total Size│Training using Adam│
├───────┼─────────────┼──────────┼───────────────────┤
│float32│   500.2 MB  │ 10.37 GB │      41.48 GB     │
│float16│   250.1 MB  │ 5.19 GB  │      20.74 GB     │
│  int8 │  125.05 MB  │ 2.59 GB  │      10.37 GB     │
│  int4 │   62.52 MB  │  1.3 GB  │      5.19 GB      │
└───────┴─────────────┴──────────┴───────────────────┘


# Loading the dataset

In [4]:
from datasets import load_dataset

dataset=load_dataset(os.getenv('DATASET_NAME'))
dataset

Downloading readme:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.81M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/441k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/447k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


Generating validation split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


Generating test split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1999
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 499
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 499
    })
})

## [Optional] Scale training data to a small slice

We want to make sure that we can run the training successed under the limited computing resources.

In [5]:
smaller_training=dataset['train'].select(range(100))
smaller_validation=dataset['validation'].select(range(100))
smaller_test=dataset['test'].select(range(100))

dataset['train']=smaller_training
dataset['validation']=smaller_validation
dataset['test']=smaller_test

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 100
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 100
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 100
    })
})

# Loading tokenizer



In [6]:
from transformers import AutoTokenizer
# see https://github.com/huggingface/transformers/issues/18388 for description about padding
tokenizer=AutoTokenizer.from_pretrained(
    os.getenv('MODEL_NAME'),
    padding_side='left',
    add_eos_token=True,
    add_bos_token=True,
    use_fast=False
)
tokenizer.pad_token=tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
from transformers import DataCollatorForLanguageModeling

data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)

2024-02-15 11:21:52.601508: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-15 11:21:52.601653: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-15 11:21:52.895012: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Preprocess the data

We need to create some preprocess function to format the input dataset, ensuring its suitability for the fine-tuning process. Here we convert the dialog-summary(prompt-response) pairs into explicit instructions for the LLM.

In [8]:
from functools import partial
from transformers import set_seed

seed=42
set_seed(seed)

def create_prompt_formats(sample):
    """
    Format various feilds of the sample ('instruction','output')
    """
    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruct: Summarize the below conversation."
    RESPONSE_KEY = "### Output:"
    END_KEY = "### End"
    
    blurb = f"\n{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}"
    input_context = f"{sample['dialogue']}" if sample["dialogue"] else None
    response = f"{RESPONSE_KEY}\n{sample['summary']}"
    end = f"{END_KEY}"
    
    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    sample["text"] = formatted_prompt

    return sample

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int,seed, dataset):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """
    
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)
    
    # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=['id', 'topic', 'dialogue', 'summary'],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
    
    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset

# Loading the model

We use QLora quantization the model to reduce the memory usage.

In [9]:
import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM


bnb_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_type=torch.float16,
    llm_int8_enable_fp32_cpu_offload=True
)

model=AutoModelForCausalLM.from_pretrained(
    os.getenv('MODEL_NAME'),
    device_map='auto',
    quantization_config=bnb_config,
    # Solving the issue: ValueError: PhiForCausalLM does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.
    trust_remote_code=True,
#     attn_implementation="flash_attention_2", # Does not be supported in here
    torch_dtype=torch.float16
)
model.config.quantization_config

configuration_phi.py:   0%|          | 0.00/9.26k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-2:
- configuration_phi.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi.py:   0%|          | 0.00/62.7k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-2:
- modeling_phi.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

BitsAndBytesConfig {
  "bnb_4bit_compute_dtype": "float32",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": true,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

In [10]:
max_length=get_max_length(model)

Found max lenth: 2048


In [11]:
train_dataset=preprocess_dataset(tokenizer, max_length, seed, dataset['train'])
eval_dataset=preprocess_dataset(tokenizer, max_length, seed, dataset['validation'])
train_dataset

Preprocessing dataset...


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Filter:   0%|          | 0/100 [00:00<?, ? examples/s]

Preprocessing dataset...


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Filter:   0%|          | 0/100 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'input_ids', 'attention_mask'],
    num_rows: 100
})

# Frozen the model's parameters

In [12]:
from peft import prepare_model_for_kbit_training

# save memory
model.gradient_checkpointing_enable()
model=prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
model

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (dense): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (final_layern

In [13]:
from peft import LoraConfig, TaskType, get_peft_model

peft_config=LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        'q_proj',
        'k_proj',
        'v_proj',
        'dense',
        'fc1',
        'fc2',
    ],
    bias="none",
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

peft_model=get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()

trainable params: 23,592,960 || all params: 2,803,276,800 || trainable%: 0.8416207775129448


In [14]:
import time
from transformers import TrainingArguments, Trainer

training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    overwrite_output_dir=True,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=5,
    gradient_checkpointing=True,  # Enable gradient checkpointing
    gradient_checkpointing_kwargs={"use_reentrant": False},
    warmup_steps=50,
    max_steps=100, # Total number of training steps
    num_train_epochs=2, # Number of training epochs
    learning_rate=5e-5, # Learning rate
    weight_decay=0.01, # Weight decay
    optim="paged_adamw_8bit", # Keep the optimizer state and quantize it
#     bf16=True, # Do not supported in Kaggle environment, require Ampere....
    fp16=True, # use fp16 16bit(mixed) precision training instead of 32-bit training.
    logging_dir='./logs',
    logging_strategy="steps",
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2, # Limit the total number of checkpoints
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=50,
    load_best_model_at_end=True, # Load the best model at the end of training,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME")
)

peft_model.config.use_cache=False

trainer=Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator
)


start_time=time.time()
trainer.train()
end_time=time.time()

training_time=end_time-start_time

print(f"Training completed in {training_time} seconds.")

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.3 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.2
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240215_112251-du4g8ew6[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33msft-microsoft-phi2-on-dialogsum[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Supervised-fine-tune-models[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Supervised-fine-tune-models/runs/du4g8ew6[0m


Step,Training Loss,Validation Loss
50,1.4203,1.396566
100,1.2814,1.363865


Training completed in 1410.76864695549 seconds.


In [15]:
%%capture
trainer.push_to_hub(os.getenv('WANDB_NAME'))
tokenizer.push_to_hub(os.getenv('WANDB_NAME'))

# Evaluating

In [16]:
import gc

del model,peft_model, tokenizer, trainer
gc.collect()

torch.cuda.empty_cache()

In [17]:
eval_tokenizer=AutoTokenizer.from_pretrained(
    'aisuko/'+os.getenv('WANDB_NAME'),
    add_bos_token=True,
    trust_remote_code=True,
    use_fast=False
)

eval_tokenizer.pad_token=eval_tokenizer.eos_token
print(eval_tokenizer)

tokenizer_config.json:   0%|          | 0.00/7.47k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


CodeGenTokenizer(name_or_path='aisuko/sft-microsoft-phi2-on-dialogsum', vocab_size=50257, model_max_length=2048, is_fast=False, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	50257: AddedToken("                               ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	50258: AddedToken("                              ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	50259: AddedToken("                             ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	50260: AddedToken("                            ", rstrip=False, lstrip=False, single_word=False, normalized=True,

In [18]:
from peft import PeftModel

model=AutoModelForCausalLM.from_pretrained(
    os.getenv('MODEL_NAME'),
    device_map='auto',
    trust_remote_code=True,
    torch_dtype=torch.float16
)

eval_model=PeftModel.from_pretrained(
    model, 'aisuko/'+os.getenv('WANDB_NAME'), device_map='auto'
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_config.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/94.4M [00:00<?, ?B/s]

In [19]:
def inference(model, prompt, max_length=200):
    tokens=eval_tokenizer(prompt, return_tensors='pt')
    res=model.generate(
        **tokens.to('cuda'),
        max_new_tokens=max_length,
        do_sample=True,
        num_return_sequences=1,
        temperature=0.1,
        num_beams=1,
        top_p=0.95
    )
    return eval_tokenizer.batch_decode(res, skip_special_tokens=False)

In [20]:
dialogue=dataset['test'][5]['dialogue']
summary=dataset['test'][5]['summary']

prompt=f'Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n'

peft_model_res=inference(eval_model, prompt, 100)
peft_model_output=peft_model_res[0].split('Output:\n')[1]

prefix, success, result=peft_model_output.partition('###')

dashline='-'.join('' for x in range(100))
print(prompt)
print(dashline)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Instruct: Summarize the following conversation.
#Person1#: You're finally here! What took so long?
#Person2#: I got stuck in traffic again. There was a terrible traffic jam near the Carrefour intersection.
#Person1#: It's always rather congested down there during rush hour. Maybe you should try to find a different route to get home.
#Person2#: I don't think it can be avoided, to be honest.
#Person1#: perhaps it would be better if you started taking public transport system to work.
#Person2#: I think it's something that I'll have to consider. The public transport system is pretty good.
#Person1#: It would be better for the environment, too.
#Person2#: I know. I feel bad about how much my car is adding to the pollution problem in this city.
#Person1#: Taking the subway would be a lot less stressful than driving as well.
#Person2#: The only problem is that I'm going to really miss having the freedom that you have with a car.
#Person1#: Well, when it's nicer outside, you can start biking t

In [21]:
print(summary)

#Person2# complains to #Person1# about the traffic jam, #Person1# suggests quitting driving and taking public transportation instead.


In [22]:
print(prefix)

#Person2# tells #Person1# that they got stuck in traffic again and that it's always congested near the Carrefour intersection during rush hour. #Person1# suggests that #Person2# should consider taking public transport system to work, and #Person2# agrees that it would be better for the environment. #Person1# also suggests that #Person2# could start biking to work when it's nicer outside. #Person2# decides to quit driving to work.
