 Importing Required Libraries

In [1]:
!pip install torch transformers datasets accelerate peft
!pip install faiss-gpu transformers datasets

from IPython.display import clear_output
clear_output()


  torch: The PyTorch library, used for tensor computations and building neural networks.
  transformers: A library by Hugging Face that provides pre-trained models and tokenizers for various NLP tasks.
  peft: A library that provides functionality for applying Low-Rank Adaptation (LoRA) to models.
  datasets: A library for loading and processing datasets.
  Trainer: A class from Hugging Face that simplifies the training loop.
  TrainingArguments: A class to define the parameters for training.
  DataCollatorForLanguageModeling: A utility to create batches of data for language modeling tasks.


In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset, Dataset
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
from sklearn.model_selection import train_test_split
from huggingface_hub import notebook_login


# Step 1: Authenticate with Hugging Face
notebook_login('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')


# Load the dataset
dataset = load_dataset("bitext/Bitext-retail-banking-llm-chatbot-training-dataset")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/11.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.87M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25545 [00:00<?, ? examples/s]

In [3]:
print(dataset)

In [4]:
# Convert the Hugging Face dataset to a pandas DataFrame for splitting
train_df = dataset['train'].to_pandas()

In [5]:
# Create a validation split (90% training, 10% validation)
train_df, eval_df = train_test_split(train_df, test_size=0.1, random_state=42)

In [6]:
import pandas as pd

# Convert the split DataFrames back to Hugging Face datasets
train_data = Dataset.from_pandas(train_df)
eval_data = Dataset.from_pandas(eval_df)


   model_name: Specifies the model to be used, in this case, DistilGPT-2, a smaller and faster version of GPT-2.

  AutoModelForCausalLM: Loads the pre-trained model suitable for causal language modeling tasks.

  AutoTokenizer: Loads the corresponding tokenizer for the model, which is responsible for converting text into token IDs.


In [7]:
# Load the DistilGPT-2 model and tokenizer
model_name = "distilgpt2"  # Using DistilGPT-2
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

This line sets the padding token to be the same as the end-of-sequence token.

This is necessary for models like GPT-2 that do not have a predefined padding token.

In [8]:
# Set the padding token to be the same as the end-of-sequence token
tokenizer.pad_token = tokenizer.eos_token

LoraConfig: This class defines the configuration for applying LoRA to the model.

  r: The rank of the LoRA adapters, which determines the dimensionality of the low-rank matrices.

  lora_alpha: A scaling factor that controls the contribution of the LoRA parameters.

  lora_dropout: The dropout rate applied to the LoRA layers to prevent overfitting.

  task_type: Specifies the type of task, in this case, causal language modeling.


In [9]:
# Define LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank of the LoRA adapters
    lora_alpha=32,  # Scaling factor
    lora_dropout=0.1,  # Dropout rate
    task_type=TaskType.CAUSAL_LM
)

This line applies the LoRA configuration to the pre-trained model, allowing for efficient fine-tuning.

In [10]:
# Get the LoRA model
model = get_peft_model(model, lora_config)




  This function tokenizes the input data, truncating and padding it to a maximum length of 512 tokens.

  It also sets the labels for training to be the same as the input IDs, which is necessary for calculating the loss during training.

  This line applies the preprocess_function to the entire dataset, transforming it into a tokenized format suitable for training.



In [11]:
# Preprocess the dataset
def preprocess_function(examples):
    inputs = tokenizer(examples['instruction'], truncation=True, padding='max_length', max_length=512)
    inputs['labels'] = inputs['input_ids']  # Set labels to input_ids
    return inputs

This extracts the training dataset from the tokenized dataset.

In [12]:
# Tokenize the training and validation datasets
tokenized_train_dataset = train_data.map(preprocess_function, batched=True)
tokenized_eval_dataset = eval_data.map(preprocess_function, batched=True)

Map:   0%|          | 0/22990 [00:00<?, ? examples/s]

Map:   0%|          | 0/2555 [00:00<?, ? examples/s]

TrainingArguments: Defines the parameters for training.

  output_dir: Directory where the fine-tuned model will be saved.

  evaluation_strategy: Specifies when to evaluate the model (here, at the end of each epoch).

  learning_rate: The learning rate for the optimizer.

  per_device_train_batch_size: The batch size for training.

  num_train_epochs: The number of training epochs.
  
   weight_decay: A regularization technique to prevent overfitting.


In [13]:
# Set training arguments
training_args = TrainingArguments(
    output_dir="./lora-finetuned-model",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
)




  This creates a data collator that prepares batches of data for language modeling. The mlm=False argument indicates that we are not using masked language modeling.


In [14]:
# Create a data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


This creates a Trainer instance that will handle the training loop, using the specified model, training arguments, dataset, and data collator.

This line starts the training process, where the model will be fine-tuned on the provided dataset.

In [15]:
# Create Trainer instance with eval_dataset
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,  # Include the validation dataset
    data_collator=data_collator,
)

# Start fine-tuning
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.7267,2.386417
2,2.4724,2.192319
3,2.4632,2.13201


TrainOutput(global_step=17244, training_loss=2.750012536912749, metrics={'train_runtime': 4870.7395, 'train_samples_per_second': 14.16, 'train_steps_per_second': 3.54, 'total_flos': 9073303171891200.0, 'train_loss': 2.750012536912749, 'epoch': 3.0})

In [16]:
# Save the fine-tuned model
model.save_pretrained("./lora-finetuned-model")
tokenizer.save_pretrained("./lora-finetuned-model")

('./lora-finetuned-model/tokenizer_config.json',
 './lora-finetuned-model/special_tokens_map.json',
 './lora-finetuned-model/vocab.json',
 './lora-finetuned-model/merges.txt',
 './lora-finetuned-model/added_tokens.json',
 './lora-finetuned-model/tokenizer.json')