## Fine Tuning an LLM on BiText dataset
Dataset- https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset

The dataset has the following specs:

Use Case: Intent Detection
Vertical: Customer Service
27 intents assigned to 10 categories
26872 question/answer pairs, around 1000 per intent
30 entity/slot types
12 different types of language generation tags
The categories and intents have been selected from Bitext's collection of 20 vertical-specific datasets, covering the intents that are common across all 20 verticals. The verticals are:

Automotive, Retail Banking, Education, Events & Ticketing, Field Services, Healthcare, Hospitality, Insurance, Legal Services, Manufacturing, Media Streaming, Mortgages & Loans, Moving & Storage, Real Estate/Construction, Restaurant & Bar Chains, Retail/E-commerce, Telecommunications, Travel, Utilities, Wealth Management



In [1]:
from datasets import load_dataset

ds = load_dataset("bitext/Bitext-customer-support-llm-chatbot-training-dataset")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
ds

DatasetDict({
    train: Dataset({
        features: ['flags', 'instruction', 'category', 'intent', 'response'],
        num_rows: 26872
    })
})

## Structuring Data for BERT QA
- Context: The instructions (or broader conversation, if available) can act as the context for the model. Since we don't have broader conversational context, you can use just the instruction as the question in this case.
- Question: Extract from instruction.
- Answer: Taken directly from the response.

## Example Breakdown for BERT QA:

Context: This will usually be general knowledge about the domain or background. Since this dataset doesn't provide a broader context, you can leave this field blank or generalize it.

Example Context: "You are interacting with a support system that helps with cancelling orders."
Question: Use the instruction as the question.

Example Question: "I need help cancelling purchase {{Order Number}}."
Answer: The response will be the answer the model should learn to extract from the context.

Example Answer: "To cancel your purchase, follow these steps: 1. Log into your account. 2. Locate your order. 3. Initiate the cancellation."


# Full Example Format for BERT QA:
- Data in Row Format:
Flags: BLQ
Instruction: "I need help cancelling purchase {{Order Number}}."
Response: "To cancel your purchase, follow these steps: 1. Log into your account. 2. Locate your order. 3. Initiate the cancellation."
- BERT QA Format:
Context: "You are interacting with a support system that helps with cancelling orders."
Question: "I need help cancelling purchase {{Order Number}}."
Answer: "To cancel your purchase, follow these steps: 1. Log into your account. 2. Locate your order. 3. Initiate the cancellation."

In [3]:
import pandas as pd
data=pd.DataFrame(ds['train'])
data

Unnamed: 0,flags,instruction,category,intent,response
0,B,question about cancelling order {{Order Number}},ORDER,cancel_order,I've understood you have a question regarding ...
1,BQZ,i have a question about cancelling oorder {{Or...,ORDER,cancel_order,I've been informed that you have a question ab...
2,BLQZ,i need help cancelling puchase {{Order Number}},ORDER,cancel_order,I can sense that you're seeking assistance wit...
3,BL,I need to cancel purchase {{Order Number}},ORDER,cancel_order,I understood that you need assistance with can...
4,BCELN,"I cannot afford this order, cancel purchase {{...",ORDER,cancel_order,I'm sensitive to the fact that you're facing f...
...,...,...,...,...,...
26867,BL,I am waiting for a rebate of {{Refund Amount}}...,REFUND,track_refund,Thank you for sharing your situation regarding...
26868,BIL,how to see if there is anything wrong with my ...,REFUND,track_refund,Ensuring the accuracy of your restitution is o...
26869,BLQZ,I'm waiting for a reimbjrsement of {{Currency ...,REFUND,track_refund,"Firstly, I genuinely understand the importance..."
26870,BL,I don't know what to do to see my reimbursemen...,REFUND,track_refund,I've understood you're unsure about how to che...


In [4]:
data['category'].value_counts()

category
ACCOUNT         5986
ORDER           3988
REFUND          2992
INVOICE         1999
CONTACT         1999
PAYMENT         1998
FEEDBACK        1997
DELIVERY        1994
SHIPPING        1970
SUBSCRIPTION     999
CANCEL           950
Name: count, dtype: int64

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Function to generate context, question, and answer from the dataset
def prepare_data(data):
    contexts = data['instruction'].tolist()
    questions = []
    answers = data['response'].tolist()

    # Generate questions based on the category and intent of each row
    for instruction, category, intent in zip(data['instruction'], data['category'], data['intent']):
        if category == "ACCOUNT":
            questions.append(f"How should I handle this account-related request: '{instruction}'?")
        elif category == "ORDER":
            if "cancel" in intent.lower():
                questions.append(f"What's the appropriate response to cancel an order: '{instruction}'?")
            else:
                questions.append(f"How should I address this order-related inquiry: '{instruction}'?")
        elif category == "REFUND":
            questions.append(f"What's the proper way to handle this refund request: '{instruction}'?")
        elif category == "INVOICE":
            questions.append(f"How should I assist with this invoice-related query: '{instruction}'?")
        elif category == "CONTACT":
            questions.append(f"What's the best way to guide the customer to contact support given: '{instruction}'?")
        elif category == "PAYMENT":
            questions.append(f"How should I respond to this payment-related inquiry: '{instruction}'?")
        elif category == "FEEDBACK":
            questions.append(f"What's an appropriate response to this feedback: '{instruction}'?")
        elif category == "DELIVERY":
            questions.append(f"How should I address this delivery-related question: '{instruction}'?")
        elif category == "SHIPPING":
            questions.append(f"What's the proper response to this shipping inquiry: '{instruction}'?")
        elif category == "SUBSCRIPTION":
            questions.append(f"How do I handle this subscription-related request: '{instruction}'?")
        elif category == "CANCEL":
            questions.append(f"What's the appropriate way to process this cancellation request: '{instruction}'?")
        else:
            questions.append(f"How should I respond to this request: '{instruction}'?")
    
    return contexts, questions, answers

# Prepare the data
contexts, questions, answers = prepare_data(data)

# Split the data into training and testing sets
train_contexts, test_contexts, train_questions, test_questions, train_answers, test_answers = train_test_split(
    contexts, questions, answers, test_size=0.1, random_state=42
)

# Create a DataFrame for the training and testing sets
train = pd.DataFrame({"context": train_contexts, "question": train_questions, "answer": train_answers})
test = pd.DataFrame({"context": test_contexts, "question": test_questions, "answer": test_answers})

# Display the first few rows of the training set
train.head()

Unnamed: 0,context,question,answer
0,I have to check how soon can I expect the purc...,How should I address this delivery-related que...,We understand your anticipation to know the es...
1,locate bill{{Invoice Number}},How should I assist with this invoice-related ...,Assuredly! I understand your need to locate th...
2,need assistance to obtain a refund of money,What's the proper way to handle this refund re...,I appreciate that you require assistance in ob...
3,I don't know how I can talk to cstomer support,What's the best way to guide the customer to c...,Grateful for your contact! I get the sense tha...
4,help editing purchase {{Order Number}},How should I address this order-related inquir...,We understand that you need assistance with ed...


In [9]:
# Next steps our fine tuning our BERT 
import torch
from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Function to tokenize the data for BERT QA
def tokenize_data(contexts, questions, answers, max_length=512):
    input_ids = []
    attention_masks = []
    start_positions = []
    end_positions = []
    
    for context, question, answer in zip(contexts, questions, answers):
        # Encode the context and question using the tokenizer
        encoding = tokenizer.encode_plus(
            question,                      # The question
            context,                       # The context (input text)
            max_length=max_length,         # Maximum input length for BERT
            padding='max_length',          # Pad input sequences to the max length
            truncation=True,               # Truncate input sequences to the max length
            return_attention_mask=True,    # Return attention masks
            return_tensors='pt'            # Return PyTorch tensors
        )

        # Append the encoded inputs and attention masks
        input_ids.append(encoding['input_ids'])
        attention_masks.append(encoding['attention_mask'])

        # Find the start and end positions of the answer within the context
        start_idx = context.find(answer)
        end_idx = start_idx + len(answer) - 1

        # Append the start and end positions
        start_positions.append(start_idx)
        end_positions.append(end_idx)
    
    # Convert lists to tensors
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    start_positions = torch.tensor(start_positions)
    end_positions = torch.tensor(end_positions)
    
    return input_ids, attention_masks, start_positions, end_positions

# Tokenize the training data
train_input_ids, train_attention_masks, train_start_positions, train_end_positions = tokenize_data(
    train['context'].tolist(), 
    train['question'].tolist(), 
    train['answer'].tolist()
)

# Tokenize the testing data
test_input_ids, test_attention_masks, test_start_positions, test_end_positions = tokenize_data(
    test['context'].tolist(), 
    test['question'].tolist(), 
    test['answer'].tolist()
)

# Check the shape of the tokenized data
print(f"Training input shape: {train_input_ids.shape}")
print(f"Testing input shape: {test_input_ids.shape}")

Training input shape: torch.Size([24184, 512])
Testing input shape: torch.Size([2688, 512])


In [10]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Create TensorDataset for training and testing data
train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_start_positions, train_end_positions)
test_dataset = TensorDataset(test_input_ids, test_attention_masks, test_start_positions, test_end_positions)

# Create DataLoader for the training and testing sets
batch_size = 16

train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=batch_size)
test_dataloader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset), batch_size=batch_size)

# Check if DataLoader is working
for batch in train_dataloader:
    print(batch)
    break

[tensor([[ 101, 2054, 1005,  ...,    0,    0,    0],
        [ 101, 2129, 2323,  ...,    0,    0,    0],
        [ 101, 2129, 2323,  ...,    0,    0,    0],
        ...,
        [ 101, 2054, 1005,  ...,    0,    0,    0],
        [ 101, 2129, 2323,  ...,    0,    0,    0],
        [ 101, 2054, 1005,  ...,    0,    0,    0]]), tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), tensor([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]), tensor([ 850,  981,  315,  397,  504,  652,  472, 1264,  593,  581,  642,  459,
         788,  299,  350,  292])]


In [11]:
from transformers import BertForQuestionAnswering, AdamW

# Load the BERT QA model
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

# Set the model to training mode
model.train()

# Define the optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Fine-tuning loop
epochs = 3

for epoch in range(epochs):
    print(f"Epoch {epoch + 1}/{epochs}")
    
    # Training loop
    for step, batch in enumerate(train_dataloader):
        input_ids, attention_mask, start_positions, end_positions = batch
        
        # Zero the gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, 
                        start_positions=start_positions, end_positions=end_positions)
        
        loss = outputs.loss
        loss.backward()  # Backpropagation
        
        optimizer.step()  # Update the model parameters
        
        if step % 100 == 0:
            print(f"Step {step}, Loss: {loss.item()}")

# Save the fine-tuned model
model.save_pretrained("bert-qa-finetuned")
tokenizer.save_pretrained("bert-qa-finetuned")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Step 0, Loss: 6.205663681030273


KeyboardInterrupt: 