## **Finetuning** `facebook/bart-large` **on** `personachat_self_original` **dataset**

---

In [1]:
# !pip install datasets transformers torch

In [2]:
from transformers import BartForConditionalGeneration, BartTokenizer, Trainer, TrainingArguments

model_name= "facebook/bart-large"

model= BartForConditionalGeneration.from_pretrained(model_name)
tokenizer= BartTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces = True)

config.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

So, before I access the dataset, the dataset is a gated dataset. i.e., it doesnot allow to access until and unless users have authorized and agreed to terms and conditions. Hence Huggingface login is required.

In [5]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("Hugging_face")
secret_value_1 = user_secrets.get_secret("wandb_api_key")

In [7]:
from huggingface_hub import login

login(secret_value_0)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [10]:
from datasets import load_dataset

ds = load_dataset("bitext/Bitext-customer-support-llm-chatbot-training-dataset")

README.md:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

(…)t_Training_Dataset_27K_responses-v11.csv:   0%|          | 0.00/19.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/26872 [00:00<?, ? examples/s]

In [11]:
ds

DatasetDict({
    train: Dataset({
        features: ['flags', 'instruction', 'category', 'intent', 'response'],
        num_rows: 26872
    })
})

In [12]:
import pandas as pd

df = pd.DataFrame(ds['train'])

In [14]:
df['instruction'].unique()

array(['question about cancelling order {{Order Number}}',
       'i have a question about cancelling oorder {{Order Number}}',
       'i need help cancelling puchase {{Order Number}}', ...,
       'how to see if there is anything wrong with my restitution?',
       "I don't know what to do to see my reimbursement status",
       'I need to know if there is anything new on the restitution'],
      dtype=object)

In [15]:
df['response'].unique()

array(["I've understood you have a question regarding canceling order {{Order Number}}, and I'm here to provide you with the information you need. Please go ahead and ask your question, and I'll do my best to assist you.",
       "I've been informed that you have a question about canceling order {{Order Number}}. I'm here to assist you! Please go ahead and let me know what specific question you have, and I'll provide you with all the information and guidance you need. Your satisfaction is my top priority.",
       "I can sense that you're seeking assistance with canceling your purchase with the purchase number {{Order Number}}. I apologize for any inconvenience caused, and I'm here to guide you through the process.\n\nTo cancel your purchase, please follow these steps:\n\n1. Log into your account: Visit our {{Online Company Portal Info}} and sign in using your credentials.\n2. Locate your order: Once logged in, navigate to the '{{Online Order Interaction}}' or '{{Online Order Interacti

In [18]:
df['intent'].unique()

array(['cancel_order', 'change_order', 'change_shipping_address',
       'check_cancellation_fee', 'check_invoice', 'check_payment_methods',
       'check_refund_policy', 'complaint', 'contact_customer_service',
       'contact_human_agent', 'create_account', 'delete_account',
       'delivery_options', 'delivery_period', 'edit_account',
       'get_invoice', 'get_refund', 'newsletter_subscription',
       'payment_issue', 'place_order', 'recover_password',
       'registration_problems', 'review', 'set_up_shipping_address',
       'switch_account', 'track_order', 'track_refund'], dtype=object)

In [19]:
##@ Preprocessing_function 

def preprocess_function(examples):
    inputs = [f"Intent: {intent} | Instruction: {instr}" for intent, instr in zip(examples['intent'], examples['instruction'])]
    targets = examples['response']
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True, padding='max_length')
    labels = tokenizer(targets, max_length=1024, truncation=True, padding='max_length').input_ids
    model_inputs['labels'] = labels
    return model_inputs

In [20]:
tokenized_dataset = ds.map(preprocess_function, batched=True)

Map:   0%|          | 0/26872 [00:00<?, ? examples/s]

In [22]:
training_args = TrainingArguments(
    output_dir='./bart_chatbot',
    eval_strategy='epoch',
    save_strategy= 'epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01,
    save_steps=500,
    save_total_limit=2,
    logging_dir='./logs',
    logging_steps=50,
    load_best_model_at_end=True
)

In [26]:
trainer = Trainer(
    model= model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    tokenizer=tokenizer,
)

In [27]:
import wandb
wandb.login(key= secret_value_1)
wandb.init(project="Chatbot_retry", name="version1")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mfirojpaudel[0m ([33mfirojpaudel-madan-bhandari-memorial-college[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [28]:
trainer.train()



KeyboardInterrupt: 