<a href="https://colab.research.google.com/github/Ochan-LOKIDORMOI/Ml_summative_assignment_chatbot/blob/main/Summative_Assignment_Chat_Bot_Ochan_LOKIDORMOI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Dataset Overview**

The dataset consists of ***3,000*** entries designed for training a domain-specific chatbot, covering queries from the ***healthcare*** and ***finance*** sectors.

It is structured into four key columns:

- **Query:**  Represents user input, capturing real-world questions or requests.
- **Response:**  Provides the chatbot's predefined answer to each query.
intent: Categorizes the query’s purpose, ensuring accurate chatbot responses.
- **Domain:** Specifies the industry context, such as healthcare (e.g., medical inquiries, appointment booking) or finance (e.g., account management, loan inquiries).

In [1]:
#mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Importing Libraries**

In [2]:
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments

# **Loading the Data**

In [3]:
# Load dataset (example, adjust path as needed)
df = pd.read_csv("/content/drive/MyDrive/domain_specific_chatbot_data.csv")

# Display a sample
df.head()

Unnamed: 0,query,response,intent,domain
0,What are the side effects of the COVID-19 vacc...,Common side effects of the COVID-19 vaccine in...,side effects inquiry,healthcare
1,How can I schedule an appointment with my doctor?,You can schedule an appointment by calling our...,appointment booking,healthcare
2,What should I do if I miss a dose of my medica...,"If you miss a dose, take it as soon as you rem...",medication inquiry,healthcare
3,How can I check my account balance?,You can check your balance by logging into you...,balance inquiry,finance
4,What is the interest rate for a personal loan?,The current interest rate for personal loans i...,loan inquiry,finance


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   query     3000 non-null   object
 1   response  3000 non-null   object
 2   intent    3000 non-null   object
 3   domain    3000 non-null   object
dtypes: object(4)
memory usage: 93.9+ KB


## **Checking for missing values in the dataset**

In [6]:
#checking for NaN
df.isnull().sum()

Unnamed: 0,0
query,0
response,0
intent,0
domain,0


## **Splitting Data into Training and Validation Sets**

- The code splits the original dataframe `df` into two new dataframes `train_df` and `val_df`, such that 80% of the data is in train_df and 20% is in val_df

In [7]:
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

train_df.shape, val_df.shape

((2400, 4), (600, 4))

## **Resetting Indices in Pandas DataFrames**

- This part of the code focuses on resetting the index of two Pandas DataFrames: `train_df` and `val_df.`

- These DataFrames were previously created by splitting the original DataFrame, `df`, into training and validation sets using the `train_test_split` function.

In [8]:
train_data = train_df.reset_index(drop=True)
validation_data = val_df.reset_index(drop=True)

validation_data

Unnamed: 0,query,response,intent,domain
0,How can I schedule an appointment with my doctor?,You can schedule an appointment by calling our...,appointment booking,healthcare
1,What are the side effects of the COVID-19 vacc...,Common side effects of the COVID-19 vaccine in...,side effects inquiry,healthcare
2,How do I update my contact details on my account?,"To update your contact details, log into your ...",contact update,finance
3,How can I schedule an appointment with my doctor?,You can schedule an appointment by calling our...,appointment booking,healthcare
4,"I lost my credit card, what should I do?",Please contact our customer service immediatel...,lost card reporting,finance
...,...,...,...,...
595,What is the interest rate for a personal loan?,The current interest rate for personal loans i...,loan inquiry,finance
596,How do I update my contact details on my account?,"To update your contact details, log into your ...",contact update,finance
597,How do I apply for a student loan?,You can apply for a student loan by visiting o...,student loan application,finance
598,What are the symptoms of flu?,"Flu symptoms include fever, cough, sore throat...",flu symptoms inquiry,healthcare


## **Text Cleaning Function**

- This part defines a function called `clean_text` which is designed to preprocess text data by removing unwanted characters and formatting.

In [9]:
# Cleaning the text by removing unwanted characters
import re

def clean_text(text):
    text = re.sub(r'\r\n', ' ', text)  # Remove carriage returns and line breaks
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'<.*?>', '', text)  # Remove any XML tags
    text = text.strip().lower()  # Strip and convert to lower case
    return text



## **Applying the Text Cleaning Function**
- After defining the clean_text function, the code applies it to specific columns in the` train_data` and `validation_data `DataFrames.

In [10]:
# Apply cleaning to dialogue and summary columns
train_data['query'] = train_data['query'].apply(clean_text)
train_data['response'] = train_data['response'].apply(clean_text)

validation_data['query'] = validation_data['query'].apply(clean_text)
validation_data['response'] = validation_data['response'].apply(clean_text)


# Display a sample after cleaning
train_data

Unnamed: 0,query,response,intent,domain
0,what should i do if i miss a dose of my medica...,"if you miss a dose, take it as soon as you rem...",medication inquiry,healthcare
1,what are the side effects of the covid-19 vacc...,common side effects of the covid-19 vaccine in...,side effects inquiry,healthcare
2,what are the symptoms of flu?,"flu symptoms include fever, cough, sore throat...",flu symptoms inquiry,healthcare
3,how do i update my contact details on my account?,"to update your contact details, log into your ...",contact update,finance
4,what are the side effects of the covid-19 vacc...,common side effects of the covid-19 vaccine in...,side effects inquiry,healthcare
...,...,...,...,...
2395,can i make changes to my loan repayment schedule?,changes to your loan repayment schedule can be...,loan repayment adjustment,finance
2396,"i lost my credit card, what should i do?",please contact our customer service immediatel...,lost card reporting,finance
2397,what are the side effects of the covid-19 vacc...,common side effects of the covid-19 vaccine in...,side effects inquiry,healthcare
2398,what is the interest rate for a personal loan?,the current interest rate for personal loans i...,loan inquiry,finance


## **Loading the T5 Tokenizer**

In [11]:
tokenizer= T5Tokenizer.from_pretrained("t5-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


## **preprocess function**
- This function helps turn text into numbers so the T5 model can understand it

- Since the model doesn’t read raw text, this step changes the words into tokens (numerical representations).
- It also sets up the data for training by creating labels for the model.

In [12]:
# Preprocessing function for tokenization
def preprocess_function(examples):
    # Tokenize the dialogue and summary
    inputs = tokenizer(examples["query"], padding="max_length", truncation=True, max_length=250)
    targets = tokenizer(examples["response"], padding="max_length", truncation=True, max_length=250)
    inputs["labels"] = targets["input_ids"]
    return inputs

## **Applying the preprocessing**
- These codes transform the textual data in `train_data` and `validation_data` into a tokenized numerical format which is required for training and evaluating the T5 model.
- The `preprocess_function` handles the tokenization and the creation of labels for the model.



In [13]:
# Apply the preprocessing
train_dataset = train_data.apply(preprocess_function, axis=1)
val_dataset = validation_data.apply(preprocess_function, axis=1)

In [14]:
train_data['response'][0]

"if you miss a dose, take it as soon as you remember unless it's almost time for your next dose. if you’re unsure, contact your healthcare provider."

In [15]:
train_dataset[0]

{'input_ids': [125, 225, 3, 23, 103, 3, 99, 3, 23, 3041, 3, 9, 6742, 13, 82, 7757, 58, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## **Training Parameters**
- This code uses the **TrainingArguments** class from the transformers library to define how the model will be trained.
- It configures various aspects of the training process.

In [16]:
# Model
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",          # output directory for checkpoints
    num_train_epochs=6,              # number of training epochs
    per_device_train_batch_size=8,   # batch size per device during training
    per_device_eval_batch_size=8,    # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir="./logs",            # directory for storing logs
    logging_steps=50,                # how often to log training info
    save_steps=500,                  # how often to save a model checkpoint
    eval_steps=50,                   # how often to run evaluation
    evaluation_strategy="epoch",     # Ensure evaluation happens every `epoch`
)


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]



## **Initializing the Trainer**

In [17]:
# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

## **Training the Model**

In [None]:
# Train the model
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdenmarkeddy[0m ([33mdenmarkeddy-alu[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,0.2555,0.180979
2,0.0258,0.006076
3,0.0077,0.000688
4,0.0038,0.000211
5,0.0029,0.000118
6,0.0026,0.000102


TrainOutput(global_step=1800, training_loss=0.7738839065697458, metrics={'train_runtime': 637.9099, 'train_samples_per_second': 22.574, 'train_steps_per_second': 2.822, 'total_flos': 951622041600000.0, 'train_loss': 0.7738839065697458, 'epoch': 6.0})

## **Saving the Model**

In [None]:
#Save and Load Model
model.save_pretrained("/content/drive/MyDrive/chatbot_model")
tokenizer.save_pretrained("/content/drive/MyDrive/chatbot_model")

('/content/drive/MyDrive/chatbot_model/tokenizer_config.json',
 '/content/drive/MyDrive/chatbot_model/special_tokens_map.json',
 '/content/drive/MyDrive/chatbot_model/spiece.model',
 '/content/drive/MyDrive/chatbot_model/added_tokens.json')

## **Loading the model for Testing**

In [18]:
#load the model
model = T5ForConditionalGeneration.from_pretrained("/content/drive/MyDrive/chatbot_model")
tokenizer = T5Tokenizer.from_pretrained("/content/drive/MyDrive/chatbot_model")

## **Model Performance Metrics and getting the responses**

In [None]:
device = model.device


def chatbot(query):
    query = clean_text(query)
    input_ids = tokenizer(query,return_tensors="pt",max_length=250,truncation=True)

    inputs = {key: value.to(device) for key, value in input_ids.items()}

    outputs = model.generate(
        input_ids["input_ids"],
        max_length=250,
        num_beams=5,
        early_stopping=True
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        break
    response = chatbot(user_input)
    print("Chatbot:", response)

You: what are the symptoms of flu?
Chatbot: flu symptoms include fever, cough, sore throat, body aches and fatigue.
You: what is the interest rate for a personal loan?	
Chatbot: the current interest rate for personal loans is 6.5%, but it may vary based on your credit score.
You: how do i update my contact details on my account?	
Chatbot: to update your contact details, log into your account and go to the 'profile' section.
You: what are the side effects of the covid-19 vaccine
Chatbot: common side effects of the covid-19 vaccine include soreness at the injection site, fever, and fatigue.
You: i lost my credit card, what should i do?
Chatbot: please contact our customer service immediately to report your lost card and request a replacement.
You: exit


## **Evaluation metrics using  BLEU and ROUGE**

This part defines a function `evaluate_model` that calculates the performance of the trained T5 model using two common metrics:

***BLEU*** (Bilingual Evaluation Understudy) and ***ROUGE*** (Recall-Oriented Understudy for Gisting Evaluation).

This is a standard practice when evaluating text generation models, especially those used in tasks like chatbots or summarization.

In [19]:
!pip install Rouge


Collecting Rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: Rouge
Successfully installed Rouge-1.0.1


In [20]:
from google.colab import drive
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
import re
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge import Rouge

# Evaluation function
def evaluate_model(model, tokenizer, df):
    rouge = Rouge()
    smoothie = SmoothingFunction().method4
    bleu_scores = []
    rouge_scores = []

    for index, row in df.iterrows():
        input_text = row['query']
        reference_summary = row['response']

        input_ids = tokenizer.encode(input_text, return_tensors="pt")
        generated_ids = model.generate(input_ids)
        generated_summary = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

        # Calculate BLEU score
        bleu = sentence_bleu([reference_summary.split()], generated_summary.split(), smoothing_function=smoothie)
        bleu_scores.append(bleu)

        # Calculate ROUGE score
        try:
          scores = rouge.get_scores(generated_summary, reference_summary)
          rouge_scores.append(scores[0]) # Append the scores for the first sentence
        except:
          print(f"Error calculating ROUGE for index {index}. Skipping...")
          rouge_scores.append({'rouge-1': {'f': 0, 'p': 0, 'r': 0},
                              'rouge-2': {'f': 0, 'p': 0, 'r': 0},
                              'rouge-l': {'f': 0, 'p': 0, 'r': 0}})


    avg_bleu = sum(bleu_scores) / len(bleu_scores)
    avg_rouge1_f = sum(score['rouge-1']['f'] for score in rouge_scores) / len(rouge_scores)
    avg_rouge2_f = sum(score['rouge-2']['f'] for score in rouge_scores) / len(rouge_scores)
    avg_rougeL_f = sum(score['rouge-l']['f'] for score in rouge_scores) / len(rouge_scores)


    print(f"Average BLEU score: {avg_bleu}")
    print(f"Average ROUGE-1 F1 score: {avg_rouge1_f}")
    print(f"Average ROUGE-2 F1 score: {avg_rouge2_f}")
    print(f"Average ROUGE-L F1 score: {avg_rougeL_f}")

# Evaluate the model on the validation set
import nltk
nltk.download('punkt')
evaluate_model(model, tokenizer, validation_data)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Average BLEU score: 0.9133675743012869
Average ROUGE-1 F1 score: 0.9578712949251741
Average ROUGE-2 F1 score: 0.955746026806324
Average ROUGE-L F1 score: 0.9578712949251741


## **The End**

**By:** ***Ochan Denmark LOKIDORMOI***