## Multifunctional Fine Tuned Retrieval-Based Chatbot Leveraging RoBERTa , BART Transformers

### Problem Statement

##### As internet digital communication has expanded rapidly, there's been a rising need for smarter and more responsive chatbots to enhance human-computer interactions which is very much helpful with customer interaction and etc. Traditional rule-based chatbots often fail to understand the complexity and nuances of human language. As there is a need for a versatile and adaptive chatbot that can comprehend and generate contextually relevant responses, leveraging state-of-the-art natural language processing (NLP) techniques.

### Objective

##### The object of the project is to create two chatbots

##### First would be a fine tuned and retrieval-based chatbot

##### Second would be a Fined-Tuned chatbot

##### A sophisticated fine tuned and  retrieval-based chatbot would integrate RoBERTa , Sentence Transformer and advanced NLP methodologies. While the Fine Tuned chatbot will be fine tuned using Bart transformer. These chatbot's aims to enhance the quality and relevance of user interactions by employing sentence transformers for semantic understanding, cosine similarity for response retrieval, and BART for conditional text generation. It also checks the intent of the questions are whether positive or negative in nature using text blob which helps to give better experience to the user.

##### The chabot will be able to answer questions related to healthcare , finance and also be able to keep up with general conversations.

### Dataset

##### The dataset consist of questions and answers pairs. Which will be used for training and retrieval purposes. This dataset have entries of healthcare , finance and conversational questions and answers.

##### Import Libraries

In [6]:
import torch
import re
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer , InputExample, losses

##### Reading CSV File

In [7]:
chatDF = pd.read_csv("/content/merged_cleaned-data.csv")

In [8]:
chatDF.head()

Unnamed: 0,domain,query,response,intent
0,finance,Can I make changes to my loan repayment schedule?,Changes to your loan repayment schedule can be...,loan repayment adjustment
1,finance,How do I apply for a student loan?,You can apply for a student loan by visiting o...,student loan application
2,healthcare,What are the side effects of the COVID-19 vacc...,Common side effects of the COVID-19 vaccine in...,side effects inquiry
3,healthcare,How can I schedule an appointment with my doctor?,You can schedule an appointment by calling our...,appointment booking
4,healthcare,What should I do if I miss a dose of my medica...,"If you miss a dose, take it as soon as you rem...",medication inquiry


The head returns the whole DataFrame which consist of four columns "query" , "response" , "intent" and "domain".

.

##### Using "value_counts()" to count the occurences of unique values.

In [9]:
chatDF["domain"].value_counts()

Unnamed: 0_level_0,count
domain,Unnamed: 1_level_1
3_GHR_QA,3120
5_NIDDK_QA,745
2_GARD_QA,715
4_MPlus_Health_Topics_QA,672
6_NINDS_QA,606
healthcare,535
finance,431
7_SeniorHealth_QA,418
conversation,282
9_CDC_QA,99


The dataset consist of Three major domains healthcare , finance and conversation. Healthcare has the highest count, followed by finance, and then conversation.

.

##### Checking the shape of the dataset

In [10]:
chatDF.shape

(7764, 4)

Dataset has 1676 rows and 4 columns.

.

##### Cleaning the text data

In [11]:
def clean_text(text):
    if not isinstance(text, str):
        text = str(text)
    text = re.sub(r'\r\n', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[?.,@!#$%^&*()]','',text)
    text = re.sub(r'\d+','',text)
    text = text.strip().lower()
    return text

.

##### Extracting Response and Query Columns from chatDF DataFrame

In [12]:
responseDF = chatDF["response"]

##### Apply cleaning to response and query columns

In [13]:
responseDF = responseDF.apply(clean_text)

In [14]:
responseDF[0]

'changes to your loan repayment schedule can be made by contacting our loan department or via the online portal'

In [15]:
len(responseDF)

7764

.

##### Cleaning and Storing chatDF into newChatDF DataFrame

In [16]:
newChatDF = chatDF.applymap(clean_text)

  newChatDF = chatDF.applymap(clean_text)


In [17]:
newChatDF.head()

Unnamed: 0,domain,query,response,intent
0,finance,can i make changes to my loan repayment schedule,changes to your loan repayment schedule can be...,loan repayment adjustment
1,finance,how do i apply for a student loan,you can apply for a student loan by visiting o...,student loan application
2,healthcare,what are the side effects of the covid- vaccine,common side effects of the covid- vaccine incl...,side effects inquiry
3,healthcare,how can i schedule an appointment with my doctor,you can schedule an appointment by calling our...,appointment booking
4,healthcare,what should i do if i miss a dose of my medica...,if you miss a dose take it as soon as you reme...,medication inquiry


In [18]:
!pip install nlpaug


Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl.metadata (14 kB)
Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/410.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m409.6/410.5 kB[0m [31m17.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nlpaug
Successfully installed nlpaug-1.1.11


##### Contextual Word Embeddings Augmentation with NLPaug

Importing nlpaug library

In [19]:
import nlpaug.augmenter.word as naw

##### Initialize the augmenter

In [20]:
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="insert")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

The following layers were not sharded: bert.encoder.layer.*.attention.self.query.bias, bert.encoder.layer.*.attention.output.LayerNorm.weight, bert.encoder.layer.*.attention.output.LayerNorm.bias, cls.predictions.transform.LayerNorm.bias, cls.predictions.bias, bert.encoder.layer.*.output.dense.weight, bert.encoder.layer.*.attention.output.dense.bias, bert.encoder.layer.*.attention.self.value.bias, bert.embeddings.word_embeddings.weight, cls.predictions.transform.dense.bias, bert.encoder.layer.*.output.dense.bias, cls.predictions.decoder.bias, cls.predictions.decoder.weight, bert.embeddings.position_embeddings.weight, bert.encoder.layer.*.intermediate.dense.weight, bert.encoder.layer.*.attention.self.key.weight, bert.encoder.layer.*.attention.output.dense.weight, bert.encoder.layer.*.attention.self.value.weight, cls.predictions.transform.LayerNorm.weight, bert.encoder.layer.*.attention.self.key.bias, bert.embeddings.LayerNorm.bias, bert.encoder.layer.*.attention.self.query.weight, bert.

##### Function to augment a single sentence

In [21]:
def augment_text(text):
    return aug.augment(text)

##### Applying Text Augmentation to DataFrame

In [22]:
newChatDF["augmentedQuery"] = newChatDF["query"].apply(augment_text)

In [None]:
# Save to CSV
newChatDF.to_csv("augmented_queries.csv", index=False)


##### Converting the list rows into string

In [23]:
newChatDF['augmentedQuery'] = newChatDF['augmentedQuery'].apply(lambda x: ', '.join(x))

In [24]:
newChatDF["augmentedQuery"]

Unnamed: 0,augmentedQuery
0,can grant i make changes to address my loan ba...
1,how do i have apply for for a student on loan
2,so what are these the side effects of using th...
3,except how can i schedule an unexpected appoin...
4,but what should i always do if i miss having a...
...,...
7759,what remained is the current outlook today for...
7760,what extent are the possible treatments availa...
7761,in what is the dire outlook for hiv meningitis...
7762,here what remain is like the changing outlook ...


##### Concatenating "query" and "augmentedQuery" columns into "fullQuery" column

In [25]:
newChatDF["fullQuery"] = newChatDF['query'] + ' ' + newChatDF['augmentedQuery']

##### Checking the columns in newChatDF

In [26]:
newChatDF.columns

Index(['domain', 'query', 'response', 'intent', 'augmentedQuery', 'fullQuery'], dtype='object')

##### Dropping the unnecessary columns

In [27]:
newChatDF = newChatDF.drop(columns=['intent','domain','query','augmentedQuery'])

##### Checking the type of "newChatDF"

In [28]:
type(newChatDF)

### InputExample

##### "InputExample" is a specific instance of input data, typically consisting of sentences or text pairs, used to demonstrate and evaluate the transformer's ability to generate meaningful sentence embeddings.

##### Converting the "newChatDF" DataFrame to InputExample objects with a default label

In [29]:
default_label = 1.0
input_examples = newChatDF.apply(lambda row: InputExample(
    guid=str(row.name),
    texts=[row['fullQuery'], row['response']],
    label=default_label
), axis=1).tolist()

In the above code :

**guid** : it gives a unique value to each question and answer pair, helping to keep track of each example distinctly.

**texts** : it combines the "fullQuery" and "response" into a list of two separate text elements.

**label** : it assigns the number 1.0 to each row, indicating a default label, which can be used to signify something like a positive example.

Finally, the apply method processes each row, creating InputExample objects, and .tolist() converts the entire result into a list of these objects.


.

##### Printing Input Examples

In [30]:
for example in input_examples:
    print(example)

<InputExample> label: 1.0, texts: can i make changes to my loan repayment schedule can grant i make changes to address my loan balance repayment schedule; changes to your loan repayment schedule can be made by contacting our loan department or via the online portal
<InputExample> label: 1.0, texts: how do i apply for a student loan how do i have apply for for a student on loan; you can apply for a student loan by visiting our website and filling out the application form
<InputExample> label: 1.0, texts: what are the side effects of the covid- vaccine so what are these the side effects of using the new covid - vaccine; common side effects of the covid- vaccine include soreness at the injection site fever and fatigue
<InputExample> label: 1.0, texts: how can i schedule an appointment with my doctor except how can i schedule an unexpected appointment again with my doctor; you can schedule an appointment by calling our office or using our online portal
<InputExample> label: 1.0, texts: wha

.

### DataLoader

##### A `DataLoader` in machine learning efficiently manages and batches data for training and evaluation, ensuring optimized and streamlined data processing.

##### Creating a Shuffled DataLoader "train_dataloader"

In [31]:
from torch.utils.data import DataLoader
train_dataloader = DataLoader(input_examples, shuffle=True, batch_size=16)

.

### Sentence Transformer

The SentenceTransformer('stsb-roberta-base') model is used to convert sentences into 768-dimensional vectors. These vectors capture the semantic meaning of the sentences, making it useful for tasks like sentence similarity, clustering, and semantic search. Essentially, it helps in understanding and comparing the meaning of sentences in a numerical format.

##### Initialize Sentence Transformer Model


In [32]:
sentenceModel = SentenceTransformer('paraphrase-albert-small-v2')

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.84k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/827 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.7M [00:00<?, ?B/s]

The following layers were not sharded: pooler.bias, encoder.albert_layer_groups.*.albert_layers.*.full_layer_layer_norm.weight, embeddings.word_embeddings.weight, encoder.embedding_hidden_mapping_in.bias, embeddings.LayerNorm.weight, encoder.albert_layer_groups.*.albert_layers.*.ffn.weight, embeddings.token_type_embeddings.weight, encoder.embedding_hidden_mapping_in.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.query.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.key.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.query.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.dense.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.LayerNorm.bias, encoder.albert_layer_groups.*.albert_layers.*.ffn_output.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.value.bias, encoder.albert_layer_groups.*.albert_layers.*.ffn.bias, embeddings.LayerNorm.bias, embeddings.position_embeddings.weight, encoder.albert_layer_gro

tokenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/245 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [33]:
train_loss = losses.MultipleNegativesRankingLoss(sentenceModel)

##### Training Sentence Model with Multiple Epochs and Warmup Steps

In [34]:
num_epochs = 5
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)

sentenceModel.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps
)

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mchhetriria1234[0m ([33mchhetriria1234-upes[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
500,0.1947
1000,0.0941
1500,0.0701
2000,0.0556


##### Preparing , Cleaning and Encoding New Query

In [35]:
new_query = "I wanted to book an appointment"
new_query = clean_text(new_query)
new_query_embedding = sentenceModel.encode([new_query])

In [37]:
print(newChatDF.columns)


Index(['response', 'fullQuery'], dtype='object')


In [38]:
faq_embeddings = sentenceModel.encode(newChatDF["fullQuery"])  # Or whatever correct column name shows up


##### Importing cosine similarity

In [39]:
from sklearn.metrics.pairwise import cosine_similarity

##### Calculating Query Embedding Similarities

In [40]:
similarities = cosine_similarity(new_query_embedding, faq_embeddings)

In [41]:
print(similarities)

[[ 0.15910253  0.18550338  0.02729448 ... -0.05133099  0.0609915
  -0.12071879]]


##### Finding the index of the most similar query and Best Score

In [42]:
most_similar_query_index = np.argmax(similarities)
best_score = similarities[0][most_similar_query_index].item()

In [44]:
print(most_similar_query_index)

1237


### TextBlob

Importing TextBlob

In [45]:
from textblob import TextBlob

##### Classifying The Sentiment Of The Given Input

In [46]:
def classify_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    if sentiment > 0:
        print(blob)
        return "Positive"
    elif sentiment < 0:
        return "Negative"
    else:
        return "Neutral"


The text blob here will be helping us to counter the negative questions that user might ask which is not present in the dataset.

##### Classifying Sentiment of New Query

In [47]:
sentiment = classify_sentiment(new_query)

In [48]:
sentiment

'Neutral'

##### Handle Negative Sentiment and Similar Query Response

In [50]:
if sentiment == "Negative":
    print("Please drop us a mail regarding your concerns.")
elif best_score >= 0.70:
    # Retrieve the most similar query and its response
    most_similar_query = newChatDF['fullQuery'][most_similar_query_index]
    response = responseDF[most_similar_query_index]
    print(f"Most Similar Query: {most_similar_query}")
    print(f"Response: {response}")
elif best_score >= 0.30:
    print("Sorry we are facing some technical difficulties , please write to us on contact@healthcarerocks.com")
elif best_score >= 0.20:
    print("Please write to us on our mail ID contact@healthcarerocks.com")
else:
    print("please write a mail regarding any queries related to our services")

Most Similar Query: i want to book an appointment and i want to book her an appointment
Response: you can schedule an appointment by calling our office or using our online portal


This helps us to handle the questions that model have not yet seen or not present in the dataset,

.

##### Best Score

In [51]:
best_score

0.8633400797843933

##### Saving the trained model into a respective directory

In [52]:
import os
from sentence_transformers import SentenceTransformer

# Use your trained model here
# For example, model = your trained model from model.fit(...)
model = SentenceTransformer('paraphrase-albert-small-v2')  # Replace with your trained model

# Set the save path
save_path = "/content/saved_models/my_sentence_model_paraphrase-albert-small-v2"

# Create directory
os.makedirs(save_path, exist_ok=True)

# Save the model
model.save(save_path)

# Confirm
print(f"✅ Model saved at: {save_path}")


The following layers were not sharded: pooler.bias, encoder.albert_layer_groups.*.albert_layers.*.full_layer_layer_norm.weight, embeddings.word_embeddings.weight, encoder.embedding_hidden_mapping_in.bias, embeddings.LayerNorm.weight, encoder.albert_layer_groups.*.albert_layers.*.ffn.weight, embeddings.token_type_embeddings.weight, encoder.embedding_hidden_mapping_in.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.query.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.key.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.query.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.dense.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.LayerNorm.bias, encoder.albert_layer_groups.*.albert_layers.*.ffn_output.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.value.bias, encoder.albert_layer_groups.*.albert_layers.*.ffn.bias, embeddings.LayerNorm.bias, embeddings.position_embeddings.weight, encoder.albert_layer_gro

✅ Model saved at: /content/saved_models/my_sentence_model_paraphrase-albert-small-v2


##### Saving Data To Pickle File

In [None]:
import pickle

with open('pickleFiles/faq_embeddings.pkl', 'wb') as f:
    pickle.dump(faq_embeddings, f)

with open('pickleFiles/responseDF.pkl',"wb") as f:
    pickle.dump(responseDF,f)

.

.

.

### Model Loading and Testing the model

In [54]:
from sentence_transformers import SentenceTransformer

# Load the model
output_dir = '/content/saved_models/my_sentence_model_paraphrase-albert-small-v2'
sentenceModel = SentenceTransformer(output_dir)



The following layers were not sharded: pooler.bias, encoder.albert_layer_groups.*.albert_layers.*.full_layer_layer_norm.weight, embeddings.word_embeddings.weight, encoder.embedding_hidden_mapping_in.bias, embeddings.LayerNorm.weight, encoder.albert_layer_groups.*.albert_layers.*.ffn.weight, embeddings.token_type_embeddings.weight, encoder.embedding_hidden_mapping_in.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.query.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.key.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.query.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.dense.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.LayerNorm.bias, encoder.albert_layer_groups.*.albert_layers.*.ffn_output.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.value.bias, encoder.albert_layer_groups.*.albert_layers.*.ffn.bias, embeddings.LayerNorm.bias, embeddings.position_embeddings.weight, encoder.albert_layer_gro

In [55]:
from textblob import TextBlob

new_query = "how to create a demat account "
new_query = clean_text(new_query)
new_query_embedding = sentenceModel.encode([new_query])


similarities = cosine_similarity(new_query_embedding, faq_embeddings)

most_similar_query_index = np.argmax(similarities)
best_score = similarities[0][most_similar_query_index].item()


def classify_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    if sentiment > 0:
        print(blob)
        return "Positive"
    elif sentiment < 0:
        return "Negative"
    else:
        return "Neutral"

sentiment = classify_sentiment(new_query)


if sentiment == "Negative":
    print("Please drop us a mail regarding your concerns.")
elif best_score >= 0.70:
    # Retrieve the most similar query and its response
    response = responseDF[most_similar_query_index]
    print(f"Response: {response}")
elif best_score >= 0.30:
    print("Sorry we are facing some technical difficulties , please write to us on contact@healthcarerocks.com")
elif best_score >= 0.20:
    print("Please write to us on our mail ID contact@healthcarerocks.com")
else:
    print("please write a mail regarding any queries related to our services")


Response: you can download the account opening forms from the site and submit them at our branches offering demat services you can also visit the branches offering demat service for opening the demat account there is no fee for opening a dp account with bank however a nominal fee towards services is levied as per our standard rate card


In [56]:
best_score

0.7077676057815552

.

.

.

.

### Fine Tuned Chatbot Using BART Transformer

In [58]:
import pandas as pd

# Load your dataset (replace with your actual file path)
df = pd.read_csv("/content/merged_cleaned-data.csv")

# Now df is defined
print(df.head())


       domain                                              query  \
0     finance  Can I make changes to my loan repayment schedule?   
1     finance                 How do I apply for a student loan?   
2  healthcare  What are the side effects of the COVID-19 vacc...   
3  healthcare  How can I schedule an appointment with my doctor?   
4  healthcare  What should I do if I miss a dose of my medica...   

                                            response  \
0  Changes to your loan repayment schedule can be...   
1  You can apply for a student loan by visiting o...   
2  Common side effects of the COVID-19 vaccine in...   
3  You can schedule an appointment by calling our...   
4  If you miss a dose, take it as soon as you rem...   

                      intent  
0  loan repayment adjustment  
1   student loan application  
2       side effects inquiry  
3        appointment booking  
4         medication inquiry  


##### Creating a DataFrame

In [59]:
# Example: Use query and response columns from your existing DataFrame
newQueryDataset = df['query'].tolist()
responseDF = df['response'].tolist()


# Now this will work
import pandas as pd
newChatDF = pd.DataFrame({
    "query": newQueryDataset,
    "response": responseDF
})

print(newChatDF)


                                                  query  \
0     Can I make changes to my loan repayment schedule?   
1                    How do I apply for a student loan?   
2     What are the side effects of the COVID-19 vacc...   
3     How can I schedule an appointment with my doctor?   
4     What should I do if I miss a dose of my medica...   
...                                                 ...   
7759    What is the outlook for Paroxysmal Hemicrania ?   
7760  What are the treatments for Meningitis and Enc...   
7761  What is the outlook for Meningitis and Encepha...   
7762             What is the outlook for Dysautonomia ?   
7763  what research (or clinical trials) is being do...   

                                               response  
0     Changes to your loan repayment schedule can be...  
1     You can apply for a student loan by visiting o...  
2     Common side effects of the COVID-19 vaccine in...  
3     You can schedule an appointment by calling our...  
4

In [60]:
newChatDF.head()

Unnamed: 0,query,response
0,Can I make changes to my loan repayment schedule?,Changes to your loan repayment schedule can be...
1,How do I apply for a student loan?,You can apply for a student loan by visiting o...
2,What are the side effects of the COVID-19 vacc...,Common side effects of the COVID-19 vaccine in...
3,How can I schedule an appointment with my doctor?,You can schedule an appointment by calling our...
4,What should I do if I miss a dose of my medica...,"If you miss a dose, take it as soon as you rem..."


The dataframe has two columns query and response.

.

##### Checking the shape of the dataset

In [61]:
newChatDF.shape

(7764, 2)

The Dataset has 1681 rows and 2 columns.

.

##### Train Test Split

In [62]:
from sklearn.model_selection import train_test_split

##### Spliting the data into training and validation sets.

In [63]:
train_df, val_df = train_test_split(newChatDF, test_size=0.2, random_state=42)

##### Checking the shape of the dataframe

In [64]:
train_df.shape, val_df.shape

((6211, 2), (1553, 2))

After the split the training dataset train_df has 1344 rows and 2 columns where validation dataset val_df has 337 rows and 2 columns.

.

In [65]:
train_df.head()

Unnamed: 0,query,response
2638,What is (are) Weill-Marchesani syndrome ?,Weill-Marchesani syndrome is an inherited conn...
5840,What are the treatments for thrombotic thrombo...,These resources address the diagnosis or manag...
2569,What are the treatments for Periventricular he...,Treatment of epilepsy generally follows princi...
4210,What are the treatments for Costeff syndrome ?,These resources address the diagnosis or manag...
3612,Do you have information about Native Hawaiian ...,Summary : Every racial or ethnic group has spe...


In [66]:
val_df.head()

Unnamed: 0,query,response
6934,How many people are affected by triple X syndr...,"This condition occurs in about 1 in 1,000 newb..."
6157,How many people are affected by porphyria ?,"The exact prevalence of porphyria is unknown, ..."
6917,How many people are affected by sepiapterin re...,Sepiapterin reductase deficiency appears to be...
3685,What is (are) Adrenal Gland Disorders ?,The adrenal glands are small glands located on...
3406,What is (are) Drowning ?,People drown when they get too much water in t...


##### Reseting Index for Training and Validation Data

In [67]:
train_data = train_df.reset_index(drop=True)
validation_data = val_df.reset_index(drop=True)

In [68]:
train_data.head()

Unnamed: 0,query,response
0,What is (are) Weill-Marchesani syndrome ?,Weill-Marchesani syndrome is an inherited conn...
1,What are the treatments for thrombotic thrombo...,These resources address the diagnosis or manag...
2,What are the treatments for Periventricular he...,Treatment of epilepsy generally follows princi...
3,What are the treatments for Costeff syndrome ?,These resources address the diagnosis or manag...
4,Do you have information about Native Hawaiian ...,Summary : Every racial or ethnic group has spe...


In [69]:
train_data['query'][0]

'What is (are) Weill-Marchesani syndrome ?'

In [70]:
validation_data.head()

Unnamed: 0,query,response
0,How many people are affected by triple X syndr...,"This condition occurs in about 1 in 1,000 newb..."
1,How many people are affected by porphyria ?,"The exact prevalence of porphyria is unknown, ..."
2,How many people are affected by sepiapterin re...,Sepiapterin reductase deficiency appears to be...
3,What is (are) Adrenal Gland Disorders ?,The adrenal glands are small glands located on...
4,What is (are) Drowning ?,People drown when they get too much water in t...


### BART Transformers

##### Importing Bart libraries

In [71]:
from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments
from datasets import Dataset, DatasetDict

##### Initializing tokenizer

In [72]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

##### Preprocessing By Tokenizing And Creating Labels

In [73]:
def preprocess_function(examples):
    inputs = tokenizer(examples["query"], padding="max_length", truncation=True, max_length=512)
    targets = tokenizer(examples["response"], padding="max_length", truncation=True, max_length=512)
    inputs["labels"] = targets["input_ids"]
    return inputs

##### Creating Dataset Objects from Pandas DataFrames

In [74]:
train_dataset = Dataset.from_pandas(train_data)
validation_dataset = Dataset.from_pandas(validation_data)

##### Apply Preprocessing Function to Datasets

In [75]:
from datasets import load_dataset, DatasetDict
from sklearn.model_selection import train_test_split
import pandas as pd

# Load CSV into pandas
df = pd.read_csv("merged_cleaned-data.csv", names=["query", "response"])

# Split manually
train_df, val_df = train_test_split(df, test_size=0.2)

# Convert to Hugging Face Dataset
from datasets import Dataset
dataset = DatasetDict({
    'train': Dataset.from_pandas(train_df),
    'validation': Dataset.from_pandas(val_df)
})




##### Creating dataset dictionary

In [76]:
dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': validation_dataset
})

##### Initializing BART Model for Conditional Generation

In [77]:
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

The following layers were not sharded: decoder.layers.*.encoder_attn.q_proj.bias, decoder.layers.*.encoder_attn.out_proj.weight, encoder.layers.*.fc*.bias, encoder.layers.*.self_attn.q_proj.bias, encoder.layers.*.self_attn.q_proj.weight, encoder.layers.*.self_attn.v_proj.bias, decoder.layers.*.encoder_attn_layer_norm.bias, decoder.layers.*.self_attn.k_proj.bias, decoder.layers.*.fc*.bias, decoder.layers.*.self_attn_layer_norm.weight, decoder.layers.*.encoder_attn.out_proj.bias, decoder.layers.*.encoder_attn.v_proj.bias, decoder.layers.*.self_attn.q_proj.weight, decoder.layers.*.encoder_attn.k_proj.weight, decoder.layernorm_embedding.bias, decoder.layers.*.self_attn.v_proj.weight, decoder.layers.*.self_attn.q_proj.bias, decoder.layernorm_embedding.weight, encoder.layers.*.self_attn_layer_norm.weight, encoder.layers.*.self_attn.out_proj.bias, encoder.embed_positions.weight, decoder.layers.*.encoder_attn_layer_norm.weight, decoder.layers.*.final_layer_norm.weight, decoder.layers.*.final_l

In [78]:
!pip install -U transformers

from transformers import TrainingArguments
print(TrainingArguments)



<class 'transformers.training_args.TrainingArguments'>


##### Training The Model

In [79]:
!pip install --upgrade transformers



In [80]:
import transformers
print(transformers.__version__)

4.52.4


In [81]:
print(transformers.__file__)

/usr/local/lib/python3.11/dist-packages/transformers/__init__.py


In [83]:
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir='./logs'
)


##### Saving the trained model into chatbot_model directory

In [84]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Replace with your fine-tuned model and tokenizer
model.save_pretrained("/Users/riya/Documents/my_finetuned_model")
tokenizer.save_pretrained("/Users/riya/Documents/my_finetuned_model")



('/Users/riya/Documents/my_finetuned_model/tokenizer_config.json',
 '/Users/riya/Documents/my_finetuned_model/special_tokens_map.json',
 '/Users/riya/Documents/my_finetuned_model/vocab.json',
 '/Users/riya/Documents/my_finetuned_model/merges.txt',
 '/Users/riya/Documents/my_finetuned_model/added_tokens.json')

In [105]:
model.save_pretrained("./my_finetuned_model")
tokenizer.save_pretrained("./my_finetuned_model")


('./my_finetuned_model/tokenizer_config.json',
 './my_finetuned_model/special_tokens_map.json',
 './my_finetuned_model/vocab.json',
 './my_finetuned_model/merges.txt',
 './my_finetuned_model/added_tokens.json')

In [106]:
!zip -r my_finetuned_model.zip my_finetuned_model


  adding: my_finetuned_model/ (stored 0%)
  adding: my_finetuned_model/vocab.json (deflated 68%)
  adding: my_finetuned_model/special_tokens_map.json (deflated 85%)
  adding: my_finetuned_model/merges.txt (deflated 53%)
  adding: my_finetuned_model/1_Pooling/ (stored 0%)
  adding: my_finetuned_model/1_Pooling/config.json (deflated 57%)
  adding: my_finetuned_model/config_sentence_transformers.json (deflated 34%)
  adding: my_finetuned_model/tokenizer_config.json (deflated 75%)
  adding: my_finetuned_model/spiece.model (deflated 49%)
  adding: my_finetuned_model/config.json (deflated 53%)
  adding: my_finetuned_model/sentence_bert_config.json (deflated 4%)
  adding: my_finetuned_model/tokenizer.json (deflated 75%)
  adding: my_finetuned_model/README.md (deflated 59%)
  adding: my_finetuned_model/modules.json (deflated 53%)
  adding: my_finetuned_model/model.safetensors (deflated 7%)


In [107]:
from google.colab import files
files.download("my_finetuned_model.zip")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

.

.

.

##### Checking the model by passing new inputs

In [89]:
from sentence_transformers import SentenceTransformer

# Load your sentence transformer model
model = SentenceTransformer('/content/saved_models/my_sentence_model_paraphrase-albert-small-v2')

# Encode input sentence to get its embedding
query = "How to book an appointment"
embedding = model.encode(query)

print("Embedding shape:", embedding.shape)
print("Embedding:", embedding)



The following layers were not sharded: pooler.bias, encoder.albert_layer_groups.*.albert_layers.*.full_layer_layer_norm.weight, embeddings.word_embeddings.weight, encoder.embedding_hidden_mapping_in.bias, embeddings.LayerNorm.weight, encoder.albert_layer_groups.*.albert_layers.*.ffn.weight, embeddings.token_type_embeddings.weight, encoder.embedding_hidden_mapping_in.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.query.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.key.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.query.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.dense.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.LayerNorm.bias, encoder.albert_layer_groups.*.albert_layers.*.ffn_output.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.value.bias, encoder.albert_layer_groups.*.albert_layers.*.ffn.bias, embeddings.LayerNorm.bias, embeddings.position_embeddings.weight, encoder.albert_layer_gro

Embedding shape: (768,)
Embedding: [-2.20584348e-01  2.10727692e-01 -4.37153757e-01 -1.13202333e-02
  5.46704829e-01  1.16415575e-01 -2.95240313e-01  2.80090600e-01
 -4.21983153e-01 -7.00509250e-01 -6.34299591e-02  2.16126278e-01
  7.91739583e-01  1.03286520e-01 -9.06628445e-02  1.55443037e-02
 -1.06687081e+00  2.36450106e-01 -1.55542985e-01  2.28700653e-01
 -7.06777036e-01  3.06286484e-01  1.16605572e-01  6.95017219e-01
 -4.82271574e-02  3.89206350e-01  2.46990044e-02  1.27173173e+00
  9.91023481e-01 -7.23532856e-01  4.76383656e-01 -2.60935009e-01
  7.02321231e-02  6.10356927e-01 -4.58434150e-02 -7.77771249e-02
  6.72881126e-01 -2.51517266e-01  8.50442886e-01  1.00818527e+00
 -5.47738791e-01 -4.53294039e-01  3.66548039e-02 -1.35643959e-01
 -2.51419961e-01 -4.46371943e-01  5.19800901e-01  9.85462844e-01
  4.58756149e-01 -5.58126986e-01 -6.05804861e-01 -6.98197663e-01
 -2.48708397e-01 -1.55165240e-01 -2.56889999e-01 -2.13864580e-01
 -2.77590334e-01  4.06282157e-01 -9.65276361e-02  1.046

##### Evaluating the model

In [95]:
!pip install datasets




In [98]:
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset

# Load your CSV file
chatbot_df = pd.read_csv("/content/merged_cleaned-data.csv")  # Change path if needed

# Split into train and eval sets
train_df, eval_df = train_test_split(chatbot_df[['query', 'response']], test_size=0.2, random_state=42)

# Convert to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)



In [99]:
from datasets import Dataset
from sklearn.model_selection import train_test_split

# Step 1: Split data
train_df, eval_df = train_test_split(chatbot_df[['query', 'response']], test_size=0.2, random_state=42)

# Step 2: Convert to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)


In [101]:
from transformers import BartTokenizer

# Load the tokenizer
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")

# Tokenization function
def tokenize_chat(example):
    # Tokenize the user query (input)
    model_inputs = tokenizer(example["query"], padding="max_length", truncation=True, max_length=128)

    # Tokenize the response (target) using target tokenizer context
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(example["response"], padding="max_length", truncation=True, max_length=128)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply tokenization to both datasets
train_dataset = train_dataset.map(tokenize_chat)
eval_dataset = eval_dataset.map(tokenize_chat)


Map:   0%|          | 0/6211 [00:00<?, ? examples/s]



Map:   0%|          | 0/1553 [00:00<?, ? examples/s]

In [103]:
from transformers import TrainingArguments, Trainer
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir='./logs'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer
)


  trainer = Trainer(


In [109]:
from sentence_transformers import SentenceTransformer

model_path = "/content/saved_models/my_sentence_model_paraphrase-albert-small-v2"
model = SentenceTransformer(model_path)

query = "How to reset password?"
embedding = model.encode(query)
print(embedding)



The following layers were not sharded: pooler.bias, encoder.albert_layer_groups.*.albert_layers.*.full_layer_layer_norm.weight, embeddings.word_embeddings.weight, encoder.embedding_hidden_mapping_in.bias, embeddings.LayerNorm.weight, encoder.albert_layer_groups.*.albert_layers.*.ffn.weight, embeddings.token_type_embeddings.weight, encoder.embedding_hidden_mapping_in.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.query.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.key.weight, encoder.albert_layer_groups.*.albert_layers.*.attention.query.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.dense.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.LayerNorm.bias, encoder.albert_layer_groups.*.albert_layers.*.ffn_output.bias, encoder.albert_layer_groups.*.albert_layers.*.attention.value.bias, encoder.albert_layer_groups.*.albert_layers.*.ffn.bias, embeddings.LayerNorm.bias, embeddings.position_embeddings.weight, encoder.albert_layer_gro

[ 3.65055621e-01 -6.22617185e-01  4.18614864e-01 -9.42048132e-01
  1.68703347e-01  8.04268301e-01 -1.63048461e-01 -1.09796715e+00
 -9.80310500e-01  2.55452126e-01  3.30473483e-02  4.05555308e-01
 -1.40103981e-01 -5.64769864e-01  4.64763314e-01  3.49055007e-02
 -2.82224834e-01 -5.30114055e-01  1.22159600e+00  6.57735407e-01
 -3.98407519e-01 -3.19300950e-01  7.05380857e-01  4.36456911e-02
 -4.36956942e-01  7.27911472e-01  5.61081946e-01  5.35369776e-02
 -4.02573377e-01 -7.81326145e-02  2.97293454e-01 -1.09955512e-01
 -4.45774972e-01 -4.00199682e-01  1.22067821e+00 -2.51305431e-01
  1.10627878e+00 -2.40281627e-01  3.67958486e-01  5.29988348e-01
  9.89499211e-01  2.96432793e-01  6.50375932e-02  8.83737862e-01
 -1.21455824e+00 -6.96462035e-01  1.54224560e-01 -1.16182363e+00
 -9.36072886e-01 -1.14178814e-01 -1.03974986e+00 -2.33280063e-01
 -1.09972930e+00  1.27436265e-01 -4.81440037e-01 -8.35333914e-02
  3.06164533e-01 -9.13877904e-01  6.46743655e-01 -1.04089391e+00
 -1.34256348e-01 -7.55369