## Multifunctional Fine Tuned Retrieval-Based Chatbot Leveraging RoBERTa , BART Transformers

### Problem Statement

##### As internet digital communication has expanded rapidly, there's been a rising need for smarter and more responsive chatbots to enhance human-computer interactions which is very much helpful with customer interaction and etc. Traditional rule-based chatbots often fail to understand the complexity and nuances of human language. As there is a need for a versatile and adaptive chatbot that can comprehend and generate contextually relevant responses, leveraging state-of-the-art natural language processing (NLP) techniques.

### Objective

##### The object of the project is to create two chatbots

##### First would be a fine tuned and retrieval-based chatbot

##### Second would be a Fined-Tuned chatbot

##### A sophisticated fine tuned and  retrieval-based chatbot would integrate RoBERTa , Sentence Transformer and advanced NLP methodologies. While the Fine Tuned chatbot will be fine tuned using Bart transformer. These chatbot's aims to enhance the quality and relevance of user interactions by employing sentence transformers for semantic understanding, cosine similarity for response retrieval, and BART for conditional text generation. It also checks the intent of the questions are whether positive or negative in nature using text blob which helps to give better experience to the user.

##### The chabot will be able to answer questions related to healthcare  and also be able to keep up with general conversations.

### Dataset

##### The dataset consist of questions and answers pairs. Which will be used for training and retrieval purposes. This dataset have entries of plant disease cure  and conversational questions and answers.

##### Import Libraries

In [None]:
import torch
import re
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer , InputExample, losses

In [None]:
import pandas as pd

# 1. Load your dataset (use pd.read_excel for .xlsx files)
df = pd.read_excel("/content/Plant_Disease_chatbot_data.xlsx")

# 2. Rename columns to match the code's expectation
df = df.rename(columns={"Question": "query", "Answer": "response"})

# 3. Add dummy columns (required by the notebook's 'drop' and 'value_counts' functions)
df["domain"] = "plant_disease"
df["intent"] = "faq"

# 4. Save the file with the name your notebook expects
df.to_csv("chatbot_data.csv", index=False)

# Check the result
print(df.head())

                                               query  \
0  What is the first step in curing a spider mite...   
1  How does high-pressure water spray help cure s...   
2                Why is pruning useful in mite cure?   
3  Can dust reduction be considered a cure for sp...   
4  How does improving air circulation cure mite o...   

                                            response         domain intent  
0  Immediately remove and destroy heavily infeste...  plant_disease    faq  
1  It physically dislodges mites from leaves; rep...  plant_disease    faq  
2  It eliminates infestation hotspots and improve...  plant_disease    faq  
3  Yes, reducing dust lowers stress on plants and...  plant_disease    faq  
4  Better airflow reduces hot, dry microclimates ...  plant_disease    faq  


##### Reading CSV File

In [None]:
chatDF = pd.read_csv("chatbot_data.csv")

In [None]:
chatDF.head()

Unnamed: 0,query,response,domain,intent
0,What is the first step in curing a spider mite...,Immediately remove and destroy heavily infeste...,plant_disease,faq
1,How does high-pressure water spray help cure s...,It physically dislodges mites from leaves; rep...,plant_disease,faq
2,Why is pruning useful in mite cure?,It eliminates infestation hotspots and improve...,plant_disease,faq
3,Can dust reduction be considered a cure for sp...,"Yes, reducing dust lowers stress on plants and...",plant_disease,faq
4,How does improving air circulation cure mite o...,"Better airflow reduces hot, dry microclimates ...",plant_disease,faq


The head returns the whole DataFrame which consist of four columns "query" , "response" , "intent" and "domain".

.

##### Using "value_counts()" to count the occurences of unique values.

In [None]:
chatDF["domain"].value_counts()

Unnamed: 0_level_0,count
domain,Unnamed: 1_level_1
plant_disease,6189


The dataset consist of Three major domains healthcare , finance and conversation. Healthcare has the highest count, followed by finance, and then conversation.

.

##### Checking the shape of the dataset

In [None]:
chatDF.shape

(6189, 4)

Dataset has 1676 rows and 4 columns.

.

##### Cleaning the text data

In [None]:
def clean_text(text):
    if not isinstance(text, str):
        text = str(text)
    text = re.sub(r'\r\n', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[?.,@!#$%^&*()]','',text)
    text = re.sub(r'\d+','',text)
    text = text.strip().lower()
    return text

.

##### Extracting Response and Query Columns from chatDF DataFrame

In [None]:
responseDF = chatDF["response"]

##### Apply cleaning to response and query columns

In [None]:
responseDF = responseDF.apply(clean_text)

In [None]:
responseDF[0]

'immediately remove and destroy heavily infested leaves to slow population spread'

In [None]:
len(responseDF)

6189

.

##### Cleaning and Storing chatDF into newChatDF DataFrame

In [None]:
newChatDF = chatDF.applymap(clean_text)

  newChatDF = chatDF.applymap(clean_text)


In [None]:
newChatDF.head()

Unnamed: 0,query,response,domain,intent
0,what is the first step in curing a spider mite...,immediately remove and destroy heavily infeste...,plant_disease,faq
1,how does high-pressure water spray help cure s...,it physically dislodges mites from leaves; rep...,plant_disease,faq
2,why is pruning useful in mite cure,it eliminates infestation hotspots and improve...,plant_disease,faq
3,can dust reduction be considered a cure for sp...,yes reducing dust lowers stress on plants and ...,plant_disease,faq
4,how does improving air circulation cure mite o...,better airflow reduces hot dry microclimates w...,plant_disease,faq


In [None]:
!pip install nlpaug


Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl.metadata (14 kB)
Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nlpaug
Successfully installed nlpaug-1.1.11


##### Contextual Word Embeddings Augmentation with NLPaug

Importing nlpaug library

In [None]:
import nlpaug.augmenter.word as naw

##### Initialize the augmenter

In [None]:
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="insert")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

The following layers were not sharded: bert.encoder.layer.*.attention.self.value.weight, bert.encoder.layer.*.attention.output.LayerNorm.weight, bert.encoder.layer.*.attention.self.key.bias, bert.embeddings.position_embeddings.weight, bert.embeddings.LayerNorm.weight, bert.encoder.layer.*.output.dense.bias, bert.encoder.layer.*.attention.output.dense.bias, bert.encoder.layer.*.intermediate.dense.bias, cls.predictions.decoder.weight, cls.predictions.decoder.bias, bert.encoder.layer.*.attention.self.query.weight, bert.encoder.layer.*.attention.self.query.bias, bert.encoder.layer.*.intermediate.dense.weight, cls.predictions.transform.LayerNorm.bias, bert.encoder.layer.*.output.LayerNorm.bias, bert.embeddings.token_type_embeddings.weight, cls.predictions.transform.dense.weight, cls.predictions.transform.LayerNorm.weight, bert.encoder.layer.*.attention.self.value.bias, bert.encoder.layer.*.attention.self.key.weight, bert.encoder.layer.*.attention.output.dense.weight, bert.encoder.layer.*.ou

##### Function to augment a single sentence

In [None]:
def augment_text(text):
    return aug.augment(text)

##### Applying Text Augmentation to DataFrame

In [None]:
newChatDF["augmentedQuery"] = newChatDF["query"].apply(augment_text)

In [None]:
# Save to CSV
newChatDF.to_csv("augmented_queries.csv", index=False)


In [None]:
#if you have save the augmented file and now you want to use it

import pandas as pd
import ast

# Load the file you already saved
df = pd.read_csv("/content/augmented_queries-2.csv")

# Function to fix the list format
def clean_aug_text(text):
    if isinstance(text, str) and text.strip().startswith("["):
        try:
            actual_list = ast.literal_eval(text)
            return " ".join(actual_list)
        except:
            return text
    return str(text)

# Apply the fix
if 'augmentedQuery' in df.columns:
    df['augmentedQuery'] = df['augmentedQuery'].apply(clean_aug_text).fillna("")
    df['full_input'] = df['query'].astype(str) + " " + df['augmentedQuery']
else:
    df['full_input'] = df['query'].astype(str)

print("✅ Data loaded and fixed.")


✅ Data loaded and fixed.


In [None]:
print(newChatDF.columns)


Index(['query', 'response', 'domain', 'intent'], dtype='object')


##### Converting the list rows into string

In [None]:
import pandas as pd
import ast
import os

# 1. Load the file
file_path = "/content/augmented_queries.csv"

if os.path.exists(file_path):
    df = pd.read_csv(file_path)
    print("✅ File loaded successfully.")

    # 2. Fix: Handle missing 'augmentedQuery' column safely
    if 'augmentedQuery' not in df.columns:
        print("⚠️ Warning: 'augmentedQuery' column not found in CSV.")
        print("   -> Using 'query' column as input instead.")
        df['full_input'] = df['query'].astype(str).fillna("")
        # Add an empty 'augmentedQuery' column for consistency if it's missing from the CSV
        df['augmentedQuery'] = ''
    else:
        # If it exists, process it as before
        print("ℹ️ 'augmentedQuery' found. combining with 'query'.")

        def clean_aug_text(text):
            if isinstance(text, str) and text.strip().startswith("["):
                try:
                    actual_list = ast.literal_eval(text)
                    return " ".join(actual_list)
                except:
                    return text
            return str(text)

        df['augmentedQuery'] = df['augmentedQuery'].apply(clean_aug_text).fillna("")
        df['full_input'] = df['query'].astype(str) + " " + df['augmentedQuery']

    # 3. Ensure response column is clean
    df['response'] = df['response'].astype(str).fillna("")

    # IMPORTANT: Update newChatDF with the processed DataFrame
    global newChatDF
    newChatDF = df.copy()

    print(f"✅ Data Ready for Training. Rows: {len(newChatDF)}")
    print(f"   Sample Input: {newChatDF['full_input'].iloc[0]}")

else:
    print(f"❌ Error: File not found at {file_path}. Please make sure you uploaded it.")

✅ File loaded successfully.
   -> Using 'query' column as input instead.
✅ Data Ready for Training. Rows: 6189
   Sample Input: what is the first step in curing a spider mite outbreak on tomato what is the first completed step in disease curing... a spider mite whether outbreak or on tomato


In [None]:
# Rename 'full_input' to 'fullQuery' to match the rest of the notebook
newChatDF = newChatDF.rename(columns={"full_input": "fullQuery"})

# Verify it is correct
print(newChatDF.columns)

Index(['query', 'response', 'domain', 'intent'], dtype='object')


In [None]:
# 1. Copy the fixed data from 'df' to 'newChatDF'
newChatDF = df.copy()

# 2. Rename the column to match what the training code expects
newChatDF = newChatDF.rename(columns={"full_input": "fullQuery"})

# 3. Add back helper columns (to prevent errors later)
newChatDF["domain"] = "plant_disease"
newChatDF["intent"] = "faq"

# 4. Verify that 'fullQuery' is now present
print("Columns:", newChatDF.columns)
print("-" * 30)
print(newChatDF[["fullQuery", "response"]].head(1))

Columns: Index(['query', 'response', 'fullQuery', 'domain', 'intent'], dtype='object')
------------------------------
                                           fullQuery  \
0  what is the first step in curing a spider mite...   

                                            response  
0  immediately remove and destroy heavily infeste...  


##### Checking the columns in newChatDF

In [None]:
newChatDF.columns

Index(['query', 'response', 'fullQuery', 'domain', 'intent'], dtype='object')

##### Dropping the unnecessary columns

In [None]:
newChatDF = newChatDF.drop(columns=['intent','domain','query','augmentedQuery'])

##### Checking the type of "newChatDF"

In [None]:
type(newChatDF)

### InputExample

##### "InputExample" is a specific instance of input data, typically consisting of sentences or text pairs, used to demonstrate and evaluate the transformer's ability to generate meaningful sentence embeddings.

##### Converting the "newChatDF" DataFrame to InputExample objects with a default label

In [None]:
default_label = 1.0
input_examples = newChatDF.apply(lambda row: InputExample(
    guid=str(row.name),
    texts=[row['fullQuery'], row['response']],
    label=default_label
), axis=1).tolist()

In the above code :

**guid** : it gives a unique value to each question and answer pair, helping to keep track of each example distinctly.

**texts** : it combines the "fullQuery" and "response" into a list of two separate text elements.

**label** : it assigns the number 1.0 to each row, indicating a default label, which can be used to signify something like a positive example.

Finally, the apply method processes each row, creating InputExample objects, and .tolist() converts the entire result into a list of these objects.


.

##### Printing Input Examples

In [None]:
for example in input_examples:
    print(example)

<InputExample> label: 1.0, texts: what is the first step in curing a spider mite outbreak on tomato what is the first completed step in disease curing... a spider mite whether outbreak or on tomato; immediately remove and destroy heavily infested leaves to slow population spread
<InputExample> label: 1.0, texts: how does high-pressure water spray help cure spider mites how precisely does high - pressure water spray help establish cure in spider flu mites; it physically dislodges mites from leaves; repeat every – days for best results
<InputExample> label: 1.0, texts: why is pruning useful in mite cure why what is root pruning useful in mite disease cure; it eliminates infestation hotspots and improves spray penetration
<InputExample> label: 1.0, texts: can dust reduction be considered a cure for spider mites mice can induced dust reduction be also considered a cure for various spider mites; yes reducing dust lowers stress on plants and directly reduces mite survival
<InputExample> labe

.

### DataLoader

##### A `DataLoader` in machine learning efficiently manages and batches data for training and evaluation, ensuring optimized and streamlined data processing.

##### Creating a Shuffled DataLoader "train_dataloader"

In [None]:
from torch.utils.data import DataLoader
train_dataloader = DataLoader(input_examples, shuffle=True, batch_size=16)

.

### Sentence Transformer

The SentenceTransformer('stsb-roberta-base') model is used to convert sentences into 768-dimensional vectors. These vectors capture the semantic meaning of the sentences, making it useful for tasks like sentence similarity, clustering, and semantic search. Essentially, it helps in understanding and comparing the meaning of sentences in a numerical format.

##### Initialize Sentence Transformer Model


In [None]:
sentenceModel = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

The following layers were not sharded: encoder.layer.*.attention.self.query.bias, embeddings.LayerNorm.bias, encoder.layer.*.intermediate.dense.weight, pooler.dense.bias, encoder.layer.*.intermediate.dense.bias, encoder.layer.*.output.LayerNorm.weight, encoder.layer.*.attention.output.LayerNorm.weight, encoder.layer.*.attention.output.dense.weight, embeddings.token_type_embeddings.weight, encoder.layer.*.output.dense.weight, encoder.layer.*.attention.output.dense.bias, encoder.layer.*.output.dense.bias, encoder.layer.*.attention.self.value.weight, pooler.dense.weight, embeddings.position_embeddings.weight, encoder.layer.*.attention.self.key.weight, embeddings.word_embeddings.weight, encoder.layer.*.output.LayerNorm.bias, encoder.layer.*.attention.self.query.weight, encoder.layer.*.attention.self.key.bias, embeddings.LayerNorm.weight, encoder.layer.*.attention.output.LayerNorm.bias, encoder.layer.*.attention.self.value.bias


tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
train_loss = losses.MultipleNegativesRankingLoss(sentenceModel)

In [None]:
print(train_loss)

MultipleNegativesRankingLoss(
  (model): SentenceTransformer(
    (0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'BertModel'})
    (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  )
  (cross_entropy_loss): CrossEntropyLoss()
)


##### Training Sentence Model with Multiple Epochs and Warmup Steps

In [None]:
num_epochs = 5
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)

sentenceModel.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mchhetririya1234[0m ([33mchhetririya1234-upes[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
500,0.6241
1000,0.2339
1500,0.1953


In [None]:
# Save to Google Drive
output_dir = '/content/drive/MyDrive/ChatbotProject/trainedModel'
sentenceModel.save(output_dir)
print("Model saved successfully!")

Model saved successfully!


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# 1. Define the question you want to ask
user_query = "What is the cure for leaf spot?"  # <--- Change this to test different questions

# 2. Convert the user's question into numbers (embedding)
query_embedding = sentenceModel.encode([user_query])

# 3. Convert your database (fullQuery column) into numbers
# (We do this so we can compare the question to every possible answer)
print("Encoding database... this might take a second.")
faq_embeddings = sentenceModel.encode(newChatDF["fullQuery"].tolist())

# 4. Find the most similar question in your database
similarities = cosine_similarity(query_embedding, faq_embeddings)
best_index = np.argmax(similarities)
confidence_score = similarities[0][best_index].item()

# 5. Print the result
print("------------------------------------------------")
print(f"Question: {user_query}")
print(f"Confidence: {confidence_score:.4f}")
print("------------------------------------------------")

if confidence_score >= 0.50:
    answer = newChatDF.iloc[best_index]['response']
    print(f"Chatbot Answer: {answer}")
else:
    print("Chatbot Answer: I am not sure. Please contact an expert.")

Encoding database... this might take a second.
------------------------------------------------
Question: What is the cure for leaf spot?
Confidence: 0.5417
------------------------------------------------
Chatbot Answer: leaf spot is a disease of turmeric caused by colletotrichum capsici fungus it affects the leaves and shows symptoms such as brown oblong spots with grey centers the cure includes seed treatment fungicides


##### Preparing , Cleaning and Encoding New Query

In [None]:
new_query = "How do I cure spider mites?"
new_query = clean_text(new_query)
new_query_embedding = sentenceModel.encode([new_query])

In [None]:
print(newChatDF.columns)


Index(['query', 'response', 'fullQuery', 'domain', 'intent'], dtype='object')


In [None]:
# 1. Quick Fix: Ensure 'fullQuery' column exists
if 'fullQuery' not in newChatDF.columns:
    newChatDF['fullQuery'] = newChatDF['query']

# 2. Encode the database (The "Knowledge Base")
faq_embeddings = sentenceModel.encode(newChatDF["fullQuery"].tolist())

# 3. Compare and Find Answer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

similarities = cosine_similarity(new_query_embedding, faq_embeddings)
most_similar_index = np.argmax(similarities)
best_score = similarities[0][most_similar_index].item()

# 4. Print the Result
print("------------------------------------------------")
print(f"Query: {new_query}")
print(f"Confidence: {best_score:.4f}")
print("------------------------------------------------")

if best_score >= 0.50:
    print(f"Chatbot Answer: {newChatDF.iloc[most_similar_index]['response']}")
else:
    print("Chatbot Answer: I'm not sure about that.")

------------------------------------------------
Query: how do i cure spider mites
Confidence: 0.7580
------------------------------------------------
Chatbot Answer: prevent explosive populations with early integrated actions rather than relying solely on late-stage chemicals


In [None]:
# Use 'fullQuery' instead of 'query' to include the augmented data
faq_embeddings = sentenceModel.encode(newChatDF["fullQuery"].tolist())

##### Importing cosine similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

##### Calculating Query Embedding Similarities

In [None]:
similarities = cosine_similarity(new_query_embedding, faq_embeddings)

In [None]:
print(similarities)

[[ 0.61702025  0.743132    0.401717   ... -0.02439682  0.21830782
   0.21915418]]


##### Finding the index of the most similar query and Best Score

In [None]:
most_similar_query_index = np.argmax(similarities)
best_score = similarities[0][most_similar_query_index].item()

In [None]:
print(most_similar_query_index)

49


In [None]:
# 1. Retrieve the answer and the matched question from the dataframe
# Use the index you just found (most_similar_query_index)
best_response = newChatDF.iloc[most_similar_query_index]['response']
matched_question = newChatDF.iloc[most_similar_query_index]['query']

# 2. Print the detailed results
print("------------------------------------------------")
print(f"Matched Question: {matched_question}")
print(f"Confidence Score: {best_score:.4f}")
print("------------------------------------------------")

# 3. Logic to decide if we show the answer or not
if best_score >= 0.50:  # Threshold
    print(f"Chatbot Answer: {best_response}")
else:
    print("Chatbot Answer: I am not sure about that. Please contact an expert.")

------------------------------------------------
Matched Question: what is the ultimate cure principle for spider mites what is now the one ultimate cancer cure principle for spider mites
Confidence Score: 0.7580
------------------------------------------------
Chatbot Answer: prevent explosive populations with early integrated actions rather than relying solely on late-stage chemicals


### TextBlob

Importing TextBlob

In [None]:
from textblob import TextBlob

##### Classifying The Sentiment Of The Given Input

In [None]:
def classify_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    if sentiment > 0:
        print(blob)
        return "Positive"
    elif sentiment < 0:
        return "Negative"
    else:
        return "Neutral"


The text blob here will be helping us to counter the negative questions that user might ask which is not present in the dataset.

##### Classifying Sentiment of New Query

In [None]:
sentiment = classify_sentiment(new_query)

In [None]:
sentiment

'Neutral'

##### Handle Negative Sentiment and Similar Query Response

In [None]:
if sentiment == "Negative":
    print("Please drop us a mail regarding your concerns.")
elif best_score >= 0.70:
    # Retrieve the most similar query and its response
    most_similar_query = newChatDF['query'][most_similar_query_index]
    response = responseDF[most_similar_query_index]
    print(f"Most Similar Query: {most_similar_query}")
    print(f"Response: {response}")
elif best_score >= 0.30:
    print("Sorry we are facing some technical difficulties , please write to us on contact@healthcarerocks.com")
elif best_score >= 0.20:
    print("Please write to us on our mail ID contact@healthcarerocks.com")
else:
    print("please write a mail regarding any queries related to our services")

Most Similar Query: what is the ultimate cure principle for spider mites what is now the one ultimate cancer cure principle for spider mites
Response: prevent explosive populations with early integrated actions rather than relying solely on late-stage chemicals


In [None]:
# 1. Analyze Sentiment
sentiment = classify_sentiment(new_query)

# 2. Smart Logic (Answer FIRST, Sentiment SECOND)
if best_score >= 0.50:  # If we have a good match...
    # ...Give the answer immediately!
    print(f"Chatbot Answer: {response}")

elif sentiment == "Negative":
    # Only apologize if we DON'T know the answer AND the user is upset
    print("Chatbot Answer: I hear your frustration. I am not sure about the answer, but please email us for help.")

else:
    # Default fallback
    print("Chatbot Answer: I'm not sure about that. Please consult an expert.")

Chatbot Answer: prevent explosive populations with early integrated actions rather than relying solely on late-stage chemicals


This helps us to handle the questions that model have not yet seen or not present in the dataset,

.

##### Best Score

In [None]:
best_score

0.7580198645591736

##### Saving the trained model into a respective directory

In [None]:
import os

# Use the variable name of the model you trained in Cell 37
# Do NOT initialize a new SentenceTransformer here.
model_to_save = sentenceModel

# Set the save path
save_path = "/content/saved_models/my_finetuned_sentence_model"

# Create directory
os.makedirs(save_path, exist_ok=True)

# Save the TRAINED model
model_to_save.save(save_path)

print(f"✅ Trained model saved at: {save_path}")


✅ Trained model saved at: /content/saved_models/my_finetuned_sentence_model


In [None]:
import shutil
from google.colab import files

# 1. Zip the folder (because you can't download a folder directly)
shutil.make_archive("my_chatbot_model", 'zip', "/content/saved_models/my_finetuned_sentence_model")

# 2. Trigger the download in your browser
files.download("my_chatbot_model.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##### Saving Data To Pickle File

In [None]:
import pickle
import os

# Create local folder
os.makedirs('pickleFiles', exist_ok=True)

# Save the files
with open('pickleFiles/faq_embeddings.pkl', 'wb') as f:
    pickle.dump(faq_embeddings, f)

with open('pickleFiles/chatbot_data.pkl', "wb") as f:
    pickle.dump(newChatDF, f)

print("Files created! Now run the download script again.")

Files created! Now run the download script again.


In [None]:
import shutil
from google.colab import files
import os

# 1. Check if the folder exists (just to be sure)
if os.path.exists('pickleFiles'):
    print("✅ Found 'pickleFiles' folder!")

    # 2. Zip the folder
    print("📦 Zipping files...")
    shutil.make_archive("my_chatbot_data", 'zip', "pickleFiles")

    # 3. Download
    print("⬇️ Downloading now...")
    files.download("my_chatbot_data.zip")
else:
    print("❌ Error: The folder 'pickleFiles' still doesn't exist.")

✅ Found 'pickleFiles' folder!
📦 Zipping files...
⬇️ Downloading now...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

.

.

.

### Model Loading and Testing the model

In [None]:
import os

# Check the folder content
folder_path = "/content/saved_models/my_finetuned_sentence_model"

if os.path.exists(folder_path):
    print(f"✅ Folder found: {folder_path}")
    print("Files inside:", os.listdir(folder_path))
else:
    print("❌ Folder not found. Did you run the save code?")

✅ Folder found: /content/saved_models/my_finetuned_sentence_model
Files inside: ['tokenizer_config.json', 'sentence_bert_config.json', 'unigram.json', 'config_sentence_transformers.json', 'config.json', 'model.safetensors', 'tokenizer.json', 'modules.json', 'special_tokens_map.json', '1_Pooling', 'README.md']


In [None]:
best_score

0.7524635195732117

.

.

.

.

### Fine Tuned Chatbot Using BART Transformer

In [None]:
import pandas as pd
import os

# Assume newChatDF from the previous section is available in the kernel state.
# This DataFrame already contains 'response' and 'fullQuery' (which combines original and augmented queries).

# Create a DataFrame suitable for the BART fine-tuning, renaming 'fullQuery' to 'query'
# to match the expected column names in subsequent cells.
bart_df = newChatDF[['query', 'response']].copy()
bart_df.rename(columns={'query': 'query'}, inplace=True)

# Define the path for the new CSV file
file_path = "/content/merged_cleaned-data.csv"

# Save the DataFrame to a CSV file
bart_df.to_csv(file_path, index=False)

# Load the newly created CSV file
df = pd.read_csv(file_path)

# Now df is defined
print(df.head())

                                               query  \
0  what is the first step in curing a spider mite...   
1  how does high-pressure water spray help cure s...   
2  why is pruning useful in mite cure why what is...   
3  can dust reduction be considered a cure for sp...   
4  how does improving air circulation cure mite o...   

                                            response  
0  immediately remove and destroy heavily infeste...  
1  it physically dislodges mites from leaves; rep...  
2  it eliminates infestation hotspots and improve...  
3  yes reducing dust lowers stress on plants and ...  
4  better airflow reduces hot dry microclimates w...  


##### Creating a DataFrame

In [None]:
# Example: Use query and response columns from your existing DataFrame
newQueryDataset = df['query'].tolist()
responseDF = df['response'].tolist()


# Now this will work
import pandas as pd
newChatDF = pd.DataFrame({
    "query": newQueryDataset,
    "response": responseDF
})

print(newChatDF)


                                                  query  \
0     what is the first step in curing a spider mite...   
1     how does high-pressure water spray help cure s...   
2     why is pruning useful in mite cure why what is...   
3     can dust reduction be considered a cure for sp...   
4     how does improving air circulation cure mite o...   
...                                                 ...   
6184  what is quick wilt/foot rot in black pepper an...   
6185  what is azhukal/ capsule rot in cardamom and h...   
6186  what is cercospora leaf spot in gourd bitter/o...   
6187  what is fire blight in pear and how can it be ...   
6188  what is anthracnose in pomegranate and how can...   

                                               response  
0     immediately remove and destroy heavily infeste...  
1     it physically dislodges mites from leaves; rep...  
2     it eliminates infestation hotspots and improve...  
3     yes reducing dust lowers stress on plants and ...  
4

In [None]:
newChatDF.head()

Unnamed: 0,query,response
0,what is the first step in curing a spider mite...,immediately remove and destroy heavily infeste...
1,how does high-pressure water spray help cure s...,it physically dislodges mites from leaves; rep...
2,why is pruning useful in mite cure why what is...,it eliminates infestation hotspots and improve...
3,can dust reduction be considered a cure for sp...,yes reducing dust lowers stress on plants and ...
4,how does improving air circulation cure mite o...,better airflow reduces hot dry microclimates w...


The dataframe has two columns query and response.

.

##### Checking the shape of the dataset

In [None]:
newChatDF.shape

(6189, 2)

The Dataset has 1681 rows and 2 columns.

.

##### Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

##### Spliting the data into training and validation sets.

In [None]:
train_df, val_df = train_test_split(newChatDF, test_size=0.2, random_state=42)

##### Checking the shape of the dataframe

In [None]:
train_df.shape, val_df.shape

((4951, 2), (1238, 2))

After the split the training dataset train_df has 1344 rows and 2 columns where validation dataset val_df has 337 rows and 2 columns.

.

In [None]:
train_df.head()

Unnamed: 0,query,response
3950,what sanitation steps reduce early blight on t...,scout tomato biweekly; remove infected tissue ...
1020,what is the scientific name of the early bligh...,alternaria solani
1545,where does the late blight pathogen survive be...,infected potato/tomato debris cull piles and v...
1433,how does staking or trellising help tomato hea...,it improves airflow reduces fruit contact with...
3599,can bacterial spot spread via tools or hands w...,scout tomato every – days; remove infected tis...


In [None]:
val_df.head()

Unnamed: 0,query,response
996,can windbreaks reduce tylcv spread can agricul...,yes by limiting whitefly dispersal
468,afternoon vs overnight humidity afternoon day ...,overnight wetness is more critical for infection
3098,how can i prune strawberry to boost airflow an...,for strawberry keep soil evenly moist; use fur...
2127,where does septoria leaf spot survive between ...,on plant debris volunteer tomatoes and solanac...
2080,how can fungicides be used for septoria how ca...,apply preventively when weather favors disease


##### Reseting Index for Training and Validation Data

In [None]:
train_data = train_df.reset_index(drop=True)
validation_data = val_df.reset_index(drop=True)

In [None]:
train_data.head()

Unnamed: 0,query,response
0,what sanitation steps reduce early blight on t...,scout tomato biweekly; remove infected tissue ...
1,what is the scientific name of the early bligh...,alternaria solani
2,where does the late blight pathogen survive be...,infected potato/tomato debris cull piles and v...
3,how does staking or trellising help tomato hea...,it improves airflow reduces fruit contact with...
4,can bacterial spot spread via tools or hands w...,scout tomato every – days; remove infected tis...


In [None]:
train_data['query'][0]

'what sanitation steps reduce early blight on tomato what if sanitation steps can reduce early blight levels on tomato'

In [None]:
validation_data.head()

Unnamed: 0,query,response
0,can windbreaks reduce tylcv spread can agricul...,yes by limiting whitefly dispersal
1,afternoon vs overnight humidity afternoon day ...,overnight wetness is more critical for infection
2,how can i prune strawberry to boost airflow an...,for strawberry keep soil evenly moist; use fur...
3,where does septoria leaf spot survive between ...,on plant debris volunteer tomatoes and solanac...
4,how can fungicides be used for septoria how ca...,apply preventively when weather favors disease


### BART Transformers

##### Importing Bart libraries

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments
from datasets import Dataset, DatasetDict

##### Initializing tokenizer

In [None]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

##### Preprocessing By Tokenizing And Creating Labels

In [None]:
def preprocess_function(examples):
    inputs = tokenizer(examples["query"], padding="max_length", truncation=True, max_length=512)
    targets = tokenizer(examples["response"], padding="max_length", truncation=True, max_length=512)
    inputs["labels"] = targets["input_ids"]
    return inputs

##### Creating Dataset Objects from Pandas DataFrames

In [None]:
train_dataset = Dataset.from_pandas(train_data)
validation_dataset = Dataset.from_pandas(validation_data)

##### Apply Preprocessing Function to Datasets

In [None]:
from datasets import load_dataset, DatasetDict
from sklearn.model_selection import train_test_split
import pandas as pd

# Load CSV into pandas
df = pd.read_csv("merged_cleaned-data.csv", names=["query", "response"])

# Split manually
train_df, val_df = train_test_split(df, test_size=0.2)

# Convert to Hugging Face Dataset
from datasets import Dataset
dataset = DatasetDict({
    'train': Dataset.from_pandas(train_df),
    'validation': Dataset.from_pandas(val_df)
})




##### Creating dataset dictionary

In [None]:
dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': validation_dataset
})

##### Initializing BART Model for Conditional Generation

In [None]:
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

The following layers were not sharded: model.decoder.layers.*.encoder_attn.k_proj.bias, model.decoder.layers.*.self_attn_layer_norm.bias, model.shared.weight, model.encoder.layers.*.self_attn_layer_norm.bias, model.encoder.layers.*.self_attn.q_proj.bias, model.encoder.embed_tokens.weight, model.decoder.layers.*.fc*.weight, model.decoder.layers.*.self_attn.v_proj.weight, model.encoder.layers.*.fc*.weight, model.encoder.layers.*.final_layer_norm.weight, model.encoder.layernorm_embedding.bias, model.decoder.embed_tokens.weight, model.decoder.layers.*.final_layer_norm.weight, model.encoder.layers.*.self_attn.v_proj.weight, model.decoder.layers.*.self_attn.k_proj.weight, model.encoder.layers.*.self_attn.q_proj.weight, model.decoder.layers.*.self_attn.k_proj.bias, final_logits_bias, model.decoder.layers.*.self_attn.out_proj.bias, model.decoder.layers.*.encoder_attn_layer_norm.bias, model.encoder.layers.*.self_attn.out_proj.bias, model.decoder.layers.*.encoder_attn.q_proj.bias, model.decoder.

In [None]:
!pip install -U transformers

from transformers import TrainingArguments
print(TrainingArguments)



Collecting transformers
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.57.3-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.57.2
    Uninstalling transformers-4.57.2:
      Successfully uninstalled transformers-4.57.2
Successfully installed transformers-4.57.3


<class 'transformers.training_args.TrainingArguments'>


##### Training The Model

In [None]:
!pip install --upgrade transformers



In [None]:
import transformers
print(transformers.__version__)

4.57.2


In [None]:
print(transformers.__file__)

/usr/local/lib/python3.12/dist-packages/transformers/__init__.py


In [None]:
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir='./logs',
    remove_unused_columns=True # Changed from False to True
)

##### Saving the trained model into chatbot_model directory

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Replace with your fine-tuned model and tokenizer
model.save_pretrained("/Users/riya/Documents/my_finetuned_model")
tokenizer.save_pretrained("/Users/riya/Documents/my_finetuned_model")



('/Users/riya/Documents/my_finetuned_model/tokenizer_config.json',
 '/Users/riya/Documents/my_finetuned_model/special_tokens_map.json',
 '/Users/riya/Documents/my_finetuned_model/vocab.json',
 '/Users/riya/Documents/my_finetuned_model/merges.txt',
 '/Users/riya/Documents/my_finetuned_model/added_tokens.json')

In [None]:
model.save_pretrained("./my_finetuned_model")
tokenizer.save_pretrained("./my_finetuned_model")


('./my_finetuned_model/tokenizer_config.json',
 './my_finetuned_model/special_tokens_map.json',
 './my_finetuned_model/vocab.json',
 './my_finetuned_model/merges.txt',
 './my_finetuned_model/added_tokens.json')

In [None]:
!zip -r my_finetuned_model.zip my_finetuned_model


  adding: my_finetuned_model/ (stored 0%)
  adding: my_finetuned_model/tokenizer_config.json (deflated 75%)
  adding: my_finetuned_model/config.json (deflated 64%)
  adding: my_finetuned_model/vocab.json (deflated 68%)
  adding: my_finetuned_model/model.safetensors (deflated 41%)
  adding: my_finetuned_model/merges.txt (deflated 53%)
  adding: my_finetuned_model/special_tokens_map.json (deflated 85%)
  adding: my_finetuned_model/generation_config.json (deflated 46%)


In [None]:
from google.colab import files
files.download("my_finetuned_model.zip")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

.

.

.