# Medical Advice Chatbot

In [30]:
!pip install transformers pandas datasets scikit-learn numpy tensorflow_text rouge nltk

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
data = '/content/drive/MyDrive/BSE ML-Techniques-1/med_chatbot.csv'

In [31]:
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import numpy as np
import re
import string
import random
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Data Loading

In [6]:
df = pd.read_csv(data)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 853 entries, 0 to 852
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Symptoms   853 non-null    object
 1   Diagnosis  853 non-null    object
dtypes: object(2)
memory usage: 13.5+ KB


In [8]:
# Checking for missing values
missing_values = df.isnull().sum()

# Checking for duplicate entries
duplicate_entries = df.duplicated().sum()

# Display the results
missing_values, duplicate_entries


(Symptoms     0
 Diagnosis    0
 dtype: int64,
 4)

## Preprocessing

## Cleaning Data

In [9]:
# Remove duplicate entries
df_cleaned = df.drop_duplicates()

df_cleaned.duplicated().sum()


0

In [10]:
df_cleaned.head(5)

Unnamed: 0,Symptoms,Diagnosis
0,I have I've been having a lot of pain in my ne...,cervical spondylosis
1,I have I have a rash on my face that is gettin...,impetigo
2,I have I have been urinating blood. I sometime...,urinary tract infection
3,I have I have been having trouble with my musc...,arthritis
4,I have I have been feeling really sick. My bod...,dengue


In [11]:
response_templates = [
    "Based on your symptoms, it sounds like you may have {diagnosis}. It's best to consult a doctor for a proper diagnosis.",
    "Your symptoms suggest {diagnosis}. Have you noticed any other symptoms?",
    "It seems like you might be experiencing {diagnosis}. I recommend getting a medical opinion to confirm.",
    "From what you're describing, {diagnosis} could be a possibility. Make sure to seek professional medical advice.",
    "I see. This might be related to {diagnosis}. You should consider seeing a doctor for further evaluation."
]

In [12]:
# Apply the transformation to the Diagnosis column
df_cleaned['Diagnosis'] = df_cleaned['Diagnosis'].apply(lambda x: random.choice(response_templates).format(diagnosis=x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Diagnosis'] = df_cleaned['Diagnosis'].apply(lambda x: random.choice(response_templates).format(diagnosis=x))


In [13]:
df_cleaned.head()

Unnamed: 0,Symptoms,Diagnosis
0,I have I've been having a lot of pain in my ne...,It seems like you might be experiencing cervic...
1,I have I have a rash on my face that is gettin...,It seems like you might be experiencing impeti...
2,I have I have been urinating blood. I sometime...,"From what you're describing, urinary tract inf..."
3,I have I have been having trouble with my musc...,I see. This might be related to arthritis. You...
4,I have I have been feeling really sick. My bod...,It seems like you might be experiencing dengue...


In [14]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'\w*\d\w*', '', text)
    return text.strip()

In [15]:
df_cleaned['Symptoms'] = df_cleaned['Symptoms'].apply(clean_text)
df_cleaned['Diagnosis'] = df_cleaned['Diagnosis'].apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Symptoms'] = df_cleaned['Symptoms'].apply(clean_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Diagnosis'] = df_cleaned['Diagnosis'].apply(clean_text)


In [16]:
symptoms = df_cleaned['Symptoms'].tolist()
diagnoses = df_cleaned['Diagnosis'].tolist()

### Cleaned up dataset ready for preprocessing

In [17]:
df_cleaned.head(5)

Unnamed: 0,Symptoms,Diagnosis
0,i have ive been having a lot of pain in my nec...,it seems like you might be experiencing cervic...
1,i have i have a rash on my face that is gettin...,it seems like you might be experiencing impeti...
2,i have i have been urinating blood i sometimes...,from what youre describing urinary tract infec...
3,i have i have been having trouble with my musc...,i see this might be related to arthritis you s...
4,i have i have been feeling really sick my body...,it seems like you might be experiencing dengue...


### Splitting Text

In [18]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset, DataLoader

In [19]:
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df_cleaned['Symptoms'].tolist(), df_cleaned['Diagnosis'].tolist(), test_size=0.2, random_state=42
)

In [20]:
# Load tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [21]:
class ChatbotDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = "chatbot: " + self.texts[idx]  # Add task prefix for T5
        label = self.labels[idx]

        encoding = self.tokenizer(text, padding="max_length", truncation=True, max_length=self.max_length, return_tensors="pt")
        target_encoding = self.tokenizer(label, padding="max_length", truncation=True, max_length=self.max_length, return_tensors="pt")

        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': target_encoding['input_ids'].squeeze()
        }

In [22]:
train_dataset = ChatbotDataset(train_texts, train_labels, tokenizer)
val_dataset = ChatbotDataset(val_texts, val_labels, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=False)

In [23]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

In [24]:
for epoch in range(50):
    model.train()
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_loader)}")

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch 1, Loss: 3.4658084013882804
Epoch 2, Loss: 0.7943515455021578
Epoch 3, Loss: 0.50919089457568
Epoch 4, Loss: 0.3707756375565248
Epoch 5, Loss: 0.29080310954767113
Epoch 6, Loss: 0.23019534805241754
Epoch 7, Loss: 0.18671643698916715
Epoch 8, Loss: 0.14755612182266573
Epoch 9, Loss: 0.11891172221478294
Epoch 10, Loss: 0.10343273010324029
Epoch 11, Loss: 0.0889737049008117
Epoch 12, Loss: 0.07876936070182744
Epoch 13, Loss: 0.0712067888940082
Epoch 14, Loss: 0.0652921145891442
Epoch 15, Loss: 0.060565662296379316
Epoch 16, Loss: 0.0589846889324048
Epoch 17, Loss: 0.05514152260387645
Epoch 18, Loss: 0.05401185821084415
Epoch 19, Loss: 0.052344047826002625
Epoch 20, Loss: 0.049913576552096536
Epoch 21, Loss: 0.04908414614551208
Epoch 22, Loss: 0.04701190235860207
Epoch 23, Loss: 0.04596904823008706
Epoch 24, Loss: 0.044519037183593305
Epoch 25, Loss: 0.04323557755526374
Epoch 26, Loss: 0.041420970769489515
Epoch 27, Loss: 0.04083057810716769
Epoch 28, Loss: 0.03909246917156612
Epoch 

In [25]:
model.save_pretrained("medical_chatbot_t5")
tokenizer.save_pretrained("medical_chatbot_t5")

('medical_chatbot_t5/tokenizer_config.json',
 'medical_chatbot_t5/special_tokens_map.json',
 'medical_chatbot_t5/spiece.model',
 'medical_chatbot_t5/added_tokens.json')

In [26]:
def chatbot_response(input_text):
    model.eval()
    input_text = "chatbot: " + input_text
    encoding = tokenizer(input_text, return_tensors="pt", max_length=128, truncation=True, padding="max_length").to(device)

    output = model.generate(input_ids=encoding['input_ids'], attention_mask=encoding['attention_mask'], max_length=50)
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return response

In [32]:
def evaluate_model():
    model.eval()
    rouge = Rouge()
    bleu_scores = []
    rouge_scores = []

    for symptom, actual_response in zip(val_texts, val_labels):
        predicted_response = chatbot_response(symptom)

        # Compute BLEU Score
        bleu_score = sentence_bleu([actual_response.split()], predicted_response.split())
        bleu_scores.append(bleu_score)

        # Compute ROUGE Score
        rouge_score = rouge.get_scores(predicted_response, actual_response)[0]['rouge-l']['f']
        rouge_scores.append(rouge_score)

    print(f"Average BLEU Score: {sum(bleu_scores) / len(bleu_scores):.4f}")
    print(f"Average ROUGE-L Score: {sum(rouge_scores) / len(rouge_scores):.4f}")

# Run evaluation
evaluate_model()

The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Average BLEU Score: 0.2096
Average ROUGE-L Score: 0.3744


In [27]:
# Test the chatbot
while True:
  user_input = input("You: ")
  if user_input.lower() == "exit":
    break
  response = chatbot_response(user_input)
  print("Chatbot:", response)

You: I have a mild headache and my stomach aches
Chatbot: based on your symptoms it sounds like you may have gastroesophageal reflux disease its best to consult a doctor for a proper diagnosis
You: My chest hurts and I cant breathe properly
Chatbot: it seems like you might be experiencing allergy i recommend getting a medical opinion to confirm
You: I have sores in my mouth 
Chatbot: i see this might be related to diabetes you should consider seeing a doctor for further evaluation
You: I am having trouble breathing and my chest hurts
Chatbot: from what youre describing pneumonia could be a possibility make sure to seek professional medical advice
You: I am having trouble moving my legs
Chatbot: i see this might be related to psoriasis you should consider seeing a doctor for further evaluation
You: exit


The model works efficiently with a BLEU Score of 0.2096 and Rouge score of 0.3744.

T5-small was the most efficient pre-trained to use as it was lightweight.

In [33]:
!pip install streamlit

Collecting streamlit
  Downloading streamlit-1.42.2-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.42.2-py2.py3-none-any.whl (9.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m65.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m75.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl (79 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[