<a href="https://colab.research.google.com/github/Ali-desu/Bert_llama2_medical_chatbot/blob/main/Copie_de_medical_intent_detector_Using_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Project Object:**

To build a detecting tool classifying the text of a person's description of their medical symptom to the correct category(intent), which can be used in applications such as a medical chatbot.

## 1. GPU Setup in Colab

In [None]:
# A GPU can be added by going to the menu and selecting: Edit 🡒 Notebook Settings 🡒 Hardware accelerator 🡒 (GPU)
# confirm the GPU is detected:

import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')


Found GPU at: /device:GPU:0


In [None]:
# In order for torch to use the GPU, we need to identify and specify the GPU as the device. Later, in our training loop, we will load data onto the device

import torch

# If there's a GPU available...
if torch.cuda.is_available():

    # Tell PyTorch to use the GPU.
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


## 2. Load and analyze data

The dataset used is from kaggle: https://www.kaggle.com/paultimothymooney/medical-speech-transcription-and-intent

This data contains thousands of audio utterances and corresponding transcriptions for common medical symptoms like “knee pain” or “headache”. Only the transcriptions are used in this project.

We can see from below:
* column "phrase" contains transcriptions describing a person's certain medical symptoms
* column "promp" contains their corresponding intents (25 intents in total)


In [None]:
%pip install transformers
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt



In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
data = pd.read_csv('gdrive/My Drive/Colab Datasets/overview-of-recordings.csv')
data1 = data[['phrase','prompt']]
data1.sample(5)

Unnamed: 0,phrase,prompt
2321,I think my wound is infected,Infected wound
5808,"I am worried how cold intolerant I am, I am al...",Feeling cold
1160,I used alot of pain killer to get better but i...,Back pain
4883,I feel back pain when I carry heavy things,Back pain
713,When i'm driving my eyes see in double,Blurry vision


In [None]:
df=data1.copy()
df.isna().sum()

phrase    0
prompt    0
dtype: int64

In [None]:
df['prompt'].value_counts()

prompt
Acne                  328
Shoulder pain         320
Joint pain            318
Infected wound        306
Knee pain             305
Cough                 293
Feeling dizzy         283
Muscle pain           282
Heart hurts           273
Ear ache              270
Hair falling out      264
Feeling cold          263
Head ache             263
Skin issue            262
Stomach ache          261
Back pain             259
Neck pain             251
Internal pain         248
Blurry vision         246
Body feels weak       241
Hard to breath        233
Emotional pain        231
Injury from sports    230
Foot ache             223
Open wound            208
Name: count, dtype: int64

In [None]:
print('Total number of intents: %d'%(len(df['prompt'].value_counts().index)))

Total number of intents: 25


## 3. Split data to train, validation and test sets

I split data to train(70%), validation(10%) and testset (20%) stratified by the variable "intent". After stratification, data for each intent will balanced and data for each set will be proportional to 70%, 10% and 20%. That is crucial for training and testing purposes.

In [None]:
from sklearn.model_selection import train_test_split

X, sentence_test, y, intent_test = train_test_split(df.phrase, df.prompt, stratify = df.prompt,test_size=0.2, random_state=4612)
sentence_train, sentence_val, intent_train, intent_val = train_test_split(X, y, stratify = y,test_size=0.125, random_state=4612)


In [None]:
print(f"#examples in training set:{ sentence_train.shape[0]}\n#examples in validation set:{ sentence_val.shape[0]}\n#examples in test set:{ sentence_test.shape[0]}")

#examples in training set:4662
#examples in validation set:666
#examples in test set:1333


## 4. Tokenization and input formatting

I Prepare the input data to the correct format before training as follows:
* tokenizing all sentences
* padding and truncating all sentences to the same length.
* Creating the attention masks which explicitly differentiate real tokens from [PAD] tokens. 0 or 1.
* encoding the label "intent" to numbers. 25 intents to 25 numbers.
* creating DataLoaders for our training, validation and test sets

In [None]:
# Defining some key variables that will be used later on in the training
TRAIN_BATCH_SIZE =32
VALID_BATCH_SIZE = 64
EPSILON = 1e-08
EPOCHS = 4
LEARNING_RATE = 2e-5
SEED = 1215
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
max_len = 0
input_list = []
length=[]
# For every sentence...
for sent in sentence_train:

    # Tokenize the text and add special tokens--`[CLS]` and `[SEP]` tokens.
    input_ids = tokenizer.encode(sent, add_special_tokens=True)
    input_list.append(input_ids)
    length.append(len(input_ids))
    # Update the maximum sentence length.
    max_len = max(max_len, len(input_ids))
    mean_len = sum(length)/len(length)
#39 tokens is the maximum number of tokens in a sentence (transcription). Also, a sentence has 14 tokens on average.
print('Max sentence length:%d \nMean sentence length:%d' % (max_len,mean_len))

Max sentence length:39 
Mean sentence length:14


In [None]:
# create a function to tokenize sentences.
def tokenize(sentence):
  batch = tokenizer(list(sentence),
                  is_pretokenized=False,
                  #Pad or truncate all sentences to the same length. Create the attention masks which explicitly differentiate real tokens from [PAD] tokens.
                  padding=True,
                  truncation=True,
                  return_tensors="pt")
  return batch

In [None]:
tok_train = tokenize(sentence_train)
tok_val = tokenize(sentence_val)
tok_test = tokenize(sentence_test)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'is_pretokenized': False} not recognized.
Keyword arguments {'

In [None]:
from sklearn.preprocessing import LabelEncoder
# encode "intent" to 25 number labels
LE = LabelEncoder()
label_train = torch.tensor((LE.fit_transform(intent_train)))
label_val = torch.tensor((LE.fit_transform(intent_val)))
label_test = torch.tensor((LE.fit_transform(intent_test)))


In [None]:
from torch.utils.data import TensorDataset

train_dataset = TensorDataset(tok_train['input_ids'], tok_train['attention_mask'],label_train)
validation_dataset = TensorDataset(tok_val['input_ids'], tok_val['attention_mask'],label_val)
test_dataset = TensorDataset(tok_test['input_ids'], tok_test['attention_mask'],label_test)


In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order.
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = TRAIN_BATCH_SIZE # Trains with this batch size.
        )

# For validation/test the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            validation_dataset, # The validation samples.
            sampler = SequentialSampler(validation_dataset), # Pull out batches sequentially.
            batch_size = VALID_BATCH_SIZE # Evaluate with this batch size.
        )

test_dataloader = DataLoader(
            validation_dataset,
            sampler = SequentialSampler(validation_dataset),
            batch_size = VALID_BATCH_SIZE
        )

## 5. Train BERT classification model

I use BertForSequenceClassification, a BERT model with an added single linear layer on top for classification. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task.



In [None]:
from transformers import BertForSequenceClassification, AdamW, BertConfig

## use pretained base(relatively small) BERT mdoel for sequence classification
#CUDA_LAUNCH_BLOCKING=1
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 25)
model.cuda() # make pytorch run this model on GPU.

## use AdamW optimizer
optimizer = AdamW(model.parameters(),
                  lr = LEARNING_RATE,
                  eps = EPSILON #very small number to prevent any division by zero
                  )

from transformers import get_linear_schedule_with_warmup

# Total number of training steps is [number of batches] x [number of epochs].
total_steps = len(train_dataloader) * EPOCHS

## Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Function to calcuate the accuracy of the model

def calcuate_accu(big_idx, targets):
    n_correct = (big_idx==targets).sum().item()
    return n_correct

In [None]:
import time
import datetime

def format_time(elapsed):
    #Takes a time in seconds and returns a string hh:mm:ss
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [None]:
from torch.utils.tensorboard import SummaryWriter

# default `log_dir` is "runs" - we'll be more specific here
writer = SummaryWriter('runs/Tensorboard')

In [None]:
# Start the training process:
import random
import torch

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
def train(epochs):
  total_t0 = time.time() # Measure the total training time for the whole run.
  tr_loss = 0
  n_correct = 0
  nb_tr_steps = 0
  nb_tr_examples = 0

  # For each epoch...
  for epoch in range(0, epochs):
      print('======== Epoch {:} / {:} ========'.format(epoch + 1, epochs))
      print('Training...')

      t0 = time.time()     # Measure how long the training epoch takes.
      total_tr_loss = 0
      total_n_correct = 0
      total_nb_tr_examples = 0
      model.train()    # Put the model into training mode

      # For each batch of training data...
      for step, batch in enumerate(train_dataloader, 0):
          # 'batch' contains three pytorch tensors:[0]: input ids, [1]: attention masks, [2]: labels
          input_ids = batch[0].to(device, dtype = torch.long)
          input_mask = batch[1].to(device, dtype = torch.long)
          labels = batch[2].to(device, dtype = torch.long)

          model.zero_grad()       #clear any previously calculated gradients

          outputs = model(input_ids, token_type_ids=None, attention_mask=input_mask)
          loss_function = torch.nn.CrossEntropyLoss()
          loss = loss_function(outputs[0], labels) #`loss` is a Tensor containing a single value
          tr_loss += loss.item() #.item()` function just returns the Python value from the tensor
          total_tr_loss += loss.item()
          big_val, big_idx = torch.max(outputs[0], dim=1)
          n_correct += calcuate_accu(big_idx, labels)
          total_n_correct += calcuate_accu(big_idx, labels)
          nb_tr_steps += 1
          nb_tr_examples+=labels.size(0)
          total_nb_tr_examples+=labels.size(0)

          if step % 20==19:
              loss_step = tr_loss/nb_tr_steps
              accu_step = n_correct/nb_tr_examples # #correct examples/all examples
              print(f"Training Loss per 20 steps(batches): {loss_step}")
              print(f"Training Accuracy per 20 steps(batches): {accu_step}")
              elapsed = format_time(time.time() - t0)    # Calculate elapsed time in minutes.
              # Report progress.
              print('Batch {} of {}.  Elapsed: {:}.'.format(step+1, len(train_dataloader), elapsed))
              #writer.add_scalar('training loss', loss_step, (epoch +1)*len(trainloader) )
              tr_loss = 0;n_correct = 0;nb_tr_steps = 0;nb_tr_examples = 0

          loss.backward() # Perform a backward pass to calculate the gradients.
          torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Clip the norm of the gradients to 1.0. This is to help prevent the "exploding gradients" problem.
          optimizer.step()
          scheduler.step() # Update the learning rate.

    # Calculate the average loss over all of the batches.
      train_loss_per_epoch = total_tr_loss / len(train_dataloader)
      train_accuracy_per_epoch=total_n_correct/total_nb_tr_examples
      # Measure how long this epoch took.
      training_time = format_time(time.time() - t0)

      print("")
      print("training loss per epoch: {0:.2f}".format(train_loss_per_epoch))
      print("training accuracy per epoch: {0:.2f}".format(train_accuracy_per_epoch))
      print("Training 1 epcoh took: {:}".format(training_time))

In [None]:
train(epochs = EPOCHS)

## 6. Test the model on the validation set

In [None]:
# test the model on the validation set
def valid(model, validation_loader):
  model.eval()
  val_loss = 0
  nb_val_examples = 0
  n_correct = 0
  with torch.no_grad():
    for _, data in enumerate(validation_loader, 0):
      ids = data[0].to(device, dtype = torch.long)
      mask = data[1].to(device, dtype = torch.long)
      targets = data[2].to(device, dtype = torch.long)
      outputs = model(ids, mask)
      loss_function = torch.nn.CrossEntropyLoss()
      loss = loss_function(outputs[0], targets)
      val_loss += loss.item()
      big_val, big_idx = torch.max(outputs[0], dim=1)
      n_correct += calcuate_accu(big_idx, targets)
      nb_val_examples+=targets.size(0)

  val_ave_loss = val_loss/len(validation_loader)
  val_accu = (n_correct*100)/nb_val_examples
  print("Loss on validation/test data: %0.2f" % val_ave_loss)
  print("Accuracy on validation/test data: %0.2f%%" % val_accu)

  return

In [None]:
valid(model, validation_dataloader)

## 7. Obtain test error

In [None]:
valid(model, test_dataloader)

## 8. Save the model, tokenizer and labels

In [None]:
import os

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()

output_dir = './Documents/intent_detection_healthcare_bert/saved_bert_model_and_tokenizer/'

# Create output directory if needed
if not os.path.exists(output_dir):
  os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)



In [None]:
df_label = pd.DataFrame(tuple(zip(range(25),LE.classes_)), columns=['id','intent'])
df_label.to_pickle('./Documents/intent_detection_healthcare_bert/saved_bert_model_and_tokenizer/df_label.pkl')

In [None]:
# Copy the model files to a directory in Google Drive.
!cp -r ./Documents/intent_detection_healthcare_bert/saved_bert_model_and_tokenizer/ "gdrive/My Drive/"

## 9. Prepare the model for deployment
* Load the saved model, tokenizer and labels
* Create a medical_symptom_detector function with the loaded model, tokenizer and labels, which helps predict the medical intent of a medical message.
* test the detector on an unseen example

In [None]:
#### load the model and build the detector for deployment
!pip install transformers
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification

input_dir = 'gdrive/My Drive/saved_bert_model_and_tokenizer/'

loaded_model = BertForSequenceClassification.from_pretrained(input_dir)
loaded_model.eval()
loaded_tokenizer = BertTokenizer.from_pretrained(input_dir)
loaded_df_label = pd.read_pickle('gdrive/My Drive/saved_bert_model_and_tokenizer/df_label.pkl')



In [None]:
# test the model on an unseen example

def medical_symptom_detector(intent):

  pt_batch = loaded_tokenizer(
  intent,
  padding=True,
  truncation=True,
  return_tensors="pt")

  pt_outputs = loaded_model(**pt_batch)
  __, id = torch.max(pt_outputs[0], dim=1)
  prediction = loaded_df_label.iloc[[id.item()]]['intent'].item()
  print('You may have a medical condition: %s. Would you like me to transfer your call to your doctor?'%(prediction))
  return prediction

In [None]:
question = 'my shoulder hurt so much aaahgh'
intent = medical_symptom_detector(question)

In [None]:
doctor_answers = {
    "Acne": "Benzoyl peroxide works as an antiseptic to reduce the number of bacteria on the surface of the skin. It also helps to reduce the number of whiteheads and blackheads, and has an anti-inflammatory effect. Benzoyl peroxide is usually available as a cream or gel. It's used either once or twice a day.\nYou may want to see a dermatologist for more treatment options.",
    "Shoulder pain": "applying an ice pack, covered in a damp towel, to your shoulder for about 20 minutes every few hours, to reduce pain and inflammation. using a covered hot water bottle or heat pack on your shoulder, for around 20 minutes several times a day, to relieve tight or sore muscles. \n You should consult an orthopedic specialist to assess the cause of your shoulder pain.",
    "Joint pain": "Joint pain is common, especially as you get older. There are things you can do to ease the pain but get medical help if it's very painful or it does not get better.\nThere are many possible causes of joint pain. It might be caused by an injury or a longer-lasting problem such as arthritis.try to rest the affected joint if you can,put an ice pack (or bag of frozen peas) wrapped in a towelon the painful area for up to 20 minutes every 2 to 3 hours , take painkillers, such as ibuprofen or paracetamol, but do not take ibuprofen in the first 48 hours after an injury. try to lose weight if you're overweight.make sure to do not carry anything heavy. Also do not completely stop moving the affected joint.",
    "Infected wound": "If bacteria or other pathogens enter a wound, an infection can occur. Symptoms or signs of wound infection include increasing pain, swelling, and redness. More severe infections may cause nausea, chills, or fever.A person may be able to treat minor wound infections at home. However, people with more severe or persistent wound infections should seek medical attention.particularly those with other symptoms such as fever, feeling unwell, or discharge and red streaks from the wound.",
    "Knee pain": "You may benefit from a consultation with an orthopedic surgeon or sports medicine specialist for your knee pain.",
    "Cough": "A primary care physician can evaluate your cough and recommend appropriate treatment.",
    "Feeling dizzy": "Consulting a neurologist or an ear, nose, and throat (ENT) specialist may help identify the cause of your dizziness.",
    "Muscle pain": "Consider seeing a physical therapist or sports medicine specialist to address your muscle pain.",
    "Heart hurts": "Seek immediate medical attention by visiting the emergency room or calling emergency services if you experience chest pain or discomfort.",
    "Ear ache": "You should see an otolaryngologist (ENT doctor) for evaluation and treatment of your earache.",
    "Hair falling out": "Consulting a dermatologist can help identify potential causes and treatment options for hair loss.",
    "Head ache": "See a primary care physician to determine the cause of your headaches and discuss treatment options.",
    "Feeling cold": "If you're experiencing persistent cold sensations, it's important to see a doctor to rule out underlying medical conditions.",
    "Skin issue": "Consider consulting a dermatologist for evaluation and management of your skin issues.",
    "Stomach ache": "Depending on the severity and duration of your stomach ache, you may need to see a gastroenterologist or a primary care physician.",
    "Back pain": "A visit to a spine specialist or physical therapist can help address your back pain.",
    "Neck pain": "Consulting an orthopedic specialist or physical therapist may help alleviate your neck pain.",
    "Internal pain": "It's important to see a doctor for further evaluation if you're experiencing internal pain, as it could be a sign of an underlying condition.",
    "Blurry vision": "You should see an optometrist or ophthalmologist for an eye examination to determine the cause of your blurry vision.",
    "Body feels weak": "A visit to a primary care physician or neurologist may help identify the cause of your weakness.",
    "Hard to breath": "Seek immediate medical attention if you're having difficulty breathing. Visit the emergency room or call emergency services.",
    "Emotional pain": "Consider seeing a mental health professional such as a psychiatrist or therapist to address your emotional pain.",
    "Injury from sports": "You may benefit from evaluation and treatment by a sports medicine specialist or orthopedic surgeon.",
    "Foot ache": "Consulting a podiatrist can help identify the cause of your foot pain and recommend appropriate treatment.",
    "Open wound": "It's important to clean and dress open wounds properly. If the wound is severe, seek medical attention from a healthcare professional."
}

In [None]:
answer = doctor_answers[intent]
print(answer)

In [None]:
import requests

def getResponse(answer):
    endpoint = 'https://api.together.xyz/v1/chat/completions'
    res = requests.post(endpoint, json={
    "model": "meta-llama/Llama-2-70b-chat-hf",
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.7,
    "top_k": 50,
    "repetition_penalty": 1,
    "stop": [
        "[/INST]",
        "</s>"
    ],
    "messages": [
        {
            "content": f"Reformulate while talking as a bot-doctor to your patient , be brief and concise , do not add any additional informations to the answer i provided, answer:  {answer}.",
            "role": "user"
        }
    ]
    }, headers={
    "Authorization": "Bearer 2a61183859d8200f64194ef275194f62a6f97ed4e644d3354d07c7657cad8718",
    })

    print(res.json()['choices'][0]['message']['content'])

In [None]:
getResponse(answer)

In [None]:
def main_loop():
    while True:
        input_text = input('Enter your symptom (or type quit to exit): ')
        if input_text.lower() == 'quit':
            print("Thank you for using the medical chatbot. Goodbye!")
            break
        else:
            intent = medical_symptom_detector(input_text)  # Detect the medical symptom
            if intent in doctor_answers:
                answer = doctor_answers[intent]  # Retrieve doctor's answer corresponding to the detected symptom
                print("Doctor's response:", answer)  # Print the doctor's answer
            else:
                print("Sorry, I couldn't understand your symptom.")

# Call the main_loop function to start the chatbot interface
main_loop()
