<a href="https://colab.research.google.com/github/ArmaanMistry/medical-advise-llama2-finetuning/blob/main/medical2_finetune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 --progress-bar off

In [None]:
!pip install datasets rouge_score -q

In [None]:
import json
import re
from pprint import pprint

import pandas as pd
import torch
from datasets import Dataset, load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"

In [None]:
!huggingface-cli login

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
dataset = load_dataset("lavita/ChatDoctor-HealthCareMagic-100k")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 112165
    })
})

In [None]:
dataset['train'][11220]

{'instruction': "If you are a doctor, please answer the medical questions based on the patient's description.",
 'input': 'dr two days before my pet bitten me. in govt hospital thet took tt injection,IDRV INJECTION,and told to take equiralo injection. atd there a reaction for equiralo.adviced for HRIG VACCINE bUT I DIDNT TAKE .iS ANY PROBLEM FOR THIS.mY DOG IS VACCINATED ONE YEAR BACK. tHE BITE WAS APROVOKED BITE,?',
 'output': "Hello, Welcome to Chat Doctor, Rabies is a 100% fatal disease but 100% preventable by proper and adequate treatment. Dog is the known reservoir of rabies virus and can transmit rabies by biting. As you were bitten by your pet dog for which your doctor has advised In TT, anti rabies vaccine by intradermal route and passive immunization by equine rabies immunoglobulin (ERIC) around the wound.ERIC is advised if there is bleeding from the site of bite. As you were bitten by your pet dog, I would suggest you to take three doses of anti rabies vaccine and to watch th

In [None]:
dataset['train'][11220]['output']

"Hello, Welcome to Chat Doctor, Rabies is a 100% fatal disease but 100% preventable by proper and adequate treatment. Dog is the known reservoir of rabies virus and can transmit rabies by biting. As you were bitten by your pet dog for which your doctor has advised In TT, anti rabies vaccine by intradermal route and passive immunization by equine rabies immunoglobulin (ERIC) around the wound.ERIC is advised if there is bleeding from the site of bite. As you were bitten by your pet dog, I would suggest you to take three doses of anti rabies vaccine and to watch the dog for 10 days. If it is healthy you don't require serum. If dog develops any symptoms of rabies you need to take equine rabies immunoglobulin (ERIC) around the wound. Thank you."

In [None]:
# Defining prompt

DEFAULT_SYSTEM_PROMPT = """
You are a doctor, please answer the medical questions based on the patient's description.
""".strip()

In [None]:
# Prompt for training

def generate_training_prompt(
    input: str, response: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
    return f"""### Instruction: {system_prompt}

### Input:
{input.strip()}

### Response:
{response}
""".strip()

In [None]:
def generate_text(data_point):
  input_text = data_point['input']
  response_text = data_point['output']
  return {
      "input": input_text,
      "response": response_text,
      "text": generate_training_prompt(input_text, response_text),
  }

In [None]:
example = generate_text(dataset['train'][1])

In [None]:
example

{'input': 'My baby has been pooing 5-6 times a day for a week. In the last few days it has increased to 7 and they are very watery with green stringy bits in them. He does not seem unwell i.e no temperature and still eating. He now has a very bad nappy rash from the pooing ...help!',
 'response': 'Hi... Thank you for consulting in Chat Doctor. It seems your kid is having viral diarrhea. Once it starts it will take 5-7 days to completely get better. Unless the kids having low urine output or very dull or excessively sleepy or blood in motion or green bilious vomiting...you need not worry. There is no need to use antibiotics unless there is blood in the motion. Antibiotics might worsen if unnecessarily used causing antibiotic associated diarrhea. I suggest you use zinc supplements (Z&D Chat Doctor.',
 'text': "### Instruction: You are a doctor, please answer the medical questions based on the patient's description.\n\n### Input:\nMy baby has been pooing 5-6 times a day for a week. In the

In [None]:
def process_dataset(data: Dataset):
    return (
        data.shuffle(seed=42)
        .map(generate_text)
    )

In [None]:
train_subset = dataset["train"].select(range(1000))
validation_subset = dataset["train"].select(range(1000, 1100))
test_subset = dataset["train"].select(range(1100, 1200))

dataset["train"] = process_dataset(train_subset)
dataset["validation"] = process_dataset(validation_subset)
dataset["test"] = process_dataset(test_subset)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
dataset["train"] = dataset["train"].remove_columns(
    [
        "output",
        "instruction"
    ])

In [None]:
dataset["validation"] = dataset['validation'].remove_columns([
    "output",
    "instruction"
])

In [None]:
dataset["test"] = dataset['test'].remove_columns([
    "output",
    "instruction"
])

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input', 'response', 'text'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['input', 'response', 'text'],
        num_rows: 100
    })
    test: Dataset({
        features: ['input', 'response', 'text'],
        num_rows: 100
    })
})

In [None]:
def create_model_and_tokenizer():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
    )

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        use_safetensors=True,
        quantization_config=bnb_config,
        trust_remote_code=True,
        device_map="auto",
    )

    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    return model, tokenizer

In [None]:
model, tokenizer = create_model_and_tokenizer()
model.config.use_cache = False



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
model.config.quantization_config.to_dict()

{'load_in_8bit': False,
 'load_in_4bit': True,
 'llm_int8_threshold': 6.0,
 'llm_int8_skip_modules': None,
 'llm_int8_enable_fp32_cpu_offload': False,
 'llm_int8_has_fp16_weight': False,
 'bnb_4bit_quant_type': 'nf4',
 'bnb_4bit_use_double_quant': False,
 'bnb_4bit_compute_dtype': 'float16'}

In [None]:
lora_r = 16
lora_alpha = 64
lora_dropout = 0.1
lora_target_modules = [
    "q_proj",
    "up_proj",
    "o_proj",
    "k_proj",
    "down_proj",
    "gate_proj",
    "v_proj",
]

peft_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=lora_target_modules,
    bias="none",
    task_type="CAUSAL_LM",
)

In [None]:
OUTPUT_DIR = "experiments"

In [None]:
training_arguments = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    logging_steps=1,
    learning_rate=1e-4,
    fp16=True,
    max_grad_norm=0.3,
    num_train_epochs=1,
    evaluation_strategy="steps",
    eval_steps=0.2,
    warmup_ratio=0.05,
    save_strategy="epoch",
    group_by_length=True,
    output_dir=OUTPUT_DIR,
    report_to="tensorboard",
    save_safetensors=True,
    lr_scheduler_type="cosine",
    seed=42,
)

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=4096,
    tokenizer=tokenizer,
    args=training_arguments,
)



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [None]:
trainer.train()

  new_forward = torch.cuda.amp.autocast(dtype=torch.float16)(model_forward_func)
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss
13,2.2564,2.277342
26,1.8896,2.174518
39,1.8703,2.146284
52,2.0645,2.135821


TrainOutput(global_step=62, training_loss=2.1731523929103727, metrics={'train_runtime': 1214.8955, 'train_samples_per_second': 0.823, 'train_steps_per_second': 0.051, 'total_flos': 6036047310716928.0, 'train_loss': 2.1731523929103727, 'epoch': 0.99})

Save the model --code start--

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
model_save_path = "/content/drive/My Drive/lora_finetuned_medical_model"

import os
# Create the directory if it doesn't exist
os.makedirs(model_save_path, exist_ok=True)

In [None]:
import torch

# Save the model state_dict
torch.save(model.state_dict(), f"{model_save_path}")

In [None]:
trainer.model.save_pretrained(model_save_path)

In [None]:
# from peft import PeftModel
# model = PeftModel.from_pretrained(model, model_save_path) # Wrap your model with PeftModel for saving

ValueError: Can't find 'adapter_config.json' at '/content/drive/My Drive/lora_finetuned_medical_model'

Load the model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Path where your model is saved
model_save_path = "/content/drive/My Drive/lora_finetuned_model"

# Load the base model
model = AutoModelForCausalLM.from_pretrained(model_save_path)

# If using LoRA, load the adapters
model = PeftModel.from_pretrained(model, model_save_path)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_save_path)

Save the model --code end--

In [None]:
def generate_prompt(
    input: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
    return f"""### Instruction: {system_prompt}

### Input:
{input.strip()}

### Response:
""".strip()

In [None]:
dataset["train"][5]

{'input': 'i am 33 yr old unmarried girl, weight 45 kg, height 5.3, my skin seems to be very dull and blackheads on face appears, what to do',
 'response': 'Hi, Welcome to Chat Doctor. You seem to be underweight due to which your skin is dull. You need to have lots of fresh fruits and vegetables along with healthy nutritious food which help you to improve your skin texture as well as your hair. Consume a lot of water which will HY Chat Doctor.  You can also use based mixed with HALDE or lemon with honey on your face. A facial scrub will help you remove blackheads. Have a good health.',
 'text': "### Instruction: You are a doctor, please answer the medical questions based on the patient's description.\n\n### Input:\ni am 33 yr old unmarried girl, weight 45 kg, height 5.3, my skin seems to be very dull and blackheads on face appears, what to do\n\n### Response:\nHi, Welcome to Chat Doctor. You seem to be underweight due to which your skin is dull. You need to have lots of fresh fruits an

In [None]:
examples = []

for data_point in dataset["test"].select(range(5)):
    input_text = data_point['input']
    response_text = data_point['response']
    examples.append(
        {
            "input": input_text,
            "response": response_text,
            "prompt": generate_prompt(input_text),
        }
    )
example_df = pd.DataFrame(examples)
example_df

Unnamed: 0,input,response,prompt
0,"Hi, I ve been coughing for nearly two months n...",Thanks for your question on Chat Doctor. I can...,"### Instruction: You are a doctor, please answ..."
1,Blood sugar has been high for several days now...,Hello dear user! I have gone through your quer...,"### Instruction: You are a doctor, please answ..."
2,"hellooo,doctor m sufferring 4m pimples since i...","HAI, Welcome to Chat Doctor. Pimple at this ag...","### Instruction: You are a doctor, please answ..."
3,I have 10 months old child when he was for day...,No not at all. Crawling usually started on 3 -...,"### Instruction: You are a doctor, please answ..."
4,"sir, I had taken aten-25 last October-10 for 1...",Hellothanks for posting here. As per your stat...,"### Instruction: You are a doctor, please answ..."


In [None]:
from transformers import pipeline, Conversation

In [None]:
def give_advice_conversational(model, text: str):
    inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
    inputs_length = len(inputs["input_ids"][0])
    with torch.inference_mode():
        outputs = model.generate(**inputs,
                                 max_new_tokens=256,
                                 temperature=0.5,
                                 no_repeat_ngram_size=3)

    generated_text = tokenizer.decode(outputs[0][inputs_length:], skip_special_tokens=True)

    stop_words = ['Chat Doctor.', 'Chat Doctor']

    for word in stop_words:
        if word in generated_text:
            generated_text = generated_text.split(word)[0] + word
            break

    return generated_text

## Examples

### Example 1

In [None]:
example = example_df.iloc[0]
pprint(example.response)

('Thanks for your question on Chat Doctor. I can understand your concern. You '
 'are having chronic cough (cough since 2 months). We should definitely search '
 'for the cause of this chronic cough and treat accordingly. Common causes for '
 'chronic cough are 1. Bronchitis 2. Asthma 3. Chronic lung infection like '
 'tuberculosis or fungal infection. 4. Lung cancer. So better to consult '
 'pulmonologist and get done 1. Clinical examination of respiratory system 2. '
 'Chest x-ray 3. PUT (pulmonary function test) 4. CT thorax if required. So '
 'your next step should be to consult pulmonologist and discuss all these. '
 'Hope I have solved your query. I will be happy to help you further. Wish you '
 'good health. Thanks.')


In [None]:
pprint(example.input)

('Hi, I ve been coughing for nearly two months now, i don t feel sick at all. '
 'I ve just been coughing. No mucus or anything out of the ordinary. Breathing '
 'is also difficult when my coughing spells are bad. As time goes on, I m '
 'cough much less frequently, but I still have a pretty heavy smoker-like '
 'cough. Is what I have serious? Why am I still coughing? And what would be '
 'the best thing to do next? Thanks, Mateusz Majka')


In [None]:
print(example.prompt)

### Instruction: You are a doctor, please answer the medical questions based on the patient's description.

### Input:
Hi, I ve been coughing for nearly two months now, i don t feel sick at all. I ve just been coughing. No mucus or anything out of the ordinary. Breathing is also difficult when my coughing spells are bad. As time goes on, I m cough much less frequently, but I still have a pretty heavy smoker-like cough. Is what I have serious? Why am I still coughing? And what would be the best thing to do next? Thanks, Mateusz Majka

### Response:


In [None]:
%%time
advice = give_advice_conversational(model, example.prompt)

  return fn(*args, **kwargs)


CPU times: user 1min 35s, sys: 30 s, total: 2min 5s
Wall time: 2min 6s


In [None]:
pprint(advice)

('\n'
 'Hello, Thanks for your query. I have gone through your query and understand '
 'your concern. I would suggest you to consult a pulmonologist for further '
 'evaluation. You may have an upper respiratory tract infection or an allergic '
 'bronchitis. You need to get a chest X-ray done to rule out any lung '
 'infection. You can take anti-allergic medication and anti-inflammatory '
 'medication to reduce the symptoms. You should also avoid smoking and take '
 'steam inhalation to relieve your symptoms and also to reduce your cough and '
 'breathing difficulty. Hope I have answered your query, if you have any '
 'further questions, please feel free to ask. Thanks for using Chat Doctor.')


In [None]:
generated_response1 = advice.strip().split("\n")[0]
pprint(generated_response1)

('Hello, Thanks for your query. I have gone through your query and understand '
 'your concern. I would suggest you to consult a pulmonologist for further '
 'evaluation. You may have an upper respiratory tract infection or an allergic '
 'bronchitis. You need to get a chest X-ray done to rule out any lung '
 'infection. You can take anti-allergic medication and anti-inflammatory '
 'medication to reduce the symptoms. You should also avoid smoking and take '
 'steam inhalation to relieve your symptoms and also to reduce your cough and '
 'breathing difficulty. Hope I have answered your query, if you have any '
 'further questions, please feel free to ask. Thanks for using Chat Doctor.')


### Example 2

In [None]:
example2 = example_df.iloc[1]
pprint(example2.response)

('Hello dear user! I have gone through your query and understood your '
 'concerns! Thank you for sharing them on Chat Doctor. Such levels of blood '
 'sugar, excessive urination and thirst you are experiencing lately indicate '
 'an uncontrolled diabetes. Unfortunately metformin you are taking seems to be '
 'unable to control blood sugar by itself. You will need a combination of Chat '
 'Doctor.  I would recommend you to contact your endocrinologist as soon as '
 "possible to help you get the proper cure and lower the sugar level. Don't "
 "stay at home waiting if it levels out. It won't. I hope this answer was "
 'helpful to you! Please kindly rate it as helpful and write a short review '
 'about your experience with me! I would appreciate that a lot. Thank you and '
 'best regards!')


In [None]:
%%time
advice2 = give_advice_conversational(model, example2.prompt)

CPU times: user 1min 27s, sys: 25.2 s, total: 1min 52s
Wall time: 1min 53s


In [None]:
pprint(advice2)

('\n'
 'Hi, Thanks for your query. I can understand your concern. I would suggest '
 'you to consult your doctor and get your blood sugar level checked. It may be '
 'due to some infection or some other reason. So, consult your physician and '
 'get it checked. Hope I have answered your query, if you have any further '
 'query, please feel free to ask. Thanks for using Chat Doctor.')


In [None]:
generated_response2 = advice2.strip().split("\n")[0]
pprint(generated_response2)

('Hi, Thanks for your query. I can understand your concern. I would suggest '
 'you to consult your doctor and get your blood sugar level checked. It may be '
 'due to some infection or some other reason. So, consult your physician and '
 'get it checked. Hope I have answered your query, if you have any further '
 'query, please feel free to ask. Thanks for using Chat Doctor.')


### Example 3

In [None]:
example3 = example_df.iloc[2]
pprint(example3.response)

('HAI, Welcome to Chat Doctor. Pimple at this age is due to excessive activity '
 'of your oil glands of your skin due to bubbling hormonal action. Up to the '
 'age of 25, it is an ongoing process. After the hormonal tides start reduced '
 'your problem will vanisChatDoctorpletely. So the treatment cannot totally '
 'eradicate the appearance of pimples whereas it will help to reduce the '
 'intense to the maximum extent.  Frequent face wash, consuming adequate '
 'water, vegetables, fruits, greens and avoiding foods containing butter '
 'cheese, ghee, and antipathy etc. will help you to keep this menace well '
 'under control along with specific medical treatment for pimples. Medical '
 'treatment has a lot of choice according to the severity, your skin and type '
 'of lesions from topical preparations, oral antibiotics, soaps etc. Your '
 'dermatologist will guide you in this regard for a better treatment of choice '
 'to suit your present condition. Following all these measures toge

In [None]:
%%time
advice3 = give_advice_conversational(model, example3.prompt)

CPU times: user 1min 33s, sys: 29.8 s, total: 2min 3s
Wall time: 2min 4s


In [None]:
pprint(advice3)

('\n'
 'Hi, Thanks for your query. I understand your concern. Pimples are caused by '
 'hormonal changes in the body. You can use benzoyl peroxide cream or gel to '
 'treat pimple. It is available in the market. You should use it twice a day '
 'for 2-3 weeks. It will help you to get rid of pimbles. You also use '
 'antibiotic cream to treat it. You have to use it for 1-2 weeks. You will get '
 'rid from pimble. You must avoid picking pimles. It can cause infection. You '
 'take warm water bath. It helps to reduce pimle. You use face wash. It also '
 'helps to get ride of pimply. You avoid oily food. It causes pimly. Hope I '
 'have answered your query, if you have any further query, please let me know. '
 'Thanks. Wish you good health. Take care. Chat Doctor.')


### Example 4

In [None]:
example4 = example_df.iloc[3]
pprint(example4.response)

('No not at all. Crawling usually started on 3 -6 months of age. At 6 month '
 'babies are usually sitting on the floor without any support and around 10 '
 'months of age they try to stand and walk with some support. According to '
 'your history the baby shows appropriate developmental stage. So do not '
 'worry. Fits may be due to febrile convulsions which may not need any '
 'treatment and usually resolve after 5 year of age.(But history is not '
 'sufficient to tell whether it is febrile convulsions)')


In [None]:
%%time
advice4 = give_advice_conversational(model, example4.prompt)

CPU times: user 1min 29s, sys: 26.1 s, total: 1min 55s
Wall time: 1min 55s


In [None]:
pprint(advice4)

('\n'
 'Hi, Thanks for your query. I have gone through your query and understand '
 'your concern. I would like to tell you that your child is having a '
 'neurological problem. He is having seizures and it is not a normal thing. It '
 'is a neonatal seizure disorder. It can be caused by various reasons like '
 'infection, fever, metabolic disorder, genetic disorder etc. It needs to be '
 'treated with anticonvulsants. I suggest you to consult a nephrologist and '
 'get a detailed examination done. Hope I have answered your query, if you '
 'have any further questions, please feel free to ask. Wish you a good health. '
 'Thanks. Chat Doctor.')


### Example 5

In [None]:
example5 = example_df.iloc[4]
pprint(example5.response)

('Hellothanks for posting here. As per your statement, a pulse rate of 50-70 '
 'are quite within the normal range. You seem to have missed beats every 20 '
 '-25 beats which may be normal in some people. But if it is causing you '
 'symptoms like dizziness, or loss of consciousness then they are a matter of '
 'concern. Also, we must know what type of missed beats they are. So if it is '
 'troubling you too much, a holder monitoring which is a 24 hr heart EKG '
 'monitoring should be done. Also a stress test should be done which will show '
 'if these missed beats increase with exercise. Atenolol and its group of Chat '
 'Doctor.  You can be started on half dose of 25 mph atenolol only if you are '
 'having symptoms or if any of the above tests come positive. Thank you')


In [None]:
%%time
advice5 = give_advice_conversational(model, example5.prompt)

CPU times: user 58.6 s, sys: 17.4 s, total: 1min 16s
Wall time: 1min 16s


In [None]:
pprint(advice5)

('\n'
 'Hi, Thanks for your query. I have gone through your query and understand '
 'your concern. I would suggest you to consult a cardiologist for this. You '
 'should consult a doctor and get your ECG done. If your EEG is normal, then '
 'you should consult an electrophysiologist. You can consult a general '
 'physician and get a routine ECG and EEG done. Hope I have answered your '
 'query, if you have any further questions, please feel free to ask. Wish you '
 'a good health. Thanks for using Chat Doctor.')


### Example 6 (Out of Box)

In [None]:
example6 = """
### Instruction: You are a doctor. Please follow these rules while answering medical questions based on the patient's description given below:
- Only answer questions that are directly related to health, medicine, or medical advice.
- If the question is outside your scope of medical knowledge, do not attempt to provide an answer.
- For non-medical inquiries or topics unrelated to health, clearly state that you don't know or cannot respond.
- Avoid providing advice on non-medical topics, even if the question seems partially related.
- If unsure about the medical relevance of a question, default to saying you cannot respond.

Examples:
1. On-topic response:
   - Input: "My baby has been pooing 5-6 times a day for a week. In the last few days it has increased to 7 and they are very watery with green stringy bits in them. He does not seem unwell i.e no temperature and still eating. He now has a very bad nappy rash from the pooing ...help!"
   - Response: "Hi... Thank you for consulting in Chat Doctor. It seems your kid is having viral diarrhea. Once it starts it will take 5-7 days to completely get better. Unless the kids having low urine output or very dull or excessively sleepy or blood in motion or green bilious vomiting...you need not worry. There is no need to use antibiotics unless there is blood in the motion. Antibiotics might worsen if unnecessarily used causing antibiotic associated diarrhea. I suggest you use zinc supplements. Chat Doctor."

2. Off-topic response:
   - Input: "Hi, I've been experiencing issues with my car lately. It makes a strange noise whenever I accelerate, and the engine seems to be losing power. I tried changing the oil and checking the tire pressure, but nothing seems to help. Do you know what might be causing this problem, and how can I fix it?"
   - Response: "I cannot respond to such questions as they are not related to health or medicine."

### Input:
Hi, I've been experiencing issues with my car lately. It makes a strange noise whenever I accelerate, and the engine seems to be losing power. I tried changing the oil and checking the tire pressure, but nothing seems to help. Do you know what might be causing this problem, and how can I fix it? Thanks, John.

### Response:
""".strip()

In [None]:
%%time
advice6 = give_advice_conversational(model, example6)

CPU times: user 37.2 s, sys: 16.4 s, total: 53.6 s
Wall time: 53.9 s


In [None]:
pprint(advice6)

('\n'
 'I cannot answer your question as it is not related with health or medical '
 'issues. I would suggest you to consult a mechanic or a car expert. They will '
 'be able to diagnose and fix the problem. Hope I have answered your question. '
 'Let me know if I can assist you further. Thanks.')


## Comparing real and generated responses

In [None]:
final_example = [[example.response, generated_response1],
                 [example2.response, generated_response2],
                 [example3.response, generated_response3],
                 [example4.response, generated_response4],
                 [example5.response, generated_response5]]

df = pd.DataFrame(final_example, columns=['Real Response', 'Generated Response'])

display(df)
df.to_csv('new_generated_responses.csv', index=False)

NameError: name 'generated_response3' is not defined

## Calculate ROUGE Score

In [None]:
real_response_column = df['Real Response'].tolist()
generated_response_column = df['Generated Response'].tolist()

In [None]:
!pip install datasets rouge_score -q

NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968

In [None]:
!pip install evaluate -q

In [None]:
import evaluate

# Loading the ROUGE metric
rouge = evaluate.load('rouge')

# Calculating the ROUGE score for the entire test set
results = rouge.compute(predictions=generated_response_column, references=real_response_column)

In [None]:
# Display the results
for key, value in results.items():
  print(f"{key}: {value}")

In [None]:
import os

# Define the model directory
model_dir = '/content/drive/MyDrive/medical-chat-fine-tuned'

# Create the directory if it doesn't exist
os.makedirs(model_dir, exist_ok=True)

In [None]:
import torch

# Save the model state_dict
torch.save(model.state_dict(), f"{model_dir}/model.bin")

In [None]:
from transformers import LlamaTokenizer

# Save the tokenizer
tokenizer.save_pretrained(model_dir)

('/content/drive/MyDrive/medical-chat-fine-tuned/tokenizer_config.json',
 '/content/drive/MyDrive/medical-chat-fine-tuned/special_tokens_map.json',
 '/content/drive/MyDrive/medical-chat-fine-tuned/tokenizer.model',
 '/content/drive/MyDrive/medical-chat-fine-tuned/added_tokens.json',
 '/content/drive/MyDrive/medical-chat-fine-tuned/tokenizer.json')

In [None]:
# # To Load
# from google.colab import drive
# drive.mount('/content/drive')

# from transformers import LlamaForCausalLM, LlamaTokenizer

# model_dir = '/content/drive/MyDrive/medical-chat-fine-tuned'
# MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"

# # Define the model
# model = LlamaForCausalLM.from_pretrained(MODEL_NAME)  # Use the base model name here
# model.load_state_dict(torch.load(f"{model_dir}/model.bin"))
# model.eval()  # Set the model to evaluation mode

# # Load the tokenizer
# tokenizer = LlamaTokenizer.from_pretrained(model_dir)

## Identifying weather the input is related to medical or not

In [None]:
import numpy as np
import pandas as pd

In [None]:
mnm_dataset = pd.read_csv('medical_non_medical_questions_dataset.csv', lineterminator='\n')

In [None]:
mnm_dataset.head()

Unnamed: 0,questions,label
0,Who will do best at the World Cup . Trinidad o...,non-medical
1,I have recently had a MRI of my brain without ...,medical
2,DAE wish they had something to fight for? I fe...,non-medical
3,hello sir my kid is 4yrs old he has been treat...,medical
4,Hello im 18 years old and im always depressed ...,medical


In [None]:
mnm_dataset.shape

(46066, 2)

In [None]:
mnm_dataset.columns

Index(['questions', 'label'], dtype='object')

In [None]:
mnm_dataset['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
medical,25000
non-medical,21066


In [None]:
mnm_dataset.describe()

Unnamed: 0,questions,label
count,45990,46066
unique,45099,2
top,"Hello doctor,As I have PCOD problem and also c...",medical
freq,59,25000


In [None]:
X = mnm_dataset.iloc[:, 0]
y = mnm_dataset.iloc[:, 1]

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((36852,), (36852,), (9214,), (9214,))

In [None]:
def clean_text(text):
  text = str(text)
  text = text.lower()

  # remove html
  import re
  html_removed = re.sub('<.*?>', '', text)

  # remove punctuation
  import string
  punctuation_removed = html_removed.translate(str.maketrans('', '', string.punctuation))

  # remove special characters
  pattern = r'[^a-zA-z0-9\s]'
  special_removed = re.sub(pattern, '', punctuation_removed)

  return special_removed

In [None]:
X_train[5]

'What should I order from a chinese restaurant if i am trying to diet?'

In [None]:
clean_text(X_train[5])

'what should i order from a chinese restaurant if i am trying to diet'

In [None]:
X_train = X_train.apply(clean_text)
X_test = X_test.apply(clean_text)

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# stop words
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
def lemmatizer_text(text):
  words = word_tokenize(text)
  wordnet_lemmatizer = WordNetLemmatizer()
  lemmatized_text = [wordnet_lemmatizer.lemmatize(word, pos='v') for word in words]

  # stop words
  stop_words = set(stopwords.words('english'))
  filtered_words = [word for word in lemmatized_text if word not in stop_words]
  return ' '.join(filtered_words)

In [None]:
X_train[10]

'we need a new list of intolerable acts such as the usa patriot act digital millennium copyright act and acta treaty being negotiated in secret government exists to protect life liberty and property property needs to be limited to physical items intellectual property is not property and is not limited to a single physical instance'

In [None]:
lemmatizer_text(X_train[10])

'need new list intolerable act usa patriot act digital millennium copyright act acta treaty negotiate secret government exist protect life liberty property property need limit physical items intellectual property property limit single physical instance'

In [None]:
X_train = X_train.apply(lemmatizer_text)
X_test = X_test.apply(lemmatizer_text)

### TF-IDF trained from dataset

In [None]:
# tf-idf
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [None]:
X_train_tfidf.shape

(36852, 93767)

In [None]:
y_train.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
medical,19927
non-medical,16925


In [None]:
y_train = y_train.map({'medical' : 1, 'non-medical' : 0})
y_test = y_test.map({'medical' : 1, 'non-medical' : 0})

In [None]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression()
lr_model.fit(X_train_tfidf, y_train)

In [None]:
y_pred = lr_model.predict(X_test_tfidf)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

accuracy_score(y_test, y_pred)

0.974495333188626

In [None]:
confusion_matrix(y_test, y_pred)

array([[4031,  110],
       [ 125, 4948]])

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      4141
           1       0.98      0.98      0.98      5073

    accuracy                           0.97      9214
   macro avg       0.97      0.97      0.97      9214
weighted avg       0.97      0.97      0.97      9214



In [None]:
def identify_ques_type(question_asked):
  cleaned_question = clean_text(question_asked)
  lemmatized_question = lemmatizer_text(cleaned_question)
  question_tfidf = tfidf.transform([lemmatized_question])
  question_type_pred = model.predict(question_tfidf)
  q_type = 'medical' if question_type_pred == 1 else 'non-medical'
  return q_type

In [None]:
# medical
example_ques_1 = "I am suffering from frequent headaches and dizziness. What could be causing this, and what should I do to feel better?"

identify_ques_type(example_ques_1)

'medical'

In [None]:
# non-medical
example_ques_2 = "I’ve been having a hard time managing my personal finances lately. Despite budgeting, I find it difficult to save money or stick to my financial plan. Do you have any tips or strategies to help me improve my financial management and avoid unnecessary expenses?"

identify_ques_type(example_ques_2)

'non-medical'

In [None]:
# non-medical
example_ques_3 = "Hi, I've been experiencing issues with my car lately. It makes a strange noise whenever I accelerate, and the engine seems to be losing power. I tried changing the oil and checking the tire pressure, but nothing seems to help. Do you know what might be causing this problem, and how can I fix it? Thanks, John."

identify_ques_type(example_ques_3)

'medical'

In [None]:
# non-medical
example_ques_4 = "I am having problem with my maths assignment. I have been working on a problem to add two numbers for last 3 days, but still can't figure out how to get the final answer. Can you please help me with this? And to whom do I consult for the solution to this problem?"

identify_ques_type(example_ques_4)

'medical'

### Word Similarity

In [None]:
import numpy as np
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Tokenizing text for Word2Vec
def tokenize(text):
    return simple_preprocess(text)

In [None]:
X_train_tokens = X_train.apply(tokenize)
X_test_tokens = X_test.apply(tokenize)

In [None]:
word2vec_model = Word2Vec(sentences=X_train_tokens, vector_size=200, window=10, min_count=1)

In [None]:
# Function to get word vector for a sentence or keyword
def get_vector(text, model=word2vec_model):
    tokens = text.split()
    word_vectors = np.zeros((model.vector_size,))
    count = 0
    for word in tokens:
        if word in model.wv:
            word_vectors += model.wv[word]
            count += 1
    if count > 0:
        word_vectors /= count
    return word_vectors

In [None]:
health_keywords = [
    "symptom", "diagnosis", "treatment", "doctor", "medicine", "pain", "surgery", "infection", "therapy", "disease", "dizziness",
    "illness", "chronic", "fever", "allergy", "surgery", "blood pressure", "diabetes", "anxiety", "depression", "MRI", "X-ray",
    "cough", "nausea", "vomiting", "fatigue", "pharmacy", "first aid", "chemotherapy", "cancer", "asthma", "dentist", "stroke", "fitness"
]

In [None]:
# Function to calculate similarity score between the sentence and the keywords
def calculate_similarity(sentence):
    sentence_vector = get_vector(sentence)
    highest_similarity = -1
    for keyword in health_keywords:
        keyword_vector = get_vector(keyword)
        similarity = cosine_similarity([sentence_vector], [keyword_vector])[0][0]
        if similarity > highest_similarity:
            highest_similarity = similarity
    return highest_similarity

In [None]:
# Function to classify and find the most similar keyword along with similarity score
def classify_sentence_with_keyword_similarity(sentence, model=word2vec_model, threshold=0.6):
    sentence_vector = get_vector(sentence, model)
    highest_similarity = -1
    most_similar_keyword = None

    for keyword in health_keywords:
        keyword_vector = get_vector(keyword, model)
        similarity = cosine_similarity([sentence_vector], [keyword_vector])[0][0]

        if similarity > highest_similarity:
            highest_similarity = similarity
            most_similar_keyword = keyword

    classification = 'medical' if highest_similarity > threshold else 'non-medical'

    print(f"Classification: {classification}")
    print(f"Most similar keyword: {most_similar_keyword}")
    print(f"Similarity score: {highest_similarity:.2f}")

    return classification

In [None]:
# non-medical
sentence = "I am having problem with my maths assignment. I have been working on a problem to add two numbers for last 3 days, but still can't figure out how to get the final answer. Can you please help me with this?"
classify_sentence_with_keyword_similarity(sentence)

Classification: non-medical
Most similar keyword: medicine
Similarity score: 0.49


'non-medical'

In [None]:
# medical:
sentence2 = "I am suffering from frequent headaches and dizziness. What could be causing this, and what should I do to feel better?"
classify_sentence_with_keyword_similarity(sentence2)

Classification: medical
Most similar keyword: nausea
Similarity score: 0.82


'medical'

In [None]:
# non-medical
sentence3 = "Hi, I've been experiencing issues with my car lately. It makes a strange noise whenever I accelerate, and the engine seems to be losing power. I tried changing the oil and checking the tire pressure, but nothing seems to help. Do you know what might be causing this problem, and how can I fix it? Thanks, John."
classify_sentence_with_keyword_similarity(sentence3)

Classification: non-medical
Most similar keyword: fitness
Similarity score: 0.50


'non-medical'

In [None]:
# non-medical
sentence4 = "I’ve been having a hard time managing my personal finances lately. Despite budgeting, I find it difficult to save money or stick to my financial plan. Do you have any tips or strategies to help me improve my financial management and avoid unnecessary expenses?"
classify_sentence_with_keyword_similarity(sentence4)

Classification: medical
Most similar keyword: fitness
Similarity score: 0.62


'medical'

In [None]:
# non-medical
sentence5 = "I've been trying to balance my schedule between studying for my exams and going out with friends. Lately, I've felt a bit overwhelmed with all the responsibilities. Should I consult someone to help me manage my time better, or is it just a matter of getting organized?"
classify_sentence_with_keyword_similarity(sentence5)

Classification: non-medical
Most similar keyword: symptom
Similarity score: 0.51


'non-medical'

In [None]:
# non-medical
sentence6 = "I’ve been having some serious issues with my car lately. It keeps making strange noises, especially when I go uphill, and I feel like it’s struggling to keep up. Should I be worried about this situation? Is there a specific type of expert I should consult to understand what might be going wrong with it? I just want to make sure everything is functioning properly to avoid any major breakdowns."
classify_sentence_with_keyword_similarity(sentence6)

Classification: non-medical
Most similar keyword: anxiety
Similarity score: 0.47


'non-medical'

## Model now only answers healthcare related stuff

In [None]:
def ask_doctor(question_text):
  classified_ques_type = classify_sentence_with_keyword_similarity(question_text)
  if classified_ques_type == 'non-medical':
    return 'Sorry, I can only provide guidance on healthcare-related questions.'
  elif classified_ques_type == 'medical':
    prompt_text = generate_prompt(question_text)
    advice = give_advice_conversational(model, prompt_text)
    # advice = advice.strip().split("\n")[0]
    return advice

In [None]:
# non-medical
problem1 = "I am having problem with my maths assignment. I have been working on a problem to add two numbers for last 3 days, but still can't figure out how to get the final answer. Can you please help me with this?"
ask_doctor(problem1)

Classification: non-medical
Most similar keyword: medicine
Similarity score: 0.49


'Sorry, I can only provide guidance on healthcare-related questions.'

In [None]:
# medical
problem2 = "Hi, I ve been coughing for nearly two months now, i don t feel sick at all. I ve just been coughing. No mucus or anything out of the ordinary. Breathing is also difficult when my coughing spells are bad. As time goes on, I m cough much less frequently, but I still have a pretty heavy smoker-like cough. Is what I have serious? Why am I still coughing? And what would be the best thing to do next? Thanks, Mateusz Majka"
ask_doctor(problem2)

Classification: medical
Most similar keyword: nausea
Similarity score: 0.60


  return fn(*args, **kwargs)


'\nHello, I have gone through your query and understand your concern. I would suggest you to consult a pulmonologist for a detailed evaluation. You may have an upper respiratory tract infection or an allergic bronchitis. You need to get a chest X-ray done to rule out any lung infection. You can take anti-allergic medication and anti-inflammatory medication to reduce the symptoms. You should also avoid smoking and take steam inhalation to relieve the symptom. Hope I have answered your query. Let me know if I can assist you further. Thanks. Wish you a good health. Chat Doctor.'

In [None]:
# medical
problem3 = "I've been experiencing persistent headaches for about six weeks now. They come and go, but when they do occur, they can be quite intense. I don't have any other symptoms like nausea or dizziness, and my vision seems fine. I often find it hard to concentrate when the headaches are at their worst. Should I be concerned about this?"
ask_doctor(problem3)

Classification: medical
Most similar keyword: dizziness
Similarity score: 0.78


  return fn(*args, **kwargs)


'\nHi, Thanks for your query. I have gone through your query and understand your concern. I would suggest you to consult a neurologist for further evaluation. You may have migraine or tension headache. You need to consult neuro for further investigation and treatment. Hope I have answered your query, if you have any further query, please feel free to ask. Thanks for using Chat Doctor.'