<h1>Question Answering System for Healthcare System</h1>

## Domain Intro:
Question-Answering Systems (QAS) are a type of information retrieval system that automatically answers questions posed by users in natural language. These systems aim to understand the semantics of the question and retrieve the most relevant information from a given corpus or knowledge base to provide accurate answers.

## Problem Introduction:
The problem of building a question-answering system for health care system involves training a model to understand natural language questions and find the most appropriate answers from a given dataset or knowledge base. This task often involves natural language understanding, information retrieval, and text processing techniques.


This is a closed domain problem

### Model Building

**Question Answering using BERT**

In [None]:
# 1: Install Required Libraries
!pip install transformers
!pip install torch

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

In [None]:
# 2: Importing required packages
from transformers import BertForQuestionAnswering # pretrained  bert model for question answering
from transformers import BertTokenizer
import torch
import numpy as np

Internally, the BertTokenizer class works by implementing the tokenization process required for BERT models. Tokenization is the process of breaking down a piece of text into smaller units called tokens. For BERT, the tokenization process involves several key steps:

*   Basic Tokenization: The input text is first split into words and punctuation marks. For example, the sentence "Hello, how are you?" might be split into ["Hello", ",", "how", "are", "you", "?"].
*   WordPiece Tokenization: BERT further breaks down these words into smaller subword units called WordPieces. This helps BERT handle out-of-vocabulary (OOV) words and capture more meaningful subword units. For example, the word "playing" might be split into ["play", "##ing"].

*   Adding Special Tokens: BERT requires special tokens to indicate the beginning ([CLS]) and separation ([SEP]) of sentences. These tokens are added to the token list. For example, the tokens for the sentence "Sentence A [SEP] Sentence B" might be ["[CLS]", "Sentence", "A", "[SEP]", "Sentence", "B", "[SEP]"].

*   Padding: To ensure that all input sequences have the same length, padding tokens ([PAD]) may be added to the end of the token list.





In [None]:
# 3: Loading the pre-trained Bert model
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')#This line loads a pre-trained BERT model for question answering from the Hugging Face model hub.

tokenizer_for_bert = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')# loads the corresponding tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# 4: Defining the function for question-answering
def bert_question_answer(question, passage, max_len=500):

    #Tokenize input question and passage
    #Include unique tokens- [CLS] and [SEP]
    input_ids = tokenizer_for_bert.encode (question, passage,  max_length= max_len, truncation=True)

    #Getting number of tokens in 1st sentence (question) and 2nd sentence (passage that contains answer)
    sep_index = input_ids.index(102)
    len_question = sep_index + 1
    len_passage = len(input_ids)- len_question

    #Segment ids will be 0 for question and 1 for passage
    segment_ids =  [0]*len_question + [1]*(len_passage)


    #Converting token ids to tokens
    tokens = tokenizer_for_bert.convert_ids_to_tokens(input_ids)

    #Getting start and end scores for answer
    #Converting input arrays to torch tensors before passing to the model
    start_token_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]) )[0]#torch tensor= numpy array
    end_token_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]) )[1]
1`
    #Converting scores tensors to numpy arrays
    start_token_scores = start_token_scores.detach().numpy().flatten()
    end_token_scores = end_token_scores.detach().numpy().flatten()

    #Getting start and end index of answer based on highest scores
    answer_start_index = np.argmax(start_token_scores)
    answer_end_index = np.argmax(end_token_scores)


    #Getting scores for start and end token of the answer
    start_token_score = np.round(start_token_scores[answer_start_index], 2)
    end_token_score = np.round(end_token_scores[answer_end_index], 2)


    #Combining subwords starting with ## and get full words in output.
    #It is because tokenizer breaks words which are not in its vocab.
    answer = tokens[answer_start_index]
    for i in range(answer_start_index + 1, answer_end_index + 1):
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        else:
            answer += ' ' + tokens[i]

    # If the answer didn't find in the passage
    if (start_token_score < 0 ) or ( answer_start_index == 0) or ( answer_end_index <  answer_start_index) or (answer == '[SEP]'):
        answer = "Sorry , did not find the answer in the given context"

    return (answer_start_index, answer_end_index, start_token_score, end_token_score,  answer)

#Testing function
bert_question_answer("What are some effective home remedies and medical treatments for relieving a sore throat?", "Sore throats, often caused by viral infections like the common cold or flu, can be relieved through various home remedies such as gargling with warm salt water, drinking herbal teas with honey and lemon, and using over-the-counter pain relievers. For bacterial infections like strep throat, medical treatments like antibiotics may be necessary. It's important to consult with a healthcare professional for proper diagnosis and treatment.")


(48,
 74,
 5.8,
 6.23,
 'gargling with warm salt water , drinking herbal teas with honey and lemon , and using over - the - counter pain relievers')

#STEPS


*   Tokenization: The function first tokenizes the input question and passage using the BERT tokenizer (tokenizer_for_bert). It includes special tokens [CLS] and [SEP] to mark the beginning and separation of the question and passage.

*  Segment IDs: It creates segment IDs to differentiate between the question (segment 0) and the passage (segment 1).

*  Converting Tokens to IDs: It converts the tokenized input to token IDs using the BERT tokenizer.
*   Model Prediction: It passes the input token IDs and segment IDs to the BERT model (model) to get start and end scores for the answer span.


*   It identifies the start and end indices of the answer span based on the highest scores from the model.


*  Answer Reconstruction: It reconstructs the answer span from the tokenized input, handling subwords (tokens starting with ##) to form complete words.

*   Answer Validation: It checks if the answer is valid. If the start token score is negative or the answer is not found in the passage, it returns a default message indicating that the answer cannot be identified.

*   Return: It returns the start and end indices of the answer span, the scores for the start and end tokens, and the reconstructed answer.





In [None]:
bert_question_answer("What can i do get cure cancer", "") # without the context

(8,
 8,
 -3.74,
 -1.04,
 'Sorry!, I was unable to discover an answer in the passage.')

In [None]:
# some more examples for testing

# Passage 1
passage = '''Regenerative medicine represents a revolutionary approach to healthcare, aiming to repair, replace, or regenerate damaged tissues and organs to restore normal function. This field encompasses a wide range of technologies and disciplines, including stem cell biology, tissue engineering, gene editing, and biomaterials science, all working synergistically to develop innovative therapies.
At the core of regenerative medicine are stem cells, which possess the remarkable ability to differentiate into various cell types and self-renew. These cells can be sourced from a variety of places, including embryos, adult tissues, and induced pluripotent stem cells (iPSCs) generated from adult cells. Stem cells are key players in regenerative medicine because of their potential to regenerate damaged tissues and organs. For example, in bone marrow transplantation, hematopoietic stem cells are used to regenerate the blood and immune system in patients with certain cancers and blood disorders.
Tissue engineering is another critical component of regenerative medicine, focusing on creating functional tissues and organs in the lab for transplantation. This involves combining cells with biomaterials and growth factors to create bioengineered constructs that mimic the structure and function of native tissues. These constructs can then be implanted into patients to replace or repair damaged tissues. For example, researchers have developed bioengineered skin grafts for patients with severe burns, providing a more effective and less painful alternative to traditional skin grafting techniques.
Gene editing technologies, such as CRISPR-Cas9, are also playing a significant role in advancing regenerative medicine. These tools allow researchers to precisely modify the genetic code of cells, opening up new possibilities for treating genetic disorders and enhancing the efficacy of stem cell therapies. For example, researchers have used CRISPR-Cas9 to correct genetic mutations in patient-derived iPSCs, paving the way for personalized cell therapies for genetic diseases.
Despite its immense potential, regenerative medicine faces several challenges. Ethical considerations surrounding the use of embryonic stem cells remain contentious, leading researchers to explore alternative cell sources such as iPSCs. Regulatory hurdles also exist, as ensuring the safety and efficacy of regenerative medicine therapies requires rigorous testing and approval processes. Additionally, the high cost of these therapies and the need for specialized infrastructure pose challenges to widespread adoption.
Nevertheless, the field of regenerative medicine continues to advance at a rapid pace, with ongoing research efforts driving innovation and discovery. As scientists gain a deeper understanding of stem cell biology, tissue engineering, and gene editing, the potential for regenerative medicine to transform healthcare and offer new hope to patients with currently incurable conditions grows ever greater. '''

print (f'Length of the passage: {len(passage.split())} words')

question1 ='How does tissue engineering contribute to the development of bioengineered tissues and organs?'
print ('\nQuestion 1:\n', question1)
_, _ , _ , _, ans  = bert_question_answer( question1, passage)
print('\nAnswer from BERT: ', ans ,  '\n')


question2 ='What regulatory hurdles exist for the approval of regenerative medicine therapies?'
print ('\nQuestion 2:\n', question2)
_, _ , _ , _, ans  = bert_question_answer( question2, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

question3 ='What role do gene editing technologies, such as CRISPR-Cas9, play in regenerative medicine?'
print ('\nQuestion 3:\n', question3)
_, _ , _ , _, ans  = bert_question_answer( question3, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

# Passage 2
passage= """"Skin cancer is a significant global health concern characterized by the abnormal growth of skin cells, typically caused by damage from ultraviolet (UV) radiation. It is the most common type of cancer worldwide, with increasing incidence rates in many regions. Skin cancer can be broadly categorized into three main types: basal cell carcinoma (BCC), squamous cell carcinoma (SCC), and melanoma, each with distinct characteristics and treatment approaches.
Basal cell carcinoma (BCC) is the most common form of skin cancer, accounting for about 80% of cases. It usually develops in areas of the skin that have been exposed to the sun, such as the face, neck, and hands. BCCs typically appear as raised, pearly bumps or as flat, scaly patches that may be pink or red in color. While BCC rarely spreads to other parts of the body, it can cause disfigurement if not treated promptly.
Squamous cell carcinoma (SCC) is the second most common type of skin cancer, accounting for about 20% of cases. Like BCC, SCC is also primarily caused by sun exposure. SCCs often appear as firm, red nodules or as flat, scaly lesions with a crusty surface. While SCC is more likely than BCC to spread to other parts of the body if left untreated, it is still usually curable when detected early.
Melanoma is the least common but most dangerous form of skin cancer, accounting for the majority of skin cancer-related deaths. Melanomas arise from the pigment-producing cells (melanocytes) in the skin and can develop anywhere on the body, including areas not exposed to the sun. Melanomas often appear as asymmetrical moles with irregular borders and variegated colors. If left untreated, melanoma can metastasize to other organs, making early detection and treatment crucial for improving outcomes.
Risk factors for skin cancer include excessive exposure to UV radiation from the sun or tanning beds, fair skin, a history of sunburns, a weakened immune system, and a family history of skin cancer. Prevention strategies include wearing sunscreen with a high SPF, seeking shade during peak sun hours, wearing protective clothing, and avoiding indoor tanning.
Diagnosis of skin cancer is typically based on a skin examination and biopsy of suspicious lesions. Treatment options depend on the type, size, and location of the cancer but may include surgical removal, radiation therapy, chemotherapy, immunotherapy, or targeted therapy. Early detection and treatment can significantly improve the prognosis for most skin cancers, highlighting the importance of regular skin examinations and sun protection practices."""

print (f'Length of the passage: {len(passage.split())} words')


question ="How can skin cancer be prevented?"
print ('\nQuestion 1:\n', question)
_, _ , _ , _, ans  = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

question ="What are the differences between basal cell carcinoma, squamous cell carcinoma, and melanoma? "
print ('\nQuestion 2:\n', question)
_, _ , _ , _, ans  = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

question ="What are some common signs and symptoms of skin cancer? "
print ('\nQuestion 3:\n', question)
_, _ , _ , _, ans  = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans ,  '\n')

question ="How can individuals protect themselves from UV radiation exposure?"
print ('\nQuestion 4:\n', question)
_, _ , _ , _, ans  = bert_question_answer( question, passage)
print('\nAnswer from BERT: ', ans ,  '\n')



Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Length of the passage: 415 words

Question 1:
 How does tissue engineering contribute to the development of bioengineered tissues and organs?


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.



Answer from BERT:  creating functional tissues and organs in the lab for transplantation . this involves combining cells with biomaterials and growth factors to create bioengineered constructs that mimic the structure and function of native tissues 


Question 2:
 What regulatory hurdles exist for the approval of regenerative medicine therapies?


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.



Answer from BERT:  rigorous testing and approval processes 


Question 3:
 What role do gene editing technologies, such as CRISPR-Cas9, play in regenerative medicine?


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.



Answer from BERT:  significant role in advancing regenerative medicine 

Length of the passage: 410 words

Question 1:
 How can skin cancer be prevented?


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.



Answer from BERT:  wearing sunscreen with a high spf , seeking shade during peak sun hours , wearing protective clothing , and avoiding indoor tanning 


Question 2:
 What are the differences between basal cell carcinoma, squamous cell carcinoma, and melanoma? 


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.



Answer from BERT:  distinct characteristics and treatment approaches 


Question 3:
 What are some common signs and symptoms of skin cancer? 


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.



Answer from BERT:  Sorry!, I was unable to discover an answer in the passage. 


Question 4:
 How can individuals protect themselves from UV radiation exposure?

Answer from BERT:  wearing sunscreen with a high spf , seeking shade during peak sun hours , wearing protective clothing 



#ALTERNATIVES AND SCOPE OF IMPROVEMENT



*   We can use HayStack in place of bert
*   Instead of giving input directly on the executing notebook , we can use .json files .


*   Or even we have libraries to convert given pdf files to data frame to perform similar operations




