# **Development of a Medical Chatbot Using Natural Language Processing**

## **Introduction**


In recent years, the integration of technology into healthcare has revolutionized how individuals access medical information and services. The increasing reliance on digital solutions has led to the development of medical chatbots, which are designed to provide instant, personalized responses to health-related inquiries. These AI-driven systems leverage Natural Language Processing (NLP) to understand user queries and deliver accurate, relevant information, effectively bridging the gap between medical expertise and patient needs.

The importance of reliable medical information cannot be overstated, particularly in an era marked by the rapid spread of misinformation. Many people turn to the internet as their first source of information regarding health issues, yet navigating through the vast sea of available data can often be overwhelming and confusing. Medical chatbots have the potential to serve as trusted resources, offering users immediate access to accurate health information while guiding them through their questions in an understandable manner.

The ultimate goal of this chatbot is to create a user-friendly interface that allows individuals to engage with the system seamlessly. Users will receive instant responses to their health-related inquiries, empowering them to make informed decisions about their health and wellbeing. By prioritizing clarity and reliability, this chatbot aspires to improve healthcare accessibility and literacy, particularly for those who may find traditional medical resources challenging to comprehend. This research not only highlights the significance of artificial intelligence in healthcare but also underscores the importance of using credible medical sources to ensure the accuracy and effectiveness of health-related technologies.


## **Problem Statement**

Despite the abundance of medical information available online, many individuals struggle to find accurate and easily understandable health resources. Traditional healthcare services are often burdened by high demand, leading to challenges in accessibility and timely information dissemination. This gap in available resources can result in individuals relying on unreliable sources or misinterpreting health-related information, which can adversely affect their decision-making and health outcomes.

The primary goal of this project is to address these challenges by developing a medical chatbot that provides:

- **Immediate Access:** Users will have instant access to medical information without the need to sift through multiple websites or sources.

- **Clear Communication:** The chatbot will use simple language to explain medical terms and concepts, making information more accessible to the general public.

- **Reliable Information:** By utilizing the Gale Encyclopedia of Medicine, the chatbot will offer trusted and verified medical knowledge, ensuring that users receive accurate guidance.

## **Metrics of Success**

To evaluate the effectiveness of the medical chatbot, the following metrics will be considered:

- **User Satisfaction:** Measured through surveys assessing the clarity and usefulness of responses.

- **Accuracy of Information:** The chatbot's responses will be compared to verified medical sources to ensure correctness.

- **Response Time:** The average time taken by the chatbot to provide answers to user queries.

- **Engagement Rate:** The frequency of user interactions and follow-up questions will indicate user interest and trust in the chatbot.

## **Data Understanding**

The Gale Encyclopedia of Medicine serves as the foundational data source for this project. This resource is well-respected in the medical community and contains comprehensive, peer-reviewed articles that cover a wide range of medical topics, including:

- **Diseases and Conditions:** Detailed descriptions of various health conditions, their symptoms, causes, and treatment options.

- **Medications:** Information about different medications, including indications, dosage, side effects, and contraindications.

- **Preventive Measures:** Guidelines for maintaining health and preventing diseases, including lifestyle recommendations and screenings.

- **Medical Terminology:** Definitions and explanations of complex medical terms to facilitate better understanding for users.

In [3]:
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Pinecone
import pinecone
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
from langchain.llms import CTransformers
from dotenv import load_dotenv
import os

  from tqdm.autonotebook import tqdm


In [5]:
def load_pdf(data):
    loader = DirectoryLoader(data,
                             glob = "*.pdf",
                             loader_cls=PyPDFLoader)
    documents = loader.load()
    return documents

In [6]:
extracted_data = load_pdf("data")

In [7]:
for i, doc in enumerate(extracted_data[:3]):
    print(f"Document {i+1}")
    print(doc.page_content)
    print("\n")

Document 1



Document 2
TheGALE
ENCYCLOPEDIA
ofMEDICINE
SECOND EDITION


Document 3
TheGALE
ENCYCLOPEDIA
ofMEDICINE
SECOND EDITION
JACQUELINE L. LONGE, EDITOR
DEIRDRE S. BLANCHFIELD, ASSOCIATE EDITOR
VOLUME
A-B1




In [8]:
# View the first 5 words of the first document in extracted_data
first_doc_words = extracted_data[1].page_content.split()[:5]
print("First 5 words of the first document:", first_doc_words)


First 5 words of the first document: ['TheGALE', 'ENCYCLOPEDIA', 'ofMEDICINE', 'SECOND', 'EDITION']


In [9]:
def text_split(extracted_data):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 20)
    text_chunks = text_splitter.split_documents(extracted_data)

    return text_chunks

In [10]:
text_chunks = text_split(extracted_data)
print("Length of my chunk:", len(text_chunks))

Length of my chunk: 7020


In [11]:
#download embedding model
def download_hugging_face_embeddings():
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    return embeddings

In [14]:
embeddings = download_hugging_face_embeddings()
embeddings

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]



HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
), model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={})

In [15]:
query_result = embeddings.embed_query("Hello world")
print("Length", len(query_result))

Length 384


In [26]:
import os
from dotenv import load_dotenv
import pinecone

# Load environment variables from the .env file
load_dotenv()

# Retrieve Pinecone API key from the .env file
pinecone_api_key = os.getenv("PINECONE_API_KEY")

# Initialize Pinecone with the API key
pinecone.init(api_key=pinecone_api_key)

# Name of the index
index_name = "medical-chatbot"

# Check if the index exists, if not create it
if index_name not in pinecone.list_indexes():
    print(f"Creating index: {index_name}")
    # Adjust dimension (e.g., 1536) based on the embeddings you're using
    pinecone.create_index(index_name, dimension=1536)

# Connect to the index
index = pinecone.Index(index_name)

# Now you can proceed with creating embeddings for the text chunks
docsearch = Pinecone.from_texts([t.page_content for t in text_chunks], embeddings, index_name=index_name)

# Output success message
print(f"Connected to Pinecone index: {index_name}")


Creating index: medical-chatbot


ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'x-pinecone-api-version': '2024-04', 'X-Cloud-Trace-Context': '1877fe05c1a6c61a25344a5536c815db', 'Date': 'Thu, 26 Sep 2024 18:37:20 GMT', 'Server': 'Google Frontend', 'Content-Length': '136', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: Request failed. You've reach the max pod-based indexes allowed in project Default (0). To add more pod-based indexes, upgrade your plan.


In [23]:
# Load environment variables from the .env file
load_dotenv()

# Retrieve Pinecone API key from the .env file
pinecone_api_key = os.getenv("PINECONE_API_KEY")

# Initialize Pinecone with the API key (no need for environment variable anymore)
pinecone.init(api_key=pinecone_api_key)

# Connect to the index by name
index = pinecone.Index("medical-chatbot")

In [25]:
print(pinecone.list_indexes())

[]


In [24]:
#Creating Embeddings for Each of The Text Chunks & storing
docsearch=Pinecone.from_texts([t.page_content for t in text_chunks], embeddings, index_name=index)

ValueError: No active indexes found in your Pinecone project, are you sure you're using the right API key and environment?

In [None]:
docsearch=Pinecone.from_existing_index(index, embeddings)

query = "What are Allergies"

docs=docsearch.similarity_search(query, k=3)

print("Result", docs)