# **Development of a Medical Chatbot Using Natural Language Processing**

## **Introduction**


In recent years, the integration of technology into healthcare has revolutionized how individuals access medical information and services. The increasing reliance on digital solutions has led to the development of medical chatbots, which are designed to provide instant, personalized responses to health-related inquiries. These AI-driven systems leverage Natural Language Processing (NLP) to understand user queries and deliver accurate, relevant information, effectively bridging the gap between medical expertise and patient needs.

The importance of reliable medical information cannot be overstated, particularly in an era marked by the rapid spread of misinformation. Many people turn to the internet as their first source of information regarding health issues, yet navigating through the vast sea of available data can often be overwhelming and confusing. Medical chatbots have the potential to serve as trusted resources, offering users immediate access to accurate health information while guiding them through their questions in an understandable manner.

The ultimate goal of this chatbot is to create a user-friendly interface that allows individuals to engage with the system seamlessly. Users will receive instant responses to their health-related inquiries, empowering them to make informed decisions about their health and wellbeing. By prioritizing clarity and reliability, this chatbot aspires to improve healthcare accessibility and literacy, particularly for those who may find traditional medical resources challenging to comprehend. This research not only highlights the significance of artificial intelligence in healthcare but also underscores the importance of using credible medical sources to ensure the accuracy and effectiveness of health-related technologies.


## **Problem Statement**

Despite the abundance of medical information available online, many individuals struggle to find accurate and easily understandable health resources. Traditional healthcare services are often burdened by high demand, leading to challenges in accessibility and timely information dissemination. This gap in available resources can result in individuals relying on unreliable sources or misinterpreting health-related information, which can adversely affect their decision-making and health outcomes.

The primary goal of this project is to address these challenges by developing a medical chatbot that provides:

- **Immediate Access:** Users will have instant access to medical information without the need to sift through multiple websites or sources.

- **Clear Communication:** The chatbot will use simple language to explain medical terms and concepts, making information more accessible to the general public.

- **Reliable Information:** By utilizing the Gale Encyclopedia of Medicine, the chatbot will offer trusted and verified medical knowledge, ensuring that users receive accurate guidance.

## **Metrics of Success**

To evaluate the effectiveness of the medical chatbot, the following metrics will be considered:

- **User Satisfaction:** Measured through surveys assessing the clarity and usefulness of responses.

- **Accuracy of Information:** The chatbot's responses will be compared to verified medical sources to ensure correctness.

- **Response Time:** The average time taken by the chatbot to provide answers to user queries.

- **Engagement Rate:** The frequency of user interactions and follow-up questions will indicate user interest and trust in the chatbot.

## **Data Understanding**

The Gale Encyclopedia of Medicine serves as the foundational data source for this project. This resource is well-respected in the medical community and contains comprehensive, peer-reviewed articles that cover a wide range of medical topics, including:

- **Diseases and Conditions:** Detailed descriptions of various health conditions, their symptoms, causes, and treatment options.

- **Medications:** Information about different medications, including indications, dosage, side effects, and contraindications.

- **Preventive Measures:** Guidelines for maintaining health and preventing diseases, including lifestyle recommendations and screenings.

- **Medical Terminology:** Definitions and explanations of complex medical terms to facilitate better understanding for users.

In [1]:
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Pinecone
import pinecone
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
from langchain.llms import CTransformers
from dotenv import load_dotenv
import os

In [2]:
def load_pdf(data):
    loader = DirectoryLoader(data,
                             glob = "*.pdf",
                             loader_cls=PyPDFLoader)
    documents = loader.load()
    return documents

In [3]:
extracted_data = load_pdf("../data")

In [4]:
for i, doc in enumerate(extracted_data[:3]):
    print(f"Document {i+1}")
    print(doc.page_content)
    print("\n")

Document 1



Document 2
TheGALE
ENCYCLOPEDIA
ofMEDICINE
SECOND EDITION


Document 3
TheGALE
ENCYCLOPEDIA
ofMEDICINE
SECOND EDITION
JACQUELINE L. LONGE, EDITOR
DEIRDRE S. BLANCHFIELD, ASSOCIATE EDITOR
VOLUME
A-B1




In [5]:
# View the first 5 words of the first document in extracted_data
first_doc_words = extracted_data[1].page_content.split()[:5]
print("First 5 words of the first document:", first_doc_words)


First 5 words of the first document: ['TheGALE', 'ENCYCLOPEDIA', 'ofMEDICINE', 'SECOND', 'EDITION']


In [6]:
def text_split(extracted_data):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 20)
    text_chunks = text_splitter.split_documents(extracted_data)

    return text_chunks

In [7]:
text_chunks = text_split(extracted_data)
print("Length of my chunk:", len(text_chunks))

Length of my chunk: 7020


In [8]:
#download embedding model
def download_hugging_face_embeddings():
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    return embeddings

In [9]:
embeddings = download_hugging_face_embeddings()
embeddings

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
), model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

In [10]:
query_result = embeddings.embed_query("Hello world")
print("Length", len(query_result))

Length 384


In [11]:
import os
from dotenv import load_dotenv
from pinecone import Pinecone
from langchain.vectorstores import Pinecone as PineconeVectorStore

# Load environment variables from the .env file
load_dotenv()

# Retrieve Pinecone API key from the .env file
pinecone_api_key = os.getenv("PINECONE_API_KEY")

# Initialize Pinecone with the API key
pinecone_client = Pinecone(api_key=pinecone_api_key)

# Name of the index
index_name = "medical-chatbot"

# List existing indexes and check if the index exists
existing_indexes = pinecone_client.list_indexes().names()

if index_name not in existing_indexes:
    raise ValueError(f"Index '{index_name}' does not exist. Please create the index first.")

# Connect to the existing index
index = pinecone_client.Index(index_name)

# Now proceed with creating embeddings for the text chunks
docsearch = PineconeVectorStore.from_texts(
    [t.page_content for t in text_chunks],
    embeddings,
    index_name=index_name
)

print(f"Connected to Pinecone index: {index_name}")


Connected to Pinecone index: medical-chatbot


In [12]:
# Connect to the existing Pinecone index and load it using LangChain's PineconeVectorStore
docsearch = PineconeVectorStore.from_existing_index(index_name, embeddings)

# Query the index
query = "What are Allergies"
docs = docsearch.similarity_search(query, k=3)

# Print the search results
print("Results:", docs)

Results: [Document(metadata={}, page_content="GALE ENCYCLOPEDIA OF MEDICINE 2 117Allergies\nAllergic rhinitis is commonly triggered by\nexposure to household dust, animal fur,or pollen. The foreign substance thattriggers an allergic reaction is calledan allergen.\nThe presence of an allergen causes the\nbody's lymphocytes to begin producingIgE antibodies. The lymphocytes of an allergy sufferer produce an unusuallylarge amount of IgE.\nIgE molecules attach to mast\ncells, which contain histamine.HistaminePollen grains\nLymphocyte\nFIRST EXPOSURE"), Document(metadata={}, page_content="GALE ENCYCLOPEDIA OF MEDICINE 2 117Allergies\nAllergic rhinitis is commonly triggered by\nexposure to household dust, animal fur,or pollen. The foreign substance thattriggers an allergic reaction is calledan allergen.\nThe presence of an allergen causes the\nbody's lymphocytes to begin producingIgE antibodies. The lymphocytes of an allergy sufferer produce an unusuallylarge amount of IgE.\nIgE molecules att

In [13]:
prompt_template="""
Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context: {context}
Question: {question}

Only return the helpful answer below and nothing else.
Helpful answer:
"""

In [14]:
PROMPT=PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain_type_kwargs={"prompt": PROMPT}

In [17]:
import os
print(os.getcwd())


e:\projects\MedicineChatbot\research


In [18]:
llm=CTransformers(model="../model/llama-2-7b-chat.ggmlv3.q4_0.bin",
                  model_type="llama",
                  config={'max_new_tokens':512,
                          'temperature':0.8})

In [19]:
qa=RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=docsearch.as_retriever(search_kwargs={'k': 2}),
    return_source_documents=True, 
    chain_type_kwargs=chain_type_kwargs)

In [17]:
# while True:
#     user_input = input(f"Input Prompt: ")
#     if user_input.lower() == 'exit':
#         break
#     result = qa.invoke({"query": user_input})  # Use invoke instead of calling the object directly
#     print("Response: ", result["result"])
