# **üíª Author Information**

**Name:** CHUA JINGXUAN

**Note:** This Jupyter Notebook was created by CHUA JINGXUAN. If you find any issues, have questions, or want to provide feedback, please don't hesitate to reach out. Thank you for exploring this notebook!

**Date Written**: 28/03/2025 (FRI)

**Last Updated**: 28/03/2025 (FRI)



[![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/Kanon14) <a href="https://www.linkedin.com/in/chua-jingxuan-51a300173" target="_blank" style="margin-left: 10px;">
    <img src="https://upload.wikimedia.org/wikipedia/commons/1/19/LinkedIn_logo.svg" alt="LinkedIn Icon" width="80" height="22">
</a>

---

# üî¨ **Experiment for Medical Chatbot (RAG Framework)**

In [2]:
print("Hello World")

Hello World


In [3]:
%pwd

'c:\\Users\\cjx14\\Personal_Projects\\End-to-end-Medical-Chatbot\\research'

In [4]:
import os
os.chdir("../")

In [5]:
%pwd

'c:\\Users\\cjx14\\Personal_Projects\\End-to-end-Medical-Chatbot'

### ü¶úüîó Importing Langchain Libraries

In [6]:
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [7]:
# Extract the data from the PDF file
def load_pdf_file(data):
    loader = DirectoryLoader(data, glob="*.pdf", loader_cls=PyPDFLoader)
    documents = loader.load()
    return documents

In [8]:
extracted_data = load_pdf_file("data/")

In [9]:
# extracted_data

In [10]:
# Split the data into chunks
def text_split(extracted_data):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
    text_chunks = text_splitter.split_documents(extracted_data)
    return text_chunks

In [11]:
text_chunks = text_split(extracted_data)
print(f"Length of the text chunks: {len(text_chunks)}")

Length of the text chunks: 12883


### ü§ó Embedding Models from HuggingFace

Source: [sentence-transformers embedding](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

In [12]:
from langchain.embeddings import HuggingFaceEmbeddings

In [13]:
# Download the embeddings from HuggingFace
def download_hf_embeddings():
    embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
    return embeddings

In [14]:
embeddings = download_hf_embeddings()

  embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')


In [15]:
query_result = embeddings.embed_query("Hello World")
print("Length of the query result: ", len(query_result))

Length of the query result:  384


### üóÉÔ∏è Creating Pinecone Vector Database

In [16]:
from dotenv import load_dotenv
load_dotenv()

True

<font color="teal">Load the API Keys</font>

In [17]:
PINECONE_API_KEY = os.environ.get("PINECONE_API_KEY")
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

In [18]:
from pinecone import Pinecone, ServerlessSpec

In [None]:
pc = Pinecone(api_key=PINECONE_API_KEY)

index_name = "medicalbot"

pc.create_index(name=index_name, 
                dimension=384,
                metric="cosine",
                spec=ServerlessSpec(cloud="aws", region="us-east-1"))

<font color="orange">Setup the Environment Variables</font>

In [None]:
import os
os.environ["PINECONE_API_KEY"] = PINECONE_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [20]:
# Embed each chunk and upsert the embeddings into the Pinecone index
from langchain_pinecone import PineconeVectorStore

docsearch = PineconeVectorStore.from_documents(
    documents=text_chunks,
    index_name=index_name,
    embedding=embeddings,
)

In [21]:
# Load existing index
from langchain_pinecone import PineconeVectorStore

# Embed each chunk and upsert the embeddings into your Pinecone index
docsearch = PineconeVectorStore.from_existing_index(
    index_name=index_name,
    embedding=embeddings,
)

In [22]:
docsearch

<langchain_pinecone.vectorstores.PineconeVectorStore at 0x27369cafcd0>

In [23]:
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k":3})

In [26]:
retrieved_docs = retriever.invoke("What is Diabetes?")

In [43]:
# retrieved_docs

### ü§ñ Creating the RAG Framework

In [38]:
from langchain_openai import OpenAI
llm = OpenAI(temperature=0.4, max_tokens=500)

In [39]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# Define a more specific system prompt for medical domain
system_prompt = (
    "You are an assistant specialized in answering medical-related questions only. "
    "Use the following pieces of retrieved context to answer the question. "
    "If the question is not related to medicine, healthcare, or biology, respond with 'I don't know'. "
    "If you don't know the answer to a medical question, also respond with 'I don't know'. "
    "Use three sentences maximum and keep the answer concise."
    "\n\n"
    "{context}"
)

# Create chat prompt template with system and user messages
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [40]:
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [41]:
response = rag_chain.invoke({"input": "what is Acromegaly and gigantism?"})
print(response["answer"])


Acromegaly and gigantism are both disorders caused by the abnormal release of a chemical from the pituitary gland in the brain, leading to increased growth in bone and soft tissue. Acromegaly occurs after bone growth has stopped, while gigantism occurs during childhood and can result in unusual height. Symptoms of these disorders can include changes in facial features, enlarged hands and feet, and other disturbances throughout the body. Diagnosis is often delayed and the majority of patients are not identified until they are middle aged.


In [42]:
response_unknown = rag_chain.invoke({"input": "What is virtual reality (VR)?"})
print(response_unknown["answer"])



I don't know.


### üåü End of this Notebook