# Build BioMistral Medical RAG Chatbot using BioMistral Open Source LLM

In the notebook we will build a Medical Chatbot with BioMistral LLM and Heart Health pdf file.

## Load the google drive

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


## Installation

In [None]:
!pip install langchain sentence-transformers chromadb llama-cpp-python langchain_community pypdf

Collecting langchain
  Downloading langchain-0.1.13-py3-none-any.whl (810 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m810.5/810.5 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence_transformers-2.5.1-py3-none-any.whl (156 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl (525 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-cpp-python
  Downloading llama_cpp_python-0.2.57.tar.gz (36.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.9/36.9 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... 

## Importing libraries

In [None]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
from langchain_community.llms import LlamaCpp
from langchain.chains import RetrievalQA, LLMChain

## Import the document

In [None]:
loader = PyPDFDirectoryLoader("/content/drive/MyDrive/BioMistral/Data")
docs = loader.load()

In [None]:
len(docs)  # number of pages

95

In [None]:
docs[6]

Document(page_content='2\nThese facts may seem frightening, but they need not be. The good\nnews is that you have a lot of power to protect and improve yourheart health. This guidebook will help you find out your own riskof heart disease and take steps to prevent it.\n“But,” you may still be thinking, “I take pretty good care of myself.\nI’m unlikely to get heart disease.” Yet a recent national survey showsthat only 3 percent of U.S. adults practice all of the “Big Four”habits that help to prevent heart disease: eating a healthy diet, getting regular physical activity, maintaining a healthy weight, andavoiding smoking. Many young people are also vulnerable. Arecent study showed that about two-thirds of teenagers already haveat least one risk factor for heart disease.\nEvery risk factor counts. Research shows that each individual risk\nfactor greatly increases the chances of developing heart disease.Moreover, the worse a particular risk factor is, the more likely youare to develop heart

## Chunking

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = text_splitter.split_documents(docs)

In [None]:
len(chunks)

747

In [None]:
chunks[600]

Document(page_content='is to usesmaller plates and taller, narrower glasses so that moderate portionsdon’t seem skimpy. It can also help to set a regular eating schedule,especially if you tend to skip or delay meals.', metadata={'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf', 'page': 74})

In [None]:
chunks[601]

Document(page_content='How To Choose a Weight-Loss Program\nSome people lose weight on their own, while others like the supportof a structured program. If you decide to participate in a weight-loss program, here are some questions to ask before you join:\nDoes the program provide counseling to help you change your eat-', metadata={'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf', 'page': 74})

## Embeddings creations

In [None]:
import os
os.environ['HUGGINGFACEHUB_API_TOKEN'] = "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"

In [None]:
embeddings = SentenceTransformerEmbeddings(model_name="NeuML/pubmedbert-base-embeddings")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.12k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/667 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/706k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Vector Store creation

In [None]:
vectorstore = Chroma.from_documents(chunks, embeddings)

In [None]:
query = "Who is at risk of heart disease?"

search_results = vectorstore.similarity_search(query)

In [None]:
search_results

[Document(page_content='While each risk factor increases your risk of heart disease, having', metadata={'page': 8, 'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf'}),
 Document(page_content='■As early as age 45, a man’s risk of heart disease begins to rise significantly. For a woman, risk starts to increase at age 55.\n■Fifty percent of men and 64 percent of women who die suddenlyof heart disease have no previous symptoms of the disease.1Heart Disease: Why Should You Care?Heart Disease:', metadata={'page': 5, 'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf'}),
 Document(page_content='factor greatly increases the chances of developing heart disease.Moreover, the worse a particular risk factor is, the more likely youare to develop heart disease. For example, if you have high bloodpressure, the higher it is, the greater your chances of developingheart disease, including its many', metadata={'page': 6, 'source': '/content/drive/MyDrive/BioMistral/Data/

In [None]:
retriever = vectorstore.as_retriever(search_kwargs={'k':5})

In [None]:
retriever.get_relevant_documents(query)

[Document(page_content='While each risk factor increases your risk of heart disease, having', metadata={'page': 8, 'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf'}),
 Document(page_content='■As early as age 45, a man’s risk of heart disease begins to rise significantly. For a woman, risk starts to increase at age 55.\n■Fifty percent of men and 64 percent of women who die suddenlyof heart disease have no previous symptoms of the disease.1Heart Disease: Why Should You Care?Heart Disease:', metadata={'page': 5, 'source': '/content/drive/MyDrive/BioMistral/Data/healthyheart.pdf'}),
 Document(page_content='factor greatly increases the chances of developing heart disease.Moreover, the worse a particular risk factor is, the more likely youare to develop heart disease. For example, if you have high bloodpressure, the higher it is, the greater your chances of developingheart disease, including its many', metadata={'page': 6, 'source': '/content/drive/MyDrive/BioMistral/Data/

## LLM Model loading

In [None]:
llm = LlamaCpp(
    model_path="/content/drive/MyDrive/BioMistral/BioMistral-7B.Q4_K_M.gguf",
    temperature=0.2,
    max_tokens = 2048,
    top_p=1
)

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /content/drive/MyDrive/BioMistral/BioMistral-7B.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = hub
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.att

## Use LLM and retriver and query, to generate final response

In [None]:
template = """
<|context|>
You are an Medical Assistant that follows the instructions and generate the accurate response based on the query and the context provided.
Please be truthful and give direct answers.
</s>
<|user|>
{query}
</s>
<|assistant|>
"""


In [None]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate

In [None]:
prompt = ChatPromptTemplate.from_template(template)

In [None]:
rag_chain = (
    {"context": retriever, "query": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
response=rag_chain.invoke(query)


llama_print_timings:        load time =    4322.57 ms
llama_print_timings:      sample time =      45.30 ms /    67 runs   (    0.68 ms per token,  1479.19 tokens per second)
llama_print_timings: prompt eval time =   36180.28 ms /    71 tokens (  509.58 ms per token,     1.96 tokens per second)
llama_print_timings:        eval time =   44605.08 ms /    66 runs   (  675.83 ms per token,     1.48 tokens per second)
llama_print_timings:       total time =   81144.96 ms /   137 tokens


In [None]:
response

'The people who are at risk of heart disease are those who have high blood pressure, high cholesterol, diabetes, or a family history of heart disease. It is also important to maintain a healthy diet, exercise regularly, and do not smoke to reduce your risk. Would you like more information on any of these factors?'

In [None]:
import sys

while True:
  user_input = input(f"Input query: ")
  if user_input == 'exit':
    print("Exiting...")
    sys.exit()
  if user_input=="":
    continue
  result = rag_chain.invoke(user_input)
  print("Answer: ", result)

Input query: What are the diseases that affect heart health?


Llama.generate: prefix-match hit

llama_print_timings:        load time =    4322.57 ms
llama_print_timings:      sample time =      75.43 ms /   114 runs   (    0.66 ms per token,  1511.25 tokens per second)
llama_print_timings: prompt eval time =   11116.12 ms /    19 tokens (  585.06 ms per token,     1.71 tokens per second)
llama_print_timings:        eval time =   77089.44 ms /   113 runs   (  682.21 ms per token,     1.47 tokens per second)
llama_print_timings:       total time =   88803.10 ms /   132 tokens


Answer:  The diseases that affect heart health include coronary artery disease, hypertension, diabetes mellitus, hyperlipidemia, obesity, and smoking. Other factors such as age, family history, and stress can also contribute to heart disease. It is essential to maintain a healthy lifestyle to prevent or manage heart disease. This includes regular exercise, a healthy diet, not smoking, and managing stress levels. Regular check-ups with a healthcare provider can also help detect any potential issues early and prevent them from developing into more serious conditions.
Input query: what are the preventive measures


Llama.generate: prefix-match hit

llama_print_timings:        load time =    4322.57 ms
llama_print_timings:      sample time =      72.44 ms /   107 runs   (    0.68 ms per token,  1477.13 tokens per second)
llama_print_timings: prompt eval time =    7699.64 ms /    16 tokens (  481.23 ms per token,     2.08 tokens per second)
llama_print_timings:        eval time =   70889.61 ms /   106 runs   (  668.77 ms per token,     1.50 tokens per second)
llama_print_timings:       total time =   79119.50 ms /   122 tokens


Answer:  The best way to prevent COVID-19 is to regularly wash your hands with soap and water for at least 20 seconds or use hand sanitizer containing at least 60% alcohol. Also, avoid touching your eyes, nose, or mouth with unwashed hands, avoid close contact with people who are sick, stay home when you are sick, cover your cough or sneeze with a tissue, clean and disinfect frequently touched objects and surfaces, and wear a mask or cloth face cover when in public.
Input query: How High blood Cholesterol affect heart health?


Llama.generate: prefix-match hit

llama_print_timings:        load time =    4322.57 ms
llama_print_timings:      sample time =     106.42 ms /   149 runs   (    0.71 ms per token,  1400.17 tokens per second)
llama_print_timings: prompt eval time =   10798.06 ms /    21 tokens (  514.19 ms per token,     1.94 tokens per second)
llama_print_timings:        eval time =   98872.11 ms /   148 runs   (  668.05 ms per token,     1.50 tokens per second)
llama_print_timings:       total time =  110419.31 ms /   169 tokens


Answer:  High blood cholesterol is a condition in which there is too much cholesterol in your blood. It is a condition in which there is too much cholesterol in your blood. This can be caused by several factors, including diet, lifestyle, family history, and certain medical conditions. If left untreated, high cholesterol can lead to the buildup of plaque in your arteries, which can increase your risk of heart disease, stroke, and other chronic conditions. It is important to work with your healthcare provider to manage your cholesterol levels through lifestyle changes, medication, or a combination of both. This can help reduce your risk of developing these conditions and improve your overall health.
Input query: exit
Exiting...


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
