# RAG Development 

Ollama is installed locally on the system before running the code.

Installing required dependencies after installing Ollama locally and importing Phi-3 which is a small language model developed by Microsoft. 

In [9]:
%pip install langchain_community 

Note: you may need to restart the kernel to use updated packages.


To import the phi3 model, disable the comment and run the following pull command.

In [None]:
#!ollama pull phi3 

Querying the LLM with non-domain questions

In [12]:
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOllama(model="phi3")
prompt = ChatPromptTemplate.from_template("Where is Kerala ?")

# using LangChain Expressive Language chain syntax
chain = prompt | llm | StrOutputParser()
print(chain.invoke({}))

 Kerala is a state located in the southwestern region of India. It lies along the Malabar Coast, bordered by the Western Ghats mountain range to the east and the Arabian Sea (a part of the Indian Ocean) to the west. The neighboring states are Karnataka to the northwest, Tamil Nadu to the northeast, and Lakshadweep Island to the southwest across the Arabian Sea. Kerala is known for its natural beauty, diverse landscapes ranging from sandy beaches to dense rainforests, as well as its rich cultural heritage and vibrant festivals.


Installing necessary packages importing libraries.

In [3]:
# Install prerequisites
!pip install llama-index-embeddings-huggingface
!pip install llama-index-llms-ollama
!pip install llama-index-vector-stores-chroma
!pip install llama-index ipywidgets
!pip install llama-index-llms-huggingface
!pip install llama_index.readers.web
!pip install chromadb

# Import required modules from the llama_index library
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.core import StorageContext

# Import ChromaVectorStore and chromadb module
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Import the Ollama class
from llama_index.llms.ollama import Ollama

Collecting llama-index-vector-stores-chroma
  Obtaining dependency information for llama-index-vector-stores-chroma from https://files.pythonhosted.org/packages/df/b1/a8b06770de7eb8ddd3656b08c16884c43099aa9b140754e894dcb7528d8c/llama_index_vector_stores_chroma-0.1.8-py3-none-any.whl.metadata
  Downloading llama_index_vector_stores_chroma-0.1.8-py3-none-any.whl.metadata (653 bytes)
Collecting chromadb<0.6.0,>=0.4.0 (from llama-index-vector-stores-chroma)
  Obtaining dependency information for chromadb<0.6.0,>=0.4.0 from https://files.pythonhosted.org/packages/a4/e1/ce276f553811bd6c684cfe5f637a33ae6444750746f974a8f73d5dc92004/chromadb-0.5.0-py3-none-any.whl.metadata
  Downloading chromadb-0.5.0-py3-none-any.whl.metadata (7.3 kB)
Collecting build>=1.0.3 (from chromadb<0.6.0,>=0.4.0->llama-index-vector-stores-chroma)
  Obtaining dependency information for build>=1.0.3 from https://files.pythonhosted.org/packages/e2/03/f3c8ba0a6b6e30d7d18c40faab90807c9bb5e9a1e3b2fe2008af624a9c97/build-1.2.1

ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device





# Data pre processing

The dataset for this rag application is 4 articles on latest advancements and researches going in the field of Alzheimer's disease. After extracting the content from the webpages, the dataset is cleaned and combined to be one document.

In [13]:
from llama_index.readers.web import BeautifulSoupWebReader
#webpage
urls = [
    "https://projects.research-and-innovation.ec.europa.eu/en/projects/success-stories/all/brain-study-opens-door-potential-new-disease-treatments",
    "https://medicalxpress.com/news/2024-03-door-earlier-diagnosis-potential-treatment.html",
    "https://www.mountsinai.org/about/newsroom/2024/altering-cellular-interactions-around-amyloid-plaques-may-offer-novel-alzheimers-treatment-strategies",
    "https://www.bmh.manchester.ac.uk/stories/blood-vessel-breakthrough/"
]

# Initializing an empty list to store text content from each article
documents = []

# Fetching HTML data for each URL and extract text content
for url in urls:
    html_data = BeautifulSoupWebReader().load_data([url])
    text_content = html_data[0].text.strip()  
    documents.append(text_content)

# Combine the text content into a single document
combined_document = "\n\n".join(documents)

# replace "\n" (paragraph break) and "\t" (tab character)
combined_document = combined_document.replace("\n", "")
combined_document = combined_document.replace("\t", "")


# Print the combined document
print("Combined cleaned document:")
print(combined_document)

Combined cleaned document:
Brain study opens door to potential new disease treatments | Research and InnovationSkip to main content Search  You must have JavaScript enabled to use this form.SearchHome…ProjectsSuccess storiesAll success storiesBrain study opens door to potential new disease treatmentsBrain study opens door to potential new disease treatmentsMillions of people suffer from brain diseases. To better understand what happens in the brains of these patients, the EU-funded RobustSynapses project focused on synapses, where many brain conditions often first develop. By identifying key things that can go wrong, the project team has opened the door to potential new targets for life-saving treatments that would benefit everyone.©Zoran Milic #211950775, source: stock.adobe.com 2021PDF BasketNo article selectedConvert to PDFEmpty basketAdd to pdf basket5Jul2021 Neurodegenerative disorders such as Alzheimer’s disease, Parkinson’s disease and amyotrophic lateral sclerosis (ALS) all aff

The cleaned combined document is then saved as a text file named Research_paper.txt

In [15]:
with open('Research_paper.txt', 'w', encoding='utf-8') as text_file:
    text_file.write(combined_document)

For the preparation of knowledge base for the rag, the document is chunked first and then embedded using a hugging face embedding model. The embedded text is then used for creating the vector database.

# Chunking

Chunking strategy utilised is splitting the document by sentences.

In [16]:
split_docs = combined_document.split(". ")

Saving the chunks created from the dataset as output files.

In [18]:
def save_chunks_to_files(folder_name, chunked_data):

  # Create directory
  import os
  try:
    os.makedirs("/content/" + folder_name, exist_ok=True)
  except OSError as e:
    print(f"Error creating directory: {e}")
    return

  count = 0
  for doc in chunked_data:
    fname = os.path.join("/content/", folder_name, f"Output{count}.txt")
    with open(fname, "w", encoding='utf-8') as text_file:
      text_file.write(str(doc))
    count += 1

save_chunks_to_files("chunks", split_docs)

# Embedding

After chunking, embeddings are created for the chunks of the dataset using an hugging face embedding model and LLM used here for generating texts is Phi-3 which is a small language model developed by Microsoft

In [19]:
# Use the Phi-3 as our LLM
# Set a timeout of 10 minutes
llm = Ollama(model="phi3", request_timeout=600.0)

In [20]:
# Initialize a HuggingFace Embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Specify the LLM and embedding model into LlamaIndex's settings
Settings.llm = llm
Settings.embed_model = embed_model

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Vector Database

Creating vector database and indexing the embeddings in the database and here we are using ChromaDB as the database solution.

In [21]:
# Import ChromaVectorStore and chromadb module
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Load documents
reader = SimpleDirectoryReader("/content/chunks") # load documents from the /data folder
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

# Create client ("db") and a database ("chroma_db")
db = chromadb.PersistentClient(path="./chroma_db")

# Create a collection/table in the db
chroma_collection = db.create_collection("Research_papers_collection")

# Set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
# Specify Chroma as our vector db
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create the vector index
vector_index = VectorStoreIndex.from_documents(
    docs, # the file created earlier
    storage_context = storage_context,
    embed_model = embed_model
)

# Print the metadata
print(chroma_collection)

# Print the name of the collection (table)
print(f'Collection name is: {chroma_collection.name}')

Loaded 50 docs
name='Research_papers_collection' id=UUID('95100208-0c26-4610-9b9e-da8d0c9b7fb8') metadata=None tenant='default_tenant' database='default_database'
Collection name is: Research_papers_collection


# Prompt Template

Creating a prompt template for querying the LLM.

In [22]:
from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core import ChatPromptTemplate

qa_prompt_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Using only the context provided, "
    "answer the question: {query_str}\n"
)

# Text QA Prompt
chat_text_qa_msgs = [
    ChatMessage(
        role=MessageRole.SYSTEM,
        content=(
            "If the question is not related to the context, please say you can't answer the question"
        ),
    ),
    ChatMessage(role=MessageRole.USER, content=qa_prompt_str),
]

text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)

# Query Pipeline

Creating query pipeline to query and retrieve relevant information from the vector database and presenting the query with context to the LLM and generate response.

Following queries are related to the knowledge base or domain of the rag application.

In [23]:
print(
    vector_index.as_query_engine(
        text_qa_template=text_qa_template,
        llm=llm,
    ).query("What is the focus of the EU-funded RobustSynapses project, and why is it significant for understanding neurodegenerative diseases?")
)

 The focus of the EU-funded RobustSynapses project is on synapses in the brain. Synapses are where many brain conditions often first develop. This research is significant for understanding neurodegenerative diseases because, as explained by principal investigator Patrik Verstreken, there are no cures available for major neurodegenerative conditions such as dementia and paralysis. By studying synapses, the RobustSynapses project aims to address one of the largest unmet medical needs in this field, potentially leading to advancements that could help treat or prevent these debilitating diseases.


In [24]:
print(
    vector_index.as_query_engine(
        text_qa_template=text_qa_template,
        llm=llm,
    ).query("Who is Patrik Verstreken, and what is his role in the RobustSynapses project?")
)

 Patrik Verstreken is the principal investigator of the RobustSynapses project. He serves as the scientific director and group leader at the VIB Center for Brain & Disease Research at KU Leuven, Belgium. His role in the project involves addressing unmet medical needs related to major neurodegenerative conditions by exploring potential therapeutic targets and treatments.


In [25]:
print(
    vector_index.as_query_engine(
        text_qa_template=text_qa_template,
        llm=llm,
    ).query("What are the two most common types of dementia, and how many people in the UK are affected by them?")
)

 The two most common types of dementia are Alzheimer's disease and vascular dementia. Collectively, they affect around 700,000 people in the UK.


In [32]:
print(
    vector_index.as_query_engine(
        text_qa_template=text_qa_template,
        llm=llm,
    ).query("What role do reactive astrocytes play in Alzheimer's disease, and how do they contribute to the brain's ability to clear amyloid plaques?")
)

 Based on the given context information, reactive astrocytes are a type of brain cell that becomes activated in response to injury or disease. In relation to Alzheimer's disease, these cells play a crucial role as indicated by the study focusing on their involvement and interaction with the plexin-B1 protein. The context suggests that reactive astrocytes contribute to brain cell communication and potentially aid in the clearance of amyloid plaques, which are characteristic features of Alzheimer's pathophysiology. However, specific mechanisms by which they facilitate this process are not provided within the given information.


In [27]:
print(
    vector_index.as_query_engine(
        text_qa_template=text_qa_template,
        llm=llm,
    ).query("What is the main cause of vascular dementia ?")
)


 The main cause of vascular dementia, as mentioned in the context, is high blood pressure (known as hypertension). This condition leads to damage to the small arteries of the brain.


Out of Context Queries : These are the queries targeted out of the context of the knowledge base of the RAG or non domain queries.

In [28]:
print(
    vector_index.as_query_engine(
        text_qa_template=text_qa_template,
        llm=llm,
    ).query("Where is pyramid ?")
)


 Based on the given context information, there is no mention of a location called "pyramid". Therefore, it's not possible to provide an answer about where "pyramid" is located. Please ensure the details you are seeking pertain to the provided context. If additional information becomes available or if this question refers to something else in the wider scope of knowledge, feel free to ask!


In [36]:
print(
    vector_index.as_query_engine(
        text_qa_template=text_qa_template,
        llm=llm,
    ).query("Where is Kerala ?")
)

 I'm sorry, but based on the given context information, there is no mention or reference to Kerala. The content focuses on Mount Sinai Health System and related entities. Therefore, I cannot provide an answer about Kerala in this specific context.


In [42]:
print(
    vector_index.as_query_engine(
        text_qa_template=text_qa_template,
        llm=llm,
    ).query("How to treat ulcers ?")
)

 I'm sorry, but based on the provided context information, it does not relate to the treatment of ulcers. The given texts discuss findings related to cell interactions with harmful plaques in neuroscience research. To provide information about treating ulcers, I would need a relevant source or context specifically addressing gastrointestinal health and ulcer treatments.
