Motivation:
Traditional RAG systems often use dense vector embeddings for retrieval, which can be computationally expensive and may not always capture the nuances of term importance. BM25 RAG addresses these limitations by using a probabilistic retrieval model that considers term frequency, inverse document frequency, and document length. This approach can lead to more accurate and interpretable retrieval, especially for queries requiring specific or rare information.

https://github.com/adithya-s-k/AI-Engineering.academy/blob/main/RAG/01_BM25_RAG/notebook.ipynb

In [5]:
%pip install qdrant_client

Defaulting to user installation because normal site-packages is not writeable
Collecting qdrant_client
  Downloading qdrant_client-1.12.1-py3-none-any.whl (267 kB)
[K     |████████████████████████████████| 267 kB 511 kB/s eta 0:00:01
[?25hCollecting grpcio-tools>=1.41.0
  Downloading grpcio_tools-1.68.1-cp39-cp39-macosx_10_9_universal2.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 3.4 MB/s eta 0:00:01
[?25hCollecting portalocker<3.0.0,>=2.7.0
  Downloading portalocker-2.10.1-py3-none-any.whl (18 kB)
Collecting grpcio>=1.41.0
  Downloading grpcio-1.68.1-cp39-cp39-macosx_10_9_universal2.whl (11.1 MB)
[K     |████████████████████████████████| 11.1 MB 1.7 MB/s eta 0:00:01
Collecting httpx[http2]>=0.20.0
  Downloading httpx-0.28.1-py3-none-any.whl (73 kB)
[K     |████████████████████████████████| 73 kB 3.9 MB/s eta 0:00:01
[?25hCollecting pydantic>=1.10.8
  Downloading pydantic-2.10.4-py3-none-any.whl (431 kB)
[K     |████████████████████████████████| 431 kB 2.4 MB/s

In [6]:
import logging
import sys
import os

from IPython.display import Markdown, display
import qdrant_client



In [8]:
%pip install llama_index
%pip install llama-index-embeddings-fastembed

Defaulting to user installation because normal site-packages is not writeable
Collecting llama_index
  Downloading llama_index-0.12.8-py3-none-any.whl (6.8 kB)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0
  Downloading llama_index_indices_managed_llama_cloud-0.6.3-py3-none-any.whl (11 kB)
Collecting nltk>3.8.1
  Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 627 kB/s eta 0:00:01
[?25hCollecting llama-index-core<0.13.0,>=0.12.8
  Downloading llama_index_core-0.12.8-py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 1.2 MB/s eta 0:00:01
[?25hCollecting llama-index-question-gen-openai<0.4.0,>=0.3.0
  Downloading llama_index_question_gen_openai-0.3.0-py3-none-any.whl (2.9 kB)
Collecting llama-index-multi-modal-llms-openai<0.5.0,>=0.4.0
  Downloading llama_index_multi_modal_llms_openai-0.4.1-py3-none-any.whl (5.8 kB)
Collecting llama-index-readers-file<0.5.0,>=0.4.0
  Downloading llama_index_readers_

In [9]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings

In [11]:
%pip install llama-index-embeddings-fastembed

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [12]:
%pip install llama-index-vector-stores-qdrant

Defaulting to user installation because normal site-packages is not writeable
Collecting llama-index-vector-stores-qdrant
  Downloading llama_index_vector_stores_qdrant-0.4.1-py3-none-any.whl (11 kB)
Installing collected packages: llama-index-vector-stores-qdrant
Successfully installed llama-index-vector-stores-qdrant-0.4.1
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [13]:
from llama_index.vector_stores.qdrant import QdrantVectorStore

In [14]:
from llama_index.embeddings.fastembed import FastEmbedEmbedding

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
%pip install llama_index.llms.ollama

In [18]:
%pip install llama_index.embeddings.ollama

Defaulting to user installation because normal site-packages is not writeable
Collecting llama_index.embeddings.ollama
  Downloading llama_index_embeddings_ollama-0.5.0-py3-none-any.whl (2.6 kB)
Installing collected packages: llama-index.embeddings.ollama
Successfully installed llama-index.embeddings.ollama-0.5.0
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [19]:
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

In [21]:
Settings.llm = Ollama(
    model="llama3.2", 
    temperature=0.1,
    context_window=8096,  # equivalent to max_tokens
    streaming=True
)

In [22]:
Settings.embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    embed_batch_size=10
)

In [23]:
from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader("./data", recursive=True).load_data(show_progress=True)

Loading files: 100%|██████████| 1/1 [00:00<00:00,  2.34file/s]


In [25]:
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    transformations=[
        MarkdownNodeParser(include_metadata=True),
        # TokenTextSplitter(chunk_size=500, chunk_overlap=20),
        # SentenceSplitter(chunk_size=1024, chunk_overlap=20),
        # SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95 , embed_model=Settings.embed_model),
        Settings.embed_model,
    ],
)

# Ingest directly into a vector db
nodes = pipeline.run(documents=documents , show_progress=True)
print("Number of Nodes:",len(nodes))

Parsing nodes: 100%|██████████| 15/15 [00:00<00:00, 5102.15it/s]
Generating embeddings: 100%|██████████| 15/15 [01:36<00:00,  6.44s/it]

Number of Nodes: 15





In [26]:
import asyncio
from llama_index.core.storage.docstore import SimpleDocumentStore

docstore = SimpleDocumentStore()
docstore.add_documents(nodes)
docstore.persist(persist_path="./docstore.json")

In [31]:
%pip install llama-index-retrievers-bm25

Defaulting to user installation because normal site-packages is not writeable
Collecting llama-index-retrievers-bm25
  Downloading llama_index_retrievers_bm25-0.5.0-py3-none-any.whl (3.6 kB)
Collecting bm25s<0.3.0,>=0.2.0
  Downloading bm25s-0.2.6-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 663 kB/s eta 0:00:01
Collecting pystemmer<3.0.0.0,>=2.2.0.1
  Downloading PyStemmer-2.2.0.3-cp39-cp39-macosx_11_0_arm64.whl (220 kB)
[K     |████████████████████████████████| 220 kB 831 kB/s eta 0:00:01
Installing collected packages: pystemmer, bm25s, llama-index-retrievers-bm25
Successfully installed bm25s-0.2.6 llama-index-retrievers-bm25-0.5.0 pystemmer-2.2.0.3
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [32]:
from llama_index.retrievers.bm25 import BM25Retriever
import Stemmer

In [33]:
# We can pass in the index, docstore, or list of nodes to create the retriever
bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes,
    similarity_top_k=2,
    # Optional: We can pass in the stemmer and set the language for stopwords
    # This is important for removing stopwords and stemming the query + text
    # The default is english for both
    stemmer=Stemmer.Stemmer("english"),
    language="english",
)

In [35]:
%pip install matplotlib

Defaulting to user installation because normal site-packages is not writeable
Collecting matplotlib
  Downloading matplotlib-3.9.4-cp39-cp39-macosx_11_0_arm64.whl (7.8 MB)
[K     |████████████████████████████████| 7.8 MB 878 kB/s eta 0:00:01
Collecting contourpy>=1.0.1
  Downloading contourpy-1.3.0-cp39-cp39-macosx_11_0_arm64.whl (249 kB)
[K     |████████████████████████████████| 249 kB 1.1 MB/s eta 0:00:01
[?25hCollecting importlib-resources>=3.2.0
  Downloading importlib_resources-6.4.5-py3-none-any.whl (36 kB)
Collecting pyparsing>=2.3.1
  Downloading pyparsing-3.2.0-py3-none-any.whl (106 kB)
[K     |████████████████████████████████| 106 kB 1.6 MB/s eta 0:00:01
Collecting kiwisolver>=1.3.1
  Downloading kiwisolver-1.4.7-cp39-cp39-macosx_11_0_arm64.whl (64 kB)
[K     |████████████████████████████████| 64 kB 1.9 MB/s eta 0:00:01
Collecting cycler>=0.10
  Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB)
Collecting fonttools>=4.22.0
  Downloading fonttools-4.55.3-cp39-cp39-macos

In [36]:
from llama_index.core.response.notebook_utils import display_source_node


Generating embeddings:  60%|██████    | 9/15 [12:34<08:22, 83.79s/it]
Matplotlib is building the font cache; this may take a moment.


In [37]:
# will retrieve context from specific companies
retrieved_nodes = bm25_retriever.retrieve(
    "Why is Rainfall forecast imperative as overwhelming precipitation can lead to numerous catastrophes"
)
for node in retrieved_nodes:
    display_source_node(node, source_length=5000)

**Node ID:** cdef3399-50b8-4a0f-a25f-24d4d04eafdf<br>**Similarity:** 5.760929107666016<br>**Text:** Development of Multiple Combined Regression
Methods for Rainfall Measurement.
Nusrat Jahan Prottasha1, Md. Jashim Uddin 2, Md. Kowsher3, Rokeya Khatun
Shorna4, Niaz Al Murshed 5, and Boktiar Ahmed Bappy 6
1 Daﬀodil International University Dhaka 1207, Bangladesh,
jahannusratprotta@gmail.com
2 Noakhali Science and Technology University, 3814, Dhaka,
mdjaud12@gmail.com
3 Stevens Institute of Technology, Hoboken, NJ 07030 USA,
ga.kowsher@gmail.com
4 Daﬀodil International University, 1207, Dhaka,
rokeyashorna5@gmail.com
5 Jahangirnagar University, 1342, Dhaka,
niazalmurshed.ai@gmail.com
6 Jhenaidah polytechnic institute, 7300, Dhaka,
entbappy73@gmail.com
Abstract. Rainfall forecast is imperative as overwhelming precipitation
can lead to numerous catastrophes. The prediction makes a diﬀerence for
individuals to require preventive measures. In addition, the expectation
ought to be precise. Most of the nations in the world is an agricultural
nation and most of the economy of any nation depends upon agriculture.
Rain plays an imperative part in agribusiness so the early expectation of
rainfall plays a vital part within the economy of any agricultural. Over-
whelming precipitation may well be a major disadvantage. It’s a cause
for natural disasters like ﬂoods and drought that unit of measurement
experienced by people over the world each year. Rainfall forecast has
been one of the foremost challenging issues around the world in the ﬁnal
year. There are so many techniques that have been invented for predict-
ing rainfall but most of them are classiﬁcation, clustering techniques.
Predicting the quantity of rain prediction is crucial for countries’ people.
In our paperwork, we have proposed some regression analysis techniques
which can be utilized for predicting the quantity of rainfall (The amount
of rainfall recorded for the day in mm) based on some historical weather
conditions dataset. we have applied 10 supervised regressors (Machine
Learning Model) and some preprocessing methodology to the dataset.
We have also analyzed the result and compared them using various sta-
tistical parameters among these trained models to ﬁnd the bestperformed
model. Using this model for predicting the quantity of rainfall in some
diﬀerent places. Finally, the Random Forest regressor has predicted the
best r2 score of 0.869904217, and the mean absolute error is 0.194459262,
mean squared error is 0.126358647 and the root mean squared error is
0.355469615. . .<br>

**Node ID:** 9fd739ac-06c2-4fc3-a8d8-44ae31442bad<br>**Similarity:** 3.0598549842834473<br>**Text:** 2 Nusrat Jahan et al.
Keywords: Rainfall, Supervised Learning, Regression, Random Forest
Tree, AdaBoost Regressor, Gradient Boosting Regressor, XGBoost.
1 Introduction
This research paper proposed a scientiﬁc method to predict rainfall quantity
based on some diﬀerent weather conditions considering preceding weather records
and present weather situations using some regression analysis techniques .[1]
Rainfall determining is exceptionally vital since overwhelming and irregular rain-
fall can have numerous impacts on many other things like annihilation of river-
bank, crops, agriculture, and farms. One of the very deleterious departures is
ﬂooding due to the over rain.[2] According to Wikipedia in late summer 2002,
enormous storm downpours driven to gigantic ﬂooding in eastern India, Nepal,
and Bangladesh, killing over 500 individuals and clearing out millions of houses.
Each year in Bangladesh approximately 26,000 square kilometers (10,000 sq mi)
(around 18% of the country) is ﬂooded, killing over 5,000 individuals and wreck-
ing more than 7 million homes. On the other hand, Western Sydney is now
the ”greatest concern” from the worst ﬂoods in decades to have ravaged east-
ern Australia.[3] Jonh C, Rodda et al. presented a very rational method of the
rainfall measurement problem. The application of science and innovation that
predicts the state of the environment at any given speciﬁc period is known as
climate determining or weather forecasting. There are many distinctive strate-
gies for climate estimate and weather forecasting. But rainfall prediction is rare.
Some of the research has shown some classiﬁcation method to predict whether
it would be rain tomorrow or not. But instead of a classiﬁcation method for pre-
dicting rain, we need to the quantity of the rainfall in a particular place. There
is numerous equipment implement for foreseeing rainfall by utilizing the climate
conditions like temperature, humidity, weight. These conventional strategies can-
not work productively so by utilizing machine learning procedures. we can create
an exact comes about rain forecast. Ready to fair do it by having the histori-
cal information investigation of rainfall and can anticipate the precipitation for
future seasons. In our paper, we presented some predictive regression analysis
techniques to quantify rainfall quantity at a place. Here we used more than 10
years of historical data to train our model. The dataset contains various weather
conditions of diﬀerent places. This method can be utilized to predict the rainfall
(The amount of rainfall recorded for the day in mm) and avoid the annihilation
caused by it to life, agriculture, farm, and property. If we can quantify the rain-
fall most people can make some decisions before overwhelmed rain-aﬀected. The
contributions of this work are summarised as:
– We have assessed a pipeline of making choices for evaluating the ﬁnest rea-
sonable rain prediction.
– We have utilized 10 supervised regressors (Machine Learning Model). Be-
cause diﬀerent regressors give us diﬀerent results. So, it’s essential to ﬁnd
out the right model according to the requirements.<br>

In [44]:
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer
from llama_index.core.response_synthesizers import ResponseMode

response_synthesizer = get_response_synthesizer(
    response_mode=ResponseMode.COMPACT
)

BM25_QUERY_ENGINE = RetrieverQueryEngine(
    retriever=bm25_retriever
)

In [46]:
response = BM25_QUERY_ENGINE.query("Why is rainfall forecasting important to prevent disasters?")
response

ResponseError: model requires more system memory (7.3 GiB) than is available (4.4 GiB)