<a href="https://colab.research.google.com/github/Alfred9/Exploring-LLMs/blob/main/Document%20Retrieval/Document_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Document retrieval**
 is the art of extracting relevant documents from a corpus in response to an input.In this notebbk we leverage `Hybrid retrieval Technique`

`Hybrid Retrieval` combines keyword-based and embedding-based retrieval techniques, leveraging the strengths of both approaches. In essence, dense embeddings excel in grasping the contextual nuances of the query, while keyword-based methods excel in matching keywords.

In [1]:
!pip install haystack-ai
!pip install "datasets >=2.6.1"
!pip install "sentence-transformers >=2.2.0"
!pip install accelerate



#### **Initializing DocumentStore**

We use  `InMemoryDocumentStore` to intitialize the DocumentStore  that stores the Documents that our system uses to find answers to the questions.

In [2]:
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store =InMemoryDocumentStore()

#### **Fetching and Processing Document**

We use the PubMed datasets from  Hugging Face Hub by anakin87/medrag-pubmed-chunk.

In [3]:
from datasets import load_dataset
from haystack import Document

dataset = load_dataset("anakin87/medrag-pubmed-chunk", split= "train")




In [4]:
#transform data from dataset into a list of Document objects,
#with each Document containing content and metadata extracted from the corresponding item in the dataset.

docs = []
for doc in dataset:
  docs.append (
      Document(content=doc["contents"], meta={"title": doc["title"], "abstract":doc["content"], "pmid": doc["id"]})
  )

#### **Indexing Documents with a Pipeline**

Let's Create a pipeline to store the data in the document store with their embedding.
 We use  a `DocumentSplitter` to split documents into chunks of 512 words, `SentenceTransformersDocumentEmbedder` to create document embeddings for dense retrieval and `DocumentWriter` to write documents to the document store.

In [5]:
from haystack.components.preprocessors.document_splitter import DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack import Pipeline
from haystack.utils import ComponentDevice


document_splitter = DocumentSplitter(split_by ='word', split_length=512, split_overlap=32)
document_embedder = SentenceTransformersDocumentEmbedder(model="BAAI/bge-small-en-v1.5", device=ComponentDevice.from_str("cuda:0"))
document_writer = DocumentWriter(document_store)

In [6]:
#sets up a pipeline for document processing, with three main components:
# a document splitter, a document embedder, and a document writer
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("document_splitter", document_splitter)
indexing_pipeline.add_component("document_embedder", document_embedder)
indexing_pipeline.add_component("document_writer", document_writer)

In [7]:
# connect pipeline components and then run the pipeline on the documents
indexing_pipeline.connect("document_splitter", "document_embedder")
indexing_pipeline.connect("document_embedder", "document_writer")

indexing_pipeline.run({"document_splitter": {"documents": docs}})

Batches:   0%|          | 0/481 [00:00<?, ?it/s]

{'document_writer': {'documents_written': 15380}}

#### **Creating a hybrid retrieval Pipeline**

A hybrid retrieval pipeline allows the system to leverage the strength of diffrent appproaches by combining keyword-based search and dense vector  search for more accurate and diverse results.

 We use both `InMemoryEmbeddingRetriever` and `InMemoryBM25Retriever` to perform both dense and keyword-based retrieval

Steps Involved are
1. Initializing the retriever and embedders i.e `InMemoryEmbeddingRetriever` and `InMemoryBM25Retriever`
2. Joining the retrieval results from bm25_retriever and embedding retriever
3. Rank the results relevancy to query using the `TransformersSimilarityRanker` that scores the relevancy of all retrieved documents for the given search query by using a cross encoder model `BAAI/bge-reranker-base model`
4. Adding all initilized components to pipelin eand connnect them for the hybrid_retrieval pipeline
5. Visualize the pipeline usind the `draw()` method.

In [8]:
#Initializing the Retriever and embedder

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.components.embedders import SentenceTransformersTextEmbedder


text_embedder =  SentenceTransformersTextEmbedder(
    model ="BAAI/bge-small-en-v1.5", device=ComponentDevice.from_str("cuda:0")
)

embedding_retriever =InMemoryEmbeddingRetriever(document_store)
bm25_retriever = InMemoryBM25Retriever(document_store)

In [9]:
#Join the Retrieval Results

from haystack.components.joiners import  DocumentJoiner

document_joiner = DocumentJoiner()

In [10]:
#Rank the results

from haystack.components.rankers import TransformersSimilarityRanker

ranker = TransformersSimilarityRanker(model = "BAAI/bge-reranker-base")

In [11]:
# create retrieval pipeline by adding all components to the pipeline

from haystack import Pipeline

hybrid_retrieval = Pipeline()
hybrid_retrieval.add_component("text_embedder", text_embedder)
hybrid_retrieval.add_component("embedding_retriever", embedding_retriever)
hybrid_retrieval.add_component("bm25_retriever", bm25_retriever)
hybrid_retrieval.add_component("document_joiner", document_joiner)
hybrid_retrieval.add_component("ranker", ranker)


hybrid_retrieval.connect("text_embedder", "embedding_retriever")
hybrid_retrieval.connect("bm25_retriever", "document_joiner")
hybrid_retrieval.connect("embedding_retriever", "document_joiner")
hybrid_retrieval.connect("document_joiner", "ranker")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7e2912b5b9a0>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - embedding_retriever: InMemoryEmbeddingRetriever
  - bm25_retriever: InMemoryBM25Retriever
  - document_joiner: DocumentJoiner
  - ranker: TransformersSimilarityRanker
🛤️ Connections
  - text_embedder.embedding -> embedding_retriever.query_embedding (List[float])
  - embedding_retriever.documents -> document_joiner.documents (List[Document])
  - bm25_retriever.documents -> document_joiner.documents (List[Document])
  - document_joiner.documents -> ranker.documents (List[Document])

In [12]:
#visualise the steps by saving the steps as an image
hybrid_retrieval.draw("hybrid-retrieval.png")


#### **Testing the Hybrid Retrieval Pipeline**
This is done by passing the query to text_embedder, bm25_retriever and ranker and run the retrieval pipeline

In [13]:
query = "ADHD symptoms in children"

result = hybrid_retrieval.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query}, "ranker": {"query": query}}
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Ranking by BM25...:   0%|          | 0/15380 [00:00<?, ? docs/s]

In [14]:
#function to print results as a kind of search page
#This prints the results with their respective ranking based on relevancy to the query
def pretty_print_results(prediction):
    for doc in prediction["documents"]:
        print(doc.meta["title"], "\t", doc.score)
        print(doc.meta["abstract"])
        print("\n", "\n")
pretty_print_results(result["ranker"])


Medication for hyperkinetic children. 	 0.9761005640029907
The hyperkinetic syndrome is a symptom complex of hyperactivity, short attention span, distractibility, impulsivity, learning difficulties, other behaviour problems and 'equivocal' neurological signs. However, none of these terms has ever been objectively defined and at present diagnosis is largely a matter of clinical judgement. In the management of the disorder, drugs do have a place but the decision to use medication is a complex procedure diagnostically and therapeutically calling for the highest in clinical skill and medical supervision. The most useful medication at present is the stimulant group of drugs, particularly dextroamphetamine and methylphenidate. Antipsychotic drugs are sometimes useful but carry the risk of depressing higher CNS functions such as attention and cognition. Other drugs which have been shown to be of value include tricyclic antidepressants (although their effect is less predictable and less striki