<a href="https://colab.research.google.com/github/Alfred9/Exploring-LLMs/blob/main/Document%20Retrieval/Document_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install haystack-ai
!pip install "datasets >=2.6.1"
!pip install "sentence-transformers >=2.2.0"
!pip install accelerate



#### **Initializing DocumentStore**

In [2]:
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store =InMemoryDocumentStore()

#### **Fetching and Processing Document**

In [3]:
from datasets import load_dataset
from haystack import Document

dataset = load_dataset("anakin87/medrag-pubmed-chunk", split= "train")




In [4]:
docs = []
for doc in dataset:
  docs.append (
      Document(content=doc["contents"], meta={"title": doc["title"], "abstract":doc["content"], "pmid": doc["id"]})
  )

#### **Indexing Documents with a Pipeline**

In [5]:
from haystack.components.preprocessors.document_splitter import DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack import Pipeline
from haystack.utils import ComponentDevice


document_splitter = DocumentSplitter(split_by ='word', split_length=512, split_overlap=32)
document_embedder = SentenceTransformersDocumentEmbedder(model="BAAI/bge-small-en-v1.5", device=ComponentDevice.from_str("cuda:0"))
document_writer = DocumentWriter(document_store)

In [6]:
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("document_splitter", document_splitter)
indexing_pipeline.add_component("document_embedder", document_embedder)
indexing_pipeline.add_component("document_writer", document_writer)

In [7]:
indexing_pipeline.connect("document_splitter", "document_embedder")
indexing_pipeline.connect("document_embedder", "document_writer")

indexing_pipeline.run({"document_splitter": {"documents": docs}})

Batches:   0%|          | 0/481 [00:00<?, ?it/s]

{'document_writer': {'documents_written': 15380}}

#### **Creating a hybrid retrieval Pipeline**

A hybrid retrieval pipeline allows the system to leverage the strength of diffrent appproaches by combining keyword-based search and dense vector  search for more accurate and diverse results.

 We use both `InMemoryEmbeddingRetriever` and `InMemoryBM25Retriever` to perform both dense and keyword-based retrieval

In [8]:
#Initializing the Retriever and embedder

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.components.embedders import SentenceTransformersTextEmbedder


text_embedder =  SentenceTransformersTextEmbedder(
    model ="BAAI/bge-small-en-v1.5", device=ComponentDevice.from_str("cuda:0")
)

embedding_retriever =InMemoryEmbeddingRetriever(document_store)
bm25_retriever = InMemoryBM25Retriever(document_store)

In [9]:
#Join the Retrieval Results

from haystack.components.joiners import  DocumentJoiner

document_joiner = DocumentJoiner()

In [10]:
#Rank the results

from haystack.components.rankers import TransformersSimilarityRanker

ranker = TransformersSimilarityRanker(model = "BAAI/bge-reranker-base")

In [11]:
# create pipeline by adding all components to the pipeline

from haystack import Pipeline

hybrid_retrieval = Pipeline()
hybrid_retrieval.add_component("text_embedder", text_embedder)
hybrid_retrieval.add_component("embedding_retriever", embedding_retriever)
hybrid_retrieval.add_component("bm25_retriever", bm25_retriever)
hybrid_retrieval.add_component("document_joiner", document_joiner)
hybrid_retrieval.add_component("ranker", ranker)


hybrid_retrieval.connect("text_embedder", "embedding_retriever")
hybrid_retrieval.connect("bm25_retriever", "document_joiner")
hybrid_retrieval.connect("embedding_retriever", "document_joiner")
hybrid_retrieval.connect("document_joiner", "ranker")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7d01ca37ed40>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - embedding_retriever: InMemoryEmbeddingRetriever
  - bm25_retriever: InMemoryBM25Retriever
  - document_joiner: DocumentJoiner
  - ranker: TransformersSimilarityRanker
🛤️ Connections
  - text_embedder.embedding -> embedding_retriever.query_embedding (List[float])
  - embedding_retriever.documents -> document_joiner.documents (List[Document])
  - bm25_retriever.documents -> document_joiner.documents (List[Document])
  - document_joiner.documents -> ranker.documents (List[Document])

In [12]:
hybrid_retrieval.draw("hybrid-retrieval.png")


In [18]:
query = "epilepsy symptoms in infants"

result = hybrid_retrieval.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query}, "ranker": {"query": query}}
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Ranking by BM25...:   0%|          | 0/15380 [00:00<?, ? docs/s]

In [19]:
def pretty_print_results(prediction):
    for doc in prediction["documents"]:
        print(doc.meta["title"], "\t", doc.score)
        print(doc.meta["abstract"])
        print("\n", "\n")
pretty_print_results(result["ranker"])


[Epileptic headaches (author's transl)]. 	 0.955604076385498
The author describes three observations characterized by the following similar features: 1. Children, between 2 and 9 years, suffering from attacks of sterotyped headaches. 2. EEG with bursts of high- voltage paroxysmal discharges, mainly during hyperventilation. 3. Efficiency of antiepileptic drugs contrary to antimigrainous treatment. These headaches are considered by the author as "epileptic", and this diagnosis is discussed.

 

Cerebrospinal fluid acid-base status and lactate and pyruvate concentrations after convulsions of varied duration and aetiology in children. 	 0.8956598043441772
Twenty-two infants and children were studied after convulsions of varied cause and duration. Arterial and CSF acid-base variables, lactate and pyruvate concentrations, and lactate/pyruvate ratios were measured between 3 and 18 hours after convulsive episodes. Biochemical signs of cerebral hypoxia were found in 7 patients with prolonged (g