# Data indexing

## Scifact Dataset


The Scifact dataset is a specialized collection of scientific claims and evidence from research papers, designed for scientific fact-checking and verification tasks. It consists of scientific claims paired with abstracts from research papers that either support or refute these claims.

The dataset contains over 5,000 scientific abstracts from research papers across various scientific domains including medicine, biology, chemistry, and other life sciences. Each entry in the dataset includes a unique ID, the paper's title, and the full text of the abstract.

Originally created to help evaluate scientific claim verification systems, this dataset is part of the Benchmark for Scientific Claim Verification (BeIR) collection. It's particularly useful for building scientific fact-checking systems, training models to understand and verify scientific claims, and developing information retrieval systems for scientific literature.

Let's explore the dataset structure and prepare it for our RAG application.



In [None]:
from datasets import load_dataset

dataset = load_dataset("BeIR/scifact", "corpus", split="corpus")
dataset[0]

In [None]:
len(dataset)

## Dense embeddings

We're not going to choose the fanciest embedding model out there, but stick to something simple and efficient. FastEmbed comes with some pretrained models that we can use out of the box. Due to ONNX usage, these models can be launched efficiently even on a CPU. The `all-MiniLM-L6-v2` model is a lightweight model that's good for a start.

In [None]:
from fastembed import TextEmbedding

dense_embedding_model = TextEmbedding("sentence-transformers/all-MiniLM-L6-v2")
dense_embeddings = list(dense_embedding_model.passage_embed(dataset["text"][0:1]))
len(dense_embeddings)

In [None]:
len(dense_embeddings[0])

## Sparse embeddings

Similarly, we can use a BM25 model to generate sparse embeddings, so it hopefully will handle the cases in which the dense embeddings fail. 

In [None]:
from fastembed import SparseTextEmbedding

bm25_embedding_model = SparseTextEmbedding("Qdrant/bm25")
bm25_embeddings = list(bm25_embedding_model.passage_embed(dataset["text"][0:1]))
bm25_embeddings


## Putting data in a Qdrant collection

All the vectors might be now upserted into a Qdrant collection. Keeping them all in a single one enables the possibility to combine different embeddings and create even a complex pipeline with several steps. Depending on the specifics of your data, you may prefer to use a different approach.

In [None]:
!docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant:v1.10.0


In [None]:
from qdrant_client import QdrantClient, models

client = QdrantClient("http://localhost:6333", timeout=600)
client.create_collection(
    "scifact",
    vectors_config={
        "all-MiniLM-L6-v2": models.VectorParams(
            size=len(dense_embeddings[0]),
            distance=models.Distance.COSINE,
        )
    },
    sparse_vectors_config={
        "bm25": models.SparseVectorParams(
            modifier=models.Modifier.IDF,
        )
    }
)

In [None]:
import tqdm

batch_size = 5
for batch in tqdm.tqdm(dataset.iter(batch_size=batch_size), 
                       total=len(dataset) // batch_size):
    dense_embeddings = list(dense_embedding_model.passage_embed(batch["text"]))
    bm25_embeddings = list(bm25_embedding_model.passage_embed(batch["text"]))
    
    client.upload_points(
        "scifact",
        points=[
            models.PointStruct(
                id=int(batch["_id"][i]),
                vector={
                    "all-MiniLM-L6-v2": dense_embeddings[i].tolist(),
                    "bm25": bm25_embeddings[i].as_object(),
                },
                payload={
                    "_id": batch["_id"][i],
                    "title": batch["title"][i],
                    "text": batch["text"][i],
                }
            )
            for i, _ in enumerate(batch["_id"])
        ],
        # We send a lot of embeddings at once, so it's best to reduce the batch size.
        # Otherwise, we would have gigantic requests sent for each batch and we can
        # easily reach the maximum size of a single request.
        batch_size=batch_size,  
    )

In [None]:
# client.recover_snapshot(
#     "scifact",
#     location="https://storage.googleapis.com/common-datasets-snapshots/scifact-multiple-representations.snapshot",
# )

In [None]:
client.get_collection("scifact")