# Data indexing

## Scifact Dataset


The Scifact dataset is a specialized collection of scientific claims and evidence from research papers, designed for scientific fact-checking and verification tasks. It consists of scientific claims paired with abstracts from research papers that either support or refute these claims.

The dataset contains over 5,000 scientific abstracts from research papers across various scientific domains including medicine, biology, chemistry, and other life sciences. Each entry in the dataset includes a unique ID, the paper's title, and the full text of the abstract.

Originally created to help evaluate scientific claim verification systems, this dataset is part of the Benchmark for Scientific Claim Verification (BeIR) collection. It's particularly useful for building scientific fact-checking systems, training models to understand and verify scientific claims, and developing information retrieval systems for scientific literature.

Let's explore the dataset structure and prepare it for our RAG application.



In [2]:
from datasets import load_dataset

dataset = load_dataset("BeIR/scifact", "corpus", split="corpus")
dataset[0]

Using the latest cached version of the dataset since BeIR/scifact couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'corpus' at /Users/sarangsanjaykulkarni/.cache/huggingface/datasets/BeIR___scifact/corpus/0.0.0/984eed826375f18d27936c4a32bf0f8491e3f414 (last modified on Sun Jul  6 21:19:11 2025).


{'_id': '4983',
 'title': 'Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging.',
 'text': 'Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line scan diffusion-weighted magnetic resonance imaging (MRI) sequence with diffusion tensor analysis was applied to measure the apparent diffusion coefficient, to calculate relative anisotropy, and to delineate three-dimensional fiber architecture in cerebral white matter in preterm (n = 17) and full-term infants (n = 7). To assess effects of prematurity on cerebral white matter development, early gestation preterm infants (n = 10) were studied a second time at term. In the central white matter the mean apparent diffusion coefficient at 28 wk was high, 1.8 microm2/ms, and decreased toward term to 1.2 microm2/ms. In the posterior limb of the internal capsule, t

### Description
- **Loading the Dataset:** We use the `datasets` library to load the "corpus" split of the Scifact dataset from the BeIR collection. This split contains the abstracts we’ll index.
- **Inspecting a Sample:** `dataset[0]` retrieves the first entry, showing its structure: a dictionary with keys `_id` (unique identifier), `title` (paper title), and `text` (abstract text).
- **Purpose:** This step helps us understand the data we’re working with, confirming it matches the expected format for indexing.

Next, let’s check the total number of documents in the dataset.


In [3]:
len(dataset)

5183

## Dense Embeddings

Dense embeddings capture the semantic meaning of text, allowing searches based on concepts rather than just exact keywords. For this notebook, we’re not going to choose the fanciest embedding model out there, but stick to something simple and efficient. FastEmbed provides pretrained models that we can use out of the box. Due to ONNX usage, these models can be launched efficiently even on a CPU. The `all-MiniLM-L6-v2` model is a lightweight model from Sentence Transformers that’s good for a start.


In [5]:
from fastembed import TextEmbedding

dense_embedding_model = TextEmbedding("sentence-transformers/all-MiniLM-L6-v2")
dense_embeddings = list(dense_embedding_model.passage_embed(dataset["text"][0:1]))
len(dense_embeddings)

1

### Description
- **Model Initialization:** We load the `all-MiniLM-L6-v2` model using `TextEmbedding` from FastEmbed. This model is optimized for semantic text representation and is lightweight, making it suitable for CPU-based environments.
- **Generating Embeddings:** We embed the text of the first abstract (`dataset["text"][0:1]`) to test the process. The result is a list of dense embedding vectors.
- **Output Check:** `len(dense_embeddings)` confirms we get one embedding vector for the single document processed.

Let’s inspect the dimensionality of the dense embeddings.

In [6]:
len(dense_embeddings[0])

384

### Description
- **Vector Dimensionality:** This returns the length of the embedding vector (e.g., 384 dimensions for `all-MiniLM-L6-v2`).
- **Significance:** The dimensionality is crucial for configuring the Qdrant collection later, as it defines the size of the vector space we’ll store and search.

## Sparse Embeddings

Sparse embeddings, like those generated by BM25, are effective for keyword-based searches, capturing exact term matches rather than semantic similarity. Similarly, we can use a BM25 model to generate sparse embeddings, so it hopefully will handle the cases in which the dense embeddings fail.


In [7]:
from fastembed import SparseTextEmbedding

bm25_embedding_model = SparseTextEmbedding("Qdrant/bm25")
bm25_embeddings = list(bm25_embedding_model.passage_embed(dataset["text"][0:1]))


Fetching 18 files:   0%|          | 0/18 [00:00<?, ?it/s]

dutch.txt:   0%|          | 0.00/453 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

german.txt: 0.00B [00:00, ?B/s]

arabic.txt: 0.00B [00:00, ?B/s]

french.txt:   0%|          | 0.00/813 [00:00<?, ?B/s]

english.txt:   0%|          | 0.00/936 [00:00<?, ?B/s]

finnish.txt: 0.00B [00:00, ?B/s]

greek.txt: 0.00B [00:00, ?B/s]

hungarian.txt: 0.00B [00:00, ?B/s]

italian.txt: 0.00B [00:00, ?B/s]

norwegian.txt:   0%|          | 0.00/851 [00:00<?, ?B/s]

romanian.txt: 0.00B [00:00, ?B/s]

portuguese.txt: 0.00B [00:00, ?B/s]

danish.txt:   0%|          | 0.00/424 [00:00<?, ?B/s]

turkish.txt:   0%|          | 0.00/260 [00:00<?, ?B/s]

swedish.txt:   0%|          | 0.00/559 [00:00<?, ?B/s]

russian.txt: 0.00B [00:00, ?B/s]

spanish.txt: 0.00B [00:00, ?B/s]

### Description
- **Model Initialization:** We load the `Qdrant/bm25` model using `SparseTextEmbedding`. BM25 is a traditional ranking algorithm that scores documents based on term frequency and inverse document frequency.
- **Generating Embeddings:** We embed the first abstract’s text to produce a sparse vector, which highlights important keywords with non-zero values while most elements remain zero.
- **Output Inspection:** `bm25_embeddings` shows the sparse vector structure, typically as a list of dictionaries with indices and values for non-zero terms.
- **Complementary Role:** Sparse embeddings complement dense embeddings by excelling in exact-match scenarios, enhancing retrieval robustness.


## Putting Data in a Qdrant Collection

All the vectors might be now upserted into a Qdrant collection. Keeping them all in a single one enables the possibility to combine different embeddings and create even a complex pipeline with several steps. Depending on the specifics of your data, you may prefer to use a different approach. Qdrant is a vector database optimized for storing and searching high-dimensional data efficiently.

### Starting Qdrant with Docker

First, let’s set up a Qdrant instance if it’s not already running.


In [16]:
!docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant:v1.15.0


aee8942527922eca2700ea3203721b4278cde163edb7dc98ebe3e61e047b933b


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Description
- **Docker Command:** This runs Qdrant version 1.15.0 in a detached mode (`-d`) and maps ports 6333 (REST API) and 6334 (gRPC) from the container to your local machine.
- **Purpose:** Ensures a Qdrant server is available locally to store and manage our embeddings.

### Creating the Qdrant Collection

Now, let’s configure a collection to store both dense and sparse embeddings.

In [17]:
from qdrant_client import QdrantClient, models

client = QdrantClient("http://localhost:6333", timeout=600)
client.create_collection(
    "scifact",
    vectors_config={
        "all-MiniLM-L6-v2": models.VectorParams(
            size=len(dense_embeddings[0]),
            distance=models.Distance.COSINE,

        )
    },
    sparse_vectors_config={
        "bm25": models.SparseVectorParams(
            modifier=models.Modifier.IDF,
        )
    }
)

True


### Description
- **Client Setup:** We connect to the local Qdrant instance with a generous timeout (600 seconds) to handle large uploads.
- **Collection Creation:** We create a collection named "scifact" with configurations for:
  - **Dense Vectors:** Named `all-MiniLM-L6-v2`, with the vector size from our earlier embedding (e.g., 384) and Cosine distance for similarity searches.
  - **Sparse Vectors:** Named `bm25`, using an IDF (Inverse Document Frequency) modifier to weight terms based on their rarity across the dataset.
- **Why Combined Storage:** Storing both embedding types in one collection enables hybrid search capabilities later.

### Uploading Data

We’ll upload the dataset’s embeddings and metadata to Qdrant in batches for efficiency.


In [18]:
import tqdm

batch_size = 5
for batch in tqdm.tqdm(dataset.iter(batch_size=batch_size), 
                       total=len(dataset) // batch_size):
    dense_embeddings = list(dense_embedding_model.passage_embed(batch["text"]))
    bm25_embeddings = list(bm25_embedding_model.passage_embed(batch["text"]))
    
    client.upload_points(
        "scifact",
        points=[
            models.PointStruct(
                id=int(batch["_id"][i]),
                vector={
                    "all-MiniLM-L6-v2": dense_embeddings[i].tolist(),
                    "bm25": bm25_embeddings[i].as_object(),
                },
                payload={
                    "_id": batch["_id"][i],
                    "title": batch["title"][i],
                    "text": batch["text"][i],
                }
            )
            for i, _ in enumerate(batch["_id"])
        ],
        # We send a lot of embeddings at once, so it's best to reduce the batch size.
        # Otherwise, we would have gigantic requests sent for each batch and we can
        # easily reach the maximum size of a single request.
        batch_size=batch_size,  
    )

1037it [00:34, 30.30it/s]                          



### Description
- **Batching:** We process the dataset in chunks of 5 documents (`batch_size=5`) to manage memory and avoid oversized requests. `tqdm` provides a progress bar for tracking.
- **Embedding Generation:** For each batch, we compute dense and sparse embeddings for all abstracts in the batch.
- **PointStruct:** Each document becomes a "point" in Qdrant with:
  - `id`: A unique integer ID from `_id`.
  - `vector`: A dictionary with dense (`all-MiniLM-L6-v2`) and sparse (`bm25`) embeddings.
  - `payload`: Metadata including the ID, title, and text for retrieval purposes.
- **Uploading:** `upload_points` sends these points to the "scifact" collection, with batching to optimize performance.

### Verifying the Collection

Let’s confirm the collection is set up correctly.


In [19]:
client.get_collection("scifact")

CollectionInfo(status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=None, indexed_vectors_count=5183, points_count=5183, segments_count=2, config=CollectionConfig(params=CollectionParams(vectors={'all-MiniLM-L6-v2': VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}, shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors={'bm25': SparseVectorParams(index=None, modifier=<Modifier.IDF: 'idf'>)}), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=False, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=10000, flush_interval_sec=5, max_o