# Indexing, Search, and Retrieval with OpenSearch

This notebooks walks through the functions for indexing, querying, and ranking documents with OpenSearch.

## Indexing

The Python script, `ingest_pipeline.py`, builds an index from a full list of countries from *Wikipedia*. This notebook breaks down the procedure step-by-step, using a small sample of countries for demonstration.

Given a selection of documents, the script begins by preparing two DataFrames:
 - `document_df`: Each row represents one document. This table contains descriptive metadata and the full body of text.
 - `segment_df`: Each row includes a segment that fits within the context window of the retriever model. The main driver of semantic search.

In [18]:
from docutrance.index import (
    build_segment_dataframe,
    build_wikipedia_index
)


from pathlib import Path
from sentence_transformers import SentenceTransformer
import spacy
import random



# Load a list of URLs and select a sample
urls = Path("../data/links/countries.txt").read_text().splitlines()
sample = random.sample(urls, 10)

# Initiate models for processing text data.
lemmatizer = spacy.load('en_core_web_sm')
encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

document_df = build_wikipedia_index(
    sample,
    lemmatizer,
    encoder
)

# Controls how documents are split into pargraphs.
paragraph_fn =  lambda x: x.split('\n')

# Controls the size of token overlap between segments
stride = encoder.max_seq_length // 2

# Filters segments below a minimum length
min_segment_length = 8

segment_df = build_segment_dataframe(
    document_df,
    paragraph_fn,
    lemmatizer,
    encoder,
    stride=stride,
    min_segment_length=min_segment_length
)

document_df.head(10)

Extracting data from Wikipedia. ..: 100%|██████████| 10/10 [00:05<00:00,  1.77it/s]
Lemmatizing body. . .: 100%|██████████| 10/10 [00:13<00:00,  1.33s/it]
Computing title embeddings. . .: 100%|██████████| 10/10 [00:00<00:00, 97.48it/s]
Splitting documents into paragraphs. . .: 100%|██████████| 10/10 [00:00<?, ?it/s]
Getting sentence boundaries. . .: 100%|██████████| 1074/1074 [00:15<00:00, 69.45it/s]
Getting segment boundaries. . .: 100%|██████████| 1074/1074 [00:00<00:00, 3559.83it/s]
Smoothing segment boundaries. . .: 100%|██████████| 1074/1074 [00:00<00:00, 1335.67it/s]
Extracting segments. . .: 100%|██████████| 1074/1074 [00:00<00:00, 99669.94it/s]


Filtering underlength segments. . .
Removed 33 underlength segments.


Embedding segments. . .: 100%|██████████| 1713/1713 [00:34<00:00, 49.90it/s]
Assigning segment ids. . .: 100%|██████████| 1713/1713 [00:00<00:00, 107194.86it/s]


Unnamed: 0,url,document_id,title,body,body_lemmatized,title_embedding
0,https://en.wikipedia.org/wiki/Austria,0,Austria,"Austria,[e] formally the Republic of Austria,[...","austria,[e ] formally republic austria,[f ] la...","[0.06930039, 0.36513674, -0.21867831, -0.08320..."
1,https://en.wikipedia.org/wiki/Singapore,1,Singapore,"Singapore,[f] officially the Republic of Singa...","singapore,[f ] officially republic singapore ,...","[0.3834032, 0.3134104, 0.12782218, -0.03487570..."
2,https://en.wikipedia.org/wiki/Costa_Rica,2,Costa Rica,"Costa Rica,[a] officially the Republic of Cost...","costa rica,[a ] officially republic costa rica...","[0.25713432, -0.24487439, 0.30020368, -0.41871..."
3,https://en.wikipedia.org/wiki/Eritrea,3,Eritrea,"Eritrea,[b] officially the State of Eritrea,[c...","eritrea,[b ] officially state eritrea,[c ] cou...","[-0.21471323, 0.23990066, 0.108718246, -0.0575..."
4,https://en.wikipedia.org/wiki/Malawi,4,Malawi,"Malawi,[a][9] officially the Republic of Malaw...","malawi,[a][9 ] officially republic malawi,[b ]...","[-0.1703589, 0.39855027, -0.2987493, 0.3856757..."
5,https://en.wikipedia.org/wiki/Maldives,5,Maldives,"The Maldives,[e] officially the Republic of Ma...","maldives,[e ] officially republic maldives,[f ...","[0.1535826, -0.31509265, 0.10253833, -0.450252..."
6,https://en.wikipedia.org/wiki/Antigua_and_Barbuda,6,Antigua and Barbuda,Antigua and Barbuda[d] is a sovereign archipel...,antigua barbuda[d ] sovereign archipelagic cou...,"[0.16648428, 0.04340147, 0.19107334, 0.2145875..."
7,https://en.wikipedia.org/wiki/Canada,7,Canada,Canada[a] is a country in North America. Its t...,canada[a ] country north america . province te...,"[0.40659752, -0.013133737, 0.45929885, -0.2987..."
8,https://en.wikipedia.org/wiki/Solomon_Islands,8,Solomon Islands,"Solomon Islands,[7] also known simply as the S...","solomon islands,[7 ] know simply solomons,[8 ]...","[0.16126138, -0.16808078, 0.25491053, 0.278386..."
9,https://en.wikipedia.org/wiki/Danish_Realm,9,Danish Realm,"The Danish Realm,[g] officially the Kingdom of...","danish realm,[g ] officially kingdom denmark,[...","[0.097465456, 0.59684706, -0.20928447, -0.3438..."


In processing the `segment_df`, blocks of body text from the document_df are broken down into segments through four steps:

 1. **Paragraph segmentation**: Each document body is heuristically split into paragraphs using a user-defined function. For Wikipedia pages, paragraphs are separated by newline characters.

 2. **Token boundary detection**: The retriever model's tokenizer identifies overlapping token windows based on the model's maximum sequence length.

 3. **Sentence boundary detection**: A SpaCy model is used to identify sentence boundaries within each paragraph.

 4. **Segment boundary smoothing**: The initial token-based segment boundaries are adjusted to align with the nearest sentence boundary in the direction that reduces token count.

This procedure produces well-formed segments that consist of complete sentences and approach the retriever model’s maximum context length.

In [35]:
print("Example Segment:\n\n", segment_df.sample(1).reset_index().loc[0, 'segment'])

segment_df.head(10)

Example Segment:

 Days before, in the 30 May 1959 election, the People's Action Party (PAP) won a landslide victory.[76] Governor Sir William Allmond Codrington Goode served as the first Yang di-Pertuan Negara (Head of State).[77]


Unnamed: 0,segment_id,document_id,segment,segment_embedding
0,000-0000,0,"Austria,[e] formally the Republic of Austria,[...","[0.06246262, -0.17715333, 0.19159143, -0.05378..."
1,000-0001,0,"The country occupies an area of 83,879 km2 (32...","[0.17630827, -0.25996053, -0.027568314, -0.098..."
2,000-0002,0,The area of today's Austria has been inhabited...,"[0.21554814, 0.15389599, -0.01523502, -0.06066..."
3,000-0003,0,"] Austria, as a unified state, emerged from th...","[-0.011898865, 0.20504114, 0.14017548, -0.2356..."
4,000-0004,0,Being the heartland of the Habsburg monarchy s...,"[-0.054862324, 0.15583988, 0.09909635, -0.1205..."
5,000-0005,0,Before the dissolution of the empire two years...,"[-0.20920843, 0.2299816, 0.15384266, 0.0937039..."
6,000-0006,0,After the assassination of Archduke Franz Ferd...,"[-0.20502546, 0.29093865, 0.18609236, 0.038689..."
7,000-0007,0,"During the interwar period, anti-parliamentari...","[-0.104107656, 0.12012277, 0.05642662, 0.04945..."
8,000-0008,0,Austria is a semi-presidential[d] representati...,"[-0.08257274, 0.040588252, 0.13276948, -0.1226..."
9,000-0009,0,It hosts the Organization for Security and Co-...,"[-0.15144917, 0.008586518, -0.06958202, -0.143..."


The function, `docutrance.index.index_documents`, takes DataFrames to index, connects with OpenSearch, and builds an index according to the configuration defined by the user.

In [None]:
from docutrance.index import index_documents
from opensearchpy import OpenSearch

encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Object for managing indexing and retrieval for OpenSearch
client = OpenSearch(hosts=[{'host': 'localhost', 'port': 9200}])

# document_df config
index_name = 'sample_documents'

# Controls index specail features. index.knn allows for semantic search
index_settings = {
    "index.knn": True,
    "number_of_shards": 1,
    "number_of_replicas": 0
}

# 1. Special configuration for the document_df:

# Describes the index contents and data types to OpenSearch
index_mappings = {
    "properties": {
        "url": {"type": "keyword"},
        "body": {"type": "text"},
        "body_lemmatized": {"type": "text"},
        "title": {"type": "text"},
        "title_embedding": {
            "type": "knn_vector",
            "dimension": encoder.get_sentence_embedding_dimension(), # Careful when switching models
            "method": {
                "engine": "lucene",
                "space_type": "l2",
                "name": "hnsw",
                "parameters": {}
                }
        }
    }
}

# Identifies which column is to be taken as the unique identifier
id_column = 'document_id'

index_documents(
    document_df, client, index_name, index_settings, index_mappings, id_column, overwrite_old_index=True
)

# 2. Special configuration for the segment_df:

# Describe the columns.
index_mappings = {
    "properties": {
        "document_id": {"type": "keyword"},
        "segment": {"type": "text"},
        "segment_embedding": {
            "type": "knn_vector",
            "dimension": encoder.get_sentence_embedding_dimension(),
            "method": {
                "engine": "lucene",
                "space_type": "l2",
                "name": "hnsw",
                "parameters": {}
            }
        }
    }
}

# Name the index.
id_column = 'segment_id'

#Push to OpenSearch.
index_documents(
    segment_df, client, index_name, index_settings, index_mappings, id_column, overwrite_old_index=True
)


🗑️ Deleted old index 'sample_documents'.
Created index sample_documents with configuration {'settings': {'index.knn': True, 'number_of_shards': 1, 'number_of_replicas': 0}, 'mappings': {'properties': {'url': {'type': 'keyword'}, 'body': {'type': 'text'}, 'body_lemmatized': {'type': 'text'}, 'title': {'type': 'text'}, 'title_embedding': {'type': 'knn_vector', 'dimension': 384, 'method': {'engine': 'lucene', 'space_type': 'l2', 'name': 'hnsw', 'parameters': {}}}}}}


Indexing documents to sample_documents: 100%|██████████| 10/10 [00:00<00:00, 19.04it/s]


✅ Successfully indexed 10 documents.
🗑️ Deleted old index 'sample_documents'.
Created index sample_documents with configuration {'settings': {'index.knn': True, 'number_of_shards': 1, 'number_of_replicas': 0}, 'mappings': {'properties': {'document_id': {'type': 'keyword'}, 'segment': {'type': 'text'}, 'segment_embedding': {'type': 'knn_vector', 'dimension': 384, 'method': {'engine': 'lucene', 'space_type': 'l2', 'name': 'hnsw', 'parameters': {}}}}}}


Indexing documents to sample_documents: 100%|██████████| 1713/1713 [01:30<00:00, 19.02it/s]

✅ Successfully indexed 1713 documents.



