# Indexing, Search, and Retrieval with OpenSearch

This notebooks walks through the functions for indexing, retrieving, and ranking documents with OpenSearch.

## Indexing

The Python script, `ingest_pipeline.py`, builds an index from a full list of countries from *Wikipedia*. This notebook breaks down the procedure step-by-step, using a small sample of countries for demonstration.

Given a selection of documents, the script begins by preparing two DataFrames:
 - `document_df`: Each row represents one document. This table contains descriptive metadata and the full body of text.
 - `segment_df`: Each row includes a segment that fits within the context window of the retriever model. The main driver of semantic search.

In [1]:
from docutrance.index import (
    build_segment_dataframe,
    build_wikipedia_index
)


from pathlib import Path
from sentence_transformers import SentenceTransformer
import spacy
import random



# Load a list of URLs and select a sample
urls = Path("../data/links/countries.txt").read_text().splitlines()
random.seed= 42
sample = random.sample(urls, 10)

# Initiate models for processing text data.
lemmatizer = spacy.load('en_core_web_sm')
encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

document_df = build_wikipedia_index(
    sample,
    lemmatizer,
    encoder
)

# Controls how documents are split into pargraphs.
paragraph_fn =  lambda x: x.split('\n')

# Controls the size of token overlap between segments
stride = encoder.max_seq_length // 2

# Filters segments below a minimum length
min_segment_length = 8

segment_df = build_segment_dataframe(
    document_df,
    paragraph_fn,
    lemmatizer,
    encoder,
    stride=stride,
    min_segment_length=min_segment_length
)

document_df.head(10)

  from .autonotebook import tqdm as notebook_tqdm
Extracting data from Wikipedia. ..: 100%|██████████| 10/10 [00:06<00:00,  1.65it/s]
Lemmatizing body. . .: 100%|██████████| 10/10 [00:14<00:00,  1.49s/it]
Computing title embeddings. . .: 100%|██████████| 10/10 [00:00<00:00, 106.15it/s]
Splitting documents into paragraphs. . .: 100%|██████████| 10/10 [00:00<00:00, 18575.31it/s]
Getting sentence boundaries. . .: 100%|██████████| 1215/1215 [00:16<00:00, 75.05it/s]
Getting segment boundaries. . .: 100%|██████████| 1215/1215 [00:00<00:00, 4002.76it/s]
Smoothing segment boundaries. . .: 100%|██████████| 1215/1215 [00:00<00:00, 1381.15it/s]
Extracting segments. . .: 100%|██████████| 1215/1215 [00:00<00:00, 115335.05it/s]


Filtering underlength segments. . .
Removed 29 underlength segments.


Embedding segments. . .: 100%|██████████| 1973/1973 [00:37<00:00, 51.94it/s]
Assigning segment ids. . .: 100%|██████████| 1973/1973 [00:00<00:00, 169584.04it/s]


Unnamed: 0,url,document_id,title,body,body_lemmatized,title_embedding
0,https://en.wikipedia.org/wiki/Moldova,0,Moldova,"Moldova,[d] officially the Republic of Moldova...","moldova,[d ] officially republic moldova,[e ] ...","[-0.5732232, 0.2628887, 0.0038419936, -0.11685..."
1,https://en.wikipedia.org/wiki/Saint_Lucia,1,Saint Lucia,in the Caribbean\nSaint Lucia[a] is an island ...,caribbean \n saint lucia[a ] island country we...,"[0.34250408, -0.15099865, 0.11134313, -0.30862..."
2,https://en.wikipedia.org/wiki/New_Zealand,2,New Zealand,New Zealand (Māori: Aotearoa) is an island cou...,new zealand ( māori : aotearoa ) island countr...,"[-0.081107944, -0.07023762, 0.01571196, 0.1328..."
3,https://en.wikipedia.org/wiki/Slovenia,3,Slovenia,– in Europe (green & dark grey)– in the Europe...,– europe ( green & dark grey ) – europ...,"[0.13591444, 0.37356806, -0.5100972, -0.154435..."
4,https://en.wikipedia.org/wiki/Switzerland,4,Switzerland,"in Europe (green and dark grey)\nSwitzerland,[...","europe ( green dark grey ) \n switzerland,[d...","[-0.06259759, 0.27365315, 0.14753258, 0.081325..."
5,https://en.wikipedia.org/wiki/Italy,5,Italy,"Italy,[a] officially the Italian Republic,[b] ...","italy,[a ] officially italian republic,[b ] co...","[-0.03437813, -0.39357063, -0.002518681, 0.200..."
6,https://en.wikipedia.org/wiki/Saint_Vincent_an...,6,Saint Vincent and the Grenadines,"Saint Vincent and the Grenadines,[b] sometimes...","saint vincent grenadines,[b ] know simply sain...","[-0.2780527, 0.169605, 0.087471284, -0.0899487..."
7,https://en.wikipedia.org/wiki/Mongolia,7,Mongolia,Mongolia[b] is a landlocked country in East As...,"mongolia[b ] landlocked country east asia , bo...","[-0.58181524, 0.12627521, 0.16169146, 0.323582..."
8,https://en.wikipedia.org/wiki/Canada,8,Canada,Canada[a] is a country in North America. Its t...,canada[a ] country north america . province te...,"[0.40659752, -0.013133737, 0.45929885, -0.2987..."
9,https://en.wikipedia.org/wiki/Solomon_Islands,9,Solomon Islands,"Solomon Islands,[7] also known simply as the S...","solomon islands,[7 ] know simply solomons,[8 ]...","[0.16126138, -0.16808078, 0.25491053, 0.278386..."


In processing the `segment_df`, blocks of body text from the document_df are broken down into segments through four steps:

 1. **Paragraph segmentation**: Each document body is heuristically split into paragraphs using a user-defined function. For Wikipedia pages, paragraphs are separated by newline characters.

 2. **Token boundary detection**: The retriever model's tokenizer identifies overlapping token windows based on the model's maximum sequence length.

 3. **Sentence boundary detection**: A SpaCy model is used to identify sentence boundaries within each paragraph.

 4. **Segment boundary smoothing**: The initial token-based segment boundaries are adjusted to align with the nearest sentence boundary in the direction that reduces token count.

This procedure produces well-formed segments that consist of complete sentences and approach the retriever model’s maximum context length.

In [2]:
print("Example Segment:\n\n", segment_df.sample(1).reset_index().loc[0, 'segment'])

segment_df.head(10)

Example Segment:

 Elections take place every four years. The National Council (Državni svet Republike Slovenije), consisting of forty members, appointed to represent social, economic, professional and local interest groups, has a limited advisory and control power.[126]


Unnamed: 0,segment_id,document_id,segment,segment_embedding
0,000-0000,0,"Moldova,[d] officially the Republic of Moldova...","[-0.21883741, -0.039802853, 0.11549487, -0.312..."
1,000-0001,0,The unrecognised breakaway state of Transnistr...,"[0.036985442, 0.06745155, 0.09227955, -0.26171..."
2,000-0002,0,Most of Moldovan territory was a part of the P...,"[-0.04496378, 0.025177646, 0.11631268, -0.1491..."
3,000-0003,0,but Russian rule was restored over the whole o...,"[-0.23470737, 0.10999178, 0.018682856, 0.02254..."
4,000-0004,0,"In February 1918, it declared independence and...","[-0.31539658, 0.072800584, -0.071843125, -0.03..."
5,000-0005,0,"In 1940, as a consequence of the Molotov–Ribbe...","[-0.31048736, 0.062092364, -0.10678636, -0.010..."
6,000-0006,0,"On 27 August 1991, as the dissolution of the S...","[-0.38658524, 0.091805875, 0.15380627, -0.3506..."
7,000-0007,0,The constitution of Moldova was adopted in 199...,"[-0.584582, 0.10552795, 0.34430596, -0.2833754..."
8,000-0008,0,"Under the presidency of Maia Sandu, elected in...","[-0.23851253, 0.08208075, 0.04983329, -0.01659..."
9,000-0009,0,Moldova is the second poorest country in Europ...,"[-0.4169526, 0.05651239, 0.09918604, -0.210805..."


The function, `docutrance.index.index_documents`, takes DataFrames to index, connects with OpenSearch, and builds an index according to the configuration defined by the user.

In [3]:
from docutrance.index import index_documents
from opensearchpy import OpenSearch

encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Object for managing indexing and retrieval for OpenSearch
client = OpenSearch(hosts=[{'host': 'localhost', 'port': 9200}])

# Controls index specail features. index.knn allows for semantic search
index_settings = {
    "index.knn": True,
    "number_of_shards": 1,
    "number_of_replicas": 0
}

# 1. Special configuration for the document_df:
document_index_name = 'sample_documents'

# Describes the index contents and data types to OpenSearch
document_index_mappings = {
    "properties": {
        "url": {"type": "keyword"},
        "body": {"type": "text"},
        "body_lemmatized": {"type": "text"},
        "title": {"type": "text"},
        "title_embedding": {
            "type": "knn_vector",
            "dimension": encoder.get_sentence_embedding_dimension(), # Careful when switching models
            "method": {
                "engine": "lucene",
                "space_type": "l2",
                "name": "hnsw",
                "parameters": {}
                }
        }
    }
}

# Identifies which column is to be taken as the unique identifier
document_id_column = 'document_id'

index_documents(
    document_df, 
    client, 
    document_index_name, 
    index_settings, 
    document_index_mappings, 
    document_id_column, 
    overwrite_old_index=True
)

# 2. Special configuration for the segment_df:
segment_index_name = 'sample_segments'

# Describe the columns.
segment_index_mappings = {
    "properties": {
        "document_id": {"type": "keyword"},
        "segment": {"type": "text"},
        "segment_embedding": {
            "type": "knn_vector",
            "dimension": encoder.get_sentence_embedding_dimension(),
            "method": {
                "engine": "lucene",
                "space_type": "l2",
                "name": "hnsw",
                "parameters": {}
            }
        }
    }
}

# Name the index.
segment_id_column = 'segment_id'

#Push to OpenSearch.
index_documents(
    segment_df, 
    client, 
    segment_index_name, 
    index_settings, 
    segment_index_mappings, 
    segment_id_column, 
    overwrite_old_index=True
)


🗑️ Deleted old index 'sample_documents'.
Created index sample_documents with configuration {'settings': {'index.knn': True, 'number_of_shards': 1, 'number_of_replicas': 0}, 'mappings': {'properties': {'url': {'type': 'keyword'}, 'body': {'type': 'text'}, 'body_lemmatized': {'type': 'text'}, 'title': {'type': 'text'}, 'title_embedding': {'type': 'knn_vector', 'dimension': 384, 'method': {'engine': 'lucene', 'space_type': 'l2', 'name': 'hnsw', 'parameters': {}}}}}}


Indexing documents to sample_documents: 100%|██████████| 10/10 [00:00<00:00, 22.42it/s]


✅ Successfully indexed 10 documents.
🗑️ Deleted old index 'sample_segments'.
Created index sample_segments with configuration {'settings': {'index.knn': True, 'number_of_shards': 1, 'number_of_replicas': 0}, 'mappings': {'properties': {'document_id': {'type': 'keyword'}, 'segment': {'type': 'text'}, 'segment_embedding': {'type': 'knn_vector', 'dimension': 384, 'method': {'engine': 'lucene', 'space_type': 'l2', 'name': 'hnsw', 'parameters': {}}}}}}


Indexing documents to sample_segments: 100%|██████████| 1973/1973 [01:45<00:00, 18.77it/s]

✅ Successfully indexed 1973 documents.





## Retrieval

The process begins with the user issuing a query. During indexing, some fields were lemmatized and embedded to allow for advanced retrieval strategies. The query input must be processed in the same way.

The function, `docutrance.search.preprocess_input`, takes a query input and returns a dictionary with four key-value pairs:

 - **raw**: the original user input.
 - **stripped**: The query input with stop words removed.
 - **lemmatized**: The stripped input in its lemmatized form.
 - **embedding**: The raw input transformed into a sentence embedding.

In [4]:
from docutrance.search import preprocess_input
import spacy
from sentence_transformers import SentenceTransformer

# Initiate models for processing text data.
lemmatizer = spacy.load('en_core_web_sm')
encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')


query = 'popular dishes and products'

processed = preprocess_input(query, lemmatizer, encoder)

print('Query Input:\n', query)
print()
print('Processed Query Input:')
for k,v in processed.items():
    if not isinstance(v, str):
        v = str(v[:4])[:-1] + '. . .'
    
    print('\t', k, ':', v)

Query Input:
 popular dishes and products

Processed Query Input:
	 raw : popular dishes and products
	 stripped : popular dishes products
	 lemmatized : popular dish product
	 embedding : [ 0.05008381 -0.5412065   0.328839    0.14115289. . .


Queries follow the structure defined by the OpenSearch team.

During query formulation, each field is carefully matched with the appropriate input type.

In the example below, three subqueries are combined:

 - A match query is applied to both the body and title using input text with stopwords removed.

 - The final subquery matches the lemmatized query against the lemmatized body only.

In [None]:
from opensearchpy import OpenSearch

#This first query ranks document relevance using traditional keyword matching.
body = {
    "query": {
        "bool": {
            "should": [
                {
                    "multi_match": {
                        "query": processed["stripped"],
                        "fields": [
                            "title",
                            "body"
                        ]
                    }
                },
                {
                    "match_phrase": {
                        "body_lemmatized": {
                            "query": processed["lemmatized"]
                        }
                    }
                }
            ]
        }
    }
}

client = OpenSearch(hosts=[{'host': 'localhost', 'port': 9200}])
document_index_name = 'sample_documents'

response = client.search(body=body, index=document_index_name)

The OpenSearch client takes a query and returns a response, which includes matches and relevance scores.

The function `docutrance.search.post_process_response` loads the response into a DataFrame, aggregates scores, and ranks respones. Finaly it computes a scalar value for combining multiple ranked lists through Reciprocal Rank Fusion (RRF).

In [None]:
from docutrance.search import post_process_response

# Mapping of columns to rename
column_map = {"_id": "document_id"}

# Specefies columns and methods for aggregation.
agg_map = {"_score": "sum"}

result = post_process_response(
    response, 
    column_map=column_map,
    agg_map=agg_map
    )
result.merge(document_df[['document_id', 'title']]).sort_values('rank')

Unnamed: 0,document_id,_score,rank,rrf,title
2,2,1.166804,1.0,0.016393,New Zealand
0,0,1.048601,2.0,0.016129,Moldova
4,4,1.017478,3.0,0.015873,Switzerland
5,5,0.973145,4.0,0.015625,Italy
8,8,0.847408,5.0,0.015385,Canada
1,1,0.709357,6.0,0.015152,Saint Lucia
3,3,0.589853,7.0,0.014925,Slovenia
7,7,0.489845,8.0,0.014706,Mongolia
9,9,0.464518,9.0,0.014493,Solomon Islands
6,6,0.090731,10.0,0.014286,Saint Vincent and the Grenadines


In [None]:
# Semantic queries have a slightly different nested structure. This one returns k nearest neighbors between the query and segment embeddings.

body = {
    "query": {
        "bool": {
            "should": {
                "knn": {
                    "segment_embedding": {
                        "vector": processed['embedding'], # Make sure to select the appropriate query type.
                        "k": 500
                    }
                }
            }
        }
    }
}

client = OpenSearch(hosts=[{'host': 'localhost', 'port': 9200}])

#Segments are indexed seperately.
document_index_name = 'sample_segments'

response = client.search(body=body, index=document_index_name)

In [24]:
from docutrance.search import post_process_response

# Segments are retained as semantic highlights.
column_map = {"segment": "semantic_highlight"}

# In addition to the score, high-scoring segments are returned in an ordered list
agg_map = {"_score": "sum", "semantic_highlight": lambda x: list(x)}

result = post_process_response(
    response, 
    column_map=column_map,
    agg_map=agg_map
    )

result = result.merge(document_df[['document_id', 'title']]).sort_values('rank').reset_index(drop=True)

print('Query:', processed['raw'])
print()
print('Top Ranked Country', result.loc[0, 'title'])
print()
print('Semantic Highlights:\n')

for highlight in result.loc[0, 'semantic_highlight']:
    print(highlight)
    print()

result



Query: popular dishes and products

Top Ranked Country Italy

Semantic Highlights:

Italian cuisine is heavily influenced by Etruscan, ancient Greek, ancient Roman, Byzantine, Arabic, and Jewish cuisines.[412] Significant changes occurred with the discovery of the New World, with items such as potatoes, tomatoes, and maize becoming main ingredients from the 18th century.[413]

The Italian meal structure is typical of the Mediterranean region and differs from North, Central, and East European meal structures, although it still often consists of breakfast (colazione), lunch (pranzo), and dinner (cena).[425] However, much less emphasis is placed on breakfast, which is often skipped or involves lighter portions than are seen in non-Mediterranean Western countries.[426] Late-morning and mid-afternoon snacks, called merenda (pl.: merende), are often included.[427]

The Mediterranean diet forms the basis of Italian cuisine, which is rich in pasta, fish, fruits, and vegetables and characterise

Unnamed: 0,document_id,_score,semantic_highlight,rank,rrf,title
0,5,0.128802,[Italian cuisine is heavily influenced by Etru...,1.0,0.016393,Italy
1,0,0.101247,"[Main dishes often include beef, pork, potatoe...",2.0,0.016129,Moldova
2,1,0.085334,"[Saint Lucian cuisine is a mix of African, Eur...",3.0,0.015873,Saint Lucia
3,3,0.048113,"[Ethnologically, the most characteristic Slove...",4.0,0.015625,Slovenia
4,4,0.046705,[The cuisine is multifaceted. While dishes suc...,5.0,0.015385,Switzerland
5,2,0.041896,[The national cuisine has been described as Pa...,6.0,0.015152,New Zealand
