# Indexing, Retrieval, and Re-Ranking with OpenSearch

This notebooks walks through the functions for indexing, retrieving, and ranking documents with OpenSearch.

## Indexing

The Python script, `ingest_pipeline.py`, builds an index from a full list of countries from *Wikipedia*. This notebook breaks down the procedure step-by-step, using a small sample of countries for demonstration.

Given a selection of documents, the script begins by preparing two DataFrames:
 - `document_df`: Each row represents one document. This table contains descriptive metadata and the full body of text.
 - `segment_df`: Each row includes a segment that fits within the context window of the retriever model. The main driver of semantic search.

In [1]:
from docutrance.index import (
    build_segment_dataframe,
    build_wikipedia_index
)


from pathlib import Path
from sentence_transformers import SentenceTransformer
import spacy
import random



# Load a list of URLs and select a sample
urls = Path("../data/links/countries.txt").read_text().splitlines()
random.seed= 42
sample = random.sample(urls, 10)

# Initiate models for processing text data.
lemmatizer = spacy.load('en_core_web_sm')
encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

document_df = build_wikipedia_index(
    sample,
    lemmatizer,
    encoder
)

# Controls how documents are split into pargraphs.
paragraph_fn =  lambda x: x.split('\n')

# Controls the size of token overlap between segments
stride = encoder.max_seq_length // 2

# Filters segments below a minimum length
min_segment_length = 8

segment_df = build_segment_dataframe(
    document_df,
    paragraph_fn,
    lemmatizer,
    encoder,
    stride=stride,
    min_segment_length=min_segment_length
)

document_df.head(10)

  from .autonotebook import tqdm as notebook_tqdm
Extracting data from Wikipedia. ..: 100%|██████████| 10/10 [00:06<00:00,  1.61it/s]
Lemmatizing body. . .: 100%|██████████| 10/10 [00:15<00:00,  1.55s/it]
Computing title embeddings. . .: 100%|██████████| 10/10 [00:00<00:00, 77.93it/s]
Splitting documents into paragraphs. . .: 100%|██████████| 10/10 [00:00<?, ?it/s]
Getting sentence boundaries. . .: 100%|██████████| 1250/1250 [00:16<00:00, 74.12it/s]
Getting segment boundaries. . .: 100%|██████████| 1250/1250 [00:00<00:00, 3930.83it/s]
Smoothing segment boundaries. . .: 100%|██████████| 1250/1250 [00:00<00:00, 1358.93it/s]
Extracting segments. . .: 100%|██████████| 1250/1250 [00:00<00:00, 103037.95it/s]


Filtering underlength segments. . .
Removed 57 underlength segments.


Embedding segments. . .: 100%|██████████| 2060/2060 [00:40<00:00, 50.91it/s]
Assigning segment ids. . .: 100%|██████████| 2060/2060 [00:00<00:00, 174455.68it/s]


Unnamed: 0,url,document_id,title,body,body_lemmatized,title_embedding
0,https://en.wikipedia.org/wiki/Romania,0,Romania,Romania[a] is a country located at the crossro...,"romania[a ] country locate crossroad central ,...","[-0.31080642, 0.23566397, -0.25159326, 0.39146..."
1,https://en.wikipedia.org/wiki/Federated_States...,1,Federated States of Micronesia,The Federated States of Micronesia (/ˌmaɪkroʊˈ...,federate state micronesia ( /ˌmaɪkroʊˈniːʒə/ ⓘ...,"[0.48593277, -0.3911962, 0.701863, -0.2735546,..."
2,https://en.wikipedia.org/wiki/Saint_Vincent_an...,2,Saint Vincent and the Grenadines,"Saint Vincent and the Grenadines,[b] sometimes...","saint vincent grenadines,[b ] know simply sain...","[-0.2780527, 0.169605, 0.087471284, -0.0899487..."
3,https://en.wikipedia.org/wiki/Spain,3,Spain,"Spain,[i] or the Kingdom of Spain,[a][j] is a ...","spain,[i ] kingdom spain,[a][j ] country south...","[0.099953264, -0.25360575, 0.21685211, 0.16622..."
4,https://en.wikipedia.org/wiki/North_Korea,4,North Korea,"North Korea,[a] officially the Democratic Peop...","north korea,[a ] officially democratic people ...","[-0.55245703, 0.3070167, 0.17727529, -0.129822..."
5,https://en.wikipedia.org/wiki/Israel,5,Israel,"Israel,[a] officially the State of Israel,[b] ...","israel,[a ] officially state israel,[b ] count...","[0.18425143, 0.8903475, -0.06465284, -0.285763..."
6,https://en.wikipedia.org/wiki/Tuvalu,6,Tuvalu,Tuvalu (/tuːˈvɑːluː/ ⓘ too-VAH-loo)[5] is an i...,tuvalu ( /tuːˈvɑːluː/ ⓘ - vah - loo)[5 ] islan...,"[0.35686594, -0.012433024, -0.07774307, 0.0379..."
7,https://en.wikipedia.org/wiki/Timor-Leste,7,Timor-Leste,"Timor-Leste,[b] also known as East Timor,[c] o...","timor - leste,[b ] know east timor,[c ] offici...","[0.21216178, -0.1054345, 0.17663197, -0.199195..."
8,https://en.wikipedia.org/wiki/Qatar,8,Qatar,"Qatar,[a] officially the State of Qatar,[b] is...","qatar,[a ] officially state qatar,[b ] country...","[-0.2051595, 0.43293706, -0.13102828, 0.309307..."
9,https://en.wikipedia.org/wiki/Libya,9,Libya,"Libya,[b] officially the State of Libya,[c] is...","libya,[b ] officially state libya,[c ] country...","[-0.48242846, 0.38598338, 0.13472082, -0.39562..."


In processing the `segment_df`, blocks of body text from the document_df are broken down into segments through four steps:

 1. **Paragraph segmentation**: Each document body is heuristically split into paragraphs using a user-defined function. For Wikipedia pages, paragraphs are separated by newline characters.

 2. **Token boundary detection**: The retriever model's tokenizer identifies overlapping token windows based on the model's maximum sequence length.

 3. **Sentence boundary detection**: A SpaCy model is used to identify sentence boundaries within each paragraph.

 4. **Segment boundary smoothing**: The initial token-based segment boundaries are adjusted to align with the nearest sentence boundary in the direction that reduces token count.

This procedure produces well-formed segments that consist of complete sentences and approach the retriever model’s maximum context length.

In [2]:
print("Example Segment:\n\n", segment_df.sample(1).reset_index().loc[0, 'segment'])

segment_df.head(10)

Example Segment:

 Non-Muslim expatriates can obtain a permit to purchase alcohol for personal consumption. The Qatar Distribution Company (a subsidiary of Qatar Airways) is permitted to import alcohol and pork; it operates the only liquor store in the country, which also sells pork to holders of liquor licences.[150][151] Qatari officials had indicated a willingness to allow alcohol in "fan zones" at the 2022 FIFA World Cup.[152]


Unnamed: 0,segment_id,document_id,segment,segment_embedding
0,000-0000,0,Romania[a] is a country located at the crossro...,"[-0.07612449, -0.07460885, -0.01813557, 0.1916..."
1,000-0001,0,"It has a mainly continental climate, and an ar...","[0.11955495, -0.0035955096, 0.08295564, 0.0374..."
2,000-0002,0,"Europe's second-longest river, the Danube, emp...","[0.07221405, 0.020109031, -0.008036764, -0.059..."
3,000-0003,0,Settlement in the territory of modern Romania ...,"[-0.1975225, 0.10504465, -0.039830647, 0.10182..."
4,000-0004,0,"After World War I, Transylvania, Banat, Bukovi...","[-0.076479286, 0.010923682, 0.15934747, 0.1246..."
5,000-0005,0,"In 1940, under Axis pressure, Romania lost ter...","[-0.3039339, 0.08313686, 0.015751738, 0.189600..."
6,000-0006,0,Romania is a developing country with a high-in...,"[-0.12838116, 0.10129507, -0.11238637, -0.0669..."
7,000-0007,0,Romania is a net exporter of automotive and ve...,"[-0.17891245, -0.11273907, -0.21656702, -0.118..."
8,000-0008,0,Romania derives from the local name for Romani...,"[-0.24619849, 0.0360259, 0.026693404, 0.247025..."
9,000-0009,0,The oldest known surviving document written in...,"[-0.33449292, 0.20895389, -0.18661594, 0.07439..."


The function, `docutrance.index.index_documents`, takes DataFrames to index, connects with OpenSearch, and builds an index according to the configuration defined by the user.

In [3]:
from docutrance.index import index_documents
from opensearchpy import OpenSearch

encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Object for managing indexing and retrieval for OpenSearch
client = OpenSearch(hosts=[{'host': 'localhost', 'port': 9200}])

# Controls index specail features. index.knn allows for semantic search
index_settings = {
    "index.knn": True,
    "number_of_shards": 1,
    "number_of_replicas": 0
}

# 1. Special configuration for the document_df:
document_index_name = 'sample_documents'

# Describes the index contents and data types to OpenSearch
document_index_mappings = {
    "properties": {
        "url": {"type": "keyword"},
        "body": {"type": "text"},
        "body_lemmatized": {"type": "text"},
        "title": {"type": "text"},
        "title_embedding": {
            "type": "knn_vector",
            "dimension": encoder.get_sentence_embedding_dimension(), # Careful when switching models
            "method": {
                "engine": "lucene",
                "space_type": "l2",
                "name": "hnsw",
                "parameters": {}
                }
        }
    }
}

# Identifies which column is to be taken as the unique identifier
document_id_column = 'document_id'

index_documents(
    document_df, 
    client, 
    document_index_name, 
    index_settings, 
    document_index_mappings, 
    document_id_column, 
    overwrite_old_index=True
)

# 2. Special configuration for the segment_df:
segment_index_name = 'sample_segments'

# Describe the columns.
segment_index_mappings = {
    "properties": {
        "document_id": {"type": "keyword"},
        "segment": {"type": "text"},
        "segment_embedding": {
            "type": "knn_vector",
            "dimension": encoder.get_sentence_embedding_dimension(),
            "method": {
                "engine": "lucene",
                "space_type": "l2",
                "name": "hnsw",
                "parameters": {}
            }
        }
    }
}

# Name the index.
segment_id_column = 'segment_id'

#Push to OpenSearch.
index_documents(
    segment_df, 
    client, 
    segment_index_name, 
    index_settings, 
    segment_index_mappings, 
    segment_id_column, 
    overwrite_old_index=True
)


🗑️ Deleted old index 'sample_documents'.
Created index sample_documents with configuration {'settings': {'index.knn': True, 'number_of_shards': 1, 'number_of_replicas': 0}, 'mappings': {'properties': {'url': {'type': 'keyword'}, 'body': {'type': 'text'}, 'body_lemmatized': {'type': 'text'}, 'title': {'type': 'text'}, 'title_embedding': {'type': 'knn_vector', 'dimension': 384, 'method': {'engine': 'lucene', 'space_type': 'l2', 'name': 'hnsw', 'parameters': {}}}}}}


Indexing documents to sample_documents: 100%|██████████| 10/10 [00:00<00:00, 20.09it/s]


✅ Successfully indexed 10 documents.
🗑️ Deleted old index 'sample_segments'.
Created index sample_segments with configuration {'settings': {'index.knn': True, 'number_of_shards': 1, 'number_of_replicas': 0}, 'mappings': {'properties': {'document_id': {'type': 'keyword'}, 'segment': {'type': 'text'}, 'segment_embedding': {'type': 'knn_vector', 'dimension': 384, 'method': {'engine': 'lucene', 'space_type': 'l2', 'name': 'hnsw', 'parameters': {}}}}}}


Indexing documents to sample_segments: 100%|██████████| 2060/2060 [01:49<00:00, 18.85it/s]

✅ Successfully indexed 2060 documents.





## Retrieval

The process begins with the user issuing a query. During indexing, some fields were lemmatized and embedded to allow for advanced retrieval strategies. The query input must be processed in the same way.

The function, `docutrance.search.preprocess_input`, takes a query input and returns a dictionary with four key-value pairs:

 - **raw**: the original user input.
 - **stripped**: The query input with stop words removed.
 - **lemmatized**: The stripped input in its lemmatized form.
 - **embedding**: The raw input transformed into a sentence embedding.

In [10]:
from docutrance.search import preprocess_input
import spacy
from sentence_transformers import SentenceTransformer

# Initiate models for processing text data.
lemmatizer = spacy.load('en_core_web_sm')
encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')


query = 'popular dishes and products'

processed = preprocess_input(query, lemmatizer, encoder)

print('Query Input:\n', query)
print()
print('Processed Query Input:')
for k,v in processed.items():
    if not isinstance(v, str):
        v = str(v[:4])[:-1] + '. . .'
    
    print('\t', k, ':', v)

Query Input:
 popular dishes and products

Processed Query Input:
	 raw : popular dishes and products
	 stripped : popular dishes products
	 lemmatized : popular dish product
	 embedding : [ 0.05008381 -0.5412065   0.328839    0.14115289. . .


Queries follow the structure defined by the OpenSearch team.

During query formulation, each field is carefully matched with the appropriate input type.

In the example below, three subqueries are combined:

 - A match query is applied to both the body and title using input text with stopwords removed.

 - The final subquery matches the lemmatized query against the lemmatized body only.

In [11]:
from opensearchpy import OpenSearch

#This first query ranks document relevance using traditional keyword matching.
body = {
    "query": {
        "bool": {
            "should": [
                {
                    "multi_match": {
                        "query": processed["stripped"],
                        "fields": [
                            "title",
                            "body"
                        ]
                    }
                },
                {
                    "match_phrase": {
                        "body_lemmatized": {
                            "query": processed["lemmatized"]
                        }
                    }
                }
            ]
        }
    }
}

client = OpenSearch(hosts=[{'host': 'localhost', 'port': 9200}])
document_index_name = 'sample_documents'

response = client.search(body=body, index=document_index_name)

The OpenSearch client takes a query and returns a response, which includes matches and relevance scores.

The function `docutrance.search.post_process_response` loads the response into a DataFrame, aggregates scores, and ranks respones. Finaly it computes a scalar value for combining multiple ranked lists through Reciprocal Rank Fusion (RRF).

In [12]:
from docutrance.search import post_process_response

# Mapping of columns to rename
column_map = {"_id": "document_id"}

# Specefies columns and methods for aggregation.
agg_map = {"_score": "sum"}

keyword_results = post_process_response(
    response, 
    column_map=column_map,
    agg_map=agg_map
    )
keyword_results.merge(document_df[['document_id', 'title']]).sort_values('rank')

Unnamed: 0,document_id,_score,rank,rrf,title
4,4,1.517835,1.0,0.016393,North Korea
3,3,1.445086,2.0,0.016129,Spain
6,6,1.306109,3.0,0.015873,Tuvalu
5,5,1.120901,4.0,0.015625,Israel
9,9,1.101329,5.0,0.015385,Libya
0,0,0.64479,6.0,0.015152,Romania
8,8,0.605722,7.0,0.014925,Qatar
2,2,0.09105,8.0,0.014706,Saint Vincent and the Grenadines
7,7,0.08219,9.0,0.014493,Timor-Leste
1,1,0.080734,10.0,0.014286,Federated States of Micronesia


Semantic queries have a slightly different nested structure. This one returns k nearest neighbors between the query and segment embeddings.

In [13]:
body = {
    "query": {
        "bool": {
            "should": {
                "knn": {
                    "segment_embedding": {
                        "vector": processed['embedding'], # Make sure to select the appropriate query type.
                        "k": 500
                    }
                }
            }
        }
    }
}

client = OpenSearch(hosts=[{'host': 'localhost', 'port': 9200}])

#Segments are indexed seperately.
document_index_name = 'sample_segments'

response = client.search(body=body, index=document_index_name)

Semantic search compares the query to the segment embeddings, returning 500 nearest neighbors. Results are grouped by their document_id and ranked by aggregate score.

Relevant segments are combined into an ordered list to be used as semantic highlights.

In [14]:
from docutrance.search import post_process_response

# Segments are retained as semantic highlights.
column_map = {"segment": "semantic_highlight"}

# In addition to the score, high-scoring segments are returned in an ordered list
agg_map = {"_score": "sum", "semantic_highlight": lambda x: list(x)}

semantic_results = post_process_response(
    response, 
    column_map=column_map,
    agg_map=agg_map
    )

semantic_results = semantic_results.merge(document_df[['document_id', 'title']]).sort_values('rank').reset_index(drop=True)

print('Query:\n', processed['raw'])
print()
print('Top Ranked Country:\n', semantic_results.loc[0, 'title'])
print()
print('Semantic Highlights:\n')

for highlight in semantic_results.loc[0, 'semantic_highlight']:
    print(highlight)
    print()

semantic_results



Query:
 popular dishes and products

Top Ranked Country:
 Libya

Semantic Highlights:

Common Libyan foods include several variations of red (tomato) sauce based pasta dishes (similar to the Italian Sugo all'arrabbiata dish); rice, usually served with lamb or chicken (typically stewed, fried, grilled, or boiled in-sauce); and couscous, which is steam cooked whilst held over boiling red (tomato) sauce and meat (sometimes also containing courgettes/zucchini and chickpeas), which is typically served along with cucumber slices, lettuce and olives.

Another popular way to serve Asida is with rub (fresh date syrup) and olive oil. Usban is animal tripe stitched and stuffed with rice and vegetables cooked in tomato based soup or steamed. Shurba is a red tomato sauce-based soup, usually served with small grains of pasta.[313]

Bazeen, a dish made from barley flour and served with red tomato sauce, is customarily eaten communally, with several people sharing the same dish, usually by hand. This 

Unnamed: 0,document_id,_score,semantic_highlight,rank,rrf,title
0,9,0.13193,[Common Libyan foods include several variation...,1.0,0.016393,Libya
1,3,0.128288,"[Inner Spain – Castile – hot, thick soups such...",2.0,0.016129,Spain
2,5,0.088598,[It incorporates many foods traditionally eate...,3.0,0.015873,Israel
3,4,0.079766,[Korean cuisine has evolved through centuries ...,4.0,0.015625,North Korea


## Reranking

One of the challenges of working with hybrid retrieval is combining different scoring schemes. Reciprocal Rank Fusion (RRF) is a simple strategy that prioritizes rank over individual scores. 

The RRF Score is given by the formula:

$$
\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}
$$

 - `𝑑` is a document,
 - `𝑅` is the set of rankings (e.g. keyword, semantic)
 - `rank 𝑟(𝑑)` is the rank position of document 𝑘
 - `k` is a constant (60) to break ties for low-ranked documents.

RRF is an effective and simple strategy for combining different ranked lists.

The function, `docutrance.search.combine_responses`, combines results and reranks results based on aggregate RRF score.

In [16]:
from docutrance.search import combine_responses

responses = [keyword_results, semantic_results]

final_results = combine_responses(responses, document_df)
final_results[['rank', 'semantic_highlight', 'title', 'url']]

Unnamed: 0,rank,semantic_highlight,title,url
0,1.0,"[Inner Spain – Castile – hot, thick soups such...",Spain,https://en.wikipedia.org/wiki/Spain
1,2.0,[Korean cuisine has evolved through centuries ...,North Korea,https://en.wikipedia.org/wiki/North_Korea
2,3.0,[Common Libyan foods include several variation...,Libya,https://en.wikipedia.org/wiki/Libya
3,4.0,[It incorporates many foods traditionally eate...,Israel,https://en.wikipedia.org/wiki/Israel
4,5.0,[],Tuvalu,https://en.wikipedia.org/wiki/Tuvalu
5,6.0,[],Romania,https://en.wikipedia.org/wiki/Romania
6,7.0,[],Qatar,https://en.wikipedia.org/wiki/Qatar
7,8.0,[],Saint Vincent and the Grenadines,https://en.wikipedia.org/wiki/Saint_Vincent_an...
8,9.0,[],Timor-Leste,https://en.wikipedia.org/wiki/Timor-Leste
9,10.0,[],Federated States of Micronesia,https://en.wikipedia.org/wiki/Federated_States...


This completes the overview on how documents are indexed, retrieved, and ranked with OpenSearch.

The rankings in this notebook are not meaningful because the index is too small. To get better acquainted with OpenSearch, it is reccomended to build your own search engine and experiment.

Follow the instructions from `search_engine/README.md` and try it yourself! Use the python script, `ingest_pipeline.py`, to build an index from the full list of Wikipedia countries. Then brose your index using `app.py`.