### Use dense retrieval for search

### Caveats:
1. If the texts doesn't have relative anser, it will still return. It can be dealt with a threshold. And gathering feedback from customer to improve the future version
2. It can not deal with extact match very well (it is a sematic search). Using a hybrid search: sematic and keyword search can deal with this case.
3. Hard to work properly in domains other than ones that they were trained on. 
4. If the answers are not one sentence but among multiple sentences, this approach doesn't work well. **This shows one of the important design parameters of dense retrieval systems: what is the best way to chunk long texts? And why do we need to chunk them in the first place?**

## Chuncking 
### One Vector per Doc
* Embedding only a representative part of the document and ignoring the rest of the text. (As an approach, it may work better for documents where the beginning captures the main points of a document )
* Embedding the document in chunks, embedding those chunks and then aggregating those chunks into a single vector. Usually average those chunks, it might lose a lot of the information in the document. (It treats every chunk equally may lose other important information as well)

### Multiple vectors per doc
* Each sentence is a chunk. The issue here is this could be too granular and the vectors don’t capture enough of the context.
* Each paragraph is a chunk. This is great if the text is made up of short paragraphs. Otherwise, it may be that every 4-8 sentences are a chunk.
* Some chunks derive a lot of their meaning from the text around them. So we can incorporate some context via:
  * Adding the title of the document to the chunk
  * Adding some of the text before and after them to the chunk. This way, the chunks can overlap so they include some surrounding text. 

## Searching:
* For less data, like thousands or tens of thousands, NumPy is reasonable. 
* For beyond millions of vectors, approximate nearest neighbor search libraries like Annoy or FAISS should be use to make it faster.
* Or using Vector Database.

## Reranking

Using LLM or other model to rerank the result from dense retrieval to improve the results is a common way for the case that the there is already a search pipeline. 

### Open Source Retrieve & Re-Rank by using s-bert
[Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)

In [None]:
%pip install cohere tqdm Annoy

In [2]:
import cohere
import numpy as np
import re
import pandas as pd
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity
from annoy import AnnoyIndex

In [3]:
api_key = ''

In [4]:
co = cohere.Client(api_key)

In [25]:
text = """
nterstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan. It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine. Set in a dystopian future where humanity is embroiled in a catastrophic blight and famine, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for humankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007 and was originally set to be directed by Steven Spielberg. Kip Thorne, a Caltech theoretical physicist and 2017 Nobel laureate in Physics,[4] was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar. Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm. Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles. Interstellar uses extensive practical and miniature effects, and the company Double Negative created additional digital effects.

Interstellar premiered in Los Angeles on October 26, 2014. In the United States, it was first released on film stock, expanding to venues using digital projectors. The film received generally positive reviews from critics and grossed over $681 million worldwide ($703 million after subsequent re-releases), making it the tenth-highest-grossing film of 2014. It has been praised by astronomers for its scientific accuracy and portrayal of theoretical astrophysics.[5][6][7] Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades.
"""

texts = text.split('.')

# Clean up to remove empty space and new lines
texts = np.array([t.strip(' \n') for t in texts])

In [26]:
# Create embeddings
response = co.embed(texts=texts.tolist())
embeds = np.array(response.embeddings)
print(embeds.shape)

default model on embed will be deprecated in the future, please specify a model in the request.


(14, 4096)


In [27]:
# Create the search index
search_index = AnnoyIndex(embeds.shape[1], 'angular')

for index, embed in enumerate(embeds):
    search_index.add_item(index, embed)

search_index.build(10)
search_index.save('interstellar.ann')

True

In [36]:
def search(query):
    response = co.embed(texts=[query])
    query_embed = response.embeddings[0]
    simliar_item_ids = search_index.get_nns_by_vector(query_embed, 5, include_distances=True)
    results = pd.DataFrame({'text': texts[simliar_item_ids[0]], 'distances': simliar_item_ids[1]})
    print("Query for '{}' \n The nearest neighbors: \n {}".format(query, results))
    return results

In [35]:
search('How much did the film make?')

search('Which actors are involved?')

default model on embed will be deprecated in the future, please specify a model in the request.
default model on embed will be deprecated in the future, please specify a model in the request.


Query for 'How much did the film make?': and the nearest neighbors: 
                                                 text  distances
0  The film received generally positive reviews f...   0.890909
1  It stars Matthew McConaughey, Anne Hathaway, J...   1.066916
2  In the United States, it was first released on...   1.086900
3  Cinematographer Hoyte van Hoytema shot it on 3...   1.099424
4  Interstellar premiered in Los Angeles on Octob...   1.152348
Query for 'Which actors are involved?': and the nearest neighbors: 
                                                 text  distances
0  It stars Matthew McConaughey, Anne Hathaway, J...   0.917706
1  Principal photography began in late 2013 and t...   1.183912
2  Cinematographer Hoyte van Hoytema shot it on 3...   1.190191
3  Brothers Christopher and Jonathan Nolan wrote ...   1.194230
4  Interstellar premiered in Los Angeles on Octob...   1.228957


Unnamed: 0,text,distances
0,"It stars Matthew McConaughey, Anne Hathaway, J...",0.917706
1,Principal photography began in late 2013 and t...,1.183912
2,Cinematographer Hoyte van Hoytema shot it on 3...,1.190191
3,Brothers Christopher and Jonathan Nolan wrote ...,1.19423
4,Interstellar premiered in Los Angeles on Octob...,1.228957


In [44]:
results = co.rerank(documents=texts.tolist(), query='film gross', model="rerank-english-02", top_n=3)

In [45]:
print(results)

for idx, r in enumerate(results):
    print(f"Document Rank: {idx+1}, Document Index : {r.index}")
    print(f"Document Text: {r.document['text']}")
    print(f"Document Score: {r.relevance_score:.2f}")

[RerankResult<document['text']: The film received generally positive reviews from critics and grossed over $681 million worldwide ($703 million after subsequent re-releases), making it the tenth-highest-grossing film of 2014, index: 10, relevance_score: 0.8873999>, RerankResult<document['text']: It has been praised by astronomers for its scientific accuracy and portrayal of theoretical astrophysics, index: 11, relevance_score: 0.0434469>, RerankResult<document['text']: Set in a dystopian future where humanity is embroiled in a catastrophic blight and famine, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for humankind, index: 2, relevance_score: 0.042087726>]
Document Rank: 1, Document Index : 10
Document Text: The film received generally positive reviews from critics and grossed over $681 million worldwide ($703 million after subsequent re-releases), making it the tenth-highest-grossing film of 2014
Document Score: 0.89
Documen