## Dense retrieval example

 Let’s take a look at a dense retrieval example by using Cohere to search the
 Wikipedia page for the film Interstellar. In this example, we will do the
 following:
 1. Get the text we want to make searchable and apply some light
 processing to chunk it into sentences.
 2. Embed the sentences.
 3. Build the search index.
 4. Search and see the results.

In [3]:
#import
import numpy as np 
import cohere
import pandas as pd 
from tqdm import tqdm 

In [5]:
api_key= '0IAnuUYnGXylLb1C288TXaWLDJiENP32RqGbB9U8'

In [8]:
#create and retrive a cohere api key from os.cohere.ai
co= cohere.Client(api_key)

Getting the text archive and chunking it

Let’s use the first section of the Wikipedia article on the film Interstellar.
We’ll get the text, then break it into sentences:

In [9]:
text = """
 Interstellar is a 2014 epic science fiction film co-written, 
directed, and produced by Christopher Nolan. 
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, 
Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine. 
Set in a dystopian future where humanity is struggling to 
survive, the film follows a group of astronauts who travel 
through a wormhole near Saturn in search of a new home for 
mankind.
 Brothers Christopher and Jonathan Nolan wrote the screenplay, 
which had its origins in a script Jonathan developed in 2007. 
Caltech theoretical physicist and 2017 Nobel laureate in 
Physics[4] Kip Thorne was an executive producer, acted as a 
scientific consultant, and wrote a tie-in book, The Science of 
Interstellar. 
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in 
the Panavision anamorphic format and IMAX 70 mm. 
Principal photography began in late 2013 and took place in 
Alberta, Iceland, and Los Angeles. 
Interstellar uses extensive practical and miniature effects and 
the company Double Negative created additional digital effects.
 Interstellar premiered on October 26, 2014, in Los Angeles. 
In the United States, it was first released on film stock, 
expanding to venues using digital projectors. 
The film had a worldwide gross over $677 million (and $773 
million with subsequent re-releases), making it the tenth-highest 
grossing film of 2014. 
It received acclaim for its performances, direction, screenplay, 
musical score, visual effects, ambition, themes, and emotional 
weight. 
It has also received praise from many astronomers for its 
scientific accuracy and portrayal of theoretical astrophysics. 
Since its premiere, Interstellar gained a cult following,[5] and 
now is regarded by many sci-fi experts as one of the best 
science-fiction films of all time.
 Interstellar was nominated for five awards at the 87th Academy 
Awards, winning Best Visual Effects, and received numerous other 
accolades"""

In [11]:
#split into list of sentences
texts= text.split('.')

#clean up to remove emplty spaces
texts= [t.strip('\n') for t in texts]

Embedding the text chunks


 Let’s now embed the texts. We’ll send them to the Cohere API, and get back
 a vector for each text:

In [13]:
#get the embeddings
response= co.embed(
    texts= texts, 
    input_type= 'search_document',

).embeddings

embeds= np.array(response)
print(embeds.shape)

(15, 4096)


Building the search index

 Before we can search, we need to build a search index. An index stores the
 embeddings and is optimized to quickly retrieve the nearest neighbors even
 if we have a very large number of points:

In [14]:
import faiss
dim=embeds.shape[1]
index= faiss.IndexFlatL2(dim)
print(index.is_trained)
index.add(np.float32(embeds))


True


Search the index

 We can now search the dataset using any query we want. We simply embed
 the query and present its embedding to the index, which will retrieve the
 most similar sentence from the Wikipedia article.

In [17]:
#function for searching

def search(query, number_of_results=5):

    #get the query embeds
    query_embed= co.embed(texts= [query], input_type='search_query').embeddings[0]

    #retrieve the nearest neighbour
    distnaces,similar_items_ids= index.search(np.float32([query_embed]), number_of_results)

    #get the reseults
    texts_np = np.array(texts)
    results=pd.DataFrame(data= {'texts': texts_np[similar_items_ids[0]],
                            'distances': distnaces[0]})
    
    #print nd return the results
    print(f"Query:'{query}' \n Nerest Neighbours:")
    return results

In [21]:
#testing
query ="how precise was the science"
results= search(query)
results 

Query:'how precise was the science' 
 Nerest Neighbours:


Unnamed: 0,texts,distances
0,\nIt has also received praise from many astro...,10267.427734
1,"\nSince its premiere, Interstellar gained a c...",12490.473633
2,\nCaltech theoretical physicist and 2017 Nobe...,12507.566406
3,\nInterstellar uses extensive practical and m...,12546.205078
4,\nCinematographer Hoyte van Hoytema shot it o...,13720.408203
