# Project - Semantic Search using Cohere API
Language models give computers the ability to search by meaning and go beyond searching by matching keywords. This capability is called semantic search. 

![Searching an archive using sentence embeddings](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/basic-semantic-search-overview.png?3)

In this notebook, I'll build a simple semantic search engine. The applications of semantic search go beyond building a web search engine. Unlike a common lexical search algorithm that matches key words for searching, the semantic approach takes the meaning of the search query, understanding the intent of the user in providing the relevant resslts.

The approach to the problem is tackled in the below steps:

1. Get the dataset of questions.
2. Create Embeddings and index of Embeddings for the dataset.
3. Search using an index and nearest neighbor search.
4. Visualize the archive based on the embeddings.

In [1]:
# Install Cohere for embeddings, Umap to reduce embeddings to 2 dimensions, 
# Altair for visualization, Annoy for approximate nearest neighbor search
!pip install cohere umap-learn altair annoy datasets tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cohere
  Downloading cohere-2.6.1.tar.gz (9.2 kB)
Collecting umap-learn
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[K     |████████████████████████████████| 88 kB 3.8 MB/s 
Collecting annoy
  Downloading annoy-1.17.1.tar.gz (647 kB)
[K     |████████████████████████████████| 647 kB 12.3 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.5.2-py3-none-any.whl (432 kB)
[K     |████████████████████████████████| 432 kB 31.2 MB/s 
Collecting pynndescent>=0.5
  Downloading pynndescent-0.5.7.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 33.9 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 21.8 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 65.9 M

Get your Cohere API key by [signing up here](https://os.cohere.ai/register). Paste it in the cell below.

## Importing Libraries

In [2]:
import cohere
import numpy as np
import re
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
import umap
import altair as alt
from sklearn.metrics.pairwise import cosine_similarity
from annoy import AnnoyIndex
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None)

You'll need your API key for this next cell. [Sign up to Cohere](https://os.cohere.ai/) and get one if you haven't yet.

In [3]:
# Paste your API key here. Remember to not share publicly
api_key = 'oiLIUE0uBgMqngYWCVRYtKWDiy0BUOSKAruqGyRs'

# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)

## 1. Get The Dataset of Questions
We'll use the [trec](https://www.tensorflow.org/datasets/catalog/trec) dataset which is made up of questions and their categories.

In [5]:
# Fetching dataset and structuring as DataFrame
dataset = load_dataset("trec", split="train")

df = pd.DataFrame(dataset)
df.head(10)



Unnamed: 0,text,coarse_label,fine_label
0,How did serfdom develop in and then leave Russia ?,2,26
1,What films featured the character Popeye Doyle ?,1,5
2,How can I find a list of celebrities ' real names ?,2,26
3,What fowl grabs the spotlight after the Chinese Year of the Monkey ?,1,2
4,What is the full form of .com ?,0,1
5,What contemptible scoundrel stole the cork from my lunch ?,3,29
6,What team did baseball 's St. Louis Browns become ?,3,28
7,What is the oldest profession ?,3,30
8,What are liver enzymes ?,2,24
9,Name the scar-faced bounty hunter of The Old West .,3,29


## 2. Create Embeddings of the dataset
The next step is to embed the text of the questions.

![embedding archive texts](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/semantic-search-embed-text-archive.png)

In [6]:
# Get the embeddings

#This endpoint returns text embeddings. An embedding is a list of floating point numbers that captures semantic information about the text that it represents.
embeddings = co.embed(texts=list(df['text']),
                  model="large",
                  truncate="LEFT").embeddings

In [7]:
# Check the dimensions of the embeddings
embeddings = np.array(embeddings)
embeddings.shape

(5452, 4096)

## 3. Search using an index and nearest neighbor search
![Building the search index from the embeddings](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/semantic-search-index.png)
Using [Annoy](https://github.com/spotify/annoy) to build an index that stores the embeddings in a way that is optimized for fast search. This approach scales well to a large number of texts.

After building the index, we can use it to retrieve the nearest neighbors either of existing questions (section 3.1), or of new questions that we embed (section 3.2).

In [8]:
# Create the search index, pass the size of embedding

# AnnoyIndex(f, metric) returns a new index that's read-write and stores vector of f dimensions. Metric can be "angular", "euclidean", "manhattan", "hamming", or "dot".
search_index = AnnoyIndex(embeddings.shape[1], 'angular') #Passing dimensions of Embedding as we want to index and store them.

# Add all the vectors to the search index
for i in range(len(embeddings)):
  search_index.add_item(i, embeddings[i]) #Creating index of 1st item against it's embedding

search_index.build(10) # Building a forest of 10 index trees
search_index.save('test.ann')

True

### 3.1. Find the neighbors of an example from the dataset
If we're only interested in measuring the distance between the questions in the dataset (no outside queries), a simple way is to calculate the distance between every pair of embeddings we have.

In [9]:
df.iloc[92]['text']

'What are bear and bull markets ?'

In [11]:
# Choose an example (we'll retrieve others similar to it)
sentence_id = 100
print('Example Question: ',df.iloc[sentence_id]['text'])

# Retrieve nearest neighbors
similar_text_ids = search_index.get_nns_by_item(sentence_id, 10, include_distances=True) 
# Returns 10 nearest items. As we are fetching distances also, function returns a tuple of nearest item's index and it's distance from queried item.

similar_text_ids

Example Question:  Who invented Make-up ?


([100, 5231, 583, 2569, 5258, 829, 1645, 4020, 4755, 26],
 [0.0,
  0.5644333362579346,
  0.8844353556632996,
  0.9291921854019165,
  0.939775288105011,
  0.9762884974479675,
  0.9800129532814026,
  0.9936729073524475,
  1.0033985376358032,
  1.0056532621383667])

In [12]:
# Create a Dataframe of the nearest items and display
df_nearest = pd.DataFrame({'Closest Sentence':df.iloc[similar_text_ids[0]]['text'], 'Distance':similar_text_ids[1]})

# Note that the neighbours contain the item we searched for, hence removing it.
df_nearest.drop(index = sentence_id, inplace=True)
df_nearest

Unnamed: 0,Closest Sentence,Distance
5231,Where did makeup originate ?,0.564433
583,Who invented the toothbrush ?,0.884435
2569,Who invented the Wonderbra ?,0.929192
5258,Who invented panties ?,0.939775
829,Who invented silly putty ?,0.976288
1645,Silly putty was invented by whom ?,0.980013
4020,Who invented the fountain ?,0.993673
4755,Who invented the horoscope ?,1.003399
26,Who was the inventor of silly putty ?,1.005653


### 3.2. Find the neighbors of a user query
We're not limited to searching using existing items. If we get a query, we can embed it and find its nearest neighbors from the dataset.

In [13]:
# Creating a Framework to accept a query from user and return the Semantic search results.
def return_similar_query(query, n=5):
  # Embedding for the query is required to search the index
  query_embed = co.embed(texts=[query], model='large', truncate='LEFT').embeddings

  similar_text_ids = search_index.get_nns_by_vector(vector=query_embed[0], n=n, include_distances=True)
  df_similar = pd.DataFrame({'Closest Sentence':df.iloc[similar_text_ids[0]]['text'], 'Distance':similar_text_ids[1]})
  return df_similar


In [14]:
query = "What is the tallest mountain in the world?"
result = return_similar_query(query, n=5)
result

Unnamed: 0,Closest Sentence,Distance
3885,What is the tallest mountain ?,0.438478
236,What is the name of the tallest mountain in the world ?,0.483403
670,What is the highest mountain in the world ?,0.544842
1293,What is the world 's highest peak ?,0.629433
4336,Name the highest mountain .,0.754351


In [15]:
query = "Who is the beautiful actress in hollywood?"
result = return_similar_query(query, n=5)
result

Unnamed: 0,Closest Sentence,Distance
4468,Who is the most sexy celebrity ?,0.898084
5288,What famous actress made her first appearance on stage at the age of five in the year 191 as `` Baby '' ?,0.973373
2999,Who is the actress Bette Davis once said she wished she looked like ?,0.97347
2748,What American actress was the first to be called a `` vamp '' ?,0.978531
2785,What buxom blonde appeared on the cover of more than 5 magazines ?,0.998832


In [16]:
query = "Which is highest grossing movie?"
result = return_similar_query(query, n=5)
result

Unnamed: 0,Closest Sentence,Distance
3279,What movie has made the most money ?,0.635258
923,What was the top box office movie in April 1998 ?,0.908633
4101,What actor and actress have made the most movies ?,0.971196
2619,How many films are made by the major studios in a year ?,1.006928
1006,What 1915 film was the first to gross $5 million ?,1.031404


## 4. Visualizing the archive
Finally, let's plot out all the questions onto a 2D chart so you're able to visualize the semantic similarities of this dataset!

In [24]:
#@title Plot the archive {display-mode: "form"}

# UMAP reduces the dimensions from 1024 to 2 dimensions that we can plot
reducer = umap.UMAP(n_neighbors=20) 
umap_embeds = reducer.fit_transform(embeddings[:5000])
# Prepare the data to plot and interactive visualization using Altair
df_explore = pd.DataFrame(data={'text': df['text'][:5000]})
df_explore['x'] = umap_embeds[:,0]
df_explore['y'] = umap_embeds[:,1]

# Plot
chart = alt.Chart(df_explore).mark_circle(size=60).encode(
    x=#'x',
    alt.X('x',
        scale=alt.Scale(zero=False)
    ),
    y=
    alt.Y('y',
        scale=alt.Scale(zero=False)
    ),
    tooltip=['text']
).properties(
    width=700,
    height=400
)
chart.interactive()

The above graph is interactive, which can be zoomed to a single datapoint. Hover over the points to read the text. I see some patterns in clustered points. Similar questions, or questions asking about similar topics are grouped.

