<a href="https://colab.research.google.com/github/Eddiebee/AI-Craft/blob/main/Semantic_Search_with_Cohere_and_HuggingFace_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Search with Cohere and HuggingFace Dataset

Text embeddings at various areas of application, but most especially we see them excelling in their application to search tasks. By virtue of the semantic information contained in the embedding we have a search experience that goes beyond the traditional key-word search, but takes into consideration the semantics of the search query and matches this to the best fit in the search documents.

We'll be building on my previous notebook on [Text Embedding using Cohere and HuggingFace Dataset.](https://colab.research.google.com/drive/1y69Hy4hM1hGStPq8eeZ7Ey2jQmB7fh8y?usp=sharing)

Leggo! 🚀

# Setup

We'll kickoff by installing and importing the required packages.

In [3]:
!pip install cohere datasets -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.2/151.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

In [4]:
import cohere
from datasets import load_dataset
import pandas as pd
import numpy as np

We have to initialize Cohere with an API Key.

If you don't have an account with Cohere yet, please [sign up for one over here](https://dashboard.cohere.com/welcome/register). NO CREDIT CARD REQUIRED! 🤗

I have mine stored inside my Colab secrets. Loading it becomes a breeze. 💨

In [5]:
from google.colab import userdata
COHERE_API_KEY = userdata.get('COHERE_TRIAL_KEY')

In [6]:
# initialize Cohere using your API key
co = cohere.Client(COHERE_API_KEY)

We'll throw in the code needed to turn the [email intent classification dataset's](https://huggingface.co/datasets/aadilsayad/email-intent-classification/) prompt column to text embeddings in the next code cell.

In [7]:
# load email intent dataset
dataset = load_dataset("aadilsayad/email-intent-classification")

# define get_embeddings function
def get_embeddings(texts,
                   model="embed-english-v3.0",
                   input_type="search_document"):

  response = co.embed(
        model=model,
        input_type=input_type,
        texts=texts)

  return response.embeddings

# embed dataset
dataset["train"] = dataset["train"].add_column(name="prompt_embeddings",
                   column=get_embeddings(texts=dataset["train"]["prompt"]))

# print out our column names
dataset.column_names

Downloading data:   0%|          | 0.00/59.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

{'train': ['prompt', 'label', 'prompt_embeddings']}

Notice that we set the `input_type` parameter to `search_document` because the texts (documents) we embed will be stored in a vector database.

`How about an article on this`? 🤔

# Applied Text Embeddings - Semantic Search

This section, we'll apply this text embeddings in performming the task of semantic search.

## Embed the search prompt

We have to first of all embed the search query.

To do this, we'll make use of our `get_embeddings` function.

Leggo! 🚀

In [8]:
# define search prompt
new_prompt = "I would love to send this email to the following contacts."
# new_prompt = "I would love to share this email to the following contacts."


In [9]:
new_prompt_embeddings = get_embeddings([new_prompt],
                                       input_type="search_query")[0]

# we use input_type="search_query" because we're using the text to find
# the most relevant documents in your vector database.

## Compare to Embedded Documents

In [10]:
from sklearn.metrics.pairwise import cosine_similarity

In [11]:
prompt_embeddings = dataset["train"]["prompt_embeddings"]
len(prompt_embeddings)

1000

In [35]:
# set SAMPLE size
SAMPLE = 9

# calculate cosine similarity between the search query and existing queries
def get_similarity(target, candidates):
    # turn list into array
    candidates = np.array(candidates)
    target = np.expand_dims(np.array(target),axis=0)

    # calculate cosine similarity
    sim = cosine_similarity(target, candidates)
    sim = np.squeeze(sim).tolist()
    sort_index = np.argsort(sim)[::-1]
    sort_score = [sim[i] for i in sort_index]
    similarity_scores = zip(sort_index,sort_score)

    # return similarity scores
    return similarity_scores

# get the similarity between the search query and existing queries
similarity = get_similarity(new_prompt_embeddings, prompt_embeddings[:SAMPLE])

In [36]:
similarity = list(similarity)

In [37]:
similarity

[(7, 0.669374261590068),
 (1, 0.6447462950277745),
 (2, 0.6209104441785771),
 (5, 0.6050300403484374),
 (0, 0.5967863521460304),
 (3, 0.5457825971152368),
 (4, 0.5264902235388096),
 (6, 0.5123163466083849),
 (8, 0.4838913311261836)]

In [39]:
# view the top 5 articles
print("Prompt: ")
print(new_prompt,"\n")

print("Most Similar Documents:")
for idx, sim in (similarity):
  print(f"Similarity: {sim:.2f};", dataset["train"]["prompt"][idx])

Prompt: 
I would love to send this email to the following contacts. 

Most Similar Documents:
Similarity: 0.67; I want to email someone.
Similarity: 0.64; I'd like to compose an email.
Similarity: 0.62; I need to send an email.
Similarity: 0.61; Let's write an email.
Similarity: 0.60; Can I send an email, please?
Similarity: 0.55; Could you help me write an email?
Similarity: 0.53; Is it possible to send an email with you?
Similarity: 0.51; Time to send an email.
Similarity: 0.48; Open email for writing.
