<a href="https://colab.research.google.com/github/HAL22/SemanticSearchTutorial/blob/main/SemanticSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Diagram
 * Indexing
   text -> cohere model -> vectors -> pinecone vector db
 * Querying 
   text -> cohere -> vector -> pinecone -> return top_k matches -> get metadata

In [None]:
!pip install cohere pinecone-client datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cohere
  Downloading cohere-3.1.9.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pinecone-client
  Downloading pinecone_client-2.1.0-py3-none-any.whl (170 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m170.6/170.6 KB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
Collecting urllib3~=1.26
  Downloading urllib3-1.26.14-py2.py3-none-any.whl (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
Collecting loguru>=0.5.0
  Downloading loguru-0.6.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 KB[0m [31m2.8 MB/

In [None]:
COHERE_KEY = 'xe78K7SHOOjfYTiznAeyEDcCQebsdhSUocEOkIK8'
PINECONE_KEY = '454c4914-385a-43ef-9335-f28e403b636b'

In [None]:
import cohere
co = cohere.Client(COHERE_KEY)

In [None]:
from datasets import load_dataset

In [None]:
# Retriving the first 1K roes 
trec = load_dataset('trec', split='train[:1000]')

Downloading builder script:   0%|          | 0.00/5.09k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading and preparing dataset trec/default to /root/.cache/huggingface/datasets/trec/default/2.0.0/f2469cab1b5fceec7249fda55360dfdbd92a7a5b545e91ea0f78ad108ffac1c2...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/336k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5452 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset trec downloaded and prepared to /root/.cache/huggingface/datasets/trec/default/2.0.0/f2469cab1b5fceec7249fda55360dfdbd92a7a5b545e91ea0f78ad108ffac1c2. Subsequent calls will reuse this data.


In [None]:
trec

Dataset({
    features: ['text', 'coarse_label', 'fine_label'],
    num_rows: 1000
})

In [None]:
trec[1]

{'text': 'What films featured the character Popeye Doyle ?',
 'coarse_label': 1,
 'fine_label': 5}

In [None]:
embeds = co.embed(
    texts=trec['text'],
    model='small',
    truncate='LEFT'
).embeddings

In [10]:
# Getting the shape/dimensions 
import numpy as np
shape = np.array(embeds).shape
shape

(1000, 1024)

In [12]:
import pinecone

pinecone.init(PINECONE_KEY, environment='us-west1-gcp')

index_name = 'cohere-pinecone-trec'

#if the index does not exist, we create it
if index_name not in pinecone.list_indexes():
  pinecone.create_index(
      index_name,
      dimension=shape[1],
      metric='cosine' # similarity 
  )

In [13]:
# connnect to index
index = pinecone.Index(index_name)

Now we can begin populating the index with our embeddings. Pinecone expects us to provide a list of tuples in the format *(id, vector, metadata)*, where the *metadata* field is an optional extra field where we can store anything we want in a dictionary format. For this example, we will store the original text of the embeddings.

While uploading our data, we will batch everything to avoid pushing too much data in one go.

In [14]:
batch_size = 128

ids = [str(i) for i in range(shape[0])]
# create list of metadata dictionaries
meta = [{'text': text} for text in trec['text']]

# create list of (id, vector, metadata) tuples to be upserted
to_upsert = list(zip(ids, embeds, meta))

for i in range(0, shape[0], batch_size):
    i_end = min(i+batch_size, shape[0])
    index.upsert(vectors=to_upsert[i:i_end])

# let's view the index statistics
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1000}},
 'total_vector_count': 1000}

Semantic Search

In [15]:
def get_query_embedding(query):
  return co.embed(
    texts=[query],
    model='small',
    truncate='LEFT'
).embeddings

In [16]:
def query_pinecone(query_embedding):
  return index.query(query_embedding, top_k=10, include_metadata=True)

In [17]:
def print_top_ten_query_results(query):
  query_embeddings = get_query_embedding(query)
  res = query_pinecone(query_embeddings)
  for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

In [18]:
print_top_ten_query_results("What was the caise pf the major depression")

0.74: When was `` the Great Depression '' ?
0.67: Why did the world enter a global depression in 1929 ?
0.47: What crop failure caused the Irish Famine ?
0.34: What war did the Wanna-Go-Home Riots occur after ?
0.33: When did the Dow first reach ?
0.31: What is considered the costliest disaster the insurance industry has ever faced ?
0.30: What are some of the significant historical events of the 1990s ?
0.29: What was the education system in the 1960 's ?
0.28: What were popular songs and types of songs in the 1920s ?
0.27: What events happened January 26 , 1978 ?


In [19]:
print_top_ten_query_results("Africa has a lot animals")

0.50: What is the smallest country in Africa ?
0.50: What animal has killed the most people ?
0.49: Where do hyenas live ?
0.49: What animal has the biggest eyes ?
0.48: What country has the largest sheep population ?
0.47: What is the highest peak in Africa ?
0.47: What mammal of North America is the world 's longest-lived for its size ?
0.46: What kind of animal is Babar ?
0.46: What predators exist on Antarctica ?
0.43: What is the largest snake in the world ?


In [20]:
print_top_ten_query_results("Why was there a long-term economic downturn in the early 20th century?")

0.71: Why did the world enter a global depression in 1929 ?
0.62: When was `` the Great Depression '' ?
0.40: What crop failure caused the Irish Famine ?
0.38: What are some of the significant historical events of the 1990s ?
0.38: When did the Dow first reach ?
0.35: What were popular songs and types of songs in the 1920s ?
0.33: What was the education system in the 1960 's ?
0.32: Give a reason for American Indians oftentimes dropping out of school .
0.31: What war did the Wanna-Go-Home Riots occur after ?
0.30: What historical event happened in Dogtown in 1899 ?
