## Build simple semantic search app using Pinecone 

I will use a Quora dataset and turn it into a series of vector embeddings and store it into Pinecone.
Then I will build simple question answer response system and ask multiple questions.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone, ServerlessSpec
from DLAIUtils import Utils
import DLAIUtils as DLAIUtils

import os
import time
import torch

In [3]:
# extensible progress bar for Python and CLI
from tqdm.auto import tqdm

In [4]:
# subset of data between rows 240000 and 290000
dataset = load_dataset('quora', split='train[240000:290000]')

In [5]:
# Look at five rows of the dataset
dataset[:5]

{'questions': [{'id': [207550, 351729],
   'text': ['What is the truth of life?', "What's the evil truth of life?"]},
  {'id': [33183, 351730],
   'text': ['Which is the best smartphone under 20K in India?',
    'Which is the best smartphone with in 20k in India?']},
  {'id': [351731, 351732],
   'text': ['Steps taken by Canadian government to improve literacy rate?',
    'Can I send homemade herbal hair oil from India to US via postal or private courier services?']},
  {'id': [37799, 94186],
   'text': ['What is a good way to lose 30 pounds in 2 months?',
    'What can I do to lose 30 pounds in 2 months?']},
  {'id': [351733, 351734],
   'text': ['Which of the following most accurately describes the translation of the graph y = (x+3)^2 -2 to the graph of y = (x -2)^2 +2?',
    'How do you graph x + 2y = -2?']}],
 'is_duplicate': [False, True, False, True, False]}

In [6]:
questions = []

for record in dataset['questions']:
		questions.extend(record['text'])
question = list(set(questions))
print('\n'.join(questions[:10]))
print('-' * 50)
print(f'Number of questions: {len(questions)}')

What is the truth of life?
What's the evil truth of life?
Which is the best smartphone under 20K in India?
Which is the best smartphone with in 20k in India?
Steps taken by Canadian government to improve literacy rate?
Can I send homemade herbal hair oil from India to US via postal or private courier services?
What is a good way to lose 30 pounds in 2 months?
What can I do to lose 30 pounds in 2 months?
Which of the following most accurately describes the translation of the graph y = (x+3)^2 -2 to the graph of y = (x -2)^2 +2?
How do you graph x + 2y = -2?
--------------------------------------------------
Number of questions: 100000


## Check cuda and Setup the model

Note: "Checking cuda" refers to checking if you have access to GPUs (faster compute). In this course, we are using CPUs. So, you might notice some code cells taking a little longer to run.

We are using all-MiniLM-L6-v2 sentence-transformers model that maps sentences to a 384 dimensional dense vector space.

In [7]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print('Sorry no cuda.')
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

Sorry no cuda.


In [8]:
# Create sample query/question and turn it into an embedding 
query = 'which city is the most populated in the world?'
xq = model.encode(query)
xq.shape

(384,)

In [9]:
# Instantiate utils class from DLAIUtils
utils = Utils()
# Setup Pinecone by getting the api key from the utils class
PINECONE_API_KEY = utils.get_pinecone_api_key()

In [10]:
# Connect to Pinecone serverless version and create an index
pinecone = Pinecone(api_key=PINECONE_API_KEY)
INDEX_NAME = utils.create_dlai_index_name('dl-ai')

# Check if the index exists and delete it if it does
if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
    pinecone.delete_index(INDEX_NAME)
print(INDEX_NAME)
# Create the index
pinecone.create_index(name=INDEX_NAME, 
    # Dimension of the embeddings
		dimension=model.get_sentence_embedding_dimension(), 
    # Index metric to find the similarity between embeddings   
    metric='cosine',
    # Serverless specification, specifies what public cloud to use and the region
    spec=ServerlessSpec(cloud='aws', region='us-west-2'))
# Pointer to the index
index = pinecone.Index(INDEX_NAME)
print(index)

dl-ai-wzhfz2rot3blbkfjckgchvjahqrbvqjj4ygp
<pinecone.data.index.Index object at 0x14541cd90>


In [11]:
# Upload data, create embeddings and upsert to Pinecone
batch_size=200
vector_limit=10000

questions = question[:vector_limit]

import json

# Iterate over the questions in batches
for i in tqdm(range(0, len(questions), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(questions))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{'text': text} for text in questions[i:i_end]]
    # create embeddings
    xc = model.encode(questions[i:i_end])
    # create records list for upsert
    records = zip(ids, xc, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

  0%|          | 0/50 [00:00<?, ?it/s]

In [12]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 10000}},
 'total_vector_count': 10000}

## Run Your Query 

In [13]:
# small helper function so we can repeat queries later
def run_query(query):
  # Create a vector embedding for the query
  embedding = model.encode(query).tolist()
  # Pass the embedding to the Pinecone index and get the top 10 results and ask for metadata
  results = index.query(top_k=10, vector=embedding, include_metadata=True, include_values=False)
  # Iterate over the results and print the score and the metadata
  for result in results['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

In [14]:
# Run the query and get back similar questions
run_query('which city has the highest population in the world?')

0.67: Which is the most urbanised city in India?
0.64: Which city has the most museums per capita?
0.62: How many cities are on Earth?
0.61: Which city deserves to be the Capital of the World?
0.59: Where is the highest place on Earth?
0.58: What is the largest race of people on Earth?
0.58: What country has the most beautiful people?
0.55: Where will the biggest increases in population come from the next 20 years?
0.55: Which country is the largest democracy in the world?
0.55: Where are the largest slums in the world?


In [15]:
query = 'how do i make chocolate cake?'
run_query(query)

0.87: How do I make cake?
0.53: How do you make a perfume out of Skittles?
0.52: How do you make shepherd's pie?
0.46: Where can I find affordable cake shops on the Gold Coast?
0.44: Is red velvet cake just cake batter dyed red?
0.44: What are some favorite recipes?
0.44: How does one eat ice cream?
0.42: Where can I buy very incredible and most amazing cupcakes in Gold Coast?
0.41: What are the different ways to make clothes?
0.39: What is the best way to make poppy seed tea?
