[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/search/semantic-search/semantic-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/search/semantic-search/semantic-search.ipynb)

# Semantic Search

[![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/fast-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/search/semantic-search/semantic-search-fast.ipynb)

In this walkthrough we will see how to use Pinecone for semantic search. To begin we must install the required prerequisite libraries:

In [1]:
!pip install -U \
  pinecone-datasets==0.5.0rc5 \
  datasets==2.12.0 \
  sentence-transformers==2.2.2

Collecting pinecone-datasets==0.5.0rc5
  Downloading pinecone_datasets-0.5.0rc5-py3-none-any.whl (12 kB)
Installing collected packages: pinecone-datasets
  Attempting uninstall: pinecone-datasets
    Found existing installation: pinecone-datasets 0.5.0rc4
    Uninstalling pinecone-datasets-0.5.0rc4:
      Successfully uninstalled pinecone-datasets-0.5.0rc4
Successfully installed pinecone-datasets-0.5.0rc5


---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Data Preprocessing

The dataset preparation process requires a few steps:

1. We download the Quora dataset from Hugging Face Datasets.

2. The text content of the dataset is embedded into vectors.

3. We create a Pinecone Dataset and save it.

4. We upload the dataset to Pinecone.

We will see how steps `1 - 4` are done in this section, but we won't implement `2` across the whole dataset until we reach the *upsert loop* as we will iteratively perform these two steps.

In either case, this can take some time. If you'd rather skip the data preparation step and get straight to upserts and testing the semantic search functionality, you should 
refer to the [**fast notebook**](https://github.com/pinecone-io/examples/blob/master/search/semantic-search/semantic-search-fast.ipynb). The uses a premade dataset and is ready to go.

In [2]:
from datasets import load_dataset

dataset = load_dataset('quora', split='train[240000:320000]')
dataset

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset quora (/Users/roymiara/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04)


Dataset({
    features: ['questions', 'is_duplicate'],
    num_rows: 80000
})

The dataset contains ~400K pairs of natural language questions from Quora.

In [3]:
dataset[:5]

{'questions': [{'id': [207550, 351729],
   'text': ['What is the truth of life?', "What's the evil truth of life?"]},
  {'id': [33183, 351730],
   'text': ['Which is the best smartphone under 20K in India?',
    'Which is the best smartphone with in 20k in India?']},
  {'id': [351731, 351732],
   'text': ['Steps taken by Canadian government to improve literacy rate?',
    'Can I send homemade herbal hair oil from India to US via postal or private courier services?']},
  {'id': [37799, 94186],
   'text': ['What is a good way to lose 30 pounds in 2 months?',
    'What can I do to lose 30 pounds in 2 months?']},
  {'id': [351733, 351734],
   'text': ['Which of the following most accurately describes the translation of the graph y = (x+3)^2 -2 to the graph of y = (x -2)^2 +2?',
    'How do you graph x + 2y = -2?']}],
 'is_duplicate': [False, True, False, True, False]}

Whether or not the questions are duplicates is not so important, all we need for this example is the text itself. We can extract them all into a single `questions` list.

In [4]:
questions = []

for record in dataset['questions']:
    questions.extend(record['text'])
  
# remove duplicates
questions = list(set(questions))
print('\n'.join(questions[:5]))
print(len(questions))

Which Finance job requires traveling around the world?
Does anyone use the Xyleme learning content management system?
Which folder keeps the extension files of Google Chrome Portable Version?
What are some good ways to lose weight?
Can Health services research be a STEM major?
136057


With our questions ready to go we can move on to demoing steps **2** and **3** above.

### Building Embeddings

To create our embeddings we will us the `MiniLM-L6` sentence transformer model. This is a very efficient semantic similarity embedding model from the `sentence-transformers` library. We initialize it like so:

In [5]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print(f"You are using {device}. This is much slower than using "
          "a CUDA-enabled GPU. If on Colab you can change this by "
          "clicking Runtime > Change runtime type > GPU.")

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model

You are using cpu. This is much slower than using a CUDA-enabled GPU. If on Colab you can change this by clicking Runtime > Change runtime type > GPU.


SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

There are *three* interesting bits of information in the above model printout. Those are:

* `max_seq_length` is `256`. That means that the maximum number of tokens (like words) that can be encoded into a single vector embedding is `256`. Anything beyond this *must* be truncated.

* `word_embedding_dimension` is `384`. This number is the dimensionality of vectors output by this model. It is important that we know this number later when initializing our Pinecone vector index.

* `Normalize()`. This final normalization step indicates that all vectors produced by the model are normalized. That means that models that we would typical measure similarity for using *cosine similarity* can also make use of the *dotproduct* similarity metric. In fact, with normalized vectors *cosine* and *dotproduct* are equivalent.

Moving on, we can create a sentence embedding using this model like so:

In [6]:
query = 'which city is the most populated in the world?'

xq = model.encode(query)
xq.shape

(384,)

Encoding this single sentence leaves us with a `384` dimensional sentence embedding (aligned to the `word_embedding_dimension` above).

To prepare this for `upsert` to Pinecone, all we do is this:

Later when we do upsert our data to Pinecone, we will be doing so in batches. Meaning `vectors` will be a list of `(id, embedding, metadata)` tuples.

Now we upsert the data, we will do this in batches of `128`.

_**Note:** On Google Colab with GPU expected runtime is ~7 minutes. If using CPU this will be significantly longer. If you'd like to get this running faster refer to the [fast notebook](https://github.com/pinecone-io/examples/blob/master/search/semantic-search/semantic-search-fast.ipynb)._

In [7]:
# first create list for ids, embeddings and metadata
ids = []
embeddings = []
metadata = []

batch_size = 128


In [8]:
from tqdm.auto import tqdm

# note doing it on 1000 questiosn 
for i in tqdm(range(0, len(questions[:1024]), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(questions))
    # create IDs batch
    _ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    _metadata = [{'text': text} for text in questions[i:i_end]]
    # create embeddings
    _embeddings = model.encode(questions[i:i_end])
    # create records list for upsert

    ids.extend(_ids)
    embeddings.extend(_embeddings)
    metadata.extend(_metadata)

100%|██████████| 8/8 [00:02<00:00,  3.77it/s]


In [9]:
# creating a dataframe
import pandas as pd

df = pd.DataFrame({'id': ids, 'values': embeddings, 'metadata': metadata})
df

Unnamed: 0,id,values,metadata
0,0,"[0.06995727, -0.041538, -0.07480858, 0.0554511...",{'text': 'Which Finance job requires traveling...
1,1,"[0.020836761, -0.055749647, -0.06341955, 0.042...",{'text': 'Does anyone use the Xyleme learning ...
2,2,"[-0.048633844, -0.011845093, 0.005464233, 0.00...",{'text': 'Which folder keeps the extension fil...
3,3,"[-0.07174875, 0.07797023, 0.056134596, 0.12874...",{'text': 'What are some good ways to lose weig...
4,4,"[-0.01627545, 0.070892945, -0.0010130059, -0.0...",{'text': 'Can Health services research be a ST...
...,...,...,...
1019,1019,"[0.063618615, 0.016510956, 0.04254935, 0.01624...",{'text': 'Should I prepare & crack CAT by leav...
1020,1020,"[0.029550772, 0.047756128, -0.028292943, -0.03...",{'text': 'Why are Europe and Asia separate con...
1021,1021,"[0.0754963, -0.10504479, 0.0089978445, 0.05468...",{'text': 'Who is the most beautiful and glamor...
1022,1022,"[-0.009103978, -0.07920409, 0.065698035, -0.04...",{'text': 'What are the major factors that moti...


In [29]:
# Creating a Pinecone dataset
from pinecone_datasets import Dataset as PineconeDataset, DatasetMetadata

In [30]:
dataset_metadata = DatasetMetadata(
    **{
        'name': 'quora_all-MiniLM-L6-bm25',
        'created_at': '2023-02-17 14:17:01.481785',
        'documents': 522931,
        'queries': 0,
        'source': 'https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs',
        'license': None,
        'bucket': 'gs://pinecone-datasets-dev',
        'task': 'similar questions',
        'dense_model': {
            'name': 'sentence-transformers/all-MiniLM-L6-v2',
            'tokenizer': None,
            'dimension': 384
        },
        'sparse_model': None,
        'description': None,
        'tags': None,
        'args': None
    })

In [44]:
pds = PineconeDataset.from_pandas(df, metadata=dataset_metadata)

In [45]:
pds.head()

Unnamed: 0,id,values,sparse_values,metadata,blob
0,0,"[0.06995727, -0.041538, -0.07480858, 0.0554511...",,{'text': 'Which Finance job requires traveling...,
1,1,"[0.020836761, -0.055749647, -0.06341955, 0.042...",,{'text': 'Does anyone use the Xyleme learning ...,
2,2,"[-0.048633844, -0.011845093, 0.005464233, 0.00...",,{'text': 'Which folder keeps the extension fil...,
3,3,"[-0.07174875, 0.07797023, 0.056134596, 0.12874...",,{'text': 'What are some good ways to lose weig...,
4,4,"[-0.01627545, 0.070892945, -0.0010130059, -0.0...",,{'text': 'Can Health services research be a ST...,


In [33]:
# saving dataset for later
pds.to_path('./tmp/quora_all-MiniLM-L6-bm25')



In [46]:
# upserting datsaet to Pinecone
import os
os.environ["PINECONE_API_KEY"] = "a91ac336-1d39-4085-a28d-2c675a1aefce"
os.environ["PINECONE_ENVIRONMENT"] = "us-west1-gcp"

pds.to_index("semantic-search", batch_size=300, concurrency=16)

  self._async_upsert(


---