[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/search/semantic-search/semantic-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/search/semantic-search/semantic-search.ipynb)

# Semantic Search

[![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/fast-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/search/semantic-search/semantic-search-fast.ipynb)

In this walkthrough we will see how to use Pinecone for semantic search. To begin we must install the required prerequisite libraries:

In [1]:
!pip install -U \
  pinecone-datasets==0.5.0rc2 \
  datasets==2.12.0 \
  sentence-transformers==2.2.2

Collecting pinecone-datasets==0.5.0rc2
  Using cached pinecone_datasets-0.5.0rc2-py3-none-any.whl (12 kB)
Collecting datasets==2.12.0
  Using cached datasets-2.12.0-py3-none-any.whl (474 kB)
Collecting sentence-transformers==2.2.2
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting fsspec<2024.0.0,>=2023.1.0 (from pinecone-datasets==0.5.0rc2)
  Using cached fsspec-2023.6.0-py3-none-any.whl (163 kB)
Collecting gcsfs<2024.0.0,>=2023.1.0 (from pinecone-datasets==0.5.0rc2)
  Using cached gcsfs-2023.5.0-py2.py3-none-any.whl (26 kB)
Collecting pandas<3.0.0,>=2.0.0 (from pinecone-datasets==0.5.0rc2)
  Downloading pandas-2.0.2-cp39-cp39-macosx_11_0_arm64.whl (10.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.9/10.9 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hCollecting pinecone-client==3.0.0rc2 (from pinecone-datasets==0.5.0rc2)
  Using cached pinecone_client-3.0.0rc2-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Data Preprocessing

The dataset preparation process requires a few steps:

1. We download the Quora dataset from Hugging Face Datasets.

2. The text content of the dataset is embedded into vectors.

3. We create a Pinecone Dataset and save it.

4. We upload the dataset to Pinecone.

We will see how steps `1 - 4` are done in this section, but we won't implement `2` across the whole dataset until we reach the *upsert loop* as we will iteratively perform these two steps.

In either case, this can take some time. If you'd rather skip the data preparation step and get straight to upserts and testing the semantic search functionality, you should 
refer to the [**fast notebook**](https://github.com/pinecone-io/examples/blob/master/search/semantic-search/semantic-search-fast.ipynb). The uses a premade dataset and is ready to go.

In [2]:
from datasets import load_dataset

dataset = load_dataset('quora', split='train[240000:320000]')
dataset

  from .autonotebook import tqdm as notebook_tqdm
Downloading builder script: 100%|██████████| 2.38k/2.38k [00:00<00:00, 9.31MB/s]
Downloading metadata: 100%|██████████| 1.13k/1.13k [00:00<00:00, 9.25MB/s]
Downloading readme: 100%|██████████| 5.69k/5.69k [00:00<00:00, 27.7MB/s]


Downloading and preparing dataset quora/default to /Users/roymiara/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04...


Downloading data: 100%|██████████| 58.2M/58.2M [00:05<00:00, 10.9MB/s]
Downloading data files: 100%|██████████| 1/1 [00:05<00:00,  5.88s/it]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 983.19it/s]
                                                                                         

Dataset quora downloaded and prepared to /Users/roymiara/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04. Subsequent calls will reuse this data.




Dataset({
    features: ['questions', 'is_duplicate'],
    num_rows: 80000
})

The dataset contains ~400K pairs of natural language questions from Quora.

In [4]:
dataset[:5]

{'questions': [{'id': [207550, 351729],
   'text': ['What is the truth of life?', "What's the evil truth of life?"]},
  {'id': [33183, 351730],
   'text': ['Which is the best smartphone under 20K in India?',
    'Which is the best smartphone with in 20k in India?']},
  {'id': [351731, 351732],
   'text': ['Steps taken by Canadian government to improve literacy rate?',
    'Can I send homemade herbal hair oil from India to US via postal or private courier services?']},
  {'id': [37799, 94186],
   'text': ['What is a good way to lose 30 pounds in 2 months?',
    'What can I do to lose 30 pounds in 2 months?']},
  {'id': [351733, 351734],
   'text': ['Which of the following most accurately describes the translation of the graph y = (x+3)^2 -2 to the graph of y = (x -2)^2 +2?',
    'How do you graph x + 2y = -2?']}],
 'is_duplicate': [False, True, False, True, False]}

Whether or not the questions are duplicates is not so important, all we need for this example is the text itself. We can extract them all into a single `questions` list.

In [5]:
questions = []

for record in dataset['questions']:
    questions.extend(record['text'])
  
# remove duplicates
questions = list(set(questions))
print('\n'.join(questions[:5]))
print(len(questions))

If Allah is merciful then why would He burn sinned humans with fire for eternity in hell? Cant' He show mercy?
What percentage of transgender women are sexually attracted to women as opposed to men, or both men and women?
What is the advantage of polarized sunglasses?
What are good gift ideas for a dad in his late 40s?
Who are the people still choosing to vote for Donald Trump and why do they want to vote for him (other than because he's the Republican nominee)?
136057


With our questions ready to go we can move on to demoing steps **2** and **3** above.

### Building Embeddings

To create our embeddings we will us the `MiniLM-L6` sentence transformer model. This is a very efficient semantic similarity embedding model from the `sentence-transformers` library. We initialize it like so:

In [6]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print(f"You are using {device}. This is much slower than using "
          "a CUDA-enabled GPU. If on Colab you can change this by "
          "clicking Runtime > Change runtime type > GPU.")

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model

You are using cpu. This is much slower than using a CUDA-enabled GPU. If on Colab you can change this by clicking Runtime > Change runtime type > GPU.


SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

There are *three* interesting bits of information in the above model printout. Those are:

* `max_seq_length` is `256`. That means that the maximum number of tokens (like words) that can be encoded into a single vector embedding is `256`. Anything beyond this *must* be truncated.

* `word_embedding_dimension` is `384`. This number is the dimensionality of vectors output by this model. It is important that we know this number later when initializing our Pinecone vector index.

* `Normalize()`. This final normalization step indicates that all vectors produced by the model are normalized. That means that models that we would typical measure similarity for using *cosine similarity* can also make use of the *dotproduct* similarity metric. In fact, with normalized vectors *cosine* and *dotproduct* are equivalent.

Moving on, we can create a sentence embedding using this model like so:

In [16]:
query = 'which city is the most populated in the world?'

xq = model.encode(query)
xq.shape

(2, 384)

Encoding this single sentence leaves us with a `384` dimensional sentence embedding (aligned to the `word_embedding_dimension` above).

To prepare this for `upsert` to Pinecone, all we do is this:

Later when we do upsert our data to Pinecone, we will be doing so in batches. Meaning `vectors` will be a list of `(id, embedding, metadata)` tuples.

Now we upsert the data, we will do this in batches of `128`.

_**Note:** On Google Colab with GPU expected runtime is ~7 minutes. If using CPU this will be significantly longer. If you'd like to get this running faster refer to the [fast notebook](https://github.com/pinecone-io/examples/blob/master/search/semantic-search/semantic-search-fast.ipynb)._

In [38]:
# first create list for ids, embeddings and metadata
ids = []
embeddings = []
metadata = []

batch_size = 128


In [39]:
from tqdm.auto import tqdm

# note doing it on 1000 questiosn 
for i in tqdm(range(0, len(questions[:1024]), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(questions))
    # create IDs batch
    _ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    _metadata = [{'text': text} for text in questions[i:i_end]]
    # create embeddings
    _embeddings = model.encode(questions[i:i_end])
    # create records list for upsert

    ids.extend(_ids)
    embeddings.extend(_embeddings)
    metadata.extend(_metadata)

100%|██████████| 8/8 [00:02<00:00,  3.08it/s]


In [52]:
# creating a dataframe
import pandas as pd

df = pd.DataFrame({'id': ids, 'values': embeddings, 'metadata': metadata})
df

Unnamed: 0,id,values,metadata
0,0,"[0.079809085, 0.13530786, -0.024871217, 0.0126...",{'text': 'If Allah is merciful then why would ...
1,1,"[0.08731179, -0.043924116, -0.07815887, 0.0399...",{'text': 'What percentage of transgender women...
2,2,"[-0.044097595, 0.012648403, 0.007437395, 0.013...",{'text': 'What is the advantage of polarized s...
3,3,"[0.019823564, 0.062449012, 0.015589851, -0.011...",{'text': 'What are good gift ideas for a dad i...
4,4,"[0.056418877, -0.089993075, 0.06808353, -0.019...",{'text': 'Who are the people still choosing to...
...,...,...,...
1019,1019,"[-0.033038545, 0.08281174, -0.055870146, 0.061...",{'text': 'How was hemoglobin discovered? Who d...
1020,1020,"[0.018484745, 0.062107757, 0.034533918, 0.0269...",{'text': 'Why did Steve Jobs drop out of colle...
1021,1021,"[0.005033609, -0.084230006, -0.013950559, 0.01...",{'text': 'My wife and I fight a lot and I need...
1022,1022,"[0.054642506, -0.06619325, -0.05927356, -0.035...",{'text': 'I'm worried about my relationship. S...


In [53]:
# Creating a Pinecone dataset
from pinecone_datasets import Dataset as PineconeDataset, DatasetMetadata

In [60]:
dataset_metadata = DatasetMetadata(
    **{
        'name': 'quora_all-MiniLM-L6-bm25',
        'created_at': '2023-02-17 14:17:01.481785',
        'documents': 522931,
        'queries': 0,
        'source': 'https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs',
        'license': None,
        'bucket': 'gs://pinecone-datasets-dev',
        'task': 'similar questions',
        'dense_model': {
            'name': 'sentence-transformers/all-MiniLM-L6-v2',
            'tokenizer': None,
            'dimension': 384
        },
        'sparse_model': None,
        'description': None,
        'tags': None,
        'args': None
    })

In [55]:
pds = PineconeDataset.from_pandas(df, metadata=dataset_metadata)

In [56]:
pds.head()

Unnamed: 0,id,values,sparse_values,metadata,blob
0,0,"[0.079809085, 0.13530786, -0.024871217, 0.0126...",,{'text': 'If Allah is merciful then why would ...,
1,1,"[0.08731179, -0.043924116, -0.07815887, 0.0399...",,{'text': 'What percentage of transgender women...,
2,2,"[-0.044097595, 0.012648403, 0.007437395, 0.013...",,{'text': 'What is the advantage of polarized s...,
3,3,"[0.019823564, 0.062449012, 0.015589851, -0.011...",,{'text': 'What are good gift ideas for a dad i...,
4,4,"[0.056418877, -0.089993075, 0.06808353, -0.019...",,{'text': 'Who are the people still choosing to...,


In [57]:
# saving dataset for later
pds.to_path('./tmp/quora_all-MiniLM-L6-bm25')



In [51]:
# upserting datsaet to Pinecone
import os
os.environ["PINECONE_API_KEY"] = "YOUR_API_KEY"
os.environ["PINECONE_ENVIRONMENT"] = "YOUR_ENVIRONMENT"

pds.to_index("seamntic-search", batch_size=300, concurrency=16, create_index=True, metadata_config={"indexed": []})

ConnectionError: Failed to connect to Pinecone's controller on region YOUR_ENVIRONMENT. Please verify client configuration: API key, region and project_id. See more info: https://docs.pinecone.io/docs/quickstart#2-get-and-verify-your-pinecone-api-key
Underlying Error: error sending request for url (https://controller.your_environment.pinecone.io/actions/whoami): error trying to connect: dns error: failed to lookup address information: nodename nor servname provided, or not known

---