# Demo vector databases and Semantic Search with Pinecone

corresponding [Google Colab](https://colab.research.google.com/drive/1J5vyHtWPmdOAopsFW63dvax2jz_uPe7e?usp=sharing).

In [1]:
!pip install pinecone-client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.5.1-py3-none-any.whl (156 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.5.1


In [4]:
from pinecone import Pinecone, PodSpec

In [5]:
from sentence_transformers import SentenceTransformer

In [6]:
# download and instantiate the DistilBERT sentence transformer model as follows:
model_name = 'distilbert-base-nli-stsb-mean-tokens'
model = SentenceTransformer(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.05k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

To use Pinecone, we need an account and a key, that can be created using their [website](https://app.pinecone.io/?sessionType=signup&ref=dailydoseofds.com).

In [7]:
pinecone_key = "<PUT YOUR KEY HERE>"
pc = Pinecone(api_key=pinecone_key)

In pinecone the vectors are stored in [indexes](https://docs.pinecone.io/docs/indexes?ref=dailydoseofds.com). The vectors in an index we created must share the same dimensionality and distance metric for measuring similarity. We create an index using che create_index method. First, we check the fact that we do not have any index:

In [8]:
pc.list_indexes()

{'indexes': []}

Following, the code that creates the index through a function call:
* name: the name of the index.
*dimension: the dimensionality of the vectors that will be stored in the index. This should match the dimensionality of the vectors that will be inserted into the index. We have specified 768 here because that is the embedding dimension returned by the SentenceTransformer model.
* metric: The distance metric used to calculate the similarity between vectors. In this case, we use `euclidean`, i.e. the Euclidean distance.
* `spec`: an object `PodSpec` that specifies the environment in which the index will be created. In this example, the index is created in a Google Cloud Platform called `gcp-starter`.

Executing this function creates an index that can be found in the dashboard of the Pinecone website.


In [10]:
pc.create_index(
    name="vector-demo",
    dimension=768,
    metric="euclidean",
    spec=PodSpec(environment='gcp-starter')
    )

Now we have an index. Let us create some text, so we can push vector embeddings in its.

In [11]:
data = [
    {"id": "vector1", "text": "I love using vector databases"},
    {"id": "vector2", "text": "Vector databases are great for storing and retrieving vectors"},
    {"id": "vector3", "text": "Using vector databases makes my life easier"},
    {"id": "vector4", "text": "Vector databases are efficient for storing vectors"},
    {"id": "vector5", "text": "I wnjoy working with vector databases"},
    {"id": "vector6", "text": "Vector databases are useful for many applications"},
    {"id": "vector7", "text": "I find vector databases very helpful"},
    {"id": "vector8", "text": "Vector databases can handle large amounts of data"},
    {"id": "vector9", "text": "I think vector databases are the future of data storage"},
    {"id": "vector10", "text": "Using vector databases has improved my workflow"},
]

Now we create embeddings for these sentences. With the following code, we iterate over each sentence in the data list defined above, we convert it into a vector using DistilBert, and add the embedding to a list.

In [12]:
vector_data = []
for sentence in data:
  embedding = model.encode(sentence['text'])
  vector_info = {"id": sentence['id'], "values": embedding.tolist()}
  vector_data.append(vector_info)

As we see, the text has been converted in a vector. Here we look at the first 10 number composing the embedding of the first text.

In [15]:
vector_data[0]['values'][:10]

[-1.0006766319274902,
 0.30460259318351746,
 0.6573171019554138,
 0.489531010389328,
 -0.5995281934738159,
 -0.5410853028297424,
 -0.013175887987017632,
 -0.3186182379722595,
 -0.34427568316459656,
 -0.5891623497009277]

In practical instances, multiple indexes can be used under the same account. For this reason, it is necessary to build an `index` object that specifies the index we wish to add these embeddings to. Let us do this:

In [16]:
index = pc.Index("vector-demo")

Now that we have these embeddings, we perform the upsert. The **upsert** is a database operation that combines **update** and **insert**. The document is inserted if it does not already exist, or it updates an existing document if it does exist.

The upsert is a common operation in databases, especially in NoSQL databases, where is it used to ensure that a document is either iserted or updated based on its existence in the collection.

In [17]:
index.upsert(vectors=vector_data)

{'upserted_count': 10}

The method highlights the fact that we were successful. In any case, we can also check the characteristics of the index in this way:

In [18]:
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0001,
 'namespaces': {'': {'vector_count': 10}},
 'total_vector_count': 10}

* dimension: the **dimensionality** of the vectors stored in the index.
* `index_fullness`: a measure of how full the index is, typically indicating the percentage of slots in the index that are occupied.
* `namespaces`: A dictionary containing statistics for each namespace in the index. In this case, there is only one namespace ('') with a vector_count of 10, indicating that there are 10 vectors in the index.
* `total_vector_count`:  The total number of vectors in the index across all namespaces (10, in this case).

Now we can run a similarity search to see the obtained results. We can do this using the `query()` method of the `index` object we created earlier.


In [23]:
search_text = "Vector databases are really helpful"

# the encoding creates a numpy array. We convert it to a list
search_embeddings = model.encode(search_text)
search_embeddings = search_embeddings.tolist()

Now we have everything to search the database:

In [26]:
index.query(vector=search_embeddings, top_k=3)

{'matches': [{'id': 'vector7', 'score': 20.8271179, 'values': []},
             {'id': 'vector4', 'score': 44.177948, 'values': []},
             {'id': 'vector1', 'score': 51.9566956, 'values': []}],
 'namespace': '',
 'usage': {'read_units': 5}}

This code snippet calls the query method on an index object, which performs a nearest neighbor search for a given query vector (search_embedding) and returns the top 3 matches.

  * `matches`: A list of dictionaries, where each dictionary contains information about a matching vector. Each dictionary includes the `id` of the matching vector, the `score` indicating the similarity between the query vector and the matching vector. As we specified euclidean as our metric while creating this index, a higher score indicates more distance, i.e. less similarity.
  * `namespace`: The namespace of the index where the query was performed. In this case, the namespace is an empty string (''), indicating the default namespace.
  * `usage`: A dictionary containing information about the usage of resources during the query operation. In this case, `read_units` indicates the number of read units consumed by the query operation, which is 5. However, we originally appended 10 vectors to this index, which shows that it did look through all the vectors to find the nearest neighbors.