# Introduction of VectorDatabases, Embeddings, and Pinecone
**OPL Stack: OpenAI, Pinecone, and LangChain**

**Embeddings** are the core of building LLM applications. 

Text embeddings are numeric representations of text, used in NLP(natural language processing) and ML(machine learning) tasks. Text embeddings can be used to measure the relatedness and similarity between two pieces of text. Relatedness measures how closely two pieces of text are related in meaning.

The distance between two embeddings or two vectors measures their relatedness which translates to the relatedness between the text concepts they represent. Similar embeddings or vectors represent similar concepts. Text concepts are words and phrases. Similar embeddings or vectors represent similar concepts.

**There are two common approaches to measure relatedness and similarity between text embeddings.**
**1)Cosine similarity and
  2)Euclidean distance**

  #### Embeddings Applications:
  ##### 1) Text Classification:
  Assigning a label to a piece of text.

  ##### 2) Text Clustering:
  Grouping together pieces of text that are similar in meaning.

  ##### 3) Question-Answering: 
  Answering a question posed in natural language.
  

## Vector Databases  
(**OPL Stack for developing LLM applications:** The OPL stack is a set of open-source tools that can be used to build applications that use "Large Language Models". **The stack consists of three main components OpenAI, Pinecone, and LangChain**)

One of the biggest challenges of AI applications is efficient data processing. AI applications such as LLMs, Generative AI, and semantic search require a large amount of data to train and operate. Efficient data processing is essential for making AI applications successful.

Many of the latest AI applications(ex. chatbots, question-answering systems, and machine translation) rely on vector embeddings.**Vector Embeddings means converting text to numbers that carry within themselves semantic information.** Vector Embeddings are a way to represent text as a set of numbers in a high dimensional space, and the numbers represent the meaning of the words in the text.

We need a specialized database or data store specifically designed to manage large quantities of data in a numeric representation. There are many vector databases available, both free and commercial. **Examples: Pinecone, Chroma, Milvus, qdrant.**

Pinecone is a vector database designed for storing and quering high dimensional vectors.

**Vector Databases are a new type of database, designed to store and query unstructured data.**

**Unstructured data is data that does not have a fixed schema, such as text, images, and audio.**

vector databases are much more efficient at storing and querying unstructured data and are heavily used in LLMs applications.

Just like the "select" statement in the SQL, in Vector databases we apply a similarity metric to find a vector that is the most similar to our query. Vector databases use a combination of different optimized algorithms that all participate in **Approximate Nearest Neighbor**(ANN) search.


## How does vector database work?
Three steps:

**1)Embedding:-** Create vector embeddings for the content we want to index. This is done by using an embedding model such as Openai's "text-embedding-ada-002" or "text-embedding-3-small"

**2)Indexing:-** Insert the vector embeddings into the vector database. This is done by associating each vector embedding with a reference to the original content used to create it.

**3)Querying:-** Query the vector database for similar content. This is done using the same embedding model used to create the vector embeddings. 

The **embedding model** is used to create a vector embedding for the query, and this vector embedding is then used to query the database for similar vector embeddings. The similar vector embeddings are then associated with the original content that was used to create them.

example:- Let's imagine a company that wants to store and query its private documents. The company would first use an embedding model to create vector embeddings for each document. These vector embeddings would then be inserted into a vector database such as Pinecone. Each vector embedding would be associated with a reference to the original document. Finally, the company can use the vector database to query for information that is similar to a given query or question. for instance, the company could query the documents or a product description.

## Pinecone (example of vector database)

Pinecone is a high-performance, scalable and distributed vector store designed for LLMs.

In [4]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

To install the pinecone-client library:

In [5]:
pip install -q pinecone-client

Note: you may need to restart the kernel to use updated packages.


To upgrade the pinecone-client library to its latest version:

In [6]:
pip install --upgrade -q pinecone-client

Note: you may need to restart the kernel to use updated packages.


To display the version of the installed library:

In [5]:
pip show pinecone-client

Name: pinecone-client
Version: 3.2.2
Summary: Pinecone client and SDK
Home-page: https://www.pinecone.io
Author: Pinecone Systems, Inc.
Author-email: support@pinecone.io
License: Apache-2.0
Location: C:\Users\njm_s\AppData\Local\Programs\Python\Python311\Lib\site-packages
Requires: certifi, tqdm, typing-extensions, urllib3
Required-by: 
Note: you may need to restart the kernel to use updated packages.


With the new client import the Pinecone class:

In [7]:
from pinecone import Pinecone
pc = Pinecone()  #this constructor expects an environment variable called PINECONE_API_KEY. But if we've already created and loaded such a variable into memory in environment variable's file, then the authentication is automatically handeled. However, if you haven't loaded the environment variable, you should explicitly pass an argument named api_key with your key's value to the pinecone constructor.
#pc = Pinecone(api_key='YOUR_API_KEY')

pc.list_indexes()

{'indexes': [{'dimension': 1536,
              'host': 'langchain-qppbyn2.svc.gcp-starter.pinecone.io',
              'metric': 'cosine',
              'name': 'langchain',
              'spec': {'pod': {'environment': 'gcp-starter',
                               'pod_type': 'starter',
                               'pods': 1,
                               'replicas': 1,
                               'shards': 1}},
              'status': {'ready': True, 'state': 'Ready'}}]}

## Working with Pinecone Indexes
An **index** is the highest-level organizational unit of vector data in Pinecone. It accepts and stores vectors, serves queries over the vectors it contains, and does other vector operations over its contents.

Currently, there are **two types of indexes:**

**1)Serverless indexes:** You don't configure or manage any computing or storage resources(They scale automatically based on usage, and you pay only for the amount of data stored and operations performed with no minimums).

**2)Pod-based indexes:** You choose one or more pre-configured units of hardware for running a pinecone service(pods). Depending on the pod type, pod size, and the number of pods used, you get a different amount of storage and higher or lower latency and throughput.

In [8]:
pc.list_indexes() # to get a complete description of all indexes in a project.

{'indexes': [{'dimension': 1536,
              'host': 'langchain-qppbyn2.svc.gcp-starter.pinecone.io',
              'metric': 'cosine',
              'name': 'langchain',
              'spec': {'pod': {'environment': 'gcp-starter',
                               'pod_type': 'starter',
                               'pods': 1,
                               'replicas': 1,
                               'shards': 1}},
              'status': {'ready': True, 'state': 'Ready'}}]}

 Note that the output is a list of dictionaries, and you can access the name of the first index as follows:

In [9]:
pc.list_indexes()[0]

{'dimension': 1536,
 'host': 'langchain-qppbyn2.svc.gcp-starter.pinecone.io',
 'metric': 'cosine',
 'name': 'langchain',
 'spec': {'pod': {'environment': 'gcp-starter',
                  'pod_type': 'starter',
                  'pods': 1,
                  'replicas': 1,
                  'shards': 1}},
 'status': {'ready': True, 'state': 'Ready'}}

In [10]:
pc.list_indexes()[0]['name']

'langchain'

In [10]:
pc.describe_index('langchain')

{'dimension': 1536,
 'host': 'langchain-qppbyn2.svc.gcp-starter.pinecone.io',
 'metric': 'cosine',
 'name': 'langchain',
 'spec': {'pod': {'environment': 'gcp-starter',
                  'pod_type': 'starter',
                  'pods': 1,
                  'replicas': 1,
                  'shards': 1}},
 'status': {'ready': True, 'state': 'Ready'}}

In [12]:
pc.list_indexes().names()

['langchain']

### Create a Pinecone Index

In [22]:
from pinecone import PodSpec    #This represents the configuration used to deploy a pod-based index.
index_name = 'langchain'                             # create a new index called "langchain"

if index_name not in pc.list_indexes().names():
    print(f'Creating index {index_name}')
    pc.create_index(
        name = index_name,
        dimension = 1536,  #This is the default dimension for text-embedding-3-small(one of the recommended OpenAI's embedding models.)
        metric = 'cosine',  # This is the algorithm used to calculate the distance between vectors.
        spec = PodSpec(
            environment = 'gcp-starter'
        ) 
    )
    print('Index created! :)')
else:
     print(f'Index {index_name} already exists!')

Creating index langchain
Index created! :)


### Delete a Pinecone Index

In [21]:
index_name = 'langchain'
if index_name in pc.list_indexes().names():
    print(f'Deleting index {index_name} ...')
    pc.delete_index(index_name)
    print('Done')
else:
    print(f'Index {index_name} does not exists!')

Deleting index langchain ...
Done


**Important Note:** To perform any operation with an index, you must first select it. To select an index do the following:-

In [17]:
index = pc.Index(index_name)  # this method will return an object of type index
index.describe_index_stats()                  # Here I am selecting the index and displaying some statistics about it

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

**note:-** Serverless indexes automatically scale as needed, so index_fullness mainly applies to pod_based indexes.

## Working with Vectors

Here we are generating five random vectors, with a dimension of 1536, using the random module. And I'll use a list comprehension to generate the vectors. A vector is practically a list, so a list of vectors means a list of lists.

In [23]:
import random
vectors = [[random.random() for _ in range(1536)] for v in range(5)]
print(vectors)  #These are the vectors

[[0.6833256878375683, 0.5377523008606321, 0.03689244473761499, 0.7914283080108644, 0.844260362780829, 0.5246127489089148, 0.9913029009050893, 0.6386933439754483, 0.4138752579349736, 0.4865159780996333, 0.15168755396566858, 0.446188369033229, 0.701397361421723, 0.9087853438098438, 0.442814995856489, 0.647360869559549, 0.6905426586199855, 0.6015401154256776, 0.023432061276011273, 0.5386953870350276, 0.1173523636496624, 0.9767709956176157, 0.7043311023831083, 0.3116705397105255, 0.2205269931765962, 0.5710849916861925, 0.7877749545531672, 0.47073896934790116, 0.6911069789919287, 0.26425605495195625, 0.2911213286021145, 0.718461327122258, 0.3644939602513674, 0.3615906955525403, 0.3124584773313357, 0.18192392375379207, 0.1914078383324933, 0.23272523391356614, 0.9386109914169358, 0.5307155039818369, 0.5746846727730863, 0.904445810030635, 0.24535019532157776, 0.5887282378042874, 0.21548287807511224, 0.08946640098293168, 0.02781408583088607, 0.9803587378221046, 0.12603057870191525, 0.1509323011

### Insert the Vectors
To insert a vector, we need the **vector itself and its ID**, which is a string.
Since there are five vectors in the above example, I am creating a list with five elements that represent the IDs.

**upsert()** is a single operation that can be used to insert a new value in the index or update an existing value if it already exists. This can be useful for situations where you are not sure whether the value already exists. The **upsert()** method returns the number of vectors inserted.

In [24]:
import random

vectors = [[random.random() for _ in range(1536)] for v in range(5)]
ids = list('abcde')

index_name = 'langchain' 
index = pc.Index(index_name)  #selecting the index to insert vectors into the index

#To insert new vectors into the index, use the upsert() method.
index.upsert(vectors=zip(ids, vectors))  # By calling the zip() function, we connect the IDs and the Vectors in a list of tuples.


{'upserted_count': 5}

### Update the Vectors
To update a vector, we can use the same **upsert()** method. for this, we need to provide two arguments: 1) The ID of the vector, we want to update, 2) The new value of the vector.

In [28]:
# updating vectors
index.upsert(vectors=[('c', [0.5] * 1536)])    # Here I am updating the vector with the ID c 

{'upserted_count': 1}

### Fetch a Vector by ID

Select the index, if it's not already selected. Then call the **fetch()** method.

**Note:-** If you fetch a vector that doesn't exist, you won't get an error, but an empty vector.

In [29]:
# fetching vectors

#index = pc.Index(index_name)

index.fetch(ids=['c', 'd'])  # Here I am fetching the vectors with ids c and d.

{'namespace': '',
 'usage': {'read_units': 1},
 'vectors': {'c': {'id': 'c',
                   'values': [0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
                              0.5,
             

### Delete Vectors by ID

**Note:-** If you want to delete all the vectors in the current index, delete the index and recreate it.

In [16]:
# deleting vectors
index.delete(ids=['b', 'c'])     # Here I am deleting the vectors with ids b and c.

{}

In [25]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 5e-05,
 'namespaces': {'': {'vector_count': 5}},
 'total_vector_count': 5}

In [18]:
index.fetch(ids=['x'])      # Here I am trying to fetch the vector with the id x, which doesn't exists. So I've got an empty vector.

{'namespace': '', 'usage': {'read_units': 1}, 'vectors': {}}

In [20]:
index.fetch(ids=['a', 'd'])    # Here I am trying to fetch the vector with the id a and d, which does exist.

{'namespace': '',
 'usage': {'read_units': 1},
 'vectors': {'a': {'id': 'a',
                   'values': [0.194068819,
                              0.713190377,
                              0.818635106,
                              0.311499327,
                              0.0468405485,
                              0.529482365,
                              0.142504022,
                              0.911256671,
                              0.655106306,
                              0.556106687,
                              0.705024838,
                              0.838230073,
                              0.00279728603,
                              0.939765632,
                              0.239037022,
                              0.70895189,
                              0.417354047,
                              0.677159846,
                              0.449500501,
                              0.398692,
                              0.931859,
                        

## Perform a Query
The query operation will retrieve the IDs of the most similar vectors in the index, along with their similarity scores


In [32]:
# Query
query_vector = [random.random() for _ in range(1536)]

# The following method will return the top three most similar matches to our query_vector.
index.query(
    vector = query_vector,
    top_k = 3,      #it will return the top three most similar vectors
    include_values = False     # if we don't want to display the actual values of the vectors.
)

{'matches': [{'id': 'c', 'score': 0.875969887, 'values': []},
             {'id': 'd', 'score': 0.764638841, 'values': []},
             {'id': 'a', 'score': 0.755540431, 'values': []}],
 'namespace': '',
 'usage': {'read_units': 5}}