### Similarity search and vector embeddings 
OpenAI offers a powerful language model called GPT-3, which can be used for various tasks, such as generating embeddings and performing similarity searches. In this example, we'll use the OpenAI API to generate embeddings for a set of documents and then perform a similarity search using cosine similarity.

In [2]:
from langchain_community.embeddings import OllamaEmbeddings
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Define the documents
documents = [
    "The cat is on the mat.",
    "There is a cat on the mat.",
    "The dog is in the yard.",
    "There is a dog in the yard.",
]

# Initialize the OllamaEmbeddings instance
embeddings = OllamaEmbeddings()

# Generate embeddings for the documents
document_embeddings = embeddings.embed_documents(documents)

# Perform a similarity search for a given query
query = "A cat is sitting on a mat."
query_embedding = embeddings.embed_query(query)

# Calculate similarity scores
similarity_scores = cosine_similarity([query_embedding], document_embeddings)[0]

# Find the most similar document
most_similar_index = np.argmax(similarity_scores)
most_similar_document = documents[most_similar_index]

print(f"Most similar document to the query '{query}':")
print(most_similar_document)


Most similar document to the query 'A cat is sitting on a mat.':
The cat is on the mat.


- In order to perform this analysis, we need to convert our documents into a format that our similarity computation algorithm can understand. This is where OpenAIEmbeddings class comes in. We use it to generate embeddings for each document, transforming them into vectors that represent their semantic content.

- Similarly, we also transform our query string into an embedding. The query string is the text we want to find the most similar document too.

- With our documents and query now in the form of embeddings, we compute the cosine similarity between the query embedding and each document embedding. The cosine similarity is a metric used to determine how similar two vectors are. In our case, it gives us a list of similarity scores for our query against each document.

- With our similarity scores in hand, we then identify the document most similar to our query. We do this by finding the index of the highest similarity score and retrieving the corresponding document from our list of documents.

### embedding model. 

For this task, we've chosen the pre-trained "sentence-transformers/all-mpnet-base-v2" model. This model is designed to transform sentences into embeddings - vectors that encapsulate the semantic meaning of the sentences. The model_kwargs parameter is used here to specify that we want our computations to be performed on the CPU.

In [3]:
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
hf = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

documents = ["Document 1", "Document 2", "Document 3"]
doc_embeddings = hf.embed_documents(documents)

  from .autonotebook import tqdm as notebook_tqdm
modules.json: 100%|██████████| 349/349 [00:00<00:00, 2.03MB/s]
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 658kB/s]
README.md: 100%|██████████| 10.6k/10.6k [00:00<00:00, 29.7MB/s]
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 364kB/s]
config.json: 100%|██████████| 571/571 [00:00<00:00, 3.77MB/s]
pytorch_model.bin: 100%|██████████| 438M/438M [00:07<00:00, 61.4MB/s] 
tokenizer_config.json: 100%|██████████| 363/363 [00:00<00:00, 2.69MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 873kB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 1.23MB/s]
special_tokens_map.json: 100%|██████████| 239/239 [00:00<00:00, 1.81MB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 1.32MB/s]


- Now that we have our model, we define a list of documents - these are the pieces of text that we want to convert into semantic embeddings.

- With our model and documents ready, we move on to generate the embeddings. We do this by calling the embed_documents method on our HuggingFaceEmbeddings instance, passing our list of documents as an argument. This method processes each document and returns a corresponding list of embeddings.

- These embeddings are now ready for any downstream tasks such as classification, clustering, or similarity analysis. They represent our original documents in a form that machines can understand and process, enabling us to perform complex semantic tasks.

### Cohere embeddings
Cohere is dedicated to making its innovative multilingual language models accessible to all, thereby democratizing advanced NLP technologies worldwide. Their Multilingual Model, which maps text into a semantic vector space for better text similarity understanding, significantly enhances multilingual applications such as search operations. Unlike their English language model, the multilingual model uses dot product computations resulting in superior performance. 

These multilingual embeddings are represented in a 768-dimensional vector space.

## Deep Lake Vector Store
- Vector stores are data structures or databases designed to store and manage high-dimensional vectors efficiently. They enable efficient similarity search, nearest neighbor search, and other vector-related operations. Vector stores can be built using various data structures such as approximate nearest neighbor (ANN) techniques, KD trees, or Vantage Point trees.

- Deep Lake, serves as both a data lake for deep learning and a multi-modal vector store. As a multi-modal vector store, it allows users to store images, audio, videos, text, and metadata in a format optimized for deep learning. It enables hybrid search, allowing users to search both embeddings and their attributes. 

- Users can save data locally, in their cloud, or on Activeloop storage. Deep Lake supports the training of PyTorch and TensorFlow models while streaming data with minimal boilerplate code. It also provides features like version control, dataset queries, and distributed workloads using a simple Python API.

- Moreover, as the size of datasets increases, it becomes increasingly difficult to store them in local memory. A local vector store could have been utilized in this particular instance since only a few documents are being uploaded. However, the necessity for a centralized cloud dataset arises in a typical production setting, where thousands or millions of documents may be involved and accessed by various programs.