# Generate Embeddings

## Setup

The notebook requires ATLAS URI and LLAMA API key. Along with that, in the stackup_ai database's collection zendesk_data collection, we need to create a search index (explained later) where our LLM will fetch the embeddings to respond to a user's query as accurate as possible.

In [1]:
"""
This script sets up a connection to a MongoDB Atlas cluster using the ATLAS_URI environment variable.
It also configures logging to print messages at the INFO level to the console.
The script first imports necessary modules and sets up logging.
It then loads environment variables from a .env file using the dotenv_values function.
The ATLAS_URI environment variable is retrieved and used to create a MongoDB client connection.
If the ATLAS_URI environment variable is not set, the script raises an exception.
"""
import sys, os
import logging
import pymongo
import warnings

warnings.filterwarnings(action='ignore')
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# Load settings from .env file
from dotenv import find_dotenv, dotenv_values

sys.path.insert(0, '../')

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

ATLAS_URI = config.get('ATLAS_URI')

if not ATLAS_URI:
    raise Exception ("'ATLAS_URI' is not set.  Please set it above to continue...")

mongodb_client = pymongo.MongoClient(ATLAS_URI)
print ("Atlas client initialized")

Atlas client initialized


In [2]:
"""
Constants for the Zendesk data processing pipeline.
DB_NAME: The name of the database to store the processed data.
COLLECTION_NAME: The name of the MongoDB collection to store the Zendesk data.
INDEX_NAME: The name of the index to be created on the embeddings field in the Zendesk data collection.
"""
DB_NAME = 'stackup_ai'
COLLECTION_NAME = 'zendesk_data'
INDEX_NAME = 'index_embeddings'

In [3]:
'''
Sets the environment variable 'LLAMA_INDEX_CACHE_DIR' to the path 'cache' located in the parent directory of the current script.
This is likely used to specify a directory for caching data related to the LLaMA index,
which is a data structure used for efficient retrieval of information from large language models.
'''
os.environ['LLAMA_INDEX_CACHE_DIR'] = os.path.join(os.path.abspath('../'), 'cache')

In [4]:
"""
Deletes all documents from the specified MongoDB collection.
This code connects to a MongoDB database, retrieves the specified collection, and then deletes all documents from that collection.
It first prints the total number of documents in the collection, then calls the `delete_many()` method to delete all documents,
and finally prints the number of deleted documents.
This function is intended to be used for maintenance or testing purposes,as it will permanently remove all data from the specified collection.
Caution should be exercised when using this code in a production environment.
"""
database = mongodb_client[DB_NAME]
collection = database [COLLECTION_NAME]

doc_count = collection.count_documents (filter = {})
print (f"Document count before delete : {doc_count:,}")

result = collection.delete_many(filter= {})
print (f"Deleted docs : {result.deleted_count}")

Document count before delete : 26
Deleted docs : 26


In [5]:
"""
Initializes a HuggingFaceEmbedding model and a ServiceContext with the specified embedding model.
The HuggingFaceEmbedding model is used for generating embeddings of text data.
The ServiceContext is a container for various services used by the LlamaIndex library, including the embedding model.
Args:
    model_name (str): The name of the HuggingFace model to use for generating embeddings.
Returns:
    ServiceContext: A ServiceContext instance with the specified embedding model.
"""

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import ServiceContext

embed_model = HuggingFaceEmbedding(model_name="all-MiniLM-L6-v2")
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=None)

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO:sentence_transformers.SentenceTransformer:2 prompts are loaded, with the keys: ['query', 'text']
2 prompts are loaded, with the keys: ['query', 'text']
LLM is explicitly disabled. Using MockLLM.


In [6]:
"""
Imports the MongoDBAtlasVectorSearch class from the llama_index.vector_stores.mongodb module.
The MongoDBAtlasVectorSearch class provides a vector store implementation that uses MongoDB Atlas as the backend.
It can be used to store and retrieve vector embeddings and associated metadata.
"""
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
from llama_index.core import StorageContext

vector_store = MongoDBAtlasVectorSearch(mongodb_client = mongodb_client,
                                 db_name = DB_NAME, collection_name = COLLECTION_NAME,
                                 vector_index_name  = INDEX_NAME,
                                 ## The following columns are set to default values
                                 # embedding_key = 'embedding', text_key = 'text', metadata_= 'metadata',
                                 )

storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [7]:
"""
Reads data files from a directory using the SimpleDirectoryReader class from the llama_index.core module.
The `data_dir` variable specifies the directory path where the data files are located.
The `SimpleDirectoryReader` class is used to load all the data files in the directory and return a list of documents.
The number of chunks (documents) loaded from the directory is printed to the console.
"""
from llama_index.core import SimpleDirectoryReader

data_dir = './data/'

## This reads one doc
# docs = SimpleDirectoryReader(
#     input_files=["./data/10k/uber_2021.pdf"]

## This reads an entire directory
docs = SimpleDirectoryReader(
        input_dir=data_dir
).load_data()

print (f"Loaded {len(docs)} chunks from '{data_dir}'")


Loaded 17 chunks from './data/'


In [8]:
"""
Creates a VectorStoreIndex from the provided documents, using the given storage and service contexts.
Args:
    docs (List[Document]): A list of documents to create the index from.
    storage_context (StorageContext): The storage context to use for the index.
    service_context (ServiceContext): The service context to use for the index.
Returns:
    VectorStoreIndex: The created VectorStoreIndex.
"""

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    docs, 
    storage_context=storage_context,
    service_context=service_context,
)

Batches: 100%|██████████| 1/1 [00:10<00:00, 10.35s/it]
Batches: 100%|██████████| 1/1 [00:03<00:00,  3.39s/it]
Batches: 100%|██████████| 1/1 [00:02<00:00,  2.24s/it]


Go to Atlas MongoDB > stackup_ai > zendesk_data and create a search index with name `index_embeddings` with following definition
```json
{
  "fields": [
    {
      "numDimensions": 384,
      "path": "embedding",
      "similarity": "cosine",
      "type": "vector"
    }
  ]
}
```

## Demonstration

In [9]:
"""
Initializes a LlamaAPI instance with the provided LLAMA_API key.
The LlamaAPI class is used to interact with the Llama language model API.
It provides methods for generating text, embedding text, and other language model-related operations.
Args:
    api_key (str): The API key for the Llama language model API.
"""
from llama_index.llms.llama_api import LlamaAPI

LLAMA_API = config.get('LLAMA_API')
llm = LlamaAPI(api_key=LLAMA_API)
embed_model = HuggingFaceEmbedding(model_name="all-MiniLM-L6-v2")

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO:sentence_transformers.SentenceTransformer:2 prompts are loaded, with the keys: ['query', 'text']
2 prompts are loaded, with the keys: ['query', 'text']


In [10]:
"""
Initializes a ServiceContext object with the provided embed_model and llm parameters.
The ServiceContext class is used to configure the various services used by the LlamaIndex library,
such as the embedding model and language model.
This code sets the default embed_model and llm parameters for the ServiceContext, which can be used throughout the application.
"""
from llama_index.core import ServiceContext
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm)

In [11]:
"""
Initializes a MongoDBAtlasVectorSearch object and creates a StorageContext and VectorStoreIndex from it.
The MongoDBAtlasVectorSearch object is configured with the provided MongoDB client, database name, collection name, and index name.
The default values for the embedding_key, text_key, and metadata_ parameters are used.
The StorageContext is created from the vector store, and the VectorStoreIndex is created from the vector store and the service context.
"""

from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex

vector_store = MongoDBAtlasVectorSearch(mongodb_client = mongodb_client,
                                 db_name = DB_NAME, collection_name = COLLECTION_NAME,
                                 vector_index_name  = INDEX_NAME,
                                 ## the following columns are set to default values
                                 # embedding_key = 'embedding', text_key = 'text', metadata_= 'metadata',
                                 )

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)

In [12]:
"""
Queries the index using the provided query string and displays the response as Markdown-formatted text.
Args:
    query (str): The query string to be used for searching the index.
Returns:
    None
"""
from IPython.display import Markdown

response = index.as_query_engine().query("What is bounty? Explain in detail.")
display(Markdown(f"<b>{response}</b>"))
# pprint(response, indent=4)

Batches: 100%|██████████| 1/1 [00:00<00:00, 18.44it/s]


<b>The StackUp [bounty](https://earn.stackup.dev/) program offers an additional opportunity for Stackies to engage in more advanced learning activities with higher expectations for their output. This program presents a new level of challenge compared to quests, allowing Stackies to tackle more complex challenges in exchange for a larger reward amount.</b>

In [13]:
"""
Prints the score of the first source node in the response.
# """
print(f"Score: {response.source_nodes[0].score}")
# print(response)

Score: 0.7607830762863159
