#  Populate MongoDB Atlas Database
In this Python notebook, we will be generating embeddings for a course material/document, indexing it and storing it into MongoDB Atlas database.

## Step 1: Load Settings

In [3]:
# Load settings from .env file
import sys
from dotenv import find_dotenv, dotenv_values

# Change system path to root direcotry
sys.path.insert(0, '../')

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())


ATLAS_URI = config.get('ATLAS_URI')

if not ATLAS_URI:
    raise Exception ("'ATLAS_URI' is not set.  Please set it above to continue...")
else:
    print("ATLAS_URI Connection string found")


ATLAS_URI Connection string found


In [4]:
# Define our variables
DB_NAME = 'abdulabiola21'
COLLECTION_NAME = 'ele-562'
INDEX_NAME = 'lambda_embedding'

In [5]:
# LlamaIndex will download embeddings models as needed
# Set llamaindex cache dir to ../cache dir here (Default is system tmp)
# This way, we can easily see downloaded artifacts
os.environ['LLAMA_INDEX_CACHE_DIR'] = os.path.join(os.path.abspath('../'), 'cache')

In [6]:
import pymongo

mongodb_client = pymongo.MongoClient(ATLAS_URI)

print ("Atlas client initialized")

Atlas client initialized


## Step 2: Clear out the collection

In the event you're repeating this Quest again, run the code cell below for a fresh start! This code cell will delete the documents you've previously added to the `abdulabiola21-ele-562` collection (if any). If you are running this Quest for the first time, you should expect to see `0` below because you've yet to add any items into this collection yet.

In [7]:
database = mongodb_client[DB_NAME]
collection = database [COLLECTION_NAME]

doc_count = collection.count_documents (filter = {})
print (f"Document count before delete : {doc_count:,}")

result = collection.delete_many(filter= {})
print (f"Deleted docs : {result.deleted_count}")

INFO:pymongo.serverSelection:{"message": "Waiting for suitable server to become available", "selector": "Primary()", "operation": "count", "topologyDescription": "<TopologyDescription id: 663e2e750ba9f33da19ff21f, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('ac-rtbkmtd-shard-00-00.mmc8nho.mongodb.net', 27017) server_type: Unknown, rtt: None>, <ServerDescription ('ac-rtbkmtd-shard-00-01.mmc8nho.mongodb.net', 27017) server_type: Unknown, rtt: None>, <ServerDescription ('ac-rtbkmtd-shard-00-02.mmc8nho.mongodb.net', 27017) server_type: Unknown, rtt: None>]>", "clientId": {"$oid": "663e2e750ba9f33da19ff21f"}, "remainingTimeMS": 29}
{"message": "Waiting for suitable server to become available", "selector": "Primary()", "operation": "count", "topologyDescription": "<TopologyDescription id: 663e2e750ba9f33da19ff21f, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('ac-rtbkmtd-shard-00-00.mmc8nho.mongodb.net', 27017) server_type: Unknown, rtt: None>, <Serv

## Step 3: Setup Embeddings

Using Google's free embedding

In [10]:
%pip install llama-index-embeddings-gemini -q

Note: you may need to restart the kernel to use updated packages.


In [11]:
import os
os.environ["GOOGLE_API_KEY"] = config.get("GEMINI_API_KEY")

In [12]:
# from llama_index.embeddings import HuggingFaceEmbedding
# Uncomment the line above and comment the line below if you face an import error
# imports
from llama_index.embeddings.gemini import GeminiEmbedding

embed_model = GeminiEmbedding(model_name="models/embedding-001")

# Test the embedding model to see the dimension in order to create the index with the right dimension on the vector db
vector = embed_model.get_text_embedding("What is the meaning of life")
print(f"The dimension of the embedding model is: {len(vector)}")

INFO:numexpr.utils:NumExpr defaulting to 4 threads.
NumExpr defaulting to 4 threads.
The dimension of the embedding model is: 768


In [13]:
## Set up embedding model
# The LLM used to generate natural language responses to queries
# If not provided, it will default to gpt-3.5-turbo from OpenAI
# If your OpenAI API key is not set, it will default to llama2-chat-13B from Llama.cpp
# Since we don't need an LLM just yet, we'll be setting it to None

# from llama_index import ServiceContext
# Uncomment the line above and comment away the line below if you face an import error
from llama_index.core import ServiceContext

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=None)

  service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=None)


LLM is explicitly disabled. Using MockLLM.


## Step 4: Connect Llama-Index and MongoDB Atlas

As mentioned in the Quest, we'll be using MongoDB Atlas as our vector storage. This is critical to store indexed data and then query later on.

In [14]:
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
# from llama_index.storage.storage_context import StorageContext
# Uncomment the line above and comment away the line below if you face an import error
from llama_index.core import StorageContext

vector_store = MongoDBAtlasVectorSearch(mongodb_client = mongodb_client,
                                 db_name = DB_NAME, collection_name = COLLECTION_NAME,
                                 index_name  = INDEX_NAME)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

## Step 5: Read PDF Documents

Llama-index has very handy `SimpleDirectoryReader` that can read single/multiple files and also an entire directory's content. We'll be using this to read our 2 PDF files and storing the data in `docs`.

In [18]:
%%time

# from llama_index.readers.file.base import SimpleDirectoryReader
# Uncomment the line above and comment away the line below if you face an import error
from llama_index.core import SimpleDirectoryReader

cwd = os.getcwd()
data_dir = os.path.join(cwd, 'ELE562')

docs = SimpleDirectoryReader(
        input_dir=data_dir
).load_data()

print (f"Loaded {len(docs)} chunks from '{data_dir}'")

Loaded 128 chunks from '/home/engineerlambda/stackup-bounty/ELE562'
CPU times: user 21 s, sys: 242 ms, total: 21.2 s
Wall time: 23.3 s


## Step 6: Index the Documents and Store Them Into MongoDB Atlas

The code cell below is where everything that we've been preparing for in this Python notebook comes together:
- Embeddings are generated using our packaged-up embedding model `service_context` 
- Our documents `docs` get indexed `storage_context` - both text and embeddings are stored in MongoDB Atlas

In [19]:
%%time

# from llama_index.indices.vector_store.base import VectorStoreIndex
# Uncomment the line above and comment away the line below if you face an import error
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    docs, 
    storage_context=storage_context,
    service_context=service_context,
)

CPU times: user 3.06 s, sys: 128 ms, total: 3.19 s
Wall time: 1min 22s
