# MongoDB-Atlas

- Author: [Ivy Bae](https://github.com/ivybae), [Jongho Lee](https://github.com/XaviereKU)
- Peer Review : [Haseom Shin](https://github.com/IHAGI-c), [ro__o_jun](https://github.com/ro-jun), [Sohyeon Yim](https://github.com/sohyunwriter)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/07-MongoDB.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/07-MongoDB.ipynb)

## Overview

This tutorial covers how to use ```MongoDB-Atlas``` with **LangChain** .

[MongoDB Atlas](https://www.mongodb.com/en/atlas) is a multi-cloud database service that provides an easy way to host and manage your data in the cloud.

This tutorial walks you through using **CRUD** operations with the ```MongoDB-Atlas``` **storing** , **updating** , **deleting** documents, and performing **similarity-based retrieval** .

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [What is MongoDB-Atlas?](#what-is-mongodb-atlas?)
- [Data](#data)
- [Initial Setting MongoDB-Atlas](#initial-setting-mongodb-atlas)
- [Document Manager](#document-manager)


### References

- [Get Started with Atlas](https://www.mongodb.com/docs/atlas/getting-started/)
- [Deploy a Free Cluster](https://www.mongodb.com/docs/atlas/tutorial/deploy-free-tier-cluster/)
- [Connection Strings](https://www.mongodb.com/docs/manual/reference/connection-string/)
- [Atlas Search and Vector Search Indexes](https://www.mongodb.com/docs/languages/python/pymongo-driver/current/indexes/atlas-search-index/)
- [Review Atlas Search Index Syntax](https://www.mongodb.com/docs/atlas/atlas-search/index-definitions/)
- [JSON and BSON](https://www.mongodb.com/resources/basics/json-and-bson)
- [Write Data to MongoDB](https://www.mongodb.com/docs/languages/python/pymongo-driver/current/write-operations/)
- [Read Data from MongoDB](https://www.mongodb.com/docs/languages/python/pymongo-driver/current/read/)
- [Query Filter Documents](https://www.mongodb.com/docs/manual/core/document/#query-filter-documents)
- [Update Operators](https://www.mongodb.com/docs/manual/reference/operator/update/)
- [Integrate Atlas Vector Search with LangChain](https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/)
- [Get Started with the LangChain Integration](https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/get-started/)
- [Comparison Query Operators](https://www.mongodb.com/docs/manual/reference/operator/query-comparison/)
- [MongoDB Atlas](https://python.langchain.com/docs/integrations/vectorstores/mongodb_atlas/)
- [Document loaders](https://python.langchain.com/docs/concepts/document_loaders/)
- [Text splitters](https://python.langchain.com/docs/concepts/text_splitters/)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain-core",
        "python-dotenv",
        "langchain_openai",
        "langchain_community",
        "pymongo",
        "certifi",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "{Project Name}",
        "MONGODB_ATLAS_CLUSTER_URI": "{Your Atlas URI}",
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as ```OPENAI_API_KEY``` in a ```.env``` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [4]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

To set ```MONGODB_ATLAS_CLUSTER_URI``` you need to sign up and and create a cluster from [MongoDB Atlas](https://www.mongodb.com/en/atlas)

**Atlas** can be started with [Atlas CLI](https://www.mongodb.com/docs/atlas/cli/current/atlas-cli-getting-started/) or **Atlas UI**.

**Atlas CLI** can be difficult to use if you're not used to working with development tools, so this tutorial will walk you through how to use **Atlas UI**.

To deploy a cluster, please select the appropriate project in your **Organization**. If the project doesn't exist, you'll need to create it.

If you select a project, you can create a cluster.

![mongodb-atlas-project](./assets/07-mongodb-atlas-initialization-01.png)

Follow the procedure below to deploy a cluster

- select **Cluster**: **M0** Free cluster option

> Note: You can deploy only one Free cluster per Atlas project

- select **Provider**: **M0** on AWS, GCP, and Azure

- select **Region**

- create a database user and add your IP address settings.

After you deploy a cluster, you can see the cluster you deployed as shown in the image below.

![mongodb-atlas-cluster-deploy](./assets/07-mongodb-atlas-initialization-02.png)

Click **Get connection string** in the image above to get the cluster URI.

Now set the value of `MONGODB_ATLAS_CLUSTER_URI` in the `.env` file or set it directly inside the Set environment variables cell.

The **connection string** resembles the following example:

> mongodb+srv://[databaseUser]:[databasePassword]@[clusterName].[hostName].mongodb.net/?retryWrites=true&w=majority


## What is MongoDB-Atlas?

[MongoDB Atlas](https://www.mongodb.com/en/atlas) is a multi-cloud database service that provides an easy way to host and manage your data in the cloud.

It provides security by blocking all other IPs except user approved.

Text based search and vector based similarity search are provided and you can choose what field to index for future use, like pre-filter.

So you don't need to waste some spaced for unused indexing.

You can change settings for the indexed fields on Atlas webpage, and can control other things.

## Prepare Data

This section guides you through the **data preparation process** .

This section includes the following components:

- Data Introduction

- Preprocess Data


### Data Introduction

In this tutorial, we will use the fairy tale **📗 The Little Prince** in PDF format as our data.

This material complies with the **Apache 2.0 license** .

The data is used in a text (.txt) format converted from the original PDF.

You can view the data at the link below.
- [Data Link](https://huggingface.co/datasets/sohyunwriter/the_little_prince)

### Preprocess Data

In this tutorial section, we will preprocess the text data from The Little Prince and convert it into a list of ```LangChain Document``` objects with metadata. 

Each document chunk will include a ```title``` field in the metadata, extracted from the first line of each section.

In [5]:
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
import re
from typing import List


def preprocessing_data(content: str) -> List[Document]:
    # 1. Split the text by double newlines to separate sections
    blocks = content.split("\n\n")

    # 2. Initialize the text splitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,  # Maximum number of characters per chunk
        chunk_overlap=50,  # Overlap between chunks to preserve context
        separators=["\n\n", "\n", " "],  # Order of priority for splitting
    )

    documents = []

    # 3. Loop through each section
    for block in blocks:
        lines = block.strip().splitlines()
        if not lines:
            continue

        # Extract title from the first line using square brackets [ ]
        first_line = lines[0]
        title_match = re.search(r"\[(.*?)\]", first_line)
        title = title_match.group(1).strip() if title_match else ""

        # Remove the title line from content
        body = "\n".join(lines[1:]).strip()
        if not body:
            continue

        # 4. Chunk the section using the text splitter
        chunks = text_splitter.split_text(body)

        # 5. Create a LangChain Document for each chunk with the same title metadata
        for chunk in chunks:
            documents.append(Document(page_content=chunk, metadata={"title": title}))

    print(f"Generated {len(documents)} chunked documents.")

    return documents

In [6]:
# Load the entire text file
with open("./data/the_little_prince.txt", "r", encoding="utf-8") as f:
    content = f.read()

# Preprocess Data
docs = preprocessing_data(content=content)

Generated 262 chunked documents.


## Setting up MongoDB-Atlas

This part walks you through the initial setup of ```MongoDB-Atlas```.

This section includes the following components:

- Load Embedding Model

- Load ```MongoDB-Atlas``` Client

### Load Embedding Model

In this section, you'll learn how to load an embedding model.

This tutorial uses **OpenAI's** **API-Key** for loading the model.

*💡 If you prefer to use another embedding model, see the instructions below.*
- [Embedding Models](https://python.langchain.com/docs/integrations/text_embedding/)

In [7]:
import os
from langchain_openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings(model="text-embedding-3-large")

### Load MongoDB-Atlas Client

In this section, we'll show you how to load the **database client object** using the **Python SDK** for ```MongoDB-Atlas``` .
- [Python SDK Docs](https://docs.trychroma.com/docs/overview/introduction)

In [8]:
# Create Database Client Object Function
from pymongo import MongoClient
import certifi


def get_db_client(URI):
    """
    Initializes and returns a VectorStore client instance.

    This function loads configuration (e.g., API key, host) from environment
    variables or default values and creates a client object to interact
    with the MongoDB-Atlas Python SDK.

    Returns:
        client:ClientType - An instance of the Chroma client.

    Raises:
        ValueError: If required configuration is missing.
    """
    client = MongoClient(URI, tlsCAFile=certifi.where())
    return client

In [9]:
# Get DB Client Object
URI = os.getenv("MONGODB_ATLAS_CLUSTER_URI")
client = get_db_client(URI)

### Create Collection

If you are successfully connected to ```MongoDB-Atlas```, there is a sample collection.

But in this tutorial we will create a new collection with ```MongoDBAtlasCollectionManager```.

In [10]:
from utils.mongodb_atlas import MongoDBAtlasCollectionManager

# Get collectionManager
collectionManager = MongoDBAtlasCollectionManager(
    db_name="langchain-opentutorial-db", client=client
)

# Create new collection
collectionManager.create_collection("little-prince")

After you created a collection, you can check it on Atlas webpage.
![mongodb-atlas-collection](./assets/07-mongodb-atlas-database.png)

### Create Vector Search Index

To perform ector search in Atlas, you must create an **Atlas Vector Search Index**.

First, either define **Atlas Search Index** or **Atlas Vector Search Index** using `SearchIndexModel` object.

- `definition` : define the **Search Index**.

- `name` : query the **Search Index** by name.

To learn more about `definition` of `SearchIndexModel` , see [Review Atlas Search Index Syntax](https://www.mongodb.com/docs/atlas/atlas-search/index-definitions/).

**[NOTE]**

When you make an index and if you want to make some metadata to be used as a filter, you need to specify it.

In the following example, we set ```title``` to be used as a filter for later use in ```vector_index```, but not set in ```search_index```.

In [11]:
from pymongo.operations import SearchIndexModel

TEST_SEARCH_INDEX_NAME = "test_search_index"
TEST_VECTOR_SEARCH_INDEX_NAME = "test_vector_index"

search_index = SearchIndexModel(
    definition={
        "mappings": {"dynamic": True},
    },
    name=TEST_SEARCH_INDEX_NAME,
)

vector_index = SearchIndexModel(
    definition={
        "fields": [
            {
                "type": "vector",
                "numDimensions": embedding.embed_query("Hello").__len__(),
                "path": "embedding",
                "similarity": "cosine",
            },
            {
                "type": "filter",
                "path": "title"
            }
        ]
    },
    name=TEST_VECTOR_SEARCH_INDEX_NAME,
    type="vectorSearch",
)

Now we can create **index** based on ```SearchIndexModel``` defined above.

In [12]:
# create actual index
collectionManager.create_index(TEST_SEARCH_INDEX_NAME, search_index)
collectionManager.create_index(TEST_VECTOR_SEARCH_INDEX_NAME, vector_index)

After you created indexes, you can check it on Atlas webpage, search tab.

![mongodb-atlas-search-index](./assets/07-mongodb-atlas-search-index-01.png)

## Document Manager

To support the **Langchain-Opentutorial** , we implemented a custom set of **CRUD** functionalities for VectorDBs. 

The following operations are included:

- ```upsert``` : Update existing documents or insert if they don’t exist

- ```upsert_parallel``` : Perform upserts in parallel for large-scale data

- ```similarity_search``` : Search for similar documents based on embeddings

- ```delete``` : Remove documents based on filter conditions

Each of these features is implemented as class methods specific to each VectorDB.

In this tutorial, you can easily utilize these methods to interact with your VectorDB.

*We plan to continuously expand the functionality by adding more common operations in the future.*

### Create Instance

First, we create an instance of the **MongoDB-Atlas** helper class to use its CRUD functionalities.

This class is initialized with the **MongoDB-Atlas Python SDK client instance** and the **embedding model instance** , both of which were defined in the previous section.

In [13]:
from utils.mongodb_atlas import MongoDBAtlasDocumentManager

crud_manager = MongoDBAtlasDocumentManager(
    client=client,
    db_name="langchain-opentutorial-db",
    collection_name="little-prince",
    embedding=embedding,
)

Now you can use the following **CRUD** operations with the ```crud_manager``` instance.

These instance allow you to easily manage documents in your ```MongoDB-Atlas``` .

### Upsert Document

**Update** existing documents or **insert** if they don’t exist

**✅ Args**

- ```texts``` : Iterable[str] – List of text contents to be inserted/updated.

- ```metadatas``` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).

- ```ids``` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.

- ```**kwargs``` : Extra arguments for the underlying vector store.

**🔄 Return**

- None

In [14]:
from uuid import uuid4

ids = [str(uuid4()) for _ in docs]

args = {
    "texts": [doc.page_content for doc in docs[:2]],
    "metadatas": [doc.metadata for doc in docs[:2]],
    "ids": ids[:2],
    # Add additional parameters if you need
}
crud_manager.upsert(**args)

### Upsert Parallel

Perform **upsert** in **parallel** for large-scale data

**✅ Args**

- ```texts``` : Iterable[str] – List of text contents to be inserted/updated.

- ```metadatas``` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).

- ```ids``` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.

- ```batch_size``` : int – Number of documents per batch (default: 32).

- ```workers``` : int – Number of parallel workers (default: 10).

- ```**kwargs``` : Extra arguments for the underlying vector store.

**🔄 Return**

- None

In [15]:
from uuid import uuid4

args = {
    "texts": [doc.page_content for doc in docs],
    "metadatas": [doc.metadata for doc in docs],
    "ids": ids,
    # Add additional parameters if you need
}

crud_manager.upsert_parallel(**args)

### Similarity Search

Search for **similar documents** based on **embeddings** .

This method uses **"cosine similarity"** .


**✅ Args**

- ```query``` : str – The text query for similarity search.

- ```k``` : int – Number of top results to return (default: 10).

- ```filter``` : dict - Pre-filter applied for similarity search. Consist of ```field```, ```operator``` and ```value``` (default: None).

- ```**kwargs``` : Additional search options (e.g., filters).

**🔄 Return**

- ```results``` : List[Document] – A list of LangChain Document objects ranked by similarity.


To make a filter, you need to pass a dictionary like below
```python
{"field": {"operator": "value"}}
```

In [16]:
# Search by Query
results = crud_manager.search(query="What is essential is invisible to the eye.", k=3, vector_index='test_vector_index')
for idx,doc in enumerate(results):
    print(f"Rank {idx} | Title : {doc['title']}")
    print(f"Contents : {doc['page_content']}")
    print(f"Score : {doc['score']}")
    print()

Rank 0 | Title : Chapter 4
Contents : If I have told you these details about the asteroid, and made a note of its number for you, it is on account of the grown-ups and their ways. When you tell them that you have made a new friend, they never ask you any questions about essential matters. They never say to you, "What does his voice sound like? What games does he love best? Does he collect butterflies?" Instead, they demand: "How old is he? How many brothers has he? How much does he weigh? How much money does his father make?" Only
Score : 0.6709620356559753

Rank 1 | Title : Chapter 13
Contents : "Eh? Are you still there? Five-hundred-and-one million-- I can‘t stop... I have so much to do! I am concerned with matters of consequence. I don‘t amuse myself with balderdash. Two and five make seven..." 
"Five-hundred-and-one million what?" repeated the little prince, who never in his life had let go of a question once he had asked it.
The businessman raised his head.
Score : 0.6636617779731

Now, let us create a filter to restrict our manager to search only the chunks with the **title equals Chapter 4.**

```MongoDBAtlas``` supports following operators.

|Operator|Type|Description|
|---|---|---|
|$eq|Equals|equal to the value|
|$ne|Equals|not equal to the value|
|$gt|Range|greater than the value|
|$lt|Range|less than the value|
|$gte|Range|greater or equal to the value|
|$lte|Range|less or equal to the value|
|$in|Inclusive|included among the values|
|$nin|Inclusive|not included among the values|
|$not|Logical|logical not|
|$nor|Logical|logical nor|
|$and|Logical|logical and|
|$or|Logical|logical or|

In [17]:
# Create Filter
filters = {"title": {"$eq": "Chapter 4"}}

# Filter Search
results = crud_manager.search(query="Which asteroid did the little prince come from?",k=3, filters=filters, vector_index='test_vector_index')
for idx,doc in enumerate(results):
    print(f"Rank {idx} | Title : {doc['title']}")
    print(f"Contents : {doc['page_content']}")
    print(f"Score : {doc['score']}")
    print()

Rank 0 | Title : Chapter 4
Contents : Grown-ups are like that... 
Fortunately, however, for the reputation of Asteroid B-612, a Turkish dictator made a law that his subjects, under pain of death, should change to European costume. So in 1920 the astronomer gave his demonstration all over again, dressed with impressive style and elegance. And this time everybody accepted his report. 
(picture)
Score : 0.8311741352081299

Rank 1 | Title : Chapter 4
Contents : - the narrator speculates as to which asteroid from which the little prince came　　
I had thus learned a second fact of great importance: this was that the planet the little prince came from was scarcely any larger than a house!
Score : 0.8178739547729492

Rank 2 | Title : Chapter 4
Contents : If you were to say to the grown-ups: "I saw a beautiful house made of rosy brick, with geraniums in the windows and doves on the roof," they would not be able to get any idea of that house at all. You would have to say to them: "I saw a house t

### Delete Document

Remove documents based on filter conditions

**✅ Args**

- ```ids``` : Optional[List[str]] – List of document IDs to delete. If None, deletion is based on filter.

- ```filters``` : Optional[Dict] – Dictionary specifying filter conditions (e.g., metadata match).

- ```**kwargs``` : Any additional parameters.

**🔄 Return**

- None

In [18]:
# Delete by ids
ids = ids[:2] # The 'ids' value you want to delete
crud_manager.delete(ids=ids)

In [19]:
# Delete by ids with filters
ids = ids # The `ids` value corresponding to chapter 6
crud_manager.delete(ids=ids,filters={"title":{"$eq": "Chapter 6"}})

In [20]:
# Delete All
crud_manager.delete()