title | description | author | ms.service | ms.subservice | ms.topic | ms.date | ms.author | ms.reviewer | ms.custom |
---|---|---|---|---|---|---|---|---|---|
Use the graph bulk executor .NET library with Azure Cosmos DB Gremlin API |
Learn how to use the bulk executor library to massively import graph data into an Azure Cosmos DB Gremlin API container. |
jasonwhowell |
cosmos-db |
cosmosdb-graph |
how-to |
05/28/2019 |
jasonh |
sngun |
devx-track-csharp |
Using the graph bulk executor .NET library to perform bulk operations in Azure Cosmos DB Gremlin API
This tutorial provides instructions about using Azure CosmosDB's bulk executor .NET library to import and update graph objects into an Azure Cosmos DB Gremlin API container. This process makes use of the Graph class in the bulk executor library to create Vertex and Edge objects programmatically to then insert multiple of them per network request. This behavior is configurable through the bulk executor library to make optimal use of both database and local memory resources.
As opposed to sending Gremlin queries to a database, where the command is evaluated and then executed one at a time, using the bulk executor library will instead require to create and validate the objects locally. After creating the objects, the library allows you to send graph objects to the database service sequentially. Using this method, data ingestion speeds can be increased up to 100x, which makes it an ideal method for initial data migrations or periodical data movement operations. Learn more by visiting the GitHub page of the Azure Cosmos DB Graph bulk executor sample application.
The bulk executor library contains a Microsoft.Azure.CosmosDB.BulkExecutor.Graph
namespace to provide functionality for creating and importing graph objects.
The following process outlines how data migration can be used for a Gremlin API container:
- Retrieve records from the data source.
- Construct
GremlinVertex
andGremlinEdge
objects from the obtained records and add them into anIEnumerable
data structure. In this part of the application the logic to detect and add relationships should be implemented, in case the data source is not a graph database. - Use the Graph BulkImportAsync method to insert the graph objects into the collection.
This mechanism will improve the data migration efficiency as compared to using a Gremlin client. This improvement is experienced because inserting data with Gremlin will require the application send a query at a time that will need to be validated, evaluated, and then executed to create the data. The bulk executor library will handle the validation in the application and send multiple graph objects at a time for each network request.
GraphBulkExecutor
provides the BulkImportAsync
method that requires a IEnumerable
list of GremlinVertex
or GremlinEdge
objects, both defined in the Microsoft.Azure.CosmosDB.BulkExecutor.Graph.Element
namespace. In the sample, we separated the edges and vertices into two BulkExecutor import tasks. See the example below:
IBulkExecutor graphbulkExecutor = new GraphBulkExecutor(documentClient, targetCollection);
BulkImportResponse vResponse = null;
BulkImportResponse eResponse = null;
try
{
// Import a list of GremlinVertex objects
vResponse = await graphbulkExecutor.BulkImportAsync(
Utils.GenerateVertices(numberOfDocumentsToGenerate),
enableUpsert: true,
disableAutomaticIdGeneration: true,
maxConcurrencyPerPartitionKeyRange: null,
maxInMemorySortingBatchSize: null,
cancellationToken: token);
// Import a list of GremlinEdge objects
eResponse = await graphbulkExecutor.BulkImportAsync(
Utils.GenerateEdges(numberOfDocumentsToGenerate),
enableUpsert: true,
disableAutomaticIdGeneration: true,
maxConcurrencyPerPartitionKeyRange: null,
maxInMemorySortingBatchSize: null,
cancellationToken: token);
}
catch (DocumentClientException de)
{
Trace.TraceError("Document client exception: {0}", de);
}
catch (Exception e)
{
Trace.TraceError("Exception: {0}", e);
}
For more information on the parameters of the bulk executor library, refer to the BulkImportData to Azure Cosmos DB topic.
The payload needs to be instantiated into GremlinVertex
and GremlinEdge
objects. Here is how these objects can be created:
Vertices:
// Creating a vertex
GremlinVertex v = new GremlinVertex(
"vertexId",
"vertexLabel");
// Adding custom properties to the vertex
v.AddProperty("customProperty", "value");
// Partitioning keys must be specified for all vertices
v.AddProperty("partitioningKey", "value");
Edges:
// Creating an edge
GremlinEdge e = new GremlinEdge(
"edgeId",
"edgeLabel",
"targetVertexId",
"sourceVertexId",
"targetVertexLabel",
"sourceVertexLabel",
"targetVertexPartitioningKey",
"sourceVertexPartitioningKey");
// Adding custom properties to the edge
e.AddProperty("customProperty", "value");
Note
The bulk executor utility doesn't automatically check for existing Vertices before adding Edges. This needs to be validated in the application before running the BulkImport tasks.
- Visual Studio 2019 with the Azure development workload. You can get started with the Visual Studio 2019 Community Edition for free.
- An Azure subscription. You can create a free Azure account here. Alternatively, you can create a Cosmos database account with Try Azure Cosmos DB for free without an Azure subscription.
- An Azure Cosmos DB Gremlin API database with an unlimited collection. This guide shows how to get started with Azure Cosmos DB Gremlin API in .NET.
- Git. For more information check out the Git Downloads page.
In this tutorial, we'll follow through the steps for getting started by using the Azure Cosmos DB Graph bulk executor sample hosted on GitHub. This application consists of a .NET solution that randomly generates vertex and edge objects and then executes bulk insertions to the specified graph database account. To get the application, run the git clone
command below:
git clone https://github.com/Azure-Samples/azure-cosmosdb-graph-bulkexecutor-dotnet-getting-started.git
This repository contains the GraphBulkExecutor sample with the following files:
File | Description |
---|---|
App.config |
This is where the application and database-specific parameters are specified. This file should be modified first to connect to the destination database and collections. |
Program.cs |
This file contains the logic behind creating the DocumentClient collection, handling the cleanups and sending the bulk executor requests. |
Util.cs |
This file contains a helper class that contains the logic behind generating test data, and checking if the database and collections exist. |
In the App.config
file, the following are the configuration values that can be provided:
Setting | Description |
---|---|
EndPointUrl |
This is your .NET SDK endpoint found in the Overview blade of your Azure Cosmos DB Gremlin API database account. This has the format of https://your-graph-database-account.documents.azure.com:443/ |
AuthorizationKey |
This is the Primary or Secondary key listed under your Azure Cosmos DB account. Learn more about Securing Access to Azure Cosmos DB data |
DatabaseName , CollectionName |
These are the target database and collection names. When ShouldCleanupOnStart is set to true these values, along with CollectionThroughput , will be used to drop them and create a new database and collection. Similarly, if ShouldCleanupOnFinish is set to true , they will be used to delete the database as soon as the ingestion is over. Note that the target collection must be an unlimited collection. |
CollectionThroughput |
This is used to create a new collection if the ShouldCleanupOnStart option is set to true . |
ShouldCleanupOnStart |
This will drop the database account and collections before the program is run, and then create new ones with the DatabaseName , CollectionName and CollectionThroughput values. |
ShouldCleanupOnFinish |
This will drop the database account and collections with the specified DatabaseName and CollectionName after the program is run. |
NumberOfDocumentsToImport |
This will determine the number of test vertices and edges that will be generated in the sample. This number will apply to both vertices and edges. |
NumberOfBatches |
This will determine the number of test vertices and edges that will be generated in the sample. This number will apply to both vertices and edges. |
CollectionPartitionKey |
This will be used to create the test vertices and edges, where this property will be auto-assigned. This will also be used when re-creating the database and collections if the ShouldCleanupOnStart option is set to true . |
- Add your specific database configuration parameters in
App.config
. This will be used to create a DocumentClient instance. If the database and container have not been created yet, they will be created automatically. - Run the application. This will call
BulkImportAsync
two times, one to import Vertices and one to import Edges. If any of the objects generates an error when they're inserted, they will be added to either.\BadVertices.txt
or.\BadEdges.txt
. - Evaluate the results by querying the graph database. If the
ShouldCleanupOnFinish
option is set to true, then the database will automatically be deleted.
- To learn about NuGet package details and release notes of bulk executor .NET library, see bulk executor SDK details.
- Check out the Performance Tips to further optimize the usage of bulk executor.
- Review the BulkExecutor.Graph Reference article for more details about the classes and methods defined in this namespace.