The objective of this exercise is to illustrate the concept of sharding, a database partitioning technique for storing large data collections across multiple database servers. For this purpose, you will work with MongoDB, a document oriented database management system supporting different sharding strategies.
- Requirements
- Installation and Configuration
- Introduction to MongoDB & Sharded Clusters
- Preparing a Sharded Cluster
- Sharding Database Collections
- Balancing Data Across Shards
- Guiding Partitioning Using Tags
- Uninstalling (if locally installed)
- Docker environment:
- Option 1 (prefered): play-with-docker.com
- Option 2: Docker Toolbox or Docker CE depending on your OS
- Hands-on material (github repository)
Get the hands-on material:
git clone https://github.com/javieraespinosa/dxlab-sharding.git
Download the MongoDB docker image:
# Enter lab folder
cd dxlab-sharding
# Downloads images specified in docker-compose.yml
docker-compose pull
Verify the existence of the image:
docker images
If you are using Docker Toolbox, open VirtualBox and configure the default virtual machine to map host port 8888 to guest port 8888 on TCP (see instructions).
That's it! You are ready to rock.
As stated before, MongoDB is a document oriented database management system (i.e., you insert/get/update/delete documents).
In MongoDB, documents are represented using JSON, a text-based data format. The following example illustrates a JSON document containing information about the city of Washigton:
{
"city" : "WASHINGTON",
"loc" : [ -69.384237, 44.269281 ],
"pop" : 1261,
"state" : "ME"
}
Internally, MongoDB organizes documents in collections, and collections in databases. A mongo database may contain multiple collections.
MongoDB does not support SQL; queries are expressed as programs (see examples).
MongoDB supports sharding via a sharded cluster. A sharded cluster is composed of the following components:
- Shard(s): store data.
- Query router(s): redirect queries/operations to the appropriate shard (or shards).
- Config server(s): store cluster’s metadata. Query router(s) uses this metadata to select appropriate shards.
In MongoDB, sharding is enabled on a per-collection basis. When enabled, MongoDB distributes the documents of the collection across the shards (i.e. mongo servers) of a cluster.
For simplicity, you will start working with a partially configured cluster. The cluster will be composed of:
- 1 query router
- 1 config server
- 3 shards
Start the cluster as follows:
# Start docker containers
docker-compose up -d
Cluster components run inside a virtual network as docker containers. You can list containers (and their IPs) as follows:
# List containers
docker ps
# List containers IPs
docker network inspect dxlab-sharding_default
As shown in the figure, only the query router and config server are configured as part of a cluster. Let's complete the cluster by adding a shard server.
Enter the cluster environment:
docker-compose run --rm cli
Connect to the query router:
mongo --host queryrouter.docker
Add shard 1 to cluster:
// Change database
use admin
// Add shard to cluster
db.runCommand({
addShard: "shard1.docker",
name: "shard1"
})
You can verify the cluster status as follows:
sh.status()
Disconnect now from query router (ctr + c
).
At this point, your cluster does not contain any data (i.e., you have not created any collection or database). The following example illustrates how to import data to your cluster:
mongoimport \
--host queryrouter.docker \
--db mydb \
--collection cities \
--file ./cities.txt
This imports the content of cities.txt into collection cities of database mydb. Verify this by connecting to the query router:
# Connect to query router
mongo --host queryrouter.docker
Explore the database:
// List databases
show dbs
// Change to database mydb
use mydb
// List collections
show collections
Run some queries over cities:
// SELECT * FROM Cities
db.cities.find().pretty()
// SELECT COUNT(*) FROM Cities
db.cities.count()
Recall that in MongoDB, sharding is enabled on a per-collection basis. When enabled, MongoDB uses the shard key (an attribute that must exists in every document), for partitioning and distributing the collection documents across the cluster.
MongoDB uses two kinds of partitioning strategies:
-
Range based partitioning: data is partitioned into intervals
[min, max]
called chunks. Chunks limits depend on the domain of the shard key. -
Hash based partitioning: data is partitioned using a hash function.
In what follows, you will experiment with both kind of strategies.
Create a new collection:
db.createCollection("cities1")
show collections
Enable sharding on this collection using attribute state as shard key:
sh.enableSharding("mydb")
sh.shardCollection("mydb.cities1", { "state": 1} )
Verify the cluster state:
sh.status()
Populate the collection using mydb.cities:
db.cities.find().forEach(
function(doc) {
db.cities1.insert(doc);
}
)
Verify the new cluster state:
sh.status()
Create a new collection:
db.createCollection("cities2")
show collections
Enable sharding on this collection using also attribute state as shard key:
sh.shardCollection(
"mydb.cities2", { "state": "hashed" }
)
Verify the cluster state:
sh.status()
Populate the collection using mydb.cities:
db.cities.find().forEach(
function(doc) {
db.cities2.insert(doc);
}
)
Verify the new cluster state:
sh.status()
Independently of the selected partition strategy, when a shard server has too many chunks (compared to other shards in the cluster), MongoDB automatically redistributes the chunks across shards. This process is called cluster balancing.
Let's trigger the balancing process by adding more shards to your cluster.
use admin
db.runCommand( { addShard: "shard2.docker", name: "shard2" } )
db.runCommand( { addShard: "shard3.docker", name: "shard3" } )
Wait a few seconds. Then, check again the cluster status and compare to the previous result.
sh.status()
MongoDB also supports tagging a range of shard key values. Some advantages of using tags are:
- Isolation of data.
- Colocation of shards in geographical related regions.
In what follows, you will use tags for isolating some cities into specific shard servers.
Associate tags to shards:
sh.addShardTag("shard1", "CA")
sh.addShardTag("shard2", "NY")
sh.addShardTag("shard3", "Others")
Create, populate and enable sharding on a new collection:
use mydb
db.createCollection("cities3")
sh.shardCollection("mydb.cities3", { "state": 1} )
db.cities.find().forEach(
function(doc) {
db.cities3.insert(doc);
}
)
Define and associate key ranges [from, to)
to shards:
sh.addTagRange("mydb.cities3", { state: MinKey }, { state: "CA" }, "Others")
sh.addTagRange("mydb.cities3", { state: "CA" }, { state: "CA_" }, "CA")
sh.addTagRange("mydb.cities3", { state: "CA_" }, { state: "NY" }, "Others")
sh.addTagRange("mydb.cities3", { state: "NY" }, { state: "NY_" }, "NY")
sh.addTagRange("mydb.cities3", { state: "NY_" }, { state: MaxKey }, "Others")
Analize new cluster state:
sh.status()
Not sure about data distribution? Try the sharding notebook.
Disconnect from query router (ctr + c
).
Disconnect from cli (ctr + d
).
Stop containers and remove docker images:
docker-compose down
docker rmi -f mongo:3.0