Skip to content

javieraespinosa/dxlab-sharding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sharding Data Collections with MongoDB

The objective of this exercise is to illustrate the concept of sharding, a database partitioning technique for storing large data collections across multiple database servers. For this purpose, you will work with MongoDB, a document oriented database management system supporting different sharding strategies.

Table of Contents

Requirements

Installation and Configuration

Get the hands-on material:

git clone https://github.com/javieraespinosa/dxlab-sharding.git

Download the MongoDB docker image:

# Enter lab folder
cd dxlab-sharding

# Downloads images specified in docker-compose.yml
docker-compose pull

Verify the existence of the image:

docker images

If you are using Docker Toolbox, open VirtualBox and configure the default virtual machine to map host port 8888 to guest port 8888 on TCP (see instructions).

That's it! You are ready to rock.

Introduction to MongoDB & Sharded Clusters

MongoDB

As stated before, MongoDB is a document oriented database management system (i.e., you insert/get/update/delete documents).

In MongoDB, documents are represented using JSON, a text-based data format. The following example illustrates a JSON document containing information about the city of Washigton:

{ 
  "city" : "WASHINGTON", 
  "loc" : [ -69.384237, 44.269281 ], 
  "pop" : 1261, 
  "state" : "ME"
}

Internally, MongoDB organizes documents in collections, and collections in databases. A mongo database may contain multiple collections.

MongoDB does not support SQL; queries are expressed as programs (see examples).

Sharded Clusters

MongoDB supports sharding via a sharded cluster. A sharded cluster is composed of the following components:

  • Shard(s): store data.
  • Query router(s): redirect queries/operations to the appropriate shard (or shards).
  • Config server(s): store cluster’s metadata. Query router(s) uses this metadata to select appropriate shards.

In MongoDB, sharding is enabled on a per-collection basis. When enabled, MongoDB distributes the documents of the collection across the shards (i.e. mongo servers) of a cluster.

Preparing a Sharded Cluster

For simplicity, you will start working with a partially configured cluster. The cluster will be composed of:

  • 1 query router
  • 1 config server
  • 3 shards

Start the cluster as follows:

# Start docker containers
docker-compose up -d

Cluster components run inside a virtual network as docker containers. You can list containers (and their IPs) as follows:

# List containers 
docker ps

# List containers IPs
docker network inspect dxlab-sharding_default

Cluster configuration

As shown in the figure, only the query router and config server are configured as part of a cluster. Let's complete the cluster by adding a shard server.

Enter the cluster environment:

docker-compose run --rm cli

Connect to the query router:

mongo --host queryrouter.docker

Add shard 1 to cluster:

// Change database 
use admin

// Add shard to cluster 
db.runCommand({ 
  addShard: "shard1.docker", 
  name: "shard1" 
})

You can verify the cluster status as follows:

sh.status()

Disconnect now from query router (ctr + c).

Q1. Which is the important information reported by sh.status()?

Inserting and Querying data

At this point, your cluster does not contain any data (i.e., you have not created any collection or database). The following example illustrates how to import data to your cluster:

mongoimport \
  --host queryrouter.docker \
  --db mydb \
  --collection cities \
  --file ./cities.txt

This imports the content of cities.txt into collection cities of database mydb. Verify this by connecting to the query router:

# Connect to query router
mongo --host queryrouter.docker

Explore the database:

// List databases
show dbs

// Change to database mydb
use mydb 

// List collections
show collections

Run some queries over cities:

// SELECT * FROM Cities
db.cities.find().pretty()

// SELECT COUNT(*) FROM Cities
db.cities.count()

Q2. Describe in natural language the content of the collection

Sharding Database Collections

Recall that in MongoDB, sharding is enabled on a per-collection basis. When enabled, MongoDB uses the shard key (an attribute that must exists in every document), for partitioning and distributing the collection documents across the cluster.

MongoDB uses two kinds of partitioning strategies:

  • Range based partitioning: data is partitioned into intervals [min, max] called chunks. Chunks limits depend on the domain of the shard key.

  • Hash based partitioning: data is partitioned using a hash function.

In what follows, you will experiment with both kind of strategies.

Range Based Partitioning

Create a new collection:

db.createCollection("cities1")
show collections

Enable sharding on this collection using attribute state as shard key:

sh.enableSharding("mydb") 
sh.shardCollection("mydb.cities1", { "state": 1} )

Verify the cluster state:

sh.status()

Q3. How many chunks did you create? Which are their associated ranges? Include a screen copy of the results of the command in your answer to support your answer.

Populate the collection using mydb.cities:

db.cities.find().forEach( 
   function(doc) {
      db.cities1.insert(doc); 
   }
)

Verify the new cluster state:

sh.status()

Q4. How many chunks are there now? Which are their associated ranges? Which changes can you identify in particular? Include a screen copy of the results of the command in your answer to support your answer.

Hash Based Partitioning

Create a new collection:

db.createCollection("cities2")
show collections

Enable sharding on this collection using also attribute state as shard key:

sh.shardCollection(
  "mydb.cities2", { "state": "hashed" } 
)

Verify the cluster state:

sh.status()

Q5. How many chunks did you create? What differences do you see with respect to the same task in the range sharding strategy? Include a screen copy of the results of the command in your answer to support your answer.

Populate the collection using mydb.cities:

db.cities.find().forEach( 
   function(doc) {
      db.cities2.insert(doc); 
   }
)

Verify the new cluster state:

sh.status()

Q6. How many chunks are there now? Include a screen copy of the results of the command in your answer to support your answer. Compare the result with respect to the range sharding. Include a screen copy of the results of the command in your answer to support your answer.

Balancing Data Across Shards

Independently of the selected partition strategy, when a shard server has too many chunks (compared to other shards in the cluster), MongoDB automatically redistributes the chunks across shards. This process is called cluster balancing.

sharding-migrating

Let's trigger the balancing process by adding more shards to your cluster.

use admin
db.runCommand( { addShard: "shard2.docker", name: "shard2" } )
db.runCommand( { addShard: "shard3.docker", name: "shard3" } ) 

Wait a few seconds. Then, check again the cluster status and compare to the previous result.

sh.status()

Q7. Draw the new configuration of the cluster and label each element (router, config server and shards) with its corresponding IP.

Guiding Partitioning Using Tags

MongoDB also supports tagging a range of shard key values. Some advantages of using tags are:

  • Isolation of data.
  • Colocation of shards in geographical related regions.

In what follows, you will use tags for isolating some cities into specific shard servers.

Associate tags to shards:

sh.addShardTag("shard1", "CA") 
sh.addShardTag("shard2", "NY") 
sh.addShardTag("shard3", "Others")

Create, populate and enable sharding on a new collection:

use mydb

db.createCollection("cities3") 
sh.shardCollection("mydb.cities3", { "state": 1} ) 

db.cities.find().forEach( 
   function(doc) { 
      db.cities3.insert(doc); 
   } 
)

Define and associate key ranges [from, to) to shards:

sh.addTagRange("mydb.cities3", { state: MinKey }, { state: "CA" }, "Others") 
sh.addTagRange("mydb.cities3", { state: "CA" }, { state: "CA_" }, "CA") 
sh.addTagRange("mydb.cities3", { state: "CA_" }, { state: "NY" }, "Others") 
sh.addTagRange("mydb.cities3", { state: "NY" }, { state: "NY_" }, "NY") 
sh.addTagRange("mydb.cities3", { state: "NY_" }, { state: MaxKey }, "Others")

Analize new cluster state:

sh.status()

Q8. Analyze the results and explain the logic behind this tagging strategy. Connect to the shard that contains the data about California, and count the documents. Do the same operation with the other shards. Is the sharded data collection complete with respect to initial one? Are shards orthogonal?

Not sure about data distribution? Try the sharding notebook.

Uninstalling (if locally installed)

Disconnect from query router (ctr + c).

Disconnect from cli (ctr + d).

Stop containers and remove docker images:

docker-compose down
docker rmi -f mongo:3.0 

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published