Vagrant + Cascading + Hadoop 2 Cluster
The Cascading 2.7 SDK includes Cascading and many of its sub-projects:
- Lingual - ANSI SQL Command Shell and JDBC Driver
- Pattern - Machine Learning
- Cascalog - Clojure DSL over Cascding
- Scalding - Scala DSL over Cascading
- Multitool - Command line tool for managing large files
- Load - Command line tool for load testing Hadoop
To make getting started as easy as possible this setup also includes build tools used by parts of the SDK:
- gradle - build tool used by Cascading and its related projects
- leiningen 2 - a popular build tool in the clojure community, which is used in the cascalog tutorial included in the SDK
- sbt - a popular build tool in the scala community, which is used in the scalding tutorial included in the SDK
This work is based on: http://cscarioni.blogspot.co.uk/2012/09/setting-up-hadoop-virtual-cluster-with.html
Deploying the cluster
Then simply clone this repository, change into the directory and bring the cluster up.
$ vagrant up
This will set up 4 machines -
hadoop3. Each of them will have two CPUs and .5GB of
RAM. If this is too much for your machine, adjust the
The machines will be provisioned using Puppet. All of them will have hadoop (apache-hadoop-2.6.0) installed, ssh will be configured and local name resolution also works.
Hadoop is installed in
/opt/hadoop-2.6.0 and all tools are in the
master machine acts as the namenode and the yarn resource manager, the 3 others are data nodes and run node
The cluster uses zeroconf (a.k.a. bonjour) for name
resolution. This means that you never have to remember any IP nor will you have to fiddle with your
Name resolution works from the host to all VMs and between all VMs as well. If you are using linux, make sure you have
avahi-daemon installed and it is running. On a Mac everything should just work (TM) witouth doing anything. Windows
users have to install Bonjour for Windows before starting the cluster.
The network used is
192.168.7.0/24. If that causes any problems, change the
modules/avahi/file/hosts files to something that works for you. Since everything else is name based, no other change
Starting the cluster
This cluster uses the
ssh-into-all-the-boxes-and-start-things-up-approach, which is fine for testing.
Once all machines are up and provisioned, the cluster can be started. Log into the master, format hdfs and start the cluster.
$ vagrant ssh master $ (master) sudo prepare-cluster.sh $ (master) sudo start-all.sh
After a little while, all daemons will be running and you have a fully working hadoop cluster. Note that the
prepare-cluster.sh step is a one time action.
Stopping the cluster
If you want to shut down your cluster, but want to keep it around for later use, shut down all the services and tell vagrant to stop the machines like this:
$ vagrant ssh master $ (master) sudo stop-all.sh $ exit or Ctrl-D $ vagrant halt
When you want to use your cluster again, simply do this:
$ vagrant up $ vagrant ssh master $ (master) sudo start-all.sh
Getting rid of the cluster
If you don't need the cluster anymore and want to get your disk-space back do this:
$ vagrant destroy -f
This will only delete the VMs all local files in the directory stay untouched and can be used again, if you decide to start up a new cluster.
Interacting with the cluster
You can access all services of the cluster with your web-browser.
- namenode: http://master.local:50070/dfshealth.jsp
- application master: http://master.local:8088/cluster
- job history server: http://master.local:19888/jobhistory
To interact with the cluster on the command line, log into the master and use the hadoop command.
$ vagrant ssh master $ (master) hadoop fs -ls / $ ...
You can access the host file system from the
/vagrant directory, which means that you can drop your hadoop job in
there and run it on your own fully distributed hadoop cluster.
Since this is a fully virtualized environment running on your computer, it will not be super-fast. This is not the goal of this setup. The goal is to have a fully distributed cluster for testing and troubleshooting.
To not overload the host machine, has each tasktracker a hard limit of 1 map task and 1 reduce task at a time.
Puppet will download the latest Cascading SDK 2.7-wip build and put all SDK tools in
PATH. The SDK itself can be found in
The SDK allows you to install the [Driven plugin for Cascading]((http://cascading.io/driven) , by simply running
install-driven-plugin. This will install the plugin for the vagrant user in
Installing the plugin will cause every Cascading based application to send telemetry to
If you no longer want this to happen, you can simply delete the installation directory of the plugin mentioned above.
For more information about driven, please read the Driven documentation.
This version of the cluster also contains Apache HBase. The layout on disk is similar to
Hadoop. The distributition is in
/opt/hbase-<version>. You can start the HBase cluster like so.
$ (master) sudo start-hbase.sh
The Hadoop cluster must be running, before you issue this command, since HBase requires HDFS to be up and running.
To cluster is shut down like so:
$ (master) sudo stop-hbase.sh
The setup is fully distributed.
hadoop3 are running a
zookeeper instance and a region-server each. The HBase master is running on the
The webinterface of the HBase master is http://master.local:60010.
Hacking & Troubleshooting & Tips & Tricks
If something is not working right, join the Cascading mailinglist and post your problem there.
Single Node setup
If your computer is not capable of running 4 VMs at a time, you can still benefit from this setup. The
directory contains an alternative
Vagrantfile, which only starts the
master and deploys everything on it.
The interaction, the start- and stop sequence work the same ways as in the multi-VM cluster, except that it isn't fully distributed. This slimmed down version of the setup also does not include HBase.
To run the single node setup, run
vagrant up in the
single-node directory instead of the root directory. Everything
else stays the same.
Hacking & Troubleshooting
Vagrant makes it easy to share files between the vms of the cluster and your host machine. The project directory is
/vagrant, which enables you to get files from or to your host, by simply copying them into that
The namenode stores the
/srv/hadoop/namenode. The datanodes are storing all data in
Resetting the cluster
Sometimes, when experimenting too much, your cluster might not start anymore. If that is the case, you can easily reset it like so.
$ for host in master hadoop1 hadoop2 hadoop3; do vagrant ssh $host --command 'sudo rm -rf /srv/hadoop' ; done $ vagrant provision
After those two commands your cluster is in the same state as when you started it for the first time. You can now reformat the namenode and restart all services.
If you change any of the puppet modules, you can simply apply the changes with vagrants built-in provisioner.
$ vagrant provision
In order to save bandwidth and time we download hadoop only once and store it in the
/vagrant directory, so that the
other vms can reuse it. If the download fails for some reason, delete the tarball and rerun
We are also downloading a file containing checksums for the tarball. They are verified, before the cluster is started.
If something went wrong during the download, you will see the
verify_tarball part of puppet fail. If that is the case,
delete the tarball and the checksum file (
<tarball>.mds) and rerun
- have a way to configure the names/ips in only one file