Google Compute Engine Cluster for Hadoop
Copyright 2013 Google Inc. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
The application is not an official Google product.
The application sets up Google Compute Engine instances as a Hadoop cluster and executes MapReduce tasks.
The purpose of the application is to demonstrate how to leverage Google Compute Engine for parallel processing with MapReduce on Hadoop.
The application is not meant to maintain a persistent Hadoop cluster. Because the application is designed to delete all disks at the deletion of the cluster, all data on the Hadoop cluster instances, including data in HDFS (Hadoop Distributed FileSystem), will be lost when the Hadoop cluster is torn down. For persistent storage beyond the lifespan of the cluster, use external persistent storage, such as Google Cloud Storage.
The application takes advantage of MapReduce tasks to parallelize the copy of input and output of MapReduce between Google Cloud Storage and HDFS.
gcutil and gsutil
The application uses gcutil and gsutil, command line tools for Google Compute Engine and Google Cloud Storage respectively. The tools are distributed as a part of Google Cloud SDK. Follow the instruction in the link to install them.
Authenticaton and default project
After the installation of Cloud SDK, run the following command to authenticate, as instructed on the page.
gcloud auth login
The above command authenticates the user, and sets the default project. The default project must be set to the project where the Hadoop cluster will be deployed. The project ID is shown at the top of the project's "Overview" page of the Developers Console.
The default project can be changed with the command:
gcloud config set project <project ID>
The application uses SSH to start MapReduce tasks, on the Hadoop cluster. In order for the application to execute remote commands automatically, gcutil must have an SSH key with an empty passphrase for Google Compute Engine instances.
gcutil uses its own
and the key is stored in .ssh directory of the user's home directory as
$HOME/.ssh/google_compute_engine (private key) and
$HOME/.ssh/google_compute_engine.pub (public key).
If the SSH key for gcutil doesn't already exist, it can be created with the command:
ssh-keygen -t rsa -q -f $HOME/.ssh/google_compute_engine
Alternatively, the SSH key for gcutil can be generated upon the creation of an instance. The following command creates an instance, and an SSH key if it doesn't exist on the computer.
gcutil addinstance --zone us-central1-a --machine_type f1-micro \ --image debian-7-wheezy <instance-name>
The instance is created only to create an SSH key, and is safe to delete.
gcutil deleteinstance -f --delete_boot_pd <instance-name>
If an SSH key with a passphrase already exists, the key files must be renamed or deleted before creating the SSH key with an empty passphrase.
The application runs with Python 2.7. It's tested on Mac OS X and Linux.
Alternatively, a Google Compute Engine instance can be used to run the application, which works as a controller of the Hadoop cluster.
- On HDFS NameNode Web console (
http://<master external IP address>:50070), "Browse the filesystem" link does not work. In order to browse HDFS from Web UI, go to the live node list from the "Live Nodes" link, and click the IP address of an arbitrary DataNode.
- The script does not work properly if the path to the application directory contains whitespace. Please download the application to a path that doesn't include any whitespaces.
- The application creates a file named
.hadoop_on_compute.credentials(hidden file) under the home directory of the user to cache OAuth2 information. It is automatically created the first time access is authorized by the user. In case incorrect information is cached, leading to Google Compute Engine API access failure, removal of the file allows redoing the authentication.
- Without additional security consideration, which falls outside the scope of the application, Hadoop's Web UI is open to public. Some resources on the Web are:
Set up Instruction
Setting up the environment should be done in the directory of the application ("root directory of the application"). In the following instruction, commands are expected to run from the root directory.
Prepare Hadoop package
Download Hadoop package
Put the downloaded package in the root directory of the application, where
Download can be performed from your Web browser and the file can be copied
to the working directory. Alternatively, command line tools, such as
wget may be used.
curl -O http://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz
Customize Hadoop configuration and re-package
Apply hadoop-1.2.1.patch to customize Hadoop configuration. From the root directory of the application, execute the following commands.
tar zxf hadoop-1.2.1.tar.gz patch -p0 < hadoop-1.2.1.patch tar zcf hadoop-1.2.1.tar.gz hadoop-1.2.1
Hadoop configurations can be modified after the patch is applied in the above steps so as to include the custom configurations.
Download Open JDK and Dependent Packages
Create a directory called "deb_packages" under the root directory of this application.
Download amd64 package of openjdk-6-jre-headless, and architecture-common package of openjdk-6-jre-lib from the following sites.
Also download dependent packages.
- http://packages.debian.org/wheezy/libnss3-1d [amd64]
- http://packages.debian.org/wheezy/libnss3 [amd64]
- http://packages.debian.org/wheezy/ca-certificates-java [architecture-common]
- http://packages.debian.org/wheezy/libnspr4 [amd64]
Put the downloaded packages into the
mkdir deb_packages cd deb_packages curl -O http://security.debian.org/debian-security/pool/updates/main/o/openjdk-6/openjdk-6-jre-headless_6b27-1.12.6-1~deb7u1_amd64.deb curl -O http://security.debian.org/debian-security/pool/updates/main/o/openjdk-6/openjdk-6-jre-lib_6b27-1.12.6-1~deb7u1_all.deb curl -O http://http.us.debian.org/debian/pool/main/n/nss/libnss3-1d_3.14.3-1_amd64.deb curl -O http://http.us.debian.org/debian/pool/main/n/nss/libnss3_3.14.3-1_amd64.deb curl -O http://http.us.debian.org/debian/pool/main/c/ca-certificates-java/ca-certificates-java_20121112+nmu2_all.deb curl -O http://http.us.debian.org/debian/pool/main/n/nspr/libnspr4_4.9.2-1_amd64.deb cd ..
Prepare Google Cloud Storage bucket
Create a Google Cloud Storage bucket, from which Google Compute Engine instance downloads Hadoop and other packages.
This can be done by one of:
- Using an existing bucket.
- Creating a new bucket from the "Cloud Storage" page on the project page of Developers Console
- Creating a new bucket by
gsutil command line tool.
gsutil mb gs://<bucket name>
Note this bucket can be different from the bucket where MapReduce input and output are located. Make sure to create the bucket in the same Google Cloud project as that specified in the "Default project" section above.
Create client ID and client secret
Client ID and client secret are required by OAuth2 authorization to identify the application. It is required in order for the application to access Google API (in this case, Google Compute Engine API) on behalf of the user.
Client ID and client secret can be set up from "APIs & auth" menu of Developers Console of the project. Choose "Credentials" submenu, and click the red button at the top labeled "CREATE NEW CLIENT ID" to create a new pair of client ID and client secret for the application.
Choose "Installed application" as "Application type", choose "Other" as "Installed application type" and click "Create Client ID" button.
CLIENT_SECRET values in
GceCluster class in
gce_cluster.py with the values created in the above step.
CLIENT_ID = '12345....................com' CLIENT_SECRET = 'abcd..........'
Download and set up Python libraries
The following instructions explain how to set up the additional libraries for the application.
Google Client API
Google Client API is library to access various Google's services via API.
Download google-api-python-client-1.2.tar.gz from download page or by the following command.
curl -O http://google-api-python-client.googlecode.com/files/google-api-python-client-1.2.tar.gz
Set up the library in the root directory of the application.
tar zxf google-api-python-client-1.2.tar.gz ln -s google-api-python-client-1.2/apiclient . ln -s google-api-python-client-1.2/oauth2client . ln -s google-api-python-client-1.2/uritemplate .
curl -O https://httplib2.googlecode.com/files/httplib2-0.8.tar.gz
Set up the library in the root directory of the application.
tar zxf httplib2-0.8.tar.gz ln -s httplib2-0.8/python2/httplib2 .
curl -O http://python-gflags.googlecode.com/files/python-gflags-2.0.tar.gz
Set up the library in the root directory of the application.
tar zxf python-gflags-2.0.tar.gz ln -s python-gflags-2.0/gflags.py . ln -s python-gflags-2.0/gflags_validators.py .
Python mock (required only for unit tests)
mock is mocking library for Python. It will be included in Python as standard package from Python 3.3. However, since the application uses Python 2.7, it needs to be set up.
Download mock-1.0.1.tar.gz from download page. or by the following command.
curl -O https://pypi.python.org/packages/source/m/mock/mock-1.0.1.tar.gz
Set up the library in the root directory of the application.
tar zxf mock-1.0.1.tar.gz ln -s mock-1.0.1/mock.py .
Usage of the Application
compute_cluster_for_hadoop.py is the main script of the application.
It sets up Google Compute Engine and Google Cloud Storage environment
for Hadoop cluster, starts up Hadoop cluster, initiates MapReduce jobs
and tears down the cluster.
compute_cluster_for_hadoop.py has 4 subcommands,
Please refer to the following usages for available options.
./compute_cluster_for_hadoop.py setup --help ./compute_cluster_for_hadoop.py start --help ./compute_cluster_for_hadoop.py mapreduce --help ./compute_cluster_for_hadoop.py shutdown --help
Set up environment
'setup' subcommand sets up environment.
- Create SSH key used by communication between Hadoop instances.
- Upload Hadoop and Open JDK packages to Google Cloud Storage, so that Google Compute Engine instances can download them to set up Hadoop.
- Set up firewall in Google Compute Engine network to allow users to access Hadoop Web consoles.
'setup' must be performed at least once per combination of Google Compute Engine project and Google Cloud Storage bucket. 'setup' may safely be run repeatedly for the same Google Compute Engine project and/or Google Cloud Storage bucket. The project ID can be found in the "PROJECT ID" column of the project list of the Developers Console or at the top of the project's page on Developers Console. Bucket name is the name of Google Cloud Storage bucket without the "gs://" prefix.
Execute the following command to set up the environment.
./compute_cluster_for_hadoop.py setup <project ID> <bucket name>
'start' subcommand starts Hadoop cluster. By default, it starts 6 instances: one master and 5 worker instances. In other words, the default value of the number of workers is 5.
./compute_cluster_for_hadoop.py start <project ID> <bucket name> \ [number of workers] [--prefix <prefix>] \ [--data-disk-gb <size>]
Each instance is started with a Persistent Disk used for HDFS. The size of the
disk can be set with the
--data-disk-gb parameter. The default size is 500GB.
Review the documentation
for help in determining the disk size that provides the right performance
for the cluster.
If the instance is started for the first time, the script requires log in
and asks for authorization to access Google Compute Engine.
By default, the command opens Web browser for the authorization.
If the script is run in remote host on terminal (on SSH, for example),
it cannot open Web browser on local machine.
In this case,
--noauth_local_webserver option can be specified as instructed
by the message as follows.
./compute_cluster_for_hadoop.py --noauth_local_webserver start \ <project ID> <bucket name> [number of workers] \ [--prefix <prefix>] [--data-disk-gb <size>]
It avoids the attempt to open local Web browser, and it shows URL for authentication and authorization. When authorization is successful on the Web browser, the page shows code to paste on the terminal. By pasting the correct code, authorization process is complete in the script. The script can then access Google Compute Engine through API.
As shown on console log, HDFS and MapReduce Web consoles are available at
http://<master external IP address>:50070 and
http://<master external IP address>:50030 respectively.
'start' subcommand accepts custom command executed on each instance by
For example, it can be used to install required software and/or to download
necessary files onto each instance.
The custom command is executed under the permission of the user who started
the instance. If superuser privilege is required for the command execution,
sudo in the command.
External IP Addresses on Worker Instances
By default, all Google Compute Engine instances created by the application are equipped with external IP addresses.
A Hadoop cluster with no external IP addresses on worker instances can be
started by passing
--external-ip=master option to 'start' subcommand.
./compute_cluster_for_hadoop.py start <project ID> <bucket name> ... --external-ip=master
Quota and security are some of the reasons to set up cluster without external IP addresses. Note the master instance always has an external IP address, so that MapReduce tasks can be started.
If a Google Compute Engine instance is created without an external IP address,
it cannot access the Internet directly, including accessing Google Cloud
--external-ip option is set to "master", however, the
application sets up routing and NAT via the master instance, so that workers
can still access the Internet transparently. Note that all traffic to the
Internet from the workers, then, goes through the single master instance.
When the workers are created with external IP addresses (the default), their
traffic goes directly to the Internet.
'mapreduce' subcommand starts MapReduce task on the Hadoop cluster. It requires project name and bucket for the temporary use. The temporary bucket is used to transfer mapper and reducer programs. It may or may not be the same as the bucket used to set up the cluster.
Input files must be located in a single directory on Google Cloud Storage, from where they are copied to Hadoop cluster as mapper's input. The directory on Google Cloud Storage or a single file on Google Cloud Storage can be specified as input.
--input and --output parameters are required, and must point to directory on Google Cloud Storage, starting with "gs://". Alternatively, --input can be a single file on Google Cloud Storage, still starting with "gs://". --input, --output and temporary bucket can belong to the different buckets.
The output of MapReduce task is copied to the specified output directory on Google Cloud Storage. The existing files in the directory may be overwritten. The output directory does not need to exist in advance.
The command uses Hadoop streaming MapReduce processing. The mapper and the reducer must be programmed to read input from standard input and write output to standard output. If local file is specified as mapper and/or reducer, they are copied to Hadoop cluster through Google Cloud Storage. Alternatively, files on Google Cloud Storage may be used as mapper or reducer.
If mapper or reducer requires additional files, such as data files or libraries,
it can be achieved by
--command option of 'start' subcommand.
If mapper or reducer is not specified, the step (mapper or reducer) copies
input to output. Specifying 0 as
--reducer-count will skip shuffle and
reduce phases, making the output of mapper the final output of MapReduce.
sample directory in the application includes sample mapper and
reducer, that counts the words' occurrence in the input files in shortest
to longest and in alphabetical order in the same length of the word.
./compute_cluster_for_hadoop.py mapreduce <project ID> <bucket name> [--prefix <prefix>] --input gs://<input directory on Google Cloud Storage> \ --output gs://<output directory on Google Cloud Storage> \ --mapper sample/shortest-to-longest-mapper.pl \ --reducer sample/shortest-to-longest-reducer.pl \ --mapper-count 5 \ --reducer-count 1
Shut down cluster
'shutdown' subcommand deletes all instances in the Hadoop cluster.
./compute_cluster_for_hadoop.py shutdown <project ID> [--prefix <prefix>]
The application is designed to delete Google Compute Engine instances and disks when the cluster is shut down. Therefore, all files on the Hadoop cluster, including HDFS, are deleted when the cluster is shut down. For persistent storage beyond the lifespan of the cluster, use external persistent storage, such as Google Cloud Storage.
Prefix and zone
shutdown subcommands take string value as
The prefix specified is prepended to instance names of the cluster.
In order to start up multiple clusters in the same project, specify different
prefix, and run
shutdown commands with the appropreate prefix.
There are restrictions in the prefix string.
- Prefix must be 15 letters or less.
- Only lower case alphanumerical letters and hyphens can be used.
- The first letter must be lower case alphabet.
Similarly, if zone is specified by --zone parameter in
the same zone must be specified for
so that the MapReduce task and shutdown are performed correctly.
Cleaning up the environment
After the cluster is shut down, and if the Hadoop cluster environment is not
necessary any more, the following commands clean up the environment set up
gcutil deletefirewall -f hdfs-datanode hdfs-namenode \ hadoop-mr-jobtracker hadoop-mr-tasktracker gsutil rm -f -r gs://<bucket name>/mapreduce/
The same environment can be again set up by
setup command, allowing the
subsequent deployments of the Hadoop clusters.
Log in to Hadoop master instance
hadoop-shell.sh provides convenient way to log on to the master instance of Hadoop cluster.
When logged on to the master instance, user is automatically switched
to Hadoop cluster's superuser ('hadoop').
hadoop command can be used to work on HDFS and Hadoop.
hadoop dfs -ls /
The application has 3 Python files,
They have corresponding unit tests,
Unit tests can be directly executed.
./compute_cluster_for_hadoop_test.py ./gce_cluster_test.py ./gce_api_test.py
Note some unit tests simulate error conditions, and those tests shows error messages.