Once you have created a cluster you can submit "jobs" (work) to it. These can be entirely new jobs, or jobs you port from an existing environment.
To learn about how to write Hadoop jobs from the ground up, see the Apache Hadoop tutorials.
Google Cloud Platform offers input/output data connectors for your Hadoop and Spark jobs:
When porting a job from HDFS using the Cloud Storage connector for Hadoop, be sure to use the correct file path syntax (gs://
).
Also note that FileSystem.append
is unsupported. If you choose Cloud Storage as your default file system, update your MapReduce, if necessary, to avoid using the append method.
Once you've set up a Hadoop cluster and have written or ported a job, you can run the job using the following steps.
First, validate that your cluster is set up, and that you can access your data. Navigate to the command line to execute the following commands.
Type ./bdutil shell
to SSH into the master node of the Hadoop cluster.
Type hadoop fs -ls /
to check the cluster status. If data outputs, the cluster is set up correctly.
Next, run the job from the command line, while you are still connected to the cluster via SSH. Always run jobs as the hadoop
user to avoid having to type full Hadoop paths in commands.
The following example runs a sample job called WordCount. Hadoop installations include this sample in the /home/hadoop/hadoop-install/hadoop-examples-*.jar file.
To run the WordCount job:
- Navigate to the command line.
- Type
./bdutil shell
to SSH into the master node of the Hadoop cluster. - Type
hadoop fs -mkdir input
to create theinput
directory. Note that when using Google Cloud Storage as your default file system, input automatically resolves togs://$<CONFIGBUCKET>/input
. - Copy any file from the web, such as the following example text from Apache, by typing the following command:
curl http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html > setup.html
. - Copy one or more text files into the
input
directory. Using the same Apache text in the previous step, type the following command:hadoop fs -copyFromLocal setup.html input
. - Type
cd /home/hadoop/hadoop-install/
to navigate to the Hadoop install directory. - Type
hadoop jar hadoop-examples-*.jar wordcount input output
to run the job on data in the input directory, and place results in the output directory.
To check the status of of the Hadoop job, visit the JobTracker page. See the monitoring jobs page for instructions on how to access the JobTracker.
After completing the job, make sure to shut down the Hadoop cluster for the most cost effective solution.