<div style="font-size:18pt; padding-top:20px; text-align:center"><span style="font-weight:bold; color:green">MapReduce</span> <b>Workflow and Data Serialization Systems</b></div><hr>
<div style="text-align:right;">Sergei Yu. Papulin <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru, papulin_hse@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Content</span>
    <ol>
        <li><a href="#1">MapReduce Workflow</a>
            <ol style = "list-style-type:lower-alpha">
                <li><a href="#1a">ControlFlow</a></li>
                <li><a href="#1b">Oozie</a></li>
                <li><a href="#1c">Tez</a></li>
            </ol>
        </li>
        <li><a href="#2">Data Serialization Systems</a>
            <ol style = "list-style-type:lower-alpha">
                <li><a href="#2a">Java</a></li>
                <li><a href="#2b">Avro</a></li>
                <li><a href="#2c">Parquet</a></li>
            </ol>
        </li>
        <li><a href="#3">References</a></li>
    </ol>
</div>

<p>Launch the cell below to apply a jupyter notebook style</p>

In [1]:
%%html
<link href="css/style.css" rel="stylesheet" type="text/css">

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. MapReduce Workflow</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<a name="1a"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            a. ControlFlow
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#1">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#1b">Next</a>
            </div>
        </div>
    </div>
</div>

<p><b>Creating a jar file with a workflow</b></p>

<div>
<p>Create a new project in <span class="code-font">IntelliJ</span>, and copy the following files to the current project:</p> 
<ul><li class="code-font">ProdAvgMapper.java</li>
    <li class="code-font">ProdAvgCombiner.java</li>
    <li class="code-font">ProdAvgReducer.java</li>
    <li class="code-font">SumCountWritable.java</li>
</ul>
</div>

<p>Replace a package name with <span class="code-font">edu.classes.mrjobflow</span></p>

<p>Create additional java files in your project with contents displayed below. Package is <span class="code-font">edu.classes.mrjobflow</span></p>

<p class="code-font">GroupMapper.java</p>

In [None]:
package edu.classes.mrjobflow;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class GroupMapper extends Mapper<Object, Text, IntWritable, IntWritable> {

    private IntWritable one = new IntWritable(1);
    private IntWritable group = new IntWritable();

    public void map(Object key, Text value, Context context
    ) throws IOException, InterruptedException {


        String prod_rating = value.toString();
        String[] items = prod_rating.split("\t");

        if (items.length == 2) {

            double rating = Double.valueOf(items[1]);

            if (rating >= 0 && rating < 1) {
                group.set(1);
                context.write(group, one);
            } else if (rating >= 1 && rating < 2) {
                group.set(2);
                context.write(group, one);
            } else if (rating >= 2 && rating < 3) {
                group.set(3);
                context.write(group, one);
            } else if (rating >= 3 && rating < 4) {
                group.set(4);
                context.write(group, one);
            } else if (rating >= 4 && rating <= 5) {
                group.set(5);
                context.write(group, one);
            }
        }
    }
}

<p class="code-font">GroupReducer.java</p>

In [None]:
package edu.classes.mrjobflow;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class GroupReducer extends Reducer<IntWritable,IntWritable,IntWritable,IntWritable> {

    private IntWritable resultCount = new IntWritable();

    public void reduce(IntWritable key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {

        int count = 0;

        for (IntWritable val : values) {
            count += val.get();
        }

        System.out.println(key + " : " + count);

        resultCount.set(count);

        context.write(key, resultCount);
    }
}

<p class="code-font">JobFlowDriver.java</p>

In [None]:
package edu.classes.mrjobflow;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
import org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob;
import org.apache.hadoop.mapreduce.Job;

import java.util.List;

public class JobFlowDriver {

    public void run(String[] args) throws Exception {

        JobControl jobControl = new JobControl("MapReduceJobFlow");

        // Job 1 ----------------------------------------------
        // Create a configuration for the first job
        Configuration confAvgJob = new Configuration();

        // Create a job for calculating average product ratings
        Job avgJob = Job.getInstance(confAvgJob, "ProdAverageRating");
        avgJob.setJarByClass(JobFlowDriver.class);
        avgJob.setMapperClass(ProdAvgMapper.class);
        avgJob.setCombinerClass(ProdAvgCombiner.class);
        avgJob.setReducerClass(ProdAvgReducer.class);
        avgJob.setOutputKeyClass(Text.class);
        avgJob.setOutputValueClass(SumCountWritable.class);
        FileInputFormat.addInputPath(avgJob, new Path(args[0]));
        FileOutputFormat.setOutputPath(avgJob, new Path(args[1]));

        // Create a JobControl container for avgJob
        ControlledJob cntrAvgJob = new ControlledJob(confAvgJob);
        cntrAvgJob.setJob(avgJob);

        // Add the JobControl container to JobControl
        jobControl.addJob(cntrAvgJob);
        //-----------------------------------------------------

        // Job 2 ----------------------------------------------
        Configuration confGroupJob = new Configuration();
        Job groupJob = Job.getInstance(confGroupJob, "GroupProdByRating");
        groupJob.setJarByClass(JobFlowDriver.class);
        groupJob.setMapperClass(GroupMapper.class);
        groupJob.setCombinerClass(GroupReducer.class);
        groupJob.setReducerClass(GroupReducer.class);
        groupJob.setOutputKeyClass(IntWritable.class);
        groupJob.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(groupJob, new Path(args[1]));
        FileOutputFormat.setOutputPath(groupJob, new Path(args[2]));

        // Create a JobControl container for groupJob
        ControlledJob cntrGroupJob = new ControlledJob(confGroupJob);
        cntrGroupJob.setJob(groupJob);

        // Create a dependency Job 1 -> Job 2
        cntrGroupJob.addDependingJob(cntrAvgJob);


        jobControl.addJob(cntrGroupJob);
        //-----------------------------------------------------

        Thread jobControlRunner = new Thread(jobControl);
        jobControlRunner.start();

        long startTime = System.currentTimeMillis();

        while (!jobControl.allFinished()) {
            System.out.println("Jobs in waiting state: " + jobControl.getWaitingJobList().size());
            System.out.println("Jobs in ready state: "   + jobControl.getReadyJobsList().size());
            System.out.println("Jobs in running state: " + jobControl.getRunningJobList().size());
            System.out.println("Jobs in success state: " + jobControl.getSuccessfulJobList().size());
            System.out.println("Jobs in failed state: "  + jobControl.getFailedJobList().size());
            System.out.println("\n");

            try {
                Thread.sleep(10 * 1000);
            } catch (Exception e) {

            }
        }

        long endTime = System.currentTimeMillis();

        List<ControlledJob> fail = jobControl.getFailedJobList();
        List<ControlledJob> succeed = jobControl.getSuccessfulJobList();

        int numOfSuccessfulJob = succeed.size();
        if (numOfSuccessfulJob > 0) {
            System.out.println(numOfSuccessfulJob + " jobs succeeded");
        }

        int numOfFailedjob = fail.size();
        if (numOfFailedjob > 0) {
            System.out.println("------------------------------- ");
            System.out.println(numOfFailedjob + " jobs failed");
        }

        System.out.println("JobFlow results:");
        System.out.println("Total num of Jobs: " + 2);
        System.out.println("ExecutionTime: " + ((endTime-startTime) / 1000));
        jobControl.stop();

    }

    public static void main(String[] args) throws Exception {
        JobFlowDriver jobFlow = new JobFlowDriver();
        jobFlow.run(args);
    }

}


<p><b>Running on an EMR cluster</b></p>

<p>Create a json file to specify a jobflow step</p>

In [None]:
# %load /YOUR_PATH/config/jobflow-step-template.json
[
  {
     "Name": "JobFlow Step",
     "Type": "CUSTOM_JAR",
     "ActionOnFailure": "TERMINATE_CLUSTER",
     "Jar": "s3://YOUR_BUCKET/jobflow/MapReduceJobFlow.jar",
     "Args": [
         "s3://YOUR_BUCKET/data/reviews_Electronics_5.json",
         "s3://YOUR_BUCKET/data/temp_jobflow",
         "s3://YOUR_BUCKET/data/output_jobflow"]
  }
]


<p>Create an EMR cluster with the jobflow step</p>

<div class="msg-block msg-info">
  <div class="msg-text-info">
      <p>1) find out your subnet where a cluster will be run</p>
      <p class="code-block code-font">aws ec2 <span class="code-key">describe-subnets</span></p>
      <p>2) create a S3 bucket (if needed)</p>
      <p class="code-block code-font">aws s3 <span class="code-key">mb</span> s3://YOUR_BUCKET/</p>
      <p>3) copy a file with bootstrap actions and your bucket name to S3</p>
      <p class="code-block code-font">aws s3 <span class="code-key">cp</span> /YOUR_PATH/config/download-unzip-s3.sh s3://YOUR_BUCKET/scripts/download-unzip-s3.sh</p>
      <p>4) copy the jobflow jar file to S3</p>
      <p class="code-block code-font">aws s3 <span class="code-key">cp</span> /YOUR_PATH/MapReduceJobFlow.jar s3://YOUR_BUCKET/jobflow/MapReduceJobFlow.jar</p>
  </div>
</div>

In [None]:
%%bash
aws emr create-cluster \
    --name "Hadoop_Cluster" \
    --release-label emr-5.8.0 \
    --applications Name=Hadoop Name=Zeppelin \
    --log-uri s3://your_bucket/logs/ \
    --service-role emr-default-role \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=3,InstanceType=m4.large \
    --ec2-attributes InstanceProfile=emr-default-ec2-role,KeyName=BigData_Keys,SubnetId=YOUR_SUBNET \
    --bootstrap-action Path=s3://YOUR_BUCKET/download-unzip-s3.sh \
    --configurations file:///YOUR_PATH/config/hdfs-config.json \
    --steps file:///YOUR_PATH/config/jobflow-step.json

<p>Set up an SSH tunnel using dynamic port forwarding (run in your terminal)</p>

In [None]:
sudo aws emr socks --cluster-id YOUR_CLUSTER_ID --key-pair-file /YOUR_PATH/bigdata_keys.pem

<div class="msg-block msg-ref">
      <div class="msg-text-ref">
          <p>To connect to hadoop web dashboards, see the following guides: <br>
          <a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel.html">Part 1: Set Up an SSH Tunnel to the Master Node Using Dynamic Port Forwarding</a><br>
          <a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-proxy.html">Part 2: Configure Proxy Settings to View Websites Hosted on the Master Node</a>
          </p>
     </div>
</div>

<p>Print out an internal host name of the master node</p>

In [None]:
!aws emr list-instances \
        --cluster-id YOUR_CLUSTER_ID \
        --instance-group-types "MASTER" \
        --query "Instances[0].PrivateDnsName" \
        --output text

<p>Use the hostname to connect to hadoop web dashboards. For exmpale,</p>

<a class="code-block code-font" href="http://ip-10-0-1-14.eu-west-1.compute.internal:8088">ip-10-0-1-14.eu-west-1.compute.internal:8088</a> (ResourceManager dashboard)

<p>Display steps and their states </p>

In [None]:
!aws emr list-steps --cluster-id YOUR_CLUSTER_ID

<p>After the jobflow step will be completed, check the output directory</p>

In [None]:
!sudo aws s3 ls s3://YOUR_BUCKET/data/output_jobflow/

<p>Display an output file</p>

In [None]:
!sudo aws s3 cp --quiet s3://YOUR_BUCKET/data/output_jobflow/part-r-00000 /dev/stdout

<div class="msg-block msg-imp">
  <div class="msg-text-imp"><p>
      Don't forget to terminate the cluster, otherwise your free subscription runs out quickly. A rule of thumb is that you terminate the cluster after all job is completed. There are two options to do this:<br>
      1) <span class="code-font">EMR -> Select Cluster -> Terminate</span><br>
      2) AWS CLI: <span class="code-font">aws emr terminate-clusters --cluster-ids j-xxxxx</span>
      
  </p></div>
</div>

<p>Terminate the cluster</p>

In [None]:
!aws emr terminate-clusters --cluster-ids YOUR_CLUSTER_ID

<p>Check that all clusters are terminated</p>

In [None]:
!aws emr list-clusters --active 

<a name="1b"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            b. Oozie
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#1a">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#1c">Next</a>
            </div>
        </div>
    </div>
</div>

<p><b>Creating configuration files with a workflow description</b></p>

<p>Create two jar files - <span class="code-font">ProdAvgRatingApp.jar</span> and <span class="code-font">GroupProdRating.jar</span> - and copy them to <span class="code-font">/YOUR_PATH/oozie-project/lib</span>. Use the files from the previous section for that. In addition, to create <span class="code-font">GroupProdRating.jar</span>, use the driver below</p> 

<p class="code-font">GroupDriver.java</p>

In [None]:
package edu.classes.mrjobflow;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class GroupDriver extends Configured implements Tool {

    public int run(String[] args) throws Exception {

        Job job = Job.getInstance(getConf(), "GroupProdRating");
        job.setJarByClass(GroupDriver.class);
        job.setMapperClass(GroupMapper.class);
        job.setCombinerClass(GroupReducer.class);
        job.setReducerClass(GroupReducer.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        System.exit(ToolRunner.run(conf, new GroupDriver(), args));
    }
}


<p>Create an Oozie workflow file</p>

<p class="code-font">workflow.xml</p>

In [None]:
# %load /YOUR_PATH/oozie-project/workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-job-flow">
    <start to="mr-avg-rating"/>
    <action name="mr-avg-rating">
        <map-reduce>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="/user/jobflow/temp_avg"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.mapper.new-api</name>
                    <value>true</value>
                </property>
                <property>
                    <name>mapred.reducer.new-api</name>
                    <value>true</value>
                </property>
                <property>
                    <name>mapreduce.job.map.class</name>
                    <value>edu.classes.mapreduce.ProdAvgMapper</value>
                </property>
                <property>
                    <name>mapreduce.job.combine.class</name>
                    <value>edu.classes.mapreduce.ProdAvgCombiner</value>
                </property>
                <property>
                    <name>mapreduce.job.reduce.class</name>
                    <value>edu.classes.mapreduce.ProdAvgReducer</value>
                </property>
                <property>
                    <name>mapreduce.input.fileinputformat.inputdir</name>
                    <value>/user/hadoop/reviews_Electronics_5.json</value>
                </property>
                <property>
                    <name>mapreduce.output.fileoutputformat.outputdir</name>
                    <value>/user/jobflow/temp_avg</value>
                </property>
                <property>
                    <name>mapreduce.job.output.key.class</name>
                    <value>org.apache.hadoop.io.Text</value>
                </property>
                <property>
                    <name>mapreduce.job.output.value.class</name>
                    <value>edu.classes.mapreduce.SumCountWritable</value>
                </property>
            </configuration>
        </map-reduce>
        <ok to="mr-group-rating"/>
        <error to="fail"/>
    </action>
    <action name="mr-group-rating">
        <map-reduce>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="/user/jobflow/output"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.mapper.new-api</name>
                    <value>true</value>
                </property>
                <property>
                    <name>mapred.reducer.new-api</name>
                    <value>true</value>
                </property>
                <property>
                    <name>mapreduce.job.map.class</name>
                    <value>edu.classes.mrjobflow.GroupMapper</value>
                </property>
                <property>
                    <name>mapreduce.job.combine.class</name>
                    <value>edu.classes.mrjobflow.GroupReducer</value>
                </property>
                <property>
                    <name>mapreduce.job.reduce.class</name>
                    <value>edu.classes.mrjobflow.GroupReducer</value>
                </property>
                <property>
                    <name>mapreduce.input.fileinputformat.inputdir</name>
                    <value>/user/jobflow/temp_avg</value>
                </property>
                <property>
                    <name>mapreduce.output.fileoutputformat.outputdir</name>
                    <value>/user/jobflow/output</value>
                </property>
                <property>
                    <name>mapreduce.job.output.key.class</name>
                    <value>org.apache.hadoop.io.IntWritable</value>
                </property>
                <property>
                    <name>mapreduce.job.output.value.class</name>
                    <value>org.apache.hadoop.io.IntWritable</value>
                </property>
            </configuration>
        </map-reduce>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>


<p class="code-font">flow.properties</p>

In [None]:
# %load /YOUR_PATH/oozie-project/flow.properties
nameNode=hdfs://<change>
jobTracker=<change>
oozie.wf.application.path=${nameNode}/user/hadoop/oozie-app


<p><b>Running an EMR cluster</b></p>

<div class="msg-block msg-info">
  <div class="msg-text-info">
      <p>1) find out your subnet where a cluster will be run</p>
      <p class="code-block code-font">aws ec2 <span class="code-key">describe-subnets</span></p>
      <p>2) create a S3 bucket (if needed)</p>
      <p class="code-block code-font">aws s3 <span class="code-key">mb</span> s3://YOUR_BUCKET/</p>
      <p>3) copy a file with bootstrap actions and your bucket name to S3</p>
      <p class="code-block code-font">aws s3 <span class="code-key">cp</span> /YOUR_PATH/config/download-unzip-s3.sh s3://YOUR_BUCKET/scripts/download-unzip-s3.sh</p>
      <p>4) copy oozie workflow files to S3</p>
      <p class="code-block code-font">aws s3 <span class="code-key">cp</span> /YOUR_PATH/oozie-project s3://YOUR_BUCKET/oozie --recursive</p>
  </div>
</div>

<p>Launch an EMR cluster</p>

In [None]:
%%bash
aws emr create-cluster \
    --name "Hadoop_Cluster" \
    --release-label emr-5.8.0 \
    --applications Name=Hadoop Name=Zeppelin Name=Oozie \
    --log-uri s3://aws-emr-logs/logs/ \
    --service-role emr-default-role \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=3,InstanceType=m4.large \
    --ec2-attributes InstanceProfile=emr-default-ec2-role,KeyName=BigData_Keys,SubnetId=YOUR_SUBNET \
    --bootstrap-action Path=s3://YOUR_BUCKET/scripts/download-unzip-s3.sh \
    --configurations file:///YOUR_PATH/config/hdfs-config.json

<p>Check a state of your cluster</p>

In [None]:
!aws emr describe-cluster \
    --cluster-id YOUR_CLUSTER_ID \
    --query "Cluster.Status"

<p>Set up an SSH tunnel using dynamic port forwarding (run in your terminal)</p>

In [None]:
sudo aws emr socks --cluster-id YOUR_CLUSTER_ID --key-pair-file /YOUR_PATH/bigdata_keys.pem

<div class="msg-block msg-ref">
      <div class="msg-text-ref">
          <p>To connect to hadoop web dashboards, see the following guides: <br>
          <a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel.html">Part 1: Set Up an SSH Tunnel to the Master Node Using Dynamic Port Forwarding</a><br>
          <a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-proxy.html">Part 2: Configure Proxy Settings to View Websites Hosted on the Master Node</a>
          </p>
     </div>
</div>

<p>Print out an internal host name of the master node</p>

In [None]:
!aws emr list-instances \
        --cluster-id YOUR_CLUSTER_ID \
        --instance-group-types "MASTER" \
        --query "Instances[0].PrivateDnsName" \
        --output text

<p>Use the hostname to connect to hadoop web dashboards. For exmpale,</p>

<a class="code-block code-font" href="http://ip-10-0-1-14.eu-west-1.compute.internal:8088">ip-10-0-1-14.eu-west-1.compute.internal:8088</a> (ResourceManager dashboard)

<p>Connect to the master node via SSH</p>

<p>1. Display a public ip of the master</p>

In [None]:
!aws emr list-instances \
        --cluster-id YOUR_CLUSTER_ID \
        --instance-group-types "MASTER" \
        --query "Instances[0].PublicIpAddress" \
        --output text

<p>2. In your terminal run the following command</p>

In [None]:
sudo ssh -i /YOUR_PATH/bigdata_keys.pem hadoop@MASTER_PUBLIC_IP

<p>On the master node run the command below to copy the <span class="code-font">reviews_Electronics_5.json</span> file from S3 to HDFS</p>

In [None]:
hadoop distcp s3://YOUR_BUCKET/data/reviews_Electronics_5.json hdfs:///user/hadoop

<p><b>Running an Oozie workflow</b> (all commands on the master node)</p>

<p>Copy the <span class="code-font">flow.properties</span> file from S3 to a local directory of the master node</p>

In [None]:
sudo aws s3 cp s3://YOUR_BUCKET/oozie/flow.properties /home/hadoop/oozie/

<p>Insert to the first line of the file a host name of the master using the following command</p>

In [None]:
sudo sed -i "1imasterNode=$HOSTNAME" /home/hadoop/oozie/flow.properties

<p>Create a directory in HDFS to upload jar files and an oozie workflow</p>

In [None]:
hdfs dfs -mkdir -p /user/hadoop/oozie-app

<p>Copy the <span class="code-font">workflow.xml</span> file to HDFS</p> 

In [None]:
hdfs dfs -cp s3://YOUR_BUCKET/oozie/workflow.xml /user/hadoop/oozie-app/

<p>Copy the jar files to HDFS</p> 

In [None]:
hdfs dfs -cp s3://YOUR_BUCKET/oozie/lib /user/hadoop/oozie-app/lib

<p>Run the workflow. You can use the ResourceManager dashboard to track execution</p>

In [None]:
oozie job -config /home/hadoop/oozie/flow.properties -run

<p>Run the following command to see status of jobs in your workflow</p>

In [None]:
oozie job -info YOUR_JOB_ID

<p>Look at logs using the command below</p>

In [None]:
oozie job -log YOUR_JOB_ID

<div class="msg-block msg-info">
  <div class="msg-text-info">
      <p>Oozie log files youn find here</p>
      <p class="code-block code-font"><span class="code-key">ls</span> /var/log/oozie</p>
  </div>
</div>

<p>Display the result of computation</p>

In [None]:
hdfs dfs -cat /user/jobflow/output/part-r-0000*

<div class="msg-block msg-imp">
  <div class="msg-text-imp"><p>
      Don't forget to terminate the cluster, otherwise your free subscription runs out quickly. A rule of thumb is that you terminate the cluster after all job is completed. There are two options to do this:<br>
      1) <span class="code-font">EMR -> Select Cluster -> Terminate</span><br>
      2) AWS CLI: <span class="code-font">aws emr terminate-clusters --cluster-ids j-xxxxx</span>
      
  </p></div>
</div>

<p>Terminate the cluster</p>

In [None]:
!aws emr terminate-clusters --cluster-ids YOUR_CLUSTER_ID

<p>Check that all clusters are terminated</p>

In [None]:
!aws emr list-clusters --active

<a name="1c"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            c. Tez
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#1b">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2">Next</a>
            </div>
        </div>
    </div>
</div>

In [None]:
# TODO

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. Data Serialization Systems</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<a name="2a"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            a. Java
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2b">Next</a>
            </div>
        </div>
    </div>
</div>

In [None]:
# TODO

<a name="2b"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            b. Avro
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2a">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2c">Next</a>
            </div>
        </div>
    </div>
</div>

In [None]:
# TODO

<a name="2c"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            c. Parquet
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2b">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#3">Next</a>
            </div>
        </div>
    </div>
</div>

In [None]:
# TODO

<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. References</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>

JobControl
<ul>
    <li><a href="https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/mapreduce/lib/jobcontrol/JobControl.html">JobControl</a></li>
    <li><a  href="https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/mapreduce/lib/jobcontrol/ControlledJob.html">ControlledJob</a></li>
    <li><a href="http://coe4bd.github.io/HadoopHowTo/multipleJobsSingle/multipleJobsSingle.html">Chaining and Managing Multiple MapReduce Jobs with One Driver</a></li>
    <li><a href="https://www.programcreek.com/java-api-examples/index.php?source_dir=HadoopEKG-master/mapred/src/benchmarks/gridmix2/src/java/org/apache/hadoop/mapreduce/GridMixRunner.java">GridMixRunner.java</a></li>
</ul>

Oozie
<ul>
    <li><a href="https://oozie.apache.org/docs/4.3.0/">Oozie, Workflow Engine for Apache Hadoop</a></li>
    <li><a href="https://www.safaribooksonline.com/library/view/apache-oozie/9781449369910/ch04.html">Apache Oozie by Aravind Srinivasan, Mohammad Kamrul Islam</a></li>
    <li><a href="https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook">Map Reduce Cookbook</a></li>
    <li><a href="https://aws.amazon.com/ru/blogs/big-data/run-common-data-science-packages-on-anaconda-and-oozie-with-amazon-emr/">Run Common Data Science Packages on Anaconda and Oozie with Amazon EMR</a></li>
    <li><a href="https://aws.amazon.com/blogs/big-data/use-apache-oozie-workflows-to-automate-apache-spark-jobs-and-more-on-amazon-emr/">Use Apache Oozie Workflows to Automate Apache Spark Jobs (and more!) on Amazon EMR</a></li>
    <li><a href="https://discuss.pivotal.io/hc/en-us/articles/203355837-How-to-run-a-MapReduce-jar-using-Oozie-workflow">How to run a MapReduce jar using Oozie workflow</a></li>
    <li><a href="https://gist.github.com/airawat/6001806">Oozie Workflow Java MapReduce Action</a></li>
</ul>