<div style="font-size:18pt; padding-top:20px; text-align:center"><span style="font-weight:bold; color:green">Hadoop </span><b>Configuration on Amazon EMR cluster </b></div><hr>
<div style="text-align:right;">Sergei Yu. Papulin <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Contents</span>
    <ol>
        <li><a href="#1">Bootstrap Actions</a></li>
        <li><a href="#2">Hadoop Configuration</a></li>
        <li><a href="#3">EMR Steps</a></li>
        <li><a href="#4">Running MapReduce Job on EMR Cluster</a>
        <li><a href="#5">Configuration Files on EMR Cluster</a>
        <li><a href="#6">References</a></li>
    </ol>
</div>

<p>Launch the cell below to apply a jupyter notebook style</p>

In [1]:
%%html
<link href="css/style.css" rel="stylesheet" type="text/css">

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. Bootstrap Actions</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<div class="msg-block msg-ref">
      <div class="msg-text-ref">
          <p>"You can use a bootstrap action to install additional software on your cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster" - <a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html">Create Bootstrap Actions to Install Additional Software</a></p>
     </div>
</div>

<p>Create a S3 bucket where scripts will be stored (if needed)</p>

In [None]:
!aws s3 mb s3://aws-bigdata

<p>The script below downloads a compressed json file to the EMR master node, uncompresses and uploads it to S3 storage</p>

<p>Bash script for Bootstrap Actions on the master</p>

In [None]:
# %load /YOUR_PATH/config/download-unzip-s3.sh
#!/bin/bash
echo "Check whether it is the master"

cluster_id=$(cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId")

host_private_ip=$(hostname -i)
master_private_ip=$(aws emr list-instances \
                        --cluster-id $cluster_id \
                        --instance-group-types "MASTER" \
                        --query "Instances[0].PrivateIpAddress" \
                        --output text)

# Check whether it is the master. If it isn't -> exit
if [ $host_private_ip != $master_private_ip ]
then
    echo "exit 0"
    exit 0
fi

echo "Check whether the file with data exists"

aws s3 ls s3://aws-bigdata/data/reviews_Electronics_5.json

# Check whether the file with data exists. If so -> exit
if [ $? = 0 ]
then
    echo "exit 0"
    exit 0
fi


wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz -P "/home/hadoop"
gzip -d /home/hadoop/reviews_Electronics_5.json.gz

echo "Check whether the bucket exists"

aws s3 ls "s3://aws-bigdata"

# Check whether the bucket exists. If it doesn't -> create the one
if [ $? = 255 ]
then
    echo "Doesn't exist"
    aws s3 mb s3://aws-bigdata
fi

aws s3api put-object --bucket aws-bigdata --key data/
aws s3 cp /home/hadoop/reviews_Electronics_5.json s3://aws-bigdata/data/reviews_Electronics_5.json


<p>Upload your script with bootstrap actions to the S3 bucket. It will be used later</p>

In [None]:
!aws s3 cp /YOUR_PATH/config/download-unzip-s3.sh s3://aws-bigdata

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. Hadoop Configuration</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<div class="msg-block msg-ref">
      <div class="msg-text-ref">
          <p>"You can override the default configurations for applications by supplying a configuration object for applications when you create a cluster" - <a href="http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html">Configuring Applications</a></p>
     </div>
</div>

<p>Change a default replication factor to 3 using a configuration json file</p>

In [None]:
# %load /YOUR_PATH/config/hdfs-config.json
[
  {
    "Classification": "hdfs-site",
    "Properties": {
      "dfs.replication": "3"
    }
  }
]


<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. EMR Steps</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<div class="msg-block msg-ref">
      <div class="msg-text-ref">
          <p>"Amazon EMR defines a unit of work called a step, which can contain one or more Hadoop jobs. A step is an instruction that manipulates the data" - <a href="http://docs.aws.amazon.com/emr/latest/DeveloperGuide//emr-steps.html">Steps</a>, <a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-work-with-steps.html">Work with Steps Using the CLI and Console</a></p>
     </div>
</div>

<p>One step for calculation of products' average ratings</p>

In [None]:
# %load /YOUR_PATH/config/emr-jar-step.json
[
  {
     "Name": "Average Product Rating Step",
     "Type": "CUSTOM_JAR",
     "ActionOnFailure": "TERMINATE_CLUSTER",
     "Jar": "s3://aws-bigdata/ProdAvgRatingApp.jar",
     "Args": [
         "s3://aws-bigdata/data/reviews_Electronics_5.json",
         "s3://aws-bigdata/output_avg"]
  }
]


<a name="4"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">4. Running MapReduce Job on EMR Cluster</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<p>Display all subnets and use one of them (created in previous classes) for cluster creation later in this section. If there are no available subnets, create one</p>

In [None]:
!aws ec2 describe-subnets

<p><b>Run an EMR cluster and a MapReduce job WITHOUT STEPS</b></p>

<p>Create an EMR cluster with bootstrap actions and configuration. Specify your subnet</p>

In [None]:
%%bash
aws emr create-cluster \
    --name "Hadoop_Cluster" \
    --release-label emr-5.8.0 \
    --applications Name=Hadoop Name=Zeppelin \
    --log-uri s3://aws-emr-logs/logs/ \
    --service-role emr-default-role \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=3,InstanceType=m4.large \
    --ec2-attributes InstanceProfile=emr-default-ec2-role,KeyName=BigData_Keys,SubnetId=YOUR_SUBNET \
    --bootstrap-action Path=s3://aws-bigdata/download-unzip-s3.sh \
    --configurations file:///YOUR_PATH/config/hdfs-config.json

<p>Display a public ip of the master</p>

In [None]:
!aws emr list-instances \
        --cluster-id YOUR_CLUSTER_ID \
        --instance-group-types "MASTER" \
        --query "Instances[0].PublicIpAddress" \
        --output text

<p>Copy an jar file on the master</p>

<p><i>Option 1. From local PC via SSH</i></p>

In [None]:
sudo scp -i "/YOUR_PATH/bigdata_keys.pem" /YOUR_PATH/ProdAvgRatingApp.jar hadoop@MASTER_PUBLIC_IP:/home/hadoop

<p><i>Option 2. Connect to the master and get the jar file from S3</i></p>

In [None]:
sudo ssh -i /YOUR_PATH/bigdata_keys.pem hadoop@MASTER_PUBLIC_IP

In [None]:
aws s3 cp s3://aws-bigdata/ProdAvgRatingApp.jar /home/hadoop

<p>Copy a file with data from S3 to the HDFS</p>

In [None]:
hadoop distcp s3://aws-bigdata/data/reviews_Electronics_5.json hdfs:///user/hadoop

<p>Check that the file was copied successfully</p>

In [None]:
hdfs dfs -tail /user/hadoop/reviews_Electronics_5.json

<p>Display how blocks of the file are distributed</p>

In [None]:
hdfs fsck /user/hadoop/reviews_Electronics_5.json -files -blocks -locations

<p>Run the jar file with two reducers</p>

In [None]:
hadoop jar /home/hadoop/ProdAvgRatingApp.jar \
            -D mapreduce.job.reduces=2 \
            hdfs:///user/hadoop/reviews_Electronics_5.json \
            hdfs:///user/hadoop/output_ratings

<p>Display a result</p>

In [None]:
hdfs dfs -tail /user/hadoop/output_ratings/part-r-00000

<div class="msg-block msg-warning">
  <div class="msg-text-warn"><p>Remove the directory with output files (if needed). To re-run the job, you must remove the directory</p>
            <p class="code-block code-font">hdfs <span class="code-key">dfs</span> -rm -r /user/hadoop/output_ratings</p></div>
</div>

<p>Copy the output files to S3</p>

In [None]:
# TODO

<p><b>Hadoop Web Dashboard</b></p>

<p>Set up an SSH tunnel using dynamic port forwarding with the AWS CLI</p>

In [None]:
sudo aws emr socks --cluster-id YOUR_CLUSTER_ID --key-pair-file /YOUR_PATH/bigdata_keys.pem

<div class="msg-block msg-ref">
      <div class="msg-text-ref">
          <p>To connect to hadoop web dashboards, see the following guides: <br>
          <a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel.html">Part 1: Set Up an SSH Tunnel to the Master Node Using Dynamic Port Forwarding</a><br>
          <a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-proxy.html">Part 2: Configure Proxy Settings to View Websites Hosted on the Master Node</a>
          </p>
     </div>
</div>

<p>Private a DNS name of the master to connect through a browser</p>

In [None]:
!aws emr list-instances \
        --cluster-id YOUR_CLUSTER_ID \
        --instance-group-types "MASTER" \
        --query "Instances[0].PrivateDnsName" \
        --output text

<p>Ports to access web dashboards</p>

<p class="code-block code-font">HDFS NameNode port:<span class="code-key"> 50070</span></p>
<p class="code-block code-font">HDFS ResourceManager port:<span class="code-key"> 8088</span></p>

<p><b>ResourceManager and NodeManager Dashboards</b></p>

<p>Nodes</p>
<img src="img/nodes.png"/>

<p>Scheduler</p>
<img src="img/sch.png"/>

<p>Application</p>
<img src="img/app.png"/>

<p>Job</p>
<img src="img/job.png"/>

<p>Map Tasks</p>
<img src="img/maps.png"/>

<p>Map Task</p>
<img src="img/map_task.png"/>

<p>Container</p>
<img src="img/container.png"/>

<p>NodeManager</p>
<img src="img/node_manager.png"/>

<div class="msg-block msg-imp">
  <div class="msg-text-imp"><p>
      Don't forget to terminate the cluster, otherwise your free subscription runs out quickly. A rule of thumb is that you terminate the cluster after all job is completed. There are two options to do this:<br>
      1) <span class="code-font">EMR -> Select Cluster -> Terminate</span><br>
      2) AWS CLI: <span class="code-font">aws emr terminate-clusters --cluster-ids j-xxxxx</span>
      
  </p></div>
</div>

<p>Terminate the cluster</p>

In [None]:
!aws emr terminate-clusters --cluster-ids YOUR_CLUSTER_ID

<p><b>Run an EMR cluster and a MapReduce job WITH STEPS</b></p>

In [None]:
%%bash
aws emr create-cluster \
    --name "Hadoop_Cluster" \
    --release-label emr-5.8.0 \
    --applications Name=Hadoop Name=Zeppelin \
    --log-uri s3://aws-emr-logs/logs/ \
    --service-role emr-default-role \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=3,InstanceType=m4.large \
    --ec2-attributes InstanceProfile=emr-default-ec2-role,KeyName=BigData_Keys,SubnetId=YOUR_SUBNET \
    --bootstrap-action Path=s3://aws-bigdata/download-unzip-s3.sh \
    --configurations file:///YOUR_PATH/config/hdfs-config.json \
    --steps file:///YOUR_PATH/config/emr-jar-step.json

<div class="msg-block msg-info">
  <div class="msg-text-info"><p>Your can use the <span class="code-font">--auto-terminate</span> option to automatically terminate a cluster after completing all the steps</p></div>
</div>

<p>Display a result from AWS S3 to your terminal</p>

In [None]:
sudo aws s3 cp --quiet s3://aws-bigdata/output_avg/part-r-00000 /dev/stdout

<div class="msg-block msg-imp">
  <div class="msg-text-imp"><p>
      Don't forget to terminate the cluster, otherwise your free subscription runs out quickly. A rule of thumb is that you terminate the cluster after all job is completed. There are two options to do this:<br>
      1) <span class="code-font">EMR -> Select Cluster -> Terminate</span><br>
      2) AWS CLI: <span class="code-font">aws emr terminate-clusters --cluster-ids j-xxxxx</span>
  </p></div>
</div>

<a name="5"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">5. Configuration Files on EMR Cluster</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<p><span class="code-font">core-site.xml</span>, <span class="code-font">hdfs-site.xml</span>, <span class="code-font">yarn-site.xml</span>, <span class="code-font">mapred-site.xml</span></p>

In [None]:
ls /etc/hadoop/conf.empty

<p>Environment Variables</p>

In [None]:
ls /etc/default

<p>Daemons</p>

<div class="msg-block msg-ref">
      <div class="msg-text-ref">
          <p><a href="https://aws.amazon.com/ru/premiumsupport/knowledge-center/restart-service-emr/">How do I restart a service in Amazon EMR?</a>
          </p>
     </div>
</div>

In [None]:
initctl list

<p>Example with ResourceManager</p>

In [None]:
sudo start | stop | restart  hadoop-yarn-resourcemanager

<a name="6"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">6. References</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>

<a href="https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/ClusterSetup.html">Hadoop Cluster Setup</a><br>
<a href="https://mapr.com/blog/best-practices-yarn-resource-management/">Best Practices for YARN Resource Management</a>