<div style="font-size:18pt; padding-top:20px; text-align:center"><b>BigData Cluster on </b> <span style="font-weight:bold; color:green">AWS</span></div><hr>
<div style="text-align:right;">Sergei Yu. Papulin <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru, papulin_hse@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Content</span>
    <ol>
        <li><a href="#1">Quick Start Local Cluster</a></li>
        <li><a href="#2">Virtual Private Cloud for Cluster</a></li>
        <li><a href="#3">Elastic MapReduce (EMR)</a></li>
        <li><a href="#4">Cloudera Cluster on AWS</a>
            <ol style = "list-style-type:lower-alpha">
                <li><a href="#4a">Running AWS Instances</a></li>
                <li><a href="#4b">Deploying Cluster using Cloudera Manager</a></li>
                <li><a href="#4c">Running User Code</a></li>
                <li><a href="#4d">Rerun and Resource Release</a></li>
            </ol>
        </li>
        <li><a href="#5">References</a></li>
    </ol>
</div>

<p>Launch the cell below to apply a jupyter notebook style</p>

In [1]:
%%html
<link href="css/style.css" rel="stylesheet" type="text/css">

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. Quick Start Local Cluster</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>

<p>Install one of available distributions (Cloudera QuickStart VM by default)</p>

In [None]:
# TODO: Add HDP and MapR

<p>Create VirtualBox VM port forwarding for SSH</p>

In [None]:
# VM -> Settings -> Network -> Advanced -> Port Forwarding -> Host: 127.0.0.1 Port: 2222, Guest: Port 22

<p>Connect to the local Cloudera VM via SSH</p>

In [None]:
sudo ssh cloudera@127.0.0.1 -p 2222

<div class="msg-block msg-info">
  <div class="msg-text-info"><p>Password of the Cloudera VM: <span class="code-font">cloudera</span></p></div>
</div>

<p>Use port forwarding from your local host to the local VM to access a HDFS dashboard</p>

In [None]:
sudo ssh -N -f -L 9961:quickstart.cloudera:50070 cloudera@127.0.0.1 -p 2222

<p>Open a web brower to see a HDFS dashboard</p>

<div class="code-block code-font"><a href="http://localhost:9961">http://localhost:9961</a></div>

<p>For Hue</p>

In [None]:
sudo ssh -N -f -L 9962:quickstart.cloudera:8888 cloudera@127.0.0.1 -p 2222

<div class="code-block code-font"><a href="http://localhost:9962">http://localhost:9962</a></div>

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. Virtual Private Cloud for Cluster</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>

<p><b>Python Script</b></p>

In [None]:
import subprocess

In [None]:
def run_cmd_get_id(cmd):
    output = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, 
                            shell=True, universal_newlines=True)
    # print(output.stderr)
    return output.stdout.strip('"\n')

In [None]:
# Create a VPC with cidr 10.0.1.0/24
vpc_id = \
run_cmd_get_id('aws ec2 create-vpc \
                        --cidr-block "10.0.1.0/24" \
                        --instance-tenancy "default" \
                        --query "Vpc.VpcId"')

run_cmd_get_id('aws ec2 modify-vpc-attribute --vpc-id ' + vpc_id + ' --enable-dns-hostnames')

# Create a subnet 10.0.1.0/28
subnet_id = \
run_cmd_get_id('aws ec2 create-subnet \
                        --vpc-id ' + vpc_id + ' \
                        --cidr-block "10.0.1.0/28" \
                        --query "Subnet.SubnetId"')

# Create an Internet gateway
gateway_id = \
run_cmd_get_id('aws ec2 create-internet-gateway \
                        --query "InternetGateway.InternetGatewayId"')

# Attach the Internet gateway to the VPC
run_cmd_get_id('aws ec2 attach-internet-gateway \
                        --vpc-id ' + vpc_id + ' \
                        --internet-gateway-id ' + gateway_id)

# Get id of default route table of the VPC
rtb_id = \
run_cmd_get_id('aws ec2 describe-route-tables \
                        --filters "Name=vpc-id,Values=' + vpc_id + '" \
                        --query "RouteTables[0].RouteTableId"')

run_cmd_get_id('aws ec2 create-route \
                        --route-table-id ' + rtb_id + ' \
                        --destination-cidr-block "0.0.0.0/0" \
                        --gateway-id ' + gateway_id)

secgroup_id = \
run_cmd_get_id('aws ec2 describe-security-groups \
                        --filters "Name=vpc-id,Values=' + vpc_id + '" \
                        --query "SecurityGroups[0].GroupId"')

run_cmd_get_id('aws ec2 authorize-security-group-ingress \
                        --group-id ' + secgroup_id + ' \
                        --protocol tcp \
                        --port 22 \
                        --cidr "0.0.0.0/0"')

In [None]:
print("VCP ID: %s\nSubnet ID: %s\nGateway ID: %s\nRouteTable ID: %s\nSecurity Group ID: %s" % 
      (vpc_id, subnet_id, gateway_id, rtb_id, secgroup_id))

<p><b>boto3</b> - AWS SDK for Python</p>

In [None]:
# TODO

<p><b>Bash Script</b></p>

In [None]:
# TODO

<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. Elastic MapReduce (EMR)</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>

<div class="msg-block msg-info">
      <div class="msg-text-info">
          <p>
          <a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html">What is Amazon EMR?</a><br>
              <a href="http://docs.aws.amazon.com/cli/latest/reference/emr/index.html#cli-aws-emr">CLI EMR</a><br>
<a href="http://docs.aws.amazon.com/cli/latest/reference/ec2/index.html#cli-aws-ec2">CLI EC2</a></p>
     </div>
</div>

<div class="msg-block msg-warning">
  <div class="msg-text-warn"><p>Before run the commands below: <br>
  1) set permission to manipulate EMR in your IAM user (see <a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-roles-creatingroles.html">this</a>)<br>
  2) check a list of instance types that can be used to launch an EMR cluster (see <a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html">this</a>)<br>
  3) create a new VPC or use an existing one (see <a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-vpc-subnet.html#emr-vpc-launching-job-flows">this</a> and the previous class)<br>
  4) create a new key pair or use an existing one (see <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html">this</a> and the previous class)
  </p></div>
</div>

In [None]:
#Create S3 bucket and object lab1/logs/

<p><b>Create roles to launch an EMR Cluster</b> (create once and use them many times)</p>

<p><b>Option 1.</b> Create roles through the AWS CLI</p>

<p>Attach policy to your user for creating required roles</p>

IAM -> Users -> Your_User -> Add inline policy -> Policy name: ROLE_ACCESS, Policy document:

In [None]:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:CreateRole",
                "iam:AttachRolePolicy",
                "iam:GetInstanceProfile",
                "iam:CreateInstanceProfile",
                "iam:AddRoleToInstanceProfile"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

<p>Create default roles for an EMR Cluster. There are three roles: <span class="code-font">EMR_DefaultRole</span>, <span class="code-font">EMR_EC2_DefaultRole</span>, and <span class="code-font">EMR_AutoScaling_DefaultRole</span></p>

In [None]:
!aws emr create-default-roles

<p><b>Option 2.</b> Create roles through the IAM Role Console</p>

In [None]:
IAM -> Roles -> Create Role -> EMR -> "EMR" (emr-default-role) and "EMR Role for EC2" (emr-default-ec2-role)

<p><b>Launch an EMR Cluster</b></p>

In [None]:
%%bash -s $subnet_id
aws emr create-cluster \
    --name "Hadoop_Cluster" \
    --release-label emr-5.0.0 \
    --applications Name=Hadoop Name=Zeppelin \
    --log-uri s3://aws-mr-jobs-labs/class_2/logs/ \
    --service-role emr-default-role \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large \
    --ec2-attributes InstanceProfile=emr-default-ec2-role,KeyName=BigData_Keys,SubnetId=$1

<div class="msg-block msg-imp">
  <div class="msg-text-imp"><p>
      Don't forget to terminate the cluster, otherwise your free subscription runs out quickly. A rule of thumb is that you terminate the cluster after all job is completed. There are two options to do this:<br>
      1) <span class="code-font">EMR -> Select Cluster -> Terminate</span><br>
      2) AWS CLI: <span class="code-font">aws emr terminate-clusters --cluster-ids j-xxxxx</span>
      
  </p></div>
</div>

<p>Show all clusters</p>

In [None]:
%%bash
aws emr list-clusters

<p>List active clusters</p>

In [None]:
%%bash
aws emr list-clusters --active

<p>Display a detailed description of a cluster by its id</p>

In [None]:
%%bash
aws emr describe-cluster --cluster-id "CLUSTER_ID"

<p>Print out only a public DNS of the Master</p>

In [None]:
%%bash
aws emr describe-cluster --cluster-id "CLUSTER_ID" --query "Cluster.MasterPublicDnsName"

<p>Display a public ip and dns of all instances of your EMR cluster</p>

In [None]:
%%bash
aws emr list-instances \
        --cluster-id "CLUSTER_ID" \
        --query "Instances[*].[PublicIpAddress,PublicDnsName]" \
        --output text

<p>Connect to an EMR Master Node using SSH</p>

<div class="msg-block msg-warning">
  <div class="msg-text-warn"><p>
      Allow access to an EMR Master Node via SSH by setting an inbound rule in its security group
  </p></div>
</div>

In [None]:
sudo ssh -i /PATH/TO/.ssh/bigdata_keys.pem hadoop@PUBLIC_IP_MASTER_EMR_CLUSTER

<p>Access to HDFS</p>

In [None]:
sudo ssh -i /PATH/TO/.ssh/bigdata_keys.pem -N -f -L 7740:localhost:50070  hadoop@PUBLIC_IP_MASTER_EMR_CLUSTER

<div class="code-block code-font"><a href="http://localhost:7740">http://localhost:7740</a></div>

<p>Access to Zeppelin</p>

In [None]:
sudo ssh -i /PATH/TO/.ssh/bigdata_keys.pem -N -f -L 7741:localhost:8890  hadoop@PUBLIC_IP_MASTER_EMR_CLUSTER

<div class="code-block code-font"><a href="http://localhost:7741">http://localhost:7741</a></div>

<div class="msg-block msg-info">
      <div class="msg-text-info">
          <p>Web interfaces of other EMR embedded services are <a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html">here</a>
          </p>
     </div>
</div>

<p>Terminate the cluster by its ID</p>

In [None]:
%%bash
aws emr terminate-clusters --cluster-ids j-xxxxx

<div class="msg-block msg-imp">
  <div class="msg-text-imp"><p>
      Don't forget to terminate the cluster, otherwise your free subscription runs out quickly. A rule of thumb is that you terminate the cluster after all job is completed. There are two options to do this:<br>
      1) <span class="code-font">EMR -> Select Cluster -> Terminate</span><br>
      2) AWS CLI: <span class="code-font">aws emr terminate-clusters --cluster-ids j-xxxxx</span>
      
  </p></div>
</div>

<a name="4"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">4. Cloudera Cluster on AWS</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>

<a name="4a"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            a. Running AWS Instances
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#4a">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#4c">Next</a>
            </div>
        </div>
    </div>
</div>

<p>Create three AWS instances: <span class="code-font">Master, Slave1 and Slave2</span></p>

In [None]:
vm_name_list = ["Master", "Slave1", "Slave2"]
vm_id_list = list(); vm_vol_id_list = list()

for indx, vm_name in enumerate(vm_name_list):
    vm_id = \
    run_cmd_get_id('aws ec2 run-instances \
                        --image-id "ami-2944b450" \
                        --count 1 \
                        --instance-type "m4.xlarge" \
                        --key-name "BigData_Keys" \
                        --subnet-id ' + subnet_id + ' \
                        --security-group-ids ' + secgroup_id + ' \
                        --instance-initiated-shutdown-behavior "stop" \
                        --private-ip-address "10.0.1.' + str(indx + 5) + '" \
                        --associate-public-ip-address \
                        --block-device-mappings \'[{"DeviceName": "/dev/sda1", "Ebs": { "DeleteOnTermination": false, "VolumeSize": 64, "VolumeType": "gp2" }}]\' \
                        --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=' + vm_name + '}]" \
                        --query "Instances[0].InstanceId"')
    vm_vol_id = \
    run_cmd_get_id('aws ec2 describe-volumes \
                        --filters "Name=attachment.instance-id,Values=' + vm_id + '" \
                        --query "Volumes[0].VolumeId"')
    
    vm_vol_id_list.append(vm_vol_id); vm_vol_id = ""
    vm_id_list.append(vm_id); vm_id = ""

<a name="4b"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            b. Deploying Cluster using Cloudera Manager
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#4a">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#4c">Next</a>
            </div>
        </div>
    </div>
</div>

<p><b>Preparation</b></p>

<div class="msg-block msg-info">
      <div class="msg-text-info">
          <p>Run all commands in this section on all three AWS instances
          </p>
     </div>
</div>

<p>Get a public ip of a running instance</p>

In [None]:
!aws ec2 describe-instances \
    --filters "Name=tag:Name,Values=Master" \
    --query "Reservations[0].Instances[0].PublicIpAddress" \
    --output text

<p>Connect to the Master VM via SSH</p>

In [None]:
sudo ssh -i /home/sergo/.ssh/bigdata_keys.pem ubuntu@PUBLIC_IP_MASTER

<p>Check a current Ubuntu version</p>

In [None]:
lsb_release -a

<p>On the remote node ping other VMs within the same subnet</p> 

In [None]:
ping 10.0.1.6

In [None]:
ping 10.0.1.7

In [None]:
ping ip-10-0-1-6

In [None]:
#sudo nano /etc/hosts

<p>Encrypt the word "cloudera"</p>

In [None]:
openssl passwd -crypt cloudera

<p>Create a new user with sudo rights</p>

In [None]:
sudo useradd -m -p lCxREFCR/yroc -s /bin/bash cloudera
sudo usermod -G sudo cloudera
#sudo deluser --remove-home cloudera

In [None]:
sudo nano /etc/sudoers
%sudo ALL=(ALL:ALL) NOPASSWD:ALL

<p>Set the option "PasswordAuthentication" to "yes" in config_sshd</p>

In [None]:
sudo nano /etc/ssh/sshd_config
PasswordAuthentication yes

In [None]:
sudo service ssh restart

<p>Configure Swappiness</p>

In [None]:
sudo sysctl vm.swappiness=10

In [None]:
sudo nano /etc/sysctl.conf

# Swappiness
vm.swappiness=10

In [None]:
# To kill port forwarding 
# ps aux | grep ssh
# sudo kill 10424

In [None]:
# TODO: create script run at launch of VMs

<p><b>Run Cloudera Installer</b></p>

<div class="msg-block msg-info">
      <div class="msg-text-info">
          <p>Run all commands in this section on the master VM
          </p>
     </div>
</div>

<p>Create a directory for the Cloudera Manager Installer on the Master node</p>

In [None]:
sudo mkdir /home/ubuntu/cloudera-installer

<p>Download to the master node the Cloudera Manager Installer</p>

In [None]:
sudo wget https://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin \
    -P /home/ubuntu/cloudera-installer 

<p>Go to the installer directory</p>

In [None]:
cd /home/ubuntu/cloudera-installer

<p>Set executable permission</p>

In [None]:
sudo chmod u+x cloudera-manager-installer.bin

<p>Run the installer</p>

In [None]:
sudo ./cloudera-manager-installer.bin

In [None]:
Cloudear Manager
Next->Next->Yes->Next->Yes

In [None]:
sudo tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log 

<p><b>Installation</b></p>

<p>Set SSH port forwarding of your local port 8861 to the port 7180 of the Master VM</p>

In [None]:
sudo ssh -i /home/sergo/.ssh/bigdata_keys.pem -N -f -L 8861:ip-10-0-1-5:7180 ubuntu@PUBLIC_IP_MASTER

<p>Point your web browser to http://localhost:7180</p>

<p>Log in to Cloudera Manager: admin, admin</p>

<p>Select the Cloudera Express Option</p>

<p>Enter the following pattern</p>

In [None]:
10.0.1.[5-7]

<p>Select all VMs</p>

<p>Select Repository -> Choose Method -> Use Parcels -> More Options -> Add the Anaconda parcel</p>

In [None]:
https://repo.continuum.io/pkgs/misc/parcels/

<p>Additional Parcels -> Anaconda and Kafka</p>

<p>Pick Install Oracle Java SE Development Kit (JDK)</p>

<p>Skip a Single User Mode configuration</p>

<p>Provide SSH login credentials -> Login To All Hosts As: cloudera -> password: cloudera</p>

<p>Installing... Finished</p>

<p>Choose a combination of services to install -> Core with Spark</p>

In [None]:
sudo tail --lines 100 /var/log/cloudera-scm-agent/cloudera-scm-agent.log

<a name="4c"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            c. Running User Code
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#4b">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#4d">Next</a>
            </div>
        </div>
    </div>
</div>

<p><b>HDFS dashboard</b></p>

<p>On the local Node</p>

In [None]:
sudo ssh -i /home/sergo/.ssh/bigdata_keys.pem -N -f -L 8840:ip-10-0-1-5:50070 ubuntu@PUBLIC_IP_MASTER

<div class="code-block code-font"><a href="http://localhost:8840">http://localhost:8840</a></div>

<div class="msg-block msg-info">
      <div class="msg-text-info">
          <p>To see ports of other services, click on this <a href="https://www.cloudera.com/documentation/enterprise/5-2-x/topics/cdh_ig_ports_cdh5.html">reference</a></p>
     </div>
</div>

<p><b>Jupyter Notebook</b></p>

<p>On the remote Master Node</p>

In [None]:
export PATH="/opt/cloudera/parcels/Anaconda-4.2.0/bin:$PATH"
jupyter notebook --port 8880 --no-browser

<p>On the local Node</p>

In [None]:
sudo ssh -i /home/sergo/.ssh/bigdata_keys.pem -N -f -L 8861:localhost:8880 ubuntu@PUBLIC_IP_MASTER

<div class="code-block code-font"><a href="http://localhost:8861">http://localhost:8861</a></div>

<p><b>Pyspark</b></p>

<p>On the remote Master Node</p>

In [None]:
#sudo -u hdfs hadoop fs -chmod 777 /user/spark
#sudo -u spark hadoop fs -chmod 777 /user/spark/applicationHistory
#sudo -u hdfs hadoop fs -chmod 777 /user

In [None]:
sudo pyspark --master yarn-clien

<div class="msg-block msg-imp">
  <div class="msg-text-imp"><p>
      Don't forget to terminate the cluster after the work is completed
  </p></div>
</div>

<a name="4d"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            d. Rerun and Resource Release
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#4c">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#5">Next</a>
            </div>
        </div>
    </div>
</div>

<p>Terminate the VMs</p>

In [None]:
run_cmd_get_id('aws ec2 terminate-instances --instance-ids ' + " ".join(vm_id_list))

In [None]:
vm_id_list = list()

<p>Rerun using the existing EBS Volumes</p>

In [None]:
# TODO

<p>Delete the EBS Volumes</p>

In [None]:
run_cmd_get_id('delete-volume --volume-id ' + " ".join(vm_vol_id_list))

In [None]:
vm_vol_id_list = list()

<p>Delete the VPC through the AWS console along with your security group, route table, gateway, subnet</p>

In [None]:
vpc_id = ""; subnet_id = ""; gateway_id = ""; rtb_id = ""; secgroup_id = ""

<a name="4"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">4. References</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>