<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# AWS Elastic Map Reduce

---

## LEARNING OBJECTIVES

- Spin up a cluster on AWS
- Browse HDFS using Hadoop User Environment
- Launch a HIVE Shell and execute HIVE queries on an EMR cluster

## LESSON GUIDE

- [Introduction to EMR](#introduction)
- [EMR Pricing](#pricing)
- [EMR cluster](#guided-practice)
    1. [Prerequisites](#prerequisites)
    1. [Launching the cluster](#launch)
    1. [Prepare sample data and script](#prepare)
    1. [Process sample data](#process)
    1. [Check results](#check)
- [More EMR cluster](#more)
- [Configure Web Connection](#configure)
    - [Setting access and foxyproxy](#access)
- [Hadoop User Environment (HUE)](#hue) 
- [Example: AWS sample: cloudfront logs](#example)
- [Ind-practice](#ind-practice) 
- [Conclusion](#conclusion)
- [Additional resources](#resources)

<a name="introduction"></a>
## Intro to EMR 

In a previous lesson we have discovered two very important AWS services: EC2 and S3. Today we will see how to spin up a computer cluster on Amazon. 

**What is a cluster?**

**What is a typical topology for a Big Data computing cluster?**

Amazon Elastic MapReduce was introduced in April 2009 to automate _provisioning_ of the Hadoop cluster, running and terminating jobs, and handling data transfer between EC2 (VM) and S3 (Object Storage). It simplifies the management of a Hadoop cluster, making it available to anyone at the click of a button.

EMR offers several pre-installed software packages including:

- Hadoop
- HBase
- Pig
- Hive
- Hue
- Spark
and many others.

EMR also supports spot Instances since 2011. It is recommended to only run the Task Instance Group on spot instances to take advantage of the lower cost while maintaining availability.

**Which of these have you already encountered on your local VM?**

<a name="pricing"></a>
## EMR Pricing

EMR Pricing is based on the type of instances forming the cluster and it's divided in tiers. The pricing adds to the cost of spinning up the instances in EC2.

Also, very importantly, costs are calculated in hourly increments, so if we plan to use the cluster for two sessions of half an hour, we should have it up for one hour consecutively instead of spinning it up and down twice.

EMR is not included in the AWS free tier that you've used in the previous class, so it's always a good practice to do some price checking before you spin up a cluster.

We can use the [AWS cost calculator](https://calculator.s3.amazonaws.com/index.html) to estimate the cost of a  three-node cluster with medium size instances `(m3.xlarge)`. The image below shows the cost for one hour: it's slightly more than one dollar.

![](./assets/images/emrcost.png)

If we were to keep the cluster alive for a month, that would result in a pretty high price, that's why it's so convenient to spin up and down clusters as they are needed.

<a name="guided-practice"></a>
## EMR cluster 

Let's spin up an EMR cluster with Hive and let's use it to perform a simple word count using Hive like we did on the local VM. We will be following the [example provided by Amazon here](http://docs.aws.amazon.com//ElasticMapReduce/latest/ManagementGuide/emr-gs.html).

Let's first log-in to AWS and go to the EMR service page:

![](./assets/images/emr.png)

<a name="prerequisites"></a>
### 1. Prerequisites

As a first step we will create 2 folders in an S3 bucket of ours and call them:
- input
- output

**We can do this manually:**

![](./assets/images/bucket.png)

**Or via the command line:**

```bash
$ aws s3 ls
```

```bash
$aws s3 mb s3://bucket-name
# you can remove it using aws s3 rb s3://bucket-name
```

<a name="launch"></a>
### 2.  Launch Cluster

![](./assets/images/clusterstart.png)

**Remember to choose the key pair you have already stored on your computer.**


![](./assets/images/clusterstarting.png)

**Notice also that like for EC2 we can list the clusters using the Cluster List pane:**

![](./assets/images/clusterlist.png)

**The cluster will take several minutes to boot completely. Press the circular refresh button in the top right of the console summary ("Cluster list") to refresh your view and see if the cluster is ready.**

**In the meantime, let's do a couple of review checks:**

---

**Do you remember what exercise we did with HIVE?**

**Do you remember how to connect to an instance on EC2?**

**Do you remember which commands we used in AWSCLI?**

---
**Once the cluster is ready we will see it in green:**

![](./assets/images/clusterready.png)

<a name="prepare"></a>
### 3. Prepare sample data and script

We will analyse log data in a similar way as we did in the HIVE exercise. The major difference here is that both our data and the computing power are somewhere in the cloud, instead of being on a virtual machine running on our laptop.

The sample data is a series of Amazon CloudFront web distribution log files. This lists the time, date, and various information about all users' activities (including the operating system they used to access the server). The data is stored in Amazon S3 at `s3://us-west-2.elasticmapreduce.samples` (make sure the region is your region).
Each entry in the CloudFront log files provides details about a single user request in the following format:

    2014-07-05 20:00:00 LHR3 4260 10.0.0.15 GET eabcd12345678.cloudfront.net /test-image-1.jpeg 200 - Mozilla/5.0%20(MacOS;%20U;%20Windows%20NT%205.1;%20en-US;%20rv:1.9.0.9)%20Gecko/2009040821%20IE/3.0.9

**A sample HIVE script is also provided here:**

    s3://us-west-2.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q

**The sample Hive script does the following:**

- Creates a Hive table named cloudfront_logs.
- Reads the CloudFront log files from Amazon S3 using EMRFS and parses the CloudFront log files using the regular expression serializer/deserializer (RegEx SerDe).
- Writes the parsed results to the Hive table cloudfront_logs.
- Submits a HiveQL query against the data to retrieve the total requests per operating system for a given time frame.
- Writes the query results to your Amazon S3 output bucket.

**The Hive code that creates the table looks like the following:**

```SQL
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( 
	Date Date, 
	Time STRING, 
	Location STRING, 
	Bytes INT, 
	RequestIP STRING, 
	Method STRING, 
	Host STRING, 
	Uri STRING, 
	Status INT, 
	Referrer STRING, 
	OS String, 
	Browser String, 
	BrowserVersion String 
)
```

**The Hive code that parses the log files using the RegEx SerDe looks like the following:**

```SQL
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$" ) LOCATION 's3://us-west-2.elasticmapreduce.samples/cloudfront/data/';
```

**The Hive query looks like the following:**

```sql
SELECT os, COUNT(*) count FROM cloudfront_logs WHERE date BETWEEN '2014-07-05' AND '2014-08-05' GROUP BY os;
```

The output file summarises the count of different operating
systems people used to access the system.

<a name="process"></a>
### 4. Process Sample Data

Following the instructions [here](http://docs.aws.amazon.com//ElasticMapReduce/latest/ManagementGuide/emr-gs-process-sample-data.html) we can create a new job step based on the hive script by adding a `step` and assigning input, output and script buckets.


Once it's ready click on the cluster from the **Cluster list** page and click **Add step**. On **Step Type** select **Hive program**. For Script s3 location input:

`s3://eu-west-1.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q`

where you must replace eu-west-1 with your region. For Input s3 location input:

`s3://eu-west-1.elasticmapreduce.samples`

with the same replacement. For output s3 location you should put your s3 bucket (you can use the navigation to find it).

In the last box (arguments) place (this allows column names that are the same as reserved words):

- hiveconf hive.support.sql11.reserved.keywords=false

and select "Add".

- By selecting the circular refresh button you can see if it has completed (it will only take about one minute).

- Go to your s3 bucket to the os_requests folder and there will be a file (probably it has a name like 00000) which you can download with a right-click select and view. It will contain counts of operating systems.


![](./assets/images/steppending.png)

![](./assets/images/steprunning.png)

<a name="check"></a>
### 5. Check results


![](./assets/images/results.png)


You can navigate to your S3 bucket and check the results. There should be a new file, with the content:

    Android    855
    Linux      813
    MacOS      852
    OSX        799
    Windows    883
    iOS        794


Wonderful! We have just run a HIVE script on EMR!!

**We have run a HIVE script by defining a step. Do you think we could simply run hive commands from the HIVE command line?**

<a name="more"></a>
## Accessing the EMR cluster 

Go ahead and SSH to your master node and launch Hive. Then try to query the table you just created (`cloudfront_logs`).

To do so, go to your ec2 console. You will see three new ec2 instances. Select the descriptions. You can see that one of the instances is the EMR-master and the other two are the EMR-slaves. 

Log in to the master using ssh in a similar format as a normal single ec2 instance (even though this is a cluster of three instances). Note that now you have to indicate **hadoop** instead of ec2-user:

```bash
$ ssh -i your_key_file.pem hadoop@your_public_DNS_for_the_master
```

Now you are inside the instance and you can type:

```bash
$ hive
```

Now you can type Hive-SQL:

```SQL
SHOW tables;
SELECT * FROM cloudfront_logs LIMIT 10;
SELECT COUNT(*) FROM cloudfront_logs;
```

The result of the count is equal to the sum of the different OS systems
you had in your text file output of the Hive query earlier. Note that the Hive-SQL operates fairly slowly here given the size of the table (only 5000 rows) but that will scale more sensibly for larger tables. 

- Try running queries to check if the counts match the output text file that was stored in s3. 

Note that your normal single ec2 instance wouldn't come with Hive etc since it doesn't require a distributed file system. Even though we are using ssh to access the master ec2, we are actually querying data stored across all three machines in the cluster. This is seamless because of the Hadoop system that is in place and comes installed.


Go to "Security Groups" under "Network & Security" on the left side-bar and add an inbound rule for:

SSH | TCP | 22 | Custom | 0.0.0.0/0

ensuring you select "Save". 

<a name="configure"></a>
## Configure Web Connection 

**So far we have learned two ways of running HIVE. Can you list them?**

We will now learn about HUE, or Hadoop User Interface, which is a great way to interact with a Hadoop cluster.

Before we can do that, we will have to go trough one more step. In fact, the default security settings for EMR are pretty tight and do not allow for external web connections to our cluster. In order to connect with a browser we will have to set up an _ssh tunnel_, i.e. have our browser communicate to the cluster via an encrypted channel. 

Luckily, Amazon provides us with simple instructions:

![](./assets/images/webconnection.png)

![](./assets/images/sshtunnel.png)

<a name="access"></a>
### In order to follow them we first need to complete two steps:

#### 1. enable SSH access to our master node. This is done in the Security Groups pane of the EC2 services page.

![](./assets/images/securitygroups.png)

#### 2. Install and configure Foxy-Proxy as explained [here](https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-connect-master-node-proxy.html).

Once we have enabled SSH access, we can go ahead and connect:

```bash
ssh -i ~/.ssh/MyFirstKey.pem -ND 8157 hadoop@<YOUR_MASTER_DNS>
```

Note that this command will not end because it's keeping the tunnel alive.

If the tunnel and Foxy-proxy are well configured, we should be able to connect to several web services. The one we are interested in is HUE.

<a name="hue"></a>
## Hadoop User Environment (HUE)

[Hue](http://gethue.com/) aggregates the most common Apache Hadoop components into a single interface and targets the user experience. Its main goal is to have the users "just use" Hadoop without worrying about the underlying complexity or using a command line.

It's accessible at the port 8888 of our master node through an SSH tunnel. Since it's the first time we use it, we'll have to set up username (choose **hdfs** as username, otherwise you might get problems with access rights) and password. Let's go ahead and do that.


    http://<YOUR_MASTER_DNS>.compute.amazonaws.com:8888/


![](./assets/images/hueuser.png)

**Let's also install all the examples:**

![](./assets/images/hueinstall.png)

**And we can finally open the Hue home page:**

![](./assets/images/huehome.png)

<a name="example"></a>
### Example: AWS Sample: CloudFront Logs

Amongst the examples there's one that looks familiar. It's the cloudfront sample logs script we've just executed in HIVE. Let's see what happens if we run it from HUE. Hit the EXECUTE button.

We will see the log of the MR being executed:

![](./assets/images/huecloudfront.png)

#### And the results:

![](./assets/images/huecfresults.png)

#### HUE also generates a nice chart for us:

![](./assets/images/huechart.png)

Note that you can progress to the `next` button to execute the next queries in the script.

Finally, note that we can also explore the HDFS like we were doing on the local VM by pointing our browser to the 50070 port (click [here](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html) for further browser interfaces like yarn):

    http://<YOUR_MASTER_DNS>:50070/dfshealth.html#tab-overview
    
![](./assets/images/hdfs.png)

<a name="ind-practice"></a>
## Independent practice with HUE

The HUE Home offers several other examples. In pairs choose an example and work through the code. Make sure you understand what it does and how you execute it. Here are some questions to guide your discovery:

- What tables are present?
- How are they defined? what's the schema? how do you check it in HUE?
- What does the query do?
- How long does it take to execute?
- How much data does it process?
- What are the results?


<a name="conclusion"></a>
## Conclusions

We have learned how to spin up a cluster on AWS and how to run HIVE queries on it using a script or using HUE.

**Make sure you terminate your cluster now:**

![](./assets/images/terminate.png)

**Delete the buckets from S3, to avoid paying for storage space.**

![](./assets/images/deletebucket.png)


#### Now that you're enabled with the ability to process very large datasets in the cloud, what problems would you like to tackle?


<a name="resources"></a>
## ADDITIONAL RESOURCES

- [AWS EMR tutorial](http://docs.aws.amazon.com//ElasticMapReduce/latest/ManagementGuide/emr-gs.html)