<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Elastic Map Reduce notes:

---

## Setting up EMR:

- Go to EMR console https://console.aws.amazon.com (select EMR, under Analytics) and create a cluster. 

- Select one of your existing key pairs.

- Check if the region is correct in the top right hand corner (not critical but you might get authentication problems if you use services in different areas).

- Regions are like eu-west-1, us-east-1, us-west-2, etc.

- It takes several minutes for the EMR cluster to boot up. If it fails just try again. Press the circular refresh button in the top right of the console summary ("Cluster list") to refresh your view and see if the cluster is ready.

- Once it's ready click on the cluster from the **Cluster list** page and click **Add step**. On **Step Type** select **Hive program**. For Script s3 location input:

`s3://eu-west-1.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q`

where you must replace eu-west-1 with your region. For Input s3 location input:

`s3://eu-west-1.elasticmapreduce.samples`

with the same replacement. For output s3 location you should put your s3 bucket (you can use the navigation to find it).

In the last box (arguments) place (this allow column names that are the same as reserved words):

- hiveconf hive.support.sql11.reserved.keywords=false

and select **Add**.

- By selecting the circular refresh button you can see if it has completed (it will only take about one minute).

- Go to your s3 bucket to the os_requests folder and there will be a file (probably it has a name like 00000) which you can download with a right-click select and view. It will contain counts of operating systems.

### Discussion

This example case is from Amazon's documentation
and you can see a summary of the Hive script that was used and the sources here:
http://docs.aws.amazon.com//ElasticMapReduce/latest/ManagementGuide/emr-gs-prepare-data-and-script.html

The Hive script we imported creates a table with the following Hive-SQL:

```SQL
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
	Date Date,
	Time STRING,
	Location STRING,
	Bytes INT,
	RequestIP STRING,
	Method STRING,
	Host STRING,
	Uri STRING,
	Status INT,
	Referrer STRING,
	OS String,
	Browser String,
	BrowserVersion String
)
```

The original data was log data from Amazon's servers. This lists the time, date, and various information about all users' activities (including the operating system they used to access the server). We first create the table above, then populate it with a Hive-SQL regular expression much like you have seen in the example in the previous lesson:

```SQL
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ( "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$" ) LOCATION 's3://us-west-2.elasticmapreduce.samples/cloudfront/data/';
```

Finally, we create our output file by summarising the count of different operating
systems people used to access the system:

```SQL
SELECT os, COUNT(*) count FROM cloudfront_logs GROUP BY os;
```

That creates the output file you stored in your s3 bucket.

## Accessing the cluster through the ec2 master

Go to your ec2 console. You will see three new ec2 instances. If you select the descriptions you can check which one is the EMR-master and which two are the EMR-slaves. 

Log in to the master using ssh in a similar format as a normal single ec2 instance (even though this is a cluster of three instances). Note that now you have to indicate **hadoop** instead of ec2-user:

```bash
$ ssh -i your_key_file.pem hadoop@your_public_DNS_for_the_master
```

Now you are inside the instance and you can type:

```bash
$ hive
```

Now you can type Hive-SQL:

```SQL
SHOW tables;
SELECT * FROM cloudfront_logs LIMIT 10;
SELECT COUNT(*) FROM cloudfront_logs;
```

The result of the count is equal to the sum of the different OS systems
you had in your text file output of the Hive query earlier. Note that the Hive-SQL operates fairly slowly here given the size of the table (only 5000 rows) but that will scale more sensibly for larger tables. 

- Try running queries to check if the counts match the output text file that was stored in s3. 

Note that your normal single ec2 instance wouldn't come with Hive etc since it doesn't require a distributed file system. Even though we are using ssh to access the master ec2, we are actually querying data stored across all three machines in the cluster. This is seamless because of the Hadoop system that is in place and comes installed.


Go to "Security Groups" under "Network & Security" on the left side-bar and add an inbound rule for:

SSH | TCP | 22 | Custom | 0.0.0.0/0

ensuring you select "Save". 

## Foxyproxy for Hadoop User Environment (Hue) 

For this part you need Firefox or Google Chrome as browser. Details see [here](http://docs.aws.amazon.com/emr/latest/ReleaseGuide/accessing-hue.html).


**Firefox:**

- Click on *Tools* on the top.

- Hover over *Foxy Proxy Standard* and choose *Options*. 

- A window pops up and in the menu at the top of the screen you see *File*. Click on *File* and choose *Import settings*.  

- A menu pops up to choose a file. Add the file *foxyproxy-settings.xml*, which is in the github repository for today. 

- In the first window which popped up choose for *Select Mode* "Use proxies based on their pre-defined patterns and priorities".

**Chrome:**

- Click on *Chrome* on the top left
- Choose *Preferences*
- From the menu, choose *Accessibility*
- In the left side bar which pops up choose *Extensions*
- Search for foxyproxy
- Add Foxy Proxy Basic to Chrome (ignore any window popping up)
- Foxy Proxy Basic should now be on top of your list of extensions - click on *Options* below
- In the window which pops up click *Import/Export* in the left side bar
- Add the file *foxyproxy-settings.xml*, which is in the github repository for today
- For "Proxy mode" select *Use Proxy emr-socks-proxy for all URLs*

- If the above does not pop up after adding Foxy Proxy Basic:
    - Choose *Window* from the chrome menu
    - Choose *Extensions*
    - Proceed as described above
    - Foxy Proxy Basic should now be on top of your list of extensions - click on *Options* below
    - In the window which pops up click *Import/Export* in the left side bar
    - Add the file *foxyproxy-settings.xml*, which is in the github repository for today
    - For "Proxy mode" select *Use Proxy emr-socks-proxy for all URLs*



## Creating an ssh tunnel


- In a new terminal you will need to open a special type of ssh connection to the master ec2 instance of the cluster with the following:

```bash
$ ssh -i your_key_file.pem -N -D 8157 hadoop@your_public_DNS_for_the_master
```

which will open a process that does not end - just leave it as it is and it stays connected. If this connection is lost by shutting the terminal window down, you will not be able to connect to the HUE system via your browser (which is the next step). Now you can type into the address line of the browser:

your_public_DNS_for_the_master:8888

And you should see a web interface for HUE (Hadoop User Environment). You have to create a new username and password for this and store it (I suggest you write it down in your AWS folder for example).

## Using HUE

- Once you log in you have some quickstart settings, so you can install
all of the examples by clicking on them (it doesn't say anything
once they have installed but if you click on all four it will install them). Then go through the other settings and click next to launch the homepage.

- You now have several examples here that you can explore.

- However let's go back to the cloudfront_logs we saw earlier - we can run
this in the web interface now. If we click "Query Editors" then "Hive" (on the upper left).

- You now have 5 tables, rather than just the cloudfront_logs table we had before.

- Separately, try a normal ssh back to your ec2 master in your terminal,
as we have done before with:

```bash
$ ssh -i your_key_file.pem hadoop@your_public_DNS_for_the_master
```

and type

```bash
$ hive
> SHOW tables;
```

and you see the same five tables as in the HUE browser interface (asking
HUE to install the examples added the extra tables onto your ec2 cluster's
file system, which is a Hadoop Distributed File System).

- Back in HUE, you have a Hive-SQL editor. If you type the same query as before:

```SQL
SELECT os, COUNT(*) count FROM cloudfront_logs GROUP BY os;
```

you will see the same results (remember this is a bit slow because it's
optimised to run on larger datasets). You can also see graphical outputs
such as bar charts and other nice features.

- Separately, if you open a new tab in your browser and type:

`your_public_DNS_for_the_master:50070/dfshealth.html#tab-overview`

you will see a summary of the Hadoop Distributed File System we are using on the ec2 cluster, including how much of the available storage you are using and so on. See (click [here](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html) for further browser interfaces like yarn).

- Now go back and have a look at some of the installed examples. We will have a lab next to try out a few more features also. Note that you can totally control the Hadoop system running on the ec2 cluster from the terminal, it's just that this HUE system gives you a nice friendly interface to use - it doesn't fundamentally add extra features (just things like the graphical summary of the Hive-SQL queries and so on which you can't do in the terminal).

- You should leave the cluster up for the lab coming next, but in general you must shut these down when you are not using them as you will incur charges (around $1 per hour).