In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

# Amazon's Cloud Computing Services

Amazon Web Services has extremely thorough documentation around everything from the commands available to the command line interface (CLI) `aws {commands}`, to the Python wrapper for said interface `boto`, to full tutorials and examples on how to fire up an EMR cluster or a bunch of EC2 instances with almost any desired data processing framework.

EC2 is cheaper than EMR, but EMR is recommended for immediate use of Hadoop and any other project in the ecosystem (configurable for your cluster via [Amazon Machine Images](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html) (AMIs). In a production setting it's possible you'll want to use specific versions for consistency; in our case it's safe to use the most recent version (`3.6.0` at the time of this writing).

## Setting up a personal AWS account

To use AWS you'll need to [create an account](http://aws.amazon.com/) if you haven't already. For the first year after new account creation, you'll be eligible for discounts on some services as part of the Free Tier program.

Access the AWS [web console](https://console.aws.amazon.com/s3/) to handle most of your configuration. You'll need at least one S3 bucket to serve as storage for your logs and output.

From there you can create EMR clusters as you wish and run jobs. Be careful about the nodes you use, as only certain sizes are eligible for the free tier discounts. Still, you only pay for what you use, and the costs for small, educational jobs are relatively manageable.

There's an in depth [tutorial](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-get-started.html) available, and more detailed cluster configuration information can be found in this notebook, and in the Spark module.

### AWS credentials and command line tools

1. To verify that it is working, try 
``` bash
aws s3 ls
```
You should get back a json blob that contains the permissions you just added for your user.  If not, double-check that you got your permissions setup correctly.

1. `boto` ([docs](https://boto.readthedocs.org/en/latest/)) is a python library that wraps the functionality of `awscli`.  You can install it using
``` bash
pip install boto
```
and follow the instructions in the docs to get started.

1. Another option for interacting with s3 from the command line is `s3cmd`. You can download/start using it via   
``` bash
git clone https://github.com/s3tools/s3cmd.git
```
and follow the documentation [here](https://github.com/s3tools/s3cmd).

### Python `mrjob`

To test it, clone this [github repo](https://github.com/Yelp/mrjob) and run wordcount on README.rst:
```bash
git clone https://github.com/Yelp/mrjob.git
cd mrjob

# run command locally
# this is good for testing and debugging on files locally
python examples/mr_word_freq_count.py README.rst\
   --no-output --output-dir=/tmp/wc/
   
# check the output file contents:
cat /tmp/wc/* | more

# run command on ec2 and write output to our s3 bucket
# this costs money so only do it when you have working code

python examples/mr_word_freq_count.py -r emr README.rst \
   --no-output --output-dir=s3://thedataincubator-fellow/<user>-wc/
```
Note: if you're unable to start a new jobflow, use the `check_emr_jobflows.py` script in `datacourse/scripts/` and explicitly join the shortest queue by adding the flag `--emr-job-flow-id=j-JOBFLOWID`.

### check the output file contents:
```bash
aws s3 ls s3://thedataincubator-fellow/<user>-wc/
aws s3 cp --recursive s3://thedataincubator-fellow/<user>-wc/ /tmp/<user>-wc/
```
Note: be sure to fill in `<user>` with a key that is unique to you.

A few notes:
1. You can also upload files to s3 using the AWS CLI so that your entire workflow can be on s3.
1. The server will take a while to boot up the first time but it will stay alive.  Any subsequent jobs that are submitted will not have to reboot.  If there are already jobs running, it will wait up to 5 minutes before spawning another server (please be patient).  It will stay idle for 2 hours and then kill itself.
1. Take a look at `examples/mr_word_freq_count.py`.  This is the simple "word count" mapreduce.

## Useful AWS CLI commands

- Access Hadoop web UI, e.g. ResourceManager
    - Option 1: via a local port
```bash 
ssh -L 8158:ec2-52-0-25-37.compute-1.amazonaws.com:9026 hadoop@ec2-52-0-25-37.compute-1.amazonaws.com -i /path/to/fellos201501.pem
```
Now in your browser, go to the address `localhost:8158`  

    - Option 2: via dynamic port forwarding
        1. Type: 
```bash
aws emr socks --cluster-id j-XXXX --key-pair-file ~/path/to/keypair.pem
```
        1. Then follow [these steps)(http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-proxy.html) to add FoxyProxy to your Chrome/Firefox browser to natively use the web UI.


- `ssh` into master node to access HDFS, Pig interactive, etc.

```bash
ssh hadoop@{master-public-dns}.compute-1.amazonaws.com -i /path/to/pemfile.pem
```

- One-liner to preview a .gz file:
```bash
s3cmd get s3://thedataincubator-course/wikidata/wikistats/pagecounts/pagecounts-20081208-030000.gz - | gunzip -c | less
```

### Third party software

There are some third-party tools that can help navigate AWS S3. It can be time-consuming to go through the command line looking for logs when there's no autocomplete or easily viewable directory structure - in which case something like [bucket explorer](http://www.bucketexplorer.com/) might save you some time.

## Spark on EC2
Spark comes with a built-in script to launch clusters on Amazon EC2. This script launches a set of nodes and then installs the Standalone cluster manager on them.   
Cloudera [recommends running on YARN](http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/) for production clusters. Since we'll only be using Spark in isolation (i.e. without any interaction and IO with other frameworks like MapReduce and Impala), standalone is fine; however there is currently better documentation for configuring/tuning Spark jobs running on YARN.


To launch 2 m3.xlarge slave nodes, from `$SPARK_HOME/ec2` run:
```bash
$ ./spark-ec2 --key-pair=springfellows2015 --identity-file=~/springfellows2015.pem -s 2 --instance-type=m3.xlarge --region=us-east-1 --zone=us-east-1a --copy-aws-credentials launch myclustername
```
Note: you may need to specify `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` as environment variables.

## Spark on EMR 
If you want to use Spark-on-Yarn, i.e. spin up a Hadoop cluster from which you can run multiple frameworks, you can install Spark and configure the other installed applications when you use the CLI to create a EMR cluster.  
Here we are installing Spark using one m3.xlarge master instance and nine m3.2xlarge core instances; then running one of the example Spark jobs packaged with Spark (calculating Pi):
```bash
aws emr create-cluster --name EMR-Spark-Step-Example --ami-version 3.8.0 \
--instance-groups Name=Master,InstanceGroupType=MASTER,InstanceType=m3.xlarge,InstanceCount=1 \
Name=Core,InstanceGroupType=CORE,InstanceType=m3.2xlarge,InstanceCount=9 \
--use-default-roles --ec2-attributes KeyName=<YOUR_EC2_KEY_NAME>, \
--auto-terminate \
--log-uri s3://thedataincubator-fellow/logs/ \
--applications Name=Spark,Args=-x \
--steps Name=SparkApp,Type=CUSTOM_JAR,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--class,org.apache.spark.examples.SparkPi,s3://support.elasticmapreduce/spark/1.2.0/spark-examples-1.2.0-hadoop2.4.0.jar,10],ActionOnFailure=CONTINUE
```

The -x flag is important as it overrides the default executor allocations and instead creates one executor for each core node, with access to all the cores and RAM on that node.

There is a similar command add-steps which can add steps to an existing cluster.

You may compare this command to the one in ~/datacourse/scripts/create_spark_cluster.py to see how this code is extensible for different projects. Please don't change the script though!

We'll be using a publicly available dataset from S3 of Wikipedia site traffic to demonstrate typical ETL tasks under different frameworks. We'll only directly compare two (Spark and Scalding), but feel free to fire up a cluster and use Amazon's ample documentation to learn how to use some of the other most popular frameworks, e.g. Pig, Impala, and Crunch.

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*