# Data Lake using AWS EMR Cluster

## Deployment

1. Install `awscli`

2. run `aws configure` 
    * AWS Access Key ID : 
    * AWS Secret Access Key : 
    * Default region name: `us-west-2`
    * Default output format : `json`
    
3. **copy all the necessary files to an s3 bucket**

    * `emr_bootstrap.sh` &  `etl.py`
    
    `
    #emr_bootstrap.sh file 
    #!/bin/bash
    sudo easy_install pip3
    `
    
    * Ex: `aws s3 cp <filename> s3://<bucket_name>`


4. **Run EMR create script with the etl job**

```
aws emr create-cluster --name "Spark cluster with step" \
    --release-label emr-5.30.1 \
    --applications Name=Spark \
    --log-uri s3://dendsparktutorial/logs/ \
    --ec2-attributes KeyName=emr-key \
    --instance-type m5.xlarge \
    --instance-count 3 \
    --bootstrap-actions Path=s3://dendsparktutorial/emr_bootstrap.sh \
    --steps Type=Spark,Name="Spark program",ActionOnFailure=CONTINUE,Args=[s3://dendsparktutorial/src/etl.py] \
    --use-default-roles \
    --auto-terminate
```


**EMR Script Components**

* **aws emr** : Invokes the AWS CLI, and specifically the command for EMR.

* **create-cluster** : Creates a cluster
* **--name** : You can give any name for this - this will show up on your AWS EMR UI. This can be duplicate as existing EMR.

* **--release-label**: This is the version of EMR you’d like to use.

* **--instance-count**: Annotates instance count. One is for the primary, and the rest are for the secondary. For example, if --instance-count is given 4, then 1 instance will be reserved for primary, then 3 will be reserved for secondary instances.

* **--applications**: List of applications you want to pre-install on your EMR at the launch time

* **--bootstrap-actions**: You can have a script stored in S3 that pre-installs or sets
environmental variables, and call that script at the time EMR launches

* **--ec2-attributes KeyName**: Specify your permission key name, for example, if it is MyKey.pem, just specify MyKey for this field

* **--instance-type**: Specify the type of instances you want to use. Detailed list can be accessed here, but find the one that can fit your data and your budget.

* **--log-uri**: S3 location to store your EMR logs in. This log can store EMR metrics and also the metrics/logs for submission of your code.

In [None]:
!aws s3 ls

2019-11-30 15:19:35 athena4dend
2019-12-01 18:15:50 aws-athena-query-results-257082603396-us-west-1
2019-11-02 16:18:57 aws-emr-resources-257082603396-us-west-2
2019-11-02 16:17:06 aws-logs-257082603396-us-west-2
2019-10-12 19:55:56 dendbucketdemo1
2020-06-20 10:00:58 dendsparktut
2020-06-29 15:37:02 dendsparktutorial


In [None]:
# I need to move the nesseccary files to an S3 bucket
!ls *etl*
!ls *emr*

etl.py	koalas_etl.py
emr_bootstrap.sh


In [None]:
# copy your etl work + emr_bootstrap file if you are using koalas 
!aws s3 cp etl.py s3://dendsparktutorial/src
!aws s3 cp emr_bootstrap.sh s3://dendsparktutorial

upload: ./etl.py to s3://dendsparktutorial/src                 
upload: ./etl.py to s3://dendsparktutorial/src                 
upload: ./emr_bootstrap.sh to s3://dendsparktutorial/emr_bootstrap.sh


In [None]:
# check 
!aws s3 ls dendsparktutorial/src/

2020-07-04 18:04:22       7793 etl.py
2020-07-04 15:53:10       1226 koalas_etl.py


In [None]:
# run the spark job
!aws emr create-cluster --name "Spark cluster with step" \
    --release-label emr-5.30.1 \
    --applications Name=Spark \
    --log-uri s3://dendsparktutorial/logs/ \
    --ec2-attributes KeyName=emr-key \
    --instance-type m5.xlarge \
    --instance-count 3 \
    --bootstrap-actions Path=s3://dendsparktutorial/emr_bootstrap.sh \
    --steps Type=Spark,Name="Spark program",ActionOnFailure=CONTINUE,Args=[s3://dendsparktutorial/src/etl.py] \
    --use-default-roles \
    --auto-terminate

{
    "ClusterId": "j-3CMP8BO03MU0N"
}
