# Final Project - AWS Instructions.

## EC2 instance.

<img src="images/ec2.jpg" />

#### Log into EC2:

`ssh -i ~/ssh/dsci6007_cpm.pem ubuntu@ec2-54-91-189-236.compute-1.amazonaws.com`

#### scp files to EC2:

`scp -i ~/ssh/dsci6007_cpm.pem /Users/carles/api_meetup_cred.yml ubuntu@ec2-54-91-189-236.compute-1.amazonaws.com:~`

`scp -i ~/ssh/dsci6007_cpm.pem /Users/carles/meet_up_kinesis.py ubuntu@ec2-54-91-189-236.compute-1.amazonaws.com:~`

#### Install `requests` to run the script that loads data from MeetUp to S3 using kinesis firehose.

`sudo pip install requests`

#### Install `mailutils` to run the script sends an email with S3 bucket information.

`sudo apt install mailutils`

#### Other software installed:

`sudo apt-get install python-dev python-pip`

`sudo apt-get install python-pandas`

`sudo apt-get install python-yaml`

`sudo apt install awscli`

`sudo pip install boto3`

`sudo pip install pandas`

`sudo pip install boto`

`sudo apt install libpq-dev python-dev`

`sudo pip install psycopg2`

`sudo apt install mailutils`

##### To view all software installed: `history | grep install`

#### In case something needs to be deleted from S3 (CAUTION!!).

`aws s3 rm --recursive s3://dsci6007-firehose-final-project/meet_up2017`

#### To check the amount of data collected from MeetUp using kinesis firehose.

`aws s3 ls --summarize --human-readable --recursive s3://dsci6007-firehose-final-project/meet_up2017`

#### Script to `GET` raw json from MeetUp into S3 (runs on above EC2 instance):

`meet_up_kinesis.py`

It is limited to 200 requests per hour to avoid been `throttle out` by MeetUp.

#### CRON job.

We will run the above script on EC2 every hour using CRON.

We make the python script executable, we don't need to include python on the cron job:

`chmod a+x meet_up_kinesis.py`

scp script to EC2:

`scp -i ~/ssh/dsci6007_cpm.pem /Users/carles/meet_up_kinesis.py ubuntu@ec2-54-91-189-236.compute-1.amazonaws.com:~`

We create a CRON text file:

`cron_kinesis_meetup.txt`

which includes the following:

`0 * * * * /home/ubuntu/meet_up_kinesis.py && curl -sm 30 k.wdt.io/carles.poles@gmail.com/meetup_firehose_s3?c=0_*_*_*_*`

scp text file to EC2:

`scp -i ~/ssh/dsci6007_cpm.pem /Users/carles/cron_kinesis_meetup.txt ubuntu@ec2-54-91-189-236.compute-1.amazonaws.com:~`

Then, run this on EC2 AWS CLI:

`crontab cron_kinesis_meetup.txt`

### Note that we use the free utility  from `https://crontab.guru` to get email alerts from the above CRON.

<img src="images/cron-guru.jpg" />
<img src="images/w-1.jpg" />
<img src="images/w-2.jpg" />
<img src="images/w-3.jpg" />

### We get an email every hour with information of the size of our firehose bucket.

The script is `email_s3_report.py`.

We make the python script executable, we don't need to include python on the cron job:

`chmod a+x email_s3_report.py`

scp script to EC2:

`scp -i ~/ssh/dsci6007_cpm.pem /Users/carles/email_s3_report.py ubuntu@ec2-54-91-189-236.compute-1.amazonaws.com:~`

and include this entry on `crontab`:

`0 * * * * /home/ubuntu/email_s3_report.py dsci6007-firehose-final-project`

<img src="images/s3-email.jpg" />

### Permissions for S3 bucket.
`{
  "Id": "Policy1487106243408",
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1487106238363",
      "Action": [
        "s3:GetObject"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::dsci6007-firehose-final-project/*",
      "Principal": {
        "AWS": [
          "*"
        ]
      }
    }
  ]
}`

## Firehose Setup.

<img src="images/firehose.jpg" />

Note the S3 buffer size is set to 100MB.

## AWS Spark EMR  instance.

<img src="images/emr-1.jpg" />

Log into instance.

`ssh -i ~/ssh/dsci6007_cpm.pem hadoop@ec2-52-202-107-142.compute-1.amazonaws.com`

#### Instance has installed:

`history | grep install`

showing:

`sudo pip install jupyter`

`sudo pip install pandas`

`sudo pip install boto3`

#### URL for Zeppelin:

`http://ec2-52-202-107-142.compute-1.amazonaws.com:8890/`

#### URL for Jupyter:

`http://ec2-52-202-107-142.compute-1.amazonaws.com:8888/?token=TOKEN`

where `TOKEN` will be provided when running `pyspark` on the EMR command line.

#### In case we need to `scp` files to EMR:

`scp -i ~/ssh/dsci6007_cpm.pem ~/Downloads/meetup_test hadoop@ec2-52-202-107-142.compute-1.amazonaws.com:~`

https://developers.google.com/chart/interactive/docs/gallery/map#geocoded-locations

#### We need to copy html report files created by the scripts that are dumped to the EMR master node to a S3 bucket:

`spark_meetup_time_rdd.py`

`spark_meetup_time_df.py`

#### We create a .sh script that will run on the master node:

`emr_html_s3.sh`

which contains:

`aws s3 cp /home/hadoop/spark-top-meetup-categories-pie.html s3://dsci6007.com/`

`aws s3 cp /home/hadoop/spark-top-meetup-categories.html s3://dsci6007.com/`

`aws s3 cp /home/hadoop/spark-top-meetup-topics-pie.html s3://dsci6007.com/`

`aws s3 cp /home/hadoop/spark-top-meetup-topics.html s3://dsci6007.com/`

`aws s3 cp /home/hadoop/spark-top-meetup-organizers-pie.html s3://dsci6007.com/`

`aws s3 cp /home/hadoop/spark-top-meetup-organizers.html s3://dsci6007.com/`

`aws s3 cp /home/hadoop/spark-meetings-map.html s3://dsci6007.com/`

`aws s3 cp /home/hadoop/meetup_report_files.html s3://dsci6007.com/`

#### Then, we have a CRON job that run on the master node:

`crontab cron_copy_html_s3.txt`

which contains:

`0 */1 * * * sh /home/hadoop/emr_html_s3.sh`

we `scp` the 2 files to the master node:

`scp -i ~/ssh/dsci6007_cpm.pem emr_html_s3.sh hadoop@ec2-52-202-107-142.compute-1.amazonaws.com:~`

`scp -i ~/ssh/dsci6007_cpm.pem cron_copy_html_s3.txt hadoop@ec2-52-202-107-142.compute-1.amazonaws.com:~`

Then enable the CRON job by running:

`crontab cron_copy_html_s3.txt`

#### Note that we copy the html files to a S3 bucket that has been enabled as static website.

<img src="images/static.jpg" />

#### We have another CRON job to check the timestamp of the html file reports.

We make the python script executable, we don't need to include python on the cron job:

`chmod a+x email_timestamp_files.py`

scp script to EMR:

`scp -i ~/ssh/dsci6007_cpm.pem /Users/carles/email_timestamp_files.py hadoop@ec2-52-202-107-142.compute-1.amazonaws.com:~`

and this is the crontab entry:

`0 */1 * * * /home/hadoop/email_timestamp_files.py`

The script generates an email and a html page:

https://s3-us-west-1.amazonaws.com/dsci6007.com/meetup_report_files.html

<img src="images/mrt-1.jpg" />
<img src="images/mrt-2.jpg" />

#### Note that since we run a CRON job on the EMR master node, we will be receiving emails.

<img src = "images/mail-1.jpg" />

We go to:

`cd /var/spool/mail`

then:

`cat hadoop`

<img src = "images/mail-2.jpg" />

### Spark Submit.

`scp -i ~/ssh/dsci6007_cpm.pem spark_meetup_time_df.py hadoop@ec2-52-202-107-142.compute-1.amazonaws.com:~`

`scp -i ~/ssh/dsci6007_cpm.pem spark_meetup_time_rdd.py hadoop@ec2-52-202-107-142.compute-1.amazonaws.com:~`

#### we will run `spark_meetup_time_rdd.py` and `spark_meetup_time_df.py` from EMR as:

`spark-submit spark_meetup_time_df.py s3a://dsci6007-firehose-final-project/meet_up2017/*/*/*/*`

`spark-submit spark_meetup_time_rdd.py s3a://dsci6007-firehose-final-project/meet_up2017/*/*/*/*`

>NOTE: we may need to run on EMR command line `unset PYSPARK_DRIVER_PYTHON` to avoid launching jupyter notebook.
If jupyter notebooks are required, after finishing with `spark-submit` tasks, run `source ~/.bashrc` (refer to https://github.com/carlespoles/DSCI6007-student/blob/master/5.4%20-%20Spark%20Submit/Lab_5_4-VijethLomada-CarlesPolesMielgo/Lab_5_4-VijethLomada-CarlesPolesMielgo.ipynb)

#### we run job scripts in a CRON job every two hours:

`scp -i ~/ssh/dsci6007_cpm.pem cron_jobs_emr.txt hadoop@ec2-52-202-107-142.compute-1.amazonaws.com:~`

`crontab cron_jobs_emr.txt`

`0 */2 * * * spark-submit /home/hadoop/spark_meetup_time_df.py s3a://dsci6007-firehose-final-project/meet_up2017/*/*/*/*`

`0 */2 * * * spark-submit /home/hadoop/spark_meetup_time_rdd.py s3a://dsci6007-firehose-final-project/meet_up2017/*/*/*/*`

##### we can "concatenate" both jobs (when one finishes, the other starts):

`0 */2 * * * spark-submit /home/hadoop/spark_meetup_time_df.py s3a://dsci6007-firehose-final-project/meet_up2017/*/*/*/* && spark-submit /home/hadoop/spark_meetup_time_rdd.py s3a://dsci6007-firehose-final-project/meet_up2017/*/*/*/*`

`7/03/01 18:32:56 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
('Error: ', 'An error occurred while calling o104.saveAsTextFile.\n: com.amazonaws.services.s3.model.AmazonS3Exception: We encountered an internal error. Please try again. (Service: Amazon S3; Status Code: 200; Error Code: InternalError; Request ID: 50D9DA0B0C6FFC74), S3 Extended Request ID: yLDCHTtdetGgfoBSLoRQFlpmhsEXDSqVXKJ4QX9zzTfMq8R7T1xwDPFVh0sbAGM5iDdJSap1GPU=\n\tat com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1571)\n\tat com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:146)\n\tat com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:134)\n\tat com.amazonaws.services.s3.transfer.internal.CopyMonitor.copy(CopyMonitor.java:193)\n\tat com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:147)\n\tat com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:47)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n\tat java.lang.Thread.run(Thread.java:745)\n')`

#### Per above, I decided to re-size the cluster:

<img src="images/resize.jpg" />

## Another report on S3 bucket we run on EC2:

### script is `email_s3_insert_graph_report.py`

### Example usage: `python email_s3_insert_graph_report.py dsci6007-firehose-final-project`

### Before that, we need a new DB: `dsci6007_s3_db`

<img src="images/s3-db.jpg" />

#### we create this table:

`-- Table: public.s3_bucket`

`-- DROP TABLE public.s3_bucket;`

`CREATE TABLE public.s3_bucket`

`(`
    bucket_name character varying COLLATE pg_catalog."default",`
    
    `size bigint,`
    
    `timestamp_firehose timestamp without time zone,`
    
    `total_objects bigint`
    
`)`

`WITH (`

    `OIDS = FALSE`
    
`)`

`TABLESPACE pg_default;`


`ALTER TABLE public.s3_bucket`

    `OWNER to dsci6007;`
    3
The script will also create two images that will be saved to EC2, then moved to a S3 web bucket:

<img src="images/s3_size_plot.png" />

<img src="images/s3_objects_plot.png" />

### IMPORTANT: TO BE ABLE TO SAVE GENERATED IMAGES ON EC2, we need to edit `matplotlibrc` which is found on `/etc/matplotlibrc`:

`cd /etc/`

`sudo nano matplotlibrc`

and change this line from:

`backend      : TkAgg`

to:

`backend       : Agg`

Then, `scp` the script to EC2:

`scp -i ~/ssh/dsci6007_cpm.pem /Users/carles/email_s3_insert_graph_report.py ubuntu@ec2-54-91-189-236.compute-1.amazonaws.com:~`

If we want to bring the images from EC2 to local Mac:

`scp -i ~/ssh/dsci6007_cpm.pem ubuntu@ec2-54-91-189-236.compute-1.amazonaws.com:~/s3_objects_plot.png /Users/carles/`

`scp -i ~/ssh/dsci6007_cpm.pem ubuntu@ec2-54-91-189-236.compute-1.amazonaws.com:~/s3_size_plot.png /Users/carles/`

### We could run the above script in another CRON job.

`chmod a+x email_s3_insert_graph_report.py`

`0 * * * * /home/ubuntu/email_s3_insert_graph_report.py dsci6007-firehose-final-project`

The generated report can be found here:

https://s3-us-west-1.amazonaws.com/dsci6007.com/s3_report.html