# Web and Cloud Computing (DATA 534): Lab 4
## General Lab Instructions
### Attention - Exercise 5 ammended 

- This assignment is to be completed in python, submitting both a `.ipynb` file (you can add your answers directly to this one) along with a rendered `.md`. Further submission requirements are listed at the end of the lab.
- I added an Intro section to help you with the basics for this lab.

## Intro

In this lab, you will perform analytics on large text data using map reduce. You will work with the following Amazon web services (AWS): elastic map reduce (EMR), EC2, S3 and (optionally) Dynamo DB.

I strongly advise you to read this [tutorial](https://dbaumgartel.wordpress.com/2014/04/10/an-elastic-mapreduce-streaming-example-with-python-and-ngrams-on-aws/) (you do **not** need to carry out all of the steps). But, **before reading it**, you should be able to understand standard input/output in python. 


Create a script in python named `learning_stdin.py` with the following code:

```python
#!/usr/bin/env python
import sys

for line in sys.stdin:
    print("The following was just provided by the user:" + line)
```

Now, go to the Bash/GitBash and execute the script `python learning_stdin.py`. Nothing happens? Try typing something and press enter. See? The program is reading input from the user. The cool thing (and important for this lab) is to note that you can also pass the content of a file as the input. Let's try it out:

1. Create a file `my_input.txt` and put a bit of text there (whatever you want, the content is not important as long
as there are multiple lines). 
2. Next, execute the following command (should work on Linux, Mac, and Windows): `cat my_input.txt | ./learning_stdin.py`
3. Note how you are reading one line at a time, now take a look at the `for` you have in `learning_stdin.py`. 

That's it, you should be able to understand the tutorial now.

## Exercise 1: Distributed computing on AWS EMR

In this exercise, we will be processing all of the unigram files using AWS elastic map reduce (EMR). The overall strategy to accomplish this can be broken down in steps as follows:

### Step 1: Download a nugget of data

The dataset that we will work with is available for download as part of [Google Books Ngram dataset](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html). We will mainly work with the unigrams (1-gram) for this lab (you can read up on [ngram on the Wikipedia](https://en.wikipedia.org/wiki/N-gram) and also, play around with [Google Ngram viewer](https://books.google.com/ngrams)). Start by downloading a subset of the corpus, in particular, download American English unigram starting with **w** [here](http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-us-all-1gram-20120701-w.gz). 

The file is compressed, the uncompressed file should be about 500 MB.


### Step 2: Map and reduce

Write a `mapper.py` and a `reducer.py` scripts to analyse the data.

*  The task of `mapper.py` is to count the total number of occurences of words containing the word `google` in the (nugget of) data; count both upper and lower case combinations of the word. Additionally, output a year and count to the standard output.  
*  The task carried out by `reducer.py` is to combine the outputs produced by `mapper.py` and to sum the total from different files by year.

**Submission**: _your zip file should include the `mapper.py` and `reducer.py` files_

### Step 3: More data and testing

Download an additional subset of the ngrams corpus, say **g** corpus from [here](http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-us-all-1gram-20120701-g.gz). Locally test your map-reduce code on **w** and **g** dataset.
   + You can test the `mapper.py` by emulating the way Hadoop processes the text file: `cat googlebooks-eng-us-all-1gram-20120701-w | ./mapper.py >> test_output`. Do the same for the dataset for **g**.
   + You can test the `reducer.py` as follows: `cat test_output | ./reducer.py | sort -k1,1`.

(**WINDOWS ALERT:** Make sure when you create your scripts to change the end of line character to `LF` (linux-style) instead of `CRLF` (Windows style), otherwise your scripts will not work in Amazon's servers. It took me "only" 4 hours to figure this out.)

**Submission**: _your zip file should include your test_output file, as well as a screenshot showing your commands for this step and the console output result from the reducer (Just the first ~25-35 lines)_

![](images/step3.png)

### Step 4: S3 Paraphernalia

1. Create a new S3 bucket: `firstname-lastname-ngrams` on AWS. For example, if your name is Jon Doe, the bucket name should be `jon-doe-ngrams`. You can do this interactively, using AWS CLI, or boto3.
2. Create a directory `scripts/` in your S3 bucket, upload `mapper.py` and `reduce.py`. Also, create directory `logs/`.
3.  Create directory `unigrams/` in your S3 bucket. Your task in this step is to upload all 39 American English unigram files that are available [here](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) to your S3 bucket under the `unigrams/` directory.
   + The most simple approach is to first download each file to your local disk (you can do this manually or by writing a simple script -- basically a series of `wget` commands). Then, upload the data to S3 similarly as you have done so when you were uploading the map-reduce scripts.
   + A faster approach might be to launch an EC2 instance, download the data to the EBS attached to your EC2 instance and then to upload the data to the S3 bucket. You can do this by writing a script to download the data, launch an EC2 instance, run the script to download the data, and use `aws s3` to upload the data. (Careful, charges may apply. Although, I did this step myself and I wasn't charged [at least yet]).
   + Note: before uploading all of the files to S3, first upload a small subset, say the **w** and the **g** subset. This will shorten the development-testing-debugging cycle. Once you are sure that your mapper and reducer code works properly, then you can upload all data.

**Submission**: _your zip file should include screenshot(s) showing your files in your S3 bucket_

![](images/step4.png)
![](images/step4_1.png)

### Step 5: Take off
  

Execute your job by launching an EMR cluster. We have highlighted the steps. You can also refer to this [guide](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html).
   + Search for the **emr** resource in the AWS console.
   + Click on **Create cluster** button.
   + Give it an appropriate name (e.g., unigram-cluster).
   + Specify an S3 directory for which the log files are to be stored.
   
   
go to Amazon EMR service and create **Hadoop** service. Don't add any "Steps" now, will add the steps after the cluster is created.
4- After creating the cluster, go to the steps tab and add new STREAMING step. 
Now give it some time. It should run and the output should be in the output folder.
   
**Submission**: _your zip file should include a screenshot showing your summary screen after the cluster finishes_

![](images/step5.png)

### Step 6: Land

Wait for the job to complete. Download the outputs from your S3 bucket and plot the number of occurences of the word `google` by year.
  + The number of outputs depend on the number of reducers that are used in the job, so you might need to aggregate the results.
  + If the processing fails due to an error, you can open the log files to see the error messages. Once you correct the error, you can upload your updated scripts. Then, instead of launching a new cluster, you can add a step to the cluster that you launched in Step 7. **Before you launch a new cluster, terminate any running cluster/instances to avoid incurring charges.**
    - You can do this by going to the EMR console, click on **Cluster list**, click on the cluster that you launched (assuming that this is your first cluster launch, there should only be one cluster). Scroll down, you will see **Steps** menu. Select the step that failed to execute. Click on the **Clone step** button. Modify as necessary.
    
**Submission**: _your zip file should include the output you downloaded and the figure you generated._

**DO NOT FORGET TO TERMINATE ALL EC2 INSTANCES OR YOU WILL INCUR CHARGES TO YOUR ACCOUNT**

## Exercise 2 (optional): NoSQL databases

In this exercise, we will replicate the functionality of [Google Books Ngram viewer](https://books.google.com/ngrams). Since we do not have a user interface, we will assume that we are given a file containing a list of words separated by comma. Your task is to modify your `mapper.py` and `reducer.py` (make sure to keep your old versions) so that it reads in the words from the command line arguments and run map reduce to compute the counts for each of the words by year. Upon completion, the results are to be stored to DynamoDB, which is Amazon's NoSQL database. Here are more detailed steps:
1. Follow the Jupyter notebook [tutorial](start_script.ipynb) to learn how to work with Dynamo DB and launch an EMR cluster programmatically.
2. Modify the Jupyter notebook tutorial to read in the `words.csv` from your local machine. Check the database to see if any item exists in the DB corresponding to the words in the `words.csv`. For the words that do not exist, launch EMR to process the unigrams data to total the counts for each of the words by year. Store the results to the DB table and output the plot(s).
   + To pass in the command line arguments when launching EMR, refer to this short [tutorial](launch_emr_with_arguments_to_mapper.ipynb).
3. Modify the `mapper.py` you already have created for Exercise 1 to read in the words to be processed from the command line arguments. The starter code is provided:

In [None]:
#!/usr/bin/python

# starter code for mapper.py

import sys

words = sys.argv[1:len(sys.argv)]
# any preprocessing on the words goes here

for line in sys.stdin:
  # the rest of your code goes here -- this part probably will not require much modification

Now create a table named words in your database. This table is to store for each word that has been processed by your unigram process, the date at which the processing occurred.

**Question:**
1. Modify your script to add the word along with the date to this table for each word that is processed.
2. Can you perform a join operation of this new table with the unigrams table? Describe how you would perform this join. How would you perform the join if instead of NoSQL DB, you are using a relational database.
3. Name one advantage and disadvantage of relational DB and NoSQL DB.
4. For the unigram application that you wrote, which database is more suitable? Explain.

**REMINDER: DO NOT FORGET TO TERMINATE ALL EC2 INSTANCES OR YOU WILL INCUR CHARGES TO YOUR ACCOUNT. ALSO, DELETE ALL TABLES ONCE YOU ARE DONE.**

### Summary of Requirements for Submission
   + The `mapper.py`, `reducer.py`, and `test_output` files, and a file containing the output downloaded from step 6.
   + Screenshots (3) showing: commands for step 3 and the console output result from the reducer (Just the first ~25-35 lines), the files in the S3 bucket, the summary screen after the cluster finishes
   + The figure generated in step 6.