<a href="https://colab.research.google.com/github/rzl-ds/gu511_hw/blob/master/hw12.ipynb" target="_parent">
    <img src="https://colab.research.google.com/assets/colab-badge.svg"/>
</a>

# Exercises due by EOD <span style="color:red">2020.12.08<span>

## goal

in this homework assignment we give `spark` (and, optionally, `hadoop`) a go, and do our penultimate `git` exercise of the year.

note: this homework assignment is a bit smaller and easier than others. don't let the number of exercises or points fool you; most are walkthroughs or asking for very simple `sh` scripts. there is a section of optional exercises around `hadoop` that are worth looking at if you're particularly interested in developing solutions using `hadoop`

## method of delivery

*as mentioned in our first lecture, the method of delivery may change from assignment to assignment. we will include this section in every assignment to provide an overview of how we expect homework results to be submitted, and to provide background notes or explanations for "new" delivery concepts or methods.*

this week you will be submitting the results of your homework upload to your `s3` homework submission bucket, `commit`s `push`ed to / actions on `github`, and via `google` forms

summary:

| exercise | deliverable | method of delivery | points |
|----------|-------------|--------------------|--------|
| 1 | a file `download_usaspending_sample.sh` | uploaded to your `s3` homework submission bucket | 10 |
| 2 | none | none | 10 |
| 3 | a commit which resolves a `github` issue and a `merge`d pull request | will be seen on `github` | 10 |
| 4 | an optional survey | I will receive anonymized answers | 5 |

total points: 35

<div style="border: 1px solid lightgrey;">

# exercise 1: downloading a sample `usaspending` dataset

## 1.1: background

[USA Spending](https://www.usaspending.gov/#/about) is an excellent open data source which offers details on government expenditures (e.g. IT accessory purchases, contract work). the datasets are freely available for download, either as one-off files (see https://www.usaspending.gov/#/download_center/award_data_archive), or as a `postgres` database (see [this page](https://files.usaspending.gov/database_download/)).

let's get a file to work with. we want one that is small enough we can still work with it in our local laptops, or in development environments like our `docker` containers. the entire dataset is around 90GB, so something we would probably want to work with in a database or distributed environment like `hadoop` or `spark`

## 1.2: downloading

on [the award data archive page](https://www.usaspending.gov/download_center/award_data_archive), make sure the "Fiscal Year" dropdown is set to 2020 (not the default 2021). the top line in the table of links will be something like `2020_all_Contracts_Full_YYYYMMDD.zip`, where (as of writing) the `YYYYMMDD` "as of" date is `20201108` -- that date will update as the underlying datasets are modified and updated.

as of writing, the 2020 contracts zip file `url` was: https://files.usaspending.gov/award_data_archive/FY2020_All_Contracts_Full_20201108.zip

1. write a simple `curl` or `wget` command to download this to a linux instance
1. write a simple `unzip` statement to unzip that file into a `csv`

create a file named `download_usaspending_sample.sh` and save those two lines into that file -- this is what we will eventually deliver. read on for more info about what these data files are and how we will use them, though!

## 1.3: record format

there is a complete data dictionary for this record online [here](https://www.usaspending.gov/#/download_center/data_dictionary), but the most important fields for our interest are:

+ unique record identifiers
    + `parent_award_id_piid`: "The identifier of the procurement award under which the specific award is issued (such as a Federal Supply Schedule). Term currently applies to procurement actions only"
    + `award_id_piid`: "The unique identifier of the specific award being reported."
    + `modification_number`: "The identifier of an action being reported that indicates the specific subsequent change to the initial award."
    + `transaction_number`: "Tie Breaker for legal, unique transactions that would otherwise have the same key."
+ other useful fields
    + `action_date`: "The date the action being reported was issued / signed by the Government or a binding agreement was reached."
    + `last_modified_date`: "The last modified date captures the change date."
    + `current_total_value_of_award`: "Total amount obligated to date on a contract, including the base and exercised options."
    + `awarding_agency`: "The name associated with a department or establishment of the Government as used in the Treasury Account Fund Symbol (TAFS)."
    + `recipient_name`: "The name of the awardee or recipient that relates to the unique identifier. For U.S. based companies, this name is what the business ordinarily files in formation documents with individual states (when required)."

## 1.4: a simple in-memory filter

this dataset includes multiple snapshots throught time of expenditures -- as contracts get updated, we get new *modifications* and we get new `modification_number` values. for example, given the downloaded `csv` as of part 2 above:

```python
import pandas as pd

df = pd.read_csv('FY2020_All_Contracts_Full_20201108_6.csv')

# a special award with multiple modifications
multi_mods = df[(df.parent_award_id_piid == 'N6264917G0006')
                & (df.award_id_piid == 'N6264919F0572')]

(multi_mods
 [['parent_award_id_piid', 'award_id_piid', 'modification_number',
   'transaction_number', 'current_total_value_of_award', 
   'action_date', 'last_modified_date', 'awarding_agency_name',
   'recipient_name', ]]
 .sort_values(by='last_modified_date'))
```

this will result in the following table:

| (index) | parent_award_id_piid | award_id_piid | modification_number | transaction_number | current_total_value_of_award | action_date | last_modified_date | awarding_agency_name | recipient_name |
|-|-|-|-|-|-|-|-|-|-|
| 349170 | N6264917G0006 | N6264919F0572 | P00016 | 0.0 | 1694963.50 | 2019-10-10 | 2019-10-09 | 19:41:15 | DEPARTMENT OF DEFENSE (DOD) | SUMITOMO HEAVY INDUSTRIES, LTD. |
| 348893 | N6264917G0006 | N6264919F0572 | P00017 | 0.0 | 1694963.50 | 2019-10-15 | 2019-10-15 | 03:44:30 | DEPARTMENT OF DEFENSE (DOD) | SUMITOMO HEAVY INDUSTRIES, LTD. |
| 348860 | N6264917G0006 | N6264919F0572 | P00023 | 0.0 | 1694963.50 | 2019-11-08 | 2019-11-08 | 00:47:08 | DEPARTMENT OF DEFENSE (DOD) | SUMITOMO HEAVY INDUSTRIES, LTD. |
| 348849 | N6264917G0006 | N6264919F0572 | P00021 | 0.0 | 1694963.50 | 2019-11-01 | 2019-11-08 | 02:10:45 | DEPARTMENT OF DEFENSE (DOD) | SUMITOMO HEAVY INDUSTRIES, LTD. |
| 348994 | N6264917G0006 | N6264919F0572 | P00020 | 0.0 | 1694963.50 | 2019-10-31 | 2019-11-08 | 02:11:15 | DEPARTMENT OF DEFENSE (DOD) | SUMITOMO HEAVY INDUSTRIES, LTD. |
| 348865 | N6264917G0006 | N6264919F0572 | P00026 | 0.0 | 1694963.50 | 2019-11-19 | 2019-11-18 | 22:56:09 | DEPARTMENT OF DEFENSE (DOD) | SUMITOMO HEAVY INDUSTRIES, LTD. |
| 349297 | N6264917G0006 | N6264919F0572 | P00028 | 0.0 | 1694963.50 | 2019-11-22 | 2019-11-22 | 02:16:01 | DEPARTMENT OF DEFENSE (DOD) | SUMITOMO HEAVY INDUSTRIES, LTD. |
| 349169 | N6264917G0006 | N6264919F0572 | P00031 | 0.0 | 1694963.50 | 2019-11-29 | 2019-11-29 | 00:44:08 | DEPARTMENT OF DEFENSE (DOD) | SUMITOMO HEAVY INDUSTRIES, LTD. |
| 349116 | N6264917G0006 | N6264919F0572 | P00035 | 0.0 | 1694963.50 | 2019-12-12 | 2019-12-11 | 22:22:21 | DEPARTMENT OF DEFENSE (DOD) | SUMITOMO HEAVY INDUSTRIES, LTD. |
| 348874 | N6264917G0006 | N6264919F0572 | P00036 | 0.0 | 1694963.50 | 2019-12-16 | 2019-12-15 | 20:25:18 | DEPARTMENT OF DEFENSE (DOD) | SUMITOMO HEAVY INDUSTRIES, LTD. |

one thing we might care to do is filter down this larger dataset of all the updated versions of records into only the most recent value (as defined by `modifcation_number` and `transaction_number`, or `last_modified_date`).

we can do this with the following `python` code (assuming the dataframe has been loaded with `read_csv` as above):

```python
most_recent = (df
               .fillna({'parent_award_id': 'paid_nan',
                        'award_id_piid': 'aip_nan'})
               .sort_values(by=['parent_award_id', 'award_id_piid',
                                'last_modified_date'])
               .groupby(['parent_award_id', 'award_id_piid'])
               .last())
most_recent.shape
```

for the 6th of 6 files downloaded via the link https://files.usaspending.gov/award_data_archive/2019_all_Contracts_Full_20191108.zip, we had

+ 565,306 records in our *full*, unmodified `df` created with the `read_csv` function as above
+ 527,441 most-recent records in our `most_recent` dataframe created as above

## 1.5: what to submit

this one's simple: take the `curl` or `wget` statement (line 1) and the `unzip` statement (line 2) that you used in part 2 and write those two lines to a file named `download_usaspending_sample.sh`


##### upload `download_usaspending_sample.sh` to your s3 homework submission bucket

<div style="border: 1px solid lightgrey;">

# exercise 2: `pyspark` filter job for `usaspending` dataset

let's duplicate our `pandas` work above with `pyspark`. we will follow best practices and use `dataframes`, which means we need an environment that has `pyspark` version 2.x.x -- that's *not* our `docker` container, unfortunately.

we'll use our `databricks` CE instances instead!

## 2.1: log in to your `databricks` CE cluster

to log in, go [here](https://community.cloud.databricks.com/). if you missed the instructions on how to sign up for a CE cluster, review them in the `016_spark.ipynb` lecture.

## 2.2: create a notebook

in `databricks`, create a new notebook named `usaspending`

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1zTga64dBug5OjmRGWkLzmb1kefrFr-am"></div>

## 2.3: getting the data (again)

prefix the first cell of your new `usaspending` notebook with the `sh` magic prefix, and then write the two `bash` commands (`wget`/`curl`, then `unzip`) that were your answer for the previous exercise. the cell will look like

```sh
%sh
wget ...
unzip ...
```

run that cell (attach to a cluster if asked)

## 2.4: loading a `csv`

we learned in the `spark` lecture that, you can easily load `csv`s as `dataframe`s with the `spark.read.csv` function. do that!

```python
df = spark.read.csv('file:///databricks/driver/FY2020_All_Contracts_Full_20201108_*',
                    header=True)
```

note that we provided a `glob` (it has the `*` to match all the `_N.csv` values), so this will read **all** the `csv`s -- not just one of them, like in the `pandas` example in the previous exercise. pretty cool!

when we were working with only one file, we had had 565,306 records. how many do we have now that we're looking at all the files?

```python
# at time of writing, this returned 5,565,306
df.count()
```

## 2.5: create a `Window` object

the way that we will do our group by and sort is with a `Window` object (like the window operations in traditional `sql`. we do this in three steps. the three `python` cells below are incomplete -- I'm walking you through building them up line-by-line:

first, create a `spark` `Window` object:

```python
from pyspark.sql.functions import col, row_number
from pyspark.sql.window import Window

w = (Window
     ...)
```

second, we will update that `Window` to partition all of our records by the keys we care about ("partition" is synonymous with "group by" here):

```python
from pyspark.sql.functions import col, row_number
from pyspark.sql.window import Window

w = (Window
     .partitionBy('parent_award_id_piid', 'award_id_piid')
     ...)
```

finally, our `Window` object which partitions records by our favored keys shoudl be *ordered*, so that we know our individual groups are sorted. we'll use the updated `Window` object's `.orderBy` method, and we will order based on the `col`umn `last_modified_date`, descending (so largest / most recent value on top)

```python
from pyspark.sql.functions import col, row_number
from pyspark.sql.window import Window

w = (Window
     .partitionBy('parent_award_id_piid', 'award_id_piid')
     .orderBy(col('last_modified_date').desc()))
```

add the above as a new cell in your notebook

## 2.6: using that `Window` object to calculate order within groups

the `Window` object we just created is something that can be used to calculate aggregation functions within the defined windows. for example, we could calculate the `sum` or `average` of different values within the `groups` defined by the above `partitionBy` partitions.

in this instance, we're looking to find the first record within each group (we have descending error), and we can do this by calculating the `row_number`. we will add this as a new column called `rn` to a dataframe, and save the result to a new dataframe of morst recent records

```python
most_recent = (df
               .withColumn("rn", row_number().over(w))
               ...)
```

after adding that column, we will filter the overall dataset down to records where the `rn` value is 1 (the first row in the window group, which is the item with the largest `last_modified_date` value)

```python
most_recent = (df
               .withColumn("rn", row_number().over(w))
               .where(col("rn") == 1))
```

finally, the above plan will involve a wide transformation (the window partitioning in the window functions will require records to be shuffled around), so it will be good to `.cache` it

```python
most_recent = (df
               .withColumn("rn", row_number().over(w))
               .where(col("rn") == 1))

# semi-colon just supresses the output
most_recent.cache();
```

add the above as the next cell in your notebook.

if you want to take a peak at that dataframe as we have defined it, pass it to `display` as another cell:

```python
# for me, this took almost 9 minutes
display(most_recent)
```

we can again look at the number of records after filtering down to only the most recent transactions (i.e. the number of contracts) by running

```python
# at time of writing, this returned: 4,897,791
# for me, this took about 17 minutes
most_recent.count()
```

## 2.7: save the results

now that we've filtered our dataset, we can immediately save the results for other downstream computation. each dataframe has a `.write.parquet` method associated with it, and the first argument is the directory in `dbfs` in which we will save the output.

note:

1. the output is not a single `parquet` or `csv` file, but rather a directory containing a collection of files, each being the output of an chunk of records
1. I tend to put a `.mode('overwrite')` option on my `pyspark` `.write` calls even if it's the first time I'm writing things because I want to be able to overwrite things. of course, **if you don't want to overwrite things this is a bad idea!**

```python
out_dir = '/data/usaspending_most_recent/'

# this took about 3.75 minutes
(most_recent.write
 .mode('overwrite')
 .parquet(out_dir))
```

add this cell and run it to output your file. we can demonstrate both that this command worked and how much faster it is to perform actions (e.g. `display()`, `.count()`) on this read `dataframe` now that we are `read`ing it and don't have to do the partitioning and shuffling:

```python
most_recent_from_file = spark.read.parquet(out_dir)

# for me, this went from 9 minutes (before) down to 17 seconds
display(most_recent_from_file)
```

```python
# for me, this went from 17 minutes (before) down to 26.5 seconds
most_recent_from_file.count()
```

## 2.8: the whole thing, together

I like to keep everything broken up into cells, but if you don't, here's the whole flow:

```python
from pyspark.sql.functions import row_number, col
from pyspark.sql.window import Window

df = spark.read.csv('file:///databricks/driver/FY2020_All_Contracts_Full_20201108_*',
                    header=True)
print(df.count())

w = (Window
     .partitionBy('parent_award_id_piid', 'award_id_piid')
     .orderBy(col('last_modified_date').desc()))

most_recent = (df
               .withColumn("rn", row_number().over(w))
               .where(col("rn") == 1))
most_recent.cache();

# only matters if it's the last thing run in a cell
# display(most_recent.show())

print(most_recent.count())

out_dir = '/data/usaspending_most_recent/'
(most_recent.write
 .mode('overwrite')
 .parquet(out_dir))

most_recent_from_file = spark.read.parquet(out_dir)

print(most_recent_from_file.count())

display(most_recent_from_file)
```

<div style="border: 1px solid lightgrey;">

# exercise 3: resolving an issue with a `commit` to a `branch`

recall from a previous assignment that `github` -- *not* `git` itself -- allows you to create issues to track bugs and feature requests, and to close issues with commits. previously we closed issues by pushing changes to `master`; now we will resolve them by pushing changes to a `branch` and using the `github` web interface to close them via a `pull request`.

I have already added an issue to your repositories requesting a simple change be made. the issue's title is **die, 511 homeworks!**, and the goal is to create a new file called `status_report` which logs that we've finished our homework for MATH 511.

## 3.1: viewing and assigning the issue

log in to `github` and click on the "issues" tab, and open the issue I created for you. in particular, I want you to **assign** it to yourself -- click on the "assign yourself" link on the issues page.

also, **make note of the issue number!** you will need to include that number in your `commit` `message`

## 3.2: creating a `branch`

in your local repo, create a new `branch` named `statusreport` and check that `branch` out

## 3.3: adding a status report

create a new file named `status_report.txt` with only one line: `homework is finished`.

`add` that file to tracking and `commit` it to your `statusreport` `branch`. for your `commit` message, write the following -- ***replace #N with YOUR issue number!!!***

```
status report: initial commit, fixes #N
```

***replace #N with YOUR issue number!!!***

`push` this `commit` to `github`. remember: you don't push `branch`es to `origin master`, you push them to `origin [branch name]`! so here we `push` to `origin statusreport`.

## 3.4: check out the new `pull request` on `github`

log in to `github`. `github` noticed that the `commit` you just pushed has a reference to an issue (`"fixes #5"`) and provides you with a way to view the newly-created `pull request`

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=14RoWxzI2q5nUwRquQdrTxw9nyUiI7LCt" width="700px"></div>

click on the "compare & pull request" button to go to the "Open a pull request" page that `github` has generated for us:

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1WQRvvuQlAn5bnYKcFne_Rdu1hO-x3kcn" width="700px"></div>

this page is a form that will allow us to create a "pull request" -- a `github` concept that is, basically, a web-based merge of `branch`es resolving `issue`s. click the "create pull request" button to create an official pull request.

the page we are looking at will then update to a pull request. `github` will first calculate whether or not the changes in this `branch` can be directly merged into the `master` branch. if so, it will give us the ability to do the `merge` through the web console rather than the command line (pretty cool!)

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1BEUglXC0bjFLlCvziqNYTwNSapjmiq1f" width="700px"></div>

## 3.5: merge!

click that "Merge pull request" button and let 'er rip! the result is a successful pull request, and the webpage will update inplace:

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1KKGex6RXsED75XWsNrO5q6RXydnjVs-9" width="700px"></div>

## 3.6: view the new `merge` `commit` on the command line

the changes we just made to `master` in that pull request were all done on `github`. our `remote` knows about them, but our `local` doesn't yet!

back in your local repo on your laptop or `ec2`, `checkout` `master` and `pull` the `remote` changes. you should see the merge commit that we just created completely on `github` appears now in your local repo as a merge of your `statusreport` `branch` with `master`, like this:

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1GDsYUK8vQg-c_UE9MAgbPj7C-y2_jA2s" width="700px"></div>


##### submission will be verified via `github`

<div style="border: 1px solid lightgrey;">

# exercise 4: fill out a year-end course survey

I'd like your feedback on the course -- please fill out the form at https://forms.gle/Jf6iukdYkpYWLxbg7. this is 100% anonymous and not mandatory

<div style="border: 1px solid lightgrey;">

# <span style="color:red;font-weight:bold">EVERYTHING BELOW HERE IS OPTIONAL</span>

# <span style="color:red;font-weight:bold">EVERYTHING BELOW HERE IS OPTIONAL</span>

# <span style="color:red;font-weight:bold">EVERYTHING BELOW HERE IS OPTIONAL</span>

# <span style="color:red;font-weight:bold">EVERYTHING BELOW HERE IS OPTIONAL</span>

<div style="border: 1px solid lightgrey;">

# <span style="color:red;font-weight:bold">[OPTIONAL]</span> set up a virtual development environment

the appendix of the `015_hadoop.ipynb` lecture includes several ways of creating a `hadoop` development environment. as much as possible, we will try to execute our `hadoop` commands in a local development environment (using `docker` or virtual machines) until we know they work; then we can pay the big bucks for an `emr` environment when needed.

we are going to follow the `cloudera` `hadoop` distribution docker container quickstart method, with instructions outlined below. do all of the above **on your laptop**

+ ***IF YOU ARE ON WINDOWS, specifically Windows 10 Home edition***: shoot me an email. you will have to use an `emr` cluster as this walkthrough will not work for you
+ otherwise: if you run into issues, please check the lecture appendix first, and then reach out to us.


## .1: updating our `docker` vm memory size

these images require a lot of memory, and the default amount of memory set aside for `docker` `containers` is 2GB

if you are using a windows or mac computer, increase your vm memory as discussed [here](https://stackoverflow.com/questions/44533319/how-to-assign-more-memory-to-docker-container). I recommend `8 GB` if your computer has more than `8GB` of memory, otherwise `4 GB`.

for linux, just increase memory by invoking `docker run` (when you do) with the command line flag `--memory=8g`


## .2: finding good local ports

let's create a `docker` `container` running the `cloudera` `hadoop` distribution (version 5.7 quickstart method). the main thing to figure out is what local ports we can map to the `container`'s internal ports. go to each of the following and see if any web app is running there:

+ port `8888`: http://localhost:8888
+ port `7180`: http://localhost:7180
+ port `8880`: http://localhost:8880

hopefully there are no applications running on those ports, and what we see at every url is

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=13Neuqun5a7GZQRpnWmZ5U6Yeh-5V3hGr"></div>

if you see an application (especially `jupyter` running on port `8888`, `jupyter`'s default port) simply take the "problem" port and add 1 and try again (e.g. is anything running on http://localhost:8889? how about http://localhost:8890?). keep going until you have a series of unoccupied ports


## .3: running the `docker` `image`

take those unoccupied ports and fill them in below, then run the commands from the command line

```sh
HOST_PORT_HUE=9999
HOST_PORT_CLOUDERA_MANAGER=7180
HOST_PORT_TUTORIAL=8880

# if copy-paste doesn't work, try the one-line version below
docker run \
    --hostname=quickstart.cloudera \
    --privileged=true \
    --rm \
    -it \
    -p $HOST_PORT_HUE:8888 \
    -p $HOST_PORT_CLOUDERA_MANAGER:7180 \
    -p $HOST_PORT_TUTORIAL:80 \
    cloudera/quickstart:latest \
    /usr/bin/docker-quickstart
```

if the copy-paste didn't work, here's one that's all on one line (but less readable!)

```sh
# one-line version
docker run --hostname=quickstart.cloudera --privileged=true --rm -it -p $HOST_PORT_HUE:8888 -p $HOST_PORT_CLOUDERA_MANAGER:7180 -p $HOST_PORT_TUTORIAL:80 cloudera/quickstart:latest /usr/bin/docker-quickstart
```

this is a *very* large contianer, so pulling will take a long time. if, along the way, you get an error like

```sh
failed to register layer: Error processing tar file(exit status 1): write /var/lib/hadoop-hdfs/cache/hdfs/dfs/data/current/BP-1120155954-10.0.0.1-1459909528739/current/finalized/subdir0/subdir2/blk_1073742378: no space left on device
```

you will need to increase the disk size the same way you (might have) increased the memory size up above.


if it all works, though, this should download the `cloudera` `quickstart` `docker` `image` and then kick off a long stream of startup scripts. you will know you are done when the terminal looks like this:

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1TzvLGbGkvJlID19stE8NhI17aaLTP3a9"></div>

you may have a line that reads

```
Starting hue:                                              [FAILED]
```

this is (surprisingly!) okay. you should be able to execute `hadoop` commands inside this `container` now:

```sh
hadoop fs -ls /
```

you can also verify that it all worked by going to `http://localhost:8888`, where `8888` is replaced with whatever port you used up above for `HOST_PORT_HUE`. wait for possibly a *long* time (like, 2 minutes) and see if there is a web applciation `hue` running there:

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1zbnV1GmCBHx6XLGur17HLS3MVg4w6ReI"></div>

the username and password for this application are both `cloudera`

you're good to go!

when you're done with the `container` (after this assignment) exit the terminal with `exit`

##### this exercise is optional and ungraded; nothing to submit

# set up a virtual development environment <span style="color:red;font-weight:bold">ANSWERS</span>

follow the instructions in the lecture materials

<div style="border: 1px solid lightgrey;">

# <span style="color:red;font-weight:bold">[OPTIONAL]</span> re-learning `bash` for `hdfs`

## .1: setup

do the following in your *cloudera development environment*!

let's create some giberish files on our local file system for use in interacting with the `hdfs`. simply execute the following code to create a number of local files we can reference in our commands.

*note: you may need to install `wget` with `yum install wget`*

```bash
mkdir /tmp/cachenet
cd /tmp/cachenet
wget -x www.google.com
wget -x www.nytimes.com
wget -x www.twitter.com/i/moments
wget -x www.twitter.com/i/notifications
wget -x www.facebook.com
wget -x www.youtube.com
wget -x www.youtube.com/feed/trending
wget -x www.espn.com
wget -x en.wikipedia.org
wget -x en.wikipedia.org/wiki/L33T
wget -x www.reddit.com
wget -x www.reddit.com/r/all
wget -x www.reddit.com/r/datascience
wget -x www.reddit.com/r/python
cd ~
```


## .2: commands

in the left column of the table below I have described some pretty common filesystem operations, and listed in the second column the `bash` commands to execute them. fill in the comparable commands (i.e. those you enter at a bash prompts) to perform the same general tasks on the `hdfs`. save the resulting three-column table as a `csv` with name `super_hadooper.csv`


### .2.1: quick note: running as an `hdfs` super user

some commands will require you to be running as the `hdfs` user instead of the `root` user -- this is a file permission control built in to `hadoop`. some distributions have the main `hadoop` user named `hadoop`; others named `hdfs`. `cloudera` has chosen `hdfs`.

for example, `chown` is one of those commands. if you ran `chown` as a user other than `hdfs`, you would receive an error

```sh
chown: changing ownership of '/this/path': Non-super user cannot change owner
```

this means you *must* execute that as the `hdfs` super user. you can do this by prefixing your commands with `sudo -u hdfs`, which will tell your local `bash` shell to execute the command as the `hdfs` linux user account (which is the `hdfs` super user). given a hadoop command you know, you can execute that same command by prefacing it like:

```
sudo -u hdfs hadoop [HADOOP CMD HERE]
```

e.g. `hadoop fs -ls /` would turn into `sudo -u hdfs hadoop fs -ls /`.

only use this `sudo -u hdfs` prefixe if you are getting a `Non-super user cannot...` error!


### .2.2: read the manual!

feel free to check out the `hadoop fs` documentation. make sure you get the right version! check your `hadoop` version with `hadoop -version`. some of the versions we have access to*:

+ [the `cloudera` `docker` container: 2.6.0](http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/FileSystemShell.html)
+ [the `aws` `emr` instance: 2.8.5](http://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-common/FileSystemShell.html)


### .2.3: the command table

| task description | bash command | `hadoop` command |
|-|-|-|
| ask for help re: the `ls` command | ls --help | |
| make a directory named `/tmp` | `mkdir -p /tmp` | |
| create an empty file `/tmp/test` | `touch /tmp/test` | |
| list the contents of `/tmp` | `ls /tmp` | |
| add group write permission for `/tmp/test` | `chmod g+w /tmp/test` | |
| change the owner of `/tmp/test` to `hadoop` | `chown hadoop /tmp/test` | |
| remove a file `/tmp/test` which you don't own | `sudo rm /tmp/test` | |
| copy local `/tmp/cachenet` to `hdfs` | no meaningful analog | |
| print the final lines of `/tmp/cachenet/www.nytimes.com/index.html` | `tail /tmp/cachenet/www.nytimes.com/index.html` | |
| echo `hello world` to `/tmp/hello_world.txt` | `echo hello world > /tmp/hello_world.txt` | |
| print `/tmp/hello_world.txt` to `stdout` | `cat /tmp/hello_world.txt` | |
| get the free and used space | `df -h` | |
| count the files and dirs under `/tmp` | no equivalent! |
| remove `/tmp/hello_world.txt` | `rm /tmp/hello_world.txt` | |
| recursively remove directory `/tmp/cachenet` | `rm -r /tmp/cachenet` | |


## .3: submitting

##### this exercise is optional and ungraded; nothing to submit

# re-learning `bash` for `hdfs` <span style="color:red;font-weight:bold">ANSWERS</span>

| task description | bash command | `hadoop` command |
|-|-|-|
| ask for help re: the `ls` command | ls --help | `hadoop fs -help ls` |
| make a directory named `/tmp` | `mkdir -p /tmp` | `hadoop fs -mkdir -p /tmp` |
| create an empty file `/tmp/test` | `touch /tmp/test` | `hadoop fs -touchz /tmp/test` |
| list the contents of `/tmp` | `ls /tmp` | `hadoop fs -ls /tmp` |
| add group write permission for `/tmp/test` | `chmod g+w /tmp/test` | `hadoop fs -chmod g+w /tmp/test` |
| change the owner of `/tmp/test` to `hadoop` | `chown hadoop /tmp/test` | `sudo -u hdfs hadoop fs -chown hadoop /tmp/test` |
| remove a file `/tmp/test` which you don't own | `sudo rm /tmp/test` | `hadoop fs -rm /tmp/test` |
| copy local `/tmp/cachenet` to `hdfs` | no meaningful analog | `hadoop fs -put /tmp/cachenet /tmp` |
| print the final lines of `/tmp/cachenet/www.nytimes.com/index.html` | `tail /tmp/cachenet/www.nytimes.com/index.html` | `hadoop fs -tail /tmp/cachenet/www.nytimes.com/index.html` |
| echo `hello world` to `/tmp/hello_world.txt` | `echo hello world > /tmp/hello_world.txt` | `echo hello world \| hadoop fs -put - /tmp/hello_world.txt` |
| print `/tmp/hello_world.txt` to `stdout` | `cat /tmp/hello_world.txt` | `hadoop fs -cat /tmp/hello_world.txt` |
| get the free and used space | `df -h` | `hadoop fs -df -h` |
| count the files and dirs under `/tmp` | no easy analog | `hadoop fs -count /tmp/` |
| remove `/tmp/hello_world.txt` | `rm /tmp/hello_world.txt` | `hadoop fs -rm /tmp/hello_world.txt` |
| recursively remove directory `/tmp/cachenet` | `rm -r /tmp/cachenet` | `hadoop fs -rm -r /tmp/cachenet` |

<div style="border: 1px solid lightgrey;">

# <span style="color:red;font-weight:bold">[OPTIONAL]</span> loading `usaspending` dataset into `hdfs`

**inside your development environment you created above**, let's use the commands from the previous exercise to download the 2019 contracts `zip` file and `unzip` to get a `csv` inside that development environment container.

## .1: preparing to download

in your `docker` container you will need to run the following commands

```sh
yum install -y ca-certificates wget curl unzip
```

## .2: running the commands to download

execute the two lines you used in your `download_usaspending_sample.sh` to download and unzip the usaspending `csv` files into your `docker` development container.

*note: if your `docker` container complains about certifications and suggest you add `--no-check-certificate`, do it! it's okay*

## .3: `put`ting files in `hdfs`

let's load that 6th `csv` (`FY2020_All_Contracts_Full_20201108_6.csv`) file into `hdfs`

+ with one `hadoop` command, create a directory `/data` you can use to store data
+ with a second, `put` your `csv` into `hdfs` in your development environment at path `/data/usaspending.csv`

you can verify that the commands you wrote worked if you are able to run

```sh
hadoop fs -ls /data/
```

and the command output reads

```sh
Found 1 items
-rw-r--r--   1 root supergroup 1207115056 2019-12-05 03:18 /data/usaspending.csv
```

write the statements you used to create the `/data` directory and `put` that file into `hdfs` to a file named `load_usaspending_to_hdfs.sh`


##### this exercise is optional and ungraded; nothing to submit

# loading `usaspending` dataset into `hdfs` <span style="color:red;font-weight:bold">ANSWERS</span>

```sh
hadoop fs -mkdir /data
hadoop fs -put FY2020_All_Contracts_Full_20201108.csv /data/usaspending.csv
```

<div style="border: 1px solid lightgrey;">