# Answers for  Exercises due by EOD 2017.09.21

## exercise 1: creating a useful bash script 

both of the following will be graded equivalently, so choose based on your familiarity with linux or desire for a challenge

### exercise 1.A: creating a "useful" bash script (linux beginners)

we're going to write a bash script that will download current weather information at DCA (Reagan National Airport). we'll do this in stages:

1. create a directory to hold our data
2. download the current weather and delay status for DCA (Reagan Washington National airport)
3. print a status message indicating whether or not we were successful to a log file

to create this script, we will move one step at a time; the final script will just be all of the commands put together into one script.

along the way, we will want to make sure that all of the commands we execute are *repeatable*: we should be able to run this script a *first* time (and it will do any setup we may need that first time), and then *again* (so it will be okay that this setup is already done, and not fail)

#### create a directory

write a command to make a directory `~/data/weather/`

#### make sure your "create a directory" command is *repeatable*

try running the command you just wrote *again* -- what happens?

in order to make this command repeatable, you will need to specify some flags to this command such that it will:

1. create both `~/data` and `~/data/weather` if they don't exist
    1. this is necessary the *first* time the script runs
2. to not to throw an error if that directory already exists
    1. this is necesary the *other* times the script runs

*hint: if you know how to make a directory, try `man [COMMAND]` to see how to make sure no error is thrown if a directory already exists*

#### download the current weather and delay status for DCA

the FAA (Federal Aviation Administration) has created [a RESTful `xml` and `json` formatted endpoint](http://services.faa.gov/docs/services/airport/) for basic information about airports -- thanks, FFA!

the endpoint of that API is http://services.faa.gov/docs/services/airport/airportCode, and it expects one of two values for the "format" method:

+ `application/xml`
+ `application/json`

let's open DCA's `json` formatted output. head to http://services.faa.gov/airport/status/DCA?format=application/json in your browser.

using a command line tool, download the json results of that API call to a file named 

`~/data/weather/dca.weather.json`

#### get the status code from the download request

you just successfully wrote a linux command that can download the DCA json information from the API and wrote it to a file. any time that command runs, it will either be *successful* or *unsuccessful*.

after you run that command, get the **exit status** of that command and print it to the terminal.

#### print a status message to a log file

let's get the following for a status message:

1. the current time
2. the result of the previous command (the download command) -- just as an error code, nothing more complicated than that

the end result should be a line formatted like

```
YYYY-mm-dd HH:MM:SS    gu511_download_A.sh    command status code was: [status code here]
```

write a command to save the current time to a variable called `$NOW`.

once you can construct such a line, *append* that line to a log file at `~/data/weather/download.log`

#### combine all of the above into a bash script

create a file called `gu511_download_A.sh` by filling in the following template:

```bash
#!/usr/bin/bash
# when this script is run, the line above tells the
# command line what program (binary) to use to
# execute the commands

# the following line(s) creates the directory 
# ~/data/weather if needed
FILL THIS IN

# the following line(s) downloads the current weather 
# and delay status for DCA into ~/data/weather
FILL THIS IN

# the following line(s) write a log message to file 
# indicating status code of previous line 
FILL THIS IN

# exit with the most recent error code -- you can
# leave this line alone
exit $?
```

#### submit this file: see exercise 4 below

### answer 1.A

```bash
#!/usr/bin/bash
# when this script is run, the line above tells the
# command line what program (binary) to use to
# execute the commands

# the following line(s) creates the directory 
# ~/data/weather if needed
mkdir -p ~/data/weather

# the following line(s) downloads the current weather 
# and delay status for DCA into ~/data/weather
curl --silent -o ~/data/weather/dca.weather.json http://services.faa.gov/airport/status/DCA?format=application/json

# the following line(s) write a log message to file 
# indicating status code of previous line 
STATUS_CODE=$?
echo "$(date +'%Y-%m-%d %H:%M:%S')    gu511_download_A.sh    $STATUS_CODE"

# exit with the most recent error code -- you can
# leave this line alone
exit $?
```

### exercise 1.B: create a *useful* bash script (advanced linux users)

we're going to write a bash script that will download an arbitrary number of urls from a text file in a highly parallel way. we'll write this script in stages:

1. create a directory to hold our downloaded data
2. download a list of urls from a text file

to create this script, we will move one step at a time; the final script will just be all of the commands put together into one script

#### create a test csv

execute the following commands to create a list of test urls for downloading:

```bash
echo www.google.com >> /tmp/test.urls
echo www.georgetown.edu >> /tmp/test.urls
echo www.elderresearch.com >> /tmp/test.urls
echo www.twitter.com >> /tmp/test.urls
echo www.facebook.com >> /tmp/test.urls
```

#### create a directory

write a command to make a directory `~/data/weather/`

#### make sure your "create a directory" command is *repeatable*

try running the command you just wrote *again* -- what happens?

in order to make this command repeatable, you will need to specify some flags to this command such that it will:

1. create both `~/data` and `~/data/weather` if they don't exist
    1. this is necessary the *first* time the script runs
2. to not to throw an error if that directory already exists
    1. this is necesary the *other* times the script runs

#### write a command to print the contents of `test.csv` of urls to `stdout`

print the contents of `test.csv` to the terminal (for piping to a later function)

#### use `xargs` to pipe the contents of `test.urls` to the `echo` function

soon we will write a function which will take a *single* url and download it. to pass many urls to this script and to create several forks (separate processes which will work in parallel) we will use the `xargs` command.

let's get some practice with the `xargs` command before trying to use it for our download function. in particular, let's look at the following flags:

1. `-P` or `--max-procs`: specify the maximum number of separate processes we should start (default is 1, 0 is interpreted as "maximum number possible")
2. `-n`: in conjunction with `-P`, the number of items passed to each process
3. `-I`: specify which sequence of characters in the command to follow should be replaced with the item passed in by `xargs`. a somewhat common option is `{}` because it is unlikely to be meaningful in any command that follows. that must be escaped, though -- see below

as an example, check out the results of the following:

```bash
cat /tmp/test.urls | xargs -P 100 -n 3 -I{} echo url is \{\}
```

#### `curl` one of those urls

take one of those urls -- say, www.google.com -- and download it to a file. do the following:

1. run it in "silent" mode
2. cap the maximum time the whole download operation should take at 10 seconds
3. write the contents of that download to a file in `~/data/downloads` with a the same name as the final portion (the `basename` of that url)

*hint*: suppose we have the url is a bash variable `$URL`. we could write

```bash
curl [silent flag and maximum download time flag] $URL > ~/data/downloads/$(basename $URL)
```

the `basename` piece is necessary for urls which are more complicated than just `www.xxxxxxxx.com`, such as `www.xxxxxxxx.com/a/longer/path/with?stuff=x&other_stuff=y`

verify that the downloaded contents for one test url match the source on the corresponding webpage

#### export that `curl` statement as a function

you can create a bash function using the syntax

```bash
function my_function_name {
    # do bash stuff
}
```

arguments are passed to this function as bash variables `$1`, `$2`, and so on, such that if you write

```bash
my_function_name arg1 arg2 arg3 arg4
```

these will be "available" within the body of the function as

| variable name | value |
|---------------|-------|
| `$1`          | arg1  |
| `$2`          | arg2  |
| `$3`          | arg3  |
| `$4`          | arg4  |

for example, if we wanted to turn our echo command up above into a super l33t re-usable function, we could write

```bash
function l33t_url_echo {
    echo "the url is $1"
}

# test it out
l33t_url_echo www.google.com
```

we could also make this available in other bash shells be `export`-ing it:

```bash
export -f l33t_url_echo
```

so, let's talk about **what you should actually do**:

1. convert your `curl` statement from before into a bash function that will take a url as a parameter
2. export it for use in other bash sessions

#### use that function with `xargs` on your test urls

for each of the urls filtered by `xargs` we want to run the newly-minted `bash` function with that url as the argument.

for example, if we wanted to use our `l33t_url_echo` function from above, we could write:

```bash
# ...it pays to read ahead...
cat /tmp/test.urls | xargs -P 100 -n 3 -I{} bash -c l33t_url_echo\ \{\}
```

in the above, the actual *command* we are executing with `xargs` is the `bash` command, which

1. starts a new `bash` shell
2. executes the *command* following flag `-c` (that's what the `-c` flag *is*)
3. replaces the occurrence of `\{\}` with whatever url is available
4. special characters such as spaces and braces need to be escaped to be passed in using the `-c` command

write your own version of the command above, replacing `l33t_url_echo` with the function you created previously.

delete all of the items in `~/data/downloads` to start from scratch, and run the whole `cat + xargs + your_function` line. verify it downloads each test url.

#### replace `/tmp/test.urls` with a variable path name

create a variable `$URL_FILE` with a value of `/tmp/test.urls`, and invoke the previous `cat` + `xargs` + `your_function` line using the variable name instead of the hard-coded path

#### understand command line arguments

the way that bash handles command line arguments to a shell script is identical to the way functions receive them -- the first word (first in a space-separated list) is stored to a variable `$1`, the second to `$2`, and so on. 

a common convention for command line arguments is to supply a default value, and this can be done with a bash variable resolution construct:

```bash
MY_VAR=${TRY_THIS_FIRST:-USE_THIS_IF_NOTHING_FOUND}
```

if `$TRY_THIS_FIRST` exists, bash resolves that expression to the value of `$TRY_THIS_FIRST` and uses it to set the value of `$MY_VAR`. if it does not, it will then try evaluating the *exact string* following the `:-` characters.

In the example above,

+ if `$TRY_THIS_FIRST` is set to some value, `MY_VAR` will be set to that value
+ if `$TRY_THIS_FIRST` is *not* set to some value, `MY_VAR` will be set to the `USE_THIS_IF_NOTHING_FOUND`
    + if `USE_THIS_IF_NOTHING_FOUND` is *itself* a variable expression (e.g. `$USER`), it will be resolved and then assigned to the variable `MY_VAR`
    
a common use of this is setting default command line argument values. for example, suppose I create a file `my_script.sh` that contains the following:

```bash
!/usr/bin/bash

FIRST_ARGUMENT=${1:-defaultval}

echo $FIRST_ARGUMENT
```

if I call

```bash
bash my_script.sh
```

there is no argument passed and therefore `$1` will not be set. This will result in `FIRST_ARGUMENT` being set to the default value `defaultval`, and the script will print `defaultval` to the terminal.

if, on the other hand, I call

```bash
bash my_script.sh "print me"
```

bash will create a variable `$1` with a value `print me`, and the script will end up printing `print me` to the terminal.

#### combine all of the above into a bash script

create a file called `gu511_download_B.sh` to the following format:

```bash
#!/usr/bin/bash
# when this script is run, the line above tells the
# command line what program (binary) to use to
# execute the commands

# allow the executing user to pass their own list of urls,
# but keep /tmp/test.urls as a default
URL_FILE=${1:-/tmp/test.urls}

# the following line(s) creates the directory 
# ~/data/downloads if needed
FILL THIS IN

# the following line(s) define our single-url curl
# download function
FILL THIS IN

# the following line(s) export that function for use
# in other bash session
FILL THIS IN

# the following line is the "cat + xargs + your_function"
# line from the previous step
FILL THIS IN

# exit with the most recent error code -- you can
# leave this line alone
exit $?
```

##### postscript

*if everything went according to plan, this script should be among the fastest download programs I've ever come across (no exageration there). it was useful enough that I put it and some variants on a github repo I own.*

*...it **really** pays to read ahead...*

#### submit this file: see exercise 4 below

### answer 1.B

I have a few versions in [my github repo](https://github.com/RZachLamberty/zshell) (as mentioned above), but this is basically what we want to do

```bash
#!/usr/bin/bash
# when this script is run, the line above tells the
# command line what program (binary) to use to
# execute the commands

# allow the executing user to pass their own list of urls,
# but keep /tmp/test.urls as a default
URL_FILE=${1:-/tmp/test.urls}

# the following line(s) creates the directory 
# ~/data/downloads if needed
mkdir -p ~/data/downloads

# the following line(s) define our single-url curl
# download function
function l33t_url_echo {
    curl -s "$1" > ~/data/downloads/$(basename $1)
}

# the following line(s) export that function for use
# in other bash session
export -f l33t_url_echo

# the following line is the "cat + xargs + your_function"
# line from the previous step
cat /tmp/test.urls | xargs -P 100 -n 3 -I{} bash -c l33t_url_echo\ \{\}

# exit with the most recent error code -- you can
# leave this line alone
exit $?
```

## exercise 2: installing `miniconda` on your `ec2` server

this will be a straightforward list of steps to execute from the terminal of your `ec2` server in order to install the `anaconda python` distribution.

1. on your laptop
    1. in your browser, go to [the `miniconda` download page](https://conda.io/miniconda.html)
    2. find the `python 3.6` installer for `64-bit linux`
    3. *copy the download link address*, don't just click
        1. at the time of writing, this link was: `https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh`
2. on your `ec2` server
    1. download this file: `wget [THE URL FROM ABOVE]`. this will create a file (say) `Miniconda3-latest-Linux-x86_64.sh` in your current working directory.
    2. execute that bash script: `bash Miniconda3-latest-Linux-x86_64.sh`
        1. read the license. scroll through.... yes... yes... okay... fine... sure... yes... okay... type `yes` to accept
        2. the default installation directory is fine but change it if you'd like
        3. I will recommend you *do* change your `PATH` -- the default is `no`, so you have to actively type `yes` at this prompt.
    3. if you experience an error or abort, you may have created the install directory. if you see an error: `ERROR: File or directory already exists: /home/ubuntu/miniconda3`, you can and should `rm` that directory

## answer 2

the instructions above should be sufficient -- reach out if you have any issues

## exercise 3: creating a `conda` environment and an `environment.yml` file

first, read [the documentation on creating and managing conda environments](https://conda.io/docs/user-guide/tasks/manage-environments.html).

once you've done that, create a conda environment called `gu511` with `python` version 3.5 and install into *that environment* (not your root environment) the following packages:

+ `jupyter`
+ `pandas`
+ `plotly`
+ `scikit-learn`

use the [environment sharing `export` command](https://conda.io/docs/user-guide/tasks/manage-environments.html#sharing-an-environment) to create an `environment.yml` file.

view that file with `less` and figure out what it is saying about your `conda` environment, and how some one might use that file

#### submit this file: see exercise 4 below

## answer 3

the following sequence of commands will create the environment, install the desired packages, and create the `environment.yml` file:

1. `conda create -n gu511 python=3.5`
2. `source activate gu511`
3. `conda install jupyter pandas plotly scikit-learn`
4. `conda env export > environment.yml`


this is the contents of the `environment.yml` file for the environment created *on a linux machine* (there may be slight differences if you create yours on a mac, fwiw).

```yaml
name: gu511
channels:
- defaults
dependencies:
- asn1crypto=0.22.0=py35h0d675fe_1
- bleach=2.0.0=py35h055c768_0
- ca-certificates=2017.08.26=h1d4fec5_0
- certifi=2017.7.27.1=py35h19f42a1_0
- cffi=1.10.0=py35h796c292_1
- chardet=3.0.4=py35hb6e9ddf_1
- cryptography=2.0.3=py35hef72dfd_1
- dbus=1.10.22=h3b5a359_0
- decorator=4.1.2=py35h3a268aa_0
- entrypoints=0.2.3=py35h48174a2_2
- expat=2.2.4=hc00ebd1_1
- fontconfig=2.12.4=h88586e7_1
- freetype=2.8=h52ed37b_0
- glib=2.53.6=hc861d11_1
- gmp=6.1.2=hb3b607b_0
- gst-plugins-base=1.12.2=he3457e5_0
- gstreamer=1.12.2=h4f93127_0
- html5lib=0.999999999=py35h0543385_0
- icu=58.2=h211956c_0
- idna=2.6=py35h8605a33_1
- intel-openmp=2018.0.0=h15fc484_7
- ipykernel=4.6.1=py35h29d130c_0
- ipython=6.1.0=py35h1b71439_1
- ipython_genutils=0.2.0=py35hc9e07d0_0
- ipywidgets=7.0.0=py35h8ebd919_0
- jedi=0.10.2=py35hc33c70f_0
- jinja2=2.9.6=py35h90b8645_1
- jpeg=9b=habf39ab_1
- jsonschema=2.6.0=py35h4395190_0
- jupyter=1.0.0=py35hd38625c_0
- jupyter_client=5.1.0=py35h2bff583_0
- jupyter_console=5.2.0=py35h4044a63_1
- jupyter_core=4.3.0=py35he2f7985_0
- libedit=3.1=heed3624_0
- libffi=3.2.1=h4deb6c0_3
- libgcc-ng=7.2.0=h7cc24e2_2
- libgfortran-ng=7.2.0=h9f7466a_2
- libpng=1.6.32=hda9c8bc_2
- libsodium=1.0.13=h31c71d8_2
- libstdcxx-ng=7.2.0=h7a57d05_2
- libxcb=1.12=h84ff03f_3
- libxml2=2.9.4=h6b072ca_5
- markupsafe=1.0=py35h4f4fcf6_1
- mistune=0.7.4=py35hfd0f961_0
- mkl=2018.0.0=hb491cac_4
- nbconvert=5.3.1=py35hc5194e3_0
- nbformat=4.4.0=py35h12e6e07_0
- ncurses=6.0=h06874d7_1
- notebook=5.0.0=py35h65c930e_2
- numpy=1.13.3=py35hd829ed6_0
- openssl=1.0.2l=h077ae2c_5
- pandas=0.20.3=py35h85c2c75_2
- pandoc=1.19.2.1=hea2e7c5_1
- pandocfilters=1.4.2=py35h1565a15_1
- pcre=8.41=hc71a17e_0
- pexpect=4.2.1=py35h8b56cb4_0
- pickleshare=0.7.4=py35hd57304d_0
- pip=9.0.1=py35haa8ec2a_3
- plotly=2.1.0=py35hac5c16f_0
- prompt_toolkit=1.0.15=py35hc09de7a_0
- ptyprocess=0.5.2=py35h38ce0a3_0
- pycparser=2.18=py35h61b3040_1
- pygments=2.2.0=py35h0f41973_0
- pyopenssl=17.2.0=py35h1d2a76c_0
- pyqt=5.6.0=py35h0e41ada_5
- pysocks=1.6.7=py35h6aefbb0_1
- python=3.5.4=he2c66cf_20
- python-dateutil=2.6.1=py35h90d5b31_1
- pytz=2017.2=py35h9225bff_1
- pyzmq=16.0.2=py35h4be1f71_2
- qt=5.6.2=h974d657_12
- qtconsole=4.3.1=py35h4626a06_0
- readline=7.0=hac23ff0_3
- requests=2.18.4=py35hb9e6ad1_1
- scikit-learn=0.19.0=py35h25e8076_2
- scipy=0.19.1=py35ha8f041b_3
- setuptools=36.5.0=py35ha8c1747_0
- simplegeneric=0.8.1=py35h2ec4104_0
- sip=4.18.1=py35h9eaea60_2
- six=1.10.0=py35h5312c1b_1
- sqlite=3.20.1=h6d8b0f3_1
- terminado=0.6=py35hce234ed_0
- testpath=0.3.1=py35had42eaf_0
- tk=8.6.7=h5979e9b_1
- tornado=4.5.2=py35hf879e1d_0
- traitlets=4.3.2=py35ha522a97_0
- urllib3=1.22=py35h2ab6e29_0
- wcwidth=0.1.7=py35hcd08066_0
- webencodings=0.5.1=py35hb6cf162_1
- wheel=0.29.0=py35h601ca99_1
- widgetsnbextension=3.0.2=py35h0be620c_1
- xz=5.2.3=h2bcbf08_1
- zeromq=4.2.2=hb0b69da_1
- zlib=1.2.11=hfbfcf68_1
- pip:
  - ipython-genutils==0.2.0
  - jupyter-client==5.1.0
  - jupyter-console==5.2.0
  - jupyter-core==4.3.0
  - prompt-toolkit==1.0.15
prefix: /home/ubuntu/miniconda3/envs/gu511
```

## exercise 4: submitting your homework

### tangent about how your `ssh` access was set up

in last week's exercises you created a public key and sent it to me along with a desired user name and an ip address.

after receiving them, I used the following script to create your users and configure `ssh` access:

```bash
#!/bin/bash

# command line args
USERNAME=${1}
HOME=/home/$USERNAME
IP=${2}
PUBKEY=${3}

# create user and set up home / .ssh director
adduser --disabled-password $USERNAME
mkdir -p $HOME/.ssh
chown $USERNAME:$USERNAME $HOME/.ssh

# add public key to authorized_keys
echo $PUBKEY >> $HOME/.ssh/authorized_keys
chown -R $USERNAME:$USERNAME $HOME/.ssh/
chmod 700 $HOME/.ssh
chmod 600 $HOME/.ssh/authorized_keys

# use awscli to update ec2 port settings
aws --region us-east-1 ec2 authorize-security-group-ingress \
    --group-name ssh_for_hw \
    --protocol tcp \
    --port 22 \
    --cidr $IP/32
```

I then sent you the information you need to sign in:

1. the user name you requested and received
2. the server's ip address

you should then be able to log in to my `ec2` server with the command

```bash
ssh -i /path/to/your/private/key [YOUR USER NAME HERE]@[MY EC2 IP ADDRESS HERE]
```

### actually doing exercise 4

the point of this exercise is to use `scp` (the SSH copy command) or some secure copy application (e.g. WinSCP or Filezilla) to copy your bash script file to my `ec2` server.

you should copy it into your home directory (`~`, `/home/[YOUR USER NAME HERE]`) and keep the file name as `gu511_download_A.sh` or `gu511_download_B.sh`, depending on whether you completed `1.A` or `1.B` above.

if you are using `scp`, the general structure of the command is

```bash
# copying a *local* file to a *remote* machine
scp -i /path/to/your/private/key [local files to copy] [user name]@[host name or ip]:[path on remote machine]
```

to go in the other direction (*i.e.* copy remote files to your local machine), just flip the order between the `[local files to copy]` element and the `[user name]@[host name or ip]:[path on remote machine]` element.

so for this particular copy operation:

```bash
scp -i /path/to/your/private/key /path/to/your/gu511_download_A.sh [your user name here]@[my aws ec2 ip]:~/gu511_download_A.sh

# or

scp -i /path/to/your/private/key /path/to/your/gu511_download_B.sh [your user name here]@[my aws ec2 ip]:~/gu511_download_B.sh
```

and then

```bash
scp -i /path/to/your/private/key /path/to/your/environment.yaml [your user name here]@[my aws ec2 ip]:~/
```

the final evaluation will be me running your script and creating `conda` environments using your `environment.yml` file and verifying that both behave as expected.

## answer 4

simply following the steps above is sufficient