# Command Line Interface (CLI) & Data collection <a class="tocSkip">

## Instructor: Sang-Yun Oh <a class="tocSkip">

## GUI? CLI?

- **Graphical User Interface (GUI)**:  
    interaction via graphical objects  
    e.g., Microsoft Windows and Apple OS X

- **Command Line Interface (CLI)**:  
    interaction via commands typed into shell  
    e.g., bash, zsh, tcsh, etc.

- Shell is often accessed by terminal [Terminal in Jupyter]    

- GUI is simple to use everyday but not easy to automate repetitive tasks with

- CLI is more cumbersome to use everyday but scriptable

## Shell commands

### Commonly used commands for text files

- `cat`: prints content of a file
- `head`: prints first few lines of a file
- `sed`: (stream editor) changes texts
- `paste`: pasts text files side-by-side
- `cut`: processes columns in delimited text file
- `find`: searches file system
- `grep`: searches text given regular expression pattern
- `sort`: sort a file line by line
- `uniq`: keeps unique lines of a sorted text
- etc.

### Anatomy of shell commands

Here is a simple shell command:

In [1]:
! cat --help ## most shell commands have built-in help

Usage: cat [OPTION]... [FILE]...
Concatenate FILE(s) to standard output.

With no FILE, or when FILE is -, read standard input.

  -A, --show-all           equivalent to -vET
  -b, --number-nonblank    number nonempty output lines, overrides -n
  -e                       equivalent to -vE
  -E, --show-ends          display $ at end of each line
  -n, --number             number all output lines
  -s, --squeeze-blank      suppress repeated empty output lines
  -t                       equivalent to -vT
  -T, --show-tabs          display TAB characters as ^I
  -u                       (ignored)
  -v, --show-nonprinting   use ^ and M- notation, except for LFD and TAB
      --help     display this help and exit
      --version  output version information and exit

Examples:
  cat f - g  Output f's contents, then standard input, then g's contents.
  cat        Copy standard input to standard output.

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>


1. `cat`: program name

2. `[OPTION]`: controls program behavior

3. `[FILE]`: specify file to read from or standard input

### References to learn shell command line

- [Software Carpentry Lessons](https://software-carpentry.org/lessons/)

- [Unix Power Tools](https://ucsb-primo.hosted.exlibrisgroup.com/primo-explore/fulldisplay?docid=01UCSB_ALMA51295276690003776&context=L&vid=UCSB&search_scope=default_scope&tab=default_tab&lang=en_US)

- [Explain Shell](https://explainshell.com/)

### Example: Downloading Files

- URLs of files are directly visible (e.g., Github)

- `wget` is simple and effective download tool

- Example: https://github.com/fivethirtyeight/data

- "Raw" button is the URL for actual file

- Take the candy ratings data: https://github.com/fivethirtyeight/data/tree/master/candy-power-ranking

- `wget` can be used to download files to course jupyterhub

In [2]:
%%bash
wget https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv

--2020-04-14 22:15:18--  https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5205 (5.1K) [text/plain]
Saving to: ‘candy-data.csv.1’

     0K .....                                                 100% 16.0M=0s

2020-04-14 22:15:18 (16.0 MB/s) - ‘candy-data.csv.1’ saved [5205/5205]



### Example: Viewing file contents 

In [3]:
%%bash
head candy-data.csv

competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
100 Grand,1,0,1,0,0,1,0,1,0,.73199999,.86000001,66.971725
3 Musketeers,1,0,0,0,1,0,0,1,0,.60399997,.51099998,67.602936
One dime,0,0,0,0,0,0,0,0,0,.011,.116,32.261086
One quarter,0,0,0,0,0,0,0,0,0,.011,.51099998,46.116505
Air Heads,0,1,0,0,0,0,0,0,0,.90600002,.51099998,52.341465
Almond Joy,1,0,0,1,0,0,0,1,0,.465,.76700002,50.347546
Baby Ruth,1,0,1,1,1,0,0,1,0,.60399997,.76700002,56.914547
Boston Baked Beans,0,0,0,1,0,0,0,0,1,.31299999,.51099998,23.417824
Candy Corn,0,0,0,0,0,0,0,0,1,.90600002,.32499999,38.010963


In [4]:
! head candy-data.csv ## also works

competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
100 Grand,1,0,1,0,0,1,0,1,0,.73199999,.86000001,66.971725
3 Musketeers,1,0,0,0,1,0,0,1,0,.60399997,.51099998,67.602936
One dime,0,0,0,0,0,0,0,0,0,.011,.116,32.261086
One quarter,0,0,0,0,0,0,0,0,0,.011,.51099998,46.116505
Air Heads,0,1,0,0,0,0,0,0,0,.90600002,.51099998,52.341465
Almond Joy,1,0,0,1,0,0,0,1,0,.465,.76700002,50.347546
Baby Ruth,1,0,1,1,1,0,0,1,0,.60399997,.76700002,56.914547
Boston Baked Beans,0,0,0,1,0,0,0,0,1,.31299999,.51099998,23.417824
Candy Corn,0,0,0,0,0,0,0,0,1,.90600002,.32499999,38.010963


In [5]:
! head -n 1 candy-data.csv  ## first line is the header

competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent


In [6]:
! wc -l candy-data.csv      ## counts lines in text file

86 candy-data.csv


In [7]:
! cut -d',' -f1,3 candy-data.csv    ## prints columns of delimited text

competitorname,fruity
100 Grand,0
3 Musketeers,0
One dime,0
One quarter,0
Air Heads,1
Almond Joy,0
Baby Ruth,0
Boston Baked Beans,0
Candy Corn,0
Caramel Apple Pops,1
Charleston Chew,0
Chewey Lemonhead Fruit Mix,1
Chiclets,1
Dots,1
Dum Dums,1
Fruit Chews,1
Fun Dip,1
Gobstopper,1
Haribo Gold Bears,1
Haribo Happy Cola,0
Haribo Sour Bears,1
Haribo Twin Snakes,1
HersheyÕs Kisses,0
HersheyÕs Krackel,0
HersheyÕs Milk Chocolate,0
HersheyÕs Special Dark,0
Jawbusters,1
Junior Mints,0
Kit Kat,0
Laffy Taffy,1
Lemonhead,1
Lifesavers big ring gummies,1
Peanut butter M&MÕs,0
M&MÕs,0
Mike & Ike,1
Milk Duds,0
Milky Way,0
Milky Way Midnight,0
Milky Way Simply Caramel,0
Mounds,0
Mr Good Bar,0
Nerds,1
Nestle Butterfinger,0
Nestle Crunch,0
Nik L Nip,1
Now & Later,1
Payday,0
Peanut M&Ms,0
Pixie Sticks,0
Pop Rocks,1
Red vines,1
ReeseÕs Miniatures,0
ReeseÕs Peanut Butter cup,0
ReeseÕs pieces,0
ReeseÕs stuffed with pieces,0
Ring pop,1
Rolo,0
Root Beer B

In [8]:
! grep 'Tootsie' candy-data.csv      ## finds lines with pattern (regular expression)

Tootsie Pop,1,1,0,0,0,0,1,0,0,.60399997,.32499999,48.982651
Tootsie Roll Juniors,1,0,0,0,0,0,0,0,0,.31299999,.51099998,43.068897
Tootsie Roll Midgies,1,0,0,0,0,0,0,0,1,.17399999,.011,45.736748
Tootsie Roll Snack Bars,1,0,0,0,0,0,0,1,0,.465,.32499999,49.653503


### Chaining commands togeter

- Commands can be chained together using "pipes"

- Many commands in the shell sends output to what is called "stdout" (essentially printing to screen)

- Pipe enable "stdout" to be input into another command via "stdin" (standard input).

- Hence, we can make commands such as the following

In [9]:
! head -n1 candy-data.csv

competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent


In [10]:
! head -n1 candy-data.csv | sed 's/,/\n/g'

competitorname
chocolate
fruity
caramel
peanutyalmondy
nougat
crispedricewafer
hard
bar
pluribus
sugarpercent
pricepercent
winpercent


In [11]:
! head -n1 candy-data.csv | sed 's/,/\n/g' | sed 's/chocolate/CHOCOLATE/g'

competitorname
CHOCOLATE
fruity
caramel
peanutyalmondy
nougat
crispedricewafer
hard
bar
pluribus
sugarpercent
pricepercent
winpercent


### Example: Text file download, search, and manipulation

Comands like `grep`, `sed` and `awk` can be used for text processing.

In [12]:
%%bash

wget -q -O - https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi \
#     | grep 'zipcode.zip' \
#     | sed 's/<a data/\n<a data/g' \
#     | grep -Po '(?<=href=")[^"]*(?=")'

<!DOCTYPE html>
<html  lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">
  <head>
    <meta charset="utf-8" /><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"b67fc6a152",applicationID:"70700070"};window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var i=n[t]={exports:{}};e[t][0].call(i.exports,function(n){var i=e[t][1][n];return r(i||n)},i,i.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<t.length;i++)r(t[i]);return r}({1:[function(e,n,t){function r(){}function i(e,n,t){return function(){return o(e,[u.now()].concat(f(arguments)),n?null:t

## Shell and Jupyter Notebook

Shell and Jupyter can interact with each other by passing values back and forth: e.g.

1. In shell, grab a webpage, extract all links, filter file links that end with `zipcode.zip`.  
    Then, pass the file links as python variable: `files`

1. In python, loop through the file links.  
    In each iteration, pass one file name, `f`, back to shell.

1. In shell, download the file at the link using `wget`

In [13]:
files = !wget -q -O - https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi | grep 'zipcode.zip' | sed 's/<a data/\n<a data/g' | grep -Po '(?<=href=")[^"]*(?=")'
files  ## file names from bash is in python variable!

['https://www.irs.gov/pub/irs-soi/1998zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2001zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2002zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2004zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2005zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2006zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2007zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2008zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2009zipcode.zip',
 'https://www.irs.gov/pub/irs-soi/2010zipcode.zip']

In [14]:
for f in files[:3]:
    ! wget -nc {f}        ## pass python variables into shell!

File ‘1998zipcode.zip’ already there; not retrieving.

File ‘2001zipcode.zip’ already there; not retrieving.

File ‘2002zipcode.zip’ already there; not retrieving.



## GET method and APIs

### GET method

- URLs often use [GET method](https://www.w3schools.com/tags/ref_httpmethods.asp)

- GET method passes parameters using the URL: e.g.  
    https://www.google.com/search?q=hello+there  
    https://www.google.com/search?q=hello+there&tbm=isch

- "[Urls Explained](https://www.freeformatter.com/url-parser-query-string-splitter.html#urls-explained)" dissects standard components of URLs ([online URL parser](https://www.freeformatter.com/url-parser-query-string-splitter.html)) 

### Application Programming Interface (API)

> a set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service.

- GET URLs are commonly used form of API

- In API context, an **endpoint** is a [URL path](https://www.freeformatter.com/url-parser-query-string-splitter.html#urls-explained) that process your request

- **Query string** is used to specify the request: e.g. search term and type

### Example: Google Maps

- Web services often have documentation for developers:  
    [Google Maps Developer Documentation](https://developers.google.com/maps/documentation/urls/guide#forming-the-url)

- Demo: [Display a map](https://developers.google.com/maps/documentation/urls/guide#map-action):  
    e.g. https://www.google.com/maps/@?api=1&map_action=map&basemap=terrain&layer=bicycling

- Demo: [Searching Google Maps](https://developers.google.com/maps/documentation/urls/guide#forming-the-search-url):  
    e.g. https://www.google.com/maps/search/?api=1&query=home+depot

### Example: Film Locations in San Francisco

> _... listing of filming locations of movies shot in San Francisco starting from 1924 ..._

- [Dataset Metadata](https://data.sfgov.org/Culture-and-Recreation/Film-Locations-in-San-Francisco/yitu-d5am)

- [API documentation](https://dev.socrata.com/foundry/data.sfgov.org/yitu-d5am)

- [Simple Filters](https://dev.socrata.com/docs/filtering.html): selection criteria

- [Paging through Data](https://dev.socrata.com/docs/paging.html#2.1): paging through returned data

In [15]:
!wget -qO - "https://data.sfgov.org/resource/yitu-d5am.json?release_year=2013&title=Red%20Widow"

[{"title":"Red Widow","release_year":"2013","locations":"York & 24th St.","production_company":"Beyond Pix","distributor":"American Broadcasting Company (ABC)","director":"Alon Aranya","writer":"Melissa Rosenberg","actor_1":"Radha Mitchell","actor_2":"Sterling Beaumon","actor_3":"Clifton Collins Jr."}
,{"title":"Red Widow","release_year":"2013","locations":"2nd St & Howard","production_company":"Beyond Pix","distributor":"American Broadcasting Company (ABC)","director":"Alon Aranya","writer":"Melissa Rosenberg","actor_1":"Radha Mitchell","actor_2":"Sterling Beaumon","actor_3":"Clifton Collins Jr."}
,{"title":"Red Widow","release_year":"2013","locations":"Montgomery & Market Streets","production_company":"Beyond Pix","distributor":"American Broadcasting Company (ABC)","director":"Alon Aranya","writer":"Melissa Rosenberg","actor_1":"Radha Mitchell","actor_2":"Sterling Beaumon","actor_3":"Clifton Collins Jr."}
,{"title":"Red Widow","release_year":"2013","locations":"Broadway & Taylor",

## Javascript Object Notation (JSON) format

- One of the widely used standards in data formats

- Usually plain text file with python dictionary-like formatting:  
    `{"key":"value"}`

- Can be nested:  
    `{"key0":{"key1":"value1", "key2":"value2"}}`

- In fact, Jupyter notebooks are in json format.

In [16]:
! head 03-Command-Line-and-Data-collection.ipynb

head: cannot open '03-Command-Line-and-Data-collection.ipynb' for reading: No such file or directory


### Example: Parsing Film Locations in San Francisco

- Raw JSON is in a string

- Needs to be parsed to Python dictionary: i.e., keys and values.

- Parse returned JSON formatted page with the `json` module

In [17]:
import json
json_str = !wget -qO - "https://data.sfgov.org/resource/yitu-d5am.json?release_year=2013&title=Red%20Widow"
json_str = ''.join(json_str) # Remove line breaks from this json. Not all json files do
data = json.loads(json_str)
data[0] ## print first line

{'title': 'Red Widow',
 'release_year': '2013',
 'locations': 'York & 24th St.',
 'production_company': 'Beyond Pix',
 'distributor': 'American Broadcasting Company (ABC)',
 'director': 'Alon Aranya',
 'writer': 'Melissa Rosenberg',
 'actor_1': 'Radha Mitchell',
 'actor_2': 'Sterling Beaumon',
 'actor_3': 'Clifton Collins Jr.'}

* `data` is now python dictionary

* Dictionaries can be imported to pandas dataframe

In [18]:
import pandas as pd

pd.DataFrame.from_dict(data).head()

Unnamed: 0,title,release_year,locations,production_company,distributor,director,writer,actor_1,actor_2,actor_3
0,Red Widow,2013,York & 24th St.,Beyond Pix,American Broadcasting Company (ABC),Alon Aranya,Melissa Rosenberg,Radha Mitchell,Sterling Beaumon,Clifton Collins Jr.
1,Red Widow,2013,2nd St & Howard,Beyond Pix,American Broadcasting Company (ABC),Alon Aranya,Melissa Rosenberg,Radha Mitchell,Sterling Beaumon,Clifton Collins Jr.
2,Red Widow,2013,Montgomery & Market Streets,Beyond Pix,American Broadcasting Company (ABC),Alon Aranya,Melissa Rosenberg,Radha Mitchell,Sterling Beaumon,Clifton Collins Jr.
3,Red Widow,2013,Broadway & Taylor,Beyond Pix,American Broadcasting Company (ABC),Alon Aranya,Melissa Rosenberg,Radha Mitchell,Sterling Beaumon,Clifton Collins Jr.
4,Red Widow,2013,Mason & Sacramento St,Beyond Pix,American Broadcasting Company (ABC),Alon Aranya,Melissa Rosenberg,Radha Mitchell,Sterling Beaumon,Clifton Collins Jr.
