In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(ch:wrangling_command_line)=
# The Shell and Command Line Tools


Nearly all computers provide access to a **shell interpreter**, such
as `sh` or `bash`. Like the Python interpreter, shell interpreters allow users
to run code and view its output. Shell interpreters typically perform
operations on the files on a computer. Shell interpreters have their own
language, syntax, and built-in commands.

We use the term **command-line interface (CLI) tools** to refer to the commands
available in the shell interpreter. Although we only cover a few useful CLI
tools in this section, there are many useful CLI tools that enable all sorts of
useful operations on the computer.
For instance, running this command in `bash` produces a list of all the
files in the `figures/` folder along with their file sizes:

```bash
ls -l -h figures/
```

The basic syntax for a shell command is:

```bash
 command -options arg1 arg2
```

CLI tools often take one or more **arguments**, similar to how Python functions
take in arguments, but we wrap arguments with spaces, not parentheses.
The arguments appear at the end of the command line, and they are usually the name
of a file or some text.
In the `ls` example above, the argument to `ls` is `figures/`.
Additionally, CLI tools support **flags** that provide
additional options. These flags are specified immediately following the command
name using a dash as a delimiter. 
In the `ls` example above, we provided the option flags `-l` (to provide
extra information about each file) and `-h` (to format the filesizes).
Many commands have default arguments and
options, and the `man` command prints a list of acceptable options, examples,
and defaults for a command.

:::{note}

All CLI tools we cover in this book are specific to the `sh` shell
interpreter, the default interpreter for Jupyter installations on MacOS and
Linux systems at the time of writing. Windows systems have a different
interpreter and the commands shown in the book may not run on Windows, although
Windows gives access to a `sh` interpreter through its Linux Subsystem.

:::

Commonly, we open a terminal program to start a shell interpreter. Jupyter
notebooks, however, provide a convenience: if a line of code is prefixed with
the `!` character, the line will go directly to the system’s shell interpreter.
For example, the `!ls` command lists the files in the current directory.

In [4]:
!ls

[34mdata[m[m                            wrangling_granularity.ipynb
[34mfigures[m[m                         wrangling_intro.ipynb
wrangling_checks.ipynb          wrangling_missing.ipynb
wrangling_co2.ipynb             wrangling_restaurants.ipynb
wrangling_command_line.ipynb    wrangling_structure.ipynb
wrangling_datasets.ipynb        wrangling_summary.ipynb
wrangling_formats.ipynb         wrangling_transformations.ipynb


In the line above, Jupyter runs the `ls` command through the `sh` shell
interpreter and displays the results of the command in the notebook.

Calling `ls` with a folder name as an argument shows all the files in the
folder. Let's examine the source files for the San Francisco restaurant scores
with `ls`.


In [26]:
!ls -lLh data

total 9496
-rw-r--r--  1 sam  staff   645K Dec  6 15:05 businesses.csv
-rw-r--r--  1 sam  staff   455K Dec  6 15:05 inspections.csv
-rw-r--r--  1 sam  staff   120B Dec  6 15:05 legend.csv
-rw-r--r--  1 sam  staff   3.6M Dec  6 15:05 violations.csv


The scoring information is provided in three files. The businesses.csv file
provides data on the restaurants, such as its name, address, and location;
inspections.csv contains information about the inspection date and score; and
violations.csv has more detailed information about the type of violations found
during an inspection.  The legends.csv file provides the numeric score ranges
for the inspection categories of poor, needs improvement, adequate, and good.
To discover this information, however, we need to use other tools to look at
the files' structure, size, and encoding.

For example, we can display the first few
lines of a file with the `head` command. This is very useful for peeking at a
file's contents to determine whether it's formatted as a CSV, TSV, etc. Let's
look at the inspections.csv file.


In [11]:
!head data/inspections.csv

"business_id","score","date","type"
19,"94","20160513","routine"
19,"94","20171211","routine"
24,"98","20171101","routine"
24,"98","20161005","routine"
24,"96","20160311","routine"
31,"98","20151204","routine"
45,"78","20160104","routine"
45,"88","20170307","routine"
45,"85","20170914","routine"


By default, `head` displays the first 10 lines of a file. To display the last
10 lines, we use the `tail`command. The final 10 lines in inspections.csv
appear below.

In [12]:
!tail data/inspections.csv

93959,"100","20171218","routine"
93968,"98","20171120","routine"
93969,"98","20171221","routine"
93977,"96","20171219","routine"
94012,"100","20171220","routine"
94012,"90","20180112","routine"
94133,"100","20171227","routine"
94142,"100","20171220","routine"
94189,"96","20171130","routine"
94231,"85","20171214","routine"


We can print the entire file’s contents using the `cat` command. However, you
should take care when using this command, as printing a large file can cause
the browser to crash. The legend.csv file is small, and we can use `cat` to
concatenate and print its contents.


In [13]:
!cat data/legend.csv

"Minimum_Score","Maximum_Score","Description"
0,70,"Poor"
71,85,"Needs Improvement"
86,90,"Adequate"
91,100,"Good"


In many cases, using `head` and`tail` alone gives us a good enough sense of the
file structure to proceed with loading it into a data frame. For example, we
can see from the first 10 lines that the inspections.csv file uses the CSV
format. Of course, we expect this format given file name extension is ".csv",
but as mentioned earlier, there is no guarantee that a file with this extension
is indeed a CSV file. It's good practice to check the contents to confirm the
extension matches the actual format.

We can easily read in CSV files using the pandas `pd.read_csv` command.
But before we do that, let's look at the other two files restaurant inspection files:
businesses.csv and violations.csv.


In [16]:
# The -n 3 flag tells head to only display 3 lines
!head -n 3 data/violations.csv

"business_id","date","description"
19,"20171211","Inadequate food safety knowledge or lack of certified food safety manager"
19,"20171211","Unapproved or unmaintained equipment or utensils"


In [19]:
!tail -n 3 data/businesses.csv

94571,"THE PHOENIX PASTIFICIO","200 CLEMENT ST ","San Francisco","CA","94118",,,"+14154726100"
94572,"BROADWAY DIM SUM CAFE","684 BROADWAY ST ","San Francisco","CA","94133",,,""
94574,"BINKA BITES","2241 GEARY BLVD ","San Francisco","CA","94115",,,"+14157712907"


Now that we have confirmed all three files have CSV formats, we can 
read them into Pandas data frames with `pd.read_csv`.

In [22]:
insp = pd.read_csv("data/inspections.csv")
insp

Unnamed: 0,business_id,score,date,type
0,19,94,20160513,routine
1,19,94,20171211,routine
2,24,98,20171101,routine
...,...,...,...,...
14219,94142,100,20171220,routine
14220,94189,96,20171130,routine
14221,94231,85,20171214,routine


In [23]:
viol = pd.read_csv("data/violations.csv")
viol

Unnamed: 0,business_id,date,description
0,19,20171211,Inadequate food safety knowledge or lack of ce...
1,19,20171211,Unapproved or unmaintained equipment or utensils
2,19,20160513,Unapproved or unmaintained equipment or utensi...
...,...,...,...
39039,94231,20171214,High risk vermin infestation [ date violation...
39040,94231,20171214,Moderate risk food holding temperature [ dat...
39041,94231,20171214,Wiping cloths not clean or properly stored or ...


It turns out that the businesses.csv file can't be immediately read into a 
dataframe. When we try to do so, we get this cryptic error:

```
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 8: invalid continuation byte
```

So, we'll need to learn more about this file's contents before we can successfully read
it. We address this decoding problem later on in this section.


Next, let's take another look at the shape of the inspections and violations 
dataframes.

In [24]:
insp.shape

(14222, 4)

In [25]:
viol.shape

(39042, 3)

Notice that the `inspections` data frame has about 14,000 rows (and four
columns), and `violations` has about 40,000 rows (and 3 columns). These files
are large, but not so large that they cannot be easily read into Python. In the
next section, we describe how to assess a file's size and what to do when it's
too large to read into a dataframe.

## File Size

Computers have finite limits on computing power. You have likely
encountered these limits firsthand if your computer has slowed down from having
too many applications opened at once. We often want to make sure that we do not
exceed the computer's limits while working with data.

In many situations, we analyze datasets downloaded from the Internet. These
files reside on the computer's **disk storage**. In order to use Python to
explore and manipulate the data, we need to read the data into the computer's
**memory**, also known as random access memory (RAM). All Python code requires
the use of RAM, no matter how short the code is.

A computer's RAM is typically much smaller than a computer's disk storage. For
example, one computer model released in 2018 had 32 times more disk storage
than RAM.  Unfortunately, this means that data files can often be much bigger
than what is feasible to read into memory.


Both disk storage and RAM capacity are measured in terms of **bytes**. Roughly
speaking, each character in a text file adds one byte to the file's size. For
example, the legends.csv file has 120 characters and takes up 120 bytes of
disk space.


Of course, many of the datasets we work with today contain many characters. To
succinctly describe the sizes of larger files, we use the prefixes as described
in the following {numref}`byte-prefixes`.

:::{table} Prefixes for common filesizes.
:name: byte-prefixes

| Multiple | Notation | Number of Bytes |
| -------- | -------- | --------------- |
| Kibibyte | KiB      | 1024    |
| Mebibyte | MiB      | 1024²   |
| Gibibyte | GiB      | 1024³   |
| Tebibyte | TiB      | 1024⁴   |
| Pebibyte | PiB      | 1024⁵   |

:::


For example, a file containing 52428800 characters takes up 52428800 bytes = 50
mebibytes = 50 MiB on disk.

**Why use multiples of 1024 instead of simple multiples of 1000 for these
prefixes?** This is a historical result of the fact that most computers
use a binary number scheme where powers of 2 are simpler to represent. You will
also see the typical SI prefixes used to describe size---kilobytes, megabytes,
and gigabytes, for example. Unfortunately, these prefixes are used
inconsistently. Sometimes a kilobyte refers to 1000 bytes; other times, a
kilobyte refers to 1024 bytes. To avoid confusion, we will stick to kibi-,
mebi-, and gibibytes which clearly represent multiples of 1024.

**When is it safe to read in a file?** Many computers have much more disk
storage than available memory. It is not uncommon to have a data file happily
stored on a computer that will overflow the computer's memory if we attempt to
manipulate it with a program, including Python programs. We often begin our
data work by making sure the files we are of manageable size. To accomplish
this, we use the CLI tools `ls` and `wc` and `du`.

Recall that `ls` shows the files within a folder. If we add the `-l` flag, the
output lists one file per line with additional metadata.

```bash
!ls -l data
total 9496
-rw-r--r--  1 sam  staff   645K Dec  6 15:05 businesses.csv
-rw-r--r--  1 sam  staff   455K Dec  6 15:05 inspections.csv
-rw-r--r--  1 sam  staff   120B Dec  6 15:05 legend.csv
-rw-r--r--  1 sam  staff   3.6M Dec  6 15:05 violations.csv
```

In particular, the fifth column of the listing shows the file size in bytes.
For example, we can see that violations.csv takes up `3726206` bytes on disk.
To make these file sizes more readable, we can use the `-h` flag.

```bash
!ls -lh data
total 9496
-rw-r--r--  1 sam  staff   645K Dec  6 15:05 businesses.csv
-rw-r--r--  1 sam  staff   455K Dec  6 15:05 inspections.csv
-rw-r--r--  1 sam  staff   120B Dec  6 15:05 legend.csv
-rw-r--r--  1 sam  staff   3.6M Dec  6 15:05 violations.csv
```


We see that businesses.csv takes up 645 KiB on disk, making it well within the
memory capacities of most systems. Although the violations.csv file takes up
3.6 MiB of disk storage, most machines can easily read violations.csv into a
Pandas dataframe too. A much larger file is DAWN-Data.txt, which stores the
DAWN survey data.

```bash
!ls -lh DAWN-Data.txt
rw-r--r--@ 1 sam staff 267M Dec 6 15:05 DAWN-Data.txt
```

This file takes up 267 MiB of disk storage, and while some computers can work
with it in memory, it might slow down most systems. The approach we have taken
for working with these particular data is to reduce the number of features in
the data frame.

The command `wc` (short for wordcount), also provides helpful information about
a file's size. This CLI tool returns the number of lines, words, and characters
in the file.


```bash
!wc  DAWN-Data.txt
229211 22695570 280095842 DAWN-Data.txt
```

**Folder Sizes.** Sometimes we are interested in the total size of a folder
instead of the size of individual files. For example, if we have one file of
sensor recordings for each month in a year, we might like to see whether we can
combine all the data into a single DataFrame. Note that `ls` does not calculate
the cumulative size of the contents of a folder. To properly calculate the
total size of a folder, including the files in the folder, we use the `du`
(short for disk usage) CLI tool. By default, the `du` tool shows the sizes of
folders in its own units called blocks.


```bash
!du data
9496 data
```

To show file sizes in bytes, we add the `-h` flag.

```bash
!du -h data
4.6M	data
```


We commonly also add the `-s` flag to `du` to show the file sizes for both
files and folders. The asterisk in `data/*` below tells `du` to show the size
of every item in the `data/*` folder.


```bash
!du -sh data/*
648K	data/businesses.csv
456K	data/inspections.csv
4.0K	data/legend.csv
3.6M	data/violations.csv
```

**Memory Overhead.** As a rule of thumb, reading in a file using `pandas`
usually requires at least double the available memory as the file size. That
is, reading in a 1 GiB file will typically require at least 2 GiB of available
memory.

Note that memory is shared by all programs running on a computer, including the
operating system, web browsers, and yes, Jupyter notebook itself. A computer
with 4 GiB total RAM might have only 1 GiB available RAM with many applications
running. With 1 GiB available RAM, it is unlikely that `pandas` will be able to
read in a 1 GiB file.

## File Encoding

All of the files that we have examined in this chapter are plain text files.
Plain text, as you might guess, is simple and limited. It supports standard
ASCII characters, which includes the upper and lowercase English letters,
numbers, punctuation symbols, and spaces. In short, ASCII is a character
**encoding**.

However, the ASCII encoding is not sufficient to represent a lot
of special characters and characters from other languages. Other, more modern,
character encodings have many more characters that can be represented. Common
encodings for documents and Web pages are Latin-1 (ISO-8859-1) and UTF-8. 
UTF-8 has over one million
characters, and is backwards compatible with ASCII, meaning that it uses the
same representation for English letters, numbers and punctuation as ASCII.


The `file` CLI tool can help use determine a file's encoding. Earlier in
this section, we had trouble reading the businesses.csv file, and were given an
error message concerning decoding the information. The `file` tool uncovers the
problem.

```bash
!file -I data/*
data/businesses.csv:  text/plain; charset=iso-8859-1
data/inspections.csv: text/plain; charset=us-ascii
data/legend.csv:      text/plain; charset=us-ascii
data/violations.csv:  text/plain; charset=us-ascii
```


We see that the other three files are all ASCII, but businesses.csv has an
ISO-8859-1 encoding. We can provide this information to `pandas.read_csv` to
successfully read the business information.

In [32]:
bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')
bus

Unnamed: 0,business_id,name,address,city,...,postal_code,latitude,longitude,phone_number
0,19,NRGIZE LIFESTYLE CAFE,"1200 VAN NESS AVE, 3RD FLOOR",San Francisco,...,94109,37.79,-122.42,+14157763262
1,24,OMNI S.F. HOTEL - 2ND FLOOR PANTRY,"500 CALIFORNIA ST, 2ND FLOOR",San Francisco,...,94104,37.79,-122.40,+14156779494
2,31,NORMAN'S ICE CREAM AND FREEZES,2801 LEAVENWORTH ST,San Francisco,...,94133,37.81,-122.42,
...,...,...,...,...,...,...,...,...,...
6403,94571,THE PHOENIX PASTIFICIO,200 CLEMENT ST,San Francisco,...,94118,,,+14154726100
6404,94572,BROADWAY DIM SUM CAFE,684 BROADWAY ST,San Francisco,...,94133,,,
6405,94574,BINKA BITES,2241 GEARY BLVD,San Francisco,...,94115,,,+14157712907


## Summary

In this section, we have introduced the command-line tools `ls`, `du`, `wc`,
`head`, `tail`, `cat`and `file`. These tools help us understand the format and
structure of data files. We also use these tools to ensure that the data file
is small enough to read into `pandas` and that the correct encoding is used so
that values are not gibberish. Once a file is read into `pandas`, we have a
dataframe and can proceed with analysis.


Shell commands give us a programmatic way to work with files, rather than a
point-and-click "manual" approach. They are useful for:

- Documentation: if you need to record what you did
- Error reduction: if you want to reduce typographical errors and other simple
  but potentially harmful mistakes
- Reproducibility: if you need to repeat the same process in the future or you
  plan to share your process with others you have a record of your actions
- Volume: if you have many repetitive operations to perform, the size of the
  file you are working with is large, or you need to perform things quickly,
  CLI tools can help.


After the data are in a pandas data frame, our next task is to get a handle on
the table's shape and granularity. We need to understand what a row represents
and the expected kind of values in a field before we can begin to check the
quality of the data. This is the topic of the next section.
