In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(ch:wrangling_formats)=
# File Formats

The data from the Mauna Loa Observatory, the DAWN survey, and the San Francisco
restaurant inspections are all made available online as plain text files. In
order to explore these data, we first need to get them into a `pandas`
dataframe. To do this, we need to know the file format.  For example, a
comma-separated-value formatted (CSV) file contains plain text where the data
values are separated using commas, and records are delimited by newlines
(`"\n"` or `"\r\n"` characters).  CSV files are a natural format for storing
data with a table structure; each line in the source file corresponds to a row
in the table, and the column/feature information in a line is separated by
commas.  Once we identify the file format, we know how to read it into Python
as a dataframe. Other factors that we need to know about are the file encoding
and size. If we don't specify the encoding correctly, the values might contain
gibberish, and if the file is huge, we might not be able to read it into a data
frame. We give an overview of file format and encoding in this section, and in
{numref}`Section %s <ch:wrangling_command_line>`, we demonstrate how to use
shell commands to find this information for a source file.


A dataset’s file format describes how the data are stored. This is in contrast
to the data's structure, which is the mental representation of our data.  This
mental model for the structure helps us work with the data. For example, a
table representation corresponds to data values arranged in rows and columns,
and when we work with these data we think about transforming columns,
aggregating rows, and cleaning values. On the other hand, understanding how the
data table is stored on the computer (the data source) helps us figure out how
to read the data into Python and work with it as a dataframe. In this section,
we introduce several popular formats used to store data tables as plain text.

**Delimited format.** These formats use a particular character to separate
values in a line. Typically the separators are: a comma (Comma-Separated-Values
or CSV for short), a tab (Tab-Separated Values or TSV), white-space, or colon.
These formats are natural for storing data with table structure. In these
files, each row represents a record; data values are delimited by a comma
character (`,`) for CSV or a tab character (`\t`) for TSV. The first line of
these files often contains the names of the data’s columns/features.

For example, the San Francisco restaurant scores are stored in CSV-formatted
files. Here are the first few lines of the `inspections.csv` file.

```
"business_id","score","date","type"
19,"94","20160513","routine"
19,"94","20171211","routine"
24,"98","20171101","routine"
24,"98","20161005","routine"
24,"96","20160311","routine"
31,"98","20151204","routine"
45,"78","20160104","routine"
45,"88","20170307","routine"
45,"85","20170914","routine"
```

Notice that the field names appear in the first line, comma-separated and in
quotations. We see there are four fields, the business identifier, the
restaurant's score, the date of the inspection, and the type of inspection.
Each line in the file corresponds to one inspection, and the ID, score, date
and type values are separated by commas. In addition to identifying the file
format, we also want to identify the format of the values. We see two things of
note: the scores and dates both appear as strings. We will want to convert the
score values to numbers so we can calculate summary statistics and create a
histogram of scores. And, we will need to convert the date into a date-time
format so that we can make time series plots. We show how to carry out these
transformations in {numref}`Section %s <ch:wrangling_transformations>`.


**Fixed-width Format.** This format does not use delimiters to separate values.
Instead, the values for a field appear in the exact same position in each line.
The DAWN data follow this format. Below are four partial lines from the DAWN
survey data. Notice how the values appear to line up from one row to the next.
Notice also that they seem to be squished together. You need to know the exact
format of the file to make sense of it. SAMHDA provides a 2,000-page codebook
with all of the information needed to read the file. In the codebook we find
that the age field appears in positions 34-35 and is coded in intervals from 1
to 11. The four records below have age categories of 4, 11, 11, and 2, and they
stand for 18 to 20, 65+, 65+, and 6 to 11, respectively.



```
     1 2251082    .9426354082   3 4 1 2201141 2 865 105 1102005 1 2 1 2.00-7.00-7.0000-7.0000...
     2 2291292   5.9920106887   911 1 3201134 12077  81  82 283-8 21767.0067.0167.0140-7.0000...
     3 7 7 251   4.7231718669   611 2 2201143 12313   1  12  -7-8 21764.0064.99-7.0000-7.0000...
     410 8 292   4.0801470012   6 2 1 3201122 1 234 358  99 215 2 21773.0073.0173.0106-7.0000...
```

:::{note}

**Note**: The widely adopted convention is to use the file extension, such as
`.csv`, `.tsv`, and `.txt`, to indicate the format of the contents of the file.
File names that end with `.csv` are expected to contain comma-separated values,
`.tsv` tab-separated values, and `.txt` generally is plain text. However, these
extension names are only suggestions. Even if a file has a `.csv` extension,
the actual data it holds might not be formatted properly! It's good practice to
check the contents of the file either with shell commands ({numref}`Section %s
<wrangling_command_line>`) or, if the file is not too large, open and inspect
it with a plain text editor.

:::

Other plain text formats that are widely used include hierarchical formats and
loosely structured formats. These are covered in greater detail in other
chapters, but for completeness we briefly describe them here.

A dataset’s **structure** is a mental representation of the data. For example,
we represent data that has a **table** structure by considering data values
arranged in rows and columns. In contrast, we can imagine
a **hierarchical** structure as a family tree that allows a data value to
contain other values.


**Hierarchical Format.** JavaScript Object Format (JSON) is a common format
used for communication by web servers. JSON files have a hierarchical structure
with keys and values similar to a Python dictionary. Each record in a JSON file
can have different fields and records can contain tables, making for a
potentially complicated structure.

The eXtensible Markup Language (XML) and HyperText Markup Language (HTML) are
common formats for storing documents on the Internet. Like JSON, these files
also contain data in a hierarchical, key-value format. We cover this format in
more detail in {numref}`Chapter %s <ch:web>`.

**Loosely Structured Formats.** Web logs, instrument readings, and program logs
typically provide data in plain text.  For example, below is one line of a Web
log (we've split it across multiple lines for readability). It contains
information such as the date and time and type of request made to the site.

```
169.237.46.168 - -
[26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328
"http://anson.ucdavis.edu/courses"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"
```

There is structure present, but not a simple delimited file
format.  We see that the date and time appear between square brackets and the
type of request (GET in this case) follows the date-time information and
appears in quotes. Later in {numref}`Chapter %s <ch:text>`, we will use these
observations about the structure and string manipulation tools to extract the
values of interest into a data table.

As another example, below is a single recording of measurements taken with a
wireless device. The device reports the timestamp, identifier, location of the
device, and the signal strengths that it picks up from other devices. This
information appears to use a combination of formats: key=value pairs,
semicolon delimited, and comma delimited values. 

```
t=1139644637174;id=00:02:2D:21:0F:33;pos=2.0,0.0,0.0;degree=45.5;
00:14:bf:b1:97:8a=-33,2437000000,3;00:14:bf:b1:97:8a=-38,2437000000,3;
```

All of this is to show: there are many types of file formats that store data!
To keep the chapter manageable, we'll focus on data tables---all of the
examples in this chapter have a table structure.

## Summary

In this section, we have introduced the tabular structure representation for
data that we use throughout much of the rest of the book. We have introduced
examples of formats for plain text data for storing and exchanging tables. The
comma-separated-value format is the most common, but others such as
tab-separated and fixed-width are also widely used.  Behind the scenes, we have
used command line tools to display the contents of the files presented in this
section. The next section introduces these tools more formally, including
describing the format of the language providing more examples of these tools.
