In [15]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(ch:reading_format)=
# File Formats

The data from the DAWN survey and the San Francisco restaurant inspections are made available online as plain text files. In order to explore these data, we first need to get them into a `pandas` DataFrame. To do this, we need to know the source file's format and encoding.  For example, a comma-separated-value formatted (CSV) file contains plain text where data values are separated by commas, and records are delimited by newlines (`"\n"` or `"\r\n"` characters).  CSV files are a natural format for storing data with a table structure: each line in the file corresponds to a row in the table, and the column/feature information in a line are separated by
commas. Once we identify the file format, we know how to read it into a Python DataFrame. Other potentially important aspects of the file are its encoding and size. If we don't specify the encoding correctly, then the values in the data frame might contain gibberish, and if the file is huge, then we might not be able to load it into Python. We give an overview of file format and encoding in this section. 

A file format describes how data are stored on the computer. This is in contrast to the data's structure, which is a mental representation of our data.  The mental model for structure helps us  interact with and analyse the data. For example, a table representation corresponds to data values arranged in rows and columns, and when we work with these data we think about transforming columns, aggregating rows, and cleaning values. On the other hand, understanding how the raw data is stored on the computer (the source file) helps us figure out how to read the data into Python and work with it as a data frame. In this section, we introduce several popular formats used to store data tables as plain text.

**Delimited format.** These formats use a specific character to separate data values. Typically, the separators are: a comma (Comma-Separated-Values or CSV for short), a tab (Tab-Separated Values or TSV), white-space, or colon. These formats are natural for storing data that have a table structure. In these files, each line represents a record, and within a line, the record's information is delimited by the comma character (",") for CSV or the tab character ("\t") for TSV. The first line of these files often contains the names of the table’s columns/features.

As an example, the San Francisco restaurant scores are stored in CSV-formatted
files. Here are the first few lines of the `inspections.csv` file.

In [16]:
insp_file = open("data/inspections.csv", 'r')

for x in range(6):
  print(insp_file.readline())

insp_file.close()

"business_id","score","date","type"

19,"94","20160513","routine"

19,"94","20171211","routine"

24,"98","20171101","routine"

24,"98","20161005","routine"

24,"96","20160311","routine"



Notice that the field names appear in the first line, comma-separated and in quotations. We see there are four fields, the business identifier, the restaurant's score, the date of the inspection, and the type of inspection. Each line in the file corresponds to one inspection, and the ID, score, date and type values are separated by commas. In addition to identifying the file
format, we also want to identify the format of the features. We see two things of note: the scores and dates both appear as strings. We will want to convert the scores to numbers so we can calculate summary statistics and create a histogram of scores. And, we will convert the date into a date-time format so that we can make time-series plots. We show how to carry out these transformations in {numref}`Section %s <ch:wrangling_transformations>`.

**Fixed-width Format.** This format does not use delimiters to separate data values. Instead, the values for a specific field appear in the exact same position in each line. The DAWN data follow this format. Below are two lines from the DAWN survey data. Notice how the values appear to align from one row to the next. Notice also that they seem to be squished together with no separators. You need to know the exact position of each piece of information in a line in order to make sense of it. SAMHDA provides a 2,000-page codebook with all of the information needed to read the file. In the codebook we find that the age field appears in positions 34-35 and is coded in intervals from 1 to 11. The records below have age categories of 4 and 11, and the codebook tells us that a code of 4 stands for the age bracket "6 to 11", and 11 is for "65+".

In [17]:
dawn_file = open("data/DAWN-Data.txt")

print(dawn_file.readline())
print(dawn_file.readline())
dawn_file.close()

     1 2251082    .9426354082   3 4 1 2201141 2 865 105 1102005 1 2 1 2.00-7.00-7.0000-7.0000-7.00001255 105 1142032 4 1 1 2.50 5.00 5.0100-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.000

:::{note}

**Note**: A widely adopted convention is to use the filename extension, such as `.csv`, `.tsv`, and `.txt`, to indicate the format of the contents of the file. File names that end with `.csv` are expected to contain comma-separated values, `.tsv` tab-separated values, and `.txt` generally is plain text. However, these extension names are only suggestions. Even if a file has a `.csv` extension, the actual contents might not be formatted properly! It's good practice to inspect the contents of the file before loading it into a data frame. If the file is not too large, you can open and examine it with a plain text editor. Otherwise, you might use `readlines()` to view a couple of lines, or shell commands ({numref}`Section %s <ch:reading_command_line>`). 

:::

Other plain text formats that are popular include hierarchical formats and loosely structured formats (in contrast to formats that support table structures). These are covered in greater detail in other chapters, but for completeness we briefly describe them here.

**Hierarchical Format.** JavaScript Object Format (JSON) is a common format used for communication by web servers. JSON files have a hierarchical structure with keys and values similar to a Python dictionary. Each record in a JSON file can have different fields and records can contain tables, making for a potentially complicated structure. The eXtensible Markup Language (XML) and HyperText Markup Language (HTML) are common formats for storing documents on the Internet. Like JSON, these files also contain data in a hierarchical, key-value format. We cover both formats (JSON and XML) in more detail in {numref}`Chapter %s <ch:web>`.

**Loosely Structured Formats.** Web logs, instrument readings, and program logs typically provide data in plain text.  For example, below is one line of a Web log (we've split it across multiple lines for readability). It contains information such as the date and time and type of request made to the Web site.

```
169.237.46.168 - -
[26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328
"http://anson.ucdavis.edu/courses"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"
```

There is structure present, but not in a simple delimited file format.  We see that the date and time appear between square brackets and the type of request (GET in this case) follows the date-time information and appears in quotes. Later in {numref}`Chapter %s <ch:text>`, we will use these observations about the log's structure and string manipulation tools to extract the values of interest into a data table.

As another example, below is a single recording of measurements taken with a wireless device. The device reports the timestamp, identifier, location of the device, and the signal strengths that it picks up from other devices. This information uses a combination of formats: key=value pairs, semicolon delimited, and comma delimited values. 

```
t=1139644637174;id=00:02:2D:21:0F:33;pos=2.0,0.0,0.0;degree=45.5;
00:14:bf:b1:97:8a=-33,2437000000,3;00:14:bf:b1:97:8a=-38,2437000000,3;
```

Like with the Web logs, we can use string manipulation and the patterns in the recordings to extract features into a table.

All of this is to show: there are many types of file formats that store data!
To keep the chapter manageable, we'll focus on data tables.

## File Encoding

We have mentioned already that our example source files are plain text.
The most basic kind of plain text supports only standard ASCII characters, which includes the upper and lowercase English letters, numbers, punctuation symbols, and spaces. In short, ASCII is a particular character **encoding**. For example, in ASCII, the bits 100 001 stand for the letter A, 100 010 for B, etc. 

The ASCII encoding is not sufficient to represent a lot of special characters and characters from other languages. Other, more modern, character encodings have many more characters that can be represented. Common encodings for documents and Web pages are Latin-1 (ISO-8859-1) and UTF-8. 
UTF-8 has over one million characters, and is backwards compatible with ASCII, meaning that it uses the same representation for English letters, numbers and punctuation as ASCII.

The `chardet` package can help use determine a file's encoding. No program can say with 100% certainty which encoding is used for a file because after all, it's just a sequence of bits and bytes. For this reason, `chardet.detect()` returns the most likely encoding for a file and a number between 0 and 1 that reflects the confidence in the classification of the encoding. We use `chardet.detect()` to detect the encodings of our example files.

In [None]:
import chardet
import glob

# for each file, print its name, encoding & confidence in the encoding
print("File".ljust(25), "Encoding".ljust(10), "Confidence")
for filename in glob.glob('data/*'):
    with open(filename, 'rb') as rawdata:
        result = chardet.detect(rawdata.read())
        rawdata.close()
    print(filename.ljust(25), result['encoding'].ljust(10), result['confidence'])

File                      Encoding   Confidence
data/inspections.csv      ascii      1.0
data/co2_mm_mlo.txt       ascii      1.0
data/violations.csv       ascii      1.0
data/DAWN-Data.txt        ascii      1.0
data/legend.csv           ascii      1.0


The detection function is quite certain that all but one of the files are ASCII. The exception is  `businesses.csv`, which appears to have an ISO-8859-1 encoding. We run into trouble, if we ignore this encoding and try to read the business file into Pandas without specifying the special encoding. 

```
pandas.read_csv('data/businesses.csv')
```  

The above call results in the following error.

```
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 8: invalid continuation byte
```

To successfully read the data, we must specify the ISO-8859-1 encoding.

In [None]:
bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')
bus

## Summary

In this section, we have introduced the tabular structure representation for data that we use throughout much of the rest of the book. We have introduced formats for plain text data that are widely used for storing and exchanging tables. The comma-separated-value format is the most common, but others, such as tab-separated and fixed-width, are also prevelant. 

In the next section, we address the issue of file size, and in {numref}`Section %s <ch:reading_command_line>`, we demonstrate how to use shell commands to find information about source files.