In [2]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(ch:reading)=
# Wrangling Data: Source File to Data Frame

Before we even begin to "read" the data it often helps to understand a little about source files. We want answers to a couple of basic questions like:

+ How much data do I have?
+ How is it formatted?

Answers to these questions can be very helpful. For example, a rough estimate of the size of your data can help you choose an appropriate tool to view your data. And, knowledge of the format, can help you successfully get the source file (containing the "raw" data) into a data frame.  

Once we have loaded the data in a table, we need to find its shape (numbers of rows and columns) and granularity (what a row represents) to confirm that our data are what we expect. The data frame/data table is a useful mental representation of the data structure, where rows correspond to observations/records and columns
correspond to fields/features/variables.

Although there are many types of structures that can represent data, in this book, we primarily work with data tables, such as Pandas DataFrames and SQL relations.  We note that {numref}`Chapter %s <ch:text>` examines less-structured text data, and {numref}`Chapter %s <ch:web>` works with hierarchical data structures. Why focus on data tables? First, research on how to best store and manipulate data tables has resulted in stable and efficient tools for working with tables. Second, data in a tabular format are close cousins of matrices, the mathematical objects of the immensely rich field of linear algebra. Finally, data tables are very common.

In this chapter, we introduce typical file formats and encodings for plain text  ({numref}`Section %s <ch:reading_format>`) and describe measures of file size ({numref}`Section %s <ch:reading_filesize>`). In these sections, we use Python tools to examine source files. Then, later, we introduce an alternative approach for working with files, the shell interpreter. Shell commands give us a programmatic way to get information about a file outside of the Python environment, and the shell can be very useful with big data ({numref}`Section %s <ch:reading_command_line>`). Finally, in {numref}`Section %s <ch:reading_granularity>`, we address the topic of the data table's structure and shape.  

In the next section, we provide brief descriptions of two datasets that we
use as examples throughout this chapter.