In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(ch:files)=
# Wrangling Files

Before you can work with data in Python, it helps to understand the files that store the source of the data.
You want answers to a couple of basic questions like:

+ How much data do you have?
+ How is the source file formatted?

Answers to these questions can be very helpful. If your file is too large, you may need special approaches to read it in. If your file isn't formatted the way you expect, you may run into bad values after loading it into a dataframe.

Although there are many types of structures that can represent data, in this
book we primarily work with data tables, such as Pandas DataFrames and SQL
relations. (But do note that {numref}`Chapter %s <ch:text>` examines
less-structured text data, and {numref}`Chapter %s <ch:web>` works with
hierarchical data structures). We focus on data tables for several reasons. First, research on how
to best store and manipulate data tables has resulted in stable and efficient
tools for working with tables. Second, data in a tabular format are close
cousins of matrices, the mathematical objects of the immensely rich field of
linear algebra. Finally, data tables are common.

In this chapter, we introduce typical file formats and encodings for plain text and describe measures of file size, and we use Python tools to examine source files. Later in the chapter, we introduce an alternative approach for working with files, the shell interpreter. Shell commands give us
a programmatic way to get information about a file outside of the Python
environment, and the shell can be very useful with big data. 

Finally, the first thing you want to do after loading the data into Python is find its shape (the number of rows and columns) and granularity (what a row represents). These simple checks are the starting point for cleaning and analyzing your data. 

The first section provides brief descriptions of two datasets that we
use as examples throughout this chapter.