In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(ch:files_summary)=
# Summary

Data wrangling is an essential part of data analysis. Without it, we risk overlooking problems in data that can have major consequences for future analysis. This chapter covered an important first step in data wrangling: reading data from a plain text source file into a Python DataFrame. We introduced different types of file formats and encodings, and we wrote code that can read data from these formats. We checked the size of source files and considered alternative tools for working with large datasets. 

We also introduced command-line-interface tools as an alternative to Python for checking the format, encoding, and size of a file. These CLI tools can be handy for simple filesystem-oriented tasks like those we described. We have barely touched the surface of the vast capabilities of shell tools. The shell is well worth learning. It can be useful to have more than one tool to work with files.

Understanding the shape and granularity of a table gives us insight into  what a row in a data table represents. This helps us determine whether the granularity is mixed, aggregation is needed, or weights are required. After looking at the granularity of your dataset, you should have answers to the following questions.

+ *What does a record represent?*
+ *Do all records in a table capture granularity at the same level?* Sometimes a table contains additional summary rows that have a different granularity. 
+ *If the data were aggregated, how was the aggregation performed?* Summing and averaging are common types of aggregation. 
+ *What kinds of aggregations might we perform on the data?*

Knowing your tableâ€™s granularity is a first step to cleaning your data, and it informs you of how to analyze your data. For example, we saw the granularity of the DAWN survey is an ER visit. That naturally leads us to think about comparisons of patient demographics to the US as a whole.    

The wrangling techniques in this chapter help us bring data from a source file into a data frame and understand its structure. Once we have a data frame, further wrangling is needed to assess quality and prepare the data for analysis; this is the topic of the next chapter.