In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(ch:wrangling)=
# Data Wrangling


We often need to perform preparatory work on our data before we can begin our
analysis. The amount of preparation can vary widely, but there are a few basics
for moving from raw data to data ready for analysis.  Beginning with a source
file, we "read" it into a data frame. We need to know the source file's size,
format, and encoding to properly create the data frame. We also need to
understand the table's shape (numbers of rows and columns) and granularity
(what a row represents). Once our data is in a data frame, we assess its
quality. We perform validity checks on individual data values and features. In
addition to checking the quality of the data, we learn whether or not the data
need to be transformed and reshaped to get ready for analysis. Quality checking
(and fixing) and transformation are often cyclical: the quality checks point us
toward transformations we need to make and we check the transformed features to
confirm whether the data are ready for analysis or need further cleaning and
transforming.

Note that the data frame is a useful mental representation of the data's
structure. Although there are many types of structures that can represent data,
in this book, we will almost always work with data tables, such as data frames
and relations, where rows correspond to observations/records and columns
correspond to fields/features/variables.  We note that {numref}`Chapter %s <ch:text>` examines
less-structured text data, and {numref}`Chapter %s <ch:web>` works with hierarchical data
structures. Why restrict ourselves to data tables? First, research on how to
best store and manipulate data tables has resulted in stable and efficient
tools for working with tables. Second, data in a tabular format are close
cousins of matrices, the mathematical objects of the immensely rich field of
linear algebra. Finally, data tables are very common.

Depending on the data source, we often have different expectations for quality.
Some datasets require extensive wrangling to get them into an analyzable form,
and other datasets arrive clean and we can quickly launch into modeling.  Below
are some examples of data sources and how much wrangling we might expect to do. 

- Data from a scientific experiment or study  are typically clean,
  well-documented, and have a simple structure. These data are organized to be
  broadly shared so that others can build on and reproduce the findings.  They
  are typically ready for analysis after little to no wrangling.
- Data from government surveys often come with very detailed codebooks and meta
  data describing how the data are collected and formatted, and these datasets
  are also typically ready for exploration and analysis.
- Administrative data can be clean, but without inside knowledge of the source
  we are often need to extensively check quality. Also, since we are typically
  using these data for a purpose other than the reasons for collecting it, we
  usually need to transform features and combine data tables.
- Informally collected data, such as data scraped from the Web, can be quite
  messy and tends to come with little documentation. For example, texts,
  tweets, blogs, Wikipedia tables, etc. usually require formatting and cleaning
  to transform them into quantitative information that is ready for analysis.


In this chapter, we break down data wrangling into the following stages: learn
the source file's format; determine the table's structure and shape; assess
data quality; and transform and reshape the data. An important step in
assessing the quality of the data is to consider its scope. Data scope is
covered in {numref}`Chapter %s <ch:data_scope>`, and we refer you to that
chapter for a fuller treatment of the topic. 

To clean and prepare data, we also rely on visualizations and exploratory data
analysis. In this chapter, however, we'll focus on data wrangling and cover the
other related topics in more detail in the {ref}`ch:eda` and {ref}`ch:viz`
chapters. 

In the next section, we provide brief descriptions of three datasets that we
use as examples throughout this chapter.