In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython import display

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

# Working with Dataframes using pandas

Every data scientist needs to work with data stored as a table. In Python, we
work with data tables using a special data structure called a *dataframe*.
Here's an example:


```{image} figures/intro_df.png
:alt: dataframe
:align: center
```

The defining trait of a dataframe is that both rows and columns have labels
that we use to select data. For example, we can get column of data under the
`Motto` label. We can also get the row of data under the `Alabama` label.
Having labels for both rows and columns is very convenient for working with
data, as we will see in this chapter.

Data scientists use the `pandas` library when working with dataframes in
Python. This chapter explains how to use `pandas`. First, we'll explain the
main objects that `pandas` provides: the `DataFrame` and `Series` classes.
Then, we'll show you how to use `pandas` to perform common data manipulation
tasks, like slicing, filtering, sorting, grouping, and joining. We won't go
over every single `pandas` function---there are too many! Instead, we'll give
you enough to start working with real datasets. With those skills, you'll know
enough to read and understand the `pandas` documentation yourself.

```{admonition} What's the difference between a matrix, relation, and dataframe?

There are multiple ways to put data in a table. Here's a brief overview of the
differences between three common data tables: the matrix, relation, and
dataframe.

A matrix stores numbers:

$$
\begin{aligned}
\mathbf{X} = \begin{bmatrix}
1 & 0 \\
0 & 4 \\
0 & 0 \\
\end{bmatrix}
\end{aligned}
$$

Neither columns nor rows of a matrix have labels. Matrices are the basic object
in linear algebra, a highly useful branch of math in machine learning. You'll
see matrices later when we go over modeling.

A relation is similar to a dataframe, but only has labels for columns, not
rows:

```{image} figures/intro_relation.png
:alt: relation
:align: center
```

Relations are the main data object for relational database systems, which we
also cover in this book in the SQL chapter.

This only covers the basic differences between matrices, relations, and
dataframes. For a more thorough treatment, see Petersohn et al. 2020
[^petersohnScalable2020].
```

[^petersohnScalable2020]: Petersohn, Devin, Stephen Macke, Doris Xin, William Ma, Doris Lee, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya Parameswaran. “Towards Scalable Dataframe Systems.” ArXiv Preprint ArXiv:2001.00888, 2020.