In [8]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

# DataFrames, Indices, Slicing


This section introduces the main `pandas` data structures for working with data
tables.


:::{admonition} Synopsis
:class: synopsis-box

This section introduces the `pandas.DataFrame` object, the primary data
structure data scientists use to work with data tables in Python. Dataframes
store row labels in `pandas.Index` objects. Every column of a dataframe is a
`pandas.Series` object.

To create a `pandas` DataFrame object using a CSV file, use `pd.read_csv`:

```python
df = pd.read_csv('babynames.csv')
```

To select values using row and column labels, use `DataFrame.loc` with a slice:

```python
df.loc[:, '']
```

To select values using row and column positions, use `DataFrame.iloc` with a
slice:

```python
df.iloc[:, :]
```

To sort a dataframe, use `DataFrame.sort_values()`:

```python
df.sort_values('Year')
```
:::

There's a 2021 New York Times article that talks about Prince Harry and
Meghan's unique choice for their new baby daughter's name: Lilibet
[^williamsLilith2021]. The article has an interview with Pamela Redmond, an
expert on baby names, who talks about interesting trends in how people name
their kids. For example, she says that names that start with the letter "L"
have become very popular in recent years, while names that start with the
letter "J" were popular in the 1970s and 1980s. Are these claims reflected in
data? We can use `pandas` to find out!

[^williamsLilith2021]: Williams, Alex. “Lilith, Lilibet … Lucifer? How Baby Names Went to ‘L.’” The New York Times, June 12, 2021, sec. Style. https://www.nytimes.com/2021/06/12/style/lilibet-popular-baby-names.html.

First, import the package as `pd`, the canonical abbreviation:

In [9]:
import pandas as pd

We have a dataset of baby names stored in a comma-separated values (CSV) file
called `babynames.csv`. Use the `pd.read_csv` function to read the file as a
`pandas.DataFrame` object.

In [10]:
baby = pd.read_csv('babynames.csv')
baby

Unnamed: 0,Name,Sex,Count,Year
0,Mary,F,9217,1884
1,Anna,F,3860,1884
2,Emma,F,2587,1884
...,...,...,...,...
1891891,Verna,M,5,1883
1891892,Winnie,M,5,1883
1891893,Winthrop,M,5,1883


## DataFrames and Indices

Let's pause here and explain what you're looking at. A dataframe has rows and
columns. Every row and column has a label:

```{image} figures/baby_labels.svg
:alt: baby_labels
```

By default, `pandas` assigns row labels as incrementing numbers starting from
0. In this case, the data at the row labeled `0` and column labeled `Name` has
the data `'Mary'`.

Dataframes can also have strings as row labels. Here's an example of a
dataframe with US state mottos. Every row is labeled with the state.

```{image} figures/motto_labels.svg
:alt: motto_labels
```

The row labels have a special name. We call them the **index** of a dataframe,
and `pandas` stores the row labels in a special `pandas.Index` object. We won't
discuss the `pandas.Index` object since you don't often have to manipulate the
index itself. But you should remember that even though the index looks like a
column of data, the index really represents row labels, not data. For instance,
the dataframe of US state mottos has 4 columns of data, not 5, since the index
doesn't count as a column.

## About the Data

The data in the `baby` table comes from the US Social Security department,
which records the baby name and birth sex for birth certificate purposes. They
make the baby names data available on their website [^babynamesData], and we've
loaded this data into the `baby` table.

[^babynamesData]: “Social Secuity Baby Names.” Accessed August 16, 2021. https://www.ssa.gov/oact/babynames/index.html.

When you start working with a dataset you should collect information about how
the data were collected. In this case, the Social Security website also has a
page that describes the data in more detail
([link](https://www.ssa.gov/oact/babynames/background.html)). We won't go
in-depth in this chapter about the data's limitations, but you should remember
this quote from the website:

> All names are from Social Security card applications for births that occurred
> in the United States after 1879. Note that many people born before 1937 never
> applied for a Social Security card, so their names are not included in our
> data. For others who did apply, our records may not show the place of birth,
> and again their names are not included in our data.
>
> All data are from a 100% sample of our records on Social Security card
> applications as of March 2021.

## Slicing



## Summary

We now have the five most popular baby names in 2016 and learned to express the following operations in `pandas`:

| Operation | `pandas` |
| --------- | -------  |
| Read a CSV file | `pd.read_csv()` |
| Slicing using labels or indices | `.loc` and `.iloc` |
| Slicing rows using a predicate | Use a boolean-valued Series in `.loc` |
| Sorting rows | `.sort_values()` |