In [50]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

In [29]:
# Code to make small dogs table from all_dogs
# (pd.read_csv('all_dogs.csv')
#  .sort_values('popularity_all')
#  .head(7)
#  [['breed', 'grooming', 'food_cost', 'kids', 'size']]
#  .to_csv('dogs.csv', index=False)
# )

In [51]:
dogs = pd.read_csv('dogs.csv', index_col='breed')

(ch:pandas)=
# Working with Dataframes using pandas

Data scientists work with data stored in tables. This chapter introduces
*dataframes*, one of the most widely used ways to represent data tables. We'll
also introduce `pandas`, the standard Python package for working with
dataframes. Here's an example of a dataframe that holds information about
popular dog breeds:

In [10]:
dogs

Unnamed: 0_level_0,grooming,food_cost,kids,size
breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Labrador Retriever,weekly,466.0,high,medium
German Shepherd,weekly,466.0,medium,large
Beagle,daily,324.0,high,small
Golden Retriever,weekly,466.0,high,medium
Yorkshire Terrier,daily,324.0,low,small
Bulldog,weekly,466.0,medium,medium
Boxer,weekly,466.0,high,medium


In a dataframe, each row represents a single record---in this case, a single
dog breed. Each column represents a feature about the record---for example, the
`grooming` column represents how often each dog breed needs to be groomed.

Dataframes have labels for both columns and rows. For instance, this dataframe
has a column labeled `grooming` and a row labeled `German Shepherd`. The
columns and rows of a dataframe are ordered---we can refer to the Labrador
Retriever row as the first row of the dataframe. 

Within a column, data have the same type. For instance, the `food_cost` column
contains numbers, and the `size` column contains categories. But data types can
be different within a row.

Because of these properties, dataframes enable all sorts of useful operations.

:::{note}

As a data scientist, you'll often find yourself working with people from
different backgrounds who use different terms. For instance, computer
scientists say that the columns of a dataframe represent *features* of the
data, while statisticians call them *variables* instead.

Other times, people will use the same term to refer to slightly different
things. *Data types* in a programming sense refers to how a computer stores
data internally. For instance, the `size` column has a string data type in
Python. But from a statistical point of view, the `size` column stores ordered
categorical data (ordinal data). We talk more about this specific distinction 
in the {ref}`ch:eda` chapter.

:::

In this chapter, we'll show you how to do common dataframe operations using
`pandas`. Data scientists use the `pandas` library when working with dataframes
in Python. First, we'll explain the main objects that `pandas` provides: the
`DataFrame` and `Series` classes. Then, we'll show you how to use `pandas` to
perform common data manipulation tasks, like slicing, filtering, sorting,
grouping, and joining.