# Summarizing Data

In this lecture, we'll discuss how to descriptively *summarize* data. Descriptive data summarization is one of the fundamental processes of exploratory data analysis. The `pandas` package offers us a powerful suite of tools for creating summaries. 

In [1]:
import pandas as pd
import numpy as np

In [21]:
penguins = pd.read_csv("palmer_penguins.csv")

cols = ["Species", "Region", "Island", "Culmen Length (mm)", "Culmen Depth (mm)"]

penguins = penguins[cols]

# shorten the species name

penguins["Species"] = penguins["Species"].str.split().str.get(0)

penguins.head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm),Culmen Depth (mm)
0,Adelie,Anvers,Torgersen,39.1,18.7
1,Adelie,Anvers,Torgersen,39.5,17.4
2,Adelie,Anvers,Torgersen,40.3,18.0
3,Adelie,Anvers,Torgersen,,
4,Adelie,Anvers,Torgersen,36.7,19.3


## Simple Aggregation

Because the columns of a data frame behave a lot like `numpy` arrays, we can use standard methods to compute summary statistics. Here are a few examples. 

In [3]:
x = penguins["Culmen Length (mm)"]
x

0      39.1
1      39.5
2      40.3
3       NaN
4      36.7
       ... 
339     NaN
340    46.8
341    50.4
342    45.2
343    49.9
Name: Culmen Length (mm), Length: 344, dtype: float64

In [4]:
np.sum(x) # note: NaNs are ignored by default

15021.3

In [5]:
x.sum() # also works

15021.3

In [6]:
x.mean(), x.std() # mean and standard deviation

(43.92192982456142, 5.459583713926532)

In [7]:
(x > 40).sum() # number of penguins with culmens longer than 40 mm

242

It's also possible to aggregate the entire data frame at once, in which case `pandas` will attempt to apply the specified function to each column for which this is possible. When passing a numerical aggregation function, non-numeric columns are ignored. 

In [8]:
penguins.count() # excludes NA values, works for text columns

Species               344
Region                344
Island                344
Culmen Length (mm)    342
Culmen Depth (mm)     342
dtype: int64

In [9]:
penguins.mean() # ignores all the text columns

Culmen Length (mm)    43.92193
Culmen Depth (mm)     17.15117
dtype: float64

In [10]:
# a bit counterintuitive: in text columns, returns the last 
# value alphabetically
penguins.max() 

Species                  Gentoo
Region                   Anvers
Island                Torgersen
Culmen Length (mm)         59.6
Culmen Depth (mm)          21.5
dtype: object

It is technically possible to aggregate across columns (rather than rows) in `pandas`; however, doing so usually violates the [*tidy data* principles](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) and is not recommended. 

We've already seen `describe()`, a convenience function for calculating numerical summary statistics. 

In [11]:
penguins.describe()

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm)
count,342.0,342.0
mean,43.92193,17.15117
std,5.459584,1.974793
min,32.1,13.1
25%,39.225,15.6
50%,44.45,17.3
75%,48.5,18.7
max,59.6,21.5


## Split-Apply-Combine

One of the fundamental tasks in exploratory data analysis is to summarize your data **by group**. In our penguins data, for example, a very natural thing to do is to compute summary statistics **by species**, or perhaps by habitat (or both!). We can contextualize this task in three stages: 

1. **Split** the data data frame into pieces, one for each species. 
2. **Apply** an aggregation function to each piece, yielding a single number. 
3. **Combine** the results into a new data frame.

This pattern is so common that the phrase "split-apply-combine" now appears in many texts on data analysis. This phrase was originally coined by Hadley Wickham, who is famous for developing many of the modern tools for data analysis in the `R` programming language. 

<figure class="image" style="width:50%">
  <img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/03.08-split-apply-combine.png" alt="Left: A single dataframe is split into three pieces. Middle: The data within each piece is summed. Right: the resulting sums are combined, resulting in a new data frame with one sum for each piece.">
  <figcaption><i>split-apply-combine. Image credit: Jake VanderPlas, in the Python Data Science Handbook</i></figcaption>
</figure>

Python lets us easily perform split-apply-combine operations using the `groupby()` method of data frames. 

In [12]:
penguins.groupby("Species")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f87fc2cedf0>

We can think of the result of `groupby()` as a special "view" of the data frame, such that any aggregation functions used will by applied to each of the individual "groups" (i.e. species). As before, numerical aggregation functions will drop text columns. 

In [13]:
penguins.groupby("Species").mean()

Unnamed: 0_level_0,Culmen Length (mm),Culmen Depth (mm)
Species,Unnamed: 1_level_1,Unnamed: 2_level_1
Adelie,38.791391,18.346358
Chinstrap,48.833824,18.420588
Gentoo,47.504878,14.982114


We now have a pleasant summary of the mean culmen (bill) measurements for each species. It is now clear, for example, that Adelie penguins have much shorter bills than Chinstrap and Gentoo penguins. 

If you only want to show summaries for certain columns, just pass those in list form as an index to the `groupby` object: 

In [24]:
# note the double brackets
penguins.groupby("Species")[["Culmen Length (mm)"]].mean()

Unnamed: 0_level_0,Culmen Length (mm)
Species,Unnamed: 1_level_1
Adelie,38.791391
Chinstrap,48.833824
Gentoo,47.504878


While it's useful to compute a single set of summary statistics like this, it's often more useful to apply multiple aggregation functions simultaneously. The `aggregate()` method allows us to pass multiple functions, all of which will be applied and represented as new columns. For example, a common format for measurements is the mean $\pm$ the standard deviation. We can easily compute both quantities simultaneously, per penguin species:

In [25]:
penguins.groupby("Species").aggregate([np.mean, np.std])

Unnamed: 0_level_0,Culmen Length (mm),Culmen Length (mm),Culmen Depth (mm),Culmen Depth (mm)
Unnamed: 0_level_1,mean,std,mean,std
Species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Adelie,38.791391,2.663405,18.346358,1.21665
Chinstrap,48.833824,3.339256,18.420588,1.135395
Gentoo,47.504878,3.081857,14.982114,0.98122


It's also possible to group by multiple columns -- just pass a list of column names to `groupby`: 

In [26]:
summary = penguins.groupby(["Species", "Island"]).aggregate([np.mean, np.std])
summary

Unnamed: 0_level_0,Unnamed: 1_level_0,Culmen Length (mm),Culmen Length (mm),Culmen Depth (mm),Culmen Depth (mm)
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std
Species,Island,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Adelie,Biscoe,38.975,2.480916,18.370455,1.18882
Adelie,Dream,38.501786,2.465359,18.251786,1.133617
Adelie,Torgersen,38.95098,3.025318,18.429412,1.339447
Chinstrap,Dream,48.833824,3.339256,18.420588,1.135395
Gentoo,Biscoe,47.504878,3.081857,14.982114,0.98122


## Hierarchical Indexing

Complex data summary tables like the one above are useful and powerful, but they also pose an important problem -- how can we extract the data from these summaries? For example, how can I get the mean bill length for Chinstrap penguins on Dream island? To extract this kind of data, we need to use *hierarchical indexing*, in which we pass multiple keys to the `.loc` attribute. After passing all the row indices, we need to use `.loc` again to get at the column indices. 

In [27]:
chinstrap_dream = summary.loc["Chinstrap", "Dream"]
chinstrap_dream

Culmen Length (mm)  mean    48.833824
                    std      3.339256
Culmen Depth (mm)   mean    18.420588
                    std      1.135395
Name: (Chinstrap, Dream), dtype: float64

In [28]:
# mean culmen length of chinstrap penguins on Dream Island 
chinstrap_dream.loc["Culmen Length (mm)", "mean"] 

48.83382352941177