## Split-Apply-Combine

It's pretty common to want to ask questions like

-  What is the average culmen length of the penguins living on the Island Torgersen?
-  Does the average culmen depth differ from one species to another?


One of the fundamental tasks in exploratory data analysis is to summarize your data **by group**. In our penguins data, for example, a very natural thing to do is to compute summary statistics **by species**, or perhaps by habitat (or both!). We can contextualize this task in three stages: 

1. **Split** the data data frame into pieces, one for each species. 
2. **Apply** an aggregation function to each piece, yielding a single number. 
3. **Combine** the results into a new data frame.

This pattern is so common that the phrase "split-apply-combine" now appears in many texts on data analysis. This phrase was originally coined by Hadley Wickham, who is famous for developing many of the modern tools for data analysis in the `R` programming language. 

<figure class="image" style="width:50%">
  <img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/03.08-split-apply-combine.png" alt="Left: A single dataframe is split into three pieces. Middle: The data within each piece is summed. Right: the resulting sums are combined, resulting in a new data frame with one sum for each piece.">
  <figcaption><i>split-apply-combine. Image credit: Jake VanderPlas, in the Python Data Science Handbook</i></figcaption>
</figure>

### Python lets us easily perform split-apply-combine operations using the `groupby()` method of data frames. 

In [1]:
# Same preprocessing as last video 

import pandas as pd
import numpy as np

#read in data from csv
penguins=pd.read_csv("palmer_penguins.csv")

cols=["Species", "Region", "Island", "Culmen Length (mm)", "Culmen Depth (mm)"]
penguins=penguins[cols]

penguins["Species"]=penguins["Species"].str.split().str.get(0)

penguins["Length"]=penguins["Culmen Length (mm)"]
penguins["Depth"]=penguins["Culmen Depth (mm)"]

penguins = penguins.drop(labels=["Culmen Depth (mm)","Culmen Length (mm)"],axis=1)
penguins.head()

We can group by species with the groupby() method

Now, we can get the mean of each species by adding .mean()

If we only want the mean of the Length column, we add in [["Length"]] (Note the double brackets)

We can get multiple summary statistics e.g., mean and standard deviation together via the aggreate() method

In [3]:
#note there is no () in the function names

### Group by multiple columns at the same time 

Group by Species and Island

Hierarchical Indexing
Complex data summary tables like the one above are useful and powerful, but they also pose an important problem -- how can we extract the data from these summaries? For example, how can I get the mean bill length for Chinstrap penguins on Dream island? To extract this kind of data, we need to use hierarchical indexing, in which we pass multiple keys to the .loc attribute. After passing all the row indices, we need to use .loc again to get at the column indices.

In [4]:
#first restrict to adelie, dream


In [5]:
#now grab the mean of the length column