In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

# Lecture 7 – Grouping and Pivoting

## DSC 80, Spring 2022

### Announcements

- Lab 2 is due **tonight at 11:59PM.**
    - See [this post](https://campuswire.com/c/G325FA25B/feed/277) a clarification on Question 8.
    - Git issues? See [this post](https://campuswire.com/c/G325FA25B/feed/315).
- Project 1 is due on **Thursday at 11:59PM**.
- Watch [this video 🎥](https://www.youtube.com/watch?v=uUawZfAgA64) for tips on how to work with the command-line.

### Agenda

- Data granularity.
- Grouping.
- Pivoting.

## Data granularity

### Granularity

- **Granularity** refers to the level of detail present in data.
    - Fine: small details.
    - Coarse: bigger picture.
- Typically, rows in a DataFrame correspond to individuals, and columns correspond to attributes.
- In the following example, what is an individual?

| Name | Assignment | Score |
| --- | --- | --- |
| Billy | Homework 1 | 94 |
| Sally | Homework 1 | 98 |
| Molly | Homework 1 | 82 |
| Sally | Homework 2 | 47 |

### Levels of granularity

<center><img src='imgs/caper.png' width=30%></center>

Each student submits CAPEs once for each course they are in.

| Student Name | Quarter | Course | Instructor | Recommend? | Expected Grade | Hours Per Week | Comments |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Billy | SP22 | DSC 80 | Suraj Rampure | No | A- | 14 | I hate this class |
| Billy | SP22 | DSC 40B | Arya Mazumdar | Yes | B+ | 9 | go big O |
| Sally | SP22 | DSC 10 | Janine Tiefenbruck | Yes | A | 11 | babypandas are so cute |
| Molly | SP22 | DSC 80 | Suraj Rampure | Yes | A+ | 2 | I wish there was music in class |

Only instructors can see individual responses. At [cape.ucsd.edu](https://cape.ucsd.edu), only overall class statistics are visible.

| Quarter | Course | Instructor | Recommend (%) | Expected Grade | Hours Per Week |
| --- | --- | --- | --- | --- | --- |
| SP22 | DSC 80 | Suraj Rampure | 23% | 3.15 (B) | 13.32 |
| SP22 | DSC 40B | Arya Mazumdar | 89% | 3.35 (B+) | 8.54 |
| SP22 | DSC 10 | Janine Tiefenbruck | 94% | 3.45 (B+) | 11.49 |

The university may be interested in looking at CAPEs results by department.

| Quarter | Department | Recommend (%) | Expected Grade | Hours Per Week |
| --- | --- | --- | --- | --- |
| SP22 | DSC | 91% | 3.01 (B) | 12.29 |
| SP22 | BILD | 85% | 2.78 (C+) | 13.21 |

Prospective students may be interested in comparing course evaluations across different universities.

| University | Recommend (%) | Average GPA | Hours Per Week |
| --- | --- | --- | --- |
| UC San Diego | 94% | 3.12 (B) | 42.19 |
| UC Irvine | 89% | 3.15 (B) | 38.44 |
| SDSU | 88% | 2.99 (B-) | 36.89 |

### Collecting data

- If you can control how your dataset is created then you should opt for **finer granularity** (more detail).
- You can always remove detail, but you cannot add detail if it is not already present in the dataset.
- However, obtaining fine-grained data can take more time and space.

### Manipulating granularity

- In the CAPEs example, we looked at the same information (course evaluations) at varying levels of detail.
- We'll now explore how to change the level of granularity present in our dataset.
    - While it may seem like we are "losing information," removing detail can help us understand bigger-picture trends in our data.

### Discussion Question

What is the average number of `'Years'` for each `'Degree'`? Write code that finds the answer as a **Series** indexed by `'Degree'`.

In [None]:
profs = pd.DataFrame(
[['Brad', 'UCB', 8, 'Neuro', 'Orange'],
 ['Janine', 'UCSD', 7, 'Math', 'Purple'],
 ['Marina', 'UIC', 6, 'CS', 'Yellow'],
 ['Justin', 'OSU', 4, 'CS', 'Yellow'],
 ['Aaron', 'UCB', 4, 'Math', 'Purple'],
 ['Soohyun', 'UCSD', 1, 'CS', 'Orange'],
 ['Suraj', 'UCB', 1, 'CS', 'Purple']],
    columns=['Name', 'School', 'Years', 'Degree', 'Color']
)

profs

### Approach 1: Looping through unique values

In [None]:
year_map = {}
for degree in profs['Degree'].unique():
    degree_only = profs.loc[profs['Degree'] == degree]
    year_map[degree] = degree_only['Years'].mean()
    
pd.Series(year_map)

For each unique `'Degree'`, we make a pass through the entire dataset.

### Approach 2: Single pass

Let's try and avoid passing over the dataset repeatedly.

In [None]:
profs

You can iterate over the rows of a DataFrame using the `iterrows` method (though you should rarely need to do this):

In [None]:
for idx, row in profs.iterrows():
    print(row, '\n')

In [None]:
year_map = {}
for idx, row in profs.iterrows():                            
    degree = row['Degree']
    person_years = row['Years']
    if degree in year_map:
        year_map[degree] += np.array([1, person_years])
    else:
        year_map[degree] = np.array([1, person_years])
        
year_map

In [None]:
df = pd.DataFrame(year_map, index=['total', 'years'])
df.loc['years'] / df.loc['total']

### Issues with the previous solutions

- These solutions were "ad-hoc", and depended on the specific problem we had.
    - What if we wanted the **median** `'Years'` for each `'Degree'`?
- Loops in Python are slow (though the **algorithmic reasoning** is still relevant).

## GroupBy

### 🤔

In [None]:
profs

In [None]:
profs.groupby('Degree').mean()

### Aside: Pandas Tutor

- [pandastutor.com](https://pandastutor.com) is a new tool that allows you to visualize DataFrame operations.
    - It works similarly to [pythontutor.com](https://pythontutor.com), which you may have seen in DSC 20.
    - Slight issue: can't upload `.csv` files.
- Follow along with our current example [here](https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0A%0Aprofs%20%3D%20pd.DataFrame%28%0A%5B%5B'Brad',%20'UCB',%208,%20'Neuro',%20'Orange'%5D,%0A%20%5B'Janine',%20'UCSD',%207,%20'Math',%20'Purple'%5D,%0A%20%5B'Marina',%20'UIC',%206,%20'CS',%20'Yellow'%5D,%0A%20%5B'Justin',%20'OSU',%204,%20'CS',%20'Yellow'%5D,%0A%20%5B'Aaron',%20'UCB',%204,%20'Math',%20'Purple'%5D,%0A%20%5B'Soohyun',%20'UCSD',%201,%20'CS',%20'Orange'%5D,%0A%20%5B'Suraj',%20'UCB',%201,%20'CS',%20'Purple'%5D%5D,%0A%20%20%20%20columns%3D%5B'Name',%20'School',%20'Years',%20'Degree',%20'Color'%5D%0A%29%0A%0Aprofs.groupby%28'Degree'%29.mean%28%29&d=2022-04-11&lang=py&v=v1).

### Split-apply-combine

- The `groupby` method involves three steps: **split**, **apply**, and **combine**.

<center><img src="imgs/image_0.png" width=40%></center>

- **Split** breaks up and "groups" the rows of a DataFrame according to the specified key. There is one "group" for every unique value of the key.
- **Apply** uses a function (e.g. aggregation, transformation, filtering) within the individual groups.
- **Combine** stitches the results of these operations into an output DataFrame.

### Runtime considerations

* The `groupby` method can often produce results using just a **single pass** over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.

* `groupby` is a **declarative** operation – the user just specifies **what** computation needs to be done, and `pandas` figures out **how** to do it under the hood.

* The split-apply-combine pattern can be parallelized to work on multiple computers or threads, by sending computations for each group to different processors.

### Example: Penguins 🐧

In [None]:
penguins = sns.load_dataset('penguins').dropna()
penguins.head()

In [None]:
penguins['species'].value_counts()

In [None]:
penguins['island'].value_counts()

### For each species...

What is the median bill length?

In [None]:
penguins.groupby('species').median()

What proportion live on Dream Island?

In [None]:
(
    penguins.assign(on_Dream = penguins['island'] == 'Dream')
            .groupby('species')
            .mean()
)

Now that we understand how to use `groupby`, let's dive deeper into **how** it works.

### Accessing groups

- If `df` is a DataFrame, then `df.groupby(key)` returns a `DataFrameGroupBy` object.
    - This object represents the "split" in "split-apply-combine".
- Methods and attributes of `DataFrameGroupBy` objects:
    - `.groups`: a dictionary in which the keys are group names and the values are lists of row labels.
    - `.get_group(key)`: a DataFrame with only the values for the given key
    - We usually don't use these directly, but they're useful in understanding how `groupby` works.

In [None]:
# Creates one group for each unique value in the species column
penguin_groups = penguins.groupby('species')
penguin_groups

In [None]:
penguin_groups.groups

In [None]:
penguin_groups.get_group('Chinstrap')

In [None]:
# Same as the above
penguins[penguins['species'] == 'Chinstrap']

In [None]:
for key, df in penguin_groups:
    display(df)

### Aggregation

- Once we create a `DataFrameGroupBy` object, we need to **apply** some function to each group, and **combine** the results.
- The most common operation applied to each group is an **aggregation**.
    - Aggregation refers to the process of reducing many values to one.
- To perform an aggregation, use an aggregator method on the `DataFrameGroupBy` object, e.g. `.mean()`, `.max()`, `.median()`, etc.

In [None]:
penguins

In [None]:
penguin_groups

In [None]:
penguin_groups.mean()

In [None]:
penguin_groups.sum()

In [None]:
penguin_groups.max()

### Column selection

- By default, the aggregator will be applied to **all** columns that it can be applied to.
    - `max` and `min` are defined on strings, while `median` and `mean` are not.
- If we only care about one column, we can select that column before aggregating to save time.
- `DataFrameGroupBy` objects support `[]` notation.

In [None]:
penguins.groupby('species').median()

In [None]:
penguins.groupby('species')['bill_length_mm'].median()

In [None]:
# Gives the same result, but involves wasted effort
# since the other columns had to be aggregated for no reason
penguins.groupby('species').median()['bill_length_mm']

In [None]:
# Note that this is a SeriesGroupBy object, not a DataFrameGroupBy object!
penguins.groupby('species')['bill_length_mm']

## Additional `GroupBy` methods

### Aggregation methods

- There are many built-in aggregation methods.
- What if you want to apply different aggregation methods to different columns?
- What if the aggregation method you want to use doesn't already exist in `pandas`?

### The `aggregate` method

- The `DataFrameGroupBy` object has a general `aggregate` method, which aggregates using one or more operations.
    - Remember, aggregation refers to the process of reducing many values to one.
- There are many ways of using `aggregate`; refer to [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html) for a comprehensive list.
- Example arguments:
    - A single function.
    - A list of functions.
    - A dictionary mapping column names to functions.
- Per [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html), `agg` is an alias for `aggregate`.

### Example

How many penguins are there of each species, and what is the mean body mass of each species?

In [None]:
penguins.groupby('species')['body_mass_g'].aggregate(['count', 'mean'])

Note what happens when we don't select a column before aggregating.

In [None]:
# penguins.drop(columns=['island', 'sex']).groupby('species').aggregate(['count', 'mean'])
penguins.groupby('species').aggregate(['count', 'mean'])

### Example

What is the max bill length of each species, and how many islands is each species found on?

In [None]:
penguins.groupby('species').aggregate({'bill_length_mm': 'max', 'island': 'nunique'})

### Example

What is the **interquartile range** of the body mass of each species?

In [None]:
def IQR(col):
    return np.percentile(col, 75) - np.percentile(col, 25)

In [None]:
penguins.groupby('species')['body_mass_g'].aggregate(IQR)

### The `transform` method

- Let's say we want to subtract the mean within each group.
- This is not an **aggregation**, it is a **transformation**.
- A transformation returns a DataFrame or Series of the same size.

In [None]:
penguins

In [None]:
penguins.groupby('species')['body_mass_g'].transform(lambda ser: ser - ser.mean())

### The `filter` method

- Suppose we want to keep only the groups that satisfy a particular condition.
- To do this, we use the `filter` method, which takes in a function.
- That function should accept a DataFrame/Series and return a Boolean.
- The result is a new DataFrame/Series with only the groups for which the filter function returned `True`.
- For example, suppose we want only the species whose mean bill length is above 39 mm.

In [None]:
penguins

In [None]:
penguins.groupby('species').filter(lambda df: df['bill_length_mm'].mean() > 39)

No more Adelies!

### The `apply` method

- The `apply` method is a generalization of `aggregate`, `transform`, and `filter`.
- It accepts a group as a DataFrame/Series, and can return a DataFrame, Series, or scalar.
- Per [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.apply.html), it is slower than other aggregation and transformation methods, so use those instead whenever possible, and **avoid `apply`**.

In [None]:
penguins.groupby('species').apply(lambda s: s * 2)

In [None]:
penguins.groupby('species').apply(lambda s: s.mean().mean())

### Discussion Question

For each species, find the island on which the heaviest penguin of that species lives.

In [None]:
# Why doesn't this work?
penguins.groupby('species').max()

In [None]:
penguins.sort_values('body_mass_g', ascending=False).groupby('species').first()

### Grouping with multiple columns

When we group with multiple columns, one group is created for **every unique combination** of elements in the specified columns.

In [None]:
double_group = penguins.groupby(['species', 'island'])
double_group

In [None]:
double_group.groups

In [None]:
for key, df in double_group:
    display(df.head())

In [None]:
penguins.groupby(['species', 'island']).mean()

### Grouping and indexes

- The `groupby` method creates an index based on the specified columns.
- When grouping by multiple columns, the resulting DataFrame has a `MultiIndex`.
- Advice: When working with a `MultiIndex`, use `reset_index` or set `as_index=False` in `groupby`.

In [None]:
weird = penguins.groupby(['species', 'island']).mean()
weird

In [None]:
weird['body_mass_g']

In [None]:
weird.loc['Adelie']

In [None]:
weird.loc[('Adelie', 'Torgersen')]

In [None]:
weird.reset_index()

In [None]:
penguins.groupby(['species', 'island'], as_index=False).mean()

## Introduction to pivot tables

### Average body mass for every combination of species and island

To find the above information, we can group by both `'species'` and `'island'`.

In [None]:
penguins.groupby(['species', 'island'])['body_mass_g'].mean()

But we can also create a **pivot table**.

In [None]:
penguins.pivot_table(index='species', 
                     columns='island', 
                     values='body_mass_g', 
                     aggfunc='mean')

Note that the DataFrame above shows the same information as the Series above it, just in a different arrangement.

## Summary, next time

### Summary

- Grouping allows us to change the level of granularity in a DataFrame.
- Grouping involves three steps – split, apply, and combine.
- The `groupby` method returns a `DataFrameGroupBy` method, which creates one group for every unique combination of values in the column(s) being grouped on.
- Most often, we will use an aggregation method on a `DataFrameGroupBy` object, but we can also use `transform`, `filter`, or the more general `apply` methods. Each one of these methods acts on each group individually.
- **Next time:** More on `pivot` and `pivot_table`. Simpson's paradox. Combining DataFrames.