# Groupby
* Often we want to split up and work with data based on groups
* Pandas allows us to iterate through rows and columns in a dataframe, but this is sort of slow
* Pandas also supports `groupby()` through a split-apply-combine pattern

## Splitting
* Let's get motivated first

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('datasets/census.csv')
df = df[df['SUMLEV']==50]
df.head()

* Ok, so `groupby` is great
* Usually you'll group by data in a column, but you can also provide a function to groupby and use that to segment your data.

* We can also group by multiple columns

In [None]:
#Airbnb data
df=pd.read_csv("datasets/listings.csv")
df.head()

In [None]:
#It works pretty much as you would expect


## Applying
* So far we have just looked at splitting up data
* We have three broad kinds of applying for data: aggregation, transformation, and filtering.

### Aggregation

In [None]:
# We should just be able to aggregate by calling .agg


In [None]:
# That didn't seem to work at all, NaN!


In [None]:
# We can just extend this dictionary to aggregate by multiple functions or multiple columns.


### Transformation
* Transformation broadcasts the function you supply over the grouped `DataFrame`, returning a new `DataFrame`.
* This is an important subtlety. `agg()` takes a grouped `DataFrame` and returns a scalar for that group. But `transform()` returns a `DataFrame` for that group.
* Whereas `agg()` will return a `DataFrame` the size of the number of groups (one entry per group), `transform()` will return a `DataFrame` the size of your original `DataFrame`

In [None]:
# Lets just look at a couple of columns from our DataFrame


In [None]:
# Notice that we are indexed by some review number. If we want to find the average for each group, we can do


In [None]:
# But how do we put this average, say as a column called "related_averages", 
# back to our original dataframe
# How would YOU do that....?


In [None]:
# Transform lets us do this in one step


In [None]:
# Since the return is indexed just like the original dataframe, we can just assign it to a column


### Filtering
* You can also use `filter()` to remove rows from groups, sort of like `where()`

### Applying
* This is 95% of what I actually do with groups

In [None]:
df=pd.read_csv("datasets/listings.csv")
df=df[['cancellation_policy','review_scores_value']]
df.head()

In [None]:
def calc_mean_review_scores(group):
    # we can treat this as the complete dataframe
    # now broadcast our formula and create a new column
    return group

# Now just apply this to the groups
