# Aggregating

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/dc-wikia-data-clean.csv')

## Basic summarizing and descriptive statistics

In [3]:
df.mean()

page_id        147441.209252
appearances        23.625134
year             1989.766662
dtype: float64

In [4]:
df.std()

page_id        108388.631149
appearances        87.378509
year               16.824194
dtype: float64

In [5]:
df['sex'].unique()

array(['Male', 'Female', nan, 'Genderless', 'Transgender'], dtype=object)

In [6]:
df['sex'].value_counts()

Male           4783
Female         1967
Genderless       20
Transgender       1
Name: sex, dtype: int64

In [7]:
df['year'].min()

1935.0

In [8]:
df['year'].max()

2013.0

## `groupby`

**Figure copied from [Jake Vanderplas's book](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb).**

![title](figures/jake-vanderplas-split-apply-combine.png)

### Basic built-in aggregation functions

`count`, `sum`, `mean`, `median`, `std`, `var`, `min`, `max`, `prod`, `first`, `last`.

In [11]:
all(df['page_id'].notnull())

True

In [12]:
df.groupby('sex').count()['page_id']

sex
Female         1967
Genderless       20
Male           4783
Transgender       1
Name: page_id, dtype: int64

In [13]:
df.groupby('sex').mean()

Unnamed: 0_level_0,page_id,appearances,year
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,159307.531774,22.484574,1992.621983
Genderless,132061.75,12.842105,1990.85
Male,141814.427138,24.49989,1988.532841
Transgender,317067.0,4.0,2009.0


With multiple level (search for `MultiIndex` for more info).

In [15]:
df[df['sex'] == 'Transgender']

Unnamed: 0,page_id,name,urlslug,id,align,eye,hair,sex,gsm,alive,appearances,first appearance,year
3877,317067,Daystar (New Earth),\/wiki\/Daystar_(New_Earth),,Bad,,,Transgender,,Deceased,4.0,"2009, October",2009.0


In [14]:
df.groupby(['sex', 'align']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,page_id,name,urlslug,id,eye,hair,gsm,alive,appearances,first appearance,year
sex,align,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Female,Bad,597,597,597,419,325,472,7,596,568,596,596
Female,Good,953,953,953,714,582,812,18,953,909,941,941
Female,Neutral,196,196,196,138,125,169,3,196,193,194,194
Female,Reformed,1,1,1,0,1,1,0,1,1,1,1
Genderless,Bad,11,11,11,9,3,1,0,11,10,11,11
Genderless,Good,6,6,6,4,2,2,0,6,6,6,6
Genderless,Neutral,3,3,3,3,3,0,1,3,3,3,3
Male,Bad,2223,2223,2223,1542,861,1204,8,2223,2088,2208,2208
Male,Good,1843,1843,1843,1419,919,1306,25,1842,1756,1819,1819
Male,Neutral,359,359,359,254,191,253,1,359,348,353,353


### Custom aggregations

Specifying pandas built-in functions by name.

In [16]:
df.groupby('sex').agg({'page_id': 'count'})

Unnamed: 0_level_0,page_id
sex,Unnamed: 1_level_1
Female,1967
Genderless,20
Male,4783
Transgender,1


Using multiple functions for the same column.

In [17]:
df.groupby('sex').agg({'appearances': ['mean', 'std']})

Unnamed: 0_level_0,appearances,appearances
Unnamed: 0_level_1,mean,std
sex,Unnamed: 1_level_2,Unnamed: 2_level_2
Female,22.484574,68.71708
Genderless,12.842105,11.922263
Male,24.49989,95.1682
Transgender,4.0,


Using custom python functions.

In [19]:
def values_range(x):
    return max(x) - min(x)

In [21]:
df.groupby('sex').agg({'appearances': values_range})

Unnamed: 0_level_0,appearances
sex,Unnamed: 1_level_1
Female,1230.0
Genderless,35.0
Male,3092.0
Transgender,0.0


## *Exercise*

Among bisexual characters, what is the sex that appears the most? Is that the same for homosexual characters?

In [33]:
(
    df.groupby(['gsm', 'sex'])
    .agg({'appearances': ['sum', 'mean'], 'page_id': 'count'})
#     .sort_values('appearances')
    .rename(columns={'page_id': 'count'})
)

Unnamed: 0_level_0,Unnamed: 1_level_0,appearances,appearances,count
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,count
gsm,sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Bisexual,Female,322.0,64.4,5
Bisexual,Genderless,20.0,20.0,1
Bisexual,Male,436.0,109.0,4
Homosexual,Female,1135.0,47.291667,24
Homosexual,Male,1001.0,33.366667,30
