# Aggreates in Pandas

An aggregate is a way of creating a single number that describes a group of numbers. Common aggregate statistics incluse mean, median, or standard deviation.

Common calculations that are carried out:
    - mean	Average of all values in column
    - std	Standard deviation
    - median	Median
    - max	Maximum value in column
    - min	Minimum value in column
    - count	Number of values in column
    - nunique	Number of unique values in column
    - unique	List of unique values in column
    
The general syntax:

```py
df.column_name.command()
```

```py
# The DataFrame customers contains the names and ages of all of your customers. You want to find the median age:
print(customers.age)
>> [23, 25, 31, 35, 35, 46, 62]
print(customers.age.median())
>> 35

# The DataFrame shipments contains address information for all shipments that you've sent out in the past year. You want to know how many different states:
print(shipments.state)
>> ['CA', 'CA', 'CA', 'CA', 'NY', 'NY', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ']
print(shipments.state.nunique())
>> 3

# The DataFrame inventory contains a list of types of t-shirts that your company makes. You want a list of the colors that your shirts come in:
print(inventory.color)
>> ['blue', 'blue', 'blue', 'blue', 'blue', 'green', 'green', 'orange', 'orange', 'orange']
print(inventory.color.unique())
>> ['blue', 'green', 'orange']
```

In [3]:
import pandas as pd
import numpy as np

df_train = pd.read_csv('data/train.csv')
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [15]:
print('PClass types', df_train.Pclass.unique())
print('Embarked types', df_train.Embarked.unique())

PClass types [3 1 2]
Embarked types ['S' 'C' 'Q' nan]


In [10]:
print('Max fare', df_train['Fare'].max())
print('Average fare', df_train['Fare'].mean())
print('Median', df_train['Fare'].median())

Max fare 512.3292
Average fare 32.204207968574636
Median 14.4542


In [12]:
df_train.Fare.describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

In [16]:
df_train.Age.describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

### Groupby

Suppose we have a grade book with columns student, assignment_name, and grade. We want to get an average grade for each student across all assignments. Pandas provides the `.groupby()` method, e.g. groupby student and then calculate the mean.

```py
grades = df.groupby('student').grade.mean()
```

The general syntax:

```py
df.groupby('column1').column2.measurement()
```

Where:

- `column1` the column we group by
- `column2` the column we perform the calculation on
- `measurement` the functionapplied, e.g. mean(), std(), etc.


In [6]:
results = df_train.groupby('Pclass').Cabin.count()
print(type(results))
results

<class 'pandas.core.series.Series'>


Pclass
1    176
2     16
3     12
Name: Cabin, dtype: int64

### Example

```py
    id	first_name	last_name	shoe_type	shoe_material	shoe_color	price
0	97916	Douglas	Perez	stilettos	fabric	brown	90
1	67691	Tiffany	neill	wedges	leather	navy	94
2	72818	Susan	Rivas	sandals	faux-leather	white	96
3	28080	Angela	Hopper	stilettos	leather	red	96
4	89958	Thomas	Benjamin	sandals	faux-leather	navy	97

# find most expensive shoe for each type
pricey_shoes = orders.groupby('shoe_type').price.max()

print(type(pricey_shoes))
<class 'pandas.core.series.Series'>
```

In [None]:
df_train.groupby('Fare')

The result of `.groupby()` is a `Series`. We can turn a `Series` into a `DataFrame` using the `.reset_index()` method. You will generally see `.groupby()` followed by a `.reset_index()`. General syntax:

```py
df.groupby('column1').column2.measurement().reset_index()
```

When we use groupby, we often want to rename the column we get as a result.

```py
id	tea	category	caffeine	price
0	earl grey	black	38	3
1	english breakfast	black	41	3
2	irish breakfast	black	37	2.5
3	jasmine	green	23	4.5
```

We want to find the number of each category of tea we sell.

```py
teas_counts = teas.groupby('category').id.count().reset_index()
```

This will return a `DataFrame`:

```py
	category	id
0	black	3
1	green	4
2	herbal	8
3	white	2
```

The new column is called `id` because we used the id column of teas to calculate the counts. We actually want to call this column `counts`. Us the `.rename()` method

```py
teas_counts = teas_counts.rename(columns={"id": "counts"})
```

Some operations will require the use of a function due to their complexity. In that case we can use the `.apply()` method. In these cases, combining `groupby` and `apply` , the input to our function will always be a list of values.

#### Example - calculating percentile

```py
id	name	wage	category
10131	Sarah Carney	39	product
14189	Heather Carey	17	design
15004	Gary Mercado	33	marketing
...
```

If we want to calculate the 75th percentile (i.e., the point at which 75% of employees have a lower wage and 25% have a higher wage) for each category, we can use the following combination of apply and a lambda function(requires importing numpy -> np:

```py
# np.percentile can calculate any percentile over an array of values
high_earners = df.groupby('category').wage
    .apply(lambda x: np.percentile(x, 75))
    .reset_index()
```

The output, high_earners might look like this:

```py

    category	wage
0	design	23
1	marketing	35
2	product	48
...
```

`groupby` supports grouping by more than one column, simply pass in a list of column names.

```py
Location	Date	Day of Week	Total Sales
West Village	February 1	W	400
West Village	February 2	Th	450
Chelsea	February 1	W	375
Chelsea	February 2	Th	390
```

We suspect that sales are different at different locations on different days of the week. In order to test this hypothesis, we could calculate the average sales for each store on each day of the week across multiple months. The code would look like this:

```py
df.groupby(['Location', 'Day of Week'])['Total Sales'].mean().reset_index()
```

Returning:

```py
Location	Day of Week	Total Sales
Chelsea	M	402.50
Chelsea	Tu	422.75
Chelsea	W	452.00
...
West Village	M	390
West Village	Tu	400
...
```

Example

Continuing the Shoe example:

Create a DataFrame with the total number of shoes of each shoe_type/shoe_color combination purchased.

```py
orders.groupby(['shoe_type', 'shoe_color'])['id'].count().reset_index()

	shoe_type	shoe_color	id
0	ballet flats	black	2
1	ballet flats	brown	11
...
5	sandles	black	3
6	sandles	brown	10
...
```

Note: When we're using count(), it doesn't really matter which column we perform the calculation on.

### Pivot Tables

In the example above where we are running a chain of stores and have data about the number of sales at different locations on different days, it would be more useful if we could format the results like so:

```py
Location	M	Tu	W	Th	F	Sa	Su
Chelsea	400	390	250	275	300	150	175
West Village	300	310	350	400	390	250	200
...
```

Reorganizing the table in this way is called `pivoting`, the new table is called a `pivot table`.  The general syntax:

```py
df.pivot(columns='ColumnToPivot', index='ColumnToBeRows', values='ColumnToBeValues')
```

For our specific example, we would write the command like this:

```py
# First use the groupby statement:
unpivoted = df.groupby(['Location', 'Day of Week'])['Total Sales'].mean().reset_index()

# Now pivot the table
pivoted = unpivoted.pivot(
    columns='Day of Week',
    index='Location',
    values='Total Sales')
```

The result is a new `DataFrame`, add a `reset_index()` to fix indexing

Example

Continuing the Shoe example:

```py
shoe_counts = orders.groupby(['shoe_type', 'shoe_color']).id.count().reset_index()

shoe_counts_pivot = shoe_counts.pivot(
	columns='shoe_color',
    index='shoe_type',
    values='id'
).reset_index()
```

Resulting with:

```py
	shoe_type	black	brown	navy	red	white
0	ballet flats	2.0	11.0	17.0	13.0	7.0
1	sandals	3.0	10.0	13.0	14.0	10.0
2	stilettos	8.0	14.0	7.0	16.0	5.0
3	wedges	nan	13.0	16.0	4.0	17.0
```