# Aggregating DataFrames

## Summary statistics

### Summarizing numerical data

Summary statistics are numbers that attempt to summarize a dataset. They are useful for describing a large amount of data with just a few numbers and for comparing datasets.

In [2]:
# Explore your new DataFrame first by printing the first few rows of the sales DataFrame
import pandas as pd 
sales = pd.read_csv('./data/sales_subset.csv', index_col=0)
sales.head()

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106
1,1,A,1,2010-03-05,21827.9,False,8.055556,0.693452,8.106
2,1,A,1,2010-04-02,57258.43,False,16.816667,0.718284,7.808
3,1,A,1,2010-05-07,17413.94,False,22.527778,0.748928,7.808
4,1,A,1,2010-06-04,17558.09,False,27.05,0.714586,7.808


In [3]:
# Print the mean and the median of the weekly_sales column
display(sales['weekly_sales'].mean())
display(sales['weekly_sales'].median())

23843.950148505668

12049.064999999999

### Summarizing dates

Summary statistics can also be calculated on date columns that have values with the data type datetime64. Some summary statistics — like mean — don't make a ton of sense on dates, but others are helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

In [4]:
# Print the maximum and minimum of the date column
display(sales['date'].min())
display(sales['date'].max())

'2010-02-05'

'2012-10-26'

### The `.agg()` method

The `.agg()` method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once.

In [5]:
# Import NumPy use np.mean()
import numpy as np

# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Use the custom iqr function defined for you along with .agg() to print the IQR and the median of the temperature_c, fuel_price_usd_per_l, and unemployment columns
sales[['temperature_c', 'fuel_price_usd_per_l', 'unemployment']].agg([iqr, np.mean])

Unnamed: 0,temperature_c,fuel_price_usd_per_l,unemployment
iqr,16.583333,0.073176,0.565
mean,15.731978,0.749746,8.082009


### Cumulative summary statistics

Cumulative statistics are useful for tracking summary statistics over time. They execute operations tracked over the desired column and return a number for each row. 

In [12]:
# Create a DataFrame called sales_1_1, containing the sales data for department 1 of store 1
sales_1_1 = sales[np.logical_and((sales['department'] == 1), (sales['store'] == 1))]

# Sort the rows of sales_1_1 by the date column in ascending order
sales_1_1.sort_values('date')

# Get the cumulative sum of weekly_sales and add it as a new column of sales_1_1 called cum_weekly_sales
sales['cum_weekly_sales'] = sales['weekly_sales'].cumsum()

# Get the cumulative maximum of weekly_sales, and add it as a column called cum_max_sales
sales['cum_max_sales'] = sales['weekly_sales'].cummax()

# Print the date, weekly_sales, cum_weekly_sales, and cum_max_sales columns
sales[['date', 'weekly_sales', 'cum_weekly_sales', 'cum_max_sales']]

Unnamed: 0,date,weekly_sales,cum_weekly_sales,cum_max_sales
0,2010-02-05,24924.50,2.492450e+04,24924.50
1,2010-03-05,21827.90,4.675240e+04,24924.50
2,2010-04-02,57258.43,1.040108e+05,57258.43
3,2010-05-07,17413.94,1.214248e+05,57258.43
4,2010-06-04,17558.09,1.389829e+05,57258.43
...,...,...,...,...
10769,2011-12-09,895.00,2.568930e+08,293966.05
10770,2012-02-03,350.00,2.568934e+08,293966.05
10771,2012-06-08,450.00,2.568938e+08,293966.05
10772,2012-07-13,0.06,2.568938e+08,293966.05


## Counting

### Dropping duplicates

In order to know how many unique values one of your variables has, you sometimes need to drop duplicate values from you dataset.

In [16]:
# Remove rows of sales with duplicate pairs of store and type and print the head
store_types = sales.drop_duplicates(subset=['store', 'type'])
store_types.head()


Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment,cum_weekly_sales,cum_max_sales
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106,24924.5,24924.5
901,2,A,1,2010-02-05,35034.06,False,4.55,0.679451,8.324,18863178.61,140504.41
1798,4,A,1,2010-02-05,38724.42,False,6.533333,0.686319,8.623,42653008.31,178982.89
2699,6,A,1,2010-02-05,25619.0,False,4.683333,0.679451,7.259,66180317.34,178982.89
3593,10,B,1,2010-02-05,40212.84,False,12.411111,0.782478,9.765,85470611.89,178982.89


In [20]:
# Remove rows of sales with duplicate pairs of store and department and save as store_depts and print the head
store_depts = sales.drop_duplicates(subset=['store', 'department'])
store_depts.head()

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment,cum_weekly_sales,cum_max_sales
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106,24924.5,24924.5
12,1,A,2,2010-02-05,50605.27,False,5.727778,0.679451,8.106,332506.33,57258.43
24,1,A,3,2010-02-05,13740.12,False,5.727778,0.679451,8.106,864694.67,57258.43
36,1,A,4,2010-02-05,39954.04,False,5.727778,0.679451,8.106,1045379.67,57258.43
48,1,A,5,2010-02-05,32229.38,False,5.727778,0.679451,8.106,1498242.08,57258.43


### The `.value_counts()` method

Counting is a common way to summarize categorical variables and to get an overview of your data.

In [21]:
#Count the number of different departments in store_depts, sorting the counts in descending order
store_depts.value_counts('department', sort=True)

department
1     12
55    12
72    12
71    12
67    12
      ..
37    10
48     8
50     6
39     4
43     2
Length: 80, dtype: int64

In [22]:
# Count the proportion of different departments in store_depts, sorting the proportions in descending order
store_depts.value_counts('department', normalize=True, sort=True)

department
1     0.012917
55    0.012917
72    0.012917
71    0.012917
67    0.012917
        ...   
37    0.010764
48    0.008611
50    0.006459
39    0.004306
43    0.002153
Length: 80, dtype: float64

## Grouped summary statistics

### The `.groupby()` method

Sometimes it is more interesting to calculate summary statistic for groups of your column values, created based on the values of a categorical variable, than it is to calculate those statistics just for the entire column.

In [23]:
# Group sales by "type", take the sum of "weekly_sales", and store as sales_by_type
sales_by_type = sales.groupby('type')['weekly_sales'].sum()

# Calculate the proportion of sales at each store type by dividing by the sum of sales_by_type
# Assign to sales_propn_by_type
sales_propn_by_type = sales_by_type/sales_by_type.sum()
sales_propn_by_type

type
A    0.909775
B    0.090225
Name: weekly_sales, dtype: float64

In [24]:
# Group sales by "type" and "is_holiday" and take the sum of weekly_sales
sales.groupby(['type', 'is_holiday'])['weekly_sales'].sum()

type  is_holiday
A     False         2.336927e+08
      True          2.360181e+04
B     False         2.317678e+07
      True          1.621410e+03
Name: weekly_sales, dtype: float64

### Multiple grouped summaries

In [25]:
# Get the min, max, mean, and median of unemployment and fuel_price_usd_per_l for each store type
sales.groupby('type')[['unemployment', 'fuel_price_usd_per_l']].agg([np.min, np.max, np.mean, np.median])

Unnamed: 0_level_0,unemployment,unemployment,unemployment,unemployment,fuel_price_usd_per_l,fuel_price_usd_per_l,fuel_price_usd_per_l,fuel_price_usd_per_l
Unnamed: 0_level_1,amin,amax,mean,median,amin,amax,mean,median
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
A,3.879,8.992,7.972611,8.067,0.664129,1.10741,0.744619,0.735455
B,7.17,9.765,9.279323,9.199,0.760023,1.107674,0.805858,0.803348


## Pivot tables

### Pivoting on one variable

Pivot tables are the standard way of aggregating data in spreadsheets. In pandas, pivot tables are essentially just another way of performing grouped calculations.

In [26]:
# Get the mean and median of weekly_sales by type using .pivot_table()
sales.pivot_table(values='weekly_sales', index='type', aggfunc=[np.mean, np.median])

Unnamed: 0_level_0,mean,median
Unnamed: 0_level_1,weekly_sales,weekly_sales
type,Unnamed: 1_level_2,Unnamed: 2_level_2
A,23674.667242,11943.92
B,25696.67837,13336.08


### Fill in missing values and sum values with pivot tables

In [27]:
# Print the mean weekly_sales by department and type, filling in any missing values with 0 and summing all rows and columns
sales.pivot_table(values='weekly_sales', index='department', columns='type', fill_value=0, margins=True)

type,A,B,All
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,30961.725379,44050.626667,32052.467153
2,67600.158788,112958.526667,71380.022778
3,17160.002955,30580.655000,18278.390625
4,44285.399091,51219.654167,44863.253681
5,34821.011364,63236.875000,37189.000000
...,...,...,...
96,21367.042857,9528.538333,20337.607681
97,28471.266970,5828.873333,26584.400833
98,12875.423182,217.428333,11820.590278
99,379.123659,0.000000,379.123659
