# Chapter II : Aggregating DataFrames

## Summary Statistics

Summary statistics, as follows from their name, are numbers that summarize and tell you about your dataset

### Summarizing Numerical Data

* Mean: The average of all the values in a column
* Median: The middle value of a column
* Mode: The most common value in a column
* Min: The lowest value in a column
* Max: The highest value in a column
* Variance: The average of the squared differences between each value in a column and the mean
* Standard Deviation: The square root of the variance
* sum: The sum of all the values in a column
* quantile: The value in a column that is closest to a given percentile


In [2]:
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 

dogs = pd.read_csv('datasets\\Dogs.csv')

In [4]:
# Summarizing numerical data

dogs['Height(cm)'].mean()

49.714285714285715

In [6]:
# Oldest Dog
dogs[' Date of Birth '].min()

'2011-12-11'

In [8]:
# Youngest Dog
dogs[' Date of Birth '].max()

'2018-02-27'

In [9]:
# The .agg() method
# This method allows you to compute custom summary statistics.

def pct30(column):
    return column.quantile(0.3)

dogs[' Weight (kg)'].agg(pct30)

21.0

In [10]:
# Summaries on multiple columns
# .agg() can also be used on multiple columns

dogs[[' Weight (kg)', 'Height(cm)']].agg(pct30)

 Weight (kg)    21.0
Height(cm)      45.4
dtype: float64

In [11]:
# Multiple Summaries
# We can also use .agg() to get multiple summary statistics at once.

def pct40(column):
    return column.quantile(0.4)

dogs[' Weight (kg)'].agg([pct30, pct40])

pct30    21.0
pct40    22.4
Name:  Weight (kg), dtype: float64

### Cumulative Sum

pandas also has methods for computing cumulative statistics, for example cumulative sum.  
Calling cumsum on a column returns not just one number, but a number for each row of the DataFrame.

In [12]:
# Weight Column
dogs[' Weight (kg)']

0    25
1    23
2    22
3    17
4    29
5     2
6    74
Name:  Weight (kg), dtype: int64

In [13]:
# Cumulative sum of Weight Column
dogs[' Weight (kg)'].cumsum()

0     25
1     48
2     70
3     87
4    116
5    118
6    192
Name:  Weight (kg), dtype: int64

### Cummulative Statistics

* .cummax() : The maximum value of a column, for each row, starting from the first row
* .cummin() : The minimum value of a column, for each row, starting from the first row
* .cumprod() : The product of all values in a column, for each row, starting from the first row

These all return an entire column of a DataFrame, rather than a single number.

### Walmart

In this Chapter we'll be working with data on Walmart stores, which is a chain of department stores in the US. The dataset contains weekly sales in US dollars in various stores. Each store has an ID number and a specific store type. The sales are also separated by department ID. Along with weekly sales, there is information about whether it was a holiday week or not, the average temperature during the week in that location, the average fuel price in dollars per liter that week, and the national unemployment rate that week.

In [15]:
wallmart_df = pd.read_csv('datasets\\sales_subset.csv')

# Print the head of the wallmart_df DataFrame
print(wallmart_df.head())

# Print the info about the wallmart_df DataFrame
print(wallmart_df.info())

# Print the mean of weekly_sales
print(wallmart_df['weekly_sales'].mean())

# Print the median of weekly_sales
print(wallmart_df['weekly_sales'].median())

   Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0           0      1    A           1  2010-02-05      24924.50       False   
1           1      1    A           1  2010-03-05      21827.90       False   
2           2      1    A           1  2010-04-02      57258.43       False   
3           3      1    A           1  2010-05-07      17413.94       False   
4           4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  
0       5.727778              0.679451         8.106  
1       8.055556              0.693452         8.106  
2      16.816667              0.718284         7.808  
3      22.527778              0.748928         7.808  
4      27.050000              0.714586         7.808  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10774 entries, 0 to 10773
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------

In [17]:
# Print the maximum of the date column
print(wallmart_df['date'].max())

# print the minumum of the date column
print(wallmart_df['date'].max())

2012-10-26
2012-10-26


In [27]:
# Import numpy and create custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Print IQR of the temperature_c column
print(wallmart_df['temperature_c'].agg(iqr))

# Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment
print(wallmart_df[['temperature_c', 'fuel_price_usd_per_l', 'unemployment']].agg([iqr,np.median]))

16.583333333333336
        temperature_c  fuel_price_usd_per_l  unemployment
iqr         16.583333              0.073176         0.565
median      16.966667              0.743381         8.099


## Counting

Imagine a DataFrame that contains vet visits. The vet's office wants to know how many dogs of each breed have visited their office. However, some dogs have been to the vet more than once, so we can't just count the number of each breed in the breed column.

### Dropping duplicate names

Let's try to fix this by removing rows that contain a dog name already listed earlier in the dataset, or in other words; we'll extract a dog with each name from the dataset once.  
We can do this by using the **drop_duplicates()** method.

```python
vet_visits.drop_duplicates(subset=['Name'])
```

It takes and argument, subset, which is the column we want to find our duplicates based on.  
Because we have two different dogs with the same name, we'll need to consider more than just name when dropping duplicates.

### Dropping Duplicate pairs

To base our duplicate dropping on multiple columns, we can pass a list of column names to the subset argument, in this case, name and breed.

```python
unique_dogs = vet_visits.drop_duplicates(subset=['Name', 'Breed'])
```

### Easy as 1, 2, 3

To count the dogs of each breed, we'll subset the breed column and use the **value_count** method.  
We can also use the **sort** argument to get the breeds with the biggest counts on top.

```python	
unique_dogs['Breed'].value_counts()
```

or with the **sort** argument

```python
unique_dogs['Breed'].value_counts(sort=True)
```

### Proportion

The normalize argument can bu used to turn the counts into proportions of the total.  
%25 of the dogs that go to this vet are Labrador Retrievers.

```python
unique_dogs['Breed'].value_counts(normalize=True)
```


In [29]:
# Drop duplicate store/type combinations
store_types = wallmart_df.drop_duplicates(subset=['store', 'type'])
print(store_types.head())

# Drop duplicate store/department combinations
store_depts = wallmart_df.drop_duplicates(subset=['store', 'department']) 
print(store_depts.head())

# Subset the rows where is_holiday is True and drop duplicate dates
holiday_dates = wallmart_df[wallmart_df['is_holiday'] == True].drop_duplicates(subset='date')

# Print date col of holiday_dates
print(holiday_dates['date'])

      Unnamed: 0  store type  department        date  weekly_sales  \
0              0      1    A           1  2010-02-05      24924.50   
901          901      2    A           1  2010-02-05      35034.06   
1798        1798      4    A           1  2010-02-05      38724.42   
2699        2699      6    A           1  2010-02-05      25619.00   
3593        3593     10    B           1  2010-02-05      40212.84   

      is_holiday  temperature_c  fuel_price_usd_per_l  unemployment  
0          False       5.727778              0.679451         8.106  
901        False       4.550000              0.679451         8.324  
1798       False       6.533333              0.686319         8.623  
2699       False       4.683333              0.679451         7.259  
3593       False      12.411111              0.782478         9.765  
    Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0            0      1    A           1  2010-02-05      24924.50       False   

In [30]:
# Count the number of stores of each type
store_counts = store_types['type'].value_counts()
print(store_counts)

# get the proportion of stores of each type
store_props = store_types['type'].value_counts(normalize=True)
print(store_props)

# Count the number of each department number and sort 
dept_counts_sorted = store_depts['department'].value_counts(sort=True)
print(dept_counts_sorted)

# Get the proportion of departments of each number and sort 
dept_props_sorted = store_depts['department'].value_counts(sort=True, normalize=True)
print(dept_props_sorted)

A    11
B     1
Name: type, dtype: int64
A    0.916667
B    0.083333
Name: type, dtype: float64
1     12
55    12
72    12
71    12
67    12
      ..
37    10
48     8
50     6
39     4
43     2
Name: department, Length: 80, dtype: int64
1     0.012917
2     0.012917
3     0.012917
4     0.012917
5     0.012917
        ...   
95    0.012917
96    0.012917
97    0.012917
98    0.012917
99    0.011841
Name: department, Length: 80, dtype: float64


## Grouped Summary Statistics

While computing summary statistics of entire columns may be useful, you can gain many insights from summaries of individual groups. For example, does one color of dog weigh more than another on avarage? Are female dogs taller than males?

In [35]:
# Summaries by group

dogs[dogs['Colour'] == 'Black'] [' Weight (kg)'].mean()
dogs[dogs['Colour'] == 'Brown'] [' Weight (kg)'].mean()
dogs[dogs['Colour'] == 'White'] [' Weight (kg)'].mean()
dogs[dogs['Colour'] == 'Gray'] [' Weight (kg)'].mean()
dogs[dogs['Colour'] == 'Tan'] [' Weight (kg)'].mean()

# But this too cumbersome and we can do it with just one line
print(dogs.groupby('Colour')[' Weight (kg)'].mean())

Colour
Black    26.0
Brown    23.5
Gray     17.0
Tan       2.0
White    74.0
Name:  Weight (kg), dtype: float64


That's where **groupby** method comes in. We can group by the color variable, select the weight column, and take the mean. This will give us the mean weight for each dog color.  
Just like with ungrouped summary statistics, we can use the **agg** method to get multiple statistics.

In [36]:
# Here, we pass a list of functions into agg after grouping by color.

dogs.groupby('Colour')[' Weight (kg)'].agg([min, max, sum])

# This gives the minumum, maximum and sum of the different colored dogs' weights.

Unnamed: 0_level_0,min,max,sum
Colour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Black,23,29,52
Brown,22,25,47
Gray,17,17,17
Tan,2,2,2
White,74,74,74


We can also group by multiple columns and calculate summary statistics.

In [37]:
# Here, we group by color and breed, select the weight column and take the mean.

dogs.groupby(['Colour', 'Breed'])[' Weight (kg)'].mean()

Colour  Breed      
Black   Labrador       29
        Poodle         23
Brown   Labrador       25
        Poodle         22
Gray    Schnauzer      17
Tan     Chihuahua       2
White   St. Bernard    74
Name:  Weight (kg), dtype: int64

In [38]:
# We can also group by multiple columns and aggregate by multiple columns.

dogs.groupby(['Colour', 'Breed'])[[' Weight (kg)', 'Height(cm)']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Weight (kg),Height(cm)
Colour,Breed,Unnamed: 2_level_1,Unnamed: 3_level_1
Black,Labrador,29,59
Black,Poodle,23,43
Brown,Labrador,25,56
Brown,Poodle,22,46
Gray,Schnauzer,17,49
Tan,Chihuahua,2,18
White,St. Bernard,74,77


In [40]:
# Calc total weekly sales 
sales_all = wallmart_df['weekly_sales'].sum()

# Subset for type A stores, calc total weekly sales 
sales_A =  wallmart_df[wallmart_df['type'] == 'A' ]['weekly_sales'].sum()

# Subset for type B stores, calc total weekly sales
sales_B = wallmart_df[wallmart_df['type'] == 'B' ]['weekly_sales'].sum()

# Subset for type C stores, calc total weekly sales
sales_C = wallmart_df[wallmart_df['type'] == 'C' ]['weekly_sales'].sum()

# Get proportion for each type 
sales_propn_by_type =  [sales_A, sales_B, sales_C] / sales_all

print(sales_propn_by_type)

[0.9097747 0.0902253 0.       ]


In [47]:
# Group by type; calc total weekly sales
sales_by_type = wallmart_df.groupby('type')['weekly_sales'].sum()

# Get proportion for each type
sales_propn_by_type =  sales_by_type / sum(sales_by_type)

print(sales_propn_by_type)

# Group by type and is_holiday; calc total weekly sales
sales_by_type_is_holiday = wallmart_df.groupby(['type', 'is_holiday'])['weekly_sales'].sum()
print(sales_by_type_is_holiday)

type
A    0.909775
B    0.090225
Name: weekly_sales, dtype: float64
type  is_holiday
A     False         2.336927e+08
      True          2.360181e+04
B     False         2.317678e+07
      True          1.621410e+03
Name: weekly_sales, dtype: float64


In [48]:
# ımport numpy with allias np 
import numpy as np 

# For each store type, aggregate weekly_sales: get min, max, mean and median
sales_stats = wallmart_df.groupby('type')['weekly_sales'].agg([np.min, np.max, np.mean, np.median])

# Print sale_stats 
print(sales_stats)

# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean and median
unemp_fuel_stats = wallmart_df.groupby('type')['unemployment', 'fuel_price_usd_per_l'].agg([np.min, np.max, np.mean, np.median])

# Print unemp_fuel_stats
print(unemp_fuel_stats)

        amin       amax          mean    median
type                                           
A    -1098.0  293966.05  23674.667242  11943.92
B     -798.0  232558.51  25696.678370  13336.08
     unemployment                         fuel_price_usd_per_l            \
             amin   amax      mean median                 amin      amax   
type                                                                       
A           3.879  8.992  7.972611  8.067             0.664129  1.107410   
B           7.170  9.765  9.279323  9.199             0.760023  1.107674   

                          
          mean    median  
type                      
A     0.744619  0.735455  
B     0.805858  0.803348  


  unemp_fuel_stats = wallmart_df.groupby('type')['unemployment', 'fuel_price_usd_per_l'].agg([np.min, np.max, np.mean, np.median])


## Pivot Tables

If you've ever used a spreadsheet, chances are you've used a pivot table. Let's see how to create pivot tables in pandas.

In the last lesson, we grouped dogs by color and calculated their mean weights. We can do the same thing using the **pivot_table** method.  
The **values** argument is the column that you want to summarize, and the **index** is the column that you want to group by. By default, pivot_table takes the **mean** value for each group.

```python	
dogs.groupby('color')['weight_kg'].mean()
```

This is same as this:

```python
dogs.pivot_table(values='weight_kg', index='color', aggfunc='mean')
```

### Different statistics

If we want a different summart statistic, we can use the **aggfunc** argument and pass it a function.

```python
dogs.pivot_table(values='weight_kg', index='color', aggfunc=np.median)
```

We can also pass multiple functions to aggfunc argument.

```python
dogs.pivot_table(values='weight_kg', index='color', aggfunc=[np.median, np.mean])
```
### Pivot on two variables

We can also pivot on two different variables or columns

```python
dogs.groupby(['color', 'weight_kg'])['weight_kg'].mean()
```

```python	
dogs.pivot_table(values='weight_kg', index='color', columns='breed') 
```

### Filling the missing values in pivot tables

Instead of having lots of missing values in our pivot table, we can have them filled in using **fill_value** argument.

```python
dogs.pivot_table(values='weight_kg', index='color', columns='breed', fill_value=0)
```

### Summing with Pivot Tables

If we set the margins argument to True, we can sum the values in the pivot table.

```python
dogs.pivot_table(values='weight_kg', index='color', columns='breed', margins=True)
```


In [49]:
# Pivot for mean weekly_sales for each store type 
mean_sales_by_type = wallmart_df.pivot_table(values='weekly_sales', index='type')

# Print mean_sales_by_type
print(mean_sales_by_type)

      weekly_sales
type              
A     23674.667242
B     25696.678370


In [50]:
# Pivot for mean and median weekly_sales for each store type
mean_med_sales_by_type = wallmart_df.pivot_table(values='weekly_sales', index='type', aggfunc=[np.mean, np.median])

# Print mean_med_sales_by_type
print(mean_med_sales_by_type)

              mean       median
      weekly_sales weekly_sales
type                           
A     23674.667242     11943.92
B     25696.678370     13336.08


In [52]:
# Pivot for mean weekly_sales by store type and holiday
mean_sales_by_type_holiday = wallmart_df.pivot_table(values='weekly_sales', index='type', columns='is_holiday')

# print mean_sales_by_type_holiday
print(mean_sales_by_type_holiday)

is_holiday         False      True 
type                               
A           23768.583523  590.04525
B           25751.980533  810.70500


In [53]:
# Print mean weekly_sales by department and type; fill missing values with 0
print(wallmart_df.pivot_table(values='weekly_sales', index='type', columns='department', fill_value=0))


department            1              2             3             4   \
type                                                                  
A           30961.725379   67600.158788  17160.002955  44285.399091   
B           44050.626667  112958.526667  30580.655000  51219.654167   

department            5             6             7             8   \
type                                                                 
A           34821.011364   7136.292652  38454.336818  48583.475303   
B           63236.875000  10717.297500  52909.653333  90733.753333   

department            9             10  ...            90            91  \
type                                    ...                               
A           30120.449924  30930.456364  ...  85776.905909  70423.165227   
B           66679.301667  48595.126667  ...  14780.210000  13199.602500   

department             92            93            94             95  \
type                                                         

In [54]:
# Print the mean weekly_sales by department and type; fill missing values with 0s; sum all rows and cols 
print(wallmart_df.pivot_table(values='weekly_sales', index='type', columns='department', fill_value= 0, margins=True))

department             1              2             3             4  \
type                                                                  
A           30961.725379   67600.158788  17160.002955  44285.399091   
B           44050.626667  112958.526667  30580.655000  51219.654167   
All         32052.467153   71380.022778  18278.390625  44863.253681   

department             5             6             7             8  \
type                                                                 
A           34821.011364   7136.292652  38454.336818  48583.475303   
B           63236.875000  10717.297500  52909.653333  90733.753333   
All         37189.000000   7434.709722  39658.946528  52095.998472   

department             9            10  ...            91             92  \
type                                    ...                                
A           30120.449924  30930.456364  ...  70423.165227  139722.204773   
B           66679.301667  48595.126667  ...  13199.602500   50859