<div class="alert alert-block" style = "background-color: black">
    <p><b><font size="+4" color="orange">Data Aggregation in Pandas</font></b></p>
    <p><b><font size="+1" color="white">by Jubril Davies</font></b></p>
    </div>

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
plt.rcParams.update({'font.size':14}) #sets global font size


Bad key "text.kerning_factor" on line 4 in
/opt/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


$$\begin{align} \text{This work focuses on grouping & aggregating data for analysis} \end{align}$$
---
Categorizing a dataset and applying an operation to each group is a critical part of data analysis workflow. Pandas provides a flexible groupby function that allows you to slice and dice and summarize datasets naturally.

In this work, we will learn how to:

* Split pandas object into pieces using one or more keys
* Compute group summary statistics
* Apply a varying set of functions to each column of a DataFrame
* Apply group transformations such as normalizations, scaling etc
* Compute pivot tables and cross-tabulations
* Perform Quantile analysis and other derived group analyses 

---
<div class= "alert alert-block" style="background-color: orange; border-color: black">
    <p><b><font size="+2" color="black">Groupby Mechanics</font></b></p>
    </div>
  
---

Group operations follow the **split-apply-combine** rule.

* In the first stage, data is into groups based on one or more keys provided.
* A function is then applied to each group producing a new value
* Results are then combined into a result object

> #### **Given the famous titanic dataset**

In [139]:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(url)
data = data.drop(columns = ['Name'])
data = data.sample(10,random_state=52)
data

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
580,581,1,2,female,25.0,1,1,237789,30.0,,S
841,842,0,2,male,16.0,0,0,S.O./P.P. 3,10.5,,S
406,407,0,3,male,51.0,0,0,347064,7.75,,S
394,395,1,3,female,24.0,0,2,PP 9549,16.7,G6,S
453,454,1,1,male,49.0,1,0,17453,89.1042,C92,C
671,672,0,1,male,31.0,1,0,F.C. 12750,52.0,B71,S
257,258,1,1,female,30.0,0,0,110152,86.5,B77,S
528,529,0,3,male,39.0,0,0,3101296,7.925,,S
433,434,0,3,male,17.0,0,0,STON/O 2. 3101274,7.125,,S
773,774,0,3,male,,0,0,2674,7.225,,C


**Pandas Groupby does a concise aggregation using a dataframes specified key**

<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">1. Basic Grouping & Aggregation</font></b></p>
    </div>

### **1.1. Grouping by a Single Column**
The groupby method is used to group data. Lets group the titanic dataset by sex column and calculate the average age and fare for each gender.

In [140]:
grouped = data.groupby('Sex')[['Age','Fare']].mean()
grouped

Unnamed: 0_level_0,Age,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,26.333333,44.4
male,33.833333,25.947029


**It is useful to obtain the group size using the `groupby method`. By default, missing values in a group key are excluded from the result. This behavior can be disabled by passing dropna=False to groupby.**

> #### **Get the average age of the passengers in each cabin**

In [141]:
cabin_group = data.groupby(['Cabin'])['Age'].size()
cabin_group

Cabin
B71    1
B77    1
C92    1
G6     1
Name: Age, dtype: int64

In [142]:
cabin_group = data.groupby('Cabin',dropna=False)['Age'].size()
cabin_group

Cabin
B71    1
B77    1
C92    1
G6     1
NaN    6
Name: Age, dtype: int64

> #### **To compute the number of nonnull values in each group, we use the count**

In [143]:
data.groupby('Cabin')['Age'].count()

Cabin
B71    1
B77    1
C92    1
G6     1
Name: Age, dtype: int64

### **1.2. Iterating over Groups**
It is possible to iterate over the object returned by groupby generating a sequence of 2-tuples containing the group name along with the chunk of data.

In [149]:
for (Sex,Pclass), group in data.groupby(['Sex','Pclass']):
    print((Sex,Pclass))
    print(group)

('female', 1)
     PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch  Ticket  Fare  \
257          258         1       1  female  30.0      0      0  110152  86.5   

    Cabin Embarked  
257   B77        S  
('female', 2)
     PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch  Ticket  Fare  \
580          581         1       2  female  25.0      1      1  237789  30.0   

    Cabin Embarked  
580   NaN        S  
('female', 3)
     PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch   Ticket  Fare  \
394          395         1       3  female  24.0      0      2  PP 9549  16.7   

    Cabin Embarked  
394    G6        S  
('male', 1)
     PassengerId  Survived  Pclass   Sex   Age  SibSp  Parch      Ticket  \
453          454         1       1  male  49.0      1      0       17453   
671          672         0       1  male  31.0      1      0  F.C. 12750   

        Fare Cabin Embarked  
453  89.1042   C92        C  
671  52.0000   B71        S  
('male', 2)
    

**It is possible to compute the dictionary of the data pieces as a one-liner**

In [164]:
pieces = {Sex: group for Sex,group in data.groupby("Sex")}
pieces['female']

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
580,581,1,2,female,25.0,1,1,237789,30.0,,S
394,395,1,3,female,24.0,0,2,PP 9549,16.7,G6,S
257,258,1,1,female,30.0,0,0,110152,86.5,B77,S


### **1.3. Group by Columns or Multiple Columns**
Indexing a groupby object created from a dataframe with a column name or an array of column names creates the effect of subsetting by columns for aggregation.

df.groupby("key1")["data1"] and df.groupby("key1")[["data2"]] are conveiniences for:

df["data1"].groupby(df["key1"]) and df[["data2"]].groupby(df["key1"])

It is possible to group by multiple columns by passing a list of column names. 
Lets group by sex and passenger class (Pclass)

In [24]:
multi_grouped = data.groupby(['Sex','Pclass'])['Fare'].mean()
multi_grouped

Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64

**This gives a series with hierarchical index with unique pairs of keys observed. We can therefore unstack this**

In [25]:
multi_grouped.unstack()

Pclass,1,2,3
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,106.125798,21.970121,16.11881
male,67.226127,19.741782,12.661633
