<div class="alert alert-block" style = "background-color: black">
    <p><b><font size="+4" color="orange">Data Aggregation in Pandas</font></b></p>
    <p><b><font size="+1" color="white">by Jubril Davies</font></b></p>
    </div>

In [5]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
plt.rcParams.update({'font.size':14}) #sets global font size

$$\begin{align} \text{This work focuses on grouping & aggregating data for analysis} \end{align}$$
---
Categorizing a dataset and applying an operation to each group is a critical part of data analysis workflow. Pandas provides a flexible groupby function that allows you to slice and dice and summarize datasets naturally.

In this work, we will learn how to:

* Split pandas object into pieces using one or more keys
* Compute group summary statistics
* Apply a varying set of functions to each column of a DataFrame
* Apply group transformations such as normalizations, scaling etc
* Compute pivot tables and cross-tabulations
* Perform Quantile analysis and other derived group analyses 

---
<div class= "alert alert-block" style="background-color: orange; border-color: black">
    <p><b><font size="+2" color="black">Groupby Mechanics</font></b></p>
    </div>
  
---

Group operations follow the **split-apply-combine** rule.

* In the first stage, data is into groups based on one or more keys provided.
* A function is then applied to each group producing a new value
* Results are then combined into a result object

> #### **Given the famous titanic dataset**

In [6]:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
dat = pd.read_csv(url)
dat = dat.drop(columns = ['Name'])
data = dat.sample(10,random_state=52)
data

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
580,581,1,2,female,25.0,1,1,237789,30.0,,S
841,842,0,2,male,16.0,0,0,S.O./P.P. 3,10.5,,S
406,407,0,3,male,51.0,0,0,347064,7.75,,S
394,395,1,3,female,24.0,0,2,PP 9549,16.7,G6,S
453,454,1,1,male,49.0,1,0,17453,89.1042,C92,C
671,672,0,1,male,31.0,1,0,F.C. 12750,52.0,B71,S
257,258,1,1,female,30.0,0,0,110152,86.5,B77,S
528,529,0,3,male,39.0,0,0,3101296,7.925,,S
433,434,0,3,male,17.0,0,0,STON/O 2. 3101274,7.125,,S
773,774,0,3,male,,0,0,2674,7.225,,C


**Pandas Groupby does a concise aggregation using a dataframes specified key**

<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">1. Basic Grouping & Aggregation</font></b></p>
    </div>

### **1.1. Grouping by a Single Column**
The groupby method is used to group data. Lets group the titanic dataset by sex column and calculate the average age and fare for each gender.

In [7]:
grouped = data.groupby('Sex')[['Age','Fare']].mean()
grouped

Unnamed: 0_level_0,Age,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,26.333333,44.4
male,33.833333,25.947029


**It is useful to obtain the group size using the `groupby method`. By default, missing values in a group key are excluded from the result. This behavior can be disabled by passing dropna=False to groupby.**

> #### **Get the average age of the passengers in each cabin**

In [8]:
cabin_group = data.groupby(['Cabin'])['Age'].size()
cabin_group

Cabin
B71    1
B77    1
C92    1
G6     1
Name: Age, dtype: int64

In [9]:
cabin_group = data.groupby('Cabin',dropna=False)['Age'].size()
cabin_group

Cabin
B71    1
B77    1
C92    1
G6     1
NaN    6
Name: Age, dtype: int64

> #### **To compute the number of nonnull values in each group, we use the count**

In [10]:
data.groupby('Cabin')['Age'].count()

Cabin
B71    1
B77    1
C92    1
G6     1
Name: Age, dtype: int64

### **1.2. Iterating over Groups**
It is possible to iterate over the object returned by groupby generating a sequence of 2-tuples containing the group name along with the chunk of data.

In [11]:
for (Sex,Pclass), group in data.groupby(['Sex','Pclass']):
    print((Sex,Pclass))
    print(group)

('female', 1)
     PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch  Ticket  Fare  \
257          258         1       1  female  30.0      0      0  110152  86.5   

    Cabin Embarked  
257   B77        S  
('female', 2)
     PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch  Ticket  Fare  \
580          581         1       2  female  25.0      1      1  237789  30.0   

    Cabin Embarked  
580   NaN        S  
('female', 3)
     PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch   Ticket  Fare  \
394          395         1       3  female  24.0      0      2  PP 9549  16.7   

    Cabin Embarked  
394    G6        S  
('male', 1)
     PassengerId  Survived  Pclass   Sex   Age  SibSp  Parch      Ticket  \
453          454         1       1  male  49.0      1      0       17453   
671          672         0       1  male  31.0      1      0  F.C. 12750   

        Fare Cabin Embarked  
453  89.1042   C92        C  
671  52.0000   B71        S  
('male', 2)
    

**It is possible to compute the dictionary of the data pieces as a one-liner**

In [12]:
pieces = {Sex: group for Sex,group in data.groupby("Sex")}
pieces['female']

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
580,581,1,2,female,25.0,1,1,237789,30.0,,S
394,395,1,3,female,24.0,0,2,PP 9549,16.7,G6,S
257,258,1,1,female,30.0,0,0,110152,86.5,B77,S


### **1.3. Group by Columns or Multiple Columns**
Indexing a groupby object created from a dataframe with a column name or an array of column names creates the effect of subsetting by columns for aggregation.

df.groupby("key1")["data1"] and df.groupby("key1")[["data2"]] are conveiniences for:

df["data1"].groupby(df["key1"]) and df[["data2"]].groupby(df["key1"])

It is possible to group by multiple columns by passing a list of column names. 
Lets group by sex and passenger class (Pclass)

In [13]:
multi_grouped = data.groupby(['Sex','Pclass'])['Fare'].mean()
multi_grouped

Sex     Pclass
female  1         86.50000
        2         30.00000
        3         16.70000
male    1         70.55210
        2         10.50000
        3          7.50625
Name: Fare, dtype: float64

**This gives a series with hierarchical index with unique pairs of keys observed. We can therefore unstack this**

In [14]:
multi_grouped.unstack()

Pclass,1,2,3
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,86.5,30.0,16.7
male,70.5521,10.5,7.50625


### **1.4. Grouping with Dictionary & Series**

The grouping information may also exist in the form of a dictionary as opposed to an array.

> #### **Given a dataframe of students**

In [15]:
students = pd.DataFrame(np.random.randn(5,5),columns = ['Abel','Bale','Chris','Dave','Emma'],
                        index=['crimson','gold','silver','jade','violet'])
students.loc[2:3,['Bale','Chris']] = np.nan #add a few NA values
students

Unnamed: 0,Abel,Bale,Chris,Dave,Emma
crimson,-0.255572,-1.204198,-0.164973,-0.297862,-0.126625
gold,0.9205,-0.791032,0.587806,-1.31361,0.09204
silver,2.065835,,,0.645042,-1.756833
jade,0.063949,-0.620645,-1.888483,0.996261,-0.545076
violet,-0.121555,-0.088733,1.058428,0.937185,0.25706


> **Now assuming we have a group correspondence for the columns and would like to sum up columns by group**

In [16]:
mapping = {'Abel':'red','Bale':'red','Chris':'blue','Dave':'blue','Emma':'red','Frank':'orange'}
mapping

{'Abel': 'red',
 'Bale': 'red',
 'Chris': 'blue',
 'Dave': 'blue',
 'Emma': 'red',
 'Frank': 'orange'}

> **Now lets group**

In [17]:
by_column = students.groupby(mapping ,axis=1).sum()
by_column

Unnamed: 0,blue,red
crimson,-0.462834,-1.586394
gold,-0.725805,0.221508
silver,0.645042,0.309002
jade,-0.892222,-1.101772
violet,1.995612,0.046772


### **1.5. Grouping by Index Levels**

Another type of grouping that can be done using hierarchically indexed data is grouping by index levels. This is aggregating using one of the levels of an axis index. To do this the level number or name is passed using the level keyword.

In [18]:
columns = pd.MultiIndex.from_arrays([['UK','UK','UK','HK','HK'],[1,2,4,1,4]],names=['county','base'])
county_df = pd.DataFrame(np.random.randn(4,5),columns=columns)
county_df

county,UK,UK,UK,HK,HK
base,1,2,4,1,4
0,-0.327755,0.902728,0.37294,0.456207,-1.694805
1,0.03745,-0.263498,-0.374695,0.221126,-1.178807
2,-0.478285,0.407877,-0.305877,-0.207106,-0.450895
3,-0.606311,1.373294,0.370047,-0.581456,-0.447385


<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">2. Aggregation Functions</font></b></p>
    </div>

### **2.1. Grouping with built-in Aggregation Functions**

>#### **Given a dataframe of student scores**
You can supply a list of functions to apply to all the columns or different functions per column

In [19]:
student_data = pd.DataFrame({'Name':['Abel','Bale','Chris','Davies','Emma','Frank'],
                             'Score':[85,92,58,70,95,62], 
                             'Age':[17,19,16,22,25,20],
                             'Subject':['Math','Math','English','English','Math','English']})
student_data

Unnamed: 0,Name,Score,Age,Subject
0,Abel,85,17,Math
1,Bale,92,19,Math
2,Chris,58,16,English
3,Davies,70,22,English
4,Emma,95,25,Math
5,Frank,62,20,English


> **Suppose we grouped the dataset by subject and we wanted to compute the sum,min,max and mean for the Score column and the mean for the Age column**

In [20]:
grp_by_subject = student_data.groupby('Subject').agg({'Score':['sum','mean','min','max'],
                                                    'Age':'mean'})
grp_by_subject

Unnamed: 0_level_0,Score,Score,Score,Score,Age
Unnamed: 0_level_1,sum,mean,min,max,mean
Subject,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
English,190,63.333333,58,70,19.333333
Math,272,90.666667,85,95,20.333333


The resulting dataframe has hierarchical columns, the same you would get aggregating each column separately and using concat to glue the results. This was achieved by passing a dictionary into the agg function. 

### **2.2. Grouping with Custom-built Functions**
#### **This section treats the apply method**
Grouping with functions using Pandas is a powerful way to perform operations on subsets of data.
Apply splits the object beign manipulated into pieces, invokes the passed function on each piece, then concatenates the pieces together.


> **Grouping by a Custom Function**

First define a function that categorizes students based on their scores

In [21]:
def performance_category(score):
    if score > 75:
        return 'High'
    else:
        return 'Low'

> **Group by the Function**

In [22]:
student_data['Performance'] = student_data['Score'].apply(performance_category)
group_performance = student_data.groupby(student_data['Performance'])

> **Now retrieve the performers by iterating over the group**

In [23]:
for name , group in group_performance:
    print(f"\n{name} Performers:")
    print(group)


High Performers:
   Name  Score  Age Subject Performance
0  Abel     85   17    Math        High
1  Bale     92   19    Math        High
4  Emma     95   25    Math        High

Low Performers:
     Name  Score  Age  Subject Performance
2   Chris     58   16  English         Low
3  Davies     70   22  English         Low
5   Frank     62   20  English         Low


> **To have the result as a dataframe**

In [24]:
result_df = pd.concat([group for name, group in group_performance],axis=0)
result_df

Unnamed: 0,Name,Score,Age,Subject,Performance
0,Abel,85,17,Math,High
1,Bale,92,19,Math,High
4,Emma,95,25,Math,High
2,Chris,58,16,English,Low
3,Davies,70,22,English,Low
5,Frank,62,20,English,Low


<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">3. Group-wise Operations & Transformations</font></b></p>
    </div>
    
Aggregation is only one kind of group operation. It is a special case in the more general class of data transformations; that is, it accepts functions that reduce a one-dimensional array to a scalar value.Transformations apply a function to each group but **return an object of the same shape as the input**, unlike aggregation , which **reduces the data size**.

### **This section introduces the transform and apply methods**
> #### **Given a dataframe that we want to add a column containing group means for each index**

One way to achieve this is to apply the function to each group and place the results in the appropriate locations. If each group produces a scalar value, it will be broadcasted



> #### **Given a Transformation Example**
Lets compute the average score in each subject

In [25]:
student_data['Average_Score'] = student_data.groupby('Performance')['Score'].transform(np.mean)
student_data

Unnamed: 0,Name,Score,Age,Subject,Performance,Average_Score
0,Abel,85,17,Math,High,90.666667
1,Bale,92,19,Math,High,90.666667
2,Chris,58,16,English,Low,63.333333
3,Davies,70,22,English,Low,63.333333
4,Emma,95,25,Math,High,90.666667
5,Frank,62,20,English,Low,63.333333


> **Suppose we want to subtract the mean value from each score. To do this we create a demeaning function**

In [26]:
def demean(Score):
    return Score - Score.mean()

demeaned = student_data.groupby('Performance')['Score'].transform(demean)
demeaned

0   -5.666667
1    1.333333
2   -5.333333
3    6.666667
4    4.333333
5   -1.333333
Name: Score, dtype: float64

<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">4. Quantile & Binning Analysis</font></b></p>
    </div>
    
Pandas has tools for slicing data into bins - `cut` and `qcut`. Combining these with groupby allows one to perform quantile analysis on a dataset.

>#### **Going back to the titanic dataset a DataFrame**

In [27]:
dat.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,male,35.0,0,0,373450,8.05,,S


In [28]:
factor = pd.cut(dat['Fare'],4)
factor[:10]

0    (-0.512, 128.082]
1    (-0.512, 128.082]
2    (-0.512, 128.082]
3    (-0.512, 128.082]
4    (-0.512, 128.082]
5    (-0.512, 128.082]
6    (-0.512, 128.082]
7    (-0.512, 128.082]
8    (-0.512, 128.082]
9    (-0.512, 128.082]
Name: Fare, dtype: category
Categories (4, interval[float64, right]): [(-0.512, 128.082] < (128.082, 256.165] < (256.165, 384.247] < (384.247, 512.329]]

The factor object returned by cut can be passed directly to groupby and this can be used to compute a set of statitics for another column.

In [29]:
def get_stats(group):
    return {'min':np.min(group), 'max':np.max(group),
            'count':group.count(),'mean':np.mean(group)}

grouped = dat['Age'].groupby(factor)
grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,min,max,count,mean
Fare,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-0.512, 128.082]",0.42,80.0,680.0,29.559191
"(128.082, 256.165]",0.92,58.0,25.0,33.1968
"(256.165, 384.247]",18.0,64.0,6.0,28.166667
"(384.247, 512.329]",35.0,36.0,3.0,35.333333


<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">4. Handling Missing Data</font></b></p>
    </div>

Data Aggregation  often involves mising data and Pandas provides options for handling missing values during such operations.

In [30]:
extra_students = pd.DataFrame({'Name':['George','Hank','Kingsley'],'Score':[np.NaN,np.NaN,88],
                              'Age':[18,20,26],'Subject':['Math','Math','English']})

student_data_new = student_data.append(extra_students)
student_data_new.reset_index().drop(columns='index')

Unnamed: 0,Name,Score,Age,Subject,Performance,Average_Score
0,Abel,85.0,17,Math,High,90.666667
1,Bale,92.0,19,Math,High,90.666667
2,Chris,58.0,16,English,Low,63.333333
3,Davies,70.0,22,English,Low,63.333333
4,Emma,95.0,25,Math,High,90.666667
5,Frank,62.0,20,English,Low,63.333333
6,George,,18,Math,,
7,Hank,,20,Math,,
8,Kingsley,88.0,26,English,,


In [31]:
fill_mean = student_data_new['Score'].mean()
student_data_new['Score'] = student_data_new['Score'].fillna(fill_mean)
student_data_new

Unnamed: 0,Name,Score,Age,Subject,Performance,Average_Score
0,Abel,85.0,17,Math,High,90.666667
1,Bale,92.0,19,Math,High,90.666667
2,Chris,58.0,16,English,Low,63.333333
3,Davies,70.0,22,English,Low,63.333333
4,Emma,95.0,25,Math,High,90.666667
5,Frank,62.0,20,English,Low,63.333333
0,George,78.571429,18,Math,,
1,Hank,78.571429,20,Math,,
2,Kingsley,88.0,26,English,,
