# Grouping Data with Pandas

Using Python with Pandas to generating simple insights from Data by using some grouping techniques. Grouping of data is basically aggregation of data on the basis of some columns or attributes. Groupby basically splits data into different groups depending on the columns provided.

Lets consider a simple data of Students and their marks in two subjects along with their respective teachers:

In [12]:
import pandas as pd
 
df = pd.DataFrame(
    {
        'Student':['Beth', 'Alex', 'Diana', 'Adrian'],
        'Age': [18, 19, 18, 19],
        'Math': [75, 82, 89, 85],
        'Science': [65, 75, 86, 90],
        'Teacher': ['William', 'William', 'Robert', 'Robert']
    }
)

In [13]:
df.head()

Unnamed: 0,Student,Age,Math,Science,Teacher
0,Beth,18,75,65,William
1,Alex,19,82,75,William
2,Diana,18,89,86,Robert
3,Adrian,19,85,90,Robert


## Basic insights using GroupBy Function:

In [14]:
df.groupby('Teacher').describe()

Unnamed: 0_level_0,Age,Age,Age,Age,Age,Age,Age,Age,Math,Math,Math,Math,Math,Science,Science,Science,Science,Science,Science,Science,Science
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Teacher,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Robert,2.0,18.5,0.707107,18.0,18.25,18.5,18.75,19.0,2.0,87.0,...,88.0,89.0,2.0,88.0,2.828427,86.0,87.0,88.0,89.0,90.0
William,2.0,18.5,0.707107,18.0,18.25,18.5,18.75,19.0,2.0,78.5,...,80.25,82.0,2.0,70.0,7.071068,65.0,67.5,70.0,72.5,75.0


Here we have some direct insights about the teachers. For instance in the case above, we can see that Robert’s students are performing better than William’s considering the Mean values produced above. We can see from this that may be Robert is a better teacher than Williams or has better students or something like that. We can filter the Teacher, Robert’s Data from the DataFrame as follows to validate our insights:

In [15]:
df[df['Teacher']=='Robert']

Unnamed: 0,Student,Age,Math,Science,Teacher
2,Diana,18,89,86,Robert
3,Adrian,19,85,90,Robert


We can go further by getting their the Medians of our Pandas DataFrames:

In [16]:
df.groupby('Teacher').median()

Unnamed: 0_level_0,Age,Math,Science
Teacher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Robert,18.5,87.0,88.0
William,18.5,78.5,70.0


And we can further extract insights on the basis of Teachers and their Student’s Age by using Group By on two columns and getting their Median in the following way:

In [17]:
df.groupby(['Teacher', 'Age']).median()

Unnamed: 0_level_0,Unnamed: 1_level_0,Math,Science
Teacher,Age,Unnamed: 2_level_1,Unnamed: 3_level_1
Robert,18,89,86
Robert,19,85,90
William,18,75,65
William,19,82,75


<hr />

In [22]:
import numpy as np

In [23]:
new_df = pd.DataFrame( { 
    "Country" : ["C1", "C2", "C1", "C3", "C1", "C3", "C1"],
    "City" : ["ISB", "KHR", "LAH", "DUB", "RWP", "RWP", "RWP"],
    "Sales" : [10, 25, 40, 20, 45 , 43,10]  } )
new_df

Unnamed: 0,Country,City,Sales
0,C1,ISB,10
1,C2,KHR,25
2,C1,LAH,40
3,C3,DUB,20
4,C1,RWP,45
5,C3,RWP,43
6,C1,RWP,10


In [24]:
grouped_new_df = new_df.groupby(['Country', 'City'])
grouped_new_df

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f49d519c390>

In [25]:
new_new_df = grouped_new_df.agg({'Sales': {'Mean': np.mean, 'Sum':np.sum}})
new_new_df

  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)


Unnamed: 0_level_0,Unnamed: 1_level_0,Sales,Sales
Unnamed: 0_level_1,Unnamed: 1_level_1,Mean,Sum
Country,City,Unnamed: 2_level_2,Unnamed: 3_level_2
C1,ISB,10.0,10
C1,LAH,40.0,40
C1,RWP,27.5,55
C2,KHR,25.0,25
C3,DUB,20.0,20
C3,RWP,43.0,43


In [26]:
new_new_df['Sales']

Unnamed: 0_level_0,Unnamed: 1_level_0,Mean,Sum
Country,City,Unnamed: 2_level_1,Unnamed: 3_level_1
C1,ISB,10.0,10
C1,LAH,40.0,40
C1,RWP,27.5,55
C2,KHR,25.0,25
C3,DUB,20.0,20
C3,RWP,43.0,43


In [30]:
#new_new_df.dtypes
new_new_df.index

MultiIndex(levels=[['C1', 'C2', 'C3'], ['DUB', 'ISB', 'KHR', 'LAH', 'RWP']],
           labels=[[0, 0, 0, 1, 2, 2], [1, 3, 4, 2, 0, 4]],
           names=['Country', 'City'])

In [31]:
np.lexsort((new_df.Country, new_df.City))

array([3, 0, 1, 2, 4, 6, 5])

In [29]:
new_new_df.index.levels[1][new_new_df.index.labels[1]]

Index(['ISB', 'LAH', 'RWP', 'KHR', 'DUB', 'RWP'], dtype='object', name='City')

<hr />

# Applying Arbitrary Functions for Grouping Data with Pandas

In [2]:
import pandas as pd
df = pd.DataFrame({'Student':['Beth', 'Alex', 'Diana', 'Adrian'],
                  'Age': [18, 19, 18, 19],
                  'Math': [75, 82, 89, 85],
                  'Science': [65, 75, 86, 90],
                  'Teacher': ['William', 'William', 'Robert', 'Robert']})

In [3]:
df.head()

Unnamed: 0,Student,Age,Math,Science,Teacher
0,Beth,18,75,65,William
1,Alex,19,82,75,William
2,Diana,18,89,86,Robert
3,Adrian,19,85,90,Robert


In [4]:
df.groupby('Teacher').max()

Unnamed: 0_level_0,Student,Age,Math,Science
Teacher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Robert,Diana,19,89,90
William,Beth,19,82,75


In [5]:
df.groupby('Teacher').apply(max)

Unnamed: 0_level_0,Student,Age,Math,Science,Teacher
Teacher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Robert,Diana,19,89,90,Robert
William,Beth,19,82,75,William


In [8]:
df.groupby('Teacher')['Math'].sum()

Teacher
Robert     174
William    157
Name: Math, dtype: int64

In [10]:
df.groupby(['Teacher'])['Math'].value_counts()

Teacher  Math
Robert   85      1
         89      1
William  75      1
         82      1
Name: Math, dtype: int64

In [9]:
df.groupby(['Teacher','Student'])['Math'].sum()

Teacher  Student
Robert   Adrian     85
         Diana      89
William  Alex       82
         Beth       75
Name: Math, dtype: int64

In the code above, we passed function as an argument to ‘apply’ function. Notice that in this way we can also pass custom defined functions and get our desired results. Lets define a function which finds best teacher in our case:

In [36]:
def best_teacher(group_dframe):
    return pd.DataFrame({'Math': [group_dframe.loc[group_dframe.Math.idxmax()].Teacher],
                        'Science': [group_dframe.loc[group_dframe.Science.idxmax()].Teacher]})

The function above takes a Pandas Grouped DataFrame as an argument and in turn returns a DataFrame with Teacher’s name corresponding to the Subjects’ max scores.

Lets examine the function more closely. Consider the list which is being passed as a value for key ‘Math’ in the dictionary defined in the function above:
	
`[group_dframe.loc[group_dframe.Math.idxmax()].Teacher]`

Lets disect the above list step by step for better understanding of whats going on.
	
`group_dframe.Math.idxmax()`

The above line returns the index of the maximum value for Math.
	
`group_dframe.loc[group_dframe.Math.idxmax()]`

Now by using .loc function, we will fetch the row by using the previously fetched index of maximum value for Math. For more on .loc, you can see my post How to use .loc, .iloc, .ix in Pandas .

Now finally:

`group_dframe.loc[group_dframe.Math.idxmax()].Teacher`

The line above fetches the Teacher from the row extracted in the previous step. Since that row was for the maximum score for Math, the Teacher returned here is the one whose students get maximum marks in Maths.

In [42]:
group_dframe = df.groupby('Age')

In [43]:
group_dframe.apply(best_teacher)

Unnamed: 0_level_0,Unnamed: 1_level_0,Math,Science
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
18,0,Robert,Robert
19,0,Robert,Robert


# MultiIndexing

Use index slicer when filtering from Multiindexing.