<a href="https://colab.research.google.com/github/Saifullah785/python-data-science-handbook-notes/blob/main/03_08_Aggregation_and_Grouping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Aggregation and Grouping**

In [33]:
import numpy as np
import pandas as pd

In [34]:
# Define a class to display the HTML representation of multiple objects side-by-side
class  display(object):
  """Display HTML representation of multiple objects"""

  template = """<div style ="float: left; padding: 10px;">
  <p style = 'font-family: "Courier New", Courier, monospace'>{0}</p>{1}
  </div>"""

  # Constructor to initialize the display object with the arguments to be displayed
  def __init__(self, *args):
    self.args = args

  # Method to return the HTML representation of the objects for display
  def __repr_html_(self):
    return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                     for a in self.args)

  # Method to return the string representation of the objects
  def __repr__(self):
    return '\n\n'.join(a + '\n' + repr(eval(a))
                       for a in self.args)

# **Planet Data**

In [35]:
import seaborn as sns
# Load the 'planets' dataset from seaborn
planets = sns.load_dataset('planets')
# Display the shape of the DataFrame (number of rows and columns)
planets.shape

(1035, 6)

In [36]:
# Display the first 5 rows of the DataFrame
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


# **Simple Aggregation in Pandas**

In [37]:
# Create a random number generator with a seed for reproducibility
rng = np.random.RandomState(42)
# Create a pandas Series with 5 random numbers
ser = pd.Series(rng.rand(5))
# Display the Series
ser

Unnamed: 0,0
0,0.37454
1,0.950714
2,0.731994
3,0.598658
4,0.156019


In [38]:
# Calculate and display the sum of the elements in the Series
ser.sum()

np.float64(2.811925491708157)

In [39]:
# Calculate and display the mean of the elements in the Series
ser.mean()

np.float64(0.5623850983416314)

In [40]:
# Create a pandas DataFrame with two columns ('A' and 'B') and 5 rows of random numbers
df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
# Display the DataFrame
df

Unnamed: 0,A,B
0,0.155995,0.020584
1,0.058084,0.96991
2,0.866176,0.832443
3,0.601115,0.212339
4,0.708073,0.181825


In [41]:
# Calculate and display the mean of each column in the DataFrame
df.mean()

Unnamed: 0,0
A,0.477888
B,0.44342


In [42]:
# Calculate and display the mean of each row in the DataFrame (axis='columns')
df.mean(axis = 'columns')

Unnamed: 0,0
0,0.08829
1,0.513997
2,0.849309
3,0.406727
4,0.444949


In [43]:
# Drop rows with missing values and display descriptive statistics of the remaining data
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0



Aggregation   |   	Returns

count	        |     Total number of items

first, last	  |    First and last item

mean, median	|     Mean and median

min, max	    |     Minimum and maximum

std, var	    |      Standard deviation and variance

mad           |      	Mean absolute deviation

prod	        |       Product of all items

sum	          |        Sum of all items

#**group by : Split, Apply, Combine**

**Split, Apply, Combine**

In [44]:
# Create a pandas DataFrame with a 'key' column and a 'data' column
df = pd.DataFrame({'key': ['A','B','C','A','B','C'],'data':range(6)},columns = ['key','data'])
# Display the DataFrame
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [45]:
# Group the DataFrame by the 'key' column
df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f974cd1bed0>

In [46]:
# Group the DataFrame by the 'key' column and calculate the sum of the 'data' column for each group
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


# **The GroupBy Object**

In [47]:
# Group the 'planets' DataFrame by the 'method' column
planets.groupby('method')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f974cd067d0>

In [48]:
# Group the 'planets' DataFrame by the 'method' column and select the 'orbital_period' column
planets.groupby('method')['orbital_period']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f974cd0e590>

In [49]:
# Group the 'planets' DataFrame by the 'method' column and calculate the median of the 'orbital_period' for each group
planets.groupby('method')['orbital_period'].median()

Unnamed: 0_level_0,orbital_period
method,Unnamed: 1_level_1
Astrometry,631.18
Eclipse Timing Variations,4343.5
Imaging,27500.0
Microlensing,3300.0
Orbital Brightness Modulation,0.342887
Pulsar Timing,66.5419
Pulsation Timing Variations,1170.0
Radial Velocity,360.2
Transit,5.714932
Transit Timing Variations,57.011


In [50]:
# Iterate over the groups created by grouping 'planets' by 'method'
for (method, group) in planets.groupby('method'):
  # Print the method name and the shape of the corresponding group
  print("{0:30s} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)


**Dispatch methods**

In [51]:
# Group the 'planets' DataFrame by the 'method' column, select the 'year' column, calculate descriptive statistics for each group, and unstack the result
planets.groupby('method')['year'].describe().unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,method,Unnamed: 2_level_1
count,Astrometry,2.0
count,Eclipse Timing Variations,9.0
count,Imaging,38.0
count,Microlensing,23.0
count,Orbital Brightness Modulation,3.0
...,...,...
max,Pulsar Timing,2011.0
max,Pulsation Timing Variations,2007.0
max,Radial Velocity,2014.0
max,Transit,2014.0


# **Aggregate, Filter, Transform, Apply**

In [52]:
import numpy as np
import pandas as pd
# Create a random number generator with a seed for reproducibility
rng = np.random.RandomState(0)

# Create a pandas DataFrame with a 'key' column, 'data1' column, and 'data2' column
df = pd.DataFrame({'key': ['A','B','C','A','B','C'],
                   'data1': range(6),
                   'data2': rng.randint(0,10,6)},
                   columns = ['key','data1','data2'])
# Display the DataFrame
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


**Aggregation**

In [53]:
# Group the DataFrame by the 'key' column and apply multiple aggregation functions ('min', 'median', 'max') to the grouped data
df.groupby('key').aggregate(['min', np.median, max])

  df.groupby('key').aggregate(['min', np.median, max])
  df.groupby('key').aggregate(['min', np.median, max])


Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


In [54]:
# Group the DataFrame by the 'key' column and apply specific aggregation functions to different columns
df.groupby('key').aggregate({'data1': 'min',
                             'data2': 'max'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


# **Filtering**

In [55]:
# Define a function to filter groups based on the standard deviation of the 'data2' column
def filter_func(x):
  return x['data2'].std() > 4

# Display the original DataFrame, the standard deviation of 'data2' for each group, and the filtered DataFrame
display('df', "df.groupby('key').std()", "df.groupby('key').filter(filter_func)")

df
  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9

df.groupby('key').std()
       data1     data2
key                   
A    2.12132  1.414214
B    2.12132  4.949747
C    2.12132  4.242641

df.groupby('key').filter(filter_func)
  key  data1  data2
1   B      1      0
2   C      2      3
4   B      4      7
5   C      5      9

# **Transformation**

In [58]:
# Define a function to center the data within each group by subtracting the mean of the group
def center(x):
  return x - x.mean()

# Group the DataFrame by the 'key' column and apply the 'center' transformation to each group
df.groupby('key').transform(center)

Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


# **The Apply method**

In [63]:
# Define a function to normalize the 'data1' column by the sum of the 'data2' column within each group
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

# Group the DataFrame by the 'key' column and apply the 'norm_by_data2' function to each group
df.groupby('key').apply(norm_by_data2)

  df.groupby('key').apply(norm_by_data2)


Unnamed: 0_level_0,Unnamed: 1_level_0,key,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0,A,0.0,5
A,3,A,0.375,3
B,1,B,0.142857,0
B,4,B,0.571429,7
C,2,C,0.166667,3
C,5,C,0.416667,9


# **Specifying the Split Key**

**A list, array, series, or index providing the grouping keys**

In [64]:
# Create a list of keys for grouping
L = [0, 1, 0, 1, 2, 0]
# Group the DataFrame using the list 'L' as keys and calculate the sum for each group
df.groupby(L).sum()

Unnamed: 0,key,data1,data2
0,ACC,7,17
1,BA,4,3
2,B,4,7


In [65]:
# Group the DataFrame using the 'key' column of the DataFrame itself as keys and calculate the sum for each group
df.groupby(df['key']).sum()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,3,8
B,5,7
C,7,12


**A dictionary or series mapping index to group**

In [66]:
# Set the 'key' column as the index of the DataFrame
df2 = df.set_index('key')
# Create a dictionary to map index values to groups
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
# Display the DataFrame with the index set and the result of grouping by the mapping and calculating the sum
display('df2', 'df2.groupby(mapping).sum()')

df2
     data1  data2
key              
A        0      5
B        1      0
C        2      3
A        3      3
B        4      7
C        5      9

df2.groupby(mapping).sum()
           data1  data2
key                    
consonant     12     19
vowel          3      8

**Any Python function**

In [67]:
# Group the DataFrame with the index set by applying the string lower function to the index and calculate the mean for each group
df2.groupby(str.lower).mean()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.5,4.0
b,2.5,3.5
c,3.5,6.0


**A list of valid keys**

In [68]:
# Group the DataFrame with the index set by applying a list of keys (string lower function and the mapping dictionary) and calculate the mean for each group
df2.groupby([str.lower, mapping]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key,key,Unnamed: 2_level_1,Unnamed: 3_level_1
a,vowel,1.5,4.0
b,consonant,2.5,3.5
c,consonant,3.5,6.0


**Grouping Example**

In [69]:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0
