Readme:


We encourage you to explore more functionalities in 'Python for Data Analysis, 3E' by Wes McKinney, Chapter 10: 'Data Aggregation and Group Operations'.</br>
Link: https://wesmckinney.com/book/data-aggregation

In [1]:
import numpy as np
import pandas as pd

<h3><b>Task 1 </b></h3>
<p>
Given below dataframe, sum the values in 'data1' grouping them by 'key1' and 'key2'. </br>
Run unstack() on the result and analyze what it does.
</p>


In [4]:
import pandas as pd
import numpy as np

# Create the DataFrame
df = pd.DataFrame({
    "key1": ["a", "a", None, "b", "b", "a", None],
    "key2": pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
    "data1": np.random.standard_normal(7),
    "data2": np.random.standard_normal(7)
})

# Group by key1 and key2, summing data1
grouped = df.groupby(["key1", "key2"], dropna=False)["data1"].sum()

# Display the result before unstacking
print("Grouped sum:\n", grouped)

# Unstack the result
unstacked = grouped.unstack()
print("\nUnstacked result:\n", unstacked)


Grouped sum:
 key1  key2
a     1      -0.277449
      2       0.515371
      <NA>   -1.517096
b     1       0.509951
      2       1.129948
NaN   1       0.922618
Name: data1, dtype: float64

Unstacked result:
 key2      1         2         <NA>
key1                              
a    -0.277449  0.515371 -1.517096
b     0.509951  1.129948       NaN
NaN   0.922618       NaN       NaN


<h3><b>Task 2 </b></h3>
<p>
Can you sum the values in 'data1' of the above dataframe grouping the result by below arrays of data? </br>
</p>


In [5]:
import pandas as pd
import numpy as np

# Original DataFrame
df = pd.DataFrame({
    "key1": ["a", "a", None, "b", "b", "a", None],
    "key2": pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
    "data1": np.random.standard_normal(7),
    "data2": np.random.standard_normal(7)
})

# Arrays to group by
states = np.array(["OH", "CA", "CA", "OH", "OH", "CA", "OH"])
years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]

# Group by arrays and sum 'data1'
grouped = df["data1"].groupby([states, years]).sum()

# Display the result
print(grouped)


CA  2005    0.562916
    2006    1.369498
OH  2005    0.023481
    2006   -1.240904
Name: data1, dtype: float64


<h3><b>Task 3 </b></h3>
<p>
What if you want to calculate sum on every column which is numeric grouping the result by 'key2'? </br>
</p>


In [6]:
# Group by 'key2' and sum all numeric columns
grouped_sum = df.groupby("key2").sum(numeric_only=True)

# Display the result
print(grouped_sum)


         data1     data2
key2                    
1     0.103191  0.673154
2     0.269481 -1.574802


<h3><b>Task 4 </b></h3>
<p>
Group the dataframe's data by 'key1' and calculate the size and count of each group with NA values included. </br>
</p>


In [7]:
# Group by 'key1' and calculate size
group_size = df.groupby("key1", dropna=False).size()

# Group by 'key1' and calculate count (non-NA entries per column)
group_count = df.groupby("key1", dropna=False).count()

# Display results
print("Group Size (includes NA values):")
print(group_size)
print("\nGroup Count (non-NA values per column):")
print(group_count)


Group Size (includes NA values):
key1
a      3
b      2
NaN    2
dtype: int64

Group Count (non-NA values per column):
      key2  data1  data2
key1                    
a        2      3      3
b        2      2      2
NaN      2      2      2


<h3><b>Task 5 </b></h3>
<p>
Grouping with Dictionaries and Series.</br>
Can you sum the values in 'people' dataframe grouping them by 'mapping' dictionary on column axis? </br>
</p>


In [11]:
import pandas as pd
import numpy as np

# Create the DataFrame
people = pd.DataFrame(np.random.standard_normal((5, 5)),
                      columns=["a", "b", "c", "d", "e"],
                      index=["Joe", "Steve", "Wanda", "Jill", "Trey"])

# Mapping dictionary
mapping = {
    "a": "red", 
    "b": "red", 
    "c": "blue", 
    "d": "blue", 
    "e": "red", 
    "f": "orange"  # Not present in DataFrame, will be ignored
}

# Updated approach: Transpose, group by, then transpose back
grouped = people.T.groupby(mapping).sum().T

print(grouped)


           blue       red
Joe    1.847562  0.409425
Steve -0.595276  0.146231
Wanda -0.519190 -0.764807
Jill   0.403982  1.568263
Trey   1.467491  0.940447


<h3><b>Task 6 </b></h3>
<p>
Grouping with Functions.</br>
Any function passed as a group key will be called once per index value (or once per column value if using axis="columns"), with the return values being used as the group names. </br>
More concretely, consider the example DataFrame from the previous section, which has people’s first names as index values. </br>
Suppose you wanted to group by name length. While you could compute an array of string lengths, it's simpler to just pass the 'len' function.</br>
Run below and analyze the result. </br>
</p>


In [14]:
import pandas as pd
import numpy as np

# Create the DataFrame
people = pd.DataFrame(np.random.standard_normal((5, 5)),
                      columns=["a", "b", "c", "d", "e"],
                      index=["Joe", "Steve", "Wanda", "Jill", "Trey"])

# Group by length of index (name length) and sum
grouped = people.groupby(len).sum()

print(grouped)


          a         b         c         d         e
3  0.710836  1.376156 -1.492009  0.541408 -1.464937
4  1.454821 -2.524155 -2.453808  1.182791  0.897367
5 -0.940448 -0.595841  1.007527  1.366857 -0.121438


<h3><b>Task 7 </b></h3>
<p>
Aggregations refer to any data transformation that produces scalar values from arrays.</br>
To use your own aggregation functions, pass any function that aggregates an array to the aggregate method or its short alias 'agg'.</br>
Run below code and analyze the result.</br>
Note: Custom aggregation functions are generally much slower than the optimized built-in functions. This is because there is some extra overhead (function calls, data rearrangement) in constructing the intermediate group data chunks. </br>
</p>


In [16]:
grouped = df.groupby("key1")

def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0.367722,2.274187
b,1,0.078564,1.328339


In [17]:
import pandas as pd
import numpy as np

# Sample DataFrame
df = pd.DataFrame({
    "key1": ["a", "a", None, "b", "b", "a", None],
    "key2": pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
    "data1": np.random.standard_normal(7),
    "data2": np.random.standard_normal(7)
})

# Group by 'key1'
grouped = df.groupby("key1")

# Custom aggregation function: range (max - min)
def peak_to_peak(arr):
    return arr.max() - arr.min()

# Apply aggregation
result = grouped.agg(peak_to_peak)
print(result)


      key2     data1     data2
key1                          
a        1  4.684470  3.074428
b        1  1.706936  3.344879
