The groupby() function in Pandas is important for data analysis as it allows us to group data by one or more categories and then apply different functions to those groups. This technique is used for handling large datasets efficiently and performing operations like aggregation, transformation and filtration on grouped data.

# The groupby() operation is divided into three main steps:

- Step 1: Splitting Data into Groups
The splitting process refers to dividing the dataset into groups based on a particular condition or key. This can be done using the groupby() function by passing one or more columns as keys.

Methods for splitting data are as follows:

1. Group data by a single key: In order to group data with one key we pass only one key as an argument in groupby function. Here we will group the data by the Name column

**Observations:**
- 8 records with names appearing 1-2 times each
- groupby('Name').groups returns a dictionary with row indices for each name group
- Groups are sorted alphabetically

**Interpretation:**
- groupby() partitions the DataFrame by the 'Name' column
- Each group contains all rows with the same name
- Enables aggregate operations on each group

In [19]:
import pandas as pd

data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi',
                 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
        'Age':[27, 24, 22, 32,
               33, 36, 27, 32],
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
                   'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd',
                         'B.Tech', 'B.com', 'Msc', 'MA']}
df = pd.DataFrame(data1)
print(df)


df.groupby('Name')
print(df.groupby('Name').groups)

     Name  Age    Address Qualification
0     Jai   27     Nagpur           Msc
1    Anuj   24     Kanpur            MA
2     Jai   22  Allahabad           MCA
3  Princi   32    Kannuaj           Phd
4  Gaurav   33    Jaunpur        B.Tech
5    Anuj   36     Kanpur         B.com
6  Princi   27  Allahabad           Msc
7    Abhi   32    Aligarh            MA
{'Abhi': [7], 'Anuj': [1, 5], 'Gaurav': [4], 'Jai': [0, 2], 'Princi': [3, 6]}


**Observations:**
- gk.first() returns the first row of each group by Name
- Shows 5 unique names with their first occurrence's data (Age, Address, Qualification)

**Interpretation:**
- .first() is an aggregation function that retrieves the first record from each group
- Useful for getting initial values in grouped data
- Returns a DataFrame with one row per group

In [20]:
gk = df.groupby('Name') 
gk.first()
**Observations:**
- 

**Interpretation:**
- 

SyntaxError: invalid syntax (934463892.py, line 3)

2. Grouping data with multiple keys : In order to group data with multiple keys, we pass multiple keys in groupby function. Here we group a data of "Name" and "Qualification" together using multiple keys in groupby function. 

**Observations:**
- Groups by two columns: Name and Qualification
- Creates more granular groups (e.g., ('Jai', 'Msc') and ('Jai', 'MA') are separate)

**Interpretation:**
- Multi-key groupby partitions data by combinations of values
- Enables finer-grained analysis than single-key grouping
- Each unique combination becomes a separate group

In [None]:
df.groupby(['Name', 'Qualification'])
print(df.groupby(['Name', 'Qualification']).groups)

{('Abhi', 'MA'): [7], ('Anuj', 'B.com'): [5], ('Anuj', 'MA'): [1], ('Gaurav', 'B.Tech'): [4], ('Jai', 'MCA'): [2], ('Jai', 'Msc'): [0], ('Princi', 'Msc'): [6], ('Princi', 'Phd'): [3]}


3. Grouping data by sorting keys: Group keys are sorted by default using the groupby operation. Now we apply groupby() using sort.

**Observations:**
- Groups by Name, then selects Age column
- Calculates sum of ages for each name group
- Returns a Series with names as index and total ages as values

**Interpretation:**
- Selects one column and sums it for each group
- Shows total age per person
- Quick way to get totals for each group

In [None]:
df.groupby('Name')['Age'].sum()

Name
Abhi      32
Anuj      60
Gaurav    33
Jai       49
Princi    59
Name: Age, dtype: int64

**Observations:**
- Groups by Name with sort=False
- Picks Age column and sums it
- Results show in order they appear in data, not alphabetical

**Interpretation:**
- sort=False keeps groups in original order
- Default sorts alphabetically, this doesn't
- Same totals, just different order

In [None]:
df.groupby(['Name'], sort=False)['Age'].sum()

Name
Jai       49
Anuj      60
Princi    59
Gaurav    33
Abhi      32
Name: Age, dtype: int64

4. Grouping data with object attributes: Grouping data with object attributes in Pandas allows us to treat the groups attribute as a dictionary where the keys are unique group values and values are the indices (row labels) corresponding to each group.

**Observations:**
- Returns a dictionary of groups
- Keys are names, values are row positions
- Shows which rows belong to each group

**Interpretation:**
- .groups gives you a dictionary mapping
- See all rows for each group at a glance
- Useful to check group structure

In [None]:
df.groupby('Name').groups

{'Abhi': [7], 'Anuj': [1, 5], 'Gaurav': [4], 'Jai': [0, 2], 'Princi': [3, 6]}

5. Iterating through groups: In order to iterate an element of groups, we can iterate through the object.

**Observations:**
- Loops through each group by Name
- Each iteration gives group name (the key) and a DataFrame of rows
- Prints each group's name and full data

**Interpretation:**
- Loop through groups to process each one separately
- name is the group key (Name value)
- group is a DataFrame with all rows for that group
- Useful for custom operations on each group

In [None]:
grp = df.groupby('Name')
for name, group in grp:
    print(name)
    print(group)
    print()

Abhi
   Name  Age  Address Qualification
7  Abhi   32  Aligarh            MA

Anuj
   Name  Age Address Qualification
1  Anuj   24  Kanpur            MA
5  Anuj   36  Kanpur         B.com

Gaurav
     Name  Age  Address Qualification
4  Gaurav   33  Jaunpur        B.Tech

Jai
  Name  Age    Address Qualification
0  Jai   27     Nagpur           Msc
2  Jai   22  Allahabad           MCA

Princi
     Name  Age    Address Qualification
3  Princi   32    Kannuaj           Phd
6  Princi   27  Allahabad           Msc



In [None]:
# Now we iterate an element of group containing multiple keys.

# **Observations:**
# - Groups by Name and Qualification (two columns)
# - Each iteration gives a tuple (Name, Qualification) as the key
# - Each group shows rows matching that Name-Qualification combination

#**Interpretation:**
# - name is a tuple with both column values
# - group is a DataFrame for that specific combination
# - More detailed grouping than single-key iteration
# - Useful for analyzing subgroups within groups

grp = df.groupby(['Name', 'Qualification'])
for name, group in grp:
    print(name)
    print(group)
    print()

('Abhi', 'MA')
   Name  Age  Address Qualification
7  Abhi   32  Aligarh            MA

('Anuj', 'B.com')
   Name  Age Address Qualification
5  Anuj   36  Kanpur         B.com

('Anuj', 'MA')
   Name  Age Address Qualification
1  Anuj   24  Kanpur            MA

('Gaurav', 'B.Tech')
     Name  Age  Address Qualification
4  Gaurav   33  Jaunpur        B.Tech

('Jai', 'MCA')
  Name  Age    Address Qualification
2  Jai   22  Allahabad           MCA

('Jai', 'Msc')
  Name  Age Address Qualification
0  Jai   27  Nagpur           Msc

('Princi', 'Msc')
     Name  Age    Address Qualification
6  Princi   27  Allahabad           Msc

('Princi', 'Phd')
     Name  Age  Address Qualification
3  Princi   32  Kannuaj           Phd



6. Selecting a group: In order to select a group, we can select group using GroupBy.get_group(). We can select a group by applying a function GroupBy.get_group this function select a single group. 

**Observations:**
- Groups by Name, then selects a specific group using .get_group()
- Passes 'Jai' as the group key
- Returns all rows where Name is 'Jai'

**Interpretation:**
- .get_group() extracts a single group by its key
- Returns a DataFrame with all rows for that group
- Useful for accessing specific group data without iteration
- Key must match exactly with group name

In [None]:
grp = df.groupby('Name')
grp.get_group('Jai')

Unnamed: 0,Name,Age,Address,Qualification
0,Jai,27,Nagpur,Msc
2,Jai,22,Allahabad,MCA


In [21]:
# Now we select an object grouped on multiple columns.

# **Observations:**
# - Groups by Name, then selects a specific group using .get_group()
# - Passes 'Jai' as the group key
# - Returns all rows where Name is 'Jai'

# **Interpretation:**
# .get_group() with multi-key groups requires a tuple as the key
# Tuple must match the order of groupby columns
# Returns specific rows matching both conditions
# Useful for accessing subgroups from multi-level groups

grp = df.groupby(['Name', 'Qualification'])
grp.get_group(('Jai', 'Msc'))

Unnamed: 0,Name,Age,Address,Qualification
0,Jai,27,Nagpur,Msc


# Step 2: Applying Functions to Groups

After splitting a data into a group we apply a function to each group, in order to do that we have to perform some operations which are as follows: 

1. Aggregation: It allows us to find a summary statistic for each group like summing or averaging values.

**Observations:**
- Groups by Name
- Selects Age column from groups
- Uses .aggregate() with np.sum to sum ages
- Returns a Series with totals per name

**Interpretation:**
- .aggregate() applies a function to grouped column
- np.sum calculates total for each group
- Alternative syntax to .sum()
- Can use any function: np.mean, np.max, etc

In [22]:
import numpy as np

grp1 = df.groupby('Name')
grp1['Age'].aggregate(np.sum)

  grp1['Age'].aggregate(np.sum)


Name
Abhi      32
Anuj      60
Gaurav    33
Jai       49
Princi    59
Name: Age, dtype: int64

In [None]:
#Now we perform aggregation on a group containing multiple keys.

#**Observations:**
# - Groups by Name and Qualification (two columns)
# - Selects Age column from groups
# - Uses .aggregate() with np.sum
# - Returns sum for each Name-Qualification combination


#**Interpretation:**
# - Groups data at a finer level (by both Name and Qualification)
# - Calculates age totals for each specific subgroup
# - Shows aggregation across multi-key groups
# - More detailed breakdown than single-key grouping

grp1 = df.groupby(['Name', 'Qualification'])
grp1['Age'].aggregate(np.sum)

  grp1['Age'].aggregate(np.sum)


Name    Qualification
Abhi    MA               32
Anuj    B.com            36
        MA               24
Gaurav  B.Tech           33
Jai     MCA              22
        Msc              27
Princi  Msc              27
        Phd              32
Name: Age, dtype: int64

In [None]:
# When we can apply a multiple functions at once by passing a list or dictionary of functions.

#**Observations:**
# - Groups by Name
# - Selects Age column
# - Applies multiple functions in a list: sum, mean, std
# - Returns a DataFrame with each function as a column

#**Interpretation:**
# - .agg() with a list applies multiple operations at once
# - Each function becomes a separate column in output
# - Shows multiple statistics for each group
# - More efficient than calling each function separately

grp = df.groupby('Name')
grp['Age'].agg([np.sum, np.mean, np.std])

  grp['Age'].agg([np.sum, np.mean, np.std])
  grp['Age'].agg([np.sum, np.mean, np.std])
  grp['Age'].agg([np.sum, np.mean, np.std])


Unnamed: 0_level_0,sum,mean,std
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abhi,32,32.0,
Anuj,60,30.0,8.485281
Gaurav,33,33.0,
Jai,49,24.5,3.535534
Princi,59,29.5,3.535534


2. Transformation: It is a process in which we perform some group-specific computations and return a result with the same index as the original data.

Here we are using some different datasets which is created randomly below.

**Observations:**
- Groups by Name
- Selects Age column
- Uses lambda to standardize ages: (Age - mean) / std
- .transform() returns result with same shape as original

**Interpretation:**
- Standardizes Age within each group
- Shows how many standard deviations each age is from group mean
- Keeps original data structure (doesn't reduce rows)
- Useful for normalizing values within groups for comparison

In [24]:
import pandas as pd
data2 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi', 
                 'Gaurav', 'Anuj', 'Princi', 'Abhi'], 
        'Age':[27, 24, 22, 32, 
               33, 36, 27, 32], 
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
                   'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd',
                         'B.Tech', 'B.com', 'Msc', 'MA'],
        'Score':[23, 34, 35, 45, 47, 50, 52, 53]} 
df2 = pd.DataFrame(data2)
grp2 = df2.groupby('Name')
sc = lambda x: (x - x.mean()) / x.std() 
grp2['Age'].transform(sc)

0    0.707107
1   -0.707107
2   -0.707107
3    0.707107
4         NaN
5    0.707107
6   -0.707107
7         NaN
Name: Age, dtype: float64

3. Filtration: It is a process which is used to discard groups based on a condition. For example we can filter out groups where the number of records is less than a certain threshold.

**Observations:**
- Groups by Name
- Uses .filter() with lambda condition
- len(x) >= 2 checks if group has 2 or more rows
- Returns only rows from groups with 2+ members

**Interpretation:**
- Removes groups with only 1 record
- Keeps groups that appear multiple times
- Returns a DataFrame (not reduced like aggregation)
- Useful for filtering out rare or unique entries

In [25]:
grp2 = df2.groupby('Name')
grp2.filter(lambda x: len(x) >= 2)

Unnamed: 0,Name,Age,Address,Qualification,Score
0,Jai,27,Nagpur,Msc,23
1,Anuj,24,Kanpur,MA,34
2,Jai,22,Allahabad,MCA,35
3,Princi,32,Kannuaj,Phd,45
5,Anuj,36,Kanpur,B.com,50
6,Princi,27,Allahabad,Msc,52


# Step 3: Combining Results

After applying the functions, results are combined back into a data structure like a DataFrame or Series. We can further analyze or visualize the data in its new form.

**Observations:**
- Groups by Name
- Uses .agg() with a dictionary {'Age': 'sum'}
- Sums Age for each name group
- Returns a DataFrame

**Interpretation:**
- .agg() lets you specify which columns and what operations to apply
- Dictionary format shows column name and function clearly
- More flexible than .sum() alone, can do multiple operations

In [26]:
df.groupby('Name').agg({'Age': 'sum'})

Unnamed: 0_level_0,Age
Name,Unnamed: 1_level_1
Abhi,32
Anuj,60
Gaurav,33
Jai,49
Princi,59
