### **GroupBy Operation in Pandas**
The **GroupBy** operation in Pandas is a powerful and flexible tool for data aggregation and transformation.
It allows you to **split** a DataFrame into groups, **apply** computations or transformations to each group, and then **combine** the results — similar to the **SQL `GROUP BY`** clause.

---
➡️ **Common GroupBy Methods**

| Method                       | Description                                           |
| ---------------------------- | ----------------------------------------------------- |
| `group.mean()`               | Calculates the **mean** for each group                |
| `group.sum()`                | Calculates the **sum** of each group                  |
| `group.size()`               | Counts **all rows** in each group (including nulls)   |
| `group.count()`              | Counts **non-null** values in each group              |
| `group.min()`, `group.max()` | Finds **minimum** or **maximum** values in each group |
| `group.agg()`                | Applies **multiple aggregation functions** at once    |

---
✅ **Key Takeaways**
* GroupBy enables **powerful summarization** and **aggregation** of data.
* You can group by **one or more columns**.
* Use `agg()` for **custom or multiple** aggregation functions.
* The GroupBy operation follows the clear pattern: **Split → Apply → Combine**.
---

➡️ **`Concept:` Split → Apply → Combine**

1. **`Split:`** Divide data into groups based on one or more keys (columns).
2. **`Apply:`** Perform a function independently on each group (e.g., mean, sum, count).
3. **`Combine:`** Merge the results back into a new DataFrame or Series.

In [2]:
import pandas as pd

# Sample data
df = pd.DataFrame({
    'Department': ['HR', 'IT', 'IT', 'HR', 'Finance'],
    'Salary': [4000, 5000, 6000, 4500, 7000],
    'Experience': [2, 3, 4, 5, 7]
})

# Create a groupby object
group = df.groupby('Department')

# View basic aggregations
print(
    f"Mean of each numeric column per department:\n{group.mean()}\n"  # Mean of each numeric column per department
    f"\nSum of values per department:\n{group.sum()}\n" # Sum of values per department
    f"\nTotal number of entries per department:\n{group.size()}\n" # Total number of entries per department
    f"\nCount of non-null values per department:\n{group.count()}" # # Count of non-null values per department
)

Mean of each numeric column per department:
            Salary  Experience
Department                    
Finance     7000.0         7.0
HR          4250.0         3.5
IT          5500.0         3.5

Sum of values per department:
            Salary  Experience
Department                    
Finance       7000           7
HR            8500           7
IT           11000           7

Total number of entries per department:
Department
Finance    1
HR         2
IT         2
dtype: int64

Count of non-null values per department:
            Salary  Experience
Department                    
Finance          1           1
HR               2           2
IT               2           2


➡️ **Using `agg()` for Multiple Aggregations**

In [4]:
import pandas as pd

# Sample data
df = pd.DataFrame({
    'Department': ['HR', 'IT', 'IT', 'HR', 'Finance'],
    'Salary': [4000, 5000, 6000, 4500, 7000],
    'Experience': [2, 3, 4, 5, 7]
})
# Apply multiple functions at once
result = group.agg({
    'Salary': ['mean', 'max', 'min'],
    'Experience': 'sum'
})
print(result)

            Salary             Experience
              mean   max   min        sum
Department                               
Finance     7000.0  7000  7000          7
HR          4250.0  4500  4000          7
IT          5500.0  6000  5000          7


In [6]:
import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Value1': [10, 20, 30, 40, 50, 60],
    'Value2': [100, 200, 300, 400, 500, 600]
})

# Basic groupby and aggregation
category_grouping = df.groupby('Category') # Grouping DataFrame over Categories

print(
    f"Mean of Value1 for each Category:\n{category_grouping['Value1'].mean()}\n"
    f"\nMax of Value2 for each Category:\n{category_grouping['Value2'].max()}\n"
    f"\nSum of Value1 for each Category:\n{category_grouping['Value1'].sum()}"
)

Mean of Value1 for each Category:
Category
A    30.0
B    40.0
Name: Value1, dtype: float64

Max of Value2 for each Category:
Category
A    500
B    600
Name: Value2, dtype: int64

Sum of Value1 for each Category:
Category
A     90
B    120
Name: Value1, dtype: int64


##### **GroupBy Multiple Columns**
➡️ **`Problem:` Calculate the total salary for each Job Level-Department combination.**

In [6]:
import pandas as pd
import numpy as np

# Create sample employee data
np.random.seed(0)
departments = ['Sales', 'Marketing', 'Engineering', 'HR']
job_levels = ['Junior', 'Senior', 'Manager']
employees = 1000

data = {
    'Employee_ID': range(1, employees + 1),
    'Department': np.random.choice(departments, employees),
    'Job_Level': np.random.choice(job_levels, employees),
    'Years_of_Experience': np.random.randint(1, 20, employees),
    'Performance_Score': np.random.uniform(60, 100, employees),
    'Salary': np.random.randint(30000, 150000, employees)
}

df = pd.DataFrame(data)
dept_job_group = df.groupby(['Department', 'Job_Level']) # Basic grouping by multiple columns

# Average of Performance Score
avg_performance = dept_job_group['Performance_Score'].mean()
print(f"Average of Performance Score:\n{avg_performance}\n")

# Total salary for each Job Level-Department combination
total_salary = dept_job_group['Salary'].sum()
print(f"Total salary for each Job Level-Department combination:\n{total_salary}")

Average of Performance Score:
Department   Job_Level
Engineering  Junior       79.150533
             Manager      78.054209
             Senior       80.274742
HR           Junior       79.554394
             Manager      80.751565
             Senior       82.302734
Marketing    Junior       81.300716
             Manager      80.687662
             Senior       80.072105
Sales        Junior       81.292552
             Manager      79.957837
             Senior       81.067381
Name: Performance_Score, dtype: float64

Total salary for each Job Level-Department combination:
Department   Job_Level
Engineering  Junior       6548566
             Manager      7292836
             Senior       8782303
HR           Junior       8357975
             Manager      7101794
             Senior       7082253
Marketing    Junior       8332198
             Manager      7041657
             Senior       7698620
Sales        Junior       7594099
             Manager      8540204
             Senior  

##### **Multiple Aggregation Using a List**
Once data is grouped in Pandas, you can apply **multiple aggregation functions** to the grouped data using the **`agg()`** method.
This allows you to compute several summary statistics — such as mean, minimum, and maximum — for one or more columns simultaneously.

This approach is especially useful when you want to analyze multiple characteristics of your data in a single operation.

---
➡️ **Notes**
* You can mix different aggregation functions for different columns.
* The `agg()` method can take **lists, dictionaries, or even custom lambda functions**.
* The output can be **flattened** if needed using `result.columns = ['_'.join(col) for col in result.columns]`.

In [7]:
import pandas as pd

df = pd.DataFrame({
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Value': [10, 20, 30, 40, 50, 60]
})

# Perform multiple aggregations on the 'Value' column
result = df.groupby('Category').agg({
    'Value': ['mean', 'min', 'max']
})
print(f"Multiple aggregations on the 'Value' column:\n{result}")

Multiple aggregations on the 'Value' column:
         Value        
          mean min max
Category              
A         30.0  10  50
B         40.0  20  60


➡️ **`Problem:` Using the student score DataFrame given in the IDE, calculate the minimum, maximum, and average score for each subject.**

In [8]:
import pandas as pd

df = pd.DataFrame({
    'Student': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
    'Subject': ['Math', 'Math', 'Math', 'Science', 'Science', 'Science'],
    'Score': [85, 90, 78, 92, 88, 95]
})
sub_group = df.groupby('Subject') # Grouping different by Subjects
result = sub_group.agg({'Score': ['min', 'max', 'mean']}) # Aggregating the functions using .agg()

print(f"Student Scores After Aggregation:\n{result}")

Student Scores After Aggregation:
        Score               
          min max       mean
Subject                     
Math       78  90  84.333333
Science    88  95  91.666667


##### **Multiple Aggregation using Dictionary**
``` python
# Syntax providing for Multiple Aggregation
df.groupby('column').agg({'col1': 'mean', 'col2': ['sum', 'max']})
```

➡️ **Task: Perform multiple aggregations using a dictionary to get the following output**

In [11]:
df = pd.DataFrame({
    'Student': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
    'Subject': ['Math', 'Math', 'Math', 'Science', 'Science', 'Science'],
    'Score': [85, 90, 78, 92, 88, 95]
})

subject_group = df.groupby('Subject')

# Get count of students and max/min of numeric Score.
result = subject_group.agg({
    'Student': 'count',
    'Score': ['max', 'min']
})
print(f"Student Score after Aggregation:\n{result}")

Student Score after Aggregation:
        Student Score    
          count   max min
Subject                  
Math          3    90  78
Science       3    95  88
