# pandas Exercise 3 - Group By and Aggregation

This exercise will help you practice the concepts from pandas lesson 3.

**Topics covered:**
- Basic grouping and aggregation
- Using multiple aggregation functions
- The `.agg()` method
- Grouping by multiple columns
- Understanding `size()` vs `count()`
- Transform vs Aggregate
- Filtering groups

**Instructions:**
- Read each question carefully
- Write your code in the provided code cells
- Run your code to check it works
- Compare your results with classmates or check the answer file when available

## Setup

Run this cell to import pandas and create the sample dataset.

In [None]:
import pandas as pd
import numpy as np

# Sales data for a company with multiple stores
sales_data = {
    'store_id': ['S001', 'S001', 'S001', 'S001', 'S002', 'S002', 'S002', 'S002', 
                 'S003', 'S003', 'S003', 'S003', 'S004', 'S004', 'S004', 'S004'],
    'region': ['North', 'North', 'North', 'North', 'North', 'North', 'North', 'North',
               'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South'],
    'product': ['Laptop', 'Phone', 'Tablet', 'Headphones', 
                'Laptop', 'Phone', 'Tablet', 'Headphones',
                'Laptop', 'Phone', 'Tablet', 'Headphones',
                'Laptop', 'Phone', 'Tablet', 'Headphones'],
    'quantity': [15, 45, 30, 60, 20, 50, 25, 55, 18, 40, 28, 50, 12, 35, 22, 45],
    'price': [1200, 800, 500, 150, 1200, 800, 500, 150, 1200, 800, 500, 150, 1200, 800, 500, 150],
    'discount_pct': [10, 5, 8, 0, 12, 8, 10, 5, 8, 5, 6, 0, 15, 10, 12, 8]
}

sales = pd.DataFrame(sales_data)

# Calculate revenue (quantity * price)
sales['revenue'] = sales['quantity'] * sales['price']

print("Sales DataFrame created successfully!")
print(f"Shape: {sales.shape}")
sales.head(10)

## Part 1: Basic Grouping and Aggregation

These exercises will help you practice basic groupby operations.

### Exercise 1.1: Total Revenue by Store

**Task:** Group the sales data by `store_id` and calculate the **total revenue** for each store.

**Hint:** Use `.groupby()` with `.sum()`

In [None]:
# Write your code here


### Exercise 1.2: Average Quantity by Product

**Task:** Group by `product` and find the **average (mean) quantity** sold for each product type.

**Hint:** Use `.groupby()` with `.mean()`

In [None]:
# Write your code here


### Exercise 1.3: Count Sales by Region

**Task:** Group by `region` and **count** how many sales transactions occurred in each region.

**Hint:** Use `.groupby()` with `.count()` or `.size()`

In [None]:
# Write your code here


## Part 2: Multiple Aggregation Functions

Practice using the `.agg()` method to apply multiple aggregation functions at once.

### Exercise 2.1: Multiple Statistics for Revenue

**Task:** Group by `region` and calculate the following statistics for the `revenue` column:
- Total (sum)
- Average (mean)
- Minimum
- Maximum

**Hint:** Use `.agg()` with a list of function names: `['sum', 'mean', 'min', 'max']`

In [None]:
# Write your code here


### Exercise 2.2: Different Aggregations for Different Columns

**Task:** Group by `store_id` and calculate:
- For `quantity`: sum and mean
- For `revenue`: sum and max
- For `discount_pct`: mean only

**Hint:** Use `.agg()` with a dictionary: `{'column_name': ['function1', 'function2']}`

In [None]:
# Write your code here


## Part 3: Grouping by Multiple Columns

### Exercise 3.1: Revenue by Region and Product

**Task:** Group by both `region` AND `product`, then calculate the **total revenue** for each combination.

**Hint:** Pass a list of column names to `.groupby()`: `.groupby(['col1', 'col2'])`

**Expected result:** You should see revenue broken down by each region-product combination (e.g., North-Laptop, North-Phone, South-Laptop, etc.)

In [None]:
# Write your code here


### Exercise 3.2: Average Discount by Region and Product

**Task:** Group by `region` and `product`, then calculate the **average discount percentage** for each combination.

Round your results to 2 decimal places using `.round(2)`

In [None]:
# Write your code here


## Part 4: Understanding size() vs count()

### Exercise 4.1: Compare size() and count()

**Task:** 
1. Group by `region` and use `.size()` to count rows per region
2. Group by `region` and use `.count()` to count non-null values per region
3. Compare the results - are they different? Why or why not?

**Hint:** Since this dataset has no missing values, `size()` and `count()` should give the same results.

In [None]:
# Write your code here
# 1. Using size()


# 2. Using count()



## Part 5: Transform vs Aggregate

### Exercise 5.1: Add Region Average to Each Row

**Task:** Create a new column called `region_avg_revenue` that contains the average revenue for each region. This value should be added to **every row** in that region.

**Hint:** Use `.transform('mean')` instead of `.mean()` to broadcast the group average to all rows.

**Example output:**
- All North region rows should have the same `region_avg_revenue` value
- All South region rows should have the same `region_avg_revenue` value

In [None]:
# Write your code here


# Display a few rows to verify
sales[['store_id', 'region', 'revenue', 'region_avg_revenue']].head(10)

### Exercise 5.2: Calculate Performance vs Store Average

**Task:** 
1. Add a new column `store_avg_quantity` that shows the average quantity for each store
2. Add another column `diff_from_store_avg` that shows how much each sale differs from the store's average

**Hint:** 
- Use `.transform('mean')` for the first column
- For the second column: `sales['quantity'] - sales['store_avg_quantity']`

In [None]:
# Write your code here


# Display results
sales[['store_id', 'product', 'quantity', 'store_avg_quantity', 'diff_from_store_avg']].head(8)

## Part 6: Filtering Groups

### Exercise 6.1: Filter High-Performing Stores

**Task:** Filter the data to keep only stores where the **total revenue exceeds 100,000**.

**Hint:** Use `.groupby().filter()` with a lambda function that checks if the sum of revenue is greater than 100,000.

**Expected behavior:** This should remove entire stores (all rows for that store_id) if they don't meet the criteria.

In [None]:
# Write your code here


### Exercise 6.2: Filter Products with High Average Quantity

**Task:** Keep only products where the **average quantity sold across all stores is 40 or more**.

**Hint:** Group by `product` and filter based on the mean of quantity.

In [None]:
# Write your code here


## Part 7: Custom Aggregation Functions

### Exercise 7.1: Calculate Range

**Task:** Create a custom function that calculates the range (max - min) and use it to find the range of `quantity` for each `store_id`.

**Steps:**
1. Define a function called `calc_range` that takes a series and returns `max() - min()`
2. Use `.groupby()` and `.agg()` to apply this function to the quantity column

In [None]:
# Write your code here
def calc_range(x):
    # Complete this function
    pass

# Use the function with groupby


## Part 8: Challenge Exercises

These exercises combine multiple concepts. Try to solve them without looking at hints first!

### Challenge 8.1: Top Product per Region

**Task:** For each region, find which product generated the **highest total revenue**.

**Steps (one approach):**
1. Group by `region` and `product`, sum the revenue
2. For each region group, find the product with max revenue
3. Use `.idxmax()` or sort and select the top row

**Expected output:** Two rows showing the top product for North and South regions

In [None]:
# Write your code here


### Challenge 8.2: Performance Ranking

**Task:** Create a column called `store_rank` that ranks stores within each region by total revenue (1 = highest revenue).

**Hint:** You'll need to:
1. Calculate total revenue per store
2. Use `.rank()` method with `ascending=False` and group by region
3. Consider using `.transform()` to add the rank to each row

In [None]:
# Write your code here


### Challenge 8.3: Summary Report

**Task:** Create a comprehensive summary report that shows for each region:
- Total revenue
- Number of stores
- Average revenue per store
- Total quantity sold
- Average discount percentage

**Hint:** You'll need to use `.agg()` with a dictionary and potentially some creative aggregations.

In [None]:
# Write your code here


## Bonus: Reflection Questions

Answer these questions based on your work above:

1. **When would you use `.transform()` instead of `.agg()`?**
   - Your answer: 

2. **What's the difference between `.size()` and `.count()`?**
   - Your answer: 

3. **Give an example of when you might need to group by multiple columns in a real-world scenario.**
   - Your answer: 

4. **What advantage does using a custom aggregation function give you?**
   - Your answer: 

## Well Done!

You've completed the pandas groupby and aggregation exercises. 

**Next steps:**
- Review any exercises you found challenging
- Compare your solutions with classmates
- Try applying these techniques to your own datasets
- Check out the answer file (if available) to see alternative approaches