# GroupBy and Aggregating in pandas
This notebook demonstrates how to use groupby and aggregation functions in pandas.

In [93]:
import pandas as pd

## Load the Data
Read the CSV file into a DataFrame and display it.

In [94]:
df = pd.read_csv(r'E:\PortfolioProjects\Python Project\learning pandas\Flavors.csv')

df

Unnamed: 0,Flavor,Base Flavor,Liked,Flavor Rating,Texture Rating,Total Rating
0,Mint Chocolate Chip,Vanilla,Yes,10.0,8.0,18.0
1,Chocolate,Chocolate,Yes,8.8,7.6,16.6
2,Vanilla,Vanilla,No,4.7,5.0,9.7
3,Cookie Dough,Vanilla,Yes,6.9,6.5,13.4
4,Rocky Road,Chocolate,Yes,8.2,7.0,15.2
5,Pistachio,Vanilla,No,2.3,3.4,5.7
6,Cake Batter,Vanilla,Yes,6.5,6.0,12.5
7,Neapolitan,Vanilla,No,3.8,5.0,8.8
8,Chocolte Fudge Brownie,Chocolate,Yes,8.2,7.1,15.3


## Grouping the Data

The `groupby()` operation is the foundation of pandas aggregation. When you call `df.groupby('column')`, pandas creates a **GroupBy object** that:

1. **Splits** the data into groups based on unique values in the specified column(s)
2. **Applies** aggregation functions to each group
3. **Combines** the results into a new DataFrame

### What happens when you group:
```python
groupby_frame = df.groupby('Base Flavor')
```

This creates groups for each unique value in 'Base Flavor' (e.g., 'Vanilla', 'Chocolate'). The GroupBy object stores information about these groups but doesn't perform any calculations yet.

### Key properties of GroupBy objects:
- **Lazy evaluation**: No computation happens until you apply an aggregation function
- **Memory efficient**: Only processes data when needed
- **Flexible**: Can be used with any aggregation function (mean, sum, count, custom functions, etc.)

### Next steps:
Apply aggregation functions like `.mean()`, `.sum()`, `.count()`, or `.agg()` to perform calculations on each group.

In [95]:
groupby_frame = df.groupby('Base Flavor')

## Aggregating with Mean
When you compute the mean with `groupby`, pandas will only compute numeric columns by default in recent versions.

Use the `numeric_only=True` parameter with reductions like `mean`, `sum`, `min`, and `max` to explicitly restrict aggregation to numeric columns and avoid errors or warnings when non-numeric columns exist:

- `groupby_frame.mean(numeric_only=True)`  # mean of numeric columns

If you want more control, select numeric columns first (for example with `select_dtypes`) or use `agg`/named aggregation to target specific columns.

In [96]:
groupby_frame.mean(numeric_only=True)

Unnamed: 0_level_0,Flavor Rating,Texture Rating,Total Rating
Base Flavor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Chocolate,8.4,7.233333,15.7
Vanilla,5.7,5.65,11.35


In [97]:
groupby_frame.count()

Unnamed: 0_level_0,Flavor,Liked,Flavor Rating,Texture Rating,Total Rating
Base Flavor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chocolate,3,3,3,3,3
Vanilla,6,6,6,6,6


## Aggregating with Count, Min, Max, and Sum

### Understanding the differences:

- **`count()`**: Counts non-null values for each column within each group. Works with all data types and does NOT accept `numeric_only`. Use when you want to see how many valid (non-missing) values exist in each group.

- **`min()`, `max()`, `sum()`**: Can accept `numeric_only=True` to restrict to numeric columns only. Without this parameter, pandas may attempt operations on non-numeric columns (which can cause errors or unexpected results).

### Key differences in behavior:
- `count()` includes ALL columns by default
- `min()`, `max()`, `sum()` work on numeric columns by default in recent pandas versions
- Use `numeric_only=True` explicitly for clarity and to avoid warnings

### Practical examples:
```python
# Count non-null values (works on all columns)
groupby_frame.count()

# Min/Max/Sum on numeric columns only
groupby_frame.min(numeric_only=True)
groupby_frame.max(numeric_only=True)
groupby_frame.sum(numeric_only=True)
```

### When to use each:
- **count()**: Understanding data completeness and sample sizes per group
- **min/max()**: Finding extreme values and ranges within groups
- **sum()**: Total values (useful for quantities, amounts, counts)

In [98]:
groupby_frame.min(numeric_only=True)

Unnamed: 0_level_0,Flavor Rating,Texture Rating,Total Rating
Base Flavor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Chocolate,8.2,7.0,15.2
Vanilla,2.3,3.4,5.7


In [99]:
groupby_frame.max(numeric_only=True)

Unnamed: 0_level_0,Flavor Rating,Texture Rating,Total Rating
Base Flavor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Chocolate,8.8,7.6,16.6
Vanilla,10.0,8.0,18.0


In [100]:
groupby_frame.sum(numeric_only=True)

Unnamed: 0_level_0,Flavor Rating,Texture Rating,Total Rating
Base Flavor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Chocolate,25.2,21.7,47.1
Vanilla,34.2,33.9,68.1


## Advanced Aggregation with agg()

The `agg()` method (short for "aggregate") provides powerful flexibility for applying multiple aggregation functions to different columns. You can:

- Apply multiple functions to the same column
- Apply different functions to different columns
- Use custom functions
- Create hierarchical column structures

### Multiple aggregations per column:
```python
df.groupby('column').agg({'col1': ['mean', 'sum'], 'col2': ['min', 'max']})
```

### Different aggregations for different columns:
```python
df.groupby('column').agg({'col1': 'mean', 'col2': 'sum', 'col3': 'count'})
```

In [101]:
df.groupby('Base Flavor').agg({'Flavor Rating':['mean','min','max','count', 'sum'], 'Texture Rating':['mean','min','max','count', 'sum']})

Unnamed: 0_level_0,Flavor Rating,Flavor Rating,Flavor Rating,Flavor Rating,Flavor Rating,Texture Rating,Texture Rating,Texture Rating,Texture Rating,Texture Rating
Unnamed: 0_level_1,mean,min,max,count,sum,mean,min,max,count,sum
Base Flavor,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Chocolate,8.4,8.2,8.8,3,25.2,7.233333,7.0,7.6,3,21.7
Vanilla,5.7,2.3,10.0,6,34.2,5.65,3.4,8.0,6,33.9


In [102]:
df

Unnamed: 0,Flavor,Base Flavor,Liked,Flavor Rating,Texture Rating,Total Rating
0,Mint Chocolate Chip,Vanilla,Yes,10.0,8.0,18.0
1,Chocolate,Chocolate,Yes,8.8,7.6,16.6
2,Vanilla,Vanilla,No,4.7,5.0,9.7
3,Cookie Dough,Vanilla,Yes,6.9,6.5,13.4
4,Rocky Road,Chocolate,Yes,8.2,7.0,15.2
5,Pistachio,Vanilla,No,2.3,3.4,5.7
6,Cake Batter,Vanilla,Yes,6.5,6.0,12.5
7,Neapolitan,Vanilla,No,3.8,5.0,8.8
8,Chocolte Fudge Brownie,Chocolate,Yes,8.2,7.1,15.3


## Multi-Level Grouping

You can group by multiple columns to create hierarchical groupings. This creates a MultiIndex result where you can analyze data across multiple dimensions.

### Basic multi-level groupby:
```python
df.groupby(['column1', 'column2']).mean()
```

### Important notes:
- Always select only numeric columns when using aggregation functions like `mean()`, `sum()`, etc.
- The result will have a MultiIndex with the grouping columns as index levels
- Use `reset_index()` if you want the grouping columns back as regular columns

### Example analysis:
Grouping by both 'Base Flavor' and 'Liked' status gives insights into:
- How different base flavors perform overall
- How the "liked" vs "not liked" ratings differ within each flavor category
- Which combinations have the highest/lowest ratings

In [106]:
grouped = df.groupby(['Base Flavor', 'Liked'])
result = grouped[['Flavor Rating', 'Texture Rating', 'Total Rating']].agg({'Flavor Rating':['mean','min','max','count', 'sum']})
result

Unnamed: 0_level_0,Unnamed: 1_level_0,Flavor Rating,Flavor Rating,Flavor Rating,Flavor Rating,Flavor Rating
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,max,count,sum
Base Flavor,Liked,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Chocolate,Yes,8.4,8.2,8.8,3,25.2
Vanilla,No,3.6,2.3,4.7,3,10.8
Vanilla,Yes,7.8,6.5,10.0,3,23.4


## Statistical Summary with describe()

The `describe()` method provides a comprehensive statistical summary of your grouped data. It calculates:

- **count**: Number of non-null values
- **mean**: Average value
- **std**: Standard deviation
- **min**: Minimum value
- **25%**: First quartile (25th percentile)
- **50%**: Median (50th percentile)
- **75%**: Third quartile (75th percentile)
- **max**: Maximum value

This is extremely useful for getting a quick overview of the distribution and spread of your data within each group.

In [107]:
df.groupby('Base Flavor').describe()

Unnamed: 0_level_0,Flavor Rating,Flavor Rating,Flavor Rating,Flavor Rating,Flavor Rating,Flavor Rating,Flavor Rating,Flavor Rating,Texture Rating,Texture Rating,Texture Rating,Texture Rating,Texture Rating,Total Rating,Total Rating,Total Rating,Total Rating,Total Rating,Total Rating,Total Rating,Total Rating
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Base Flavor,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Chocolate,3.0,8.4,0.34641,8.2,8.2,8.2,8.5,8.8,3.0,7.233333,...,7.35,7.6,3.0,15.7,0.781025,15.2,15.25,15.3,15.95,16.6
Vanilla,6.0,5.7,2.710719,2.3,4.025,5.6,6.8,10.0,6.0,5.65,...,6.375,8.0,6.0,11.35,4.263684,5.7,9.025,11.1,13.175,18.0


## GroupBy Operations Summary

### Key Concepts:
1. **GroupBy Object**: Created with `df.groupby('column')` - doesn't perform calculations yet
2. **Aggregation Functions**: Apply functions to each group (mean, sum, count, min, max)
3. **numeric_only Parameter**: Use with `mean()`, `sum()`, `min()`, `max()` to avoid errors with non-numeric data
4. **Multi-level Grouping**: Group by multiple columns for hierarchical analysis
5. **agg() Method**: Flexible aggregation with custom functions and multiple operations per column

### Common Patterns:
```python
# Basic aggregation
df.groupby('column').mean(numeric_only=True)

# Multiple aggregations
df.groupby('column').agg({'col1': ['mean', 'sum'], 'col2': 'count'})

# Multi-level grouping
df.groupby(['col1', 'col2']).mean(numeric_only=True)

# Statistical summary
df.groupby('column').describe()
```

### Best Practices:
- Always handle mixed data types appropriately
- Use `numeric_only=True` for numeric aggregations when DataFrame contains strings
- Select specific columns when you only need certain aggregations
- Use `reset_index()` to convert groupby results back to regular DataFrame
- Consider `as_index=False` in groupby to keep grouping columns as regular columns

### When to Use GroupBy:
- Analyzing data by categories (sales by region, ratings by product type)
- Comparing statistics across different groups
- Finding patterns and trends within subgroups
- Data summarization and reporting