# GroupBy
- for aggregation and summarization
- allow to analyze patterns by category, group, segment
- for mean/median by category, count analysis, custom aggregation, pivot table creation
- if more repeated values then use .astype('category') for groupby 

``` It is Split-Apply-Combine Strategy ```
- split data into groups based on columns
- apply a function like mean, custom to each group
- combine the results into a new DF


In [1]:
import pandas as pd

data = {
    'Department': ['HR', 'IT', 'Finance', 'IT', 'HR', 'Finance', 'IT', 'HR'],
    'Employee': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
    'Salary': [50000, 60000, 55000, 62000, 52000, 57000, 63000, 51000],
    'Experience': [2, 5, 7, 3, 4, 6, 8, 1]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Department,Employee,Salary,Experience
0,HR,A,50000,2
1,IT,B,60000,5
2,Finance,C,55000,7
3,IT,D,62000,3
4,HR,E,52000,4
5,Finance,F,57000,6
6,IT,G,63000,8
7,HR,H,51000,1


### Basic groupby with aggregation

In [None]:
df.groupby('Department')['Salary'].mean()  # grouped people department wise like unique vals and then added mean to their salary

Department
Finance    56000.000000
HR         51000.000000
IT         61666.666667
Name: Salary, dtype: float64

In [4]:
df.groupby('Department')['Employee'].count()
df.groupby('Department').size()

Department
Finance    2
HR         3
IT         3
dtype: int64

# agg() for multiple aggregation

In [6]:
df.groupby('Department').agg({
    'Salary':['mean', 'max', 'min'],
    'Experience': 'median'
})

Unnamed: 0_level_0,Salary,Salary,Salary,Experience
Unnamed: 0_level_1,mean,max,min,median
Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Finance,56000.0,57000,55000,6.5
HR,51000.0,52000,50000,2.0
IT,61666.666667,63000,60000,5.0


### Group multiple cols

In [7]:
df.groupby(['Department', 'Experience'])['Salary'].mean()

Department  Experience
Finance     6             57000.0
            7             55000.0
HR          1             51000.0
            2             50000.0
            4             52000.0
IT          3             62000.0
            5             60000.0
            8             63000.0
Name: Salary, dtype: float64

### reset index
- groupby returns a series or df with the grouped cols as index
- to convert them back into cols

In [8]:
grouped = df.groupby('Department')['Salary'].mean().reset_index()
grouped

Unnamed: 0,Department,Salary
0,Finance,56000.0
1,HR,51000.0
2,IT,61666.666667


### Using groupby with filter()

In [9]:
df.groupby('Department').filter(lambda x: x['Salary'].mean() > 55000)
# keeps only dept with mean salary > 55000

Unnamed: 0,Department,Employee,Salary,Experience
1,IT,B,60000,5
2,Finance,C,55000,7
3,IT,D,62000,3
5,Finance,F,57000,6
6,IT,G,63000,8


### Using groupby with transform()
- transform returns a Series with same length as og df, unlike agg
- broadcast aggregated results back to each row
- Use when you need per-row values that depend on group aggregates

In [10]:
df['Dept_Avg_Salary'] = df.groupby('Department')['Salary'].transform('mean')
df

Unnamed: 0,Department,Employee,Salary,Experience,Dept_Avg_Salary
0,HR,A,50000,2,51000.0
1,IT,B,60000,5,61666.666667
2,Finance,C,55000,7,56000.0
3,IT,D,62000,3,61666.666667
4,HR,E,52000,4,51000.0
5,Finance,F,57000,6,56000.0
6,IT,G,63000,8,61666.666667
7,HR,H,51000,1,51000.0


| Task             | Method                          |
| ---------------- | ------------------------------- |
| Group data       | `df.groupby('col')`             |
| Aggregate        | `agg({'col': ['mean', 'max']})` |
| Count            | `size()` or `count()`           |
| Filter groups    | `groupby().filter()`            |
| Transform groups | `groupby().transform()`         |
| Reset index      | `reset_index()`                 |