___

<a href='http://www.pieriandata.com'> <img src='./Pierian_Data_Logo.png' /></a>
___

# Groupby

The groupby method allows you to group rows of data based on a column's values and apply aggregate functions to each group.  
An aggregate function is just a fancy term for any function that takes in many values and outputs a single value —  such as taking the sum of a bunch of values and outputting the result, or taking the average or standard deviation etc.

In [35]:
import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}

In [36]:
df = pd.DataFrame(data)

In [37]:
df

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charlie,120
2,MSFT,Amy,340
3,MSFT,Vanessa,124
4,FB,Carl,243
5,FB,Sarah,350


In [38]:
df.set_index('Company')

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
GOOG,Sam,200
GOOG,Charlie,120
MSFT,Amy,340
MSFT,Vanessa,124
FB,Carl,243
FB,Sarah,350


**Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:**

In [39]:
df.groupby('Company')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002B036B383B0>

You can save this object as a new variable:

In [40]:
by_comp = df.groupby('Company')
by_comp

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002B03722C860>

And then call aggregate methods off the object:

In [41]:
# by_comp.mean()  #Trying by_comp.mean() without specifying a numeric column can fail if other columns (like 'Person') are of type 'object'.

In [None]:
by_comp['Sales'].mean()
# by_comp['Sales'].mean().index       #Index(['FB', 'GOOG', 'MSFT'], dtype='object', name='Company')
# type(by_comp['Sales'].mean())   #df.groupby('Company').mean()

Company
FB      296.5
GOOG    160.0
MSFT    232.0
Name: Sales, dtype: float64


In [47]:
by_comp['Sales'].sum()

Company
FB      593
GOOG    320
MSFT    464
Name: Sales, dtype: int64

More examples of aggregate methods:

In [None]:
by_comp['Sales'].std()  #This computes the sample standard deviation by default.

Company
FB       75.660426
GOOG     56.568542
MSFT    152.735065
Name: Sales, dtype: float64

### Understanding Standard Deviation (std)

The formula for standard deviation depends on whether you're calculating it for a **population** or a **sample**.

📌 **1. Sample Standard Deviation (most common in data analysis):**

$$
s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}
$$

- **s**: sample standard deviation  
- **n**: number of observations  
- **xᵢ**: each individual value  
- **$\bar{x}$**: sample mean  
- **$\sum$**: summation over all values  

📌 **2. Population Standard Deviation:**

$$
\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2}
$$

- **σ**: population standard deviation  
- **μ**: population mean  

🧠 **Key difference:**

- Use **$n - 1$** for sample std (Bessel’s correction) - accounts for estimation bias.  
- Use **$n$** for population std - when you have the entire dataset.  

### In pandas:

To calculate the **sample standard deviation** (default behavior with `ddof=1`):

`df['column'].std()`  

To calculate the **population standard deviation** (using `ddof=0`):

`df['column'].std(ddof=0)`  

### 🔍 What does ddof actually mean?

- **ddof** stands for *Delta Degrees of Freedom*.  
- It is the value subtracted from **n** in the denominator of the std formula.

The formula pandas and NumPy use is:

$$
\text{std} = \sqrt{\frac{1}{n - \text{ddof}} \sum (x_i - \bar{x})^2}
$$

- If **ddof = 1**, then denominator is $n - 1$ → Sample standard deviation  
- If **ddof = 0**, then denominator is $n$ → Population standard deviation  

### ✅ Why is it written as `ddof=0`?

Because you are telling pandas how much to subtract from **n**. For population standard deviation, you want **no correction**, so you subtract 0 - i.e., use **n** in the denominator.

This distinction is important to get accurate measures of variability depending on your data context!


In [None]:
by_comp['Sales'].sum()

Company
FB      593
GOOG    320
MSFT    464
Name: Sales, dtype: int64

In [60]:
by_comp['Sales'].sum().loc['FB']

593

In [62]:
df.groupby('Company')['Sales'].sum().loc['FB']

593

In [56]:
by_comp.min()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Carl,243
GOOG,Charlie,120
MSFT,Amy,124


In [57]:
by_comp['Sales'].min()

Company
FB      243
GOOG    120
MSFT    124
Name: Sales, dtype: int64

In [63]:
by_comp.max()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Sarah,350
GOOG,Sam,200
MSFT,Vanessa,340


In [64]:
by_comp['Sales'].max()

Company
FB      350
GOOG    200
MSFT    340
Name: Sales, dtype: int64

In [65]:
by_comp.count()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,2,2
GOOG,2,2
MSFT,2,2


In [None]:
by_comp.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,count,2.0
FB,mean,296.5
FB,std,75.660426
FB,min,243.0
FB,25%,269.75
FB,50%,296.5
FB,75%,323.25
FB,max,350.0
GOOG,count,2.0
GOOG,mean,160.0


In [None]:
by_comp.describe().transpose()

Company,FB,FB,FB,FB,FB,FB,FB,FB,GOOG,GOOG,GOOG,GOOG,GOOG,MSFT,MSFT,MSFT,MSFT,MSFT,MSFT,MSFT,MSFT
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Sales,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0,2.0,160.0,...,180.0,200.0,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0


In [None]:
by_comp.describe().transpose()['GOOG']

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Sales,2.0,160.0,56.568542,120.0,140.0,160.0,180.0,200.0


In [68]:
by_comp.describe().transpose().loc[('Sales', 'count')]

Company
FB      2.0
GOOG    2.0
MSFT    2.0
Name: (Sales, count), dtype: float64

In [69]:
by_comp.describe().transpose().loc[('Sales', 'count'), 'FB']

2.0

In [67]:
by_comp.describe().transpose().index

MultiIndex([('Sales', 'count'),
            ('Sales',  'mean'),
            ('Sales',   'std'),
            ('Sales',   'min'),
            ('Sales',   '25%'),
            ('Sales',   '50%'),
            ('Sales',   '75%'),
            ('Sales',   'max')],
           )

# Great Job!