<center> <i> <font style= 'font-size:100px'>"Grouping Makes Sense" </font> </i> </center>

# 

Now - finally we will see the power of ***Transformation***. That is similar to apply but with some constraints to help us.

* It can produce a scalar value (can be broadcasted to whole group)
* Can produce object of same shape as the input group
* Must not mutate its input

Let's check and use them one by one.

# 

In [1]:
import pandas as pd
import numpy as np

In [6]:
df = pd.DataFrame(np.random.randint(0, 100, 10), columns= ['data'])
df.insert(0, 'key', np.random.choice(['A', 'B', 'C'], len(df)))
df

Unnamed: 0,key,data
0,A,28
1,B,1
2,C,73
3,C,38
4,B,1
5,A,39
6,C,90
7,A,38
8,A,67
9,A,25


In [13]:
grouped = df.groupby('key').data

In [14]:
grouped.mean()

key
A    39.4
B     1.0
C    67.0
Name: data, dtype: float64

# 

###  1st Use of `.transform()`
*(Returning scalar - per group)*

In [21]:
grouped.transform(min)

0    25
1     1
2    38
3    38
4     1
5    25
6    38
7    25
8    25
9    25
Name: data, dtype: int32

This is the SCALAR transformation, here - it took whole group and then from it returned one value.

In [22]:
grouped.transform(lambda _ : 1)

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: data, dtype: int32

See? We can do anything!

### How is it different to `.apply`?
Good question

In [23]:
grouped.apply(min)

key
A    25
B     1
C    38
Name: data, dtype: int64

Apply workes on whole group, agreed but after that it doesn't have the access to all / individual elements of that group.

# 

**NOTE:** You may wonder that in some of the pervious notebooks, we saw the work of apply in which it could return the multiple values then it is conflicting with the statement just said in the cell above. (See in the notebook 3. Unseen groups in groupby folder)

###### 

The argument is valid. But the thing stays the same. Apply doesn't have access to INDIVIDUAL elements and their location after wards as Transform does.

In that GLUED case, apply returnes the DF as a whole and then attaches with other DF. So, yeah.

# 

### 2nd Use of `.transform()`
*(Returning series - per group)*

In [47]:
grouped.transform(lambda x: x.rank())

0    2.0
1    1.5
2    2.0
3    1.0
4    1.5
5    4.0
6    3.0
7    3.0
8    5.0
9    1.0
Name: data, dtype: float64

In [53]:
pd.concat([df, grouped.transform(lambda x: x.rank())], axis=1).sort_values(by= ['key'])

Unnamed: 0,key,data,data.1
0,A,28,2.0
5,A,39,4.0
7,A,38,3.0
8,A,67,5.0
9,A,25,1.0
1,B,1,1.5
4,B,1,1.5
2,C,73,2.0
3,C,38,1.0
6,C,90,3.0


# 

## → Multi Column group by WITH TimeStamps 

**In normal workflow, we are used to pass multiple columns simply to just groupby both of them.**

(While reading that statement, please don't mix that with the sorting. In sorting passing multiple columns means if in one column found duplicated values - it will sort based on the other column's value - here it does groupby, that means a group of those multiple columns - kind of unique one)

**It is mostly not used in time series when you need 2 columns together. But let's say you need that - then the doing with the normal workflow will go like...**

Here we will use `repeat` function to repeat dates. (Some notes on that are in 2. Freq-ing frequencies notebook)

In [7]:
df = pd.DataFrame({'time': pd.date_range('25/4/2021', periods= 5).repeat(3),
                   'key': np.tile(['A','B','C'], 5),
                   'data': np.arange(1, 16)})
df

Unnamed: 0,time,key,data
0,2021-04-25,A,1
1,2021-04-25,B,2
2,2021-04-25,C,3
3,2021-04-26,A,4
4,2021-04-26,B,5
5,2021-04-26,C,6
6,2021-04-27,A,7
7,2021-04-27,B,8
8,2021-04-27,C,9
9,2021-04-28,A,10


In [8]:
# If we do groupby now...
df.groupby(['key', 'time']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data
key,time,Unnamed: 2_level_1
A,2021-04-25,1
A,2021-04-26,4
A,2021-04-27,7
A,2021-04-28,10
A,2021-04-29,13
B,2021-04-25,2
B,2021-04-26,5
B,2021-04-27,8
B,2021-04-28,11
B,2021-04-29,14


It does work... but there is no logic behind grouping of the Time. So... we do have to go for...

### Real stuff ↓ 

## Introducing `pd.TimeGrouper()` 

**Announcement:** Sorry for that - `pd.TimeGrouper` is now depricated. So, we have to use `pd.Grouper`

# 

<center> -- Diverginig from the Topic -- </center>

# 

## → What is `pd.Grouper()`?

To show work of grouper, I will start with the simple example and then move to some timeseries stuff.

In [2]:
# Simple Case
DF = pd.DataFrame(np.random.randint(0, 100, 10),
                  columns= ['data'])
DF.insert(0, 'key', np.random.choice(['A','B','C'], len(DF)))
DF      

Unnamed: 0,key,data
0,B,60
1,C,69
2,A,77
3,C,97
4,C,78
5,B,70
6,A,88
7,A,23
8,A,44
9,C,81


In [3]:
pd.Grouper(key= 'key')

Grouper(key='key', axis=0, sort=False)

In [4]:
DF.groupby(pd.Grouper(key= 'key')).mean()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
B,65.0
C,81.25
A,58.0


The example above is equivalent to...
```python
DF.groupby('key').mean() == DF.groupby(pd.Grouper(key= 'key')).mean()
```

Then, the simple question is that - why to use that?<br>
Then, the simple answer is that - Becasue we have more control with it!

Then, the obvious question is that - HOW?<br>
Then, the non-obvious answer is that - More usability in groupping with time-series data.

*Here is a nice article on the [How Grouper is So Useful](https://pbpython.com/pandas-grouper-agg.html)*

_

#### Basic Idea *If you don't want to go for reading article*:

Please read:<br>
    -----------

The thing is, when you want to group the data based on the basic - normal labels, *groupby* works well. But when you want to groupby with the labels and some **time series** data which includes some time based logics like frequency and so, groupby fails.

Now you would say - hey! We have *resample* for that! Yes, we do - but there resample **fails in grouping with labels** and also it requires the DateTime as the index. It can't work with the columns (I guess).

**But it is NOT IMPOSSIBLE to group with labels and time series with using groupby, we can achive the same with the following syntax:**
```python
#  ↓ because .resample would need it   # ↓ Because we need time logic
DF.set_index('date_col').groupby('key').resample('5min').mean()
                       # ↑ Because we want to group with labels
                       
```

Now you see, the syntax starts looking wangy. So, here is the `pd.Grouper` to help us. And remember, Grouper is a multi talented function (as you can see it has replaced pd.TimeGrouper). It is not used JUST to elemenate the wangy syntax shown above, it can be useful to do many stuff but that are rare.

For now... see the syntax below. 

```python
DF.groupby(['key', pd.Grouper(key= 'date_col', freq= '5min')]).mean()
```

**See! How simple it is!** Now, 
1. We need not to set the index as datetime
2. We can directly use the time logic in groupby
3. Grouper involves many time logic, and one of them is freq.

**One more notable thing about pd.Grouper**: <br>
When the index is DateTime, we don't need to give key= 'date_col'. (If we would, throws an error), it takes the index by default (unless we have two or more date based column).

# 

<center> -- Getting in to the topic from where we left -- </center>

# 

In [16]:
# By default - without giving freq - it takes by 'D'
df.groupby(['key', pd.Grouper(key= 'time')]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data
key,time,Unnamed: 2_level_1
A,2021-04-25,1
A,2021-04-26,4
A,2021-04-27,7
A,2021-04-28,10
A,2021-04-29,13
B,2021-04-25,2
B,2021-04-26,5
B,2021-04-27,8
B,2021-04-28,11
B,2021-04-29,14


In [18]:
# If we give freq 'M'
df.groupby(['key', pd.Grouper(key= 'time', freq= 'M')]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data
key,time,Unnamed: 2_level_1
A,2021-04-30,7
B,2021-04-30,8
C,2021-04-30,9


In [32]:
# If we give freq '5T'
df.groupby(['key', pd.Grouper(key= 'time', freq= '2D')]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data
key,time,Unnamed: 2_level_1
A,2021-04-25,2.5
A,2021-04-27,8.5
A,2021-04-29,13.0
B,2021-04-25,3.5
B,2021-04-27,9.5
B,2021-04-29,14.0
C,2021-04-25,4.5
C,2021-04-27,10.5
C,2021-04-29,15.0


Simple as that, but a little explanation: Here it took 2Days gap - `[25,26]`, `[27-28]`, `[29]` <br>
And remember - it is not a rolling function which would overlap the days. It is groupby and the freq would be gap wise.

In [41]:
# IF the date col is an index - no need to provide as key=col
df.set_index('time').groupby(['key', pd.Grouper(freq= 'M')]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data
key,time,Unnamed: 2_level_1
A,2021-04-30,7
B,2021-04-30,8
C,2021-04-30,9


# 

# 

# That's it!
Next up, we will talk about 'Method chaining'

# 