⚠️ **Disclaimer:** This notebook is a work in progress and may contain errors. For learning and experimentation only.  
Skip for now do next step for now.

In [11]:
import pandas as pd

# Resampling Time Series

In [12]:
# Sample DataFrame
df = pd.DataFrame({
    "Date": pd.date_range("2023-01-01", periods=10, freq="D"),
    "Sales": [100, 120, 90, 150, 130, 160, 170, 155, 140, 180]
})

df

Unnamed: 0,Date,Sales
0,2023-01-01,100
1,2023-01-02,120
2,2023-01-03,90
3,2023-01-04,150
4,2023-01-05,130
5,2023-01-06,160
6,2023-01-07,170
7,2023-01-08,155
8,2023-01-09,140
9,2023-01-10,180


In [13]:
# .resample() works only when the index is DateTime so..

df.set_index("Date", inplace=True)
df

Unnamed: 0_level_0,Sales
Date,Unnamed: 1_level_1
2023-01-01,100
2023-01-02,120
2023-01-03,90
2023-01-04,150
2023-01-05,130
2023-01-06,160
2023-01-07,170
2023-01-08,155
2023-01-09,140
2023-01-10,180


## 1. What is Resampling?
Group data by time frequency, then aggregate

## 2. Basic Resampling - Daily -> Monthly

In [14]:
# Monthly total sales
df.resample("ME").sum()

Unnamed: 0_level_0,Sales
Date,Unnamed: 1_level_1
2023-01-31,1395


In [15]:
# Monthly average sales
df.resample("ME").mean()

Unnamed: 0_level_0,Sales
Date,Unnamed: 1_level_1
2023-01-31,139.5


## 3. Common Resampling frequencies

|Alias|Description|Ex|Explanation|
|-|-|-|-|
|'D'|Day|`df.resample('D').sum()`||
|'W'|Week|`df.resample('W').sum()`|The weeks starts from Monday only. Suppose we have data from 1 Jan 2023 to 30. But when we resample it with 'W' the bins are made on the basis of day which starts from Monday|
|'ME'|Month End|`df.resample('ME').sum()`||
|'MS'|Month Start|`df.resample('MS').sum()`||
|'QE'|Quarter End|`df.resample('QE').sum()`||
|'YE'|Year End|`df.resample('YE').sum()`||
|'H'|Hourly|`df.resample('H').sum()`||
|'T'/'min'|Minutely|`df.resample('T').sum()`||
|'S'|Secondly|`df.resample('S').sum()`||

## 4. Downsampling vs Upsampling

Downsampling (high -> low freq)  
Here the aggregate is required to combine multiple single value.  
In downsampling the rows gets merged so aggregate fun is needed.

In [53]:
df.resample('W').sum()

Unnamed: 0_level_0,Sales
Date,Unnamed: 1_level_1
2023-01-01,100
2023-01-08,975
2023-01-15,320


Upsampling (Low -> High freq)  
This introduce NaN values.  

In this no aggreate function needed for best practice. 

In [54]:
df.resample('0.5D').asfreq()

Unnamed: 0_level_0,Sales
Date,Unnamed: 1_level_1
2023-01-01 00:00:00,100.0
2023-01-01 12:00:00,
2023-01-02 00:00:00,120.0
2023-01-02 12:00:00,
2023-01-03 00:00:00,90.0
2023-01-03 12:00:00,
2023-01-04 00:00:00,150.0
2023-01-04 12:00:00,
2023-01-05 00:00:00,130.0
2023-01-05 12:00:00,


## 5. Filling misssing values after Upsampling
Filling values after upsampling is good practice. Here we directly doing upsampling & filling. But do upsampling first and then after seeing the NaN values implement fill for best practice.  
Two types: 
- Forward fill
- Backward fill

In [18]:
# Forward fill
df.resample("0.5D").ffill()

Unnamed: 0_level_0,Sales
Date,Unnamed: 1_level_1
2023-01-01 00:00:00,100
2023-01-01 12:00:00,100
2023-01-02 00:00:00,120
2023-01-02 12:00:00,120
2023-01-03 00:00:00,90
2023-01-03 12:00:00,90
2023-01-04 00:00:00,150
2023-01-04 12:00:00,150
2023-01-05 00:00:00,130
2023-01-05 12:00:00,130


In [19]:
# Backward fill
df.resample("0.5D").bfill()

Unnamed: 0_level_0,Sales
Date,Unnamed: 1_level_1
2023-01-01 00:00:00,100
2023-01-01 12:00:00,120
2023-01-02 00:00:00,120
2023-01-02 12:00:00,90
2023-01-03 00:00:00,90
2023-01-03 12:00:00,150
2023-01-04 00:00:00,150
2023-01-04 12:00:00,130
2023-01-05 00:00:00,130
2023-01-05 12:00:00,160


## 6. Resample with Multiple Aggregations
Using aggregation that mean we are downsampling. Case in upsampling we not use aggregation.

In [20]:
df.resample("ME").agg({
    "Sales": ["sum", "mean", "max"]
}) 

Unnamed: 0_level_0,Sales,Sales,Sales
Unnamed: 0_level_1,sum,mean,max
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2023-01-31,1395,139.5,180


## 7. Resampling with custom Labeling

- `label` = The bin starts from left to right on time series so which sides DateTime should be the label? `[ Left (Default) / Right ]`.  
ex. Hourly -> Daily
```
2024-01-01 00:00
2024-01-01 01:00
...
2024-01-01 23:00
2024-01-02 00:00

```
`label=left` (Default) -> Ladel will be - `2024-01-01 00:00`  
`label=right` -> Label will be - `2024-01-01 23:00`

---

- `closed` = Decides which edge of data should include? `[ left (Default) / Right]`.
ex. Hourly -> Daily
```
2024-01-01 00:00
2024-01-01 01:00
...
2024-01-01 23:00
2024-01-02 00:00

```
`closed = left` (default) -> `2024-01-01 00:00 - 2024-01-01 23:00`  
`closed = right`  -> `2024-01-01 01:00 - 2024-01-01 00:00`

In [75]:
df.resample('W', label='left').sum()

Unnamed: 0_level_0,Sales
Date,Unnamed: 1_level_1
2022-12-25,100
2023-01-01,975
2023-01-08,320


In [75]:
df.resample('W', label='left').sum()

Unnamed: 0_level_0,Sales
Date,Unnamed: 1_level_1
2022-12-25,100
2023-01-01,975
2023-01-08,320


In [64]:
df.resample("W", label='left', closed='left').sum()

Unnamed: 0_level_0,Sales
Date,Unnamed: 1_level_1
2023-01-01,920
2023-01-08,475


In [63]:
df.resample("W", label='left', closed='right').sum()

Unnamed: 0_level_0,Sales
Date,Unnamed: 1_level_1
2022-12-25,100
2023-01-01,975
2023-01-08,320
