# Task 1: Lag Matrix Creation

Given a dataframe containing time series data for multiple items, write a Python function that creates a 2d
tensor containing the previous n values per item for each dataframe row. The output tensor should have the
same number of rows as the input dataframe and n columns representing the “lagged” values. Missing values
should be filled with NaN (in particular in the beginning of each time series).

For example, for n=2 and the input dataframe:

```python
pd.DataFrame(
	{
		"item": [23, 23, 23, 23, 11, 11, 11],
		"value": [9.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
	}
)
```

the expected output tensor is:

```python
torch.tensor(
	[
		[torch.nan, torch.nan],
		[torch.nan, 9.0],
		[9.0, 2.0],
		[2.0, 3.0],
		[torch.nan, torch.nan],
		[torch.nan, 5.0],
		[5.0, 6.0],
	]
)
```

# Solution

In [1]:
import pandas as pd
import numpy as np
import torch

```diff

@@ We can achive the solution using Pandas and PyTorch. We will group the dataframe by the "item" column, then use the "shift" function to generate the lagged values for each item. The resulting dataframe is then converted into a  PyTorch tensor.

```

In [2]:
def create_lag_matrix(df, n):
    # Initialize a new dataframe to store the lagged values
    lagged_df = pd.DataFrame(index=df.index)

    # Create the lagged values for each item
    for i in range(n, 0, -1):
        lagged_df[f'lag_{i}'] = df.groupby('item')['value'].shift(i)

    # Convert the dataframe to a PyTorch tensor and return it
    return torch.tensor(lagged_df.values, dtype=torch.float)

```diff

@@ The function `create_lag_matrix(df, n)` creates a 2D tensor that contains previous `n` values (lags) for each item in the input DataFrame `df`. This is achieved by the following steps:

@@ 1. The function first initializes an empty DataFrame `lagged_df` with the same index as the input DataFrame `df`.

@@ 2. It then enters a loop that runs `n` times. In each iteration, it creates a new column in `lagged_df` that contains the lagged values of the 'value' column in `df` for each item. The lag is determined by the current iteration number (`i`), starting from `n` and going down to `1`. This is achieved using the pandas `shift` function, which shifts the values in a DataFrame column down by a specified number of places. The `groupby('item')` ensures that the shift operation is performed separately for each item, so that the lags for different items do not mix.

@@ 3. After the loop has completed, all the created lag columns are in reverse order (from the most recent to the least recent), and `lagged_df` contains the lagged values for each item in `df`.

@@ 4. Finally, `lagged_df` is converted to a PyTorch tensor using the `torch.tensor` function, and this tensor is returned. The `dtype=torch.float` argument is used to ensure that the tensor contains floating point numbers (since the input dataframe contains floating point numbers), and that `nan` values are preserved.

@@ The result is a 2D tensor where each row corresponds to a row in the input DataFrame `df`, and each column contains a previous value (lag) for the item in that row, with the most recent lags appearing first. The missing values that occur because there are not enough previous values for the first few values of each item are filled with `nan`.

@@ Overall, this function can be useful for time series analysis tasks where it's often important to consider previous values as input when predicting the next value.

```

In [3]:
# dataframe to test
df = pd.DataFrame(
    {
        "item": [23, 23, 23, 23, 11, 11, 11],
        "value": [9.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
    }
)

df

Unnamed: 0,item,value
0,23,9.0
1,23,2.0
2,23,3.0
3,23,4.0
4,11,5.0
5,11,6.0
6,11,7.0


In [4]:
# test the function
display(create_lag_matrix(df, 2))

tensor([[nan, nan],
        [nan, 9.],
        [9., 2.],
        [2., 3.],
        [nan, nan],
        [nan, 5.],
        [5., 6.]])

# Task 2: Bonus

Have the function handle gaps in the time series data. The "time" column represents the time step where
gaps (missing rows) should result in NaN entries in the output tensor.

For n=2 and the input dataframe:

```python
pd.DataFrame(
    {
        "item": [23, 23, 23, 23, 11, 11, 11],
        "time": [0, 1, 4, 5, 32, 34, 35],
        "value": [9.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
    }
)

```

the expected output is:

```python
torch.tensor(
    [
        [torch.nan, torch.nan],
        [torch.nan, 9.0],
        [torch.nan, torch.nan],
        [torch.nan, 3.0],
        [torch.nan, torch.nan],
        [5.0, torch.nan],
        [torch.nan, 6.0],
    ]
)

```

# Solution

In [5]:
# function to print dataframe as full, can be used to test intermediate dataframes in an updated create_lag_matrix() function
def print_full(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

```diff

@@ The following updated function `create_lag_matrix` now also considers gaps in the time series data based on the "time" column.

```

In [6]:
import pandas as pd
import numpy as np
import torch

def create_lag_matrix(df, n):
    # Create a new DataFrame with all time steps for each item
    all_time_steps = pd.MultiIndex.from_product([df['item'].unique(), np.arange(df['time'].max() + 1)], 
                                                names=['item', 'time'])
    
    all_time_steps_df = pd.DataFrame(index=all_time_steps).reset_index()

    # Merge with the original DataFrame
    df_full = pd.merge(all_time_steps_df, df, how='left', on=['item', 'time'])
    # print_full(df_full)

    # Create the lagged DataFrame
    lagged_df = pd.DataFrame(index=df_full.index)
    for i in range(n, 0, -1):
        lagged_df[f'lag_{i}'] = df_full.groupby('item')['value'].shift(i)
    
    # Only keep the rows that were present in the original DataFrame
    lagged_df = lagged_df[df_full['value'].notna()]

    return torch.tensor(lagged_df.values, dtype=torch.float)

```diff

@@ This function will generate a new DataFrame (`all_time_steps_df`) with all possible combinations of 'item' and 'time' values. It then merges this DataFrame with the original DataFrame (df) using 'item' and 'time' as the keys. The lagged DataFrame is created by shifting the 'value' column for each item in the specified range of lags. Finally, the function filters out the rows that were not present in the original DataFrame (df) to ensure only the relevant data is returned. The output will be a PyTorch tensor containing the lagged values.

```

In [7]:
# dataframe to test
df = pd.DataFrame(
    {
        "item": [23, 23, 23, 23, 11, 11, 11],
        "time": [0, 1, 4, 5, 32, 34, 35],
        "value": [9.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
    }
)

df

Unnamed: 0,item,time,value
0,23,0,9.0
1,23,1,2.0
2,23,4,3.0
3,23,5,4.0
4,11,32,5.0
5,11,34,6.0
6,11,35,7.0


In [8]:
# test the function
new_tensor = create_lag_matrix(df, 2)

In [9]:
# show the resulting tensor
new_tensor

tensor([[nan, nan],
        [nan, 9.],
        [nan, nan],
        [nan, 3.],
        [nan, nan],
        [5., nan],
        [nan, 6.]])