# A Vectorised Sliding Window

Adapted from https://towardsdatascience.com/fast-and-robust-sliding-window-vectorization-with-numpy-3ad950ed62f5


This code takes a dataset of daily temperature observations (can be replaced with any other daily data), and for each day between a start date and an end date (inclusive), it extracts the data within a 'window' period starting with that date, and computes the average temperature within that window. 
e.g. for a window of 3 days between the dates 1 Feb 2021 and 10 Feb 2021, it will compute the mean temperature for (1 Feb 2021 to 3 Feb 2021), (2 Feb 2021 to 4 Feb 2021), (3 Feb 2021 to 5 Feb 2021), ..., (10 Feb 2021 to 12 Feb 2021). (Thus, the last few windows overlap the end date)

In [1]:
# Module import
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.dates as dt

def date_index(y, m, d, y0 = 1942, m0 = 1, d0 = 1):
    
    '''
    This function compoutes the index value of the row in a dataset of 1x daily data for a given date y,m,d. The 
    dataset used in this example has a starting date of 1 January 1942, so this is set as the origin date y0,m0,d0.
    A diferent origin date can be set by overriding the valyes of y0, m0 and d0 at the function call. 
    '''
    date_obj0 = datetime(y0,m0,d0)
    d0 = dt.date2num(date_obj0)
    
    date_obj = datetime(y,m,d)
    d = dt.date2num(date_obj)
    
    return int(d - d0)

First we set up the start and end dates of our analysis, plus the window size as described above. We also import the data file (a .csv file) using the pandas module, then extract the columns that we need and convert it to a numpy array. 

In [2]:
#=================================Parameter Setup==============================
START_DATE = (2020, 1, 1) # Variable name in capitals for hard-coded constants
END_DATE = (2020, 1, 20)

WINDOW = 5
#===================================Data Input=================================
t_max_pd = pd.read_csv('/Users/nick/Documents/Science/Meteorology_and_Climate/Data/wagga_tmax.csv')
#t_min_pd = pd.read_csv('/Users/nick/Documents/Science/Meteorology_and_Climate/Data/wagga_tmin.csv')

# Extract the column for temperature, and convert it to a numpy array. 
t_max_array = np.array(t_max_pd.iloc[:,3])
#t_min_array = np.array(t_min_pd.iloc[:,3])

To help to conceptualise the idea of a sliding window (and for the sake of comparison), the sliding window will first be implemented with a loop through the temperature data. For each day between (and including) the start date and the end date, the code extracts the temperature value for that day + the next 4 days (for a total of 5 days), computes the mean temperature, and saves the result in a new numpy array:

In [4]:
#=========================Sliding Window the Loop Way==========================
start_index = date_index(START_DATE[0], START_DATE[1], START_DATE[2])
end_index = date_index(END_DATE[0], END_DATE[1], END_DATE[2])

looprange = np.arange(start_index, end_index  + 1)

output_results_loop = np.zeros((len(looprange)))

for temp_index in looprange:
    
    window_temps = t_max_array[temp_index : temp_index + WINDOW]
    
    results_index = temp_index - start_index
    output_results_loop[results_index] = np.nanmean(window_temps)

In each iteration of the loop above, the rows of the temperature data was indexed like so:

Instead of indexing the temperature data array with integer variables that increment by 1 with each iteration of a loop, we can index the rows of the temperature data array with a numpy array containing smaller numpy arrays of the index values of each window, all in one go. The indexing array for the rows would look something like this:

To make it easier to create this indexing array systematically without using loops (or worse still, hard-coding the numbers for each row...), we can de-compose this array into an addition of two new arrays:

With the help of numpy broadcasting, we can easily set up this addition of matrices like so:

The resulting code looks like this:

In [5]:
#=========================Sliding Window the Vectorised Way====================

start_index = date_index(START_DATE[0], START_DATE[1], START_DATE[2])
end_index = date_index(END_DATE[0], END_DATE[1], END_DATE[2])

n_windows = end_index - start_index + 1

first_array = np.expand_dims(np.arange(0, WINDOW, 1), axis = 0)
seccond_array = np.expand_dims(np.arange(0, n_windows, 1), axis = 0).transpose()

index_array = start_index + first_array + seccond_array

output_results_vector = np.nanmean((t_max_array[index_array]), axis = 1)

We can confirm that the two methods produce equal results by computing the difference between each of the results arrays:

In [6]:
print(output_results_vector - output_results_loop)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
