# Data Prep

This script splits the meteorological data into four training and four validation data sets:
    1. Single frame
    2. Extreme Frames
    3. Middle Frames
These will each be used for single frame, late fusion, early & slow fusion respectively.

The split between training and validation will be made using time periods. Validation years are as follows:
1. 1993
2. 1996
3. 1999
4. 2002
5. 2005
6. 2008
7. 2011
8. 2014

This is to encourage a range of training and validation data across the time frame of this study.

In [25]:
from tqdm.notebook import tqdm
import numpy as np

## 1. Load Data

To begin, the meteorological data is loaded from preextracted files into training and validation lists. These lists are then used to create training sets for each set highlighted in the introduction.

In [26]:
training_years = [1994, 1995, 1997, 1998, 2000, 2001, 2003, 2004, 2006, 2007, 2009, 2010, 2012, 2013, 2015]
validation_years = [1993, 1996, 1999, 2002, 2005, 2008, 2011, 2014]

data_folder = "E:/31-12-2020/forecastee-data/"
rainfall_file = "./data/rainfall/truth_rf.npy"

In [38]:
def load_data(years, data_folder, rainfall_file):
    """ This method loads the meteorology (mean sea level pressure and 2m Air temperature) and rainfall
        for each year provided. For each month of that year the MSLP and 2m Air Temperature are combined into a single
        matrix of size [2, time, 61, 121] and a 2D array of rainfall values for each month in the format
        [month, year, region_0_rainfall, ..., region_12_rainfall].
        Parameters:
            years (list<int>): The years to be extracted for.
            data_folder (string): Where is the meteorological data stored?
            rainfall_file (string): Where is rainfall stored?
        Returns:
            List<Numpy Matrix>: List of monthly matrices of size [2, time, 61, 121].
            Numpy Matrix: CEH-GEAR Rainfall values for each month required, in the format: 
                            [Month, Year, rain_region_0, ..., rain_region_12]"""
    monthly_meteo = []
    monthly_rain = []
    rainfall = np.load("./data/rainfall/truth_rf.npy")
    for y in tqdm(years):
        for m in range(1, 13):
            month_data = []
            try:
                for v in ['msl', 't2m']:
                    data_file = data_folder + "{}/forecasted-months/{}-{}.npy".format(v, m, y)
                    data = np.load(data_file)
                    if len(data.shape) != 3:
                        data = data[0, :, :, :]
                    month_data.append(data)
                # Get rainfall values
                mrain = rainfall[(rainfall[:, 0] == m) & (rainfall[:, 1] == y), :]
            except Exception as e:
                print("Unable to load {}/{}-{}".format(v, m, y))
            else:
                monthly_meteo.append(np.array(month_data))
                monthly_rain.append(mrain)
    return monthly_meteo, np.squeeze(monthly_rain)

In [39]:
training_meteo_raw, training_rainfall = load_data(training_years, data_folder, rainfall_file)
validation_meteo_raw, validation_rainfall = load_data(validation_years, data_folder, rainfall_file)

HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))

Unable to load msl/1-1993
Unable to load msl/2-1993



## 2. Preparation Methods

The methods defined below split a given list of numpy matrices into a regular-sized training matrix. After extracting each data set they are saved for use later. Firstly, we define a folder to hold the resulting data sets:

In [7]:
prepared_data_folder = "E:/31-12-2020/prepared-data/"

### 2.1 Single Frame

This first method averages across all days in each month to provide an average forecast MSLP and 2AT. The validation and training sets are then saved under the names defined below.

In [8]:
training_file = prepared_data_folder + "single_train.npy"
validation_file = prepared_data_folder + "single_valid.npy"

In [9]:
def single_frame(monthly_data):
    """ Averages across each matrix in the time dimension to produce a new
        matrix such that all matrices in the list are of equal size.
        Parameters:
            - monthly_data List<Numpy Matrix>: The matrices, each should have a size of [2, time, 61, 121].
        Returns:
            Numpy Matrix:   A matrix containing all aggregated data from the input through taking the mean of
                            the time dimension. Size: [no. months, 2, 61, 121]"""
    composite_matrices = []
    for m in monthly_data:
        data = np.mean(m, axis=1)
        composite_matrices.append(data)
    return composite_matrices

In [10]:
training_single = single_frame(training_meteo_raw)
validation_single = single_frame(validation_meteo_raw)

In [11]:
np.save(training_file, training_single)
np.save(validation_file, validation_single)

### 2.2 Extreme Frame

This next method takes the first and last frames (days) of each month. These are then saved in files defined below:

In [12]:
training_file = prepared_data_folder + "extreme_train.npy"
validation_file = prepared_data_folder + "extreme_valid.npy"

In [13]:
def extreme_frames(monthly_data):
    """ Takes the first and last entries across the time dimension in each matrix to produce a new
        matrix such that all matrices in the list are of equal size.
        Parameters:
            - monthly_data List<Numpy Matrix>: The matrices, each should have a size of [2, time, 61, 121].
        Returns:
            Numpy Matrix:   A matrix containing all aggregated data from the input through taking the mean of
                            the time dimension. Size: [no. months, 2, 4, 61, 121]"""
    matrices = []
    for m in monthly_data:
        month_matrix = np.zeros((2, 2, 61, 121))
        month_matrix[0, :, :, :] = m[:, 0, :, :]
        month_matrix[1, :, :, :] = m[:, -1, :, :]
        matrices.append(month_matrix)
    return np.array(matrices)

In [14]:
training_extreme = extreme_frames(training_meteo_raw)
validation_extreme = extreme_frames(validation_meteo_raw)

In [15]:
np.save(training_file, training_extreme)
np.save(validation_file, validation_extreme)

### 2.3 Middle Frames

This final method takes the middle 28 days of data and combines them into a single matrix. 28 days is chosen because this is the minimum number of days in a month. These are then also saved as separate datasets.

In [16]:
training_file = prepared_data_folder + "middle_train.npy"
validation_file = prepared_data_folder + "middle_valid.npy"

In [17]:
def middle_frames(monthly_data):
    """ Takes the middle 28 entries across the time dimension in each matrix to produce a new
        matrix such that all matrices in the list are of equal size.
        Parameters:
            - monthly_data List<Numpy Matrix>: The matrices, each should have a size of [2, time, 61, 121].
        Returns:
            Numpy Matrix:   A matrix containing all aggregated data from the input through taking the mean of
                            the time dimension. Size: [no. months, 2, 56, 61, 121]"""
    matrices = []
    for m in monthly_data:
        start_index = m.shape[1] - 28
        matrices.append(m[:, start_index:start_index+28, :, :])
    return np.array(matrices)

In [18]:
training_middle = middle_frames(training_meteo_raw)
validation_middle = middle_frames(validation_meteo_raw)

In [19]:
np.save(training_file, training_middle)
np.save(validation_file, validation_middle)

## 2.4 Rainfall

Now, save the rainfall values in training and validation files.

In [20]:
training_file = prepared_data_folder + "expected_train.npy"
validation_file = prepared_data_folder + "expected_valid.npy"

In [21]:
np.save(training_file, training_rainfall)
np.save(validation_file, validation_rainfall)