# Data Preparation

This script splits the meteorological data into four training and four validation data sets:
    1. Single frame
    2. Extreme Frames
    3. Middle Frames
These will each be used for single frame, late fusion, early & slow fusion respectively.

The split between training and validation will be made using time periods. Validation years are as follows:
1. 1993
2. 1996
3. 1999
4. 2002
5. 2005
6. 2008
7. 2011
8. 2014

This is to encourage a range of training and validation data across the time frame of this study.

In [2]:
import numpy as np

## 1. Load Data

To begin, the meteorological data is loaded from preextracted files into training and validation lists. These lists are then used to create training sets for each set highlighted in the introduction.

In [7]:
training_years = [1994, 1995, 1997, 1998, 2000, 2001, 2003, 2004, 2006, 2007, 2009, 2010, 2012, 2013, 2015, 2016, 2017]
validation_years = [1993, 1996, 1999, 2002, 2005, 2008, 2011, 2014]

data_folder = "E:/31-12-2020/forecastee-data/"

In [37]:
def load_meteo(years, data_folder):
    """ This method loads the meteorology (mean sea level pressure and 2m Air temperature) for each year
        provided. For each month of that year the MSLP and 2m Air Temperature are combined into a single
        matrix of size [2, time, 61, 121].
        Parameters:
            years (list<int>): The years to be extracted for.
        Returns:
            List<Numpy Matrix>: List of monthly matrices of size [2, time, 61, 121]."""
    months = []
    for y in years:
        for m in range(1, 13):
            month_data = []
            try:
                for v in ['msl', 't2m']:
                    data_file = data_folder + "{}/{}-{}.npy".format(v, m, y)
                    month_data.append(np.load(data_file))
            except Exception as e:
                print("Unable to load {}/{}-{}".format(v, m, y))
            else:
                months.append(np.array(month_data))
    return months

In [38]:
training_meteo_raw = load_meteo(training_years, data_folder)
validation_meteo_raw = load_meteo(validation_years, data_folder)

Unable to load msl/5-1994
Unable to load msl/6-1994
Unable to load msl/7-1994
Unable to load msl/8-1994
Unable to load msl/9-1994
Unable to load msl/10-1994
Unable to load msl/11-1994
Unable to load msl/12-1994
Unable to load msl/1-1995
Unable to load msl/2-1995
Unable to load msl/3-1995
Unable to load msl/4-1995
Unable to load msl/5-1995
Unable to load msl/6-1995
Unable to load msl/7-1995
Unable to load msl/8-1995
Unable to load msl/9-1995
Unable to load msl/10-1995
Unable to load msl/11-1995
Unable to load msl/12-1995
Unable to load msl/1-1997
Unable to load msl/2-1997
Unable to load msl/3-1997
Unable to load msl/4-1997
Unable to load msl/5-1997
Unable to load msl/6-1997
Unable to load msl/7-1997
Unable to load msl/8-1997
Unable to load msl/9-1997
Unable to load msl/10-1997
Unable to load msl/11-1997
Unable to load msl/12-1997
Unable to load msl/1-1998
Unable to load msl/2-1998
Unable to load msl/3-1998
Unable to load msl/4-1998
Unable to load msl/5-1998
Unable to load msl/6-1998
Una

The methods defined below split a given list of numpy matrices into a regular-sized training matrix.

In [39]:
training_meteo_raw)

(2, 28, 61, 121)

In [40]:
def single_frame(monthly_data):
    """ Averages across each matrix in the time dimension to produce a new
        matrix such that all matrices in the list are of equal size.
        Parameters:
            - monthly_data List<Numpy Matrix>: The matrices, each should have a size of [2, time, 61, 121].
        Returns:
            Numpy Matrix:   A matrix containing all aggregated data from the input through taking the mean of
                            the time dimension."""
    composite_matrices = []
    for m in monthly_data:
        print(m.shape)
        composite_matrices.append(np.mean(m, axis=1))
    return np.array(composite_matrices)
print(single_frame(training_meteo_raw).shape)

(2, 28, 61, 121)
(2, 31, 61, 121)
(2, 30, 61, 121)
(2, 31, 61, 121)
(4, 2, 61, 121)
