## GREMLIN dataset timeslice R&D

This dataset is from this link: https://mountainscholar.org/handle/10217/235392

and licensed CC:BY for the most part.

Paper on the UNEt they used is here: https://journals.ametsoc.org/view/journals/apme/60/1/jamc-d-20-0084.1.xml?tab_body=pdf

In [1]:
import xarray as xr #have to install the python netCDF reader as well
import numpy as np
# from numpy import random
import pandas as pd

import random

import metpy

import matplotlib.pyplot as plt

# Loading in the netCDF

In [2]:
data = 'gremlin_conus2_dataset.nc'

In [3]:
ds = xr.open_dataset(data)

ds

While this dataset loads in fine, one of the issues with it is the lack of coordinates. Little explainer on the different terminology: https://docs.xarray.dev/en/stable/user-guide/terminology.html



In [4]:
ds.time

While lat and long change overtime, lets assign some coordiantes

In [5]:
ds = ds.assign_coords(time=ds.time)
ds

In [6]:
num_slices = len(ds.time)
print('Number of time slices:', num_slices)

Number of time slices: 2246


In [7]:
def int_splits(int_length, train=0.8, test=0.15, val=0.05):
    '''
    Function to split an integer into seperate 
    training, testing and validation sets. 
    
    Integers are commonly length of timesteps, or other
    timeseries.
    
    slice = the integer you want to split up
    train + test + val needs to equal 1 to work!
    
    '''
    _sum = train+test+val 
    
    if type(int_length) != int:
        print('error! slice is not an integer')
    elif train <= 0:
        print('error, bad value for train')
    elif test <= 0:
        print('error, bad value for test')
    elif val <= 0:
        print('error, bad value for val')
    elif _sum != 1:
        print('error error!, please double check your splits')
        print('train+test+val equals ', _sum, 'instead of 1')
    else:    
        n_train = int(train*int_length)
        n_test = int(test*int_length)
        n_val = int(val*int_length)
        # some errors due to rounding 
        _diff = int_length - (n_train + n_test + n_val)
        n_train = int(n_train+_diff)
        
        return n_train, n_test, n_val

In [8]:
train_set, test_set, val_set = int_splits(int_length=num_slices)

Was thinking how to automate this, but at the end of the day, will have to select which variables are needed.

In [9]:
def randomizer(num_train: int, num_test: int, num_val: int):
    '''
    create lists of randomly sampled slices from the entire training set
    '''
    # Adding together train, test, and validation set to confirm it matches number of slices
    total = num_train + num_test + num_val
    
    options = random.sample(range(total), k=total)

    train = options[:num_train]
    print('number of training slices:', len(train))
    
    test = options[num_train:num_train + num_test]
    print('number of testing slices:', len(test))
    
    val = options[num_train + num_test:]
    print('number of validation slices:', len(val))
    
    return train, test, val

In [10]:
train_rnd, test_rnd, val_rnd = randomizer(train_set, test_set, val_set)

number of training slices: 1798
number of testing slices: 336
number of validation slices: 112


In [11]:
np.shape(ds.GOES_ABI_C07.data[train_rnd]) == np.shape(ds.GOES_ABI_C07.data[:train_set])

True

In [12]:
np.shape(ds.GOES_ABI_C07.data[test_rnd]) == np.shape(ds.GOES_ABI_C07.data[train_set:(train_set+test_set)])

True

In [13]:
np.shape(ds.GOES_ABI_C07.data[val_rnd]) == np.shape(ds.GOES_ABI_C07.data[(train_set+test_set):(train_set+test_set+val_set)])

True