# PyTorch Dataloader for the International Tree-Ring Data Bank (ITRDB)

This data works by wrapping our parsed ITRDB data in a PyTorch Dataset to be used for PyTorch Neural Networks. The parsed data is publicly hosted on an AWS S3 bucket, and is retrieved simply through the Python requests library. The Dataset will also cache the created dataframe, so the API request will only need to be made once per session, enabling you to create multiple Datasets (train, test, and validate) with little to no wait time. For the sake of simplicity, this Dataset will also limit the tree ring widths to between the years 1900-2023, and will then drop any rings that have 0 measurements between that time (row is all NaN between 1900-2023)

Import necessary dependencies

In [None]:
import torch
import requests
import pandas as pd
from io import StringIO
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

The Dataset Class itself, we retrieve the data from S3, cache it, and write a split function that can produce a 70-15-15 split on the data, only saving the set type requested in the constructor's parameters:

In [None]:
class TreeRingDataset(Dataset):

  _cache = None

  def __init__(self, set_type="train"):

    if TreeRingDataset._cache is None:
      res = requests.get("https://paleo-data.s3.amazonaws.com/data.csv")
      TreeRingDataset._cache = pd.read_csv(StringIO(res.text), sep=",")

    self.df = TreeRingDataset._cache.copy()
    self.df.drop(self.df.columns[list(range(1, 1900))], axis=1, inplace=True)
    self.df.dropna(subset=self.df.columns[1:125], how='all', inplace=True)

    training, test, validate = self.__split()
    self.df = training if set_type == "train" else (test if set_type == "test" else validate)

  def __split(self):
    train, test = train_test_split(self.df, train_size=.70, stratify=self.df['loc'])
    validate, test = train_test_split(test, train_size=.5, stratify=test['loc'])
    return (train, test, validate)

  def __len__(self):
    return self.df.shape[0]

  def __getitem__(self, index):
    label = torch.tensor(self.df.iloc[index, 127:131])
    x = torch.tensor(self.df.iloc[index, 1:125])

    return x, label

Then this can be taken and retrieved for each train, test, and validate set, which will only need to send 1 API request and thus finish fairly quickly:

In [None]:
train_data = TreeRingDataset(
    set_type="train"
)

test_data = TreeRingDataset(
    set_type="test"
)

validate_data = TreeRingDataset(
    set_type="validate"
)

Finally, we can wrap it in a Dataloader:

In [None]:
train_dataloader = DataLoader(train_data, batch_size=32)
test_dataloader = DataLoader(test_data, batch_size=32)
validate_dataloader = DataLoader(validate_data, batch_size=32)

And just an example of what the data looks like, we can display the first batch:

In [None]:
train_features, train_labels = next(iter(train_dataloader))
print(train_features)
print(train_labels)

tensor([[   nan,    nan,    nan,  ...,    nan,    nan,    nan],
        [1.0320, 0.9730, 0.6680,  ...,    nan,    nan,    nan],
        [0.2320, 0.1790, 0.0980,  ...,    nan,    nan,    nan],
        ...,
        [0.6200, 0.7800, 0.4900,  ...,    nan,    nan,    nan],
        [0.3000, 0.3630, 0.2230,  ...,    nan,    nan,    nan],
        [0.1100, 0.1400, 0.0000,  ...,    nan,    nan,    nan]],
       dtype=torch.float64)
tensor([[  31.2200,   31.2200,  -84.4800,  -84.4800],
        [  34.3000,   34.3000,  -94.6500,  -94.6500],
        [  43.3319,   43.3319, -110.7991, -110.7991],
        [  33.1300,   33.1300, -116.6000, -116.6000],
        [  47.5300,   47.5300, -121.0500, -121.0500],
        [  37.1500,   37.1500,  -91.0800,  -91.0800],
        [  37.8700,   37.8700, -119.3700, -119.3700],
        [  35.9000,   35.9000, -107.6300, -107.6300],
        [  38.8300,   38.8300, -108.5700, -108.5700],
        [  48.2300,   48.2300,  -90.9000,  -90.9000],
        [  37.7700,   37.7700, -11