# Demo Notebook:

This note book is a demo of the data-loader functionality. The Data-loader class code is in the 'Dataset' Directory, the data generation processes are located in the 'DataGen' directory if you wish to take a look at how the data being prepared in these loaders originated.

In [8]:
#imports
import numpy as np
import pandas as pd

#torch imports
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

#transformer imports
from transformers import DistilBertModel, DistilBertTokenizer
from transformers import DataCollatorWithPadding

#misc. imports
import sys
import os
import time
import logging
import json

#hiding the warning that comes from torch vision
import torchvision
torchvision.disable_beta_transforms_warning()

#set up logging config
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[logging.StreamHandler()]
)

#plotting
import matplotlib.pyplot as plt



### Dataloader Demo
In the following to cells I import the create_dataloaders function from the Dataset directory and use it generate the dataloaders for each datasplit.
To do so, I temporarily move the directorys to the path such that the modules can be imported for use in this demo.

In [2]:
# Add project root to sys.path
script_dir = os.path.dirname(os.path.abspath("__main__"))
project_root = os.path.dirname(script_dir)
sys.path.append(project_root)

#data loader imports
from Dataset.data_loaders import create_dataloaders

#model method imports
from ModelMethods.dact_bert_methods import dact_bert_training_loop, dact_bert_plotting
from ModelMethods.methods import train_step, evaluation_step

#import dact-bert
from Models.DACT_BERT import DACT_BERT
from Models.VanillaBert import VanillaBert
from Models.DactBert.dact_bert import TrainingEngine, DactBert

2025-03-14 20:59:41,888 - INFO - PyTorch version 2.4.1 available.


### DataLoader Functions:
---
The Dataloader "create_dataloaders" function takes two parameters: Batch_size and Split_prop. The split prop is an option of dataset split proportions, it is derived from the way that the datasets are stored on "Talapas", four datasets were generated each with a different train, test, and validation split. Those options are: 0.1, 0.2, 0.3, 0.4, any other option inputed will result in an error, unless you go and run another data-generation round with your desired split.
For this demo, I will use 0.3 as the selected split.

In [3]:
#generate dataloaders options, are batch size and split proportion
train_dataloader, valid_dataloader, test_dataloader = create_dataloaders(batch_size=256, split_prop=0.3)

2025-03-14 20:59:43,848 - INFO - Loading tokenizer...

2025-03-14 20:59:44,107 - INFO - Successfully loaded tokenizer.

2025-03-14 20:59:44,108 - INFO - Loading Datasets...

2025-03-14 20:59:44,111 - INFO - Checking Split Choice: Train...
2025-03-14 20:59:44,112 - INFO - Selected valid split.

2025-03-14 20:59:44,113 - INFO - Attempting to load data

2025-03-14 20:59:44,135 - INFO - Successfully Loaded Dataset.

2025-03-14 20:59:44,135 - INFO - Checking Split Choice: Valid...
2025-03-14 20:59:44,136 - INFO - Selected valid split.

2025-03-14 20:59:44,138 - INFO - Attempting to load data

2025-03-14 20:59:44,163 - INFO - Successfully Loaded Dataset.

2025-03-14 20:59:44,163 - INFO - Checking Split Choice: Test...
2025-03-14 20:59:44,164 - INFO - Selected valid split.

2025-03-14 20:59:44,166 - INFO - Attempting to load data

2025-03-14 20:59:44,179 - INFO - Successfully Loaded Dataset.

2025-03-14 20:59:44,179 - INFO - Successfully loaded datasets.

2025-03-14 20:59:44,180 - INFO - Crea

In [4]:
#Display the contents of the Dataloaders
print(f"Training Dataloader Number of Batches: {len(train_dataloader.dataset)}\n")
print(f"Validation Dataloader Number of Batches: {len(valid_dataloader.dataset)}\n")
print(f"Test Dataloader Number of Batches: {len(test_dataloader.dataset)}\n")

print(f"Dataloader Batch Contents: {next(iter(train_dataloader)).items()}\n")

Training Dataloader Number of Batches: 3591

Validation Dataloader Number of Batches: 1540

Test Dataloader Number of Batches: 2199

Dataloader Batch Contents: dict_items([('input_ids', tensor([[ 101, 2129, 2079,  ...,    0,    0,    0],
        [ 101, 2054, 2024,  ...,    0,    0,    0],
        [ 101, 2054, 2535,  ...,    0,    0,    0],
        ...,
        [ 101, 2129, 2079,  ...,    0,    0,    0],
        [ 101, 2054, 2024,  ...,    0,    0,    0],
        [ 101, 2129, 2079,  ...,    0,    0,    0]])), ('attention_mask', tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])), ('labels', tensor([1, 2, 0, 2, 2, 1, 1, 0, 0, 2, 2, 2, 1, 2, 0, 1, 0, 0, 1, 0, 2, 0, 1, 0,
        1, 1, 2, 0, 1, 1, 1, 0, 2, 0, 1, 2, 1, 0, 0, 1, 1, 1, 0, 0, 1, 2, 0, 1,
        1, 2, 1, 0, 0, 0, 2, 1, 0, 0, 1, 1, 0, 1, 2, 2, 2, 0, 1, 0, 0, 0, 1,