<a href="https://colab.research.google.com/github/Jez-Carter/Learning_Development_Statistics/blob/master/python/mlcroissant/recipes/tfds_croissant_builder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Train a model with Croissant 🥐, Hugging Face 🤗 and TFDS

[TensorFlow Datasets](https://www.tensorflow.org/datasets/overview) (in short, TFDS) is an established library to handle downloading and preparing data efficiently and deterministically.

TFDS is framework-agnostic: it can generate datasets by constructing a `tf.data.Dataset`, a `np.array` or a [`ArrayRecord`](https://github.com/google/array_record) data source, for use with TensorFlow, Jax, PyTorch, and other Machine Learning frameworks.

TFDS has recently introduced a `CroissantBuilder`, which defines a TFDS dataset based on a Croissant 🥐 metadata file.

## Setup



Let's install and import the needed dependencies:

In [6]:
%%capture --no-display
# Install mlcroissant from the source
!apt-get install -y python3-dev graphviz libgraphviz-dev pkg-config
!pip install "git+https://github.com/${GITHUB_REPOSITORY:-mlcommons/croissant}.git@${GITHUB_HEAD_REF:-main}#subdirectory=python/mlcroissant&egg=mlcroissant[dev]"
!pip install array_record
!pip install tfds-nightly
!pip install tensorflow
!pip install torch
!apt-get install tree

In [7]:
%%capture --no-display
import json
import os

from etils import epath
import mlcroissant as mlc
import requests
import tensorflow_datasets as tfds
import torch
from tqdm import tqdm

local_croissant_file = epath.Path("/tmp/croissant.json")
data_dir = "/tmp/croissant"

## Download the Croissant JSON-LD file

To initialize a `CroissantBuilder` in TFDS, we need a Croissant 🥐 file describing a dataset.
In this notebook, we will create a TFDS `CroissantBuilder` for [fashion_mnist](https://huggingface.co/datasets/fashion_mnist), a popular dataset for computer vision.

In [3]:
api_url = "https://huggingface.co/api/datasets/fashion_mnist/croissant"

# Download the JSON and write it to `local_croissant_file`.
response = requests.get(api_url, headers=None).json()
with local_croissant_file.open("w") as f:
  jsonld = json.dumps(response, indent=2)
  f.write(jsonld)
  print(jsonld)

{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "citeAs": "cr:citeAs",
    "column": "cr:column",
    "conformsTo": "dct:conformsTo",
    "cr": "http://mlcommons.org/croissant/",
    "data": {
      "@id": "cr:data",
      "@type": "@json"
    },
    "dataBiases": "cr:dataBiases",
    "dataCollection": "cr:dataCollection",
    "dataType": {
      "@id": "cr:dataType",
      "@type": "@vocab"
    },
    "dct": "http://purl.org/dc/terms/",
    "extract": "cr:extract",
    "field": "cr:field",
    "fileProperty": "cr:fileProperty",
    "fileObject": "cr:fileObject",
    "fileSet": "cr:fileSet",
    "format": "cr:format",
    "includes": "cr:includes",
    "isLiveDataset": "cr:isLiveDataset",
    "jsonPath": "cr:jsonPath",
    "key": "cr:key",
    "md5": "cr:md5",
    "parentField": "cr:parentField",
    "path": "cr:path",
    "personalSensitiveInformation": "cr:personalSensitiveInformation",
    "recordSet": "cr:recordSet",
    "references": "cr:reference

## Build the TFDS dataset

A `CroissantBuilder` takes as input a Croissant 🥐 file, and a list of `RecordSet` names to generate. Each `RecordSet` will correspond to a separated [`BuilderConfig`](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/BuilderConfig).

In [12]:
builder = tfds.core.dataset_builders.CroissantBuilder(
    jsonld="/tmp/croissantSpikeZip.json",
    record_set_ids=["rs-abberfraw"],
    file_format='array_record',
    data_dir="/tmp/croissant_ukceh",
)

In [None]:
builder = tfds.core.dataset_builders.CroissantBuilder(
    jsonld=local_croissant_file,
    record_set_ids=["fashion_mnist"],
    file_format='array_record',
    data_dir=data_dir,
)

  -  [Metadata(fashion_mnist)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(fashion_mnist)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(fashion_mnist)] Property "https://schema.org/version" is recommended, but does not exist.


Our `CroissantBuilder` uses the information contained in the Croissant 🥐 file to initialize the TFDS dataset's [documentation](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetInfo), which we can explore using the [`DatasetBuilder.info`](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder#info) method:

In [10]:
print(f"Dataset's description:\n{builder.info.description}\n")
print(f"Dataset's citation:\n{builder.info.citation}\n")
print(f"Dataset's features:\n{builder.info.features}")


Dataset's description:
This data contains values of bare sand area, modelled wind speed, aspect and slope at a 2.5 m spatial resolution for four UK coastal dune fields, Abberfraw (Wales), Ainsdale (England), Morfa Dyffryn (Wales), Penhale (England). Data is stored as a .csv file. Data is available for 620,756.25 m2 of dune at Abberfraw, 550,962.5 m2 of dune at Ainsdale, 1,797,756.25 m2 of dune at Morfa Dyffryn and 2,275,056.25 m2 of dune at Penhale. All values were calculated from aerial imagery and digital terrain models collected between 2014 and 2016.

Dataset's citation:
Smyth, T.A.G. (2022). Bare sand, wind speed, aspect and slope at four English and Welsh coastal sand dunes, 2014-2016. NERC EDS Environmental Information Data Centre. https://doi.org/10.5285/972599af-0cc3-4e0e-a4dc-2fab7a6dfc85

Dataset's features:
FeaturesDict({
    'Aspect': float32,
    'BareSand_it1': float32,
    'Slope': float32,
    'WindSpeed': float32,
    'X': float32,
    'Y': float32,
    'id': int64,
}

In [18]:
print(f"Dataset's features:\n{builder.info.features}")

Dataset's features:
FeaturesDict({
    'Aspect': float32,
    'BareSand_it1': float32,
    'Slope': float32,
    'WindSpeed': float32,
    'X': float32,
    'Y': float32,
    'id': int64,
})


In [None]:
print(f"Dataset's description:\n{builder.info.description}\n")
print(f"Dataset's citation:\n{builder.info.citation}\n")
print(f"Dataset's features:\n{builder.info.features}")

# ...

Dataset's description:
Dataset Card for FashionMNIST







		Dataset Summary


Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and… See the full description on the dataset page: https://huggingface.co/datasets/zalando-datasets/fashion_mnist.

Dataset's citation:


Dataset's features:
FeaturesDict({
    'image': Image(shape=(None, None, 3), dtype=uint8, description=Image column 'image' from the Hugging Face parquet file.),
    'label': int64,
    'split': Text(shape=(), dtype=string),
})


We can now generate and materialize the TFDS dataset on disk:

In [13]:
%%capture --no-display
builder.download_and_prepare()

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]





`download_and_prepare` downloads the data and prepares the dataset specifically for ML. For instance, it uses an ML-optimized data format. You can read more [in the documentation](https://www.tensorflow.org/datasets/tfless_tfds). Let's inspect it on disk:

In [14]:
!tree "/tmp/croissant_ukceh/"

[01;34m/tmp/croissant_ukceh/[0m
├── [01;34mdownloads[0m
│   └── [01;34mextracted[0m
└── [01;34mdunes_data[0m
    └── [01;34mrs_abberfraw[0m
        └── [01;34m1.0.0[0m
            ├── [00mdataset_info.json[0m
            ├── [00mdunes_data-default.array_record-00000-of-00001[0m
            ├── [00mfeatures.json[0m
            └── [00mLICENSE[0m

5 directories, 4 files


In [None]:
!tree {data_dir}/

[01;34m/tmp/croissant/[0m
├── [01;34mdownloads[0m
│   └── [01;34mextracted[0m
└── [01;34mzalando_datasets__fashion_mnist[0m
    └── [01;34mfashion_mnist[0m
        └── [01;34m1.0.0[0m
            ├── [00mdataset_info.json[0m
            ├── [00mfeatures.json[0m
            ├── [00mLICENSE[0m
            ├── [00mzalando_datasets__fashion_mnist-test.array_record-00000-of-00001[0m
            └── [00mzalando_datasets__fashion_mnist-train.array_record-00000-of-00001[0m

5 directories, 5 files


The command above outputs a dictionary of data sources with a train/test split:

In [None]:
mnist_builder = tfds.builder("mnist")
mnist_info = mnist_builder.info
mnist_builder.download_and_prepare()
datasets = mnist_builder.as_dataset()

Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/mnist/3.0.1...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/mnist/incomplete.LU0K0P_3.0.1/mnist-train.tfrecord*...:   0%|          | 0…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/mnist/incomplete.LU0K0P_3.0.1/mnist-test.tfrecord*...:   0%|          | 0/…

Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.


In [15]:
ds = builder.as_data_source()

In [16]:
len(ds['default'])

99321

In [22]:
for example in ds['default']:
  print(example['Slope'])

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
19.31231117248535
27.41070556640625
1.716382622718811
7.871978759765625
2.3910343647003174
9.75335693359375
30.52189064025879
9.85482406616211
3.3960633277893066
2.3704044818878174
1.9129993915557861
37.67653274536133
3.111509323120117
12.258570671081543
5.2287116050720215
21.794788360595703
25.679420471191406
1.3989866971969604
29.92999267578125
8.990714073181152
22.73369598388672
7.792518615722656
3.5305118560791016
1.5240890979766846
1.520542860031128
6.745203495025635
3.4201231002807617
3.8318936824798584
9.888105392456055
12.900400161743164
15.369452476501465
7.039074420928955
2.4216601848602295
27.991626739501953
1.30861496925354
3.8083484172821045
2.7673556804656982
17.903339385986328
1.351202130317688
26.016864776611328
4.003037929534912
18.345813751220703
1.9730933904647827
3.836639881134033
6.608618259429932
20.011491775512695
7.188477993011475
28.439760208129883
21.926599502563477
10.902979850769043
5.551762580

KeyboardInterrupt: 

In [23]:
import torch.nn as nn
import torch.optim as optim

# Extract features and target variable from the dataset
features = []
targets = []

for example in ds['default']:
    features.append([example['Slope'], example['Aspect'], example['WindSpeed']])
    targets.append(example['BareSand_it1'])

features = torch.tensor(features, dtype=torch.float32)
targets = torch.tensor(targets, dtype=torch.float32).view(-1, 1)

# Define the linear regression model
class LinearRegressionModel(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(LinearRegressionModel, self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)

    def forward(self, x):
        return self.linear(x)

In [24]:
model = LinearRegressionModel(input_dim=3, output_dim=1)

# Define loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Train the model
num_epochs = 100
batch_size = 32
num_batches = len(features) // batch_size

for epoch in range(num_epochs):
    for i in range(num_batches):
        start_idx = i * batch_size
        end_idx = start_idx + batch_size
        inputs = features[start_idx:end_idx]
        labels = targets[start_idx:end_idx]

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

print("Training complete.")

Epoch [10/100], Loss: nan
Epoch [20/100], Loss: nan
Epoch [30/100], Loss: nan
Epoch [40/100], Loss: nan


KeyboardInterrupt: 

In [43]:
import tensorflow as tf
# Convert the datasource into a tf.data.Dataset
dataset = tf.data.Dataset.from_generator(
    lambda: (example for example in ds),  # Generator function
    output_signature={
        "feature1": tf.TensorSpec(shape=(), dtype=tf.float32),
        "feature2": tf.TensorSpec(shape=(), dtype=tf.float32),
        "label": tf.TensorSpec(shape=(), dtype=tf.int32),
    }
)

In [50]:
import tensorflow as tf

# Define TensorFlow dataset from generator
dataset = tf.data.Dataset.from_generator(
    lambda: (example for example in ds['default']),
    output_signature={
        "id": tf.TensorSpec(shape=(), dtype=tf.int32),
        "X": tf.TensorSpec(shape=(), dtype=tf.float32),
        "Y": tf.TensorSpec(shape=(), dtype=tf.float32),
        "Aspect": tf.TensorSpec(shape=(), dtype=tf.float32),
        "Slope": tf.TensorSpec(shape=(), dtype=tf.float32),
        "WindSpeed": tf.TensorSpec(shape=(), dtype=tf.float32),
        "BareSand_it1": tf.TensorSpec(shape=(), dtype=tf.float32),
    }
)


In [51]:
# Print a few records from tf.data.Dataset
for example in dataset.take(5):  # Take the first 5 records
    print(example)


{'id': <tf.Tensor: shape=(), dtype=int32, numpy=140520>, 'X': <tf.Tensor: shape=(), dtype=float32, numpy=235566.25>, 'Y': <tf.Tensor: shape=(), dtype=float32, numpy=368331.25>, 'Aspect': <tf.Tensor: shape=(), dtype=float32, numpy=287.2237854003906>, 'Slope': <tf.Tensor: shape=(), dtype=float32, numpy=0.9244160652160645>, 'WindSpeed': <tf.Tensor: shape=(), dtype=float32, numpy=1.1475861072540283>, 'BareSand_it1': <tf.Tensor: shape=(), dtype=float32, numpy=0.0>}
{'id': <tf.Tensor: shape=(), dtype=int32, numpy=216762>, 'X': <tf.Tensor: shape=(), dtype=float32, numpy=235916.25>, 'Y': <tf.Tensor: shape=(), dtype=float32, numpy=368126.25>, 'Aspect': <tf.Tensor: shape=(), dtype=float32, numpy=136.97999572753906>, 'Slope': <tf.Tensor: shape=(), dtype=float32, numpy=21.083030700683594>, 'WindSpeed': <tf.Tensor: shape=(), dtype=float32, numpy=1.300819754600525>, 'BareSand_it1': <tf.Tensor: shape=(), dtype=float32, numpy=52.0>}
{'id': <tf.Tensor: shape=(), dtype=int32, numpy=202054>, 'X': <tf.Ten

In [52]:
# Shuffle and batch
dataset = dataset.shuffle(1000).batch(32)

# Example model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(16, activation='relu', input_shape=(6,)),  # 6 features (excluding 'id')
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile and train
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(dataset, epochs=10)


Epoch 1/10


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


ValueError: Exception encountered when calling Sequential.call().

[1mThe structure of `inputs` doesn't match the expected structure.
Expected: keras_tensor
Received: inputs={'id': 'Tensor(shape=(None,))', 'X': 'Tensor(shape=(None,))', 'Y': 'Tensor(shape=(None,))', 'Aspect': 'Tensor(shape=(None,))', 'Slope': 'Tensor(shape=(None,))', 'WindSpeed': 'Tensor(shape=(None,))', 'BareSand_it1': 'Tensor(shape=(None,))'}[0m

Arguments received by Sequential.call():
  • inputs={'id': 'tf.Tensor(shape=(None,), dtype=int32)', 'X': 'tf.Tensor(shape=(None,), dtype=float32)', 'Y': 'tf.Tensor(shape=(None,), dtype=float32)', 'Aspect': 'tf.Tensor(shape=(None,), dtype=float32)', 'Slope': 'tf.Tensor(shape=(None,), dtype=float32)', 'WindSpeed': 'tf.Tensor(shape=(None,), dtype=float32)', 'BareSand_it1': 'tf.Tensor(shape=(None,), dtype=float32)'}
  • training=True
  • mask={'id': 'None', 'X': 'None', 'Y': 'None', 'Aspect': 'None', 'Slope': 'None', 'WindSpeed': 'None', 'BareSand_it1': 'None'}

In [44]:
dataset

<_FlatMapDataset element_spec={'feature1': TensorSpec(shape=(), dtype=tf.float32, name=None), 'feature2': TensorSpec(shape=(), dtype=tf.float32, name=None), 'label': TensorSpec(shape=(), dtype=tf.int32, name=None)}>

In [41]:
%%timeit
ds['default'][0]

The slowest run took 4.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1.1 ms ± 594 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [15]:
%%timeit
ds['default'][20000]

532 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [4]:
for example in ds['default']:
  print(example)

NameError: name 'ds' is not defined

In [None]:
datasets

{Split('train'): <_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
 Split('test'): <_PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>}

In [22]:
print(data_source['default'][6])

{'Aspect': 299.36572265625, 'Slope': 12.624134063720703, 'WindSpeed': 1.2469048500061035, 'X': 235726.25, 'Y': 368731.25, 'id': 175176}


In [23]:
def dataset_to_dataframe(dataset):
    features_list = []
    labels_list = []

    for example in dataset:  # Iterate through dataset
        features = example['image'].numpy().flatten()  # Flatten image
        label = example['label'].numpy()  # Extract label

        features_list.append(features)
        labels_list.append(label)

    df = pd.DataFrame(features_list)
    df["label"] = labels_list  # Add labels column

    return df

# Convert the dataset
df = dataset_to_dataframe(data_source['default'])
print(df.head())  # Check the data

KeyError: 'image'

In [None]:
# ds_all_dict = builder.as_dataset()
# assert isinstance(ds_all_dict, dict)
print(datasets.keys())  # ==> ['test', 'train', 'unsupervised']


dict_keys([Split('train'), Split('test')])


In [25]:
train,test = builder.as_data_source(split=['default[:75%]','default[75%:]'])

In [30]:
batch_size = 128
train_sampler = torch.utils.data.RandomSampler(train, num_samples=len(train))
train_loader = torch.utils.data.DataLoader(
    train,
    sampler=train_sampler,
    batch_size=batch_size,
)
test_loader = torch.utils.data.DataLoader(
    test,
    sampler=None,
    batch_size=batch_size,
)

In [31]:
class TabularRegressor(torch.nn.Module):
    def __init__(self, input_dim):
        super(TabularRegressor, self).__init__()
        self.regressor = torch.nn.Linear(input_dim, 1)

    def forward(self, features):
        return self.regressor(features)

# Extract feature names and target name
feature_names = ['Aspect', 'Slope', 'WindSpeed']
target_name = 'BareSand_it1'

In [38]:
for example in tqdm(train_loader):
  print(torch.tensor(example['Aspect']).float())

  print(torch.tensor(example['Aspect']).float())
  1%|          | 6/582 [00:00<00:10, 55.11it/s]

tensor([ 94.0064, 312.2454, 274.5022, 141.2862, 309.0225, 195.6481,  48.8076,
        175.4713,  59.1675, 234.5903, 162.7396, 312.4264, 296.3405, 252.2441,
         82.3940, 125.2640, 284.3794, 230.5449, 207.4568, 216.1075, 271.5662,
        221.8014, 210.4821, 292.5097, 149.7092, 134.0172, 251.4913,  92.6541,
        278.4960, 240.3867, 259.0233,  92.6552, 301.6798,  61.3684, 184.2796,
        292.8975,  73.3530,  89.5366, 150.5941, 180.4246, 328.3862, 296.5338,
        283.8618, 198.1688,  76.2737, 286.2415,  87.9675, 260.2562, 264.9105,
        202.3024, 122.2946, 335.6507, 310.4983, 189.3767,  93.9819,  94.9282,
         14.0901, 130.9073, 214.5049, 163.0445, 156.8659, 164.1215,  37.7254,
        255.6591, 231.3089, 180.4357, 131.8166, 267.3774, 243.0891, 110.4502,
        146.0989,  94.7487,  81.9081, 125.1146, 186.0023,  81.3784, 190.7237,
        194.6858, 180.7667, 318.7409,  99.6311,  22.6216, 326.5854, 275.5746,
         77.0715, 300.0505, 196.1028,  84.1600, 214.3488, 191.82

  2%|▏         | 12/582 [00:00<00:15, 37.84it/s]

tensor([ 35.9017, 277.3510, 219.2173, 291.9541,  40.2817, 267.7174, 267.5250,
        293.4790, 280.8643,  23.8529, 334.6171, 260.7198, 153.5986, 110.3874,
        153.9212, 129.7376, 253.0890, 124.4401, 128.6143,  11.8531, 122.1648,
        100.2961, 183.8022, 139.8783, 112.2421,  88.0040, 241.1572, 239.7290,
        305.5530, 216.0426, 331.7708, 158.4986, 101.5971, 270.8327,  23.9173,
        119.6462, 274.4701, 296.4334, 157.1974, 270.0000, 127.2787, 198.5150,
        260.6642,  62.9844, 159.5317,  70.9100,  32.3345, 188.8998, 309.7688,
        317.6425, 198.5157, 272.5984,  85.3176, 161.2340, 254.5330, 173.2661,
        252.8095, 184.9523, 312.4075, 217.8076, 225.2126, 295.3707,  58.2573,
         56.3527,  82.1129, 251.8831, 105.1417, 141.1603,  76.5076, 107.2714,
        286.5700, 294.9682, 111.3428, 271.8436,  94.3006, 202.1109, 297.0395,
        266.1916,  90.4121, 100.4461, 217.7747, 318.5946,  61.9376, 325.1917,
        167.7101, 180.9220,  63.9291,  15.4758,  54.2902, 185.99

  3%|▎         | 17/582 [00:00<00:22, 25.53it/s]

tensor([280.7953,  40.4893, 288.2898, 263.5786, 322.1906, 184.5093, 254.6445,
        147.9292, 104.3827, 174.8359, 297.1494, 273.6690, 196.9188, 301.8864,
         90.2274, 189.3151, 267.3506, 149.6119, 336.7372, 219.0980, 203.3508,
        214.7363,  94.1753,  89.3955,  41.3665, 169.9130, 103.3302, 250.1777,
        246.8154,  21.0067,  75.7802, 256.2971, 194.1484, 199.7104, 169.0659,
        170.8646, 347.6375, 161.8909, 238.8196,  48.8231, 175.1301, 217.3382,
        216.6052,  20.5382, 137.5078, 224.1646, 105.3367, 185.5384,  85.8230,
        347.7208, 173.8143, 114.7246,  91.1074, 109.0477, 262.0151,  57.7105,
        164.4911, 295.2411,  78.3617, 114.2965,  99.1686, 201.0707, 118.5660,
        255.9375, 260.0164, 202.5823, 325.1931, 257.4791, 212.9700, 116.1840,
         40.6705, 269.2144, 231.4561, 110.1603, 334.3650, 174.0860,  86.8154,
        156.8281,  91.6894, 236.8999, 250.4602, 242.0497, 166.6763, 245.5956,
        187.3627, 271.9045,  70.0507, 260.2965, 129.7765, 173.94

  4%|▎         | 21/582 [00:00<00:25, 21.65it/s]

tensor([ 26.7575, 125.7011, 100.9303, 256.2851,  28.9906, 290.7648, 255.5266,
        318.9590, 301.5069, 136.8837, 227.8293, 258.2977, 115.3893, 183.2353,
        168.4439, 154.9641, 222.2817, 249.3542, 279.9669,  32.3514, 258.0235,
        103.2383, 188.8784, 289.8367, 144.3934, 135.5627, 175.7541, 120.5906,
        192.7268,  47.2341, 231.4679,  90.2631, 207.8624, 284.8002,  39.6033,
        232.1384, 303.4599, 239.2836, 255.5137,  50.7483, 196.4606,  82.0564,
        223.3239, 114.2142, 292.0785, 264.5342, 261.9616, 147.2005, 138.1133,
        106.1867, 276.0266, 333.1974, 312.3251, 146.6253, 190.4600,  90.2697,
        315.0132,  52.7606,  83.2602, 277.2855, 277.8466, 232.9391, 250.6502,
        107.7759, 171.0141, 144.2255,  99.3845, 309.7401, 287.3808, 250.9952,
        117.5980,  17.1806,  56.8400, 293.1846, 126.8322, 101.7151, 253.7331,
        241.9269, 227.5441, 191.9162,  71.4644, 212.4272, 105.7192, 174.2691,
         26.9955, 256.1143,  58.9062, 206.2636,  32.1134, 284.82

  5%|▍         | 28/582 [00:01<00:24, 22.19it/s]

tensor([269.7845, 115.3855, 275.1786,  10.3044, 276.5686,  78.7503, 196.7171,
        300.7902, 171.1904,  58.9774, 267.2135,  31.8534, 111.5219, 331.0764,
        117.7795,  25.9649, 275.3309, 122.9370, 104.8163,  43.5798, 273.9973,
         83.2384,  20.3666, 250.6782,  55.0620, 261.8150, 348.5518, 191.1741,
        109.0054, 179.0137, 200.4171,  53.3437, 176.6248, 216.5984, 344.9639,
        149.3551, 150.4324,  55.4127, 235.5769, 137.4267, 332.0648, 244.0130,
        163.5779, 298.5610, 343.2495, 274.1373,  57.2539, 221.1714, 206.3294,
         73.8669, 180.7616, 282.0641, 282.9222, 114.7676, 120.0706, 109.7729,
        222.9891, 248.3369, 309.1932, 179.5511, 178.6269, 190.7026, 188.9681,
        186.9976, 152.6841, 273.9272, 238.9457, 344.5427, 170.3636,  31.0986,
        338.4482, 196.7192, 131.0668, 111.9124, 307.4836, 248.6585, 200.9984,
        299.7740,  94.1224, 275.7348, 280.8034, 225.9416,  43.9178, 137.9733,
        333.7459, 242.0749, 298.1935, 252.8540, 174.5624, 174.00

  5%|▌         | 31/582 [00:01<00:25, 21.92it/s]

tensor([175.9052, 197.3294, 165.0891, 275.7513, 202.4523,  54.7115, 235.6057,
         57.7686, 328.6583, 179.4312, 271.4331, 262.7103, 280.8067, 193.0324,
        174.6909, 189.8027, 339.0517, 140.5567, 123.9267, 219.7832, 179.8580,
        262.2175, 104.6857, 192.7028, 324.7068,  97.9400, 180.9448, 279.6183,
         31.1079, 308.4104,  84.2940,  96.8927, 279.9181, 277.4833, 107.3858,
         92.5427, 203.9735, 235.6277, 170.8526, 186.3742, 110.7446, 240.2659,
        244.5279, 299.8889, 105.0238, 167.8243,  55.7573, 259.1022,  71.4880,
        223.3361, 129.6073, 121.3737, 152.4653, 282.5699, 170.4901, 168.9807,
        263.4844, 126.9884, 310.2061,  86.6450, 125.2884, 109.7844,  85.3688,
         87.5981, 107.0233,  81.6153, 135.5061, 155.3590,  20.0444, 157.3880,
        301.2101, 234.7588, 236.5251, 245.1668,  92.8461,  84.5478, 267.2174,
         94.5807, 243.6437, 213.3597, 236.8303,  49.6881,  80.6705, 315.1850,
        176.2164, 346.9138, 233.2282,  45.9466,  76.5398, 167.08

  6%|▋         | 37/582 [00:01<00:23, 23.65it/s]

tensor([296.5420, 168.5409, 137.8711, 259.1291, 250.7151, 324.9223,   7.8129,
        262.9486, 235.7045,  89.3937, 190.2246, 301.7867, 207.6533, 271.5775,
         27.9786, 340.3744, 292.4272, 228.0973, 203.7055, 223.8386, 101.0305,
        323.9237, 100.9296, 280.8397, 124.2261, 103.2151, 117.9649,  21.7661,
        296.3336, 290.0640, 253.0239,  90.8880, 181.8336,  96.6592,   7.2480,
         71.2357,   7.1747, 165.2369, 167.8895, 282.8880, 191.3429, 255.4903,
        277.6274, 167.1850, 231.0098, 204.6015, 320.7116, 274.6465, 237.1733,
        171.4973, 119.4829,  36.1324,  83.0155, 270.3932, 299.3585,  89.1730,
        188.4225, 300.6794, 203.3080, 212.7355, 141.6160,  49.6730, 141.0978,
        291.0180, 269.0135,  99.3532, 206.9737, 100.2086, 254.7505, 297.6309,
        325.4600, 258.8341, 233.3548,  32.0597, 183.7326, 206.9615, 204.2720,
        238.0517, 105.3280, 279.4695, 120.8710,  15.8237, 121.5566, 174.2344,
        243.3725, 154.3839,  79.8289, 348.4410, 329.5284,  79.17

  7%|▋         | 43/582 [00:01<00:20, 25.81it/s]

tensor([316.2184, 294.8812,  95.5525, 188.2126,  60.3600, 130.5088, 321.2351,
        110.5753, 148.8127, 159.0804, 132.4724, 108.6422, 110.7959, 302.8542,
        232.3811, 258.7130, 187.4477, 123.4569,  26.4062,  24.4113, 267.4502,
        308.2362, 335.0686, 226.0595, 133.4902,  49.8006, 298.5971, 249.7644,
        260.3494, 217.0769, 127.3425, 321.0055, 158.0945, 109.4202, 191.3820,
        110.6037, 172.0407, 305.2498, 293.5160, 331.8302, 197.5417,  99.7759,
         78.7458,  59.8218,  86.5805, 192.8695, 191.5915, 157.9557, 188.0585,
         38.2331,  91.3551,  48.3314, 324.9911,  21.3139,  58.2201, 109.1545,
        175.7777,  95.1646, 206.4462, 217.8339, 184.1466, 111.4754, 105.5328,
         32.8942, 300.0272, 269.6495, 191.3832,  57.9105, 307.3790, 152.5650,
        209.8416, 225.8542, 283.5631, 230.0387, 246.8898, 198.1496, 254.3674,
        157.4203, 325.6501, 120.7718, 318.9503,  94.5045, 172.0619,  63.1620,
        177.8752,  19.2875, 174.5798,  94.1698, 331.2745, 263.19

  8%|▊         | 49/582 [00:02<00:22, 23.40it/s]

tensor([157.5433, 213.1967,  96.5341, 264.8217, 175.9529, 180.7079,  89.9972,
        186.7496, 155.4926, 318.8059, 261.3069, 185.3202,  36.9187, 202.7744,
         30.0638, 240.9179,  43.2057, 242.8991, 111.2374, 203.0256, 198.8479,
        343.1449,   9.7257, 248.1840, 267.6527, 306.0855, 137.2278, 269.2003,
        264.8128, 167.2357,  84.4851, 229.3608, 257.3019, 121.8281, 197.0370,
         41.2560, 102.0708, 318.3319, 304.9088, 293.4944, 188.6240, 160.6937,
        267.4102, 152.0125, 274.6123, 132.8878,  57.3944, 177.5770,  98.9148,
         94.1835, 189.8648,  80.6049, 214.3111,  92.6548, 294.9894, 213.0176,
        177.9234, 171.9382, 155.7704, 253.0827, 271.9336, 181.9394,  62.0285,
        243.1440, 172.6772, 208.9489,  75.8280, 244.3237, 220.8735, 128.0558,
         96.0701, 213.6800, 153.0514, 174.5396, 298.1227,  27.0145, 222.7674,
        293.1926, 249.1333, 170.1758, 125.4197,  47.0806, 229.6078,  50.7699,
         93.2916, 168.4919, 174.9387, 291.2922, 189.2173, 295.19

  9%|▉         | 54/582 [00:02<00:21, 24.04it/s]

tensor([ 62.8211, 143.6566,  39.0986, 327.6316, 109.0654, 143.7855, 290.8304,
        107.4889, 310.7467, 118.0654, 220.6613, 110.0609, 161.3969, 156.0724,
         61.0033, 185.0953, 293.7889,  97.6895, 304.2948, 108.6767,  70.4113,
        166.1136, 180.7805, 317.1928, 130.2794,  28.8703, 175.4801, 132.2000,
        277.9684, 258.5205, 277.7680, 118.4818, 225.8111, 332.8931, 218.3586,
         27.9081, 189.4946, 133.0003, 136.8474, 306.5259,  90.0323, 284.0724,
        122.1612, 268.8090,  42.7796, 297.1830, 280.8893, 266.4918,  75.5994,
        263.5920, 170.3894, 284.7669,  94.5800, 261.5822, 236.5198, 296.9030,
        249.7603,  72.8498, 341.9474,  40.6344, 295.9191, 275.0720, 208.2670,
        147.7693, 181.4161,  58.6075, 118.9106, 295.1479,  33.2569, 170.4532,
        242.1512, 114.5268, 288.2272, 166.1238, 170.3465,  11.7358, 283.1601,
        149.1909, 173.2560, 168.5894, 294.1926, 336.3063, 221.8887, 187.3710,
        312.6115, 276.8611, 313.9960, 320.2744, 181.7797, 137.13




KeyboardInterrupt: 

In [41]:
[example[feature] for feature in feature_names]

[tensor([ 45.7197, 279.7827,  91.6005, 129.6441, 331.4779, 284.7853, 175.4904,
          31.0813,   7.1307, 245.6486, 327.7542, 126.1900, 154.8481, 204.1511,
         282.3218, 111.6741, 135.7579, 340.9374, 169.5077, 149.9320, 177.0745,
         246.3654, 216.5456, 207.7438, 226.5246, 218.8110, 308.8786, 223.4639,
         194.9151, 290.0657, 171.2713, 183.7512, 179.6918,  92.3407, 111.4425,
         267.9100, 257.3386,  61.7543,  92.9076,  97.5094, 235.7552, 255.0526,
         149.8383, 161.1867, 188.2688, 107.1730, 244.4128, 336.9975, 176.0969,
         185.7072, 103.3558, 161.7164, 114.4496, 163.6087, 143.5305, 264.6136,
         275.1865,  75.0609, 184.4847, 325.8745, 151.2295,  26.0516, 143.4626,
         268.2968, 184.6591, 276.0323, 267.5862,  83.7694, 237.2989, 106.8413,
         131.3429, 112.6300, 306.8586, 137.8302, 287.6258,  88.8585, 172.5125,
         261.8296, 319.8157, 314.3868, 211.5917, 167.2879, 311.8498,  82.5310,
         104.2166, 307.9885, 157.0848, 128.8651,  66

In [33]:
# Assuming the features are in a tensor of shape (num_features,)
input_dim = len(feature_names)
model = TabularRegressor(input_dim)
optimizer = torch.optim.Adam(model.parameters())
loss_function = torch.nn.MSELoss()

print('Training...')
model.train()
for example in tqdm(train_loader):
    features = torch.tensor([example[feature] for feature in feature_names]).float().unsqueeze(0)
    target = torch.tensor(example[target_name]).float().unsqueeze(1)
    prediction = model(features)
    loss = loss_function(prediction, target)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print('Testing...')
model.eval()
total_loss = 0
num_examples = 0
for example in tqdm(test_loader):
    features = torch.tensor([example[feature] for feature in feature_names]).float()
    target = torch.tensor(example[target_name]).float().unsqueeze(1)
    prediction = model(features)
    loss = loss_function(prediction, target)
    total_loss += loss.item() * features.shape[0]
    num_examples += features.shape[0]
print(f'\nMean Squared Error: {total_loss / num_examples:.4f}')

Training...


  0%|          | 0/582 [00:00<?, ?it/s]


ValueError: only one element tensors can be converted to Python scalars

In [34]:
print('Training...')
model.train()
for example in tqdm(train_loader):
    features = torch.tensor([[example[feature] for feature in feature_names] for example in example]).float()
    target = torch.tensor([example[target_name] for example in example]).float().unsqueeze(1)
    prediction = model(features)
    loss = loss_function(prediction, target)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Training...


  0%|          | 0/582 [00:00<?, ?it/s]


TypeError: string indices must be integers, not 'str'

In [21]:
print(len(train),len(test))

74491 24830


In [23]:
batch_size = 128
train_sampler = torch.utils.data.RandomSampler(train, num_samples=5_000)
train_loader = torch.utils.data.DataLoader(
    train,
    sampler=train_sampler,
    batch_size=batch_size,
)
test_loader = torch.utils.data.DataLoader(
    test,
    sampler=None,
    batch_size=batch_size,
)

In [29]:
from torch.utils.data import DataLoader, Dataset

# Convert data to PyTorch Dataset
class DuneDataset(Dataset):
    def __init__(self, dataset):
        self.dataset = list(dataset)

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        example = self.dataset[idx]
        x = torch.tensor([example['Aspect'], example['Slope'], example['WindSpeed']], dtype=torch.float32)
        y = torch.tensor(example['BareSand_it1'], dtype=torch.float32)
        return x, y

train_dataset = DuneDataset(train)
test_dataset = DuneDataset(test)

# DataLoader for PyTorch
batch_size = 128
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [30]:
# Define the PyTorch Linear Regression Model
class LinearRegressor(torch.nn.Module):
    def __init__(self, input_dim):
        super(LinearRegressor, self).__init__()
        self.linear = torch.nn.Linear(input_dim, 1)

    def forward(self, x):
        return self.linear(x)

# Initialize model, loss, and optimizer
model = LinearRegressor(input_dim=3)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = torch.nn.MSELoss()


In [28]:
for

TypeError: 'DataLoader' object is not subscriptable

In [31]:
# Train the model
epochs = 50
for epoch in range(epochs):
    for x_batch, y_batch in train_loader:
        optimizer.zero_grad()
        y_pred = model(x_batch).squeeze()
        loss = criterion(y_pred, y_batch)
        loss.backward()
        optimizer.step()

    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")


Epoch 0, Loss: nan
Epoch 10, Loss: nan
Epoch 20, Loss: nan
Epoch 30, Loss: nan
Epoch 40, Loss: nan


In [33]:
from sklearn.metrics import mean_squared_error, r2_score

# Evaluate the model
y_true, y_preds = [], []
with torch.no_grad():
    for x_batch, y_batch in test_loader:
        y_pred = model(x_batch).squeeze()
        y_preds.extend(y_pred.numpy())
        y_true.extend(y_batch.numpy())

mse = mean_squared_error(y_true, y_preds)
r2 = r2_score(y_true, y_preds)

print("Model Performance:")
print(f"Mean Squared Error: {mse:.4f}")
print(f"R-squared: {r2:.4f}")

ValueError: Input contains NaN.

In [24]:
train.Aspect

AttributeError: 'ArrayRecordDataSource' object has no attribute 'Aspect'

In [None]:
builder.info.splits['default[:75%]'].num_examples  # 7_500 (also works with slices)

74491

In [None]:
train_ds, test_ds = tfds.load('mnist', split=['train', 'test[:50%]'])

In [None]:
data

ArrayRecordDataSource(name=dunes_data, split='default[:75%]', decoders=None)

In [None]:
, split='train[:75%]')

In [None]:
data

{'default': ArrayRecordDataSource(name=dunes_data, split='default', decoders=None)}

In [None]:
data = builder.as_data_source()

In [None]:
train.info

AttributeError: 'ArrayRecordDataSource' object has no attribute 'info'

In [None]:
train, test = builder.as_data_source(split=['train', 'test'])

## Train a model

TFDS can be used with TensorFlow, JAX and PyTorch, because it supports many data loaders like [tf.data](https://www.tensorflow.org/guide/data), [PyGrain](https://github.com/google/grain) and [PyTorch DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html). For example, let's try with Torch:

In [None]:
torch.utils.data.RandomSampler(data, num_samples=len(data))

<torch.utils.data.sampler.RandomSampler at 0x7dc10d910f50>

In [None]:
batch_size = 128
train_sampler = torch.utils.data.RandomSampler(train, num_samples=len(train))
train_loader = torch.utils.data.DataLoader(
    train,
    sampler=train_sampler,
    batch_size=batch_size,
)
test_loader = torch.utils.data.DataLoader(
    test,
    sampler=None,
    batch_size=batch_size,
)

In [None]:
batch_size = 128
train_sampler = torch.utils.data.RandomSampler(train, num_samples=len(train))
train_loader = torch.utils.data.DataLoader(
    train,
    sampler=train_sampler,
    batch_size=batch_size,
)
test_loader = torch.utils.data.DataLoader(
    test,
    sampler=None,
    batch_size=batch_size,
)

DataLoaders can be fed in input of any ML pipeline. Let's try the example of a very simple example:

In [None]:
class LinearClassifier(torch.nn.Module):
  def __init__(self, shape, num_classes):
    super(LinearClassifier, self).__init__()
    height, width, channels = shape
    self.classifier = torch.nn.Linear(height * width * channels, num_classes)

  def forward(self, image):
    image = image.view(image.size()[0], -1).to(torch.float32)
    return self.classifier(image)

shape = train[0]["image"].shape
num_classes = 10
model = LinearClassifier(shape, num_classes)
optimizer = torch.optim.Adam(model.parameters())
loss_function = torch.nn.CrossEntropyLoss()

print('Training...')
model.train()
for example in tqdm(train_loader):
  image = example['image']
  label = example['label']
  prediction = model(image)
  loss = loss_function(prediction, label)
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

print('Testing...')
model.eval()
num_examples = 0
true_positives = 0
for example in tqdm(test_loader):
  image = example['image']
  label = example['label']
  prediction = model(image)
  num_examples += image.shape[0]
  predicted_label = prediction.argmax(dim=1)
  true_positives += (predicted_label == label).sum().item()
print(f'\nAccuracy: {true_positives/num_examples * 100:.2f}%')

Training...


100%|██████████| 469/469 [00:10<00:00, 43.92it/s]


Testing...


100%|██████████| 79/79 [00:02<00:00, 32.32it/s]


Accuracy: 74.04%





# Test

In [None]:
train,test = builder.as_data_source(split=['default[:75%]','default[75%:]'])

In [None]:
import tensorflow as tf

In [None]:
# Convert data source to dataset
dataset = tf.data.Dataset.from_generator(
    lambda: data_source,
    output_signature={
        'modeled_wind_speed': tf.TensorSpec(shape=(), dtype=tf.float32),
        'aspect': tf.TensorSpec(shape=(), dtype=tf.float32),
        'slope': tf.TensorSpec(shape=(), dtype=tf.float32),
        'bare_sand_area': tf.TensorSpec(shape=(), dtype=tf.float32)
    }
)

In [None]:
# Step 2: Data Preprocessing
def preprocess(features):
    x = tf.stack([features['modeled_wind_speed'], features['aspect'], features['slope']], axis=-1)
    y = features['bare_sand_area']
    return x, y

# Apply preprocessing
dataset = dataset.map(preprocess).batch(32).shuffle(1000).prefetch(tf.data.experimental.AUTOTUNE)


In [None]:
dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(None, 3), dtype=tf.float32, name=None), TensorSpec(shape=(None,), dtype=tf.float32, name=None))>

In [None]:
# Split dataset into train and test sets
data_size = sum(1 for _ in dataset)
train_size = int(0.8 * data_size)
train_dataset = dataset.take(train_size)
test_dataset = dataset.skip(train_size)


UnknownError: {{function_node __wrapped__IteratorGetNext_output_types_2_device_/job:localhost/replica:0/task:0/device:CPU:0}} NameError: name 'data_source' is not defined
Traceback (most recent call last):

  File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 865, in get_iterator
    return self._iterators[iterator_id]
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^

KeyError: 1


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 269, in __call__
    ret = func(*args)
          ^^^^^^^^^^^

  File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/data/ops/from_generator_op.py", line 198, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 867, in get_iterator
    iterator = iter(self._generator(*self._args.pop(iterator_id)))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "<ipython-input-31-b7c5203b39d1>", line 3, in <lambda>
    lambda: data_source,
            ^^^^^^^^^^^

NameError: name 'data_source' is not defined


	 [[{{node PyFunc}}]] [Op:IteratorGetNext] name: 

In [None]:
# Apply preprocessing
dataset = train.map(preprocess).batch(32).shuffle(1000).prefetch(tf.data.experimental.AUTOTUNE)


AttributeError: 'ArrayRecordDataSource' object has no attribute 'map'