# The Official Sig53 Dataset

**Number of Classes:** 53   
**Size on Disk:** 71 GB (uncompressed)   
**Time to Generate:** ~1 hour   

**Description:**
The Sig53 dataset is a synthetic dataset of modulated RF bursts of various families of modulations that are augmented with highly parameterizable transformations and can be generated on-the-fly or stored on disk. Such a dataset can be used for comparing and benchmarking the performance of machine-learning techniques.

For more much more detailed information about the nature of the data, please see [the associated paper on ArXiv](https://arxiv.org/pdf/2207.09918.pdf).

### On-the-Fly Dataset
#### Generating
A Sig53-like dataset can be generated from scratch, on-the-fly, yielding a dataset with essentially inifinite number of unique exemplars (called samples in machine-learning circles) per class. The downside to generating things on-the-fly is that **training is slower.** 

Before we jump into generating the whole dataset, let's start with a basic example which uses the ```ModulationsDataset```, which is the underlying class used to generate ```Sig53Dataset```. The class has a number of parameters

In [None]:
from matplotlib import pyplot as plt
from torchsig.datasets.modulations import ModulationsDataset

# Instantiate the dataset, no data is generated at this point.
ds = ModulationsDataset(
    level=0, # only AWGN
    num_samples=53*10, # 10 exemplars per class
    num_iq_samples=4096, # 4096 IQ samples
)

# Index into the dataset. This actually generates the data
# and the associated label
data, label = ds[0]

# Plot it.
plt.subplot(3, 1, 1)
plt.title("Modulation Type = {}".format(label))
plt.ylabel("Time Domain")
plt.plot(data[:100].real, marker=".")
plt.plot(data[:100].imag, marker=".")
plt.legend(["Real", "Imag"])
ax = plt.gca()
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

plt.subplot(3, 1, 2)
_ = plt.psd(data)
plt.ylabel("PSD")
plt.xlabel("")
ax = plt.gca()
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

plt.subplot(3, 1, 3)
_ = plt.specgram(data)
ax = plt.gca()
plt.ylabel("Spectrogram")
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

Now, if you want to a more interesting dataset, with more substantial impairments, you can use a different level as an argument. Typically you'll want to look at more than just one sample in the dataset as well. You might want to iterate over the entire dataset to analyze the data yourself, store it in some custom format, or train an ML algorithm with the data. The ```Dataset``` class is a Python ```Generator```, so you can iterate over it just like a list, which is nice.

We'll use a more advanced way of iterating over the dataset later, but this may be helpful if you're just getting started.

In [None]:
# Instantiate the dataset, no data is generated at this point.
ds = ModulationsDataset(
    level=1, # More like a cabled environment
    num_samples=53*10, # 10 exemplars per class
    num_iq_samples=4096, # 4096 IQ samples
)

count = 0

# iterate over the dataset once.
for data, label in ds:
    # Store the data, train on it, analyze it.
    data = data + 1
    count += 1

print("There are {} samples in this dataset!".format(count))

# Plot it.
plt.subplot(3, 1, 1)
plt.title("Modulation Type = {}".format(label))
plt.ylabel("Time Domain")
plt.plot(data[:100].real, marker=".")
plt.plot(data[:100].imag, marker=".")
plt.legend(["Real", "Imag"])
ax = plt.gca()
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

plt.subplot(3, 1, 2)
_ = plt.psd(data)
plt.ylabel("PSD")
plt.xlabel("")
ax = plt.gca()
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

plt.subplot(3, 1, 3)
_ = plt.specgram(data)
ax = plt.gca()
plt.ylabel("Spectrogram")
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

#### Iterating Over the Dataset
Now the Sig53 dataset is a specific configuration of the ```ModulationsDataset``` class. To keep the configuration fixed, we implement classes with fixed parameters of type ```Sig53Conf```. We use the ```Sig53CleanTrianQAConfig```. In this configuration, the dataset ```label``` is actually a Python ```tuple``` which includes the estimated SNR of the produced sample. 

Using this example, you can choose one of the fixed configurations and generate a Sig53-like dataset on-the-fly as you iterate over it, and train an ML model, store it, or perform some analysis on it!

In [None]:
from torchsig.datasets import conf

# A VERY small portion of the Sig53 impaired-train dataset, which is level 2
config = conf.Sig53ImpairedTrainQAConfig

# Instantiate the dataset, no data is generated at this point.
ds = ModulationsDataset(
    level=config.level,
    num_samples=config.num_samples,
    num_iq_samples=config.num_iq_samples,
    use_class_idx=config.use_class_idx,
    include_snr=config.include_snr,
    eb_no=config.eb_no
)

data, (modulation, snr) = ds[0]
print(data)
print(modulation, snr)

# Plot it.
plt.subplot(3, 1, 1)
plt.title("Modulation Type = {}".format(label))
plt.ylabel("Time Domain")
plt.plot(data[:100].real, marker=".")
plt.plot(data[:100].imag, marker=".")
plt.legend(["Real", "Imag"])
ax = plt.gca()
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

plt.subplot(3, 1, 2)
_ = plt.psd(data)
plt.ylabel("PSD")
plt.xlabel("")
ax = plt.gca()
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

plt.subplot(3, 1, 3)
_ = plt.specgram(data)
ax = plt.gca()
plt.ylabel("Spectrogram")
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

### Static Dataset
#### Generating
The official Sig53 dataset is a fixed-size static dataset and not one of the on-the-fly datasets mentioned above. Fixed datasets are useful because training on them is much faster and when comparing/reproducing results is important, this is the way to go. To use a static dataset, we first have to generate it like we did before, but we need to store it on disk. Then, we need a way to actually read it from the disk so we can train on it.

So what we'll do is use a PyTorch ```DataLoader```, which parallelizes the generation of data from ```ModulationsDataset```. It's a very useful class for ML training, which is usually done with batches of data. For each batch of data that's produced, we'll write it to a file. 

We write data as raw binary to disk here, but we don't use that by default in TorchSig, this is just an example of how you could do it. Raw binary is much more portable than other formats, but is not commonly used in the ML community.

In [None]:
from torch.utils.data import DataLoader
import numpy as np
import os

# A VERY small portion of the Sig53 impaired-train dataset, which is level 2
config = conf.Sig53ImpairedTrainSmallConfig

# Instantiate the dataset, no data is generated at this point.
ds = ModulationsDataset(
    level=config.level,
    num_samples=config.num_samples,
    num_iq_samples=config.num_iq_samples,
    use_class_idx=config.use_class_idx,
    include_snr=config.include_snr,
    eb_no=config.eb_no
)

# You can iterate through this in a similar way.
# The DataLoader produces a torch.Tensor
loader = DataLoader(ds, batch_size=16, num_workers=16)

if os.path.exists("data.fc32"):
    os.remove("data.fc32")

if os.path.exists("label.int8"):
    os.remove("label.int8")

for data, (mod, snr) in loader:
    # We convert torch.Tensor to a familiar numpy.ndarray
    # with tensor.numpy()
    with open("data.fc32", "ab+") as data_file:
        data_file.write(data.numpy().tobytes())

    with open("label.int8", "ab+") as label_file:
        label_file.write(mod.numpy().astype(np.int8).tobytes())


#### Reading a Static Dataset
We've chosen to write our data as raw binary. We can write a Dataset class that allows us to iterate over that stored data. A Dataset class only needs to implement ```__init__``` and ```__getitem__```. We talk more about creating a custom dataset in a different tutorial.

In [None]:
from torchsig.datasets.synthetic import SignalDataset
from typing import Callable, Tuple, Union
import numpy as np


class CustomDataset(SignalDataset):
    def __init__(
        self,
        path: str,
        transform: Union[Callable, None] = None,
        target_transform: Union[Callable, None] = None,
        seed: Union[int, None] = None,
    ) -> None:
        self.data_file = os.path.join(path, "data.fc32")
        self.label_file = os.path.join(path, "label.int8")

        super().__init__(transform, target_transform, seed)

    def __getitem__(self, index: int) -> Tuple:
        # the sample at index is 8192 double-precision floating-point numbers long.
        # It is located 8192*8 bytes into the file
        iq_data = np.fromfile(
            self.data_file,
            dtype=np.float64,
            count=4096 * 2,
            offset=index * 4096 * 2 * 8,
        ).view(np.complex128)
        label = np.fromfile(self.label_file, dtype=np.int8, count=1, offset=index)
        return iq_data, label

Now, we instantiate the ```CustomDataset```, pointing it to what we generated and plotting it.

In [None]:
custom_ds = CustomDataset(path=".")

data, label = custom_ds[0]

# Plot it.
plt.subplot(3, 1, 1)
plt.title("Modulation Type = {}".format(ModulationsDataset.default_classes[int(label)]))
plt.ylabel("Time Domain")
plt.plot(data[:100].real, marker=".")
plt.plot(data[:100].imag, marker=".")
plt.legend(["Real", "Imag"])
ax = plt.gca()
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

plt.subplot(3, 1, 2)
_ = plt.psd(data)
plt.ylabel("PSD")
plt.xlabel("")
ax = plt.gca()
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

plt.subplot(3, 1, 3)
_ = plt.specgram(data)
ax = plt.gca()
plt.ylabel("Spectrogram")
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

### The Official Sig53 Dataset
#### Generating
The above examples are to explain through a simplified example how TorchSig statically generates Sig53 and also to give you enough information to roll-your-own static generation method, if you'd like. The official supported method for generation is to use the provided scripts in the ```scripts/``` directory.

The official method here employs our own ```DatasetLoader``` and ```DatasetCreator``` which are thin wrappers around a ```DataLoader``` that enables seeded generation and the optional overloading of storage format. The default storage format uses lmdb, which one format used in the broader machine-learning community.

In [None]:
from torchsig.utils.writer import DatasetLoader, DatasetCreator

path = "sig53_qa/"

config = conf.Sig53CleanTrainQAConfig

if not os.path.exists(path):
    os.mkdir(path)

ds = ModulationsDataset(
    level=config.level,
    num_samples=config.num_samples,
    num_iq_samples=config.num_iq_samples,
    use_class_idx=config.use_class_idx,
    include_snr=config.include_snr,
    eb_no=config.eb_no,
)
loader = DatasetLoader(
    ds,
    seed=12345678,
    num_workers=os.cpu_count() // 2,
    batch_size=os.cpu_count() // 2,
)
creator = DatasetCreator(
    ds,
    seed=12345678,
    path="{}".format(os.path.join(path, config.name)),
    loader=loader,
)
creator.create()

#### Reading
The ```Sig53``` class in TorchSig is similar to ```CustomDataset```, it is provided with a path to find the raw data and it simply reads it from disk as it is accessed with ```__getitem__```. 

In [None]:
from torchsig.datasets.sig53 import Sig53

# The path, train, and impaired arguments tell
# this dataset class precisely where to look for data
official_sig53 = Sig53(
    root="sig53_qa",
    train=True,
    impaired=False
)

data, (mod, snr) = official_sig53[0]

# Plot it.
plt.subplot(3, 1, 1)
plt.title("Modulation Type = {}".format(ModulationsDataset.default_classes[int(label)]))
plt.ylabel("Time Domain")
plt.plot(data[:100].real, marker=".")
plt.plot(data[:100].imag, marker=".")
plt.legend(["Real", "Imag"])
ax = plt.gca()
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

plt.subplot(3, 1, 2)
_ = plt.psd(data)
plt.ylabel("PSD")
plt.xlabel("")
ax = plt.gca()
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])

plt.subplot(3, 1, 3)
_ = plt.specgram(data)
ax = plt.gca()
plt.ylabel("Spectrogram")
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])
