## Generating the Dataset

In [2]:
%matplotlib inline
import random 
import torch
from d2l import torch as d2l

In [24]:
class SyntheticRegressionData(d2l.DataModule):
    """Synthetic data for linear regression."""
    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
                 batch_size = 32):
        super().__init__()
        self.save_hyperparameters()
        n = num_train + num_val
        self.X  = torch.randn(n, len(w))
        noise = torch.randn(n,1) *noise
        self.y = torch.matmul(self.X, w.reshape((-1,1))) + b + noise

- `super().__init__()`:
The `super().__init__()` function is used to call the initializer (`__init__`) of the parent class. In your case, the class `SyntheticRegressionData` inherits from `d2l.DataModule`. So, `super().__init__()` calls the `__init__()` method of `d2l.DataModule`, allowing the class to initialize any attributes or methods that are defined in the parent class.(The parent class's __init__() is never called automatically when overriding it in the subclass. The subclass must explicitly call it using super() to invoke it.)
- `torch.rand()` generates random numbers uniformly distributed in the interval [0,1)
- `self.X = torch.randn(n, len(w))`
This line generates a feature matrix `X` with `n` rows and `len(w)` columns, using the `torch.randn()` function, which creates a tensor filled with random values sampled from a normal distribution (mean 0, standard deviation 1). 
`self.X` will store the input features for your synthetic data.

- `noise = torch.randn(n,1) * noise`
The torch.rand() function generates values from a uniform distribution in the range [0, 1),
FOR a different standard deviation and mean,
```python
mean = desired_mean   # deviated from zero
std_dev = desired_std_dev_scale
noise = torch.randn(n, 1) * std_dev + mean_from_zero
```

- `self.y = torch.matmul(self.X, w.reshape((-1,1))) + b + noise`
This line calculates the target values `y` for the synthetic linear regression problem:
`torch.matmul(self.X, w.reshape((-1,1)))`: The matrix multiplication between the feature matrix `self.X` and the reshaped weight vector `w`. Here, `w` is reshaped to be a column vector (having dimensions `len(w) x 1`) to match the matrix dimensions required for multiplication.
So, the formula for `y` is essentially:
\[
y = X \cdot w + b + \text{noise}
\]

In [23]:
data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2)
print('features:', data.X[0], '\nlabel:', data.y[0])

features: tensor([-0.7909, -0.5829]) 
label: tensor([4.6064])


## Reading the Dataset

In [8]:
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
    if train:
        indices = list(range(0, self.num_train))
        random.shuffle(indices) #example
    else:
        indices =list(range(self,num_train, self.num_train_self.num_val))
    for i in range(0, len(indices), self.batch_size):
        batch_indices = torch.tensor(indices[i: i+self.batch_size])
        yield self.X[batch_indices], self.y[batch_indices]

X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)

X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1])


It divides the `num_train` (batch_size) into minibatches using `for i in range(0, len(indices), self.batch_size):` , grabbing one minibatch of examples at a time.

It takes a batch size, a matrix of features, and a vector of labels, and generates minibatches of size batch_size. As such, each minibatch consists of a tuple of features and labels. Note that we need to be mindful of whether we’re in training or validation mode: in the former, we will want to read the data in random order, whereas for the latter, being able to read data in a pre-defined order may be important for debugging purposes.

While the iteration implemented above is good for didactic purposes, it is inefficient in ways that might get us into trouble with real problems. For example, `it requires that we load all the data in memory and that we perform lots of random memory access`. The built-in iterators implemented in a deep learning framework are considerably more efficient and they can deal with sources such as data stored in files, data received via a stream, and data generated or processed on the fly.

## Concise Implementation of the Data Loader

In [15]:
@d2l.add_to_class(d2l.DataModule)  #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
    tensors = tuple(a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size, shuffle=train)

@d2l.add_to_class(SyntheticRegressionData)  #@save
def get_dataloader(self, train):
    i = slice(0, self.num_train) if train else slice(self.num_train, None)
    return self.get_tensorloader((self.X, self.y), train, i)

X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
len(data.train_dataloader())

X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1])


32

This DataLoader has been implemented in the `d2l.torch` library. It is more efficient and has some added functionality.
Data loaders can be different for different data processing pipeline, that's why `get_dataloader` function is not defined in d2l.torch.DataModule rather defined inside `SyntheticRegressionData` which is a commonly used case. 

In [35]:
class SyntheticRegressionData_onfly(d2l.DataModule):
    """Synthetic data for linear regression."""
    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
                 batch_size = 32, seed:int = 0):
        super().__init__()
        self.save_hyperparameters()
        n = num_train + num_val
        torch.manual_seed(seed)
        self.X  = torch.randn(n, len(w))
        noise = torch.randn(n,1) *noise
        self.y = torch.matmul(self.X, w.reshape((-1,1))) + b + noise

data = SyntheticRegressionData_onfly(w=torch.tensor([2, -3.4]), b=4.2, seed=5)
print('features:', data.X[0], '\nlabel:', data.y[0])

features: tensor([1.8423, 0.5189]) 
label: tensor([6.1095])
