### **3.3. Synthetic Regression Data**

- Synthetic data is information that is artificially generated rather than produced by real-world events

- While we might not care intrinsically about the patterns that we ourselves baked into an artificial data generating model, such datasets are nevertheless **useful for didactic purposes**, helping us to **evaluate the properties** of our learning algorithms and to **confirm that our implementations work as expected**.

In [1]:
%matplotlib inline
import random
import torch
from d2l import torch as d2l

#### **3.3.1. Generating the Dataset**

- *Ground truth linear function*:
$$\bold{y} = \bold{Xw} + b + \bold{\epsilon}$$
> Assume $\epsilon \propto \mathcal{N}(0, \sigma^2 = 0.01^2)$

In [2]:
class SyntheticRegressionData(d2l.DataModule): #@save
    """Synthetic data for linear regression."""
    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000, batch_size=32):
        super().__init__()
        self.save_hyperparameters()
        n = num_train + num_val
        self.X = torch.randn(n, len(w))
        noise = torch.randn(n, 1) * noise
        self.y = torch.matmul(self.X, w.reshape((-1, 1))) + b + noise


In [3]:
# w = [2, -3.4]^T, b = 4.2
data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2)

print('features:', data.X[0], len(data.X), '\nlabel:', data.y[0])

features: tensor([-0.5329, -0.7585]) 2000 
label: tensor([5.7128])


#### **3.3.2. Reading the Dataset**

- It takes a batch size, a matrix of features, and a vector of labels, and generates minibatches of size batch_size. 

- As such, each minibatch consists of a tuple of features and labels. 

- Note that we need to be mindful of whether we’re in training or validation mode: in the former, we will want to read the data in random order, whereas for the latter, being able to read data in a pre-defined order may be important for debugging purposes.

In [4]:
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
    if train:
        indices = list(range(0, self.num_train))
        # The examples are read in random order
        random.shuffle(indices)
    else:
        indices = list(range(self.num_train, self.num_train+self.num_val))
    for i in range(0, len(indices), self.batch_size):
        batch_indices = torch.tensor(indices[i: i+self.batch_size])
        yield self.X[batch_indices], self.y[batch_indices]
        # tam dung va tiep tuc

In [5]:
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)

X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1])


#### **3.3.3. Concise Implementation of the Data Loader**

In [6]:
@d2l.add_to_class(d2l.DataModule)  #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
    tensors = tuple(a[indices] for a in tensors)
    # "*" used to unpack iterables
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size,
                                       shuffle=train)

@d2l.add_to_class(SyntheticRegressionData)  #@save
def get_dataloader(self, train):
    i = slice(0, self.num_train) if train else slice(self.num_train, None)
    return self.get_tensorloader((self.X, self.y), train, i)

In [7]:
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
len(data.train_dataloader())

X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1])


32

#### **3.3.4. Summary**

- Data loaders are a convenient way of abstracting out the process of loading and manipulating data. This way the same machine learning algorithm is capable of processing many different types and sources of data without the need for modification. 

- One of the nice things about data loaders is that they can be composed. For instance, we might be loading images and then have a postprocessing filter that crops them or modifies them in other ways. As such, data loaders can be used to describe an entire data processing pipeline.