# Real-world data representation using tensors

### following things will be covered:
- Representing real-world data as PyTorch tensors
- Working with a range of data types
- Loading data from a file
- Converting data to tensors
- Shaping tensors so they can be used as inputs for neural network models

Neural networks take tensors as input and produce tensors as outputs. In
fact, all operations within a neural network and during optimization are operations
between tensors, and all parameters (for example, weights and biases) in a neural
network are tensors. Having a good sense of how to perform operations on tensors
and index them effectively is central to using tools like PyTorch successfully.

## Working with images

An image is represented as a collection of scalars arranged in a regular grid with a
height and a width (in pixels). We might have a single scalar per grid point (the
pixel), which would be represented as a grayscale image; or multiple scalars per grid
point, which would typically represent different colors, as we saw in the previous chapter, or different features like depth from a depth camera.

### Adding color channels

RGB is defined by three numbers representing the intensity of red, green, and blue

### Loading an image file

Images come in several different file formats, but luckily there are plenty of ways to
load images in Python. Let’s start by loading a PNG image using the imageio module

In [1]:
import imageio
import torch
import os
import pandas as pd
import numpy as np
import csv

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
img_arr = imageio.imread('./dlwpt-code-master/dlwpt-code-master/data/p1ch4/image-dog/bobby.jpg')
img_arr[:3, :3, ]

  img_arr = imageio.imread('./dlwpt-code-master/dlwpt-code-master/data/p1ch4/image-dog/bobby.jpg')


Array([[[77, 45, 22],
        [77, 45, 22],
        [78, 46, 21]],

       [[75, 43, 20],
        [76, 44, 21],
        [77, 45, 20]],

       [[74, 39, 17],
        [75, 41, 16],
        [77, 43, 18]]], dtype=uint8)

### Changing the layout

We can use the tensor’s permute method with the old dimensions for each new dimension to get to an appropriate layout. Given an input tensor H × W × C as obtained previously, we get a proper layout by having channel 2 first and then channels 0 and 1:

In [3]:
img = torch.from_numpy(img_arr)
out = img.permute(2, 0 ,1)
out.shape, out[:1]

(torch.Size([3, 720, 1280]),
 tensor([[[ 77,  77,  78,  ..., 118, 117, 116],
          [ 75,  76,  77,  ..., 118, 117, 116],
          [ 74,  75,  77,  ..., 119, 117, 116],
          ...,
          [215, 216, 217,  ..., 172, 174, 174],
          [215, 216, 217,  ..., 173, 174, 174],
          [215, 216, 217,  ..., 159, 158, 158]]], dtype=torch.uint8))

This operation does not make a copy of the
tensor data. Instead, out uses the same underlying storage as img and only plays with
the size and stride information at the tensor level. This is convenient because the
operation is very cheap; but just as a heads-up: changing a pixel in img will lead to a
change in out.Note also that other deep learning frameworks use different layouts. For instance,
originally TensorFlow kept the channel dimension last, resulting in an H × W × C layout (it now supports multiple layouts). This strategy has pros and cons from a low-level
performance standpoint, but for our concerns, it doesn’t make a difference as long as
we reshape our tensors properly.Note also that other deep learning frameworks use different layouts. For instance,
originally TensorFlow kept the channel dimension last, resulting in an H × W × C layout (it now supports multiple layouts). This strategy has pros and cons from a low-level
performance standpoint, but for our concerns, it doesn’t make a difference as long as
we reshape our tensors properly.


So far, we have described a single image. Following the same strategy we’ve used
for earlier data types, to create a dataset of multiple images to use as an input for our
neural networks, we store the images in a batch along the first dimension to obtain an
N × C × H × W tensor.

As a slightly more efficient alternative to using stack to build up the tensor, we can preallocate a tensor of appropriate size and fill it with images loaded from a directory, like so:

In [4]:
batch_size = 3
batch = torch.zeros(batch_size, 3, 256, 256).to(torch.uint8)
batch.shape

torch.Size([3, 3, 256, 256])

We can now load all PNG images from an input directory and store them in
the tensor:

In [5]:
data_dir = './dlwpt-code-master/dlwpt-code-master/data/p1ch4/image-cats/'
filename = [name for name in os.listdir(data_dir) if os.path.splitext(name)[1]=='.png']
filename

['cat1.png', 'cat2.png', 'cat3.png']

In [6]:
for i, file in enumerate(filename):
    img_arr = imageio.imread(os.path.join(data_dir, file))
    img_t = torch.from_numpy(img_arr)
    img_t = img_t.permute(2,0,1)
    img_t = img_t[:3]
    batch[i] = img_t
    print(batch.shape)


torch.Size([3, 3, 256, 256])
torch.Size([3, 3, 256, 256])
torch.Size([3, 3, 256, 256])


  img_arr = imageio.imread(os.path.join(data_dir, file))


In [7]:
img_arr.shape

(256, 256, 3)

In [8]:
batch.shape

torch.Size([3, 3, 256, 256])

### Normalizing the data

Neural networks usually work with floating-point tensors as
their input. Neural networks exhibit the best training performance when the input
data ranges roughly from 0 to 1, or from -1 to 1 (this is an effect of how their building
blocks are defined).

So a typical thing we’ll want to do is cast a tensor to floating-point and normalize
the values of the pixels. Casting to floating-point is easy, but normalization is trickier,
as it depends on what range of the input we decide should lie between 0 and 1 (or -1
and 1). One possibility is to just divide the values of the pixels by 255 (the maximum
representable number in 8-bit unsigned):

In [9]:
batch = batch.float()
batch = batch/255
batch

tensor([[[[0.6118, 0.5961, 0.4863,  ..., 0.5882, 0.5843, 0.6196],
          [0.6824, 0.5255, 0.6471,  ..., 0.4706, 0.5333, 0.5412],
          [0.4980, 0.6118, 0.4196,  ..., 0.5137, 0.5608, 0.6431],
          ...,
          [0.4549, 0.5098, 0.5059,  ..., 0.4980, 0.4627, 0.4392],
          [0.5059, 0.5098, 0.4824,  ..., 0.4510, 0.4745, 0.4471],
          [0.5059, 0.4824, 0.4627,  ..., 0.4431, 0.4745, 0.4706]],

         [[0.5451, 0.5294, 0.4275,  ..., 0.5294, 0.5294, 0.5765],
          [0.6275, 0.4667, 0.5843,  ..., 0.4118, 0.4784, 0.4863],
          [0.4431, 0.5490, 0.3529,  ..., 0.4627, 0.5059, 0.5961],
          ...,
          [0.3882, 0.4314, 0.4353,  ..., 0.4588, 0.4235, 0.4039],
          [0.4353, 0.4353, 0.4157,  ..., 0.4157, 0.4392, 0.4118],
          [0.4353, 0.4078, 0.4000,  ..., 0.4039, 0.4314, 0.4353]],

         [[0.5059, 0.4824, 0.3843,  ..., 0.5137, 0.5176, 0.5686],
          [0.6078, 0.4314, 0.5373,  ..., 0.4000, 0.4667, 0.4745],
          [0.4078, 0.5176, 0.3137,  ..., 0

Another possibility is to compute the mean and standard deviation of the input data
and scale it so that the output has zero mean and unit standard deviation across each
channel:

In [10]:
n_channels = batch.shape[1]
n_channels

3

In [11]:
for c in range(n_channels):
    mean = torch.mean(batch[:, c])
    std = torch.std(batch[:, c])
    batch[:, c] = (batch[:, c] - mean)/std
        

In [12]:
batch[:, 2].shape

torch.Size([3, 256, 256])

In [13]:
batch

tensor([[[[ 0.1439,  0.0730, -0.4234,  ...,  0.0375,  0.0198,  0.1794],
          [ 0.4631, -0.2461,  0.3035,  ..., -0.4944, -0.2107, -0.1752],
          [-0.3703,  0.1439, -0.7249,  ..., -0.2993, -0.0866,  0.2858],
          ...,
          [-0.5653, -0.3171, -0.3348,  ..., -0.3703, -0.5298, -0.6362],
          [-0.3348, -0.3171, -0.4412,  ..., -0.5830, -0.4766, -0.6007],
          [-0.3348, -0.4412, -0.5298,  ..., -0.6185, -0.4766, -0.4944]],

         [[ 0.4632,  0.3874, -0.1058,  ...,  0.3874,  0.3874,  0.6150],
          [ 0.8615,  0.0839,  0.6529,  ..., -0.1816,  0.1408,  0.1787],
          [-0.0299,  0.4822, -0.4661,  ...,  0.0649,  0.2736,  0.7098],
          ...,
          [-0.2954, -0.0868, -0.0678,  ...,  0.0460, -0.1247, -0.2196],
          [-0.0678, -0.0678, -0.1627,  ..., -0.1627, -0.0489, -0.1816],
          [-0.0678, -0.2006, -0.2385,  ..., -0.2196, -0.0868, -0.0678]],

         [[ 0.7792,  0.6573,  0.1495,  ...,  0.8198,  0.8401,  1.1041],
          [ 1.3072,  0.3933,  

Here, we normalize just a single batch of images because we do not
know yet how to operate on an entire dataset. In working with images, it is good
practice to compute the mean and standard deviation on all the training data
in advance and then subtract and divide by these fixed, precomputed quantities. 

## 3D Images - Volumetric Data

In some contexts, such as medical imaging applications involving, say, CT (computed
tomography) scans, we typically deal with sequences of images stacked along the headto-foot axis, each corresponding to a slice across the human body. In CT scans, the intensity represents the density of the different parts of the body—lungs, fat, water, muscle,
and bone, in order of increasing density—mapped from dark to bright when the CT
scan is displayed on a clinical workstation. The density at each point is computed from
the amount of X-rays reaching a detector after crossing through the body, with some
complex math to deconvolve the raw sensor data into the full volume

CTs have only a single intensity channel, similar to a grayscale image. This means
that often, the channel dimension is left out in native data formats. By stacking individual 2D
slices into a 3D tensor, we can build volumetric data representing the 3D anatomy of a
subject.
For now, it suffices to say that there’s no fundamental difference between a tensor storing volumetric data versus image data. We just have an extra dimension, depth, after the channel
dimension, leading to a 5D tensor of shape N × C × D × H × W

Let’s load a sample CT scan using the volread function in the imageio module, which
takes a directory as an argument and assembles all Digital Imaging and Communications in Medicine (DICOM) files2 in a series in a NumPy 3D array

In [14]:
dir_path = './dlwpt-code-master/dlwpt-code-master/data/p1ch4/volumetric-dicom/2-LUNG 3.0  B70f-04083/'
vol_read = imageio.volread(dir_path, 'DICOM')

Reading DICOM (examining files): 1/99 files (1.0%15/99 files (15.2%30/99 files (30.3%45/99 files (45.5%61/99 files (61.6%78/99 files (78.8%95/99 files (96.0%99/99 files (100.0%)
  Found 1 correct series.
Reading DICOM (loading data): 24/99  (24.245/99  (45.566/99  (66.791/99  (91.999/99  (100.0%)


In [15]:
(vol_read[0].shape)

(512, 512)

As was true in section 4.1.3, the layout is different from what PyTorch expects, due to
having no channel information. So we’ll have to make room for the channel dimension using unsqueeze:

In [16]:
vol = torch.from_numpy(vol_read).float()
vol = torch.unsqueeze(vol, axis=0)
vol.shape

torch.Size([1, 99, 512, 512])

At this point we could assemble a 5D dataset by stacking multiple volumes along the
batch direction, just as we did in the previous section. 

In [80]:
# batch_size = 3
# batch = torch.zeros(batch_size, 1, 99, vol.shape[2], vol.shape[3]).to(torch.uint8)
# batch.shape

In [81]:
# data_dir = './dlwpt-code-master/dlwpt-code-master/data/p1ch4/volumetric-dicom/2-LUNG 3.0  B70f-04083/'
# filenames = [name for name in os.listdir(data_dir) if os.path.splitext(name)[-1] == '.dcm']
# filenames[:2]

In [82]:
# for i, file in enumerate(filenames):
#     img_arr = imageio.volread(os.path.join(data_dir, file))
#     img_t = torch.from_numpy(img_arr).float()
#     img_t = torch.unsqueeze(img_t, axis=0)
#     batch[i] = img_t
  

## Representing tabular data


The simplest form of data we’ll encounter on a machine learning job is sitting in a
spreadsheet, CSV file, or database. Whatever the medium, it’s a table containing one
row per sample (or record), where columns contain one piece of information about
our sample.

Columns may contain numerical values, like temperatures at specific locations; or
labels, like a string expressing an attribute of the sample, like “blue.” Therefore, tabular data is typically not homogeneous: different columns don’t have the same type. We
might have a column showing the weight of apples and another encoding their color
in a label. PyTorch tensors, on the other hand, are homogeneous. Information in PyTorch is
typically encoded as a number, typically floating-point (though integer types and
Boolean are supported as well). This numeric encoding is deliberate, since neural
networks are mathematical entities that take real numbers as inputs and produce real
numbers as output through successive application of matrix multiplications and
nonlinear functions.

### Using a real-world dataset

Our first job as deep learning practitioners is to encode heterogeneous, real-world
data into a tensor of floating-point numbers, ready for consumption by a neural network. The Wine Quality dataset is a freely available table containing
chemical characterizations of samples of vinho verde, a wine from north Portugal,
together with a sensory quality score.

The file contains a comma-separated collection of values organized in 12 columns
preceded by a header line containing the column names. The first 11 columns contain values of chemical variables, and the last column contains the sensory quality
score from 0 (very bad) to 10 (excellent).

A possible machine learning task on this dataset is predicting the quality score from
chemical characterization alone

### Loading a wine data tensor

Before we can get to that, however, we need to be able to examine the data in a more
usable way than opening the file in a text editor. Let’s see how we can load the data
using Python and then turn it into a PyTorch tensor. Python offers several options for
quickly loading a CSV file. Three popular options are
- The csv module that ships with Python
- NumPy
- Pandas

The third option is the most time- and memory-efficient. However, we’ll avoid introducing an additional library in our learning trajectory just because we need to load a
file. Since we already introduced NumPy in the previous section, and PyTorch has
excellent NumPy interoperability, we’ll go with that. Let’s load our file and turn the
resulting NumPy array into a PyTorch tensor 

In [118]:
# data = pd.read_csv('./dlwpt-code-master/dlwpt-code-master/data/p1ch4/tabular-wine/winequality-white.csv', delimiter=';')
# data

In [116]:
wine_path = './winequality-white.csv'
wineq_numpy = np.loadtxt(wine_path, dtype=np.float32, delimiter=";", skiprows=1)
wineq_numpy

array([[ 7.  ,  0.27,  0.36, ...,  0.45,  8.8 ,  6.  ],
       [ 6.3 ,  0.3 ,  0.34, ...,  0.49,  9.5 ,  6.  ],
       [ 8.1 ,  0.28,  0.4 , ...,  0.44, 10.1 ,  6.  ],
       ...,
       [ 6.5 ,  0.24,  0.19, ...,  0.46,  9.4 ,  6.  ],
       [ 5.5 ,  0.29,  0.3 , ...,  0.38, 12.8 ,  7.  ],
       [ 6.  ,  0.21,  0.38, ...,  0.32, 11.8 ,  6.  ]], dtype=float32)

Here the data can be seen as 2D array (32-bit floating-point),
the delimiter used to separate values in each row, and the fact that the first line should
not be read since it contains the column names. Let’s check that all the data has been
read

In [128]:
col_list = next(csv.reader(open(wine_path), delimiter=';'))
col_list

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality']

convert the NumPy array to a PyTorch tensor

In [131]:
wineq = torch.from_numpy(wineq_numpy).float()
wineq.dtype

torch.float32

At this point, we have a floating-point torch.Tensor containing all the columns,
including the last, which refers to the quality score. 3

### Representing scores

We could treat the score as a continuous variable, keep it as a real number, and perform a regression task, or treat it as a label and try to guess the label from the chemical analysis in a classification task. In both approaches, we will typically remove the
score from the tensor of input data and keep it in a separate tensor, so that we can use
the score as the ground truth without it being input to our model:

In [136]:
data = wineq[:, :-1]
data.shape

torch.Size([4898, 11])

In [143]:
target = wineq[:, -1].long()
target.shape

torch.Size([4898])

If we want to transform the target tensor in a tensor of labels, we have two options,
depending on the strategy or what we use the categorical data for. One is simply to
treat labels as an integer vector of scores:

In [146]:
target.shape[0]

4898

### One-hot encoding
The other approach is to build a one-hot encoding of the scores: that is, encode each of
the 10 scores in a vector of 10 elements, with all elements set to 0 but one, at a different index for each score. This way, a score of 1 could be mapped onto the vector
(1,0,0,0,0,0,0,0,0,0), a score of 5 onto (0,0,0,0,1,0,0,0,0,0), and so on. Note
that the fact that the score corresponds to the index of the nonzero element is purely
incidental: we could shuffle the assignment, and nothing would change from a classification standpoint.

There’s a marked difference between the two approaches. Keeping wine quality
scores in an integer vector of scores induces an ordering on the scores—which might
be totally appropriate in this case, since a score of 1 is lower than a score of 4. It also
induces some sort of distance between scores: that is, the distance between 1 and 3 is the
same as the distance between 2 and 4. If this holds for our quantity, then great. If, on
the other hand, scores are purely discrete, like grape variety, one-hot encoding will be
a much better fit, as there’s no implied ordering or distance. One-hot encoding is also
appropriate for quantitative scores when fractional values in between integer scores,
like 2.4, make no sense for the application—for when the score is either this or that

We can achieve one-hot encoding using the scatter_ method, which fills the tensor with values from a source tensor along the indices provided as arguments:

In [156]:
target_onehot = torch.zeros(target.shape[0], 10)
target_onehot

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

As scatter expects the index elements to be int64

In [174]:
target = target.to(torch.int64)

In [175]:
target_onehot.scatter_(1, target.unsqueeze(1), 1)

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 1., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

As we can see that scatter_ name ends with an underscore. This is a convention in PyTorch that indicates
the method will not return a new tensor, but will instead modify the tensor in place.

The arguments for scatter_ are as follows:
- The dimension along which the following two arguments are specified
- A column tensor indicating the indices of the elements to scatter
- A tensor containing the elements to scatter or a single scalar to scatter (1, in this case)

In other words, the previous invocation reads, “For each row, take the index of the target label (which coincides with the score in our case) and use it as the column index
to set the value 1.0.” The end result is a tensor encoding categorical information.
The second argument of scatter_, the index tensor, is required to have the same
number of dimensions as the tensor we scatter into. Since target_onehot has two
dimensions (4,898 × 10), we need to add an extra dummy dimension to target using
unsqueeze:

In [182]:
target.unsqueeze(1)[0, 0]

tensor(6)

The call to unsqueeze adds a singleton dimension, from a 1D tensor of 4,898 elements
to a 2D tensor of size (4,898 × 1), without changing its contents—no extra elements
are added; we just decided to use an extra index to access the elements. That is, we
access the first element of target as target[0] and the first element of its
unsqueezed counterpart as target_unsqueezed[0,0].
PyTorch allows us to use class indices directly as targets while training neural networks. However, if we wanted to use the score as a categorical input to the network, we
would have to transform it to a one-hot-encoded tensor

### When to categorize

Let’s go back to our datatensor, containing the 11 variables associated with the chemical
analysis. We can use the functions in the PyTorch Tensor API to manipulate our data in
tensor form. Let’s first obtain the mean and standard deviations for each column:

In [191]:
data_mean = torch.mean(data, dim=0)
data_mean

tensor([6.8548e+00, 2.7824e-01, 3.3419e-01, 6.3914e+00, 4.5772e-02, 3.5308e+01,
        1.3836e+02, 9.9403e-01, 3.1883e+00, 4.8985e-01, 1.0514e+01])

In [194]:
data_var = torch.var(data, dim=0)
data_var

tensor([7.1211e-01, 1.0160e-02, 1.4646e-02, 2.5726e+01, 4.7733e-04, 2.8924e+02,
        1.8061e+03, 8.9455e-06, 2.2801e-02, 1.3025e-02, 1.5144e+00])

In this case, dim=0 indicates that the reduction is performed along dimension 0. At
this point, we can normalize the data by subtracting the mean and dividing by the
standard deviation, which helps with the learning process 

In [198]:
data_normalized = (data-data_mean)/torch.sqrt(data_var)
data_normalized[0]

tensor([ 0.1721, -0.0818,  0.2133,  2.8211, -0.0354,  0.5699,  0.7445,  2.3313,
        -1.2468, -0.3491, -1.3930])

### Finding thresholds
Next, let’s start to look at the data with an eye to seeing if there is an easy way to tell
good and bad wines apart at a glance. First, we’re going to determine which rows in
target correspond to a score less than or equal to 3:

In [204]:
bad_indexes = target <= 3
bad_indexes.shape, bad_indexes.sum()

(torch.Size([4898]), tensor(20))

Or to directly get those values, we can try following:

In [225]:
indexes = data[target <= 3]
indexes, len(indexes)

(tensor([[8.5000e+00, 2.6000e-01, 2.1000e-01, 1.6200e+01, 7.4000e-02, 4.1000e+01,
          1.9700e+02, 9.9800e-01, 3.0200e+00, 5.0000e-01, 9.8000e+00],
         [5.8000e+00, 2.4000e-01, 4.4000e-01, 3.5000e+00, 2.9000e-02, 5.0000e+00,
          1.0900e+02, 9.9130e-01, 3.5300e+00, 4.3000e-01, 1.1700e+01],
         [9.1000e+00, 5.9000e-01, 3.8000e-01, 1.6000e+00, 6.6000e-02, 3.4000e+01,
          1.8200e+02, 9.9680e-01, 3.2300e+00, 3.8000e-01, 8.5000e+00],
         [7.1000e+00, 3.2000e-01, 3.2000e-01, 1.1000e+01, 3.8000e-02, 1.6000e+01,
          6.6000e+01, 9.9370e-01, 3.2400e+00, 4.0000e-01, 1.1500e+01],
         [6.9000e+00, 3.9000e-01, 4.0000e-01, 4.6000e+00, 2.2000e-02, 5.0000e+00,
          1.9000e+01, 9.9150e-01, 3.3100e+00, 3.7000e-01, 1.2600e+01],
         [1.0300e+01, 1.7000e-01, 4.7000e-01, 1.4000e+00, 3.7000e-02, 5.0000e+00,
          3.3000e+01, 9.9390e-01, 2.8900e+00, 2.8000e-01, 9.6000e+00],
         [7.9000e+00, 6.4000e-01, 4.6000e-01, 1.0600e+01, 2.4400e-01, 3.3000e+01,


In [210]:
min(target)

tensor(3)

Note that only 20 of the bad_indexes entries are set to True! By using a feature in
PyTorch called advanced indexing, we can use a tensor with data type torch.bool to
index the data tensor. This will essentially filter data to be only items (or rows) corresponding to True in the indexing tensor. The bad_indexes tensor has the same shape
as target, with values of False or True depending on the outcome of the comparison
between our threshold and each element in the original target tensor:

In [226]:
bad_data = data[bad_indexes]
bad_data, bad_data.shape

(tensor([[8.5000e+00, 2.6000e-01, 2.1000e-01, 1.6200e+01, 7.4000e-02, 4.1000e+01,
          1.9700e+02, 9.9800e-01, 3.0200e+00, 5.0000e-01, 9.8000e+00],
         [5.8000e+00, 2.4000e-01, 4.4000e-01, 3.5000e+00, 2.9000e-02, 5.0000e+00,
          1.0900e+02, 9.9130e-01, 3.5300e+00, 4.3000e-01, 1.1700e+01],
         [9.1000e+00, 5.9000e-01, 3.8000e-01, 1.6000e+00, 6.6000e-02, 3.4000e+01,
          1.8200e+02, 9.9680e-01, 3.2300e+00, 3.8000e-01, 8.5000e+00],
         [7.1000e+00, 3.2000e-01, 3.2000e-01, 1.1000e+01, 3.8000e-02, 1.6000e+01,
          6.6000e+01, 9.9370e-01, 3.2400e+00, 4.0000e-01, 1.1500e+01],
         [6.9000e+00, 3.9000e-01, 4.0000e-01, 4.6000e+00, 2.2000e-02, 5.0000e+00,
          1.9000e+01, 9.9150e-01, 3.3100e+00, 3.7000e-01, 1.2600e+01],
         [1.0300e+01, 1.7000e-01, 4.7000e-01, 1.4000e+00, 3.7000e-02, 5.0000e+00,
          3.3000e+01, 9.9390e-01, 2.8900e+00, 2.8000e-01, 9.6000e+00],
         [7.9000e+00, 6.4000e-01, 4.6000e-01, 1.0600e+01, 2.4400e-01, 3.3000e+01,


Note that the new bad_data tensor has 20 rows, the same as the number of rows with
True in the bad_indexes tensor. It retains all 11 columns. Now we can start to get
information about wines grouped into good, middling, and bad categories. Let’s take
the .mean() of each column:

In [230]:
bad_data = data[target <= 3]
mid_data = data[(target > 3) & (target<7)]
good_data = data[(target >= 7)]
bad_data.shape

torch.Size([20, 11])

In [232]:
bad_mean = torch.mean(bad_data, dim=0)
bad_mean

tensor([7.6000e+00, 3.3325e-01, 3.3600e-01, 6.3925e+00, 5.4300e-02, 5.3325e+01,
        1.7060e+02, 9.9488e-01, 3.1875e+00, 4.7450e-01, 1.0345e+01])

In [233]:
mid_mean = torch.mean(mid_data, dim=0)
mid_mean

tensor([6.8869e+00, 2.8153e-01, 3.3644e-01, 6.7051e+00, 4.7841e-02, 3.5424e+01,
        1.4183e+02, 9.9447e-01, 3.1808e+00, 4.8707e-01, 1.0265e+01])

In [234]:
good_mean = torch.mean(good_data, dim=0)
good_mean

tensor([6.7251e+00, 2.6535e-01, 3.2606e-01, 5.2615e+00, 3.8160e-02, 3.4550e+01,
        1.2525e+02, 9.9241e-01, 3.2151e+00, 5.0014e-01, 1.1416e+01])

In [243]:
list(zip(col_list, bad_mean, mid_mean, good_mean))

[('fixed acidity', tensor(7.6000), tensor(6.8869), tensor(6.7251)),
 ('volatile acidity', tensor(0.3332), tensor(0.2815), tensor(0.2653)),
 ('citric acid', tensor(0.3360), tensor(0.3364), tensor(0.3261)),
 ('residual sugar', tensor(6.3925), tensor(6.7051), tensor(5.2615)),
 ('chlorides', tensor(0.0543), tensor(0.0478), tensor(0.0382)),
 ('free sulfur dioxide', tensor(53.3250), tensor(35.4240), tensor(34.5505)),
 ('total sulfur dioxide',
  tensor(170.6000),
  tensor(141.8330),
  tensor(125.2453)),
 ('density', tensor(0.9949), tensor(0.9945), tensor(0.9924)),
 ('pH', tensor(3.1875), tensor(3.1808), tensor(3.2151)),
 ('sulphates', tensor(0.4745), tensor(0.4871), tensor(0.5001)),
 ('alcohol', tensor(10.3450), tensor(10.2648), tensor(11.4160))]

In [274]:
for i, arg in enumerate(zip(col_list, bad_mean, mid_mean, good_mean)):
    print('{:2}, {:20}, {:6.2f}, {:6.2f} {:6.2f}'.format(i, *arg))
    
    

 0, fixed acidity       ,   7.60,   6.89   6.73
 1, volatile acidity    ,   0.33,   0.28   0.27
 2, citric acid         ,   0.34,   0.34   0.33
 3, residual sugar      ,   6.39,   6.71   5.26
 4, chlorides           ,   0.05,   0.05   0.04
 5, free sulfur dioxide ,  53.33,  35.42  34.55
 6, total sulfur dioxide, 170.60, 141.83 125.25
 7, density             ,   0.99,   0.99   0.99
 8, pH                  ,   3.19,   3.18   3.22
 9, sulphates           ,   0.47,   0.49   0.50
10, alcohol             ,  10.34,  10.26  11.42


It looks like we’re on to something here: at first glance, the bad wines seem to have
higher total sulfur dioxide, among other differences. We could use a threshold on
total sulfur dioxide as a crude criterion for discriminating good wines from bad ones.
Let’s get the indexes where the total sulfur dioxide column is below the midpoint we
calculated earlier, like so:

In [302]:
total_sulfer_threshold = 141.83
total_sulfer_data = data[:, 6]
total_sulfer_data

tensor([170., 132.,  97.,  ..., 111., 110.,  98.])

In [303]:
predicted_index = torch.lt(total_sulfer_data, total_sulfer_threshold)
predicted_index.sum()

tensor(2727)

This means our threshold implies that just over half of all the wines are going to be
high quality. Next, we’ll need to get the indexes of the actually good wines:

In [305]:
actual_index = target > 5
actual_index.sum()

tensor(3258)

In [314]:
n_matches = torch.sum(actual_index & predicted_index ).item()
n_matches

2018

In [318]:
n_predicted = torch.sum(predicted_index).item()
n_predicted

2727

In [319]:
n_actual = torch.sum(actual_index).item()
n_actual

3258

In [321]:
n_matches, n_matches/n_predicted, n_matches/n_actual

(2018, 0.74000733406674, 0.6193984039287906)

We got around 2,000 wines right! Since we predicted 2,700 wines, this gives us a 74%
chance that if we predict a wine to be high quality, it actually is. Unfortunately, there
are 3,200 good wines, and we only identified 61% of them. Well, we got what we
signed up for; that’s barely better than random! Of course, this is all very naive: we
know for sure that multiple variables contribute to wine quality, and the relationships
between the values of these variables and the outcome (which could be the actual
score, rather than a binarized version of it) is likely more complicated than a simple
threshold on a single value.

Indeed, a simple neural network would overcome all of these limitations, as would
a lot of other basic machine learning methods. We’ll have the tools to tackle this problem after the next two chapters, once we have learned how to build our first neuralWorking with network from scratch. We will also revisit how to better grade our results in chapter 12.
Let’s move on to other data types for now.

### Working with time series
Going back to the wine dataset, we could have had a “year” column that allowed us
to look at how wine quality evolved year after year. Unfortunately, we don’t have such
data at hand, but we’re working hard on manually collecting the data samples, bottle
by bottle. (Stuff for our second edition.) In the meantime, we’ll switch to another
interesting dataset: data from a Washington, D.C., bike-sharing system reporting the
hourly count of rental bikes in 2011–2012 in the Capital Bikeshare system, along with
weather and seasonal information (available here: http://mng.bz/jgOx). Our goal
will be to take a flat, 2D dataset and transform it into a 3D one

### Adding a time dimension
In the source data, each row is a separate hour of data (figure 4.5 shows a transposed
version of this to better fit on the printed page). We want to change the row-per-hour
organization so that we have one axis that increases at a rate of one day per index increment, and another axis that represents the hour of the day (independent of the date).
The third axis will be our different columns of data (weather, temperature, and so on).

In [17]:
bikes_numpy = np.loadtxt('./dlwpt-code-master/bike-sharing-dataset/hour-fixed.csv', dtype=np.float32, delimiter=',',skiprows=1,
          converters={1: lambda x: float(x[8:10])})
bikes = torch.from_numpy(bikes_numpy)
bikes

tensor([[1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 3.0000e+00, 1.3000e+01,
         1.6000e+01],
        [2.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 8.0000e+00, 3.2000e+01,
         4.0000e+01],
        [3.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 5.0000e+00, 2.7000e+01,
         3.2000e+01],
        ...,
        [1.7377e+04, 3.1000e+01, 1.0000e+00,  ..., 7.0000e+00, 8.3000e+01,
         9.0000e+01],
        [1.7378e+04, 3.1000e+01, 1.0000e+00,  ..., 1.3000e+01, 4.8000e+01,
         6.1000e+01],
        [1.7379e+04, 3.1000e+01, 1.0000e+00,  ..., 1.2000e+01, 3.7000e+01,
         4.9000e+01]])

In [350]:
# dat = pd.read_csv('./dlwpt-code-master/bike-sharing-dataset/hour-fixed.csv', delimiter=',')
# dat["dteday"][17375], float(dat["dteday"][17375][8:10])

For every hour, the dataset reports the following variables:
- Index of record: instant
- Day of month: day
- Season: season (1: spring, 2: summer, 3: fall, 4: winter)
- Year: yr (0: 2011, 1: 2012)
- Month: mnth (1 to 12)
- Hour: hr (0 to 23)
- Holiday status: holiday
- Day of the week: weekday
- Working day status: workingday
- Weather situation: weathersit (1: clear, 2:mist, 3: light rain/snow, 4: heavy
rain/snow)
- Temperature in °C: temp
- Perceived temperature in °C: atemp
- Humidity: hum
- Wind speed: windspeed
- Number of casual users: casual
- Number of registered users: registered
- Count of rental bikes: cnt

In a time series dataset such as this one, rows represent successive time-points: there is
a dimension along which they are ordered. Sure, we could treat each row as independent and try to predict the number of circulating bikes based on, say, a particular time
of day regardless of what happened earlier. However, the existence of an ordering
gives us the opportunity to exploit causal relationships across time. For instance, it
allows us to predict bike rides at one time based on the fact that it was raining at an
earlier time. For the time being, we’re going to focus on learning how to turn our
bike-sharing dataset into something that our neural network will be able to ingest in
fixed-size chunks.

### Shaping the data by time period
We might want to break up the two-year dataset into wider observation periods, like
days. This way we’ll have N (for number of samples) collections of C sequences of length
L. In other words, our time series dataset would be a tensor of dimension 3 and shape
N × C × L. The C would remain our 17 channels, while L would be 24: 1 per hour of
the day. There’s no particular reason why we must use chunks of 24 hours, though the
general daily rhythm is likely to give us patterns we can exploit for predictions. We
could also use 7 × 24 = 168 hour blocks to chunk by week instead, if we desired. All of
this depends, naturally, on our dataset having the right size—the number of rows must
be a multiple of 24 or 168. Also, for this to make sense, we cannot have gaps in the
time series.

Let’s go back to our bike-sharing dataset. The first column is the index (the global
ordering of the data), the second is the date, and the sixth is the time of day. We have
everything we need to create a dataset of daily sequences of ride counts and other
exogenous variables. Our dataset is already sorted, but if it were not, we could use
torch.sort on it to order it appropriately

All we have to do to obtain our daily hours dataset is view the same tensor in batches
of 24 hours. Let’s take a look at the shape and strides of our bikes tensor:


In [18]:
bikes.shape, bikes.stride()

(torch.Size([17520, 17]), (17, 1))

In [19]:
daily_bikes = bikes.view(-1, 24, bikes.shape[1])
daily_bikes.shape, daily_bikes.stride()

(torch.Size([730, 24, 17]), (408, 17, 1))

What happened here? First, bikes.shape[1] is 17, the number of columns in the
bikes tensor. But the real crux of this code is the call to view, which is really important: it changes the way the tensor looks at the same data as contained in storage.

Calling view on a tensor returns a new tensor that changes the number of dimensions and the striding information, without
changing the storage. This means we can rearrange our tensor at basically zero cost,
because no data will be copied. Our call to view requires us to provide the new shape
for the returned tensor. We use -1 as a placeholder for “however many indexes are
left, given the other dimensions and the original number of elements.

Storage is a contiguous, linear container for numbers (floating-point, in this case). Our bikes tensor will have each row
stored one after the other in its corresponding storage. This is confirmed by the output from the call to bikes.stride() earlier.
For daily_bikes, the stride is telling us that advancing by 1 along the hour dimension (the second dimension) requires us to advance by 17 places in the storage (or
one set of columns); whereas advancing along the day dimension (the first dimension) requires us to advance by a number of elements equal to the length of a row in
the storage times 24 (here, 408, which is 17 × 24).

We see that the rightmost dimension is the number of columns in the original
dataset. Then, in the middle dimension, we have time, split into chunks of 24 sequential hours. In other words, we now have N sequences of L hours in a day, for C channels. To get to our desired N × C × L ordering, we need to transpose the tensor:

In [20]:
daily_bikes = daily_bikes.transpose(1, 2)
daily_bikes.shape, daily_bikes.stride()

(torch.Size([730, 17, 24]), (408, 1, 17))

### Ready for Training
The “weather situation” variable is ordinal. It has four levels: 1 for good weather, and 4
for, er, really bad. We could treat this variable as categorical, with levels interpreted as
labels, or as a continuous variable. If we decided to go with categorical, we would turn
the variable into a one-hot-encoded vector and concatenate the columns with the
dataset.

In order to make it easier to render our data, we’re going to limit ourselves to the
first day for a moment. We initialize a zero-filled matrix with a number of rows equal
to the number of hours in the day and number of columns equal to the number of
weather levels:

In [21]:
first_day = bikes[:24].long()
first_day[:, 9]

tensor([1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2])

In [33]:
weather_onehot = torch.zeros(first_day.shape[0], 4)
weather_onehot[:5]


tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])

In [41]:
weather_onehot.scatter_(dim=1, index=first_day[:, 9].unsqueeze(1).long() - 1, value=1.0)
weather_onehot[:5]

tensor([[1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.]])

In [43]:
torch.cat((bikes[:24], weather_onehot), dim=1)[0]

tensor([ 1.0000,  1.0000,  1.0000,  0.0000,  1.0000,  0.0000,  0.0000,  6.0000,
         0.0000,  1.0000,  0.2400,  0.2879,  0.8100,  0.0000,  3.0000, 13.0000,
        16.0000,  1.0000,  0.0000,  0.0000,  0.0000])

Here we prescribed our original bikes dataset and our one-hot-encoded “weather situation” matrix to be concatenated along the column dimension (that is, 1). In other
words, the columns of the two datasets are stacked together; or, equivalently, the new
one-hot-encoded columns are appended to the original dataset. For cat to succeed, it
is required that the tensors have the same size along the other dimensions—the row
dimension, in this case. Note that our new last four columns are 1, 0, 0, 0, exactly
as we would expect with a weather value of 1

In [54]:
(daily_bikes.shape[2])

24

We could have done the same with the reshaped daily_bikes tensor. Remember
that it is shaped (B, C, L), where L = 24. We first create the zero tensor, with the same
B and L, but with the number of additional columns as C:

In [52]:
daily_weather_onehot = torch.zeros(daily_bikes.shape[0], 4, daily_bikes.shape[2])
daily_weather_onehot

tensor([[[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        ...,

        [[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0

Then we scatter the one-hot encoding into the tensor in the C dimension. Since this
operation is performed in place, only the content of the tensor will change:

In [81]:
daily_weather_onehot.scatter_(
1, daily_bikes[:,9,:].long().unsqueeze(1) - 1, 1.0)
daily_weather_onehot.shape

torch.Size([730, 4, 24])

And we concatenate along the C dimension:

In [95]:
daily_bikes = torch.cat((daily_bikes, daily_weather_onehot), dim=1)
daily_bikes[0][13]

tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0896, 0.0000, 0.0000, 0.0000,
        0.0000, 0.2537, 0.2836, 0.2836, 0.2985, 0.2836, 0.2985, 0.2985, 0.2836,
        0.2537, 0.2537, 0.2537, 0.1940, 0.2239, 0.2985])

We mentioned earlier that this is not the only way to treat our “weather situation” variable. Indeed, its labels have an ordinal relationship, so we could pretend they are special values of a continuous variable. We could just transform the variable so that it runs
from 0.0 to 1.0:

In [105]:
daily_bikes[:, 9, :] = (daily_bikes[:, 9, :] - 1)/3
daily_bikes[:, 9, :][10]

tensor([-0.4993, -0.4993, -0.4989, -0.4989, -0.4989, -0.4989, -0.4989, -0.4989,
        -0.4989, -0.4989, -0.4989, -0.4989, -0.4989, -0.4989, -0.4989, -0.4989,
        -0.4989, -0.4989, -0.4984, -0.4984, -0.4984, -0.4984, -0.4984, -0.4984])

As we mentioned in the previous section, rescaling variables to the [0.0, 1.0] interval
or the [-1.0, 1.0] interval is something we’ll want to do for all quantitative variables,
like temperature (column 10 in our dataset). We’ll see why later; for now, let’s just say
that this is beneficial to the training process

There are multiple possibilities for rescaling variables. We can either map their
range to [0.0, 1.0]

In [114]:
temp = daily_bikes[:, 10, :]
temp_min = torch.min(temp)
temp_max = torch.max(temp)
temp_min, temp_max

(tensor(0.), tensor(1.))

In [111]:
daily_bikes[:, 10, :]

tensor([[0.2245, 0.2041, 0.2041,  ..., 0.3878, 0.3878, 0.4490],
        [0.4490, 0.4286, 0.4082,  ..., 0.2449, 0.2245, 0.2041],
        [0.2041, 0.1837, 0.1837,  ..., 0.1633, 0.1224, 0.1633],
        ...,
        [0.2245, 0.2245, 0.2245,  ..., 0.2653, 0.2449, 0.2449],
        [0.2449, 0.2449, 0.2449,  ..., 0.1837, 0.1837, 0.1837],
        [0.1633, 0.1633, 0.1429,  ..., 0.2449, 0.2449, 0.2449]])

In [113]:
daily_bikes[:, 10, :] = ((daily_bikes[:, 10, :] - temp_min)
/ (temp_max - temp_min))
daily_bikes[:, 10:, :]

tensor([[[0.2245, 0.2041, 0.2041,  ..., 0.3878, 0.3878, 0.4490],
         [0.2879, 0.2727, 0.2727,  ..., 0.4091, 0.4091, 0.4545],
         [0.8100, 0.8000, 0.8000,  ..., 0.8700, 0.9400, 0.8800],
         ...,
         [0.0000, 0.0000, 0.0000,  ..., 1.0000, 1.0000, 1.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]],

        [[0.4490, 0.4286, 0.4082,  ..., 0.2449, 0.2245, 0.2041],
         [0.4545, 0.4394, 0.4242,  ..., 0.2273, 0.2121, 0.2273],
         [0.8800, 0.9400, 1.0000,  ..., 0.4400, 0.4400, 0.4700],
         ...,
         [1.0000, 1.0000, 1.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]],

        [[0.2041, 0.1837, 0.1837,  ..., 0.1633, 0.1224, 0.1633],
         [0.1970, 0.1667, 0.1667,  ..., 0.1970, 0.1515, 0.2121],
         [0.4400, 0.4400, 0.4400,  ..., 0.6400, 0.6900, 0.

or subtract the mean and divide by the standard deviation:


In [115]:
daily_bikes[:, 10, :] = ((daily_bikes[:, 10, :] - torch.mean(temp))
/ torch.std(temp))
daily_bikes[:, 10, :]

tensor([[-1.3213, -1.4248, -1.4248,  ..., -0.4932, -0.4932, -0.1827],
        [-0.1827, -0.2862, -0.3897,  ..., -1.2178, -1.3213, -1.4248],
        [-1.4248, -1.5284, -1.5284,  ..., -1.6319, -1.8389, -1.6319],
        ...,
        [-1.3213, -1.3213, -1.3213,  ..., -1.1143, -1.2178, -1.2178],
        [-1.2178, -1.2178, -1.2178,  ..., -1.5284, -1.5284, -1.5284],
        [-1.6319, -1.6319, -1.7354,  ..., -1.2178, -1.2178, -1.2178]])

In the latter case, our variable will have 0 mean and unitary standard deviation. If our
variable were drawn from a Gaussian distribution, 68% of the samples would sit in the
[-1.0, 1.0] interval.

Great: we’ve built another nice dataset, and we’ve seen how to deal with time series
data. For this tour d’horizon, it’s important only that we got an idea of how a time
series is laid out and how we can wrangle the data in a form that a network will digest.
Other kinds of data look like a time series, in that there is a strict ordering. Top
two on the list? Text and audio. We’ll take a look at text next, and the “Conclusion”
section has links to additional examples for audio.