# Day 20 - Real-world data representation using tensors

## Representing tabular data

* Tabular data, in spreadsheets, is often heterogeneous
* PyTorch tensors have to be homoegeneous, so this data has to be turned into floats

### Using a real-world dataset

* A lot of datasets are freely available on the internet, like this [wine quality dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv)
    * This dataset has 12 columns, where the first 11 are measures characteristics, and the final column a quality rating from 0 to 10

### Loading a wine data tensor

* We first have to examine the data ourselves
* Options for this include:
    1. Python's `csv` module
    2. NumPy
    3. Pandas
* Pandas wins
* To avoid introducing yet another library, we will use NumPy for now

In [1]:
import numpy as np
import csv

wine_path = "./DLPT/data/winequality-white.csv"
wineq_numpy = np.loadtxt(wine_path, dtype=np.float32, delimiter=";", skiprows=1)
wineq_numpy

array([[ 7.  ,  0.27,  0.36, ...,  0.45,  8.8 ,  6.  ],
       [ 6.3 ,  0.3 ,  0.34, ...,  0.49,  9.5 ,  6.  ],
       [ 8.1 ,  0.28,  0.4 , ...,  0.44, 10.1 ,  6.  ],
       ...,
       [ 6.5 ,  0.24,  0.19, ...,  0.46,  9.4 ,  6.  ],
       [ 5.5 ,  0.29,  0.3 , ...,  0.38, 12.8 ,  7.  ],
       [ 6.  ,  0.21,  0.38, ...,  0.32, 11.8 ,  6.  ]], dtype=float32)

In [2]:
col_list = next(csv.reader(open(wine_path), delimiter=";"))

wineq_numpy.shape, col_list

((4898, 12),
 ['fixed acidity',
  'volatile acidity',
  'citric acid',
  'residual sugar',
  'chlorides',
  'free sulfur dioxide',
  'total sulfur dioxide',
  'density',
  'pH',
  'sulphates',
  'alcohol',
  'quality'])

In [3]:
import torch

wineq = torch.from_numpy(wineq_numpy)

wineq.shape, wineq.dtype

(torch.Size([4898, 12]), torch.float32)

### Representing scores

* For training, we remove the label from the data, into a separate tensor

In [4]:
data = wineq[:, :-1]
data, data.shape

(tensor([[ 7.0000,  0.2700,  0.3600,  ...,  3.0000,  0.4500,  8.8000],
         [ 6.3000,  0.3000,  0.3400,  ...,  3.3000,  0.4900,  9.5000],
         [ 8.1000,  0.2800,  0.4000,  ...,  3.2600,  0.4400, 10.1000],
         ...,
         [ 6.5000,  0.2400,  0.1900,  ...,  2.9900,  0.4600,  9.4000],
         [ 5.5000,  0.2900,  0.3000,  ...,  3.3400,  0.3800, 12.8000],
         [ 6.0000,  0.2100,  0.3800,  ...,  3.2600,  0.3200, 11.8000]]),
 torch.Size([4898, 11]))

In [5]:
target = wineq[:, -1]
target, target.shape

(tensor([6., 6., 6.,  ..., 6., 7., 6.]), torch.Size([4898]))

* We have two options for transforming the target into labels
* The first one is to simply treat the labels as a vector of integer scores

In [6]:
target = wineq[:, -1].long()
target

tensor([6, 6, 6,  ..., 6, 7, 6])

### One-hot encoding

* The other approach is to perform one-hot encoding, where each target label becomes its own vector with one element for each possible value
* All values will be zero, except the one corresponding to the target category
* The first options induces ordering in the values, as well as a measure of distance between two scores
* One-hot encoding is better suited when this is not the case, for example when assigning categories
* PyTorch gives us the `scatter_` method for turning our target into the corresponding one-hot representation

In [7]:
target_onehot = torch.zeros(target.shape[0], 10)

#                      dim,               index, src
# Copies values from scr into dim, according to the indices in index
target_onehot.scatter_(1  , target.unsqueeze(1), 1.0)

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 1., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

* As the second argument is required to have the same dimensionality as the tensor we `scatter_` into, we have to `unsqueeze` it to match

In [8]:
target_unsqueezed = target.unsqueeze(1)
target_unsqueezed, target_unsqueezed.shape

(tensor([[6],
         [6],
         [6],
         ...,
         [6],
         [7],
         [6]]),
 torch.Size([4898, 1]))

* PyTorch actually allows us to use class indices like `target` directly as targets during neural network training

### When to categorize

* When the data is continuous, or ordinal and ordering is a priority, use the values directly
    * Remember that this introduces a notion of distance between the values
* When the data is categorical, or ordering does not matter, use a one-hot encoding, or an embedding
* Now to further manipulate our `data`, by calculating the mean and standard deviation

In [9]:
data_mean = torch.mean(data, dim=0)
data_mean

tensor([6.8548e+00, 2.7824e-01, 3.3419e-01, 6.3914e+00, 4.5772e-02, 3.5308e+01,
        1.3836e+02, 9.9403e-01, 3.1883e+00, 4.8985e-01, 1.0514e+01])

In [10]:
data_var = torch.var(data, dim=0)
data_var

tensor([7.1211e-01, 1.0160e-02, 1.4646e-02, 2.5726e+01, 4.7733e-04, 2.8924e+02,
        1.8061e+03, 8.9455e-06, 2.2801e-02, 1.3025e-02, 1.5144e+00])

* Finally, we use these values to normalize the data, helping training performance

In [11]:
data_normalized = (data - data_mean) / torch.sqrt(data_var)
data_normalized

tensor([[ 1.7208e-01, -8.1761e-02,  2.1326e-01,  ..., -1.2468e+00,
         -3.4915e-01, -1.3930e+00],
        [-6.5743e-01,  2.1587e-01,  4.7996e-02,  ...,  7.3995e-01,
          1.3422e-03, -8.2419e-01],
        [ 1.4756e+00,  1.7450e-02,  5.4378e-01,  ...,  4.7505e-01,
         -4.3677e-01, -3.3663e-01],
        ...,
        [-4.2043e-01, -3.7940e-01, -1.1915e+00,  ..., -1.3130e+00,
         -2.6153e-01, -9.0545e-01],
        [-1.6054e+00,  1.1666e-01, -2.8253e-01,  ...,  1.0049e+00,
         -9.6251e-01,  1.8574e+00],
        [-1.0129e+00, -6.7703e-01,  3.7852e-01,  ...,  4.7505e-01,
         -1.4882e+00,  1.0448e+00]])

### Finding thresholds

* To get a feel for the data, and help judge our model, we can look at the data ourselves to find easy ways of telling good and bad wines apart at a glance
* Let us see if we can determine what makes a wine score 3 or lower

In [13]:
bad_indexes = target <= 3
bad_indexes.shape, bad_indexes.dtype, bad_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(20))

In [16]:
bad_data = data[bad_indexes]
bad_data.shape

torch.Size([20, 11])

* We can now have a look at the average values for bad, mediocore, and good wines, to get a feel for the data

In [19]:
bad_data = data_normalized[target <= 3]
mid_data = data_normalized[(target > 3) & (target < 7)]
good_data = data_normalized[target >= 7]

bad_mean = torch.mean(bad_data, 0)
mid_mean = torch.mean(mid_data, 0)
good_mean = torch.mean(good_data, 0)

for i, args in enumerate(zip(col_list, bad_mean, mid_mean, good_mean)):
    print("{:2} {:20} {:6.2f} {:6.2f} {:6.2f}".format(i, *args))

 0 fixed acidity          0.88   0.04  -0.15
 1 volatile acidity       0.55   0.03  -0.13
 2 citric acid            0.01   0.02  -0.07
 3 residual sugar         0.00   0.06  -0.22
 4 chlorides              0.39   0.09  -0.35
 5 free sulfur dioxide    1.06   0.01  -0.04
 6 total sulfur dioxide   0.76   0.08  -0.31
 7 density                0.29   0.15  -0.54
 8 pH                    -0.01  -0.05   0.18
 9 sulphates             -0.13  -0.02   0.09
10 alcohol               -0.14  -0.20   0.73


* We can see that the bad wines have a much higher sulfur dioxide content, among other differences
* One crude criterion for discriminating good from bad wines would then be a threshold on this

In [22]:
total_sulfur_threshold = data[(target > 3) & (target < 7)].mean(0)[6] # 141.83
total_sulfur_data = data[:, 6]
predicted_indexes = torch.lt(total_sulfur_data, total_sulfur_threshold)

predicted_indexes.shape, predicted_indexes.dtype, predicted_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(2727))

* Using this threshold, about half of our wines would be considered high quality
* We can compare this to the true indexes of the higher quality wines

In [23]:
actual_indexes = target > 5

actual_indexes.shape, actual_indexes.dtype, actual_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(3258))

In [25]:
n_matches = torch.sum(actual_indexes & predicted_indexes).item()
n_predicted = torch.sum(predicted_indexes).item()
n_actual = torch.sum(actual_indexes).item()

n_matches, n_matches / n_predicted, n_matches / n_actual

(2018, 0.74000733406674, 0.6193984039287906)

* 2018 of our predictions match with the actual data, which represents 74% of our predictions
* About 62% of the actually good wines were included with this threshold
* This is barely better than random, but can serve as a baseline for debugging model performance later

## Working with time series

* The dataset we will be using the [bike sharing dataset](https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset)

### Adding a time dimension

* In the original dataset, each row represents one hour
* We want to reshape this data, so that we have one dimension along which the next index represents the next day
* The next axis is then the hours of each day, with the third axis being the different features

In [26]:
bikes_numpy = np.loadtxt(
    "./DLPT/data/bike-sharing-dataset/hour-fixed.csv",
    dtype=np.float32,
    delimiter=",",
    skiprows=1,
    converters={1: lambda x: float(x[8:10])}
)
bikes = torch.from_numpy(bikes_numpy)

bikes

tensor([[1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 3.0000e+00, 1.3000e+01,
         1.6000e+01],
        [2.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 8.0000e+00, 3.2000e+01,
         4.0000e+01],
        [3.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 5.0000e+00, 2.7000e+01,
         3.2000e+01],
        ...,
        [1.7377e+04, 3.1000e+01, 1.0000e+00,  ..., 7.0000e+00, 8.3000e+01,
         9.0000e+01],
        [1.7378e+04, 3.1000e+01, 1.0000e+00,  ..., 1.3000e+01, 4.8000e+01,
         6.1000e+01],
        [1.7379e+04, 3.1000e+01, 1.0000e+00,  ..., 1.2000e+01, 3.7000e+01,
         4.9000e+01]])

* In a time series dataset, rows represent successive time-points
* The existence of this ordering gives us the opportunity to exploit causal relationships