<a href="https://colab.research.google.com/github/ImranNust/DeepLearningWithPyTorch/blob/main/Chapter4/Module3_RepresentingTabularData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> <center> <b> <u> Introduction </u> </b> </center> </h1>



* The simplest form of data we’ll encounter on a machine learning job is sitting in a spreadsheet, CSV file, or database. Whatever the medium, it’s a table containing one row per sample (or record), where columns contain one piece of information about our sample.

* Columns may contain numerical values, like temperatures at specific locations; or labels, like a string expressing an attribute of the sample, like “blue.” Therefore, tabular data is typically not homogeneous: different columns don’t have the same type. We might have a column showing the weight of apples and another encoding their color in a label.

* PyTorch tensors, on the other hand, are homogeneous. Information in PyTorch is
typically encoded as a number, typically floating-point (though integer types and Boolean are supported as well). This numeric encoding is deliberate, since neural networks are mathematical entities that take real numbers as inputs and produce real numbers as output through successive application of matrix multiplications and nonlinear functions.

---


<h2> <center> <b> <u> Using a real-world dataset </u> </b> </center> </h2>

* Our first job as deep learning practitioners is to encode heterogeneous, real-world data into a tensor of floating-point numbers, ready for consumption by a neural network.

* The `tabular-wine` file contains a comma-separated collection of values organized in 12 columns preceded by a header line containing the column names. The first 11 columns contain values of chemical variables, and the last column contains the sensory quality score from 0 (very bad) to 10 (excellent). These are the column names in the order they appear in the dataset:
    - fixed acidity
    - volatile acidity
    - citric acid
    - residual sugar
    - chlorides
    - free sulfur dioxide
    - total sulfur dioxide
    - density
    - pH
    - sulphates
    - alcohol
    - quality

![](https://raw.githubusercontent.com/ImranNust/DeepLearningWithPyTorch/main/Images/winedataset.png)

* A possible machine learning task on this dataset is predicting the quality score from chemical characterization alone.

---

<h2> <center> <b> <u> Loading a wine data tensor </u> </b> </center> </h2>

* Let’s see how we can load the data using Python and then turn it into a PyTorch tensor. Python offers several options for quickly loading a CSV file. Three popular options are
    - The csv module that ships with Python
    - NumPy
    - Pandas (This option is the most convenient and perferable, but as we are
      already familiar with numpy and pytorch, so we will stick to that)



In [1]:
import numpy as np
try:
  wine_path = '/content/winequality-white.csv'
  wineq_numpy = np.loadtxt(wine_path, dtype = np.float32, delimiter = ';',
                          skiprows = 1)
except:
  !git clone https://github.com/ImranNust/DeepLearningWithPyTorch
  !mv DeepLearningWithPyTorch/Images/winequality-white.csv /content/
  !rm -rf DeepLearningWithPyTorch
  wine_path = '/content/winequality-white.csv'
  wineq_numpy = np.loadtxt(wine_path, dtype = np.float32, delimiter = ';',
                           skiprows = 1)
  
print(wineq_numpy)

Cloning into 'DeepLearningWithPyTorch'...
remote: Enumerating objects: 202, done.[K
remote: Counting objects: 100% (73/73), done.[K
remote: Compressing objects: 100% (68/68), done.[K
remote: Total 202 (delta 6), reused 59 (delta 3), pack-reused 129[K
Receiving objects: 100% (202/202), 37.87 MiB | 21.83 MiB/s, done.
Resolving deltas: 100% (16/16), done.
[[ 7.    0.27  0.36 ...  0.45  8.8   6.  ]
 [ 6.3   0.3   0.34 ...  0.49  9.5   6.  ]
 [ 8.1   0.28  0.4  ...  0.44 10.1   6.  ]
 ...
 [ 6.5   0.24  0.19 ...  0.46  9.4   6.  ]
 [ 5.5   0.29  0.3  ...  0.38 12.8   7.  ]
 [ 6.    0.21  0.38 ...  0.32 11.8   6.  ]]


---

Here we just prescribe what the type of the 2D array should be (32-bit floating-point), the delimiter used to separate values in each row, and the fact that the first line should not be read since it contains the column names. 

Let’s check that all the data has been read and proceed to convert the NumPy array to a PyTorch tensor:

---

In [2]:
import csv, torch
col_list = next(csv.reader(open(wine_path), delimiter = ';'))
print('The downloaded file has shape: {}\n and column names: {}'.
      format(wineq_numpy.shape, col_list))

wineq = torch.from_numpy(wineq_numpy)

print(wineq.shape, wineq.dtype)

The downloaded file has shape: (4898, 12)
 and column names: ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']
torch.Size([4898, 12]) torch.float32


At this point, we have a floating-point torch.Tensor containing all the columns,
including the last, which refers to the quality score.

**[IMPORTANT]** Please read about three types of data: Continuous, ordinal, and categorical values

---

<h2> <center> <b> <u> Representing Scores </u> </b> </center> </h2>

We could treat the score as a continuous variable, keep it as a real number, and perform a regression task, or treat it as a label and try to guess the label from the chemical analysis in a classification task. In both approaches, we will typically remove the score from the tensor of input data and keep it in a separate tensor, so that we can use the score as the ground truth without it being input to our model:

In [3]:
data = wineq[:, :-1]
print(data, data.shape)

target = wineq[:, -1]
print(target, target.shape)

tensor([[ 7.0000,  0.2700,  0.3600,  ...,  3.0000,  0.4500,  8.8000],
        [ 6.3000,  0.3000,  0.3400,  ...,  3.3000,  0.4900,  9.5000],
        [ 8.1000,  0.2800,  0.4000,  ...,  3.2600,  0.4400, 10.1000],
        ...,
        [ 6.5000,  0.2400,  0.1900,  ...,  2.9900,  0.4600,  9.4000],
        [ 5.5000,  0.2900,  0.3000,  ...,  3.3400,  0.3800, 12.8000],
        [ 6.0000,  0.2100,  0.3800,  ...,  3.2600,  0.3200, 11.8000]]) torch.Size([4898, 11])
tensor([6., 6., 6.,  ..., 6., 7., 6.]) torch.Size([4898])


---

If we want to transform the target tensor in a tensor of labels, we have two options, depending on the strategy or what we use the categorical data for. 

1. One is simply to treat labels as an integer vector of scores:
```
target = wineq[:, -1].long()
target
```
If targets were string labels, like wine color, assigning an integer number to each string would let us follow the same approach.
2. The other approach is one-hot encoding

In [4]:
target = wineq[:, -1].long()
target

tensor([6, 6, 6,  ..., 6, 7, 6])

---

<h2> <center> <b> <u> One-hot Encoding </u> </b> </center> </h2>

The other approach is to build a one-hot encoding of the scores: that is, encode each of the 10 scores in a vector of 10 elements, with all elements set to 0 but one, at a different index for each score. This way, a score of 1 could be mapped onto the vector (1,0,0,0,0,0,0,0,0,0), a score of 5 onto (0,0,0,0,1,0,0,0,0,0), and so on. 

We can achieve one-hot encoding using the `scatter_` method, which fills the tensor with values from a source tensor along the indices provided as arguments:

In [5]:
target_onehot = torch.zeros(target.shape[0], 10)
target_onehot.scatter_(1, target.unsqueeze(1), 1.0)

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 1., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

---

The arguments for `scatter_` are as follows:
- The dimension along which the following two arguments are specified
- A column tensor indicating the indices of the elements to scatter
- A tensor containing the elements to scatter or a single scalar to scatter (1, in this case)


PyTorch allows us to use class indices directly as targets while training neural networks. However, if we wanted to use the score as a categorical input to the network, we would have to transform it to a one-hot-encoded tensor.

---


---

<h2> <center> <b> <u> When To Categorize </u> </b> </center> </h2>

How to treat columns with continuous, ordinal, and categorical data is summarized in the following figure.

![](https://raw.githubusercontent.com/ImranNust/DeepLearningWithPyTorch/main/Images/whentocategorize.png)

Our data tensor, containing the 11 variables associated with the chemical
analysis. We can use the functions in the PyTorch Tensor API to manipulate our data in tensor form. 

Let’s first obtain the mean and standard deviations for each column:

In [10]:
data_mean = torch.mean(data, dim = 0)
data_var = torch.var(data, dim = 0)

print('Mean = \n{} \nVariance = \n{}'.format(data_mean, data_var))

Mean = 
tensor([6.8548e+00, 2.7824e-01, 3.3419e-01, 6.3914e+00, 4.5772e-02, 3.5308e+01,
        1.3836e+02, 9.9403e-01, 3.1883e+00, 4.8985e-01, 1.0514e+01]) 
Variance = 
tensor([7.1211e-01, 1.0160e-02, 1.4646e-02, 2.5726e+01, 4.7733e-04, 2.8924e+02,
        1.8061e+03, 8.9455e-06, 2.2801e-02, 1.3025e-02, 1.5144e+00])


---

In this case, `dim=0` indicates that the reduction is performed along dimension `0`. At this point, we can normalize the data by subtracting the `mean` and dividing by the `standard deviation`, which helps with the learning process

In [14]:
data_normalized = (data - data_mean)/ torch.sqrt(data_var)
print(data_normalized)

tensor([[ 1.7208e-01, -8.1761e-02,  2.1326e-01,  ..., -1.2468e+00,
         -3.4915e-01, -1.3930e+00],
        [-6.5743e-01,  2.1587e-01,  4.7996e-02,  ...,  7.3995e-01,
          1.3422e-03, -8.2419e-01],
        [ 1.4756e+00,  1.7450e-02,  5.4378e-01,  ...,  4.7505e-01,
         -4.3677e-01, -3.3663e-01],
        ...,
        [-4.2043e-01, -3.7940e-01, -1.1915e+00,  ..., -1.3130e+00,
         -2.6153e-01, -9.0545e-01],
        [-1.6054e+00,  1.1666e-01, -2.8253e-01,  ...,  1.0049e+00,
         -9.6251e-01,  1.8574e+00],
        [-1.0129e+00, -6.7703e-01,  3.7852e-01,  ...,  4.7505e-01,
         -1.4882e+00,  1.0448e+00]])


---

<h2> <center> <b> <u> Finding Thresholds </u> </b> </center> </h2>

Next, let’s start to look at the data with an eye to seeing if there is an easy way to tell good and bad wines apart at a glance. 

First, we’re going to determine which rows in gtarget correspond to a score less than or equal to 3:

In [15]:
bad_indexes = target <= 3
print(bad_indexes.shape, bad_indexes.dtype, bad_indexes.sum())

torch.Size([4898]) torch.bool tensor(20)


---

Note that only 20 of the `bad_indexes` entries are set to `True`! By using a feature in PyTorch called `advanced indexing`, we can use a tensor with data type `torch.bool` to index the data tensor. This will essentially filter data to be only items (or rows) corresponding to `True` in the indexing tensor. The `bad_indexes` tensor has the same shape as `target`, with values of `False` or `True` depending on the outcome of the comparison between our threshold and each element in the original `target` tensor:

---

In [16]:
bad_data = data[bad_indexes]
bad_data.shape

torch.Size([20, 11])

Note that the new `bad_data` tensor has 20 rows, the same as the number of rows with `True` in the `bad_indexes` tensor. It retains all 11 columns. Now we can start to get information about wines grouped into good, middling, and bad categories. 

Let’s take the `.mean()` of each column:

In [17]:
bad_data = data[target<=3]
mid_data = data[(target >= 3) & (target < 7)]
good_data = data[target >= 7]

bad_mean = torch.mean(bad_data, dim=0)
mid_mean = torch.mean(mid_data, dim=0)
good_mean = torch.mean(good_data, dim=0)

for i, args in enumerate(zip(col_list, bad_mean, mid_mean, good_mean)):
  print('{:2} {:20} {:6.2f} {:6.2f} {:6.2f}'.format(i, *args))

 0 fixed acidity          7.60   6.89   6.73
 1 volatile acidity       0.33   0.28   0.27
 2 citric acid            0.34   0.34   0.33
 3 residual sugar         6.39   6.70   5.26
 4 chlorides              0.05   0.05   0.04
 5 free sulfur dioxide   53.33  35.52  34.55
 6 total sulfur dioxide 170.60 141.98 125.25
 7 density                0.99   0.99   0.99
 8 pH                     3.19   3.18   3.22
 9 sulphates              0.47   0.49   0.50
10 alcohol               10.34  10.27  11.42


It looks like we’re on to something here: at first glance, the bad wines seem to have higher total sulfur dioxide, among other differences. We could use a threshold on total sulfur dioxide as a crude criterion for discriminating good wines from bad ones.

Let’s get the indexes where the total sulfur dioxide column is below the midpoint we calculated earlier, like so:

In [18]:
total_sulfur_threshold = 141.83
total_sulfur_data = data[:, 6]
predicted_indexes = torch.lt(total_sulfur_data,
                             total_sulfur_threshold)


print(predicted_indexes.shape,
      predicted_indexes.dtype,
      predicted_indexes.sum())

torch.Size([4898]) torch.bool tensor(2727)


This means our threshold implies that just over half of all the wines are going to be high quality. Next, we’ll need to get the indexes of the actually good wines:

In [19]:
actual_indexes = target > 5

print(actual_indexes.shape, actual_indexes.dtype, actual_indexes.sum())

torch.Size([4898]) torch.bool tensor(3258)


Since there are about 500 more actually good wines than our threshold predicted, we already have hard evidence that it’s not perfect. Now we need to see how well our predictions line up with the actual rankings. We will perform a logical “and” between our prediction indexes and the actual good indexes (remember that each is just an array of zeros and ones) and use that intersection of wines-in-agreement to determine how well we did:

In [20]:
n_matches = torch.sum(actual_indexes & predicted_indexes).item()
n_predicted = torch.sum(predicted_indexes).item()
n_actual = torch.sum(actual_indexes).item()

print(n_matches, n_matches/n_predicted, n_matches/n_actual)

2018 0.74000733406674 0.6193984039287906


We got around 2,000 wines right! Since we predicted 2,700 wines, this gives us a 74% chance that if we predict a wine to be high quality, it actually is. Unfortunately, there are 3,200 good wines, and we only identified 61% of them. Well, we got what we signed up for; that’s barely better than random! Of course, this is all very naive: we know for sure that multiple variables contribute to wine quality, and the relationships between the values of these variables and the outcome (which could be the actual score, rather than a binarized version of it) is likely more complicated than a simple
threshold on a single value.

Indeed, a simple neural network would overcome all of these limitations, as would a lot of other basic machine learning methods.