#### In this lecture we cover:
 Representing different types of real-world data as PyTorch tensors

 Working with range of data types, including spreadsheet, time series, text, image, and medical imaging 

 Loading data from file

 Converting data to tensors 

 Shaping tensors so that they can be used as inputs for neural network models

#### Tabular data 
The simplest form of data you’ll encounter in your machine learning job is sitting in a
 spreadsheet, in a CSV (comma-separated values) file, or in a database.
 
  Whatever the
 medium, this data is a table containing one row per sample (or record), in which columns contain one piece of information about the sample. 
 
 Such a table is a collection of independent samples, unlike a time-series, in
 which samples are related by a time dimension. 
 
 Columns may contain numerical values, such as temperatures at specific locations,
 or labels, such as a string expressing an attribute of the sample (like "blue").
 
 Therefore, tabular data typically isn’t homogeneous; different columns don’t have the same
 type. You might have a column showing the weight of apples and another encoding
 their color in a label. 
 
 Note: PyTorch tensors, on the other hand, are homogeneous. 
 
 Information in PyTorch is encoded as a number, typically floating-point (though integer types are supported as well)
 
 Numeric encoding is deliberate, because neural networks are mathematical entities that
 take real numbers as inputs and produce real numbers as output through successive
 application of matrix multiplications and nonlinear functions.
 
 Your first job as a deep learning practitioner, therefore, is to encode heterogenous,
 real-world data in a tensor of floating-point numbers, ready for consumption by a neural network. 
 
  A large number of tabular data sets is freely available on the internet. See
 https://github.com/caesar0301/awesome-public-data sets

 We start with Wine Quality data set is a freely available
 table containing chemical characterizations of samples of vinho verde (a wine from
 northern Portugal) together with a sensory quality score

 Python offers several options for loading a CSV file quickly.
 Three popular options are 
 
 The csv module that ships with Python

 NumPy 

 Pandas


In [20]:
# In this case we are using pandas and then convert it into numpy and then
# finally into Pytorch tensor 
import pandas as pd
import numpy as np

wine_path = "C:/Users/Haier/Dataset/pytorch_dataset/tabular_data/winequality-white.csv"

wine_df = pd.read_csv(wine_path,sep=';')
wine_df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.270,0.36,20.70,0.045,45.0,170.0,1.00100,3.00,0.45,8.800000,6
1,6.3,0.300,0.34,1.60,0.049,14.0,132.0,0.99400,3.30,0.49,9.500000,6
2,8.1,0.280,0.40,6.90,0.050,30.0,97.0,0.99510,3.26,0.44,10.100000,6
3,7.2,0.230,0.32,8.50,0.058,47.0,186.0,0.99560,3.19,0.40,9.900000,6
4,7.2,0.230,0.32,8.50,0.058,47.0,186.0,0.99560,3.19,0.40,9.900000,6
5,8.1,0.280,0.40,6.90,0.050,30.0,97.0,0.99510,3.26,0.44,10.100000,6
6,6.2,0.320,0.16,7.00,0.045,30.0,136.0,0.99490,3.18,0.47,9.600000,6
7,7.0,0.270,0.36,20.70,0.045,45.0,170.0,1.00100,3.00,0.45,8.800000,6
8,6.3,0.300,0.34,1.60,0.049,14.0,132.0,0.99400,3.30,0.49,9.500000,6
9,8.1,0.220,0.43,1.50,0.044,28.0,129.0,0.99380,3.22,0.45,11.000000,6


In [18]:
# convert to numpy
wineq_numpy = np.array(wine_df)

In [21]:
wineq_numpy
# wineq_numpy = np.loadtxt(wine_path, dtype=np.float32, 
# delimiter=";", skiprows=1)     # direct numpy function 

array([[ 7.  ,  0.27,  0.36, ...,  0.45,  8.8 ,  6.  ],
       [ 6.3 ,  0.3 ,  0.34, ...,  0.49,  9.5 ,  6.  ],
       [ 8.1 ,  0.28,  0.4 , ...,  0.44, 10.1 ,  6.  ],
       ...,
       [ 6.5 ,  0.24,  0.19, ...,  0.46,  9.4 ,  6.  ],
       [ 5.5 ,  0.29,  0.3 , ...,  0.38, 12.8 ,  7.  ],
       [ 6.  ,  0.21,  0.38, ...,  0.32, 11.8 ,  6.  ]])

In [22]:
wineq_numpy.shape

(4898, 12)

Next, check that all the data has been read

In [26]:
import csv
col_list = next(csv.reader(open(wine_path), delimiter=';'))
wineq_numpy.shape, col_list

((4898, 12),
 ['fixed acidity',
  'volatile acidity',
  'citric acid',
  'residual sugar',
  'chlorides',
  'free sulfur dioxide',
  'total sulfur dioxide',
  'density',
  'pH',
  'sulphates',
  'alcohol',
  'quality'])

proceed to convert the NumPy array to a PyTorch tensor

In [34]:
import torch
wineq = torch.from_numpy(wineq_numpy)
wineq.shape, wineq.type()

(torch.Size([4898, 12]), 'torch.DoubleTensor')

At this point, you have a torch.FloatTensor containing all columns, including the
 last, which refers to the quality score. 
 
 Before going further, lets discuss:

#### Interval, ordinal, and categorical values: 
You should be aware of three kinds of numerical values as you attempt to make
 sense of your data.

#### 1. continuous values: (Interval values)
These values are the most intuitive when represented as numbers; they’re strictly ordered, and a difference between various values
 has a strict meaning.
 
  If you’re counting or measuring something with units,
 the value probably is a continuous value
 
#### 2. ordinal values:
The strict ordering of continuous values remains, but the fixed
 relationship between values no longer applies. 
 
 A good example is ordering a small,
 medium, or large drink, with small mapped to the value 1, medium to 2, and large to 3. If you were to convert 1, 2, and 3 to
 the actual volumes (say, 8, 12, and 24 fluid ounces), those values would switch to interval values. It’s important to remember that you can’t do math on the values beyond ordering them; trying to average large=3 and small=1 does not result in a medium drink!
 
#### 3.  categorical values: 
 categorical values have neither ordering nor numerical meaning. These values
 are often enumerations of possibilities, assigned arbitrary numbers. 
 
 Assigning water to
 1, coffee to 2, soda to 3, and milk to 4 is a good example.  Placing water first and milk
 last has no real logic; you simply need distinct values to differentiate them. You could
 assign coffee to 10 and milk to –3 with no significant change (although assigning values
 in the range 0..N-1 will have advantages when we discuss one-hot encoding later).


>> Coming back to orignal problem

So, we have two choices in wine example:

- You could treat the score as a continuous variable, keep it as a real number, and perform a 'regression task'.

- treat it as a label and try to guess such label from the chemical analysis in a 'classification task'. 

> In both methods, you typically remove the score
 from the tensor of input data and keep it in a separate tensor, so that you can use the
 score as the ground truth without it being input to your model

In [35]:
# seperate the data from the desired output
data = wineq[:,:-1] # all rows, all columns except the last one (score)

In [37]:
data,data.shape

(tensor([[ 7.0000,  0.2700,  0.3600,  ...,  3.0000,  0.4500,  8.8000],
         [ 6.3000,  0.3000,  0.3400,  ...,  3.3000,  0.4900,  9.5000],
         [ 8.1000,  0.2800,  0.4000,  ...,  3.2600,  0.4400, 10.1000],
         ...,
         [ 6.5000,  0.2400,  0.1900,  ...,  2.9900,  0.4600,  9.4000],
         [ 5.5000,  0.2900,  0.3000,  ...,  3.3400,  0.3800, 12.8000],
         [ 6.0000,  0.2100,  0.3800,  ...,  3.2600,  0.3200, 11.8000]],
        dtype=torch.float64), torch.Size([4898, 11]))

In [38]:
target = wineq[:,-1] # select all rows of last column (-1)

In [39]:
target,target.shape

(tensor([6., 6., 6.,  ..., 6., 7., 6.], dtype=torch.float64),
 torch.Size([4898]))

If you want to transform the target tensor in a tensor of labels, you have two options:
#### 1.
 depending on the strategy or how you want to use the categorical data. One option is
 to treat a label as an integer vector of scores:


In [40]:
target = wineq[:, -1].long()
target

tensor([6, 6, 6,  ..., 6, 7, 6])

In [47]:
#example
a = torch.tensor([4.8,5.9])
a.long()

tensor([4, 5])

Note: If targets were string labels (such as wine color), assigning an integer number to each
 string would allow you to follow the same approach

#### 2.  
The other approach is to build a one-hot encoding of the scores—that is, encode
 each of the ten scores in a vector of ten elements, with all elements set to zero but
 one, at a different index for each score. This way, a score of 1 could be mapped to the
 vector (1,0,0,0,0,0,0,0,0,0), a score of 5 to (0,0,0,0,1,0,0,0,0,0) and so on.
 

>> Note: The two approaches have marked differences. 
In our this example, the distance between 1 and 3 is the same as the distance
 between 2 and 4, for example.) If this holds for your quantity, great. 
 
>> If, on the other
 hand, scores are purely qualitative, such as color, one-hot encoding is a much better
 fit, as no implied ordering or distance is involved. One-hot encoding is appropriate
 for quantitative scores when fractional values between integer scores
 make no sense for the application (when score is either this or that). 

You can achieve one-hot encoding by using the scatter_ method, which fills the
 tensor with values from a source tensor along the indices provided as arguments.

In [48]:
target_onehot = torch.zeros(target.shape[0], 10)
target_onehot.scatter_(1, target.unsqueeze(1), 1.0)

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 1., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

the preceding invocation reads this way: “For each row, take the index of
 the target label (which coincides with the score in this case), and use it as the column
 index to set the value 1.0. The result is a tensor encoding categorical information.” 
 
 The second argument of scatter_, the index tensor, is required to have the same
 number of dimensions as the tensor you scatter into. Because target_onehot has two
 dimensions (4898x10), you need to add an extra dummy dimension to target by
 using unsqueeze:


In [58]:
target_unsqueezed = target.unsqueeze(1) # to covert it into 2D
print(target_unsqueezed.shape)
target_unsqueezed[0][0]

torch.Size([4898, 1])


tensor(6)

In [59]:
target.shape
target[0]

tensor(6)

PyTorch allows you to use class indices directly as targets while training neural networks. If you want to use the score as a categorical input to the network, however,
 you’d have to transform it to a one-hot encoded tensor. 

 > Now go back to your data tensor, containing the 11 variables associated with the
 chemical analysis. You can use the functions in the PyTorch Tensor API to manipulate
 your data in tensor form. First, obtain means and standard deviations for each column

In [60]:
data_mean = torch.mean(data, dim=0) 
data_mean

tensor([6.8548e+00, 2.7824e-01, 3.3419e-01, 6.3914e+00, 4.5772e-02, 3.5308e+01,
        1.3836e+02, 9.9403e-01, 3.1883e+00, 4.8985e-01, 1.0514e+01],
       dtype=torch.float64)

In [62]:
data_mean.shape

torch.Size([11])

In [63]:
data_var = torch.var(data, dim=0) 
data_var

tensor([7.1211e-01, 1.0160e-02, 1.4646e-02, 2.5726e+01, 4.7733e-04, 2.8924e+02,
        1.8061e+03, 8.9455e-06, 2.2801e-02, 1.3025e-02, 1.5144e+00],
       dtype=torch.float64)

In this case, dim=0 indicates that the reduction is performed along dimension 0. At
 this point, you can normalize the data by subtracting the mean and dividing by the
 standard deviation, which helps with the learning process.


In [66]:
data_normalized = (data - data_mean) / torch.sqrt(data_var)
data_normalized.shape

torch.Size([4898, 11])

In [67]:
data_normalized[0]

tensor([ 0.1721, -0.0818,  0.2133,  2.8211, -0.0354,  0.5699,  0.7445,  2.3313,
        -1.2468, -0.3491, -1.3930], dtype=torch.float64)

Next, look at the data with an eye to finding an easy way to tell good and bad wines
 apart at a glance. First, use the torch.le function to determine which rows in target correspond to a score less than or equal to 3:

In [68]:
bad_indexes = torch.le(target, 3) 
bad_indexes.shape, bad_indexes.dtype, bad_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(20))

Note that only 20 of the bad_indexes entries are set to 1! By leveraging a feature in
 PyTorch called advanced indexing, you can use a binary tensor to index the data tensor

The bad_indexes tensor has the same shape as target, with a
 value of 0 or 1 depending on the outcome of the comparison between your threshold
 and each element in the original target tensor:

In [73]:
# advanced indexing
bad_data = data[bad_indexes]
bad_data.shape

torch.Size([20, 11])

In [77]:
bad_data_pd = pd.DataFrame(np.array(bad_data))

In [80]:
bad_data_pd.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,8.5,0.26,0.21,16.2,0.074,41.0,197.0,0.998,3.02,0.5,9.8
1,5.8,0.24,0.44,3.5,0.029,5.0,109.0,0.9913,3.53,0.43,11.7
2,9.1,0.59,0.38,1.6,0.066,34.0,182.0,0.9968,3.23,0.38,8.5
3,7.1,0.32,0.32,11.0,0.038,16.0,66.0,0.9937,3.24,0.4,11.5
4,6.9,0.39,0.4,4.6,0.022,5.0,19.0,0.9915,3.31,0.37,12.6


Note: You can't directly convert the tensor data to as pandas dataframes (first convert the data into numpy array)

>  Now you can start to get information about wines grouped into good, middling,
 and bad categories. Take the .mean() of each column:

For numpy arrays and PyTorch tensors, the & operator does a logical and operation

In [81]:
bad_data = data[torch.le(target, 3)]  # less then or equal to 3
mid_data = data[torch.gt(target, 3) & torch.lt(target, 7)] #greater than 3 
# and less then 7
good_data = data[torch.ge(target, 7)] # greater than or equal to 7 

In [82]:
bad_data.shape, mid_data.shape, good_data.shape

(torch.Size([20, 11]), torch.Size([3818, 11]), torch.Size([1060, 11]))

In [84]:
# take mean of each column 

bad_mean = torch.mean(bad_data, dim=0) 
mid_mean = torch.mean(mid_data, dim=0)
good_mean = torch.mean(good_data, dim=0)

In [86]:
for i, args in enumerate(zip(col_list, bad_mean, mid_mean, good_mean)):
    # col_list have all attributes names 
    print('{:2} {:20} {:6.2f} {:6.2f} {:6.2f}'.format(i, *args))

 0 fixed acidity          7.60   6.89   6.73
 1 volatile acidity       0.33   0.28   0.27
 2 citric acid            0.34   0.34   0.33
 3 residual sugar         6.39   6.71   5.26
 4 chlorides              0.05   0.05   0.04
 5 free sulfur dioxide   53.33  35.42  34.55
 6 total sulfur dioxide 170.60 141.83 125.25
 7 density                0.99   0.99   0.99
 8 pH                     3.19   3.18   3.22
 9 sulphates              0.47   0.49   0.50
10 alcohol               10.35  10.26  11.42


#### Analysis: 
It looks as though you’re on to something here. At first glance, the bad wines seem to
 have higher total sulfur dioxide, among other differences. You could use a threshold
 on total sulfur dioxide as a crude criterion for discriminating good wines from bad ones. Now get the indexes in which the total sulfur dioxide column is below the midpoint you calculated earlier, like so:

 

In [87]:
total_sulfur_threshold = 141.83 
total_sulfur_data = data[:,6]  # column number 6 
predicted_indexes = torch.lt(total_sulfur_data, total_sulfur_threshold)

In [88]:
predicted_indexes.shape

torch.Size([4898])

In [89]:
predicted_indexes[0:3]

tensor([False,  True,  True])

In [91]:
predicted_indexes.shape, predicted_indexes.dtype, predicted_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(2727))

Analysis: Your threshold implies that slightly more than half of the wines are going to be high-quality. Next, you need to get the indexes of the good wines:

In [92]:
actual_indexes = torch.gt(target, 5)
actual_indexes.shape, actual_indexes.dtype, actual_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(3258))

Because you have about 500 more good wines than your threshold predicted, you
 already have hard evidence that the threshold isn’t perfect.
 
 
 Now you need to see how well your predictions line up with the actual rankings.
 Perform a logical and between your prediction indexes and the good indexes
 (remembering that each index is an array of 0s and 1s), and use that intersection of
 wines in agreement to determine how well you did:

In [93]:
n_matches = torch.sum(actual_indexes & predicted_indexes).item()
n_predicted = torch.sum(predicted_indexes).item()
n_actual = torch.sum(actual_indexes).item()
n_matches, n_matches / n_predicted, n_matches / n_actual


(2018, 0.74000733406674, 0.6193984039287906)

You got around 2,000 wines right! Because you had 2,700 wines predicted, a 74 percent chance exists that if you predict a wine to be high-quality, it is. Unfortunately, you
 have 3,200 good wines and identified only 61 percent of them. Well, we guess you got
 what you signed up for; that result is barely better than random. 

Analysis:  This example is naïve, of course. You know for sure that multiple variables contribute to wine quality and that the relationships between the values of these variables and
 the outcome (which could be the actual score rather than a binarized version of it) is
 likely to be more complicated than a simple threshold on a single value.
 
- A simple neural network would overcome all these limitations, as would a lot of other basic machine learning methods. 

<<<<<<<<----------------------------- End of Topic ---------------------------->>>>>>>>>>>>