# 3.1 Tabular data 
A spreadsheet, in a CSV (comma-separated values) file, or in a database.

Whatever the medium, this data is a table containing one row per sample (or record), in which col- umns contain one piece of information about the sample.

## Interal, ordinal, and categorical values
1. continuous values: these values are the most intuitive when represented as numbers.

2. ordinal values: the strict ordering of continuous values remains, but the fixed relationship between values no longer applies. 

3. categorical values: these values have neither ordering nor numerical meaning. 

In [178]:
!ls

2_loading_data.ipynb [34mdata[m[m                 [34mpictures[m[m


In [181]:
# https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
import csv
import torch
import numpy as np
wine_path = "data/winequality-white.csv"

# read data
wineq_numpy = np.loadtxt(wine_path, 
                         dtype=np.float32, 
                         delimiter=";",
                         skiprows=1)

# 2d array, 32-bit floating point
wineq_numpy

array([[ 7.  ,  0.27,  0.36, ...,  0.45,  8.8 ,  6.  ],
       [ 6.3 ,  0.3 ,  0.34, ...,  0.49,  9.5 ,  6.  ],
       [ 8.1 ,  0.28,  0.4 , ...,  0.44, 10.1 ,  6.  ],
       ...,
       [ 6.5 ,  0.24,  0.19, ...,  0.46,  9.4 ,  6.  ],
       [ 5.5 ,  0.29,  0.3 , ...,  0.38, 12.8 ,  7.  ],
       [ 6.  ,  0.21,  0.38, ...,  0.32, 11.8 ,  6.  ]], dtype=float32)

In [183]:
# check all data has been read
col_list = next(csv.reader(open(wine_path), 
                           delimiter=';'))

print(wineq_numpy.shape)
print(col_list)

(4898, 12)
['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']


In [184]:
# convert numpy to pytorch
wineq = torch.from_numpy(wineq_numpy)

print(wineq.shape, wineq.type())

torch.Size([4898, 12]) torch.FloatTensor


In [185]:
# select all rows and all columns except the last
data = wineq[:, :-1]
print(data.shape)

# select all rows and the last column
target = wineq[:, -1]
print(target.shape)
# treat lable as an integer vector of scores
target = wineq[:, -1].long()
target

torch.Size([4898, 11])
torch.Size([4898])


tensor([6, 6, 6,  ..., 6, 7, 6])

### Data Munipulation
1. one hot target column: keeping wine-quality scores in an integer vector of scores induces an ordering of the scores, which may be appropriate in this case because a score of 1 is lower than a score of 4. 
2. nomalize data

In [186]:
# one hot target column
target_onehot = torch.zeros(target.shape[0], 10)
target_onehot.scatter_(1, target.unsqueeze(1), 1.0)

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 1., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

In [187]:
# Nomalize data
data_mean = torch.mean(data, dim=0)
data_var = torch.var(data, dim=0)
data_normalized = (data - data_mean) / torch.sqrt(data_var)

### EAD 
1. Get information about wines grouped into good, middling, and bad categories. At first glance, the bad wines seem to have higher total sulfur dioxide, among other differences.

In [188]:
bad_indexes = torch.le(target, 3)
print(bad_indexes.shape, bad_indexes.dtype, bad_indexes.sum())

bad_data = data[bad_indexes]
print(bad_data.shape)

torch.Size([4898]) torch.bool tensor(20)
torch.Size([20, 11])


In [189]:
bad_data = data[torch.le(target,3)]
mid_data = data[torch.gt(target,3) & torch.le(target,7)]
good_data = data[torch.gt(target,7)]

bad_mean = torch.mean(bad_data, dim=0)
mid_mean = torch.mean(mid_data, dim=0)
good_mean = torch.mean(good_data, dim=0)

for i, args in enumerate(zip(col_list, bad_mean, mid_mean, good_mean)):
    print('{:2} {:20} {:6.2f} {:6.2f} {:6.2f}'.format(i, *args))

 0 fixed acidity          7.60   6.86   6.68
 1 volatile acidity       0.33   0.28   0.28
 2 citric acid            0.34   0.33   0.33
 3 residual sugar         6.39   6.42   5.63
 4 chlorides              0.05   0.05   0.04
 5 free sulfur dioxide   53.33  35.18  36.63
 6 total sulfur dioxide 170.60 138.70 125.88
 7 density                0.99   0.99   0.99
 8 pH                     3.19   3.19   3.22
 9 sulphates              0.47   0.49   0.49
10 alcohol               10.34  10.47  11.65


2. Set the threshold adn get the indexes in which the total sulfur dioxide column is below the mid point you calculated earlier. The result implies that slightly more than half of the wines are going to be hight quality.

In [190]:
total_sulfur_threshold = 141.83
total_sulfur_data = data[:,6]
predicted_indexes = torch.lt(total_sulfur_data, 
                             total_sulfur_threshold)

print(predicted_indexes.shape, 
      predicted_indexes.dtype, 
      predicted_indexes.sum())

torch.Size([4898]) torch.bool tensor(2727)


Below we got 500 more good wines than the previous predicted predicted, which is the hard evidence that the threshold isn't perfect.

In [191]:
actual_indexes = torch.gt(target, 5)
print(actual_indexes.shape, 
      actual_indexes.dtype, 
      actual_indexes.sum())

torch.Size([4898]) torch.bool tensor(3258)


How well the predictions line up with the actual rankings?

- You got around 2,000 wines right! 
- Because you had 2,700 wines predicted, a 74% chance exists that if you predict a wine to be high-quality, it is. 
- Unfortunately, you have 3,200 good wines and identified only 61% of them. Well, we guess you got what you signed up for; 
- The result is barely better than random.

In [192]:
n_matches = torch.sum(actual_indexes & predicted_indexes).item()
n_predicted = torch.sum(predicted_indexes).item()
n_actual = torch.sum(actual_indexes).item() 

print(n_matches, 
      n_matches / n_predicted, 
      n_matches / n_actual)

2018 0.74000733406674 0.6193984039287906


# 3.2 Time Series

Here it’s important to an idea of how a time series is laid out and how to wrangle the data into a form that a network will digest. [dataset](https://github.com/deep-learning-with-pytorch/dlwpt-code/tree/master/data/p1ch4/bike-sharing-dataset)

In [193]:
!ls

2_loading_data.ipynb [34mdata[m[m                 [34mpictures[m[m


In [194]:
data_path = 'https://raw.githubusercontent.com/deep-learning-with-pytorch/dlwpt-code/master/data/p1ch4/bike-sharing-dataset/hour-fixed.csv'

In [195]:
import numpy as np
bikes_numpy = np.loadtxt(data_path,
                         dtype=np.float32,
                         delimiter=",",
                         skiprows=1,
                         converters={1: lambda x: float(x[8:10])})

In [196]:
import torch
bikes = torch.from_numpy(bikes_numpy)
# That’s 17,520 hours, 17 columns. 
print(bikes.shape, bikes.stride())
# print(bikes.storage()[0:24])
# bikes[0:24]

torch.Size([17520, 17]) (17, 1)


Now reshape the data to have three axes (day, hour, and 17 columns)

1. N x C x L tensor: 
    - C remains your 17 channels
    - L would be 24, one per hour of the day
    - The first column is the index (the global ordering of the data); the second is the date; the sixth is the time of day. 
    
2. Attention: Calling view on a tensor returns a new tensor that changes the number of dimensions and the striding information without changing the storage.

In [197]:
# step 1:
daily_bikes = bikes.view(-1, 24, bikes.shape[1])
print(daily_bikes.shape, daily_bikes.stride())
# print(daily_bikes.storage()[0:24])
# daily_bikes[0]

# step 2:
daily_bikes = daily_bikes.transpose(1, 2)
print(daily_bikes.shape, daily_bikes.stride())
# print(daily_bikes.storage()[0:24])
# daily_bikes[0]

torch.Size([730, 24, 17]) (408, 17, 1)
torch.Size([730, 17, 24]) (408, 1, 17)


Now convert ordinal variable: weather situation into one-hot categorical variable
1. First, initialize a zero-filled matrix with a number of rows equal to the number of hours in the day and a number of columns equal to the number of weather levels.
2. Then, scatter ones into our matrix according to the corresponding level at each row.
3. Last, concatenate your matrix to your original data set, using the cat function.


Then we have to done the same thing with the reshaped daily_bikes tensor.
1. It's shaped (B, C, L), where L = 24. First, create the zero tensors, with the same B and L but with the number of added columns as C.

In [198]:
# use first day as an example
# step1:
first_day = bikes[:24].long()
weather_onehot = torch.zeros(first_day.shape[0], 4)
print(first_day[:,9])

# step2:
weather_onehot.scatter_(
    dim=1,
    index=first_day[:,9].unsqueeze(1) - 1, # You’re decreasing the values by 1 because the weather situation ranges from 1 to 4, whereas indices are 0-based.
    value=1.0)

#step3:
torch.cat((bikes[:24], weather_onehot), 1)[:1]

tensor([1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2])


tensor([[ 1.0000,  1.0000,  1.0000,  0.0000,  1.0000,  0.0000,  0.0000,  6.0000,
          0.0000,  1.0000,  0.2400,  0.2879,  0.8100,  0.0000,  3.0000, 13.0000,
         16.0000,  1.0000,  0.0000,  0.0000,  0.0000]])

In [199]:
# apply to whole data set
# step 1:
daily_weather_onehot = torch.zeros(daily_bikes.shape[0], 
                                   4, 
                                   daily_bikes.shape[2])
print(daily_weather_onehot.shape)
# step 2:
daily_weather_onehot.scatter_(
    dim=1,
    index=daily_bikes[:,9,:].long().unsqueeze(1)-1,
    value=1.0)
print(daily_weather_onehot.shape)
# step 3:
daily_bikes = torch.cat((daily_bikes, daily_weather_onehot), dim=1)
print(daily_bikes.shape)

torch.Size([730, 4, 24])
torch.Size([730, 4, 24])
torch.Size([730, 21, 24])


Or treat ordinal variable: weather situation as a special values of a continuous variable.

So we need to rescale the variable into [0.0, 1] as other variables (or the [-1.0, 1.0] interval is something that you’ll want to do for all quantitative variables,)

For example column 10: temperature, it's beneficial to rescale this variable into [0.0, 1.0] or [-1.0, 1.0]

In [200]:
# scale into [0.0, 1.0]
temp = daily_bikes[:, 10, :]
temp_min  = torch.min(temp)
temp_max = torch.max(temp)

daily_bikes[:, 10, :] = (daily_bikes[:, 10, :] - temp_min) / (temp_max - temp_min)

# scale into[-1.0, 1.0]
temp = daily_bikes[:, 10, :]
daily_bikes[:, 10, :] = (daily_bikes[:, 10, :] - torch.mean(temp)) / torch.std(temp)

# 3.3 Text

Networks operate on text at two levels: at character level, by processing one charac- ter at a time, and at word level, in which individual words are the finest-grained enti- ties seen by the network. The technique you use to encode text information into tensor form is the same whether you operate at character level or at word level.

In [201]:
with open('data/1342-0.txt', encoding='utf8') as f:
    text = f.read()

In [202]:
# split text into a list of lines 
# character level encoding
lines = text.split('\n') 
line = lines[200]
print(line)

# create a tensor that can hold the total number of one-hot encoded charaters for the whole line
letter_tensor = torch.zeros(len(line), 128)
print(letter_tensor.size())

for i, letter in enumerate(line.lower().strip()):
    letter_index = ord(letter) if ord(letter) < 128 else 0
    letter_tensor[i][letter_index] = 1

print(letter_tensor)

“Impossible, Mr. Bennet, impossible, when I am not acquainted with him
torch.Size([70, 128])
tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])


Build a mapping of words to indexes in your encoding

In [203]:
def clean_words(input_str):
    puctuation = '.,;:"!?”“_-'
    word_list = input_str.lower().replace('\n',' ').split()
    word_list = [word.strip(puctuation) for word in word_list]
    return word_list

In [204]:
words_in_line = clean_words(line) 

print(line)
print(words_in_line)

“Impossible, Mr. Bennet, impossible, when I am not acquainted with him
['impossible', 'mr', 'bennet', 'impossible', 'when', 'i', 'am', 'not', 'acquainted', 'with', 'him']


In [205]:
word_list = sorted(set(clean_words(text)))
word2index_dict = {word: i for (i, word) in enumerate(word_list)}
len(word2index_dict), word2index_dict['impossible']

(7260, 3394)

Now focus on your sentence. Break it into words and one-hot encode it—that is, populate a tensor with one one-hot encoded vector per word. Create an empty vector,
and assign the one-hot encoded values of the word in the sentence:

In [206]:
word_tensor = torch.zeros(len(words_in_line), len(word2index_dict)) 
for i, word in enumerate(words_in_line):
    word_index = word2index_dict[word] 
    word_tensor[i][word_index] = 1
    print('{:2} {:4} {}'.format(i, word_index, word))

 0 3394 impossible
 1 4305 mr
 2  813 bennet
 3 3394 impossible
 4 7078 when
 5 3315 i
 6  415 am
 7 4436 not
 8  239 acquainted
 9 7148 with
10 3215 him


## 3.3.1 Text embeddings

They're the essential tools when a large number of entries in a set has to be represented with numeric vectors.

# 3.4 Images
1. read one dog image

In [208]:
import imageio

img_arr = imageio.imread('data/bobby.jpg')
print(img_arr.shape)

(720, 1280, 3)


At this point, img is a NumPy array-like object with three dimensions: two spatial dimensions (width and height) and a third dimension corresponding to the channels red, green, and blue.


PyTorch modules that deal with image data require tensors to be laid out as C x H x W (chan- nels, height, and width, respectively).

In [209]:
img = torch.from_numpy(img_arr)
out = torch.transpose(img, 0, 2)
print(out.shape)

torch.Size([3, 1280, 720])


2. read multiple images

In [210]:
!ls

2_loading_data.ipynb      [34mpictures[m[m
[34mdata[m[m                      [34mraw.githubusercontent.com[m[m


In [211]:
import os

In [212]:
# define batch consists of 100 RGB images 256 pixels in height and 256 pixels in width
batch_size = 100
batch = torch.zeros(100, 4, 256, 256, dtype=torch.uint8)

In [218]:
data_dir = '/Users/awang/Documents/Deep_Learning_with_PyTorch/Chapter3/code/data/'
filenames = [name for name in os.listdir(data_dir) if str.endswith(name, '.png')]
for i, filename in enumerate(filenames):
    img_arr = imageio.imread('data/' + filename)
    batch[i] = torch.transpose(torch.from_numpy(img_arr), 0 ,2)

A typical thing that you’ll want to do is cast a tensor to floating-point and normalize the values of the pixels. Casting to floating-point is easy, but normalization is trickier, as it depends on what range of the input you decide should lie between 0 and 1 (or –1 and 1). 
- One possibility is to divide the values of pixels by 255 (the maximum representable number in 8-bit unsigned)

In [219]:
batch = batch.float()
batch /= 255.0

- Another possibility is to compute mean and standard deviation of the input data and scale it so that the output has zero mean and unit standard deviation across each channel:

In [220]:
n_channels = batch.shape[1] 
for c in range(n_channels):
    mean = torch.mean(batch[:, c])
    std = torch.std(batch[:, c])
    batch[:, c] = (batch[:, c] - mean) / std

- You can perform several other operations on inputs, including geometric transfor- mations such as rotation, scaling, and cropping. These operations may help with training or may be required to make an arbitrary input conform to the input requirements of a network, such as the size of the image. 

# 3.5 Volumetric data

For now, it suffices to say that no fundamental difference exists between a tensor that stores volumetric data and one that stores image data. You have an extra dimension, depth, after the channel dimension, leading to a 5D tensor of shape N x C x D x H x W.

![figure](pictures/ctscans.png)

In [221]:
!ls

2_loading_data.ipynb      [34mpictures[m[m
[34mdata[m[m                      [34mraw.githubusercontent.com[m[m


Load a sample CT scan by using the volread function in the imageio module, which takes a directory as argument and assembles all DICOM (Digital Imaging Communica- tion and Storage) files10 in a series in a NumPy 3D array

In [223]:
import imageio
dir_path = "/Users/awang/Documents/Deep_Learning_with_PyTorch/Chapter3/code/data/volumetric-dicom/2-LUNG 3.0  B70f-04083"
vol_arr = imageio.volread(dir_path, 'DICOM')
print(vol_arr.shape)

Reading DICOM (examining files): 1/99 files (1.0%99/99 files (100.0%)
  Found 1 correct series.
Reading DICOM (loading data): 99/99  (100.0%)
(99, 512, 512)


In [224]:
vol = torch.from_numpy(vol_arr).float()
print(vol.shape)
vol = torch.transpose(vol, 0, 2)
print(vol.shape)
# make room for the channel dimension by using unsqueeze
vol = torch.unsqueeze(vol, 0)
print(vol.shape)

torch.Size([99, 512, 512])
torch.Size([512, 512, 99])
torch.Size([1, 512, 512, 99])


# How to deal with these pictures?????