# 2 Preliminaries

## 2.1 Data Manipulation

### 2.1.1 Getting Started

In [74]:
import torch

A tensor represents a array of numerical values. With one axis, a tensor is called a *vector*. With two axes, a tensor is called a *matrix*. With k > 2 axes, we drop the specialized names and just refer to the object as a $k^\text{th}$ order tensor.

In [75]:
# arange(n) start at 0 included, end at n not included, default interval size is 1
x = torch.arange(12, dtype=torch.float32)
x

tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.])

In [76]:
# inspect the total number of elements in a tensor
x.numel()

12

In [77]:
# access a tensor's shape
x.shape

torch.Size([12])

In [78]:
# change the shape of a tensor without altering its size or values
X = x.reshape(3, 4)
X

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])

In [79]:
# also change the shape of a tensor if we don't know one of the size parameter.
X1 = x.reshape(-1, 4)
X1

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])

In [80]:
# tensor with all zeros
torch.zeros((2, 3, 4))

tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]])

In [81]:
# tensor with all ones
torch.ones((2, 3, 4))

tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])

In [82]:
# create a tensor with elements drawn from a standard Gaussian (normal) distribution
# with mean 0 and std 1
torch.randn(3, 4)

tensor([[-1.1114, -0.5883, -0.9171, -0.4005],
        [ 0.0134, -1.1274,  0.8206, -0.5827],
        [-1.0057,  1.2253,  0.1807,  1.4177]])

In [83]:
# construct tensors by supplying the exact values for each element
torch.tensor([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])

tensor([[2, 1, 4, 3],
        [1, 2, 3, 4],
        [4, 3, 2, 1]])

### 2.1.2 Indexing and Slicing

In [84]:
# index as python lists
X[-1], X[1:3]

(tensor([ 8.,  9., 10., 11.]),
 tensor([[ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]]))

In [85]:
# write elements of a matrix by specifying indices
X[1, 2] = 17
X

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5., 17.,  7.],
        [ 8.,  9., 10., 11.]])

In [86]:
# assign multiple elements the same value
X[:2, :] = 12
X

tensor([[12., 12., 12., 12.],
        [12., 12., 12., 12.],
        [ 8.,  9., 10., 11.]])

### 2.1.3 Operations

In [87]:
torch.exp(x)

tensor([162754.7969, 162754.7969, 162754.7969, 162754.7969, 162754.7969,
        162754.7969, 162754.7969, 162754.7969,   2980.9580,   8103.0840,
         22026.4648,  59874.1406])

In [88]:
x = torch.tensor([1.0, 2, 4, 8])
y = torch.tensor([2, 2, 2, 2])
x + y, x - y, x * y, x / y, x ** y

(tensor([ 3.,  4.,  6., 10.]),
 tensor([-1.,  0.,  2.,  6.]),
 tensor([ 2.,  4.,  8., 16.]),
 tensor([0.5000, 1.0000, 2.0000, 4.0000]),
 tensor([ 1.,  4., 16., 64.]))

We can also concatenate multiple tensors together, stacking them end-to end to form a larger tensor. We just need to provide a list of tensors and tell the system along which axis to concatenate. The example below shows what happens when we concatenate two matrices along rows (axis 0) vs. columns (axis 1). We can see that the first output's axis-0 length (6) is the sum of the two input tensors' axis-0 lengths $(3+3)$; while the second output's axis-1 length (8) is hte sum of the two input tensors' axis-1 lengths $(4+4)$.

In [89]:
X = torch.arange(12, dtype=torch.float32).reshape((3, 4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
torch.cat((X, Y), dim=0), torch.cat((X, Y), dim=1)

(tensor([[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.],
         [ 2.,  1.,  4.,  3.],
         [ 1.,  2.,  3.,  4.],
         [ 4.,  3.,  2.,  1.]]),
 tensor([[ 0.,  1.,  2.,  3.,  2.,  1.,  4.,  3.],
         [ 4.,  5.,  6.,  7.,  1.,  2.,  3.,  4.],
         [ 8.,  9., 10., 11.,  4.,  3.,  2.,  1.]]))

In [90]:
# Construct a binary tensor via logical statements.
# if X[i, j] == Y[i, j], then the corresponding entry takes value 1, otherwise takes value 0
X == Y

tensor([[False,  True, False,  True],
        [False, False, False, False],
        [False, False, False, False]])

In [91]:
# Summing all elements in the tensor yields a tensor with only one element
X.sum()

tensor(66.)

### 2.1.4 Broadcasting

Under certain conditions, even when shapes differ, we can still perform elementwise binary operations by invoking the ***broadcasting*** mechanism.

Broadcasting works according to the following two-step procedure:
1. expand one or both arrays by copying elements along axes with length 1 so that after this transformation, the two tensors have the same shape
2. perform an elementwise operation on the resulting arrays.

In [92]:
a = torch.arange(3).reshape((3, 1))
b = torch.arange(2).reshape((1, 2))
a, b

(tensor([[0],
         [1],
         [2]]),
 tensor([[0, 1]]))

Since a and b are $3 \times 1$ and $1 \times 2$ matrices, respectively, their shapes do not match up.

Broadcasting produces a larger $3 \times 2$ matrix by replicating matrix a along the columns and matrix b along the rows before adding them elementwise.

In [93]:
a + b

tensor([[0, 1],
        [1, 2],
        [2, 3]])

### 2.1.5 Saving Memory

Running operations can cause new memory to be allocated to host result. If write Y = X + Y, we dereference the tensor that Y used to point to and instead point Y at the newly allocated memory. We can demonstrate this issue with Python's id() function, which gives us the exact address of the referenced object in memory. Note that after we run Y = Y + X, id(Y) points to a different location. That's because Python first evaluates Y + X, allocating new memory for the result and then point Y to this new location in memory.

In [94]:
before = id(Y)
Y = Y + X
id(Y) == before

False

Undesirable for two reasons.
1. We do not want to run around allocating memory unnecessarily all the time. We often have hundreds of megabytes of parameters and update all of them multiple times per second. We want to perform these updates in place.
2. We might point at the same parameters from multiple variables. If we do not update in place, we must be careful to update all of these references, lest we spring a memory leak or inadvertently refer to stable parameters.

Performing in-place operations is easy. We can assign the result of an operation to a previously allocated arry Y by using slice notation: `Y[:] = <expression>`.

In [95]:
# overwrite the values of tensor Z, after initializing it, using zero_like, to have the same shape as Y
Z = torch.zeros_like(Y)
print('id(Z):', id(Z))
Z[:] = X + Y
print('id(Z):', id(Z))

id(Z): 5086955216
id(Z): 5086955216


If the value of X is not reused in subsequent computations, we can also use `X[:] = X + Y or X += Y` to reduce the memory overhead of the operation.

In [96]:
before = id(X)
X += Y
id(X) == before

True

### 2.1.6 Conversion to Other Python Objects

Converting a Numpy tensor (ndarray) is easy. The torch Tensor and numpy array will share their underlying memory, and changing one through an in-place operation will also change the other.

In [97]:
A = X.numpy()
B = torch.from_numpy(A)
type(A), type(B)

(numpy.ndarray, torch.Tensor)

In [98]:
# convert a size-1 tensor to a python scalar, invoke item function
a = torch.tensor([3.5])
a, a.item(), float(a), int(a)

(tensor([3.5000]), 3.5, 3.5, 3)

### 2.1.7 Summary

The tensor class is the main interface for storing and manipulating data in deep learning libraries. Tensors provide a variety of functions including construction routines; indexing and slicing; basic mathematics operations; broadcasting; memory-efficient assignment; and conversion to and from other Python objects.

### 2.1.8 Exercises

1. Change the conditional statement `X == Y` to `X < Y` or `X > Y`, and then see what kind of tensor you can get.

In [99]:
X = torch.arange(12, dtype=torch.float32).reshape((3, 4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
X < Y, X > Y

(tensor([[ True, False,  True, False],
         [False, False, False, False],
         [False, False, False, False]]),
 tensor([[False, False, False, False],
         [ True,  True,  True,  True],
         [ True,  True,  True,  True]]))

2. Replace the two tensors that operate by element in the broadcasting mechanism with other shapes, e.g., 3-dimensional tensors. Is the result the same as expected?

In [100]:
a = torch.randn(2, 3, 1)
b = torch.randn(1, 1, 4)
a, b
a + b

tensor([[[ 0.3514, -1.1736, -2.1658, -1.3078],
         [ 0.7536, -0.7714, -1.7636, -0.9056],
         [ 2.4033,  0.8783, -0.1139,  0.7441]],

        [[ 0.8936, -0.6315, -1.6236, -0.7657],
         [ 1.7422,  0.2172, -0.7750,  0.0830],
         [ 0.2158, -1.3092, -2.3014, -1.4434]]])

## 2.2 Data Preprocessing
### 2.2.1 Reading the Dataset

To demonstrate how to load CSV file with pandas, create CSV file below

In [101]:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

In [102]:
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms RoofType   Price
0       NaN      NaN  127500
1       2.0      NaN  106000
2       4.0    Slate  178100
3       NaN      NaN  140000


### 2.2.2 Data Preparation

In [103]:
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       NaN               0             1
1       2.0               0             1
2       4.0               1             0
3       NaN               0             1


In [104]:
# replace the NaN entries with the mean value of the corresponding column
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       3.0               0             1
1       2.0               0             1
2       4.0               1             0
3       3.0               0             1


### 2.2.3 Conversion to the Tensor Format

All the entries in inputs and targets are numerical, we can load them into a tensor.

In [105]:
import torch

X, y = torch.tensor(inputs.values), torch.tensor(targets.values)
X, y

(tensor([[3., 0., 1.],
         [2., 0., 1.],
         [4., 1., 0.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))

### 2.2.4 Exercises

1. Try loading datasets, e.g., Abalone from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php) and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?

In [106]:
abalone_data = pd.read_csv("../ML example/abalone.data", 
                            names=["sex", "length", "diameter", "height", "whole_weight",
                                    "shucked_weight", "viscera_weight", "shell_weight", 
                                    "rings"
                            ])
abalone_data.describe(include="all")

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
count,4177,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
unique,3,,,,,,,,
top,M,,,,,,,,
freq,1528,,,,,,,,
mean,,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0


2. Try out indexing and selecting data columns by name rater than by column number. The pandas documentation on indexing has further details on how to do this.

In [107]:
abalone_data[["sex", "rings", "length"]]

Unnamed: 0,sex,rings,length
0,M,15,0.455
1,M,7,0.350
2,F,9,0.530
3,M,10,0.440
4,I,7,0.330
...,...,...,...
4172,F,11,0.565
4173,M,10,0.590
4174,M,9,0.600
4175,F,10,0.625


## 2.3 Linear Algebra

### 2.3.1 Scalars

We denote scalars by ordinary lower-cased letters (e.g., x, y, and z) and the space of all continuous *real-valued* scalars by $\mathbb{R}$.

Scalars are implemented as tensors that contain only one element.

In [108]:
import torch

x = torch.tensor(3.0)
y = torch.tensor(2.0)

x + y, x * y, x / y, x ** y

(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.))

### 2.3.2 Vectors

We denote vectors by bold lowercase letters, (e.g, **x**, **y** and **z**).

Vectors are implemented as $1^\text{st}$-order tensors. In general, such tensors can have arbitrary lengths, subject to memory limitations.

In [109]:
x = torch.arange(3)
x

tensor([0, 1, 2])

In [110]:
# 0-based indexing
x[2]

tensor(2)

To indicate that a vector contains $n$ elements, we write $x\in \mathbb{R}^n$. Formally, we call $n$ the dimensionality of the vector. In code, this corresponds to the tensor's length, accessible via Python's built-in `len()` function.

In [111]:
len(x)

3

In [112]:
# also can access shape attribute
x.shape

torch.Size([3])

### 2.3.3 Matrices

We denote matrices by bold capital letters (e.g., **X**, **Y**, and **Z**), and represent them in code by tensors with two axes.

In [113]:
A = torch.arange(6).reshape(3,2)
A

tensor([[0, 1],
        [2, 3],
        [4, 5]])

In [114]:
# flip the axes, A's transpose
A.T

tensor([[0, 2, 4],
        [1, 3, 5]])

In [115]:
# a symmetric matrix
A = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
A == A.T

tensor([[True, True, True],
        [True, True, True],
        [True, True, True]])

### 2.3.4 Tensors

In [116]:
# higher order of tensors
torch.arange(24).reshape(2, 3, 4)

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

### 2.3.5 Basic Properties of Tensor Arithmetic

Elementwise operations produce outputs that have the same shape as their operands.

In [117]:
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone()
A, A + B

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 tensor([[ 0.,  2.,  4.],
         [ 6.,  8., 10.]]))

The elementwise product of two matrices is called their ***Hadamard product (denoted $\odot$)***.

In [118]:
A * B

tensor([[ 0.,  1.,  4.],
        [ 9., 16., 25.]])

Adding or multiplying a scalar and a tensor produces a result with the same shape as the original tensor. Here, each element of the tensor is added to (or multiplied by) the scalar.

In [119]:
a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape

(tensor([[[ 2,  3,  4,  5],
          [ 6,  7,  8,  9],
          [10, 11, 12, 13]],
 
         [[14, 15, 16, 17],
          [18, 19, 20, 21],
          [22, 23, 24, 25]]]),
 torch.Size([2, 3, 4]))

### 2.3.6 Reduction

Sum jof the elements in a vector $x$ of length $n$, $\sum_{i=1}^n x_i$.

In [120]:
x = torch.arange(3, dtype=torch.float32)
x, x.sum()

(tensor([0., 1., 2.]), tensor(3.))

To express sums over the elements of tensors of arbitrary shape, we simply sum over all of its axes. The sum of the elements of an $m\times n$ matrix $A$ could be written $\sum_{i=1}^m\sum_{i=1}^n a_{ij}$,

In [121]:
A.shape, A.sum()

(torch.Size([2, 3]), tensor(15.))

Specifying axis = 1 will reduce the column dimension (axis 1) by summing up elements of all the columns.

In [122]:
A, A.shape, A.sum(axis=1), A.sum(axis=1).shape

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 torch.Size([2, 3]),
 tensor([ 3., 12.]),
 torch.Size([2]))

In [123]:
# reduce along both rows and columns via sum is equivalent to sum all
A.sum(axis=[0, 1]) == A.sum()

tensor(True)

In [124]:
# find the mean
A.mean(), A.sum() / A.numel()

(tensor(2.5000), tensor(2.5000))

In [125]:
# The function for calculating the mean can also reduce a tensor along axes
A.mean(axis=0), A.sum(axis=0) / A.shape[0]

(tensor([1.5000, 2.5000, 3.5000]), tensor([1.5000, 2.5000, 3.5000]))

### 2.3.7 Non-Reduction Sum

Sometimes it can be useful to keep the number of axes unchanged when invoking the function for calculating the sum or mean. This matters when we want to use the broadcast mechanism.

In [126]:
sum_A = A.sum(axis=1, keepdim=True)
sum_A, sum_A.shape

(tensor([[ 3.],
         [12.]]),
 torch.Size([2, 1]))

For instance, since sum_A keeps its two axes after summing each row, we can divide A by sum_A with broadcasting to create a matrix where each row sums up to 1.

In [127]:
A / sum_A

tensor([[0.0000, 0.3333, 0.6667],
        [0.2500, 0.3333, 0.4167]])

If we want to calculate the cumulative sum of elements of A along some axis, say axis=0 (row by row), we can call the cumsum function. By design, this function does not reduce the input tensor along any axis.

In [128]:
A.cumsum(axis=0)

tensor([[0., 1., 2.],
        [3., 5., 7.]])

### 2.3.8 Dot Products

Given two vectors $\bold{x}, \bold{y} \in \mathbb{R}^d$, their dot product $\bold{x}^T\bold{y}$ or $<\bold{x}, \bold{y}>$ is a sum over the products of the elements at the same position: $\bold{x}^T\bold{y} = \sum_{i=1}^d x_iy_i$

In [129]:
y = torch.ones(3, dtype = torch.float32)
x, y, torch.dot(x, y)

(tensor([0., 1., 2.]), tensor([1., 1., 1.]), tensor(3.))

In [130]:
# Equivalently, calculate the dot product of two vectors
torch.sum(x * y)

tensor(3.)

### 2.3.9 Matrix-Vector Products

In [131]:
# matrix vector product. @ -> matrix-matrix products
A.shape, x.shape, torch.mv(A, x), A@x

(torch.Size([2, 3]), torch.Size([3]), tensor([ 5., 14.]), tensor([ 5., 14.]))

### 2.3.10 Matrix-Matrix Multiplication

In [132]:
B = torch.ones(3, 4)
torch.mm(A, B), A@B

(tensor([[ 3.,  3.,  3.,  3.],
         [12., 12., 12., 12.]]),
 tensor([[ 3.,  3.,  3.,  3.],
         [12., 12., 12., 12.]]))

### 2.3.11 Norms

1. $\ell_2$ norm: Euclidean norm

$$\|x\|_2=\sqrt{\sum_{i=1}^nx_i^2}$$

In [133]:
u = torch.tensor([3.0, -4.0])
torch.norm(u)

tensor(5.)

2. $\ell_1$ norm: Manhattan distance

$$\|x\|_1=\sum_{i=1}^n|x_i|$$

In [134]:
torch.abs(u).sum()

tensor(7.)

3. Frobenius norm, defined as the square root of the sum of the squares of a matrix's elements:

$$||\bold{X}||_F=\sqrt{\sum_{i=1}^m\sum_{j=1}^nx_{ij}^2}$$

In [135]:
torch.norm(torch.ones((4, 9)))

tensor(6.)

## 2.5 Automatic Differentiation

In [136]:
import torch

x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

In [137]:
# Better create x = torch.arange(4.0, requires_grad=True)
x.requires_grad_(True)
x.grad # The default value is None

In [138]:
# calculate our function of x and assign the result to y
y = 2 * torch.dot(x, x)
y

tensor(28., grad_fn=<MulBackward0>)

In [139]:
# take the gradient of y with respect to x by calling its
# backward method

y.backward()
x.grad

tensor([ 0.,  4.,  8., 12.])

In [140]:
x.grad == 4 * x

tensor([True, True, True, True])

In [141]:
x.grad.zero_() # Reset the gradient
y = x.sum()
y.backward()
x.grad

tensor([1., 1., 1., 1.])

In [142]:
# backward for non-scalar variables
x.grad.zero_()
y = x * x
y.backward(gradient=torch.ones(len(y))) # Faster: y.sum().backward()
x.grad

tensor([0., 2., 4., 6.])

In [143]:
# Detaching Computation
x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
x.grad == u

tensor([True, True, True, True])

In [144]:
x.grad.zero_()
y.sum().backward()
x.grad == 2 * x

tensor([True, True, True, True])

In [145]:
# Gradients and Python Control Flow
def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = B
    else:
        c = 100 * b
    return c