<a href="https://colab.research.google.com/github/ShaunakSen/Deep-Learning/blob/master/DiveIntoDeepLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dive Into Deep Learning

> Self notes on the book: http://www.d2l.ai/
---

If you are able to devise a solution to the problem that will work 100% of the time, __do not__ use ML!

The supervision comes into play because for choosing the parameters, we (the supervisors) provide the model with a dataset consisting of labeled examples, where each example is matched with the ground-truth label. In probabilistic terms, we typically are interested in estimating the conditional probability of a label given input features. While it is just one among several paradigms within machine learning, supervised learning accounts for the majority of successful applications of machine learning in industry. Partly, that is because many important tasks can be described crisply as estimating the probability of something unknown given a particular set of available data:

Lots of practical problems are well-described regression problems. Predicting the rating that a user will assign to a movie can be thought of as a regression problem and if you designed a great algorithm to accomplish this feat in 2009, you might have won the 1-million-dollar Netflix prize. Predicting the length of stay for patients in the hospital is also a regression problem. A good rule of thumb is that any how much? or how many? problem should suggest regression, such as:

- How many hours will this surgery take?

- How much rainfall will this town have in the next six hours?

NOTE: whole number prediction problems are regression ones, not classification

## Types of Supervised ML

### 1. Regression + 2. Classification - we know

### 2.1 Tagging

Some classification problems fit neatly into the binary or multiclass classification setups. For example, we could train a normal binary classifier to distinguish cats from dogs. Given the current state of computer vision, we can do this easily, with off-the-shelf tools. Nonetheless, no matter how accurate our model gets, we might find ourselves in trouble when the classifier encounters an image of the Town Musicians of Bremen, a popular German fairy tale featuring four animals in Fig. 1.3.3.

![](http://www.d2l.ai/_images/stackedanimals.png)

As you can see, there is a cat in Fig. 1.3.3, and a rooster, a dog, and a donkey, with some trees in the background. Depending on what we want to do with our model ultimately, treating this as a binary classification problem might not make a lot of sense. Instead, we might want to give the model the option of saying the image depicts a cat, a dog, a donkey, and a rooster.

The problem of learning to predict classes that are not mutually exclusive is called multi-label classification. Auto-tagging problems are typically best described as multi-label classification problems. Think of the tags people might apply to posts on a technical blog, e.g., “machine learning”, “technology”, “gadgets”, “programming languages”, “Linux”, “cloud computing”, “AWS”. A typical article might have 5–10 tags applied because these concepts are correlated. Posts about “cloud computing” are likely to mention “AWS” and posts about “machine learning” could also deal with “programming languages”.

We also have to deal with this kind of problem when dealing with the biomedical literature, where correctly tagging articles is important because it allows researchers to do exhaustive reviews of the literature. At the National Library of Medicine, a number of professional annotators go over each article that gets indexed in PubMed to associate it with the relevant terms from MeSH, a collection of roughly 28000 tags. This is a time-consuming process and the annotators typically have a one-year lag between archiving and tagging. Machine learning can be used here to provide provisional tags until each article can have a proper manual review. Indeed, for several years, the BioASQ organization has hosted competitions to do precisely this.

### 3. Search

Sometimes we do not just want to assign each example to a bucket or to a real value. In the field of information retrieval, we want to impose a ranking on a set of items. Take web search for an example. The goal is less to determine whether a particular page is relevant for a query, but rather, which one of the plethora of search results is most relevant for a particular user. We really care about the ordering of the relevant search results and our learning algorithm needs to produce ordered subsets of elements from a larger set. In other words, if we are asked to produce the first 5 letters from the alphabet, there is a difference between returning “A B C D E” and “C A B E D”. Even if the result set is the same, the ordering within the set matters.


One possible solution to this problem is to first assign to every element in the set a corresponding relevance score and then to retrieve the top-rated elements. PageRank, the original secret sauce behind the Google search engine was an early example of such a scoring system but it was peculiar in that it did not depend on the actual query. Here they relied on a simple relevance filter to identify the set of relevant items and then on PageRank to order those results that contained the query term. Nowadays, search engines use machine learning and behavioral models to obtain query-dependent relevance scores. There are entire academic conferences devoted to this subject.

### 4. Recommender Systems

In the simplest formulations, these systems are trained to estimate some score, such as an estimated rating or the probability of purchase, given a user and an item.

Given such a model, for any given user, we could retrieve the set of objects with the largest scores, which could then be recommended to the user. Production systems are considerably more advanced and take detailed user activity and item characteristics into account when computing such scores

Despite their tremendous economic value, recommendation systems naively built on top of predictive models suffer some serious conceptual flaws. To start, we only observe censored feedback: users preferentially rate movies that they feel strongly about. For example, on a five-point scale, you might notice that items receive many five and one star ratings but that there are conspicuously few three-star ratings. Moreover, current purchase habits are often a result of the recommendation algorithm currently in place, but learning algorithms do not always take this detail into account. Thus it is possible for feedback loops to form where a recommender system preferentially pushes an item that is then taken to be better (due to greater purchases) and in turn is recommended even more frequently. Many of these problems about how to deal with censoring, incentives, and feedback loops, are important open research questions.

### Sequence learning

So far, we have looked at problems where we have some fixed number of inputs and produce a fixed number of outputs. For example, we considered predicting house prices from a fixed set of features: square footage, number of bedrooms, number of bathrooms, walking time to downtown. We also discussed mapping from an image (of fixed dimension) to the predicted probabilities that it belongs to each of a fixed number of classes, or taking a user ID and a product ID, and predicting a star rating. In these cases, once we feed our fixed-length input into the model to generate an output, the model immediately forgets what it just saw.

This might be fine if our inputs truly all have the same dimensions and if successive inputs truly have nothing to do with each other. But how would we deal with video snippets? In this case, each snippet might consist of a different number of frames. And our guess of what is going on in each frame might be much stronger if we take into account the previous or succeeding frames. Same goes for language. One popular deep learning problem is machine translation: the task of ingesting sentences in some source language and predicting their translation in another language.

These problems also occur in medicine. We might want a model to monitor patients in the intensive care unit and to fire off alerts if their risk of death in the next 24 hours exceeds some threshold. We definitely would not want this model to throw away everything it knows about the patient history each hour and just make its predictions based on the most recent measurements.

These problems are among the most exciting applications of machine learning and they are instances of sequence learning. They require a model to either ingest sequences of inputs or to emit sequences of outputs (or both). Specifically, sequence to sequence learning considers problems where input and output are both variable-length sequences, such as machine translation and transcribing text from the spoken speech. While it is impossible to consider all types of sequence transformations, the following special cases are worth mentioning.

__Tagging and Parsing__. This involves annotating a text sequence with attributes. In other words, the number of inputs and outputs is essentially the same. For instance, we might want to know where the verbs and subjects are. Alternatively, we might want to know which words are the named entities.

__Automatic Speech Recognition__. With speech recognition, the input sequence is an audio recording of a speaker (shown in Fig. 1.3.5), and the output is the textual transcript of what the speaker said. The challenge is that there are many more audio frames (sound is typically sampled at 8kHz or 16kHz) than text, i.e., there is no 1:1 correspondence between audio and text, since thousands of samples may correspond to a single spoken word. These are sequence to sequence learning problems where the output is much shorter than the input.

__Text to Speech__. This is the inverse of automatic speech recognition. In other words, the input is text and the output is an audio file. In this case, the output is much longer than the input. While it is easy for humans to recognize a bad audio file, this is not quite so trivial for computers.

__Machine Translation__. Unlike the case of speech recognition, where corresponding inputs and outputs occur in the same order (after alignment), in machine translation, order inversion can be vital. In other words, while we are still converting one sequence into another, neither the number of inputs and outputs nor the order of corresponding data examples are assumed to be the same. Consider the following illustrative example of the peculiar tendency of Germans to place the verbs at the end of sentences.

```
German:           Haben Sie sich schon dieses grossartige Lehrwerk angeschaut?
English:          Did you already check out this excellent tutorial?
Wrong alignment:  Did you yourself already this excellent tutorial looked-at?
```



## Unsupervised Learning

In a completely opposite way, it could be frustrating to work for a boss who has no idea what they want you to do. However, if you plan to be a data scientist, you had better get used to it. The boss might just hand you a giant dump of data and tell you to do some data science with it! This sounds vague because it is. We call this class of problems unsupervised learning, and the type and number of questions we could ask is limited only by our creativity. We will address unsupervised learning techniques in later chapters. To whet your appetite for now, we describe a few of the following questions you might ask.

- Can we find a small number of prototypes that accurately summarize the data? Given a set of photos, can we group them into landscape photos, pictures of dogs, babies, cats, and mountain peaks? Likewise, given a collection of users’ browsing activities, can we group them into users with similar behavior? This problem is typically known as clustering.

- Can we find a small number of parameters that accurately capture the relevant properties of the data? The trajectories of a ball are quite well described by velocity, diameter, and mass of the ball. Tailors have developed a small number of parameters that describe human body shape fairly accurately for the purpose of fitting clothes. These problems are referred to as subspace estimation. If the dependence is linear, it is called principal component analysis.

- Is there a representation of (arbitrarily structured) objects in Euclidean space such that symbolic properties can be well matched? This can be used to describe entities and their relations, such as “Rome”  −  “Italy”  +  “France”  =  “Paris”.

- Is there a description of the root causes of much of the data that we observe? For instance, if we have demographic data about house prices, pollution, crime, location, education, and salaries, can we discover how they are related simply based on empirical data? The fields concerned with causality and probabilistic graphical models address this problem.

- Another important and exciting recent development in unsupervised learning is the advent of generative adversarial networks. These give us a procedural way to synthesize data, even complicated structured data like images and audio. The underlying statistical mechanisms are tests to check whether real and fake data are the same.

## Basic Data Manipulation

To start, we can use arange to create a row vector x containing the first 12 integers starting with 0, though they are created as floats by default. Each of the values in a tensor is called an element of the tensor. For instance, there are 12 elements in the tensor x. Unless otherwise specified, a new tensor will be stored in main memory and designated for CPU-based computation.



In [91]:
import torch
import tensorflow as tf

print (tf.__version__)
print (torch.__version__)

2.4.1
1.8.0+cu101


In [92]:
x = torch.arange(start=0, end=12, step=1)

print (x)

print (x.shape)

print (x.numel())

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
torch.Size([12])
12


In [93]:
X = x.reshape(3, 4)
print (X)

## automatically infer the first dimension
X = x.reshape(-1, 4)
print (X) 

tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])
tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])


Typically, we will want our matrices initialized either with zeros, ones, some other constants, or numbers randomly sampled from a specific distribution. We can create a tensor representing a tensor with all elements set to 0 and a shape of (2, 3, 4) as follows:



In [94]:
print (torch.zeros((2,3,4)))

## Similarly, we can create tensors with each element set to 1 as follows:

print (torch.ones((2,3,4)))

tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]])
tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])


In [95]:
print (torch.randn((2,3,4)))

tensor([[[ 1.3257,  0.3933, -0.9286,  0.0174],
         [-1.2369, -2.2574, -1.5908, -0.6852],
         [ 0.6165,  0.3781,  0.5947,  2.0286]],

        [[ 0.7829,  0.5163,  0.0536,  2.0441],
         [-0.6979,  0.5384, -0.1187, -0.3572],
         [ 2.2903,  0.0625,  0.3381, -2.2635]]])


We can also specify the exact values for each element in the desired tensor by supplying a Python list (or list of lists) containing the numerical values. Here, the outermost list corresponds to axis 0, and the inner list to axis 1.



In [96]:
torch.tensor([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])

tensor([[2, 1, 4, 3],
        [1, 2, 3, 4],
        [4, 3, 2, 1]])

In [97]:
x = torch.tensor([1.0, 2, 4, 8])
y = torch.tensor([2, 2, 2, 2])
x + y, x - y, x * y, x / y, x ** y  # The ** operator is exponentiation

(tensor([ 3.,  4.,  6., 10.]),
 tensor([-1.,  0.,  2.,  6.]),
 tensor([ 2.,  4.,  8., 16.]),
 tensor([0.5000, 1.0000, 2.0000, 4.0000]),
 tensor([ 1.,  4., 16., 64.]))

In [98]:
X = torch.arange(12, dtype=torch.float32).reshape((3,4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])

print (X)

print (Y)

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])
tensor([[2., 1., 4., 3.],
        [1., 2., 3., 4.],
        [4., 3., 2., 1.]])


In [99]:
torch.cat(tensors=[X, Y], dim=0)

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [ 2.,  1.,  4.,  3.],
        [ 1.,  2.,  3.,  4.],
        [ 4.,  3.,  2.,  1.]])

In [100]:
torch.cat((X, Y), dim=1)

tensor([[ 0.,  1.,  2.,  3.,  2.,  1.,  4.,  3.],
        [ 4.,  5.,  6.,  7.,  1.,  2.,  3.,  4.],
        [ 8.,  9., 10., 11.,  4.,  3.,  2.,  1.]])

### Broadcasting Mechanism

In the above section, we saw how to perform elementwise operations on two tensors of the same shape. Under certain conditions, even when shapes differ, we can still perform elementwise operations by invoking the broadcasting mechanism. This mechanism works in the following way: First, expand one or both arrays by copying elements appropriately so that after this transformation, the two tensors have the same shape. Second, carry out the elementwise operations on the resulting arrays.

In most cases, we broadcast along an axis where an array initially only has length 1, such as in the following example:



In [101]:
a = torch.arange(3).reshape((3,1))
print (a)
b = torch.arange(2).reshape((1, 2))
print (b)

tensor([[0],
        [1],
        [2]])
tensor([[0, 1]])


In [102]:
a+b

"""
The broadcasting happens as:
a -> [[0, 0]
     [1, 1]
     [2, 2]]
b -> [[0, 1]
     [0, 1]
     [0, 1]]   
"""

'\nThe broadcasting happens as:\na -> [[0, 0]\n     [1, 1]\n     [2, 2]]\nb -> [[0, 1]\n     [0, 1]\n     [0, 1]]   \n'

### Indexing and slicing

Thus, [-1] selects the last element and [1:3] selects the second and the third elements as follows:



In [103]:
print (X)

print (X[-1])

print (X[1:3])

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])
tensor([ 8.,  9., 10., 11.])
tensor([[ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])


In [104]:
X[1, 2] = 9
print (X)

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  9.,  7.],
        [ 8.,  9., 10., 11.]])


In [105]:
X[0:2, :] = 12

print (X)

tensor([[12., 12., 12., 12.],
        [12., 12., 12., 12.],
        [ 8.,  9., 10., 11.]])


### Saving memeory

Running operations can cause new memory to be allocated to host results. For example, if we write Y = X + Y, we will dereference the tensor that Y used to point to and instead point Y at the newly allocated memory. In the following example, we demonstrate this with Python’s id() function, which gives us the exact address of the referenced object in memory. After running Y = Y + X, we will find that id(Y) points to a different location. That is because Python first evaluates Y + X, allocating new memory for the result and then makes Y point to this new location in memory.



In [106]:
before = id(Y)
print (before)
Y=Y+X
print (id(Y))

140475785894640
140475714687456


This might be undesirable for two reasons. First, we do not want to run around allocating memory unnecessarily all the time. In machine learning, we might have hundreds of megabytes of parameters and update all of them multiple times per second. Typically, we will want to perform these updates in place. Second, we might point at the same parameters from multiple variables. If we do not update in place, other references will still point to the old memory location, making it possible for parts of our code to inadvertently reference stale parameters.

Fortunately, performing in-place operations is easy. We can assign the result of an operation to a previously allocated array with slice notation, e.g., Y[:] = <expression>. To illustrate this concept, we first create a new matrix Z with the same shape as another Y, using zeros_like to allocate a block of entries

In [107]:
Z = torch.zeros_like(Y)
print (Z)
print('id(Z):', id(Z))
Z[:] = X + Y
print('id(Z):', id(Z))
Z = X+Y
print('id(Z):', id(Z))

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])
id(Z): 140475714605216
id(Z): 140475714605216
id(Z): 140475714606176


If the value of X is not reused in subsequent computations, we can also use X[:] = X + Y or X += Y to reduce the memory overhead of the operation.



In [108]:
before = id(X)
X = X + Y
id(X) == before

False

In [109]:
before = id(X)
X +=  Y
id(X) == before

True

## Basic Linear Algebra

### Dot Product

one of the most fundamental operations is the dot product. Given two vectors  x,y their dot product

$\mathbf{x}^\top \mathbf{y}$ or $\langle \mathbf{x}, \mathbf{y} \rangle$
is a sum over the products of the elements at the same position

In [110]:
print (x)
y = torch.ones(4, dtype=torch.float32)

print (x.dot(y))

### Note that we can express the dot product of two vectors equivalently by performing an elementwise multiplication and then a sum:

print (torch.sum(x*y))

tensor([1., 2., 4., 8.])
tensor(15.)
tensor(15.)


Dot products are useful in a wide range of contexts. For example, given some set of values, and a set of wts the dot product bw these vectors give us a weighted sum. When the wts are non-negative and sum upto 1 the dot prod gives us the weighted avg

After normalizing two vectors to have the unit length, the dot products express the cosine of the angle between them. We will formally introduce this notion of length later in this section.

### Matrix-Vector Products

Now that we know how to calculate dot products, we can begin to understand matrix-vector products. Recall the matrix  $\mathbf{A} \in \mathbb{R}^{m \times n}$ and the vector $\mathbf{x} \in \mathbb{R}^n$ 

$\begin{split}\mathbf{A}=
\begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_m \\
\end{bmatrix},\end{split}$

where each $\mathbf{a}^\top_{i} \in \mathbb{R}^n$ is a row vector representing the ith row of the matrix A. The matrix-vector product Ax is simply a column vector of length  m , whose  ith  element is the dot product  $\mathbf{a}^\top_i \mathbf{x}$

$\begin{split}\mathbf{A}\mathbf{x}
= \begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_m \\
\end{bmatrix}\mathbf{x}
= \begin{bmatrix}
 \mathbf{a}^\top_{1} \mathbf{x}  \\
 \mathbf{a}^\top_{2} \mathbf{x} \\
\vdots\\
 \mathbf{a}^\top_{m} \mathbf{x}\\
\end{bmatrix}.\end{split}$

We can think of multiplication by a matrix  A as a transformation that projects vectors (here x) from  $\mathbb{R}^n$ to $\mathbb{R}^m$. These transformations turn out to be remarkably useful. For example, we can represent rotations as multiplications by a square matrix. As we will see in subsequent chapters, we can also use matrix-vector products to describe the most intensive calculations required when computing each layer in a neural network given the values of the previous layer.

Expressing matrix-vector products in code with tensors, we use the same dot function as for dot products. When we call np.dot(A, x) with a matrix A and a vector x, the matrix-vector product is performed. Note that the column dimension of A (its length along axis 1) must be the same as the dimension of x (its length).



In [111]:
A = torch.arange(20, dtype=torch.float32).reshape(5, 4)


print (A.shape, x.shape)

print (A)

print (x)

print (torch.mv(input=A, vec=x))

torch.Size([5, 4]) torch.Size([4])
tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [12., 13., 14., 15.],
        [16., 17., 18., 19.]])
tensor([1., 2., 4., 8.])
tensor([ 34.,  94., 154., 214., 274.])


### Matrix Multiplication

![](https://i.imgur.com/8lGoukC.png)

#### Walkthrough

![](https://i.imgur.com/YPiBqU8.jpeg)

![](https://i.imgur.com/YPiBqU8.jpeg)


---






In [112]:
print (A)

B = torch.ones((4,3))

print (A.shape, B.shape)
mul = torch.mm(A, B)
print (mul)
print (mul.shape)

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [12., 13., 14., 15.],
        [16., 17., 18., 19.]])
torch.Size([5, 4]) torch.Size([4, 3])
tensor([[ 6.,  6.,  6.],
        [22., 22., 22.],
        [38., 38., 38.],
        [54., 54., 54.],
        [70., 70., 70.]])
torch.Size([5, 3])


In [113]:
### a is each row vector; extract the one at idx 1
a_2 = A[1, :]

print (a_2)

tensor([4., 5., 6., 7.])


In [114]:
### b is each col vector; extract the last one: b_m
b_m = B[:, -1]

print (b_m)

tensor([1., 1., 1., 1.])


In [115]:
### a2_T . b_m should give the result at pos 2, m

print (a_2.T.dot(b_m))

tensor(22.)


In [116]:
print (mul)

tensor([[ 6.,  6.,  6.],
        [22., 22., 22.],
        [38., 38., 38.],
        [54., 54., 54.],
        [70., 70., 70.]])


### Norms

Some of the most useful operators in linear algebra are norms. Informally, the norm of a vector tells us how big a vector is. The notion of size under consideration here concerns not dimensionality but rather the magnitude of the components.

You might notice that norms sound a lot like measures of distance. And if you remember Euclidean distances (think Pythagoras’ theorem) from grade school, then the concepts of non-negativity and the triangle inequality might ring a bell. In fact, the Euclidean distance is a norm:

specifically it is the  L2  norm. Suppose that the elements in the  n -dimensional vector  x  are  x1,…,xn . The  L2  norm of  x  is the square root of the sum of the squares of the vector elements:

$\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2},$

where the subscript  2  is often omitted in  L2  norms, i.e., $\|\mathbf{x}\|$ is  equivalent to $\|\mathbf{x}\|_2$ 

In [117]:
u = torch.tensor([3.0, -4.0])
torch.norm(u)

tensor(5.)

In deep learning, we work more often with the squared  L2  norm. You will also frequently encounter the  L1  norm, which is expressed as the sum of the absolute values of the vector elements:

$\|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|.$

As compared with the  L2  norm, it is less influenced by outliers. To calculate the  L1  norm, we compose the absolute value function with a sum over the elements.

In [118]:
torch.abs(u).sum()

tensor(7.)

Both the  L2  norm and the  L1  norm are special cases of the more general  Lp  norm:

$\|\mathbf{x}\|_p = \left(\sum_{i=1}^n \left|x_i \right|^p \right)^{1/p}.|$

Analogous to  L2  norms of vectors, the Frobenius norm of a matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$ is the square root of the sum of the squares of the matrix elements:

$\|\mathbf{X}\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.$

The Frobenius norm satisfies all the properties of vector norms. It behaves as if it were an  L2  norm of a matrix-shaped vector. Invoking the following function will calculate the Frobenius norm of a matrix.

In [119]:
print (A)
torch.norm(A)

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [12., 13., 14., 15.],
        [16., 17., 18., 19.]])


tensor(49.6991)

While we do not want to get too far ahead of ourselves, we can plant some intuition already about why these concepts are useful. In deep learning, we are often trying to solve optimization problems: maximize the probability assigned to observed data; minimize the distance between predictions and the ground-truth observations. Assign vector representations to items (like words, products, or news articles) such that the distance between similar items is minimized, and the distance between dissimilar items is maximized. Oftentimes, the objectives, perhaps the most important components of deep learning algorithms (besides the data), are expressed as norms.

### Exercises

6. Run A / A.sum(axis=1) and see what happens. Can you analyze the reason?



In [120]:
print (A)

print (A.shape)

print (A.sum(axis=1))

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [12., 13., 14., 15.],
        [16., 17., 18., 19.]])
torch.Size([5, 4])
tensor([ 6., 22., 38., 54., 70.])


In [121]:
A.size(), A.sum(axis=1).size()

(torch.Size([5, 4]), torch.Size([5]))

In [122]:
## A/A.sum(axis=1) -> error

This will be fine:

In [123]:
B = torch.arange(25, dtype = torch.float32).reshape(5, 5)
print (B)
print (B.sum(axis=1))
B / B.sum(axis=1)

tensor([[ 0.,  1.,  2.,  3.,  4.],
        [ 5.,  6.,  7.,  8.,  9.],
        [10., 11., 12., 13., 14.],
        [15., 16., 17., 18., 19.],
        [20., 21., 22., 23., 24.]])
tensor([ 10.,  35.,  60.,  85., 110.])


tensor([[0.0000, 0.0286, 0.0333, 0.0353, 0.0364],
        [0.5000, 0.1714, 0.1167, 0.0941, 0.0818],
        [1.0000, 0.3143, 0.2000, 0.1529, 0.1273],
        [1.5000, 0.4571, 0.2833, 0.2118, 0.1727],
        [2.0000, 0.6000, 0.3667, 0.2706, 0.2182]])

8. Consider a tensor with shape (2, 3, 4). What are the shapes of the summation outputs along axis 0, 1, and 2?



In [124]:
a = torch.randint(low=0, high=9, size=(2,3,4))
a

tensor([[[6, 0, 6, 1],
         [4, 4, 4, 6],
         [3, 8, 1, 0]],

        [[1, 8, 0, 3],
         [3, 2, 6, 7],
         [8, 6, 2, 2]]])

In [125]:
print (a.shape)

print (a.sum(axis=0))

torch.Size([2, 3, 4])
tensor([[ 7,  8,  6,  4],
        [ 7,  6, 10, 13],
        [11, 14,  3,  2]])


In [126]:
### shape of a is (2,3,4) i.e 2 separate 3x4 matrics
a[0], a[1]

(tensor([[6, 0, 6, 1],
         [4, 4, 4, 6],
         [3, 8, 1, 0]]), tensor([[1, 8, 0, 3],
         [3, 2, 6, 7],
         [8, 6, 2, 2]]))

In [127]:
## a.sum(axis=0) will sum up the 2 3x4 matrices resulting in a 3x4 matrix
a.sum(axis=0).shape

torch.Size([3, 4])

In [128]:
a[0]+a[1]

tensor([[ 7,  8,  6,  4],
        [ 7,  6, 10, 13],
        [11, 14,  3,  2]])

In [129]:
a.sum(axis=0)

tensor([[ 7,  8,  6,  4],
        [ 7,  6, 10, 13],
        [11, 14,  3,  2]])

In [130]:
a.sum(axis=1).shape

torch.Size([2, 4])

In [131]:
a

tensor([[[6, 0, 6, 1],
         [4, 4, 4, 6],
         [3, 8, 1, 0]],

        [[1, 8, 0, 3],
         [3, 2, 6, 7],
         [8, 6, 2, 2]]])

In [132]:
### 2x4 shape: 2 rows for 2 matrices; 4 elems each for the sum of each matrix along the columns
a.sum(axis=1)

tensor([[13, 12, 11,  7],
        [12, 16,  8, 12]])

In [133]:
a.sum(axis=2)

tensor([[13, 18, 12],
        [12, 18, 18]])

In [134]:
### 2x3 shape: 2 rows for 2 matrices; 3 elems each for the sum of each matrix along the rows

a.sum(axis=2).shape

torch.Size([2, 3])

9. Feed a tensor with 3 or more axes to the linalg.norm function and observe its output. What does this function compute for tensors of arbitrary shape?



In [135]:
Y= torch.arange(24,dtype = torch.float32).reshape(2, 3, 4)
print (Y)

tensor([[[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]],

        [[12., 13., 14., 15.],
         [16., 17., 18., 19.],
         [20., 21., 22., 23.]]])


In [136]:
torch.linalg.norm(Y)

tensor(65.7571)

Analogous to  L2  norms of vectors, the Frobenius norm of a matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$ is the square root of the sum of the squares of the matrix elements:

$\|\mathbf{X}\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.$


In [137]:
sum = 0
for elem in Y[:]:
    print (elem.shape)
    for elem1 in elem:
        for elem2 in elem1:
            sum+= elem2.item()**2

print(sum**0.5)

torch.Size([3, 4])
torch.Size([3, 4])
65.75712889109438


This has computed the Frobenus norm for the tensor of shape 2x3x4

## Calculus

### Derivatives and Differentiation

We begin by addressing the calculation of derivatives, a crucial step in nearly all deep learning optimization algorithms. In deep learning, we typically choose loss functions that are differentiable with respect to our model’s parameters. Put simply, this means that for each parameter, we can determine how rapidly the loss would increase or decrease, were we to increase or decrease that parameter by an infinitesimally small amount.

Suppose that we have a function  $f: \mathbb{R} \rightarrow \mathbb{R}$ whose input and output are both scalars. The derivative of  f  is defined as

$f'(x) = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h},$

if this limit exists. If  f'(a) exists,  f  is said to be differentiable at a . If  f  is differentiable at every number of an interval, then this function is differentiable on this interval. We can interpret the derivative  f′(x)  in (2.4.1) as the instantaneous rate of change of  f(x)  with respect to  x . The so-called instantaneous rate of change is based on the variation  h  in  x , which approaches  0 .

To illustrate derivatives, let us experiment with an example. Define  
$u = f(x) = 3x^2-4x$


In [138]:
%matplotlib inline
import numpy as np
from IPython import display

def f(x):
    return 3 * x ** 2 - 4 * x

In [139]:
def numerical_lim(f, x, h):
    return (f(x+h) - f(x))/h

h = 0.1
for i in range(10):
    print (f'h = {h}, numerical limit = {numerical_lim(f, 1, h)}')
    h *= 0.1

h = 0.1, numerical limit = 2.3000000000000043
h = 0.010000000000000002, numerical limit = 2.029999999999976
h = 0.0010000000000000002, numerical limit = 2.0029999999993104
h = 0.00010000000000000003, numerical limit = 2.000299999997956
h = 1.0000000000000004e-05, numerical limit = 2.0000300000155837
h = 1.0000000000000004e-06, numerical limit = 2.0000030001021676
h = 1.0000000000000005e-07, numerical limit = 2.000000298707504
h = 1.0000000000000005e-08, numerical limit = 1.999999987845057
h = 1.0000000000000005e-09, numerical limit = 2.000000165480741
h = 1.0000000000000006e-10, numerical limit = 2.000000165480741


Let us familiarize ourselves with a few equivalent notations for derivatives. Given  y=f(x) , where  x  and  y  are the independent variable and the dependent variable of the function  f , respectively. The following expressions are equivalent:

$f'(x) = y' = \frac{dy}{dx} = \frac{df}{dx} = \frac{d}{dx} f(x) = Df(x) = D_x f(x),$

### Partial Derivatives

So far we have dealt with the differentiation of functions of just one variable. In deep learning, functions often depend on many variables. Thus, we need to extend the ideas of differentiation to these multivariate functions.

Let $y = f(x_1, x_2, \ldots, x_n)$ be a function with  n  variables. The partial derivative of  y  with respect to its  ith  parameter  xi  is:

$\frac{\partial y}{\partial x_i} = \lim_{h \rightarrow 0} \frac{f(x_1, \ldots, x_{i-1}, x_i+h, x_{i+1}, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}.$

To calculate  $\frac{\partial y}{\partial x_i}$ we can simply treat $x_1, \ldots, x_{i-1}, x_{i+1}, \ldots, x_n$ s constants and calculate the derivative of  y  with respect to  xi . For notation of partial derivatives, the following are equivalent:

$\frac{\partial y}{\partial x_i} = \frac{\partial f}{\partial x_i} = f_{x_i} = f_i = D_i f = D_{x_i} f.$

### Gradients

Suppose that the input of function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is an  n -dimensional vector $\mathbf{x} = [x_1, x_2, \ldots, x_n]^\top$ and the output is a scalar. The gradient of the function  $f(\mathbf{x})$  with respect to  $\mathbf{x}$  is a vector of  n  partial derivatives:

$\nabla_{\mathbf{x}} f(\mathbf{x}) = \bigg[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_n}\bigg]^\top,$

Note that this also returns an n-dimensional vector

where $\nabla_{\mathbf{x}} f(\mathbf{x})$ is often replaced by $\nabla f(\mathbf{x})$  when there is no ambiguity.

![](https://i.imgur.com/2qsZFkH.png)

### Chain Rule

However, such gradients can be hard to find. This is because multivariate functions in deep learning are often composite, so we may not apply any of the aforementioned rules to differentiate these functions. Fortunately, the chain rule enables us to differentiate composite functions.

Let us first consider functions of a single variable. Suppose that functions y = f(u) and u = g(x) are both differentiable, then the chain rule states that

$\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}.$

Now let us turn our attention to a more general scenario where functions have an arbitrary number of variables. Suppose that the differentiable function  y  has variables u1, u2... um,  where each differentiable function  ui  has variables  x1,x2,…,xn . Note that  y  is a function of  x1,x2,…,xn . Then the chain rule gives

$\frac{dy}{dx_i} = \frac{dy}{du_1} \frac{du_1}{dx_i} + \frac{dy}{du_2} \frac{du_2}{dx_i} + \cdots + \frac{dy}{du_m} \frac{du_m}{dx_i}$



### Automatic Differentiation

As we have explained in Section 2.4, differentiation is a crucial step in nearly all deep learning optimization algorithms. While the calculations for taking these derivatives are straightforward, requiring only some basic calculus, for complex models, working out the updates by hand can be a pain (and often error-prone).

Deep learning frameworks expedite this work by automatically calculating derivatives, i.e., automatic differentiation. In practice, based on our designed model the system builds a computational graph, tracking which data combined through which operations to produce the output. Automatic differentiation enables the system to subsequently backpropagate gradients. Here, backpropagate simply means to trace through the computational graph, filling in the partial derivatives with respect to each parameter.

As a toy example, say that we are interested in differentiating the function $y = 2\mathbf{x}^{\top}\mathbf{x}$ with respect to the column vector  x . To start, let us create the variable x and assign it an initial value.

In [140]:
x = torch.arange(4.0)
print (x.shape)
x

torch.Size([4])


tensor([0., 1., 2., 3.])

The shape of x is 4x1. 2 x^T x will be a scalar. Also we know that a gradient of a scalar-valued function with respect to a vector  x  is itself vector-valued and has the same shape as  x .

Suppose that the input of function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is an  n -dimensional vector $\mathbf{x} = [x_1, x_2, \ldots, x_n]^\top$ and the output is a scalar. The gradient of the function  $f(\mathbf{x})$  with respect to  $\mathbf{x}$  is a vector of  n  partial derivatives:

$\nabla_{\mathbf{x}} f(\mathbf{x}) = \bigg[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_n}\bigg]^\top,$

Before we even calculate the gradient of  y  with respect to  x , we will need a place to store it. It is important that we do not allocate new memory every time we take a derivative with respect to a parameter because we will often update the same parameters thousands or millions of times and could quickly run out of memory

In [141]:
x.requires_grad_(True)  # Same as `x = torch.arange(4.0, requires_grad=True)` ### This means we want to calculate grad wrt x
x.grad  # The default value is None

Now let us calculate  y . Note: y here is a scalar

In [142]:
y = 2*torch.dot(x, x) ## 2 (0.0 + 1.1 + 2.2 + 3.3)
print (x)
print (y)

tensor([0., 1., 2., 3.], requires_grad=True)
tensor(28., grad_fn=<MulBackward0>)


Since x is a vector of length 4, an inner product of x and x is performed, yielding the scalar output that we assign to y. Next, we can automatically calculate the gradient of y with respect to each component of x by calling the function for backpropagation and printing the gradient.



In [143]:
y.backward()
x.grad

tensor([ 0.,  4.,  8., 12.])

The gradient of the function  $y = 2\mathbf{x}^{\top}\mathbf{x}$  with respect to $\mathbf{x}$ will be $4\mathbf{x}$.  Let us quickly verify that our desired gradient was calculated correctly.

In [144]:
4*x

tensor([ 0.,  4.,  8., 12.], grad_fn=<MulBackward0>)

Now let us calculate another function of x.



In [145]:
# PyTorch accumulates the gradient in default, we need to clear the previous values
x.grad.zero_()

tensor([0., 0., 0., 0.])

In [146]:
y = x.sum()
print (y)

tensor(6., grad_fn=<SumBackward0>)


In [147]:
y.backward()
x.grad

tensor([1., 1., 1., 1.])

y = x1 + x2 + x3 + x4 where x1, x2, x3, x4 are each elem of vector x

dy/dx = [dy/dx1, dy/dx2, dy/dx3, dy/dx4]: Note each of these are the partial derivatives



### Backward for Non-Scalar Variables

Technically, when y is not a scalar, the most natural interpretation of the differentiation of a vector y with respect to a vector x is a matrix. For higher-order and higher-dimensional y and x, the differentiation result could be a high-order tensor.

However, while these more exotic objects do show up in advanced machine learning (including in deep learning), more often when we are calling backward on a vector, we are trying to calculate the derivatives of the loss functions for each constituent of a batch of training examples. Here, our intent is not to calculate the differentiation matrix but rather the sum of the partial derivatives computed individually for each example in the batch.

In [148]:
x.grad.zero_()
y = x * x

In [149]:
print (x)
print (y)

tensor([0., 1., 2., 3.], requires_grad=True)
tensor([0., 1., 4., 9.], grad_fn=<MulBackward0>)


In [150]:
print (y.sum())

tensor(14., grad_fn=<SumBackward0>)


In [151]:
y.sum().backward()

In [152]:
x.grad

tensor([0., 2., 4., 6.])

Here again y is basically y = x1^2 + x2^2 + x3^2 + x4^2

so dy/dx = [dy/dx1, dy/dx2, dy/dx3, dy/dx4]: Note each of these are the partial derivatives = [2x1, 2x2, 2x3, 2x4]

### Detaching Computation

Sometimes, we wish to move some calculations outside of the recorded computational graph. For example, say that y was calculated as a function of x, and that subsequently z was calculated as a function of both y and x. Now, imagine that we wanted to calculate the gradient of z with respect to x, but wanted for some reason to treat y as a constant, and only take into account the role that x played after y was calculated.

say y = x^2 and z = y*x = x^2 * x

Here, we can detach y to return a new variable u that has the same value as y but discards any information about how y was computed in the computational graph. In other words, the gradient will not flow backwards through u to x. Thus, the following backpropagation function computes the partial derivative of z = u * x with respect to x while treating u as a constant, instead of the partial derivative of z = x * x * x with respect to x.





In [153]:
x.grad.zero_()
print (x)
y = x * x
print (y)

u = y.detach()
print (u)

tensor([0., 1., 2., 3.], requires_grad=True)
tensor([0., 1., 4., 9.], grad_fn=<MulBackward0>)
tensor([0., 1., 4., 9.])


In [154]:
z = u * x
z.sum().backward()
print (x.grad) ### should return u

tensor([0., 1., 4., 9.])


In [155]:
### without detaching
x.grad.zero_()
print (x)
y = x * x
print (y)
z = y * x
z.sum().backward()
print (x.grad) ### 3 x^2 = [3.0^2, 3.1^2, 3.2^2, 3.3^2]

tensor([0., 1., 2., 3.], requires_grad=True)
tensor([0., 1., 4., 9.], grad_fn=<MulBackward0>)
tensor([ 0.,  3., 12., 27.])


### Computing the Gradient of Python Control Flow

One benefit of using automatic differentiation is that even if building the computational graph of a function required passing through a maze of Python control flow (e.g., conditionals, loops, and arbitrary function calls), we can still calculate the gradient of the resulting variable. In the following snippet, note that the number of iterations of the while loop and the evaluation of the if statement both depend on the value of the input a.



In [156]:
def f(a):
    ## init b = 2a
    b = a*2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        print ('here', b)
        c = b
    else:
        c = 100 * b
    return c

In [157]:
a = torch.randn(size=(), requires_grad=True)
print (a)

tensor(0.6563, requires_grad=True)


In [158]:
print (a.norm())

tensor(0.6563, grad_fn=<CopyBackwards>)


In [159]:
d = f(a)

here tensor(1344.0516, grad_fn=<MulBackward0>)


d = f(a) = 2*a * 2^k = k(some const) * a


In [160]:
d = f(a)
d.backward()

here tensor(1344.0516, grad_fn=<MulBackward0>)


In [161]:
print (a.grad)

tensor(2048.)


In [162]:
d/a

tensor(2048., grad_fn=<DivBackward0>)

We can now analyze the f function defined above. Note that it is piecewise linear in its input a. In other words, for any a there exists some constant scalar k such that f(a) = k * a, where the value of k depends on the input a

f(a) = k*a => df/da = k  = f(a)/a = d/a