This notebook builds on top of the first block, covering basic concepts useful to understand the PyTorch deep learning framework such as **objective function**, **non-linearities**, **affine maps**, etc.

In [0]:
import torch

In [0]:
# check if cuda is available in your computer
print("---------------------------------------------------")
print("Cuda status:", torch.cuda.is_available())

device_0 = torch.device("cuda")
device_1 = torch.device("cpu")

print("---------------------------------------------------")

---------------------------------------------------
Cuda status: True
---------------------------------------------------


In [0]:
A = torch.tensor([1, 2, 3 ,4])
B = torch.tensor([1, 2, 3 ,4]).cuda()

In [4]:
A.device

device(type='cpu')

In [5]:
B.device

device(type='cuda', index=0)

In [6]:
A.is_cuda

False

In [7]:
B.is_cuda

True

# **Computation Graph**
A simplified definition of a neural network is a string of functions that are differentiable and that we can combine together to get more complicated functions. An intuitive way to express this process is through **computation graphs**. PyTorch provide efficient functionalities for **automatic differentiation**.



https://towardsdatascience.com/the-2-step-guide-to-upload-images-in-google-colab-b51348e882e4

[link text](https://drive.google.com/file/d/1lqRo4lahewpAJddyE6Y0uPhxvgQBQF2k/view?usp=sharing)

![222](https://drive.google.com/uc?id=1lqRo4lahewpAJddyE6Y0uPhxvgQBQF2k)

In [8]:
### FORWARD
print("---------------------------------------------------")
print("FORWARD: ")

# layer 1
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

# layer 2
c = a + b
c.retain_grad()         # retrain gradients for non-leaf Tensors

d = b + 1.0
d.retain_grad()         # retrain gradients for non-leaf Tensors

# layer 3
e = c * d

print("e: ", e)

### BACKWARD
print("---------------------------------------------------")
print("BACKWARD: ")

e.backward()

print("c.grad: ", c.grad.detach().item()) # de/dc
print("d.grad: ", d.grad.detach().item()) # de/dd

---------------------------------------------------
FORWARD: 
e:  tensor(6., grad_fn=<MulBackward0>)
---------------------------------------------------
BACKWARD: 
c.grad:  2.0
d.grad:  3.0


https://towardsdatascience.com/getting-started-with-pytorch-part-1-understanding-how-automatic-differentiation-works-5008282073ec

Compute the derivates of a and b with respect to c. Just the left part of the figure above.

In [0]:
# layer 1
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

# layer 2
c = a + b
c.retain_grad() # retrain gradients for non-leaf Tensors

print("---------------------------------------------------")
print("BACKWARD: ")

c.backward()

print("a.grad: ", a.grad.detach().item()) # dc/da
print("b.grad: ", b.grad.detach().item()) # dc/db

---------------------------------------------------
BACKWARD: 
a.grad:  1.0
b.grad:  1.0


# **Input**

The first component of the neural network is the **input**. Inputs have to be represented in tensor formats as this is the main data structure or representation used in PyTorch. We introduced tensors and a few operations that are possible with them in the previous segment of this tutorial. Therefore, we will briefly review different kinds of inputs that are common in **NLP**. This is the first actual part of the tutorial where we start to introduce concepts related to NLP and how we will integrate them with the other components provided in the PyTorch ecosystem. Inputs can be represented in either scalars, vectors, or multi-dimensional matrices. Whichever the type, they are all represented as tensors in PyTorch. Typically, the inputs are composed from publicly available datasets.

Inputs to an NLP deep learning model are usually of the following dimensions: _{batch_size * max_sequence_length * vocab_size}_. Let's assume our batch size is 6,the sequence length is 60, and vocab size is 10000. Let's see how this looks below:

In [0]:
sample = torch.rand(64, 60, 10000)
print(f'the size number is of: {sample.size()}')
print(f'the total element number is of: {sample.numel()}')

the size number is of: torch.Size([64, 60, 10000])
the total element number is of: 38400000


The first thing you will notice is the huge size in vocbulary, which is typical when using what's called **one-hot encodings**. There is also option to encode words and sentences into efficient **embeddings**. This ensure a more efficient representation as words can be represented to have semantic relationship.

In such case, the dimensions reduced and they typically look like the following:

In [0]:
sample_with_embeddings = torch.rand(64, 60, 100)
print(f'the size number is of: {sample_with_embeddings.size()}')
print(f'the total element number is of: {sample_with_embeddings.numel()}')

the size number is of: torch.Size([64, 60, 100])
the total element number is of: 384000


Note now that the 3rd dimension has been significantly reduced in dimensoin because we are using embeddings as input to represente sequences. This not only ensure efficiency in terms of meaning buy also the network will train more efficiently because the dimension are reduced.

#### **Tensor Transformation**
Sometimes we need to **permute** the dimensions of the tensor. How do we do this in PyTorch? Please visit the PyTorch documentation to find out how to achieve a tranformation of the original size of the tensor. Hint: **A.permute()**. Try to permute the sample_with_embeddings above to be of the following dimenions instead: _{max_sequence_length * batch_size * vocab_size}_.

In [0]:
sample_with_embeddings[0]

tensor([[0.4167, 0.5443, 0.7273,  ..., 0.4806, 0.2747, 0.8416],
        [0.2645, 0.9291, 0.8072,  ..., 0.9039, 0.0592, 0.3381],
        [0.6887, 0.3476, 0.1404,  ..., 0.0358, 0.2000, 0.9399],
        ...,
        [0.3001, 0.7335, 0.3546,  ..., 0.8808, 0.8815, 0.4774],
        [0.6429, 0.6869, 0.9452,  ..., 0.5972, 0.1473, 0.8735],
        [0.9811, 0.7691, 0.3593,  ..., 0.2998, 0.5371, 0.1857]])

In [0]:
sample_with_embeddings[0].size()

torch.Size([60, 100])

## **Linear Transformation**
A fundamental operation of training a neural network is **affine mapping** or **linear transformations**, which is simply a tranformation of a tensor given some function. PyTorch already packages various linear transformations, so we don't need to manually implement them.

Let's look at the example below. We wish to output the hidden representation using randomly initialized weight and biases. In other words, we wish to compute the following:

**y = Wx + b**

In [10]:
import torch.nn as nn

# using Linear unit in PyTorch

sample_x = torch.rand(64, 60, 100)

fc = nn.Linear(100, 50)               # Wx + b     you only need provide dimensions 

# chaining happening automatically
out = fc(sample_x)

print(out.size())
print(sample_x)

torch.Size([64, 60, 50])
tensor([[[3.2660e-01, 2.7459e-01, 8.0994e-02,  ..., 2.8569e-01,
          2.0536e-01, 6.6281e-01],
         [3.9845e-01, 8.4902e-01, 5.3731e-01,  ..., 7.4523e-01,
          1.9759e-01, 3.2963e-01],
         [2.8975e-01, 2.0450e-01, 2.6999e-01,  ..., 2.7248e-01,
          6.2096e-01, 8.6310e-01],
         ...,
         [7.3211e-01, 4.3403e-01, 4.0419e-01,  ..., 1.3821e-01,
          3.1642e-01, 1.9150e-01],
         [4.4580e-01, 1.2637e-01, 1.7299e-01,  ..., 8.6654e-01,
          7.8285e-01, 1.0592e-01],
         [6.4025e-01, 8.7210e-01, 7.7999e-01,  ..., 2.7700e-01,
          2.3263e-01, 4.2536e-01]],

        [[4.3663e-01, 5.4482e-01, 5.7934e-02,  ..., 5.2575e-01,
          6.4258e-01, 5.8679e-01],
         [4.0286e-01, 5.5827e-01, 7.2883e-01,  ..., 2.7344e-01,
          4.2456e-01, 2.4720e-01],
         [2.8393e-01, 7.2880e-01, 6.7231e-01,  ..., 3.6974e-02,
          1.6632e-01, 3.7125e-01],
         ...,
         [2.6654e-01, 3.8918e-01, 3.5493e-01,  ..., 6.

## **Non-linear Transformation**
We can then apply a non-linear transformation using the results of the previous linear transformation, computed as follows:

**h = sigmiod(Wx + b)**

Sigmoid activation function in our example below:

In [0]:
sample_x = torch.rand(64, 60, 100)

fc = nn.Linear(100, 50)         # the Linear class did everything in background such as weights innitialization
sig = nn.Sigmoid()

out = fc(sample_x)
out = sig(out)                  # [0, 1]

print(out.size())
print(sample_x)

torch.Size([64, 60, 50])
tensor([[[0.8327, 0.4580, 0.8580,  ..., 0.0284, 0.3508, 0.4379],
         [0.4855, 0.5231, 0.9947,  ..., 0.0727, 0.9151, 0.2560],
         [0.7236, 0.3229, 0.3063,  ..., 0.7087, 0.0864, 0.0976],
         ...,
         [0.4146, 0.5652, 0.4103,  ..., 0.5533, 0.5456, 0.4220],
         [0.9939, 0.2359, 0.6844,  ..., 0.6125, 0.5642, 0.7184],
         [0.4461, 0.5078, 0.2558,  ..., 0.6217, 0.9633, 0.5419]],

        [[0.3198, 0.4743, 0.1305,  ..., 0.5555, 0.9426, 0.4458],
         [0.6244, 0.2177, 0.8523,  ..., 0.5752, 0.3186, 0.6287],
         [0.7769, 0.7241, 0.5836,  ..., 0.5176, 0.2847, 0.3039],
         ...,
         [0.6459, 0.2224, 0.0854,  ..., 0.3090, 0.1527, 0.1926],
         [0.9144, 0.8688, 0.3167,  ..., 0.9928, 0.2339, 0.4209],
         [0.5101, 0.6048, 0.5862,  ..., 0.7638, 0.1307, 0.6792]],

        [[0.1397, 0.5883, 0.5324,  ..., 0.5727, 0.5222, 0.0061],
         [0.4897, 0.2426, 0.5369,  ..., 0.1028, 0.9214, 0.2836],
         [0.0210, 0.1467, 0.5592,

There are other popular non-linear transformation or activiation functions available for use such as **RelU** and **tanh**.

## **Softmax Classifier**
This component of the neural network is called the **classifier**, which is usually in charge of making the final prediction via a normalized representation of the output layer. From the equation below you can see that to get this output we just need to apply a **softmax** function. The values returned will be in the range (0, 1) and sum to 1.

**output = softmax(x)**

In [0]:
m = nn.Softmax(dim=1)
x = torch.randn(4, 5)
out = m(x)
print(out)

tensor([[0.3248, 0.0938, 0.2058, 0.3047, 0.0710],
        [0.1008, 0.0570, 0.3523, 0.1582, 0.3316],
        [0.1321, 0.5767, 0.0290, 0.0691, 0.1931],
        [0.1899, 0.5192, 0.1439, 0.0254, 0.1217]])


#### **Chaining Linear Layers**
Go ahead and try to chain a few linear transformations, make it deep if you like. Revise the previous notebook to help you build a chain of operations.

Feel free to explore the PyTorch documentation to familiarize yourself with more of the basic linear and non-linear transformations. In addition, try to change the size of the Linear layers and combining a series of them.