Submission Guideline
* Do not clear your outputs. This notebook will not have autograder, we will not download and run your solution.
* You will have till 11.59 pm tonight to finish.

In [1]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [2]:
print(torch.cuda.is_available())

True


# Why do we offload computations to GPU?

In [3]:
random_matrix_a = torch.randn(10000,20000)
random_matrix_b = torch.randn(20000,10000)

In [5]:
%%timeit -n 1 -r 1
#CPU version
out = torch.matmul(random_matrix_a, random_matrix_b)
# out = random_matrix_a @ random_matrix_b
# out = torch.einsum('ij,jk->ik', random_matrix_a, random_matrix_b)
print(out.shape)

torch.Size([10000, 10000])
19.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [6]:
#Move to GPU
random_matrix_a = random_matrix_a.to(device)
random_matrix_b = random_matrix_b.to(device)

In [7]:
%%timeit -n 1 -r 1
#CPU version
out = torch.matmul(random_matrix_a, random_matrix_b)
# out = random_matrix_a @ random_matrix_b
# out = torch.einsum('ij,jk->ik', random_matrix_a, random_matrix_b)
print(out.shape)

torch.Size([10000, 10000])
24.3 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [8]:
del random_matrix_a, random_matrix_b

### 1) Use `%%timeit` to compare performance of the [inverse](https://docs.pytorch.org/docs/stable/generated/torch.inverse.html) and [mean](https://docs.pytorch.org/docs/stable/generated/torch.mean.html) function on CPU and GPU. 

In [9]:
random_matrix_inv = torch.randn(10000,10000) # create a random square matrix
random_matrix_mean = torch.randn(5,20000) # Compute Mean across rows (5,20000) -> (20000)

In [11]:
%%timeit -n 1 -r 1 
#CPU Inverse 
out_inv_cpu = torch.inverse(random_matrix_inv) 
print(out_inv_cpu.shape)

torch.Size([10000, 10000])
14.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [12]:
# Move random_matrix_inv to GPU
random_matrix_inv = random_matrix_inv.to(device)

In [13]:
%%timeit -n 1 -r 1 
#GPU Inverse 
out_inv_gpu = torch.inverse(random_matrix_inv) 
torch.cuda.synchronize() 
print(out_inv_gpu.shape)

torch.Size([10000, 10000])
352 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [14]:
%%timeit -n 1 -r 1 
#CPU Mean 
out_mean_cpu = torch.mean(random_matrix_mean, dim=0) 
print(out_mean_cpu.shape)

torch.Size([20000])
375 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [15]:
#Move random_matrix_mean to GPU
random_matrix_mean = random_matrix_mean.to(device)

In [16]:
%%timeit -n 1 -r 1 
#GPU Mean 
out_mean_gpu = torch.mean(random_matrix_mean, dim=0) 
torch.cuda.synchronize() 
print(out_mean_gpu.shape)

torch.Size([20000])
7.22 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### 2) Complete the following code-blocks to generate a classifier for predicting 1000 classes.

Assume you have input features $\in R^{512}$, Complete the shapes of weight matrices and einsum strings so that the number of features is [1024,2048,1000] for your three layer network with no bias and relu activation

In [28]:
input_features = torch.randn(64,1024,512).to(device) # batch size of 64, 1024 pixels, 512 features
weight_1 = torch.randn(512, 1024).to(device)
weight_2 = torch.randn(1024, 2048).to(device)
weight_3 = torch.randn(2048, 1000).to(device)

In [30]:
layer_1 = torch.einsum('bpf, fk -> bpk', input_features, weight_1)
layer_1 = torch.relu(layer_1) # Suppress Negative Values
layer_2 = torch.einsum('bpf, fk -> bpk', layer_1, weight_2)
layer_2 = torch.relu(layer_2) # Suppress Negative Values
output = torch.einsum('bpf, fk -> bpk', layer_2, weight_3)
output = output.mean(dim=1) # Average All Pixels (64,1024,1000) -> (64,1000)
print(output.shape)

torch.Size([64, 1000])
