# This part 1 of a tutorial series that will get you up and running in pytorch. 

In this section we will cover 5 major things.
- Tensors(Making them, Using them, Converting from numpy)
- nn.Sequential for basic models
- Defining our own model class
- Running on GPU
- Running on multiple GPU (very basic introduction)


This tutorial will be a high level overview to get you started as quickly as possible, if you need or want more in depth descriptions for anything discussed here then check out the official [pytorch guides](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html).

In [18]:
import torch # pytorch generally
import torch.nn as nn # module used for neural networks
import numpy as np

First, we're going to start with defining a few Tensors. You can think of a tensor as a type of array specific to the PyTorch library. Almost all operations are done through tensors so it's best to get used to working with them over numpy arrays, however there does exist easy ways to convert between the two.

In [5]:
x = torch.randn(1,5) # Define a tensor of random numbers of shape (1,5)
print(x.shape)
y = torch.zeros(5,1) # Define a tensor of all zeros of shape (5,1)
print(y.shape)

torch.Size([1, 5])
torch.Size([5, 1])


Additionally, every tensor has a data type with the default being float32. When initializing a tensor you can usually specifiy a data type by passing in a dtype kwarg. Alternatively, you can set it after the tensor is already created

In [11]:
print(x.dtype)
x = x.double() # does not change in place!
print(x.dtype)
y = y.double()

torch.float64
torch.float64


Now that we have 2 tensors we can do some simple operations on them

In [12]:
z = torch.matmul(x,y)
print(z)

tensor([[0.]], dtype=torch.float64)


Lastly, let's go over the two ways to convert a numpy array into a tensor.

In [15]:
a = np.array([1,2,3,4,5])

# Convert by first converting back to python list
# not advised for all situations
a_as_tensor = torch.Tensor(a.tolist())

# built in helper function for this exact task. Use this as your first choice
a_as_tensor = torch.from_numpy(a)

Tensors have just about the same available operations as a numpy array, so I'm avoiding going too in depth with the usability as I assume the reader has some experience working with numpy. For a full list of availalbe methods go [here](https://pytorch.org/docs/stable/torch.html#tensors). 

### Now, we're going to make our first model

We will be working with random data as the goal here is not to make a true working neural network but rather to just get a feel for the syntax and design choices of the PyTorch framework. Since we work in NLP and most of our models end up being variants of recurrent neural networks(RNN), the examples here will use the Gated Recurrent Unit(GRU) cell. For those unfamiliar, a very general description of an RNN is a neural network that stacks n layers of architecture where each layer is given a new input vector as well as a secondary input vector from the previous layer(i.e. recurrent). This is greatly helpful for most sequence modeling tasks.  

In [39]:
num_examples = 5
seq_len = 10
input_size = 50
# 5 examples, each with sequence length of 10, with each sequence element being 50-dimensional
data = torch.randn(num_examples, seq_len, input_size) 
labels = torch.zeros(5) # all labels are zero because why not


# Model Hyperparams
hidden_size = 35 # amount of hidden neurons in our rnn layer
dropout_rate = 0.0 # amount of dropout as defined in a probability (0-1)
bidirectional = False # if we want a bidirectional RNN
stacked_layers = 1 # number of RNNs we want to stack

# we need to define this function to get GRU to work in the sequential layout
# GRU returns a tuple of (output, hidden_state) but the final linear layer only expects 1 value
# so we just drop the final hidden state in this forward definition
class DropHidden(nn.Module):
    def forward(self, x):
        output, hidden = x[0], x[1]
        return output

# nn.sequential is a quick way to create a forward pass of your data in terms of neural network layers
# It's not advised to use this for anything more than basic prototyping as you're restricted to only
# a declarative approach to coding
model = nn.Sequential(
    nn.GRU(input_size, hidden_size, batch_first=True), # batch_first means first dim is size of batch
    DropHidden(),
    nn.Linear(hidden_size, 1) # fully connected layer to do a regression
)

### How to train your model

PyTorch offers some helpful classes for organizing your train/test data in a way that removes a lot of boiler plate for looping/etc. These are called DataSets and DataLoaders, essentially a DataSet class organizes your data into pairs of (examples, labels) and the DataLoader can handle splitting it into batches or shuffling of the DataSet. 

In [40]:
import torch.utils.data as utils

train_data = utils.TensorDataset(data, labels)
train_dataloader = utils.DataLoader(train_data, batch_size=1, shuffle=True, drop_last=True)


### Additionally, we need to define 2 very important aspects of our network

First, we need to define the optimizer we are going to use to update our network parameters. The most common of these is stochastic gradient descent, but from experience for RNNs I find [ADAM](https://arxiv.org/abs/1412.6980) to work well. Second, we need to pick a loss function. A loss function is essentially what tells your network how 'wrong' or 'right' it is when preforming a prediction and is the starting point for back-prop through the network for updating parameters. The problem laid out in this tutorial is a simple regression so we will stick with mean squared error.

In [41]:
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=.0001)

Time to write our first training loop!

In [48]:
def train_model(model):
    # very important to put model into 'training mode'
    # Turns ON dropout and a few other aspects
    model.train()
    total_loss = 0
    
    for inputs, label in train_dataloader:
        optimizer.zero_grad() # zero out optimizer for a new batch
        output = model(inputs) # pass to our defined model
        
        batch_loss = criterion(output, label) # calculate loss
        total_loss += batch_loss.item() # get numeric value to track loss (print per epoch/etc)
        batch_loss.backward() # trigger backprop
        optimizer.step() # update network based on backprop results
        
    print(total_loss / 5) # print 'average' loss across all batches
    
train_model(model) # run the function

0.0504742007702589


### nn.Sequential clearly has its limits

As we saw using sequential was great for prototyping but already started to show limitations(it doesn't even natively work for RNNs without adding custom code anyway!). The suggested approach for coding in PyTorch is to define your own class for your model. You start by extending the nn.Module from PyTorch and then defining the necessary layers of your network. Once you have your layers defined the key function to implement is forward, this function will do an entire forward pass of your network similar to sequential. If you need more fine-grained control over your model then you can implement a step function and write a loop in your forward function that calls step, if for some reason you need to write a custom network that requires a calculation at each forward movement in the graph that would be the best approach. 

In [85]:
class myGRU(nn.Module):
    
    def __init__(self, input_size, hidden_size, drop=0, bidirect=False, batch_first=True, layers=1):
        super(myGRU, self).__init__()
        self.gru = nn.GRU(input_size, hidden_size, num_layers=layers, dropout=drop, batch_first=True, bidirectional=bidirect)
        self.linear = nn.Linear(hidden_size, 1)
        
        self.bidirect = bidirect
        self.num_layers = layers
        self.input_size = input_size
        self.hidden_size = hidden_size
    
    def forward(self, inputs, hidden=None):
        if hidden is None:
            hidden = self.init_hidden()
            
        output, hidden = self.gru(inputs, hidden)
        return self.linear(output)
    
    def init_hidden(self):
        weight = next(self.parameters())
        if self.bidirect:
            return weight.new_zeros(2*self.num_layers, 1, self.hidden_size)
        else:
            return weight.new_zeros(self.num_layers, 1, self.hidden_size)
        


In [86]:
model = myGRU(input_size, hidden_size)
train_model(model)

0.026621105894446374


### All the above code was running on the CPU, what about GPU?

At some point you'll be creating a model that takes far too long to train using simply the CPU. If you're running for 50 epochs and each epoch is taking an hour that's just not viable. PyTorch offers a very simple way to move your code to the GPU for a massive speed boost. As long as you have an nvidia GPU and configured CUDA this should work easily(Hercules as a server is configured, so running there is fine). 

In [63]:
# first check to make sure pytorch recognizes your GPU
if torch.cuda.is_available():
    print('yay')
else:
    print('boo')
    
    
# First move your model, criterion to cuda
if torch.cuda.is_available():
    model.cuda()
    criterion.cuda()
    # need to update optimizer with the params that are now ON the GPU
    optimizer = torch.optim.Adam(model.parameters(), lr=.0001)
    
def train_model_cuda(model):
    model.train()
    total_loss = 0
    
    for inputs, label in train_dataloader:
        optimizer.zero_grad() 
        
        if torch.cuda.is_available():
            inputs = inputs.cuda()
            label = label.cuda()
        
        output = model(inputs) 
        
        batch_loss = criterion(output.cuda(), label) 
        total_loss += batch_loss.item()
        batch_loss.backward() 
        optimizer.step()
    
    print(total_loss / 5)
        
train_model_cuda(model)

yay
0.03352753575891256


If for some reason you want to run your code on a specific GPU the easiest way to do that is to just set an environment variable when running your code. 

CUDA_VISIBLE_DEVICES=2 python3 my_nn_code.py

The above code would force your python file to only see GPU with ID 2 and would run your code there. This is helpful if another GPU is already full or you just want to run a lot of models with different parameters to see what gives better results. 

### what if my model is taking a long time even on the GPU, can I use a distributed setup?

Yes -- you can parallelize your code across multiple GPUs but this is a more advanced topic. PyTorch offers some easy ways to at least get some paralleization out of your code but it comes down to you as a programmer to some extent. If interested I recommend checking out this [guide](https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html). 