# Introduction to Autograd

**"Central to all neural networks in PyTorch is the autograd package."** -- The PyTorch website


As much as we've presumably enjoyed taking derivatives so far, it turns out that there's an easier way.

In [1]:
import torch
x = torch.tensor(3., requires_grad=True)
y = x**2
y.backward()   # computes partial derivatives of y with respect to ancestors
print(x.grad)  # dy/dx = 2*x

tensor(6.)


What, torch could do derivatives this whole time? Yes. It can even do it over vectors and general tensors:

In [2]:
x = torch.tensor([3., 4.], requires_grad=True)
y = (x**2).sum()
y.backward()   # computes partial derivatives of y with respect to ancestors
print(x.grad)  # dy/dx1 = 2*x1; dy/dx2 = 2*x2

tensor([6., 8.])


**Exercise**: What will the following code produce?

In [3]:
x = torch.tensor([3., 4.], requires_grad=True)
y = (2*x**3).mean()
y.backward()
print(x.grad)

tensor([27., 48.])


It can also use the Chain Rule to compute more distant partial derivatives.

In [4]:
import torch
x = torch.tensor([[1.,2.],[3.,4.]], requires_grad=True)
y = x + 2
z = y * y * 3
out = z.sum()
out.backward()
print(x.grad)

tensor([[18., 24.],
        [30., 36.]])


To activate autograd (automatic gradient) for a particular tensor, you simply need to set ```requires_grad=True``` when constructing the tensor. This ensures that the tensor will keep track of any operations that involve itself or its descendants in a computation graph (i.e., causal diagram), such that it can compute any (reasonable) partial derivatives of its descendants with respect to itself. Recall the following problem, in which we examine how the area of this right triangle changes as we change $\theta$, assuming that the base $b$ remains constant (and the triangle remains a right triangle).

![right triangle](./img/triangle.png)

Previously, we set up the following causal diagram to depict this situation:

![causal diagram](./img/trianglegraph.png)

In the causal model, the structural equations were:

    c = b / cos(theta)
    h = c * sin(theta)
    A = (b * h) / 2

And we determined that the partial derivative of ```A``` with respect to ```theta``` was:

In [5]:
def our_custom_made_gradient(b, theta):
    return (b**2/2) / math.cos(theta)**2

Now let's do this with torch and autograd!

In [6]:
import math
theta = torch.tensor(math.pi/4., requires_grad=True)
b = torch.tensor(4.)
c = b / torch.cos(theta)
h = c * torch.sin(theta)
a = (h * b) / 2.
a.backward()

print('Our custom solution: {:.1f}'.format(our_custom_made_gradient(b, theta)))
print("Autograd's solution: {:.1f}".format(theta.grad))


Our custom solution: 16.0
Autograd's solution: 16.0


Cool. So guess we never have to do math again.

Notice that since we aren't interested in any partial derivatives with respect to ```b```, we don't bother to set ```requires_grad=True``` for that tensor. It wouldn't hurt, but it does use additional memory and makes things a bit slower.

But what happens if we don't set ```requires_grad=True``` for ```theta```?

In [7]:
theta = torch.tensor(math.pi/4.)
b = torch.tensor(4.)
c = b / torch.cos(theta)
h = c * torch.sin(theta)
a = (h * b) / 2.
a.backward()
print(theta.grad)

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

You get a bit of an ugly error, but now you know what it'll look like, just in case you get it in the future (which you probably will at some point). It's also worth exploring a bit how the ```requires_grad``` flag gets propagated to descendants and copies of a tensor.

In [8]:
theta = torch.tensor(math.pi/4., requires_grad=True)
b = torch.tensor(4.)
c = b / torch.cos(theta)
h = c * torch.sin(theta)
a = (h * b) / 2.
print('theta: {}'.format(theta.requires_grad))
print('b: {}'.format(b.requires_grad))
print('c: {}'.format(c.requires_grad))
print('h: {}'.format(h.requires_grad))
print('a: {}'.format(a.requires_grad))

theta: True
b: False
c: True
h: True
a: True


If a tensor has an explicit ```requires_grad==True```, then all of its descendants in the causal diagram also do (but not siblings, etc.) Clones also acquire the property:

In [9]:
theta = torch.tensor(math.pi/4., requires_grad=True)
theta_clone = theta.clone()
print(theta_clone.requires_grad)

True


But the actual behavior of clones may not be what you might expect.

In [10]:
a = torch.tensor(4., requires_grad=True)
b = a.clone()
c = b ** 2
c.backward()
print('dc/db = {}'.format(b.grad))
print('dc/da = {}'.format(a.grad))

dc/db = None
dc/da = 8.0


Note that PyTorch isn't really treating ```b``` as a real part of the computational graph, just as sort of a standin for ```a```. Often, when we're cloning a tensor, we'll really want to detach it from the original tensor's causal diagram, and then build a new causal diagram for the clone. To do this, we need the rather more verbose ```.clone().detach().requires_grad_(True)```.

In [11]:
a = torch.tensor(4., requires_grad=True)
b = a.clone().detach().requires_grad_(True)
c = b ** 2
c.backward()
print('dc/db = {}'.format(b.grad))
print('dc/da = {}'.format(a.grad))

dc/db = 8.0
dc/da = None


In Project 3, we implemented logistic regression. Let's try to use autograd instead of explicitly computing the gradient ourselves. First, we need some data. Let's just create some data using the "underweight disease" example from class.

In [12]:
from random import randint
import pandas as pd
import torch
def create_simple_data(how_many):
    def create_datum():
        height = randint(40,70) / 10.
        mass = randint(100,300)
        overweight = mass - 45*height
        underweight = 40*height - mass - 120
        if underweight > 0:
            response = 1
        else:
            response = 0
        datum = {'offset': 1, 'height': height, 'mass': mass,
                 'response': response }
        return datum
    def create_datum_with_response(response):
        datum = create_datum()
        while(datum['response'] == response):
            datum = create_datum()
        return datum

    data = []
    for i in range(how_many // 2):
        data.append(create_datum_with_response(0))
        data.append(create_datum_with_response(1))
    return pd.DataFrame(data)

simple_train = create_simple_data(200)
simple_test = create_simple_data(100)
print(simple_train[:10])

   offset  height  mass  response
0       1     6.7   137         1
1       1     5.4   168         0
2       1     7.0   121         1
3       1     4.7   190         0
4       1     6.3   106         1
5       1     6.6   189         0
6       1     6.3   116         1
7       1     6.2   224         0
8       1     6.3   128         1
9       1     6.3   263         0


Let's also create some functions for getting the evidence matrix and the response vector from this Pandas dataframe.

In [13]:
def evidence_matrix(dataframe):
    """
    Gets the evidence matrix (as a Torch tensor) from the 
    Pandas dataframe.
   
    """
    columns = list(dataframe.columns)
    if 'Unnamed: 0' in columns:
        columns.remove('Unnamed: 0')
    columns.remove('response')
    return torch.from_numpy(dataframe[columns].values)
    
def response_vector(dataframe):
    """
    Gets the response vector (as a Torch tensor) from the 
    Pandas dataframe.
    
    """
    return torch.from_numpy(dataframe['response'].values)  

print("DATAFRAME:")
print(simple_train[:5])
print("\nEVIDENCE MATRIX:")
print(evidence_matrix(simple_train[:5]))
print("\nRESPONSE VECTOR:")
print(response_vector(simple_train[:5]))

DATAFRAME:
   offset  height  mass  response
0       1     6.7   137         1
1       1     5.4   168         0
2       1     7.0   121         1
3       1     4.7   190         0
4       1     6.3   106         1

EVIDENCE MATRIX:
tensor([[  1.0000,   6.7000, 137.0000],
        [  1.0000,   5.4000, 168.0000],
        [  1.0000,   7.0000, 121.0000],
        [  1.0000,   4.7000, 190.0000],
        [  1.0000,   6.3000, 106.0000]], dtype=torch.float64)

RESPONSE VECTOR:
tensor([1, 0, 1, 0, 1])


Next, we'll reuse some of the code from Project 3 for training a logistic regression model.

In [14]:
MAX_STEPS = 5000 # do not change
PRECISION = 0.0000001  # do not change

from descent import Environment, adagrad
from logistic import LogisticRegressionModel

def train_logistic_regression(data, task_factory):
    """
    Trains a logistic regression model from a given Pandas DataFrame.
    
    The function returns a trained LogisticRegressionModel.
    
    """  
    X = evidence_matrix(data)
    y = response_vector(data)
    task = task_factory(X, y)
    steps = adagrad(0.9, task)
    result = steps[-1]
    return LogisticRegressionModel(result.double())

class LogisticRegressionTask(Environment):
    
    def __init__(self, X, y):
        Environment.__init__(self, 
                             torch.ones(X.shape[1], dtype=torch.float64), 
                             PRECISION, MAX_STEPS)  # do not change
        self.X = X.double()
        self.y = y.double()
    
    def gradient(self, w):
        Xt = torch.t(self.X)
        Xw = torch.mv(self.X, w.double())
        result = torch.mv(Xt, torch.sigmoid(Xw) - self.y)
        return result

lr = train_logistic_regression(simple_train, LogisticRegressionTask)
print('Accuracy of trained model on train data: {:.3f}'.format(
    lr.evaluate(evidence_matrix(simple_train), response_vector(simple_test))))
print('Accuracy of trained model on test data: {:.3f}'.format(
    lr.evaluate(evidence_matrix(simple_test), response_vector(simple_test))))

Accuracy of trained model on train data: 0.980
Accuracy of trained model on test data: 0.980


Now, instead of figuring out the gradient ourselves and then expressing it in code (as done in the above .gradient method), could we just use autograd? Let's try:

**Exercise:** Create an alternative version of ```LogisticRegressionTask``` (called ```LogisticRegressionTaskAutograd``` that computes its gradient using autograd.

In [25]:
class LogisticRegressionTaskAutograd(Environment):
    
    def __init__(self, X, y):
        w_size = X.shape[1]
        Environment.__init__(self, 
                             torch.ones(w_size, requires_grad=True, 
                                        dtype=torch.float64), 
                             PRECISION, MAX_STEPS)  # do not change
        self.X = X.double()
        self.y = y.double()
    
    def gradient(self, w):
        #L = <1-y, xw> - log(sigm(xw))
        wtilde = w.clone().detach().requires_grad_(True)
        xw = torch.mv(self.X.clone().detach(), wtilde)
        d = torch.dot(1-self.y.clone().detach(), xw)
        lg = torch.log(torch.sigmoid(xw))
        L = torch.sum(d - lg)
        L.backward()
        return wtilde.grad
    
lr = train_logistic_regression(simple_train, LogisticRegressionTaskAutograd)
print('Accuracy of trained model on train data: {:.3f}'.format(
    lr.evaluate(evidence_matrix(simple_train), response_vector(simple_test))))
print('Accuracy of trained model on test data: {:.3f}'.format(
    lr.evaluate(evidence_matrix(simple_test), response_vector(simple_test))))

Accuracy of trained model on train data: 0.500
Accuracy of trained model on test data: 0.500


For fun, we can check that the disease that affects **underweight and overweight** subjects cannot be modeled well by logistic regression. First, let's create that data:

In [26]:
from random import randint
import pandas as pd
def create_xor_data(how_many):
    def create_datum():
        height = randint(40,70) / 10.
        mass = randint(100,300)
        income = randint(20,200)
        overweight = mass - 45*height
        underweight = 40*height - mass - 120
        if underweight > 0 or overweight > 0:
            response = 1
        else:
            response = 0
        datum = {'offset': 1, 'height': height, 'mass': mass, 
                 'response': response }
        return datum
    data = []
    for i in range(how_many // 2):
        datum = create_datum()
        while(datum['response'] == 0):
            datum = create_datum()
        data.append(datum)
        datum = create_datum()
        while(datum['response'] == 1):
            datum = create_datum()
        data.append(datum)       
    return pd.DataFrame(data)

xor_train = create_xor_data(1000)
xor_test = create_xor_data(1000)
print(xor_train[:10])

   offset  height  mass  response
0       1     4.4   206         1
1       1     5.9   128         0
2       1     4.9   236         1
3       1     6.3   164         0
4       1     4.3   267         1
5       1     6.5   203         0
6       1     5.8   294         1
7       1     5.7   245         0
8       1     6.4   291         1
9       1     6.4   213         0


Then, let's run logistic regression (with autograd!) on it, and watch it fail.

In [27]:
lr = train_logistic_regression(xor_train, LogisticRegressionTaskAutograd)
print('Accuracy of trained model on train data: {:.3f}'.format(
    lr.evaluate(evidence_matrix(xor_train), response_vector(xor_test))))
print('Accuracy of trained model on test data: {:.3f}'.format(
    lr.evaluate(evidence_matrix(xor_test), response_vector(xor_test))))

Accuracy of trained model on train data: 0.500
Accuracy of trained model on test data: 0.500


**Exercise:** Encode the example two-layer neural network from the lecture notes that can successfully model the "underweight and overweight disease" example, and check its gradients for correct and incorrect responses.

In [53]:
from torch import tensor
x = tensor([1, 6.6, 120], dtype=torch.float64)
w1 = tensor([[-120.,    0., 1],
             [  40.,  -45., 0],
             [  -1.,    1., 0]], dtype=torch.float64, requires_grad=True)
pi1 = torch.mv(w1.t(),x)
x1 = torch.clamp(pi1, min=0)
w2 = tensor([[ 1.],
             [ 1.],
             [-1.]], dtype=torch.float64, requires_grad=True)
pi2 = torch.mv(w2.t(), x1)

#y = tensor(1.) 
y = tensor(0.) # try me instead!
loss = torch.sum((1-y.to(torch.float64)* pi2) - torch.log(torch.sigmoid(pi2)))
loss.backward()
print(w1.grad)
print(w2.grad)

tensor([[-1.0262e-10,  0.0000e+00,  1.0262e-10],
        [-6.7728e-10,  0.0000e+00,  6.7728e-10],
        [-1.2314e-08,  0.0000e+00,  1.2314e-08]], dtype=torch.float64)
tensor([[-2.4629e-09],
        [ 0.0000e+00],
        [-1.0262e-10]], dtype=torch.float64)
