# Adding Two Scalars

In [1]:
import numpy as np
import theano.tensor as T
from theano import function
import theano 

In Theano, all symbols must be _typed_. In particular, below we will use `T.dscalar` as the type for a "0-dimensional array (scalar) of type double". This is a Theano `type`. 

In [2]:
x = T.dscalar('x')
y = T.dscalar('y')

Note, `dscalar` is _not_ a class, so `x` and `y` are not instances of `dscalar`. They are instances of `TensorVariable`. `x` and `y`, however, are assigned the theano Type `dscalar` in their `type` field, as you can see below:

In [3]:
type(x)

theano.tensor.var.TensorVariable

In [4]:
x.type

TensorType(float64, scalar)

In [5]:
T.dscalar

TensorType(float64, scalar)

In [6]:
x.type is T.dscalar

True

By calling `T.dscalar` with a string argument, you create a _Variable_ representing a floating-point scalar quantity with the given name. If you provide no argument, the symbol will be unamed. Names are not required, but will often help with debugging. 

Next, we can combine `x` and `y` into their sum, `z`:

In [7]:
z = x + y

`z` is yet another variable which represents the addition of `x` and `y`. You can use the `pp` function to pretty-print out the computation associated to `z`:

In [8]:
from theano import pp
print(pp(z))

(x + y)


The last step is to create a function taking `x` and `y` as inputs, and giving `z` as output: 

In [9]:
f = function([x, y], z)

In [10]:
f(2, 3)

array(5.)

Note there is a slight delay when executing the `function` instruction. This because behind the scenes, `f` was being compiled into C code.

The first argument to `function` is a list of Variables that will be provided as inputs to the function. The second argument is a single Variable or a list of Variables. For either case, the second argument is what we want to see as output when we apply the function. `f` may then be used like a normal Python function.

# Adding Two Matrices
This next step simply requires that we instantiate `x` and `y` using the matrix Types:

In [11]:
x = T.dmatrix('x')
y = T.dmatrix('y')
z = x + y
f = function([x, y] , z)

We can see that we are able to pass in either python lists or numpy arrays. 

In [12]:
f([[1,2],[3,4]], [[10,20],[30,40]])

array([[11., 22.],
       [33., 44.]])

In [13]:
f(np.array([[1,2], [3,4]]), np.array([[10,20], [30,40]]))

array([[11., 22.],
       [33., 44.]])

# Shared Variables
It is also possible to make a function with an internal state. For example, let's say we want to make an accumulator: at the beginning, the state is initialized to zero. Then, on each function call, the state is incremented by the function's argument. 

First, let's define the `accumulator` function. It adds its argument to the internal state, and returns the old state value. 

In [14]:
from theano import shared
state = shared(0)
inc = T.iscalar('inc')
accumulator = function([inc], state, updates=[(state, state+inc)])

The above code introduces a few new concepts. The `shared` function constructs so-called _shared variables_. These are hybrid symbolic and non-symbolic variables whose value may be shared between multiple functions. Shared variables can be used in symbolic expressions just like the objects returned by `dmatrices(...)` but they also have an internal value that defines the value taken by this symbolic variable in all the functions that use it. It is called a _shared_ variable because its value is shared between many functions. The value can be accessed and modified by the `.get_value()` and `.set_value()` methods. 

The other new thing in this code is the `updates` parameter of `function`. `updates` must be supplied with a list of pairs of the form: `(shared-variable, new expression)`. It can also be a dictionary whose keys are shared-variables and values are the new expressions. Either way, it means “whenever this function runs, it will replace the `.value` of each shared variable with the result of the corresponding expression”. Above, our accumulator replaces the state‘s value with the sum of the state and the increment amount.

We can now try this out:

In [15]:
print(state.get_value())

0


In [16]:
accumulator(1)

array(0)

In [17]:
print(state.get_value())

1


In [18]:
accumulator(300)

array(1)

In [19]:
print(state.get_value())

301


We can also reset the state by using the `.set_value()` method:

In [20]:
state.set_value(-1)

In [21]:
accumulator(3)

array(-1)

In [22]:
print(state.get_value())

2


As we mentioned above, you can define more than one function to use the same shared variable. These functions can all update the value.

In [23]:
decrementor = function([inc], state, updates=[(state, state-inc)])

In [24]:
decrementor(2)

array(2)

In [25]:
print(state.get_value())

0


You might be wondering why the updates mechanism exists. You can always achieve a similar result by returning the new expressions, and working with them in NumPy as usual. The updates mechanism can be a syntactic convenience, but it is mainly there for efficiency. Updates to shared variables can sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix updates). Also, Theano has more control over where and how shared variables are allocated, which is one of the important elements of getting good performance on the GPU.

# Graph Structures
The first step in writing Theano code is to write down all mathematical relations using symbolic placeholders (variables). When writing down these expressions you use operations like `+`, `-`, `**`, `sum()`, `tanh()`. All these are represented internally as **ops**. An op represents a certain computation on some type of inputs producing some type of output. You can see it as a _function definition_ in most programming languages.

Theano represents symbolic mathematical computations as graphs. These graphs are composed of interconnected _Apply_, _Variable_ and _Op_ nodes. _Apply_ node represents the application of an _op_ to some _variables_. It is important to draw the difference between the definition of a computation represented by an op and its application to some actual data which is represented by the apply node. Furthermore, data types are represented by Type instances. Here is a piece of code and a diagram showing the structure built by that piece of code. This should help you understand how these pieces fit together:

In [26]:
x = T.dmatrix('x')
y = T.dmatrix('y')
z = x + y

### Diagram

<img src="https://drive.google.com/uc?id=19KR8tIZk0EVPkyeV1o5FgDRm5r6am6VK">

Arrows represent references to the Python objects pointed at. The blue box is an Apply node. Red boxes are Variable nodes. Green circles are Ops. Purple boxes are Types.

When we create _Variables_ and then _Apply Ops_ to them to make more Variables, we build a bi-partite, directed, acyclic graph. Variables point to the Apply nodes representing the function application producing them via their `owner` field. These Apply nodes point in turn to their input and output Variables via their `inputs` and `outputs` fields. (Apply instances also contain a list of references to their `outputs`, but those pointers don’t count in this graph.)

The `owner` field of both `x` and `y` point to `None` because they are not the result of another computation. If one of them was the result of another computation, it’s `owner` field would point to another blue box like `z` does, and so on.

Note that the `Apply` instance’s outputs points to `z`, and `z.owner` points back to the `Apply` instance.

## Traversing the graph
The graph can be traversed starting from outputs (the result of some computation) down to its inputs using the owner field. Take for example the following code:

In [27]:
x = theano.tensor.dmatrix('x')
y = x * 2

If you enter `type(y.owner)` you get `<class 'theano.gof.graph.Apply'>`, which is the apply node that connects the op and the inputs to get this output. You can now print the name of the op that is applied to get y:

In [28]:
type(y.owner)

theano.gof.graph.Apply

In [29]:
y.owner.op.name

'Elemwise{mul,no_inplace}'

Hence, an elementwise multiplication is used to compute y. This multiplication is done between the inputs:

In [30]:
len(y.owner.inputs)

2

In [31]:
y.owner.inputs[0]

x

In [32]:
y.owner.inputs[1]

InplaceDimShuffle{x,x}.0

Note that the second input is not 2 as we would have expected. This is because 2 was first broadcasted to a matrix of same shape as x. This is done by using the op DimShuffle :

In [33]:
type(y.owner.inputs[1])

theano.tensor.var.TensorVariable

In [34]:
type(y.owner.inputs[1].owner)

theano.gof.graph.Apply

In [35]:
y.owner.inputs[1].owner.op

<theano.tensor.elemwise.DimShuffle at 0x10f9752b0>

In [36]:
y.owner.inputs[1].owner.inputs

[TensorConstant{2}]

# Derivative in Theano
Now let’s use Theano for a slightly more sophisticated task: create a function which computes the derivative of some expression `y` with respect to its parameter `x`. To do this we will use the macro `T.grad`. For instance, we can compute the gradient of $x^2$ with respect to $x$:

$$\frac{d(x^2)}{dx} = 2x$$

In [37]:
from theano import pp

In [38]:
x = T.dscalar('x')
y = x ** 2
gy = T.grad(y, x)
display(pp(gy)) # Pretty print gradient prior to optimization

f = theano.function([x], gy)
f(4)

'((fill((x ** TensorConstant{2}), TensorConstant{1.0}) * TensorConstant{2}) * (x ** (TensorConstant{2} - TensorConstant{1})))'

array(8.)

In [39]:
pp(f.maker.fgraph.outputs[0])

'(TensorConstant{2.0} * x)'

# Logistic Regression Example

In [40]:
import numpy as np
import theano.tensor as T
rng = np.random

In [41]:
N = 400      # Training sample size
feats = 784  # Number of input variables

# generate a dataset: D = (input_values, target_class)
# D[0].shape = (400, 784), D[1].shape = (400,)
D = (rng.randn(N, feats), rng.randint(size=N, low=0, high=2))
training_steps = 10000

# Declare Theano symbolic variables
x = T.dmatrix('x')
y = T.dvector('y')

# Initialize the weight vector w randomly
# 
# This and the following bias variable b 
# are shared so they keep their udpate values 
# between training iterations (updates)
w = theano.shared(rng.randn(feats), name='w')

# Initialize bias term
b = theano.shared(0., name='b')

print('Initial model: ')
print(w.get_value())
print(b.get_value())

# ------- Construct Theano Expression Graph ---------
# Prediction, Probability that target = 1 
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))      
prediction = p_1 > 0.5                        # Prediction threshold

# Cross entropy loss function, returns an array of cross entropy's
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) 

# Get the average of all the cross entropy's, add regularization  
cost = xent.mean() + 0.01 * (w ** 2).sum()    

# Compute the gradient of the cost (w/ reg), w.r.t weight vector w and bias term b
gw, gb = T.grad(cost, [w,b])                  

# Compile
train = theano.function(
  inputs=[x,y], 
  outputs=[prediction, xent], 
  updates=((w, w - 0.1 * gw), (b, b - 0.1 * gb))
)
predict = theano.function(inputs=[x], outputs=prediction)

# Train
for i in range(training_steps):
  pred, err = train(D[0], D[1])
  
print("Final model:")
print(w.get_value())
print(b.get_value())
print("target values for D:")
print(D[1])
print("prediction on D:")
print(predict(D[0]))

Initial model: 
[ 1.54379689e+00  1.66714736e+00  9.37672181e-01 -2.61255422e-02
  9.94943018e-01  2.98970392e-01  1.72044152e+00  1.54661462e-01
  4.78673393e-01 -6.84652399e-02 -1.02263216e+00  8.20644211e-01
 -1.54131767e+00  3.15173490e-01 -5.22912056e-02  1.98288786e+00
  6.29161221e-01 -4.60054710e-01 -4.06288355e-01  2.41440577e-01
 -8.33355575e-01 -2.23144351e+00  7.65137608e-02 -6.14199999e-01
 -5.09320426e-02 -2.00915828e+00 -6.71276486e-02 -3.36337364e-03
 -2.44088596e-01  1.86395845e-01 -1.19035445e+00  7.70154479e-01
  4.79713791e-01  1.73952917e+00  1.20507064e+00 -9.15916881e-02
 -2.69536917e-01  1.98228827e+00 -5.26812764e-01 -9.72237140e-02
  1.35997071e+00  5.90587699e-01 -2.99488123e-02 -3.82591629e-02
  1.56182862e+00 -6.39931355e-01  5.53747099e-01  8.53853746e-01
  6.25335866e-02  1.40593055e+00 -5.21166420e-02 -1.80091443e+00
 -1.80745771e+00 -1.07335911e+00 -1.67091989e+00 -8.76487706e-01
  2.49786628e-01 -2.97215260e-01 -1.67604692e-01  3.02142096e-02
 -1.73993

Final model:
[-1.43967984e-02  2.51516705e-02 -2.17543586e-02  3.80772268e-02
  1.87556467e-02  8.66290328e-02 -6.96147812e-02  1.20688873e-02
  9.14800709e-02 -1.70165037e-01 -8.54772876e-02  6.20245005e-02
 -3.60309187e-02  1.25488711e-02 -8.13697221e-02 -2.37536067e-02
  1.91861278e-01  6.77952608e-02  7.31826765e-02  3.19839058e-03
 -1.53041182e-02 -1.57256386e-02  8.04234874e-02  5.95462531e-03
  4.14901496e-03 -8.56252129e-02  7.41784273e-02  1.63092658e-01
  2.77743408e-02  5.54133578e-02  3.94750837e-02 -2.72766613e-01
 -1.77850552e-02 -8.64965603e-04 -1.94108930e-01 -7.65133207e-02
 -1.92922073e-01  1.07940793e-01  9.80115998e-02  5.83033520e-02
 -4.73050632e-02 -9.88776306e-02 -3.40057273e-02 -5.23304867e-02
  3.80320952e-02 -1.25949937e-01 -4.68581512e-02 -6.79512739e-02
  5.66820384e-02  5.12201543e-02  9.11578628e-02  1.61090818e-01
 -3.59888252e-02  1.59828105e-01  7.95472852e-02 -2.51488354e-02
  1.22219180e-01 -1.71385266e-01  2.08274583e-01 -1.97954961e-02
 -4.91793037

# Logistic Regression: Classifying MNIST Digits
Recall that Logistic regression is a probabilistic, linear classifier. It is parametrized by a weight matrix W and a bias vector b. Classification is done by projecting an input vector onto a set of hyperplanes, each of which corresponds to a class. The distance from the input to a hyperplane reflects the probability that the input is a member of the corresponding class.

Mathematically, the probability that an input vector $x$ is a member of a class $i$, a value of a stochastic variable $Y$, can be written as:

$$P(Y = i \mid x, W, b) = softmax\big(Wx + b\big)$$

$$= \frac{e^{W_ix + b_i}}{\sum_j e^{W_jx + b_j}}$$

The model’s prediction $y_{pred}$ is the class whose probability is maximal, specifically:

$$y_{pred} = argmax_i P\big(Y = i \mid x, W, b\big)$$

This can be done in Theano as follows:

```python
# Initialize with 0 the weights W as a matrix of shape (n_in, n_out)
self.W = theano.shared(
  value=np.zeros((n_in, n_out),dtype=theano.config.floatX),
  name='W',
  borrow=True
)

# Initialize biases b as a vector of n_out 0s
self.b = theano.shared(
  value=np.zeros((n_out,), dtype=theano.config.floatX),
  name='b',
  borrow=True
)

# Creating a symbolic expression for computing the matrix of class-membership probabilities
# Where:
# W is a matrix where column-k represents the separation hyperplane for class-k
# x is a matrix where row-j represents input training sample-j
# b is a vector where element-k represents the free parameter of hyperplane-k
self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)

# Symbolic description of how to compute prediction as class whose probability is maximal
self.y_pred = T.argmax(self.p_y_given_x, axis=1)
```

Since the parameters of the model must maintain a persistent state throughout training, we allocate shared variables for $W$, $b$. This declares them both as being symbolic Theano variables, but also initializes their contents. The dot and softmax operators are then used to compute the vector $P(Y|x,W,b)$. The result `p_y_given_x` is a symbolic variable of vector-type.

To get the actual model prediction, we can use the `T.argmax` operator, which will return the index at which `p_y_given_x` is maximal (i.e. the class with maximum probability).

Now of course, the model we have defined so far does not do anything useful yet, since its parameters are still in their initial state. The following section will thus cover how to learn the optimal parameters.

## Defining a Loss Function
Learning optimal model parameters involves minimizing a loss function. In the case of multi-class logistic regression, it is very common to use the negative log-likelihood as the loss. This is equivalent to maximizing the likelihood of the data set $\cal{D}$ under the model parameterized by $\theta$. Let us first start by defining the likelihood $\cal{L}$ and loss $\ell$:

$$L(\theta= \{ W,b \},D) = \sum_{i=0}^{|D|} log\big(P(Y= y^{(i)} \mid x^{(i)}, W, b)\big)$$

$$\ell(\theta=\big\{W, b\big\}, D) = - L(\theta = \{ W,b \}, D)$$

Recall that $y$ is representing the class that was predicted for training sample $i$. Hence, technically the top equation could have have a $target$ (true distribution) multiplied by the log likelihood, where the target is equal to 1. However, it is safe to drop in this scenario. 

```python
# y.shape[0] is (symbolically) the number of rows in y, i.e. the
# number of examples (n) in the minibatch
# T.arange(y.shape[0]) is a symbolic vector which will contain 
# [0,1,2,...,n-1]. NOTE: y is the TARGET!
# T.log(self.p_y_given_x) is a matrix of Log-probabilities (call
# it LP, our predictions) with one row per example and one column per class
# LP[T.arange(y.shape[0]), y] is a vector v containing 
# [LP[0,y[0]], LP[1, y[1]], LP[2, y[2]],..., LP[n-1, y[n-1]]]
# And T.mean(LP[T.arange(y.shape[0], y]) is the mean (across 
# minibatch samples) of the elements in v, i.e., the mean log 
# likelihood across the minibatch.

return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
```

## Creating a Logistic Regression Class

In [54]:
class LogisticRegression(object):
    """Multi-class Logistic Regression Class

    The logistic regression is fully described by a weight matrix `W`
    and bias vector `b`. Classification is done by projecting data
    points onto a set of hyperplanes, the distance to which is used to
    determine a class membership probability.
    """
    
    def __init__(self, input, n_in, n_out):
        """ Initialize the parameters of the logistic regression

        :type input: theano.tensor.TensorType
        :param input: symbolic variable that describes the input of the
                      architecture (one minibatch)

        :type n_in: int
        :param n_in: number of input units, the dimension of the space in
                     which the datapoints lie

        :type n_out: int
        :param n_out: number of output units, the dimension of the space in
                      which the labels lie

        """
        
        # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
        self.W = theano.shared(
            value=np.zeros(
                (n_in, n_out),
                dtype=theano.config.floatX
            ),
            name='W',
            borrow=True
        )
        # initialize the biases b as a vector of n_out 0s
        self.b = theano.shared(
            value=np.zeros(
                (n_out,),
                dtype=theano.config.floatX
            ),
            name='b',
            borrow=True
        )

        # symbolic expression for computing the matrix of class-membership
        # probabilities
        # Where:
        # W is a matrix where column-k represent the separation hyperplane for
        # class-k
        # x is a matrix where row-j  represents input training sample-j
        # b is a vector where element-k represent the free parameter of
        # hyperplane-k
        self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)

        # symbolic description of how to compute prediction as class whose
        # probability is maximal
        self.y_pred = T.argmax(self.p_y_given_x, axis=1)
        
        # parameters of the model
        self.params = [self.W, self.b]
        
        # keep track of model input
        self.input = input
        
    def negative_log_likelihood(self, y):
        """Return the mean of the negative log-likelihood of the prediction
        of this model under a given target distribution.

        :type y: theano.tensor.TensorType
        :param y: corresponds to a vector that gives for each example the
                  correct label

        Note: we use the mean instead of the sum so that
              the learning rate is less dependent on the batch size
        """
        # y.shape[0] is (symbolically) the number of rows in y, i.e.,
        # number of examples (call it n) in the minibatch
        # T.arange(y.shape[0]) is a symbolic vector which will contain
        # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
        # Log-Probabilities (call it LP) with one row per example and
        # one column per class LP[T.arange(y.shape[0]),y] is a vector
        # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
        # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
        # the mean (across minibatch examples) of the elements in v,
        # i.e., the mean log-likelihood across the minibatch.
        return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
      
    def errors(self, y):
        """Return a float representing the number of errors in the minibatch
        over the total number of examples of the minibatch ; zero one
        loss over the size of the minibatch

        :type y: theano.tensor.TensorType
        :param y: corresponds to a vector that gives for each example the
                  correct label
        """
        # check if y has same dimension of y_pred
        if y.ndim != self.y_pred.ndim:
            raise TypeError(
                'y should have the same shape as self.y_pred',
                ('y', y.type, 'y_pred', self.y_pred.type)
            )
        # check if y is of the correct datatype
        if y.dtype.startswith('int'):
            # the T.neq operator returns a vector of 0s and 1s, where 1
            # represents a mistake in prediction
            return T.mean(T.neq(self.y_pred, y))
        else:
            raise NotImplementedError()

In [55]:
import numpy as np

### Instantiate Class
```python
# Generate symbolic variables for input (x and y represent a minibatch)
x = T.matrix('x')  # Data, presented as rasterized images
y = T.ivector('y') # labels, presented as 1D vector of [int] labels

# Construct the logistic regression class 
# Each MNIST image has size 28x28
classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10)
```

We start by allocating symbolic variables for the training inputs $x$ and their corresponding classes $y$. Note that `x` and `y` are defined outside the scope of the `LogisticRegression` object. Since the class requires the input to build its graph, it is passed as a parameter of the __init__ function. This is useful in case you want to connect instances of such classes to form a deep network. The output of one layer can be passed as the input of the layer above. (This tutorial does not build a multi-layer network, but this code will be reused in future tutorials that do.)

Finally, we define a (symbolic) `cost` variable to minimize, using the instance method `classifier.negative_log_likelihood`.

```python
# The cost we minimize during training is the negative log likelihood
# of the model in symbolic format
cost = classifier.negative_log_likelihood(y)
```

Note that `x` is an implicit symbolic input to the definition of `cost`, because the symbolic variables of `classifier` were defined in terms of `x` at initialization.

### Learning the Model
Learning the Model
To implement MSGD in most programming languages (C/C++, Matlab, Python), one would start by manually deriving the expressions for the gradient of the loss with respect to the parameters: in this case $\partial{\ell}/\partial{W}$, and $\partial{\ell}/\partial{b}$, This can get pretty tricky for complex models, as expressions for $\partial{\ell}/\partial{\theta}$ can get fairly complex, especially when taking into account problems of numerical stability.

With Theano, this work is greatly simplified. It performs automatic differentiation and applies certain math transforms to improve numerical stability.

To get the gradients $\partial{\ell}/\partial{W}$ and $\partial{\ell}/\partial{b}$ in Theano, simply do the following:

```python
g_W = T.grad(cost=cost, wrt=classifier.W)
g_b = T.grad(cost=cost, wrt=classifier.b)
```

`g_W` and `g_b` are symbolic variables, which can be used as part of a computation graph. The function `train_model`, which performs one step of gradient descent, can then be defined as follows:

```python
# Specify how to update the parameters of the model as a list of 
# (variable, update expression) pairs
updates = [(classifier.W, classifier.W - learning_rate * g_W),
           (classifier.b, classifier.b - learning_rate * g_b)]

# Compiling a Theano function `train_model` that returns the cost, but
# at the same time updates the parameters of the model based on the rules
# defined in `updates`
train_model = theano.function(
    inputs=[index],
    outputs=cost,
    updates=updates,
    givens={
        x: train_set_x[index * batch_size: (index + 1) * batch_size],
        y: train_set_y[index * batch_size: (index + 1) * batch_size],
    }
)
```