# CMU 17-400/17-700 auto-graded notebook

Before you turn this assignment in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE."

---

# Homework 4: Autodiff

In [None]:
# Who did you collaborate with on this assignment? 
# if no one, collaborators should contain an empty string,
# else list your collaborators below

# collaborators = [""]
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
try:
    collaborators
except:
    raise AssertionError("you did not list your collaborators, if any")

# Overview
In this notebook, we will implement a multi-layer perceptron to perform entity classification (i.e., predicting the category label of a DBPedia entity based on its title) without using any deep learning frameworks. The auto-differentiation will be handled by the provided code. You only need to using the existing APIs to build and evaluate a model. 

First, we review some background knowledge about Multi-layer Perceptrons.

# Multi-layer Perceptron
MLP is a simple neural network architecture consisting of multiple layers, each of which apply a linear transformation to their inputs followed by a non-linear mapping:
\begin{align*}
    o_i &= f(x^T w_i + b_i)
\end{align*}

Here $x \in \mathbb{R}^{d_{in}}$ is the layer input and $o_i \in \mathbb{R}$ is the 
$i$-th output of the layer. $w_i$, $b_i$ are layer parameters which will be optimized
during training. The number of outputs at each layer is called the *dimension* of that layer and we denote it by $d_{hid}$. We would like to use vector/matrix multiplications 
wherever possible to utilize their fast implementation in ${numpy}$, and combine the above
for all $i$ as:
\begin{equation*}
    o = f(x^T W + b)
\end{equation*}

$W=[w_1,w_2,\ldots,w_{d_{hid}}] \in \mathbb{R}^{d_{in} \times d_{hid}}$ stacks all the weight vectors 
horizontally, and $b \in \mathbb{R}^{d_{hid}}$ holds all the biases. The non-linearity $f$ is
applied elementwise.

To further speed-up the computation we can process a minibatch of inputs together. Let $X \in \mathbb{R}^{N \times d_{in}}$
be a matrix holding $N$ examples row-wise. We can compute the layer outputs for all of these together:
\begin{equation}
    O = f(X W + B)
\end{equation}

$B = \mathbf{1} \otimes b^T \in \mathbb{R}^{N \times d_{hid}}$ is a ``broadcasted`` version of the bias
of appropriate dimensions. For this assigment we will use `numpy` for all matrix operations, which takes care
of broadcasting automatically (see here https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html for details), hence we can use the vector $b$ directly.

The nonlinearity we will use is the `Rectified Linear Unit (ReLU)`:
\begin{equation*}
    f(x) = 
    \begin{cases}
        x &\quad x>0 \\
        0 &\quad x\leq0
    \end{cases}
\end{equation*}

For vectors the nonlinearity is applied element-wise, and we can again use numpy broadcasting for
this. In multi-layer networks, output of layer $k$ is passed as input to layer $k+1$:
\begin{equation}
    O^{(k+1)} = f(O^{(k)}W^{(k)} + b^{(k)})
\end{equation}

We can set output size of the last layer of the MLP to produce a vector the same size as the number of
labels $C$ in our dataset. The operations described thus far map inputs to positive reals, but for 
classification tasks we are interested 
in obtaining a *distribution* over class labels. This is usually done by
passing the output of the last layer through a `softMax` operation:
\begin{equation}
    p_j = \frac{e^{o_j}}{\sum_{j'=1}^C e^{o_{j'}}}
\end{equation}

Note that $p$ defines a valid distribution, and elements of $o$ which have a
high relative value will have a high probability in $p$. In case $o_j$s are very large or very negative there might be numerical issues in computing the above. A more numerically stable version of softMax uses the following:

\begin{equation}
    p_j = \frac{e^{o_j-a}}{\sum_{j'=1}^C e^{o_{j'}-a}} 
\end{equation}

This is true for any $a$; we will use $a=\max_j o_j$. Lastly, we need to define a loss function which measures how far the output  distribution $p_i$ for input $i$ is from its target distribution $t_i$. We will use the cross-entropy loss for this:
\begin{equation}
    l_i = - \sum_{j=1}^C t_{i}^{(j)} \log p_{i}^{(j)} 
\end{equation}
For single-label classification, $t_i$ is a $C$-dimensional one-hot vector encoding the correct label for this example. The above equations compute $p$ and $l$ for a single $o$, but in your code you should use `numpy`
operations to compute a minibatch of distributions $P$ and losses $L$ from a minibatches of $O$.
The objective function we will optimize is the average of losses across a minibatch:
\begin{equation}
    \text{loss} = \frac{1}{N} \sum_{i=1}^N l_i 
\end{equation}
Now we can take gradients of $\text{loss}$ wrt to the parameters of the network and perform
Stochastic Gradient Descent (SGD).

To summarize, the architecture you will implement for this assignment consists of a MLP
with one hidden layer, followed by a softMax layer and cross-entropy loss.
Given an input minibatch $X$ and their associated targets $T$, the output $P$ and loss is computed as:
\begin{align*}
    O^{(1)} &= \text{relu}(X W^{(1)} + b^{(1)}) \\
    O^{(2)} &= \text{relu}({O^{(1)}} W^{(2)} + b^{(2)}) \\
    P &= \text{softMax}(O^{(2)}) \\
    \text{loss} &= \text{mean}(\text{crossEnt}(P,T))
\end{align*}

# Start: Download and read through the helper functions
We provide the following files to help implement a multi-layer perceptron easier
* `xman.py` -- classes for expression manager, registers and operations.
* `utils.py` -- classes for data preprocessing and forming minibatches.
* `functions.py` -- function definitions and their gradients are declared here.
* `autograd.py` -- class for performing forward and backward propagation over a Wengert list.

*Next, we will download these files and the small train/val/test data we will use.*

## Import the helper functions and raw data 

In [None]:
# Just run this cell
# Load dependencies and data for this assignment
!wget https://raw.githubusercontent.com/17-700/data/master/hw4/autodiff_dependencies/autograd.py
!wget https://raw.githubusercontent.com/17-700/data/master/hw4/autodiff_dependencies/functions.py
!wget https://raw.githubusercontent.com/17-700/data/master/hw4/autodiff_dependencies/utils.py
!wget https://raw.githubusercontent.com/17-700/data/master/hw4/autodiff_dependencies/xman.py
!wget https://raw.githubusercontent.com/17-700/data/master/hw4/autodiff_dependencies/data/tiny.train
!wget https://raw.githubusercontent.com/17-700/data/master/hw4/autodiff_dependencies/data/tiny.valid
!wget https://raw.githubusercontent.com/17-700/data/master/hw4/autodiff_dependencies/data/tiny.test    

## Understand the declared operations and their gradient computation in the helper classes
We have declared all the primitive operations in the `XManFunctions` class for you. Please read the `functions.py` file carefully. Here are some explanations that help you understand the code.

    import numpy as np
    # forward pass
    EVAL_FUNS = {
            'add':      lambda x1,x2: x1+x2,
    }
        
    def _derivAdd(delta,x1):
        if delta.shape!=x1.shape:
            # broadcast, sum along axis=0
            return delta.sum(axis=0)
        else: 
            return delta

    # backward pass
    BP_FUNS = {
            'add':    	[lambda delta,out,x1,x2: _derivAdd(delta,x1),    
                        lambda delta,out,x1,x2: _derivAdd(delta,x2)],
    }

`EVAL_FUNS` is a dictionary whose keys are the names of the operators as declared in the previous section and values are the actual functions themselves (usually defined using lambda calculus). `BP_FUNS` is another dictionary with the same set of keys as `EVAL_FUNS`, but whose value for a key is a list of functions each computing its gradient wrt one of its inputs. 

In the above example `BP_FUNS['add'][0]` computes the derivative wrt `x1`, and `BP_FUNS['add'][1]` computes the derivative wrt `x2`. As input each of these functions receives:
* `delta` - partial derivative of the output of this operation
* `out` - output of this operation in the forward pass. This can be sometimes useful for computing the derivative. For example, for the sigmoid nonlinearity $\sigma'(x)=\sigma(x)(1-\sigma(x))$.
* `x1,x2,...` - all inputs to the operation

Next, we will implement three key primitive operations (1) forward pass, (2) backward pass, and (3) optimization/model updating, using the provided APIs.

In [None]:
import argparse
import numpy as np
import time
import os
from xman import *
from utils import *
from autograd import *

np.random.seed(0)

EPS=1e-4


# Implement the Forward / Backward Pass

In [None]:
# TODO: define the forward pass function using Autograd

def fwd(network, valueDict):
    """
        network: the MLP object, use network.my_xman to get the MLP model
        valueDict: dict where the keys are the name for the data (e.g., 'x', 'y') and parameters, 
                    and the values are the corresponding data/parameter values
        return: loss that you want to compute gradients on
    """
    
    # # TODO: Uncomment the lines below and replace <FILL IN> with appropriate code
    # # hints: need to use `network.my_xman.operationSequence' defined in xman.py to get the loss register in ad.val()
    # ad = Autograd(network.my_xman)
    # return ad.eval(<FILL IN>)
    
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# TODO: define the backward pass function using Autograd

def bwd(network, valueDict):
    """
        network: the MLP object, use network.my_xman to get the MLP model
        valueDict: dict where the keys are the names for the data (e.g., 'x', 'y') and parameters, 
                    and the values are the corresponding data/parameter values
        return: gradients
    """
    
    # # TODO: Uncomment the lines below and replace <FILL IN> with appropriate code
    # # important: see bprop() defined in autograd.py 
    # ad = Autograd(network.my_xman)
    # return ad.bprop(<FILL IN>)
    
    # YOUR CODE HERE
    raise NotImplementedError()

# Implement Optimization

Parameters of the above network can be trained using minibatch SGD. Once the loss function is defined we can take its derivative wrt any parameter $w_{ij}$ and update it as follows:
\begin{equation}
    w_{ij}^{(k)} \leftarrow w_{ij}^{(k-1)} - \lambda \frac{\partial{\text{loss}}}{w_{ij}}
\end{equation}
$\lambda$ is the learning rate.
In this assignment, you are not required to modify the learning rate as the training proceeds. 

In [None]:
# TODO: implement the optimization step (i.e., applying the gradients)

def update(network, dataParamDict, grads, rate):
    """
        network: the MLP object, use network.my_xman to get the MLP model
        dataParamDict: dict of parameter values (key: paramter names)
        grads: dict that contains gradients of parameters (key: parameter names)
        rate: learning rate, a scalar value
        return: the updated dataParamDict
    """

    # # TODO: Uncomment the lines below and replace <FILL IN> with appropriate code
    # for rname in grads:
    #     if network.my_xman.isParam(rname):
    #         <FILL IN>
    # return dataParamDict
    
    # YOUR CODE HERE
    raise NotImplementedError()

# Build the Model
Once the primitive operations are defined, we can go ahead and define the model. First we need to declare registers to hold inputs and parameters. Suppose there is one input *x*, a target *y* and two parameters *W* and *b*:

    W = f.param(name='W', default=a*np.random.uniform(-1.,1.,(10,10))
    b = f.param(name='b', default=0.1*np.random.uniform(-1.,1.,(10,))
    x = f.input(name='x', default=np.random.rand(1,10))
    y = f.input(name='y', default=np.random.rand(1,10))

We will specify the `name` and `default` fields for each register, including `input` registers! The `name` is used during the forward and backward passes to bind values to the correct register indexed by their names. The `default` value is used as initialization for parameters and also for performing gradient checks. We will use the `inputDict` method described in the next section to collect values for all registers and perform gradient checking using that. For this purpose, you can assign any reasonable random default value to the `input` registers (e.g., don't make them all zeros) with the right shape. 

**Note on initialization of parameters**: It is important to initialize parameters such that intermediate values in the network do not lie in the saturated regions of the non-linearity. One good heuristic is to sample the weights for $W$ of size $d_{in} \times d_{out}$ from a uniform distribution $\mathcal{U}[-a,a]$ whose scale $a$ is given by:

\begin{equation}
a = \sqrt[]{\frac{6}{d_{in} + d_{out}}}
\end{equation}

This is called Glorot initialization (http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf). Bias terms can be initialized at a scale of 0.1.

Now write the model in terms of primitive operations:

    xm = XMan()
    xm.o1 = f.relu( f.mul(x,W) + b )
    ...
    xm.loss = ...
    my_xman = xm.setup()



In [None]:
def glorot(m,n):
    # return scale for glorot initialization
    return np.sqrt(6./(m+n))

class MLP(object):
    """
    Multilayer Perceptron
    Accepts list of layer sizes [in_size, hid_size1, hid_size2, ..., out_size]
    """
    
    def __init__(self, layer_sizes):
        
        # # TODO: Uncomment the lines below and replace <FILL IN> with appropriate code
        # self.num_layers = <FILL IN>
    
        # YOUR CODE HERE
        raise NotImplementedError()
        
        self.my_xman = self._build(layer_sizes) # DO NOT REMOVE THIS LINE. Store the output of xman.setup() in this variable       

    def _build(self, layer_sizes):
        
        
        # TODO: Define your model here
        
        # # TODO: Uncomment the lines below and replace <FILL IN> with appropriate code
        # self.params = {}
        # for i in range(self.num_layers):
        #     k = i+1
        #     sc = glorot(layer_sizes[i], layer_sizes[i+1])
        #     self.params['W'+str(k)] = f.param(name='W'+str(k), default=<FILL IN>)
        #     self.params['b'+str(k)] = f.param(name='b'+str(k), default=<FILL IN>)
      
        
        # YOUR CODE HERE
        raise NotImplementedError()
        
        self.inputs = {}
        
        # # TODO: Uncomment the lines below and replace <FILL IN> with appropriate code
        # self.inputs['X'] = f.input(name='X', default=<FILL IN>)
        # self.inputs['y'] = f.input(name='y', default=<FILL IN>)
        # x = XMan()
        # inp = self.inputs['X']
        # for i in range(self.num_layers):
        #     <FILL IN>
        # x.output = <FILL IN>
        # x.loss = f.mean(f.crossEnt(<FILL IN>))
        
        # YOUR CODE HERE
        raise NotImplementedError()
        
        return x.setup()

    
    def data_dict(self, X, y):
        dataDict = {}
        dataDict['X'] = X
        dataDict['y'] = y
        return dataDict
     

## Data Format and Data Loading
For this assignment you need to predict the category label of a DBPedia entity based on its title. The data contains two columns separated by
tab, with the title in the first column and label in second.

    Lloyd_Stinson   Person
    Lobogenesis_centrota    Species
    Loch_of_Craiglush   Place


*We have done (almost) all data pre-processing in the provided code. The following are some details regarding that.*

We will encode entities for input to the networks by converting characters to a one-hot representation. Suppose we have a dictionary mapping each character in the data to an index `chardict = {'a':1,'b':2...}` and the total number of characters in the dataset is $V$, then we will represent `'a'` as a $V$-dimensional vector $[1,0,0,\ldots,0]$, and `'b'` as another $V$-dimensional vector $[0,1,0,\ldots,0]$. A string of characters will be encoded to a matrix whose each row is a $V$-dimensional vector.

We will fix the maximum length of an entity to $M$, longer entities will be truncated to this length, and shorter ones will be padded with white-space. We have provided you with code that preprocesses the data and divides it into minibatches in `utils.py`. You can,

    from utils import *
    # load data and preprocess
    dp = DataPreprocessor()
    data = dp.preprocess(<training_file>, <validation_file>, 
        <testing_file>)
    # minibatches
    mb_train = MinibatchLoader(data.training, batch_size, max_len, 
            len(data.chardict), len(data.labeldict))
    mb_valid = MinibatchLoader(data.validation, len(data.validation), 
            max_len, len(data.chardict), len(data.labeldict), 
            shuffle=False)
    mb_test = MinibatchLoader(data.test, len(data.test), max_len, 
            len(data.chardict), len(data.labeldict), shuffle=False)

`max_len` is the maximum length $M$ which we set. `shuffle=True/False` tells the batch loader whether to shuffle the data after every epoch. For validation and test sets we set the batch size same as the size of the dataset. You can then iterate over the data using,

    for (idxs,e,l) in mb_train:
        # idxs - ids of examples in minibatch
        # e - entities in one-hot format
        # l - corresponding output labels also in one-hot format

After every epoch (full sweep through `mb_train`) the data is shuffled for the next epoch in `mb_train`. `idxs` has shape $N$, e has shape $N \times M \times V$ and l has shape $N \times C$ where $N$ is the batch size. Make sure that this makes sense to you.

For input to the MLP we will concatenate all the one-hot encodings into one row vector, so you will need to flatten `e` to a $N \times MV$ size matrix whose each row consists of the encoding of all characters in the entity one after the other. You can use `numpy.reshape` for this. 

The `Data` and `MiniBatchLoader` classes create dictionaries for all characters and labels in the dataset and use that to encode the inputs and labels into a one-hot vector format.


In [None]:
# TODO: prepare the input and do a fwd-bckwd pass over it and update the weights
# by calling fwd(), bwd(), and update() we implemented previously

def train_epoch(train_dataset, mlp, lr, value_dict, logger):
    """
        train_dataset: data for train
        mlp: MLP object
        lr: learning rate
        value_dict: dict where the keys are the names for the data (e.g., 'x', 'y') and parameters, 
                    and the values are the corresponding data/parameter values
        logger: file object for logging the training loss
        return: training loss
    """
    
    # # TODO: Uncomment the lines below and replace <FILL IN> with appropriate code
    # train_loss = np.ndarray([0])
    # for ii, (idxs,e,l) in enumerate(train_dataset):      
    #     data_dict = mlp.data_dict(e.reshape(<FILL IN>),l)
    #     for k,v in data_dict.items():
    #         value_dict[k] = v
    #     # fwd-bwd
    #     vd = <FILL IN>
    #     gd = <FILL IN>
    #     value_dict = update(<FILL IN>)
    #     message = 'TRAIN loss = %.3f' % vd['loss']
    #     logger.write(message+'\n')
    #     
    #     train_loss = np.append(train_loss, vd['loss'])
    
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return train_loss


In [None]:
# TODO: prepare the input and do a fwd pass over it to compute the loss

def validate(valid_dataset, mlp, value_dict):
    """
        valid_dataset: data for validation
        mlp: MLP object
        value_dict: dict where the keys are the names for the data (e.g., 'x', 'y') and parameters, 
                    and the values are the corresponding data/parameter values
    """
       
    # tot_loss is the sum of loss over all data, n is the number of data points
    # probs are the list of output probabilities, targets are a list of labels
    tot_loss, n= 0., 0
    probs = []
    targets = []
    
    # # TODO: Uncomment the lines below and replace <FILL IN> with appropriate code
    # for (idxs,e,l) in valid_dataset: 
    #     data_dict = mlp.data_dict(e.reshape(<FILL IN>),l)
    #     for k,v in data_dict.items():
    #         value_dict[k] = v
    #     # fwd
    #     vd = <FILL IN>
    #     tot_loss += <FILL IN>
    #     probs.append(<FILL IN>)
    #     targets.append(<FILL IN>)
    #     n += 1     
    
    
    # YOUR CODE HERE
    raise NotImplementedError()

    return tot_loss, probs, targets, n


In [None]:
# TODO: prepare input and do a fwd pass over it to compute the output probs

def test(test_dataset, mlp, best_param_dict):
    """
        test_dataset: data for test
        mlp: MLP object
        best_param_dict: dict contains learned parameter values with best training performance
    """
    
     # tot_loss is the sum of loss over all data, n is the number of data points
    # probs are the list of output probabilities, targets are a list of labels
    tot_loss, n= 0., 0
    probs = []
    targets = []
    
    # # TODO: Uncomment the lines below and replace <FILL IN> with appropriate code
    # for (idxs,e,l) in test_dataset: 
    #     data_dict = mlp.data_dict(e.reshape(<FILL IN>),l)
    #     for k,v in data_dict.items():
    #         best_param_dict[k] = v
    #     # fwd
    #     vd = <FILL IN>
    #     tot_loss += <FILL IN>
    #     probs.append(<FILL IN>)
    #     targets.append(<FILL IN>)
    #     n += 1   

    # YOUR CODE HERE
    raise NotImplementedError()
    
    return tot_loss, probs, targets, n


In [None]:
def accuracy(probs, targets):
    preds = np.argmax(probs, axis=1)
    targ = np.argmax(targets, axis=1)
    return float((preds==targ).sum())/preds.shape[0]


In [None]:
# Main function doing training and testing

def main(params):
    # params: dict for all relevant parameters
    
    epochs = params['epochs']
    max_len = params['max_len']
    num_hid = params['num_hid']
    batch_size = params['batch_size']
    dataset = params['dataset']
    init_lr = params['init_lr']
    output_file = params['output_file']
    train_loss_file = params['train_loss_file']

    # load data and preprocess
    dp = DataPreprocessor()
    data = dp.preprocess('%s.train'%dataset, '%s.valid'%dataset, '%s.test'%dataset)
    
    # create minibatches
    mb_train = MinibatchLoader(data.training, batch_size, max_len,
            len(data.chardict), len(data.labeldict))
    mb_valid = MinibatchLoader(data.validation, len(data.validation), max_len,
            len(data.chardict), len(data.labeldict), shuffle=False)
    mb_test = MinibatchLoader(data.test, len(data.test), max_len,
            len(data.chardict), len(data.labeldict), shuffle=False)

    # build    
    mlp = MLP([max_len*mb_train.num_chars, num_hid, mb_train.num_labels])
      
    logger = open('%s_mlp4c_L%d_H%d_B%d_E%d_lr%.3f.txt'%
            (dataset,max_len,num_hid,batch_size,epochs,init_lr),'w')

    # get default data and params
    value_dict = mlp.my_xman.inputDict()
    min_loss = 1e5
    lr = init_lr
    best_param_dict = {}
    
    for i in range(epochs):
        # training
        
        # # TODO: Uncomment the line below and replace <FILL IN> with appropriate code
        # hint: call train_epoch
        # train_loss = <FILL IN>
        
        # YOUR CODE HERE
        raise NotImplementedError()
        
        # validate
        tot_loss, probs, targets, n = validate(mb_valid, mlp, value_dict)
        
        acc = accuracy(np.vstack(probs), np.vstack(targets))
        c_loss = tot_loss/n
        if c_loss<min_loss:
            min_loss = c_loss
            for k,v in value_dict.items():
                best_param_dict[k] = np.copy(v)
        message = ('Epoch %d VAL loss %.3f min_loss %.3f acc %.3f' %
                (i,c_loss,min_loss,acc))

    np.save(train_loss_file, train_loss)

    # testing
    tot_loss, probs, targets, n = test(mb_test, mlp, best_param_dict)
    acc = accuracy(np.vstack(probs), np.vstack(targets))
    c_loss = tot_loss/n
    np.save(output_file, np.vstack(probs))
    
    return c_loss, acc


## Good job! Let's train the model !

In [None]:
# Don't change this cell
# Helper functions used for testing training loss 
EPS = 1e-4

LOSS_INV_MAX = 20 

MLP_LOSS_THRESHOLD = [(1.022,15), (1.5,10), (2.0, 0)]
MLP_TIME_THRESHOLD = [(25,15), (50,10), (100,0)]
MLP_LOSS_INV_THRESHOLD = [(LOSS_INV_MAX/2,10), (LOSS_INV_MAX*3/4,5), (LOSS_INV_MAX,0)]

def linear(thresholds, x):
    return float(thresholds[1][1]-thresholds[0][1])*(x- thresholds[0][0])/(thresholds[1][0]-thresholds[0][0])+thresholds[0][1]

def linear_mark(thresholds, x):
    if x<=thresholds[0][0]:
        return thresholds[0][1]
    elif x<=thresholds[1][0]:
        return linear(thresholds[:2], x)
    elif x<=thresholds[2][0]:
        return linear(thresholds[1:3], x)
    else:
        return thresholds[2][1]    

def _crossEnt(x,y):
    # X, y: 2-D numpy array
    # return: return an array of cross entropy, where each element is crossEnt between x_i and y_i
    
    # YOUR CODE HERE
    raise NotImplementedError()

def load_params_from_file(filename):
    return np.load(filename)[()]

def save_params_to_file(d, filename):
    np.save(filename, d)    

def loss_inv_check(loss_arr):
    try:
        in_arr = np.reshape(loss_arr, [len(loss_arr)])
        if len(in_arr) < LOSS_INV_MAX:
            raise ValueError("Not enough Train Loss measurements. Found only %d train loss entries!"%len(loss_arr))
        in_arr = in_arr[0:LOSS_INV_MAX]
        inv = 0
        for i in range(LOSS_INV_MAX-1):
            if in_arr[i] < in_arr[i+1]:
                inv += 1
        
        return inv
    except Exception as e:
        print ("MLP TRAIN LOSS FAILED" )
        print (e)  
        return -1 


In [None]:
assert 1.60>_crossEnt(np.array([[0.1,0.1,0.8]]), np.array([[0.33,0.33,0.33]])).mean()>1.59

In [None]:
# Prepare the dataset and parameters for testing
# Do not tune the hyper-parameters below (though in practice you should)

params = dict()
params['max_len'] = 10
params['num_hid'] = 50
params['batch_size'] = 64
params['dataset'] = 'tiny'
params['epochs'] = 50
params['init_lr'] = 0.1
params['output_file'] = 'output'
params['train_loss_file'] = 'train_loss'

# make sure didn't change the hyper-parameters
assert params['init_lr'] == 0.1
assert params['epochs'] == 50
assert params['max_len'] == 10
assert params['num_hid'] == 50
assert params['batch_size'] == 64

In [None]:
# Load the data and parameters for testing

epochs = params['epochs']
max_len = params['max_len']
num_hid = params['num_hid']
batch_size = params['batch_size']
dataset = params['dataset']
init_lr = params['init_lr']
output_file = params['output_file']
train_loss_file = params['train_loss_file']

# load data and preprocess
dp = DataPreprocessor()
data = dp.preprocess('%s.train'%dataset, '%s.valid'%dataset, '%s.test'%dataset)

# minibatches
mb_train = MinibatchLoader(data.training, batch_size, max_len,
        len(data.chardict), len(data.labeldict))
mb_test = MinibatchLoader(data.test, len(data.test), max_len,
        len(data.chardict), len(data.labeldict), shuffle=False)



In [None]:
# make sure we've correctly loaded data

assert mb_train.num_examples == 1857
assert mb_test.num_examples == 221

In [None]:
# Testing gradient correctness

targets = []
indices = []
for (idxs,e,l) in mb_test:
    targets.append(l)
    indices.extend(idxs)

mlp = MLP([max_len*mb_train.num_chars,num_hid,mb_train.num_labels])
    
# function which takes a network object and checks gradients
dataParamDict = mlp.my_xman.inputDict()
fd = fwd(mlp, dataParamDict)
grads = bwd(mlp, fd)
for rname in grads:
    if mlp.my_xman.isParam(rname):
        fd[rname].ravel()[0] += EPS
        fp = fwd(mlp, fd)
        a = fp['loss']
        fd[rname].ravel()[0] -= 2*EPS
        fm = fwd(mlp, fd)
        b = fm['loss']
        fd[rname].ravel()[0] += EPS
        auto = grads[rname].ravel()[0]
        num = (a-b)/(2*EPS)
        if not np.isclose(auto, num, atol=1e-3):
            raise ValueError("gradients not close for %s, Auto %.5f Num %.5f"
                        % (rname, auto, num))
            

In [None]:
# Testing training loss
t_start = time.clock()

t_start1 = os.times()[0]
params["output_file"] = output_file+"_mlp"

loss, accu = main(params)
mlp_time = time.clock()-t_start
user_time = os.times()[0]-t_start1

student_mlp_loss = _crossEnt(np.load(params["output_file"]+".npy"), np.vstack(targets)).mean()

In [None]:
assert accu > 0.5