## 10-714: Homework 0

The goal of this homework is to give you a quick overview of some of the concepts and ideas that you should be familiar with _prior_ to taking this course.  The assignment will require you to build a basic softmax regression algorithm, plus a simple two-layer neural network.  You will create these implementations both in native Python (using the numpy library), and (for softmax regression) in native C/C++.  The homework will also walk you through the process of submitting your assignments to our autograding system. 

All the code development for the homeworks in 10-714 can be done in the Google Colab environment.  However, instead of making extensive use of actual code blocks within a colab notebook, most of the code you develop will be done `.py` files downloaded (automatically) to your Google Drive, and you will largely use the notebook for running shell scripts that test and submit the code to the auto-grader.  This is a somewhat non-standard usage of Colab Notebooks.

In [None]:
# # Code to set up the assignment
# from google.colab import drive
# drive.mount('/content/drive')
# %cd /content/drive/MyDrive/
# !mkdir -p 10714
# %cd /content/drive/MyDrive/10714
# !git clone https://github.com/dlsyscourse/hw0.git
# %cd /content/drive/MyDrive/10714/hw0

This next cell will then install the libraries required.

In [None]:
# !pip3 install --upgrade --no-deps git+https://github.com/dlsyscourse/mugrade.git
# !pip3 install pybind11
# !pip3 install numdifftools

## Question 1: A basic `add` function, and testing/autograding basics

To illustrate the workflow of these assignments and the autograding system, we'll use a simple example of implementing an `add` function.  Note that the commands run above will create the following structure in your `10714/hw0` directory

    data/
        train-images-idx3-ubyte.gz
        train-labels-idx1-ubyte.gz
        t10k-images-idx3-ubyte.gz
        t10k-labels-idx1-ubyte.gz
    src/
        simple_ml.py
        simple_ml_ext.cpp
    tests/
        test_simple_ml.py
    Makefile
    
The `data/` directory contains the data needed for this assignment (a copy of the MNIST data set); the `src/` directory contains the source files where you will write your implementations; the `tests/` directory contains tests that will evaluate (locally) your solution, and also submit them for autograding.  And the `Makefile` file is a makefile that will compile the code (relevant for the C++ portions of the assignment).

The first homework question requires you to implement `simple_ml.add()` function.  Looking at the `src/simple_ml.py` file, you will find and implement the function stub for the `add()` function.

### Running local tests

Now you will want to test to see if your code works, and if so, to submit it to the autograding system.  Throughout this course, we are using standard tools for running unit tests on code, namely the `pytest` system.  Once you've written the correct code in the `src/simple_ml.py` file, run the following command below.

In [4]:
!python -m pytest -k "add"

platform linux -- Python 3.8.15, pytest-7.2.0, pluggy-1.0.0
rootdir: /home/hujunhao/10714/hw0
collected 6 items / 5 deselected / 1 selected                                  [0m

tests/test_simple_ml.py [32m.[0m[32m                                                [100%][0m



If all goes correctly, you will see that one tests is passed correctly.  To see how this test works, take a look at the `tests/test_simple_ml.py` file, specifically the `test_add()` function.

Write additional tests for your implementations, especially if you find that your code is passing the local tests, but still seems to be failing on submission.

If you're used to debugging code via print statements, note that **pytest will by default capture any output**. You can disable this behavior and have the tests display all output in all cases by passing the `-s` flag to pytest.

### Submitting to the autograder

To start the autograding progress, go to http://mugrade.dlsyscourse.org (or http://mugrade-online.dlsyscourse.org for the public online version of the course) and login in **using your course email**.

Once you've created an account, in the left hand navigation bar, click on the "Grader Key" link, and copy the associated key (including the leading underscore if present). Once you have this key, run the following command.

In [1]:
!python3 -m mugrade submit _1lHsiKXhqhcu3qkJBhd6 -k "add"

submit
platform linux -- Python 3.8.15, pytest-7.2.0, pluggy-1.0.0
rootdir: /home/hujunhao/dlsys/hw0
collected 6 items / 5 deselected / 1 selected                                  [0m

tests/test_simple_ml.py [31mF[0m

[31m[1m__________________________________ submit_add __________________________________[0m

pyfuncitem = <Function submit_add>

    [37m@pytest[39;49;00m.hookimpl(hookwrapper=[94mTrue[39;49;00m)
    [94mdef[39;49;00m [92mpytest_pyfunc_call[39;49;00m(pyfuncitem):
        [90m## prior to test, initialize submission[39;49;00m
        [94mglobal[39;49;00m _values, _submission_key, _errors
        _values = []
        _errors = [94m0[39;49;00m
        func_name = pyfuncitem.name[[94m7[39;49;00m:]
        [94mif[39;49;00m os.environ[[33m"[39;49;00m[33mMUGRADE_OP[39;49;00m[33m"[39;49;00m] == [33m"[39;49;00m[33msubmit[39;49;00m[33m"[39;49;00m:
>           _submission_key = start_submission(func_name)

[1m[31m../../.conda/envs/dlsys/lib/pytho

Running this command will submit your `add` function to the mugrade autograding system.  To see how this works internally, take a look at the `tests/test_simply_ml.py` file again, but this time the `submit_add()` function right below the `test_add` function.

Instead of assertions there are calls to `mugrade.submit()`.  These calls each evaluate the `add` function on different inputs, then send the result to the mugrade server.  The server compares the output of your function with the correct output.  If you are logged into the mugrade system, you can go to the "Homework 0" assignment to see your updated grade. All the executions run on your local machine.

## Question 2: Loading MNIST data

Next you will implement in the `src/simple_ml.py` file: the `parse_mnist_data()` function.

We'd recommend you use the `struct` module in python (along with the `gzip` module and of course `numpy` itself), in order to implement this function.


In [4]:
!python3 -m pytest -k "parse_mnist"

platform linux -- Python 3.8.15, pytest-7.2.0, pluggy-1.0.0
rootdir: /home/hujunhao/dlsys/hw0
collected 6 items / 5 deselected / 1 selected                                  [0m

tests/test_simple_ml.py [32m.[0m[32m                                                [100%][0m



In [None]:
!python3 -m mugrade submit YOUR_GRADER_KEY_HERE -k "parse_mnist"

## Question 3: Softmax loss

Implement the softmax (a.k.a. cross-entropy) loss as defined in `softmax_loss()` function in `src/simple_ml.py`. See lecture 2 for reference.

In [3]:
!python3 -m pytest -k "softmax_loss"

platform linux -- Python 3.8.15, pytest-7.2.0, pluggy-1.0.0
rootdir: /home/hujunhao/dlsys/hw0
collected 6 items / 5 deselected / 1 selected                                  [0m

tests/test_simple_ml.py [32m.[0m[32m                                                [100%][0m



In [None]:
!python3 -m mugrade submit YOUR_GRADER_KEY_HERE -k "softmax_loss"

## Question 4: Stochastic gradient descent for softmax regression

In this question you will implement stochastic gradient descent (SGD) for (linear) softmax regression. See lecture 2 for reference.

Implement the `softmax_regression_epoch()` function, which runs a single epoch of SGD (one pass over a data set) using the specified learning rate / step size `lr` and minibatch size `batch`.

In [2]:
!python3 -m pytest -k "softmax_regression_epoch and not cpp"

platform linux -- Python 3.8.15, pytest-7.2.0, pluggy-1.0.0
rootdir: /home/hujunhao/dlsys/hw0
collected 6 items / 5 deselected / 1 selected                                  [0m

tests/test_simple_ml.py [32m.[0m[32m                                                [100%][0m



In [None]:
!python3 -m mugrade submit YOUR_GRADER_KEY_HERE -k "softmax_regression_epoch and not cpp"

### Training MNIST with softmax regression

For this you can use the `train_softmax()` function in the `src/simple_ml.py` file.  

You can see how this works using the following code.  For reference, as seen below, our implementation runs in ~3 seconds on Colab, and achieves 7.97% error.

In [3]:
import sys
sys.path.append("src/")
from simple_ml import train_softmax, parse_mnist

X_tr, y_tr = parse_mnist("data/train-images-idx3-ubyte.gz", 
                         "data/train-labels-idx1-ubyte.gz")
X_te, y_te = parse_mnist("data/t10k-images-idx3-ubyte.gz",
                         "data/t10k-labels-idx1-ubyte.gz")

train_softmax(X_tr, y_tr, X_te, y_te, epochs=10, lr=0.2, batch=100)

| Epoch | Train Loss | Train Err | Test Loss | Test Err |
|     0 |    0.35134 |   0.10182 |   0.33588 |  0.09400 |
|     1 |    0.32142 |   0.09268 |   0.31086 |  0.08730 |
|     2 |    0.30802 |   0.08795 |   0.30097 |  0.08550 |
|     3 |    0.29987 |   0.08532 |   0.29558 |  0.08370 |
|     4 |    0.29415 |   0.08323 |   0.29215 |  0.08230 |
|     5 |    0.28981 |   0.08182 |   0.28973 |  0.08090 |
|     6 |    0.28633 |   0.08085 |   0.28793 |  0.08080 |
|     7 |    0.28345 |   0.07997 |   0.28651 |  0.08040 |
|     8 |    0.28100 |   0.07923 |   0.28537 |  0.08010 |
|     9 |    0.27887 |   0.07847 |   0.28442 |  0.07970 |


## Question 5: SGD for a two-layer neural network

Now that you've written SGD for a linear classifier, let's consider the case of a simple two-layer neural network.  Specifically, for input $x \in \mathbb{R}^n$, we'll consider a two-layer neural network (without bias terms) of the form
\begin{equation}
z = W_2^T \mathrm{ReLU}(W_1^T x)
\end{equation}
where $W_1 \in \mathbb{R}^{n \times d}$ and $W_2 \in \mathbb{R}^{d \times k}$ represent the weights of the network (which has a $d$-dimensional hidden unit), and where $z \in \mathbb{R}^k$ represents the logits output by the network.  We again use the softmax / cross-entropy loss, meaning that we want to solve the optimization problem
\begin{equation}
\min_{W_1, W_2} \;\; \frac{1}{m} \sum_{i=1}^m \ell_{\mathrm{softmax}}(W_2^T \mathrm{ReLU}(W_1^T x^{(i)}), y^{(i)}).
\end{equation}
Or alternatively, overloading the notation to describe the batch form with matrix $X \in \mathbb{R}^{m \times n}$, this can also be written 
\begin{equation}
\min_{W_1, W_2} \;\; \ell_{\mathrm{softmax}}(\mathrm{ReLU}(X W_1) W_2, y).
\end{equation}

Using the chain rule, we can derive the backpropagation updates for this network (we'll briefly cover these in class, on 9/8, but also provide the final form here for ease of implementation).  Specifically, let
\begin{equation}
\begin{split}
Z_1 \in \mathbb{R}^{m \times d} & = \mathrm{ReLU}(X W_1) \\
G_2 \in \mathbb{R}^{m \times k} & = normalize(\exp(Z_1 W_2)) - I_y \\
G_1 \in \mathbb{R}^{m \times d} & = \mathrm{1}\{Z_1 > 0\} \circ (G_2 W_2^T)
\end{split}
\end{equation}
where $\mathrm{1}\{Z_1 > 0\}$ is a binary matrix with entries equal to zero or one depending on whether each term in $Z_1$ is strictly positive and where $\circ$ denotes elementwise multiplication.  Then the gradients of the objective are given by
\begin{equation}
\begin{split}
\nabla_{W_1} \ell_{\mathrm{softmax}}(\mathrm{ReLU}(X W_1) W_2, y) & = \frac{1}{m} X^T G_1  \\
\nabla_{W_2} \ell_{\mathrm{softmax}}(\mathrm{ReLU}(X W_1) W_2, y) & = \frac{1}{m} Z_1^T G_2.  \\
\end{split}
\end{equation}

**Note:** If the details of these precise equations seem a bit cryptic to you (prior to the 9/8 lecture), don't worry too much.  These _are_ just the standard backpropagation equations for a two-layer ReLU network: the $Z_1$ term just computes the "forward" pass while the $G_2$ and $G_1$ terms denote the backward pass.  But the precise form of the updates can vary depending upon the notation you've used for neural networks, the precise ways you formulate the losses, if you've derived these previously in matrix form, etc.  If the notation seems like it might be familiar from when you've seen deep networks in the past, and makes more sense after the 9/8 lecture, that is more than sufficient in terms of background (after all, the whole _point_ of deep learning systems, to some extent, is that we don't need to bother with these manual calculations).  But if these entire concepts are _completely_ foreign to you, then it may be better to take a separate course on ML and neural networks prior to this course, or at least be aware that there will be substantial catch-up work to do for the course.

Using these gradients, now write the `nn_epoch()` function in the `src/simple_ml.py` file.  As with the previous question, your solution should modify the `W1` and `W2` arrays in place.  After implementing the function, run the following test.  Be sure to use matrix operations as indicated by the expresssions above to implement the function: this will be _much_ faster, and more efficient, than attempting to use loops (and it requires far less code).

In [None]:
!python3 -m pytest -k "nn_epoch"

And finally submit for autograding.

In [None]:
!python3 -m mugrade submit YOUR_GRADER_KEY_HERE -k "nn_epoch"

### Training a full neural network

As before, though it isn't a strict necessity to pass the autograder, it's rather fun to see how well you can use your neural network function to train an MNIST classifier.  Analogous to the softmax regression case, there is a `train_nn()` function in the `simple_ml.py` file you can use to train this two-layer network via SGD with multiple epochs.  Here is code, for example, that trains a two-layer network with 400 hidden units.

In [25]:
import sys

# Reload the simple_ml module which has been cached from the earlier experiment
import importlib
import simple_ml
importlib.reload(simple_ml)

sys.path.append("src/")
from simple_ml import train_nn, parse_mnist

X_tr, y_tr = parse_mnist("data/train-images-idx3-ubyte.gz", 
                         "data/train-labels-idx1-ubyte.gz")
X_te, y_te = parse_mnist("data/t10k-images-idx3-ubyte.gz",
                         "data/t10k-labels-idx1-ubyte.gz")
train_nn(X_tr, y_tr, X_te, y_te, hidden_dim=400, epochs=20, lr=0.2)

| Epoch | Train Loss | Train Err | Test Loss | Test Err |
|     0 |    0.15324 |   0.04697 |   0.16305 |  0.04920 |
|     1 |    0.09854 |   0.02923 |   0.11604 |  0.03660 |
|     2 |    0.07392 |   0.02163 |   0.09750 |  0.03200 |
|     3 |    0.06006 |   0.01757 |   0.08825 |  0.02960 |
|     4 |    0.04869 |   0.01368 |   0.08147 |  0.02620 |
|     5 |    0.04061 |   0.01093 |   0.07698 |  0.02380 |
|     6 |    0.03494 |   0.00915 |   0.07446 |  0.02320 |
|     7 |    0.03027 |   0.00758 |   0.07274 |  0.02320 |
|     8 |    0.02674 |   0.00650 |   0.07103 |  0.02240 |
|     9 |    0.02373 |   0.00552 |   0.06989 |  0.02150 |
|    10 |    0.02092 |   0.00477 |   0.06870 |  0.02130 |
|    11 |    0.01914 |   0.00403 |   0.06837 |  0.02130 |
|    12 |    0.01705 |   0.00325 |   0.06748 |  0.02150 |
|    13 |    0.01541 |   0.00272 |   0.06688 |  0.02130 |
|    14 |    0.01417 |   0.00232 |   0.06657 |  0.02090 |
|    15 |    0.01282 |   0.00195 |   0.06591 |  0.02040 |
|    16 |    0

This takes about 30 seconds to run on Colab for our implementation, and as seen above, it achieve an error of 1.89\% on MNIST.  Not bad for less than 20 lines of code or so...

## Question 6: Softmax regression in C++

Strictly speaking, the actual implementation here is more like raw C, but we use C++ features to build the interface to Python using the [pybind11](https://pybind11.readthedocs.io) library.  Although there are other alternatives, pybind11 library is relatively nice as an interface, as it is a header-only library, and allows you to implement the entire Python/C++ interface within a single C++ source library.

The C++ file you'll implement things in is the `src/simple_ml_ext.cpp` file. You will specifically implement your code in the function `softmax_regression_epoch_cpp`.

The function essentially mirrors that of the Python implementation, but requires passing some additional arguments because we are operating on raw pointers to the array data rather than any sort of higher-level "matrix" data structure.  Specifically, `X`, `y`, and `theta` are pointers to the raw data of the corresponding numpy arrays from the previous section.  We also assuming there is no padding in the data; that is, the second row begins immediately after the first row, with no additional bytes added, e.g., to align the memory to a certain boundary (all these issues will be mentioned in subsequent discussion in the course, but avoided for now). Of course, because only the raw data is passed into the function, in order to know the actual sizes of the underlying matrices, we also need to pass these sizes explicitly to the function, which is what is provided by the `m`, `n`, and `k` arguments.

Unlike in Python, you need to be very careful when accessing memory directly like this in C++, and get very used to this kind of notation (or build additional data structures that help you access things in a more intuitive fashion, but for this assignment you should just stick to the raw indexing).


The second piece of importance for the implementation is the pybind11 code that actually provides the Python interface `PYBIND11_MODULE`

This code essentially just extracts the raw pointers from the provided inputs (using pybinds numpy interface), and then calls the corresponding `softmax_regression_epoch_cpp` function.

You will need to perform all the matrix-vector products manually, rather that rely on numpy to do all the matrix operations for you (**note: do not use an external matrix library like Eigen for this assignment, but code the multiplication yourself ... it is a relatively simple one**).

In [4]:
!make
!python3 -m pytest -k "softmax_regression_epoch_cpp"

c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) src/simple_ml_ext.cpp -o src/simple_ml_ext.so
platform linux -- Python 3.8.15, pytest-7.2.0, pluggy-1.0.0
rootdir: /home/hujunhao/dlsys/hw0
collected 6 items / 5 deselected / 1 selected                                  [0m

tests/test_simple_ml.py [31mF[0m[31m                                                [100%][0m

[31m[1m______________________ test_softmax_regression_epoch_cpp _______________________[0m

    [94mdef[39;49;00m [92mtest_softmax_regression_epoch_cpp[39;49;00m():
        [90m# test numeical gradient[39;49;00m
        np.random.seed([94m0[39;49;00m)
        X = np.random.randn([94m50[39;49;00m,[94m5[39;49;00m).astype(np.float32)
        y = np.random.randint([94m3[39;49;00m, size=([94m50[39;49;00m,)).astype(np.uint8)
        Theta = np.zeros(([94m5[39;49;00m,[94m3[39;49;00m), dtype=np.float32)
        dTheta = -nd.Gradient([94mlambda[39;49;00m Th : softmax_loss(X[37m@T

In [None]:
!python3 -m mugrade submit YOUR_GRADER_KEY_HERE -k "softmax_regression_epoch_cpp"

### Training a full softmax regression classifier with the C++ version

Let's finally try training the whole softmax regression classifier using our "direct memory acesss" C++ version.  If the previous Python version took ~3 seconds, this should be blazing fast, right?

In [30]:
import sys
sys.path.append("src/")

# Reload the simple_ml module to include the newly-compiled C++ extension
import importlib
import simple_ml
importlib.reload(simple_ml)

from simple_ml import train_softmax, parse_mnist

X_tr, y_tr = parse_mnist("data/train-images-idx3-ubyte.gz", 
                         "data/train-labels-idx1-ubyte.gz")
X_te, y_te = parse_mnist("data/t10k-images-idx3-ubyte.gz",
                         "data/t10k-labels-idx1-ubyte.gz")

train_softmax(X_tr, y_tr, X_te, y_te, epochs=10, lr = 0.2, batch=100, cpp=True)

| Epoch | Train Loss | Train Err | Test Loss | Test Err |
|     0 |    0.35134 |   0.10182 |   0.33588 |  0.09400 |
|     1 |    0.32142 |   0.09268 |   0.31086 |  0.08730 |
|     2 |    0.30802 |   0.08795 |   0.30097 |  0.08550 |
|     3 |    0.29987 |   0.08532 |   0.29558 |  0.08370 |
|     4 |    0.29415 |   0.08323 |   0.29215 |  0.08230 |
|     5 |    0.28981 |   0.08182 |   0.28973 |  0.08090 |
|     6 |    0.28633 |   0.08085 |   0.28793 |  0.08080 |
|     7 |    0.28345 |   0.07997 |   0.28651 |  0.08040 |
|     8 |    0.28100 |   0.07923 |   0.28537 |  0.08010 |
|     9 |    0.27887 |   0.07847 |   0.28442 |  0.07970 |


As expected, the numbers match exactly our Python version, and the code is ... about 5 times slower?!  What is going on here?  Well, it turns out that the "manual" matrix multiplication code you probably wrote for the C++ version is extremely inefficient.  While Python itself is a slow, interpreted language, numpy itself is backed by matrix multiplications written in C (or, more likely, Fortran, believe it or not), that have been highly optmized to make use of vector operations, the cache hierarchy of different processors, and other features that are essential for efficient numerical operations.  We will cover these details much more in later lectures, and you'll even write a matrix library that can actually perform these operations relatively efficiently (at least for some special cases ... it's honestly not that easy to beat numpy in general).