In [None]:
# # this mounts your Google Drive to the Colab VM.
# from google.colab import drive
# drive.mount('/content/drive', force_remount=True)

# # enter the foldername in your Drive where you have saved the unzipped
# # assignment folder, e.g. 'cs231n/assignments/assignment3/'
# FOLDERNAME = None
# assert FOLDERNAME is not None, "[!] Enter the foldername."

# # symlink to make it easier to load your files
# !ln -s "/content/drive/My Drive/$FOLDERNAME" "/content/assignment2"

# # now that we've mounted your Drive, this ensures that
# # the Python interpreter of the Colab VM can load
# # python files from within it.
# import sys
# sys.path.append('/content/drive/My Drive/{}'.format(FOLDERNAME))


# %cd /content

# What's this TensorFlow business?

You've written a lot of code in this assignment to provide a whole host of neural network functionality. Dropout, Batch Norm, and 2D convolutions are some of the workhorses of deep learning in computer vision. You've also worked hard to make your code efficient and vectorized.

For the last part of this assignment, though, we're going to leave behind your beautiful codebase and instead migrate to one of two popular deep learning frameworks: in this instance, TensorFlow (or PyTorch, if you choose to work with that notebook).

#### What is it?
TensorFlow is a system for executing computational graphs over Tensor objects, with native support for performing backpropogation for its Variables. In it, we work with Tensors which are n-dimensional arrays analogous to the numpy ndarray.

#### Why?

* Our code will now run on GPUs! Much faster training. Writing your own modules to run on GPUs is beyond the scope of this class, unfortunately.
* We want you to be ready to use one of these frameworks for your project so you can experiment more efficiently than if you were writing every feature you want to use by hand. 
* We want you to stand on the shoulders of giants! TensorFlow and PyTorch are both excellent frameworks that will make your lives a lot easier, and now that you understand their guts, you are free to use them :) 
* We want you to be exposed to the sort of deep learning code you might run into in academia or industry. 

## How will I learn TensorFlow?

TensorFlow has many excellent tutorials available, including those from [Google themselves](https://www.tensorflow.org/get_started/get_started).

Otherwise, this notebook will walk you through much of what you need to do to train models in TensorFlow. See the end of the notebook for some links to helpful tutorials if you want to learn more or need further clarification on topics that aren't fully explained here.

**NOTE: This notebook is meant to teach you the latest version of Tensorflow which is as of this homework version `2.2.0-rc3`. Most examples on the web today are still in 1.x, so be careful not to confuse the two when looking up documentation**.

## Install Tensorflow 2.0 (ONLY IF YOU ARE WORKING LOCALLY)

1. Have the latest version of Anaconda installed on your machine.
2. Create a new conda environment starting from Python 3.7. In this setup example, we'll call it `tf_20_env`.
3. Run the command: `source activate tf_20_env`
4. Then pip install TF 2.0 as described here: https://www.tensorflow.org/install

# Table of Contents

This notebook has 5 parts. We will walk through TensorFlow at **three different levels of abstraction**, which should help you better understand it and prepare you for working on your project.

1. Part I, Preparation: load the CINIC-10 dataset.
2. Part II, Barebone TensorFlow: **Abstraction Level 1**, we will work directly with low-level TensorFlow graphs. 
3. Part III, Keras Model API: **Abstraction Level 2**, we will use `tf.keras.Model` to define arbitrary neural network architecture. 
4. Part IV, Keras Sequential + Functional API: **Abstraction Level 3**, we will use `tf.keras.Sequential` to define a linear feed-forward network very conveniently, and then explore the functional libraries for building unique and uncommon models that require more flexibility.
5. Part V, CINIC-10 open-ended challenge: please implement your own network to get as high accuracy as possible on CINIC-10. You can experiment with any layer, optimizer, hyperparameters or other advanced features. 

We will discuss Keras in more detail later in the notebook.

Here is a table of comparison:

| API           | Flexibility | Convenience |
|---------------|-------------|-------------|
| Barebone      | High        | Low         |
| `tf.keras.Model`     | High        | Medium      |
| `tf.keras.Sequential` | Low         | High        |

# Part I: Preparation

First, we load the CINIC-10 dataset. This might take a few minutes to download the first time you run it, but after that the files should be cached on disk and loading should be faster.

For the purposes of this assignment we will still write our own code to preprocess the data and iterate through it in minibatches. The `tf.data` package in TensorFlow provides tools for automating this process, but working with this package adds extra complication and is beyond the scope of this notebook. However using `tf.data` can be much more efficient than the simple approach used in this notebook, so you should consider using it for your project.

In [5]:
import os
import tensorflow as tf
import numpy as np
import math
import timeit
import matplotlib.pyplot as plt
from ece697ls.data_utils import get_CINIC10_data
from ece697ls.pruning_helper import invert_ch

%matplotlib inline

In [111]:
def load_cinic10():
  data = get_CINIC10_data()

  # Convert to float32 for TensorFlow
  X_train = data['X_train'].astype(np.float32)
  X_val = data['X_val'].astype(np.float32)
  X_test = data['X_test'].astype(np.float32)

  y_train = data['y_train'].astype(np.int32)
  y_val = data['y_val'].astype(np.int32)
  y_test = data['y_test'].astype(np.int32)

  return X_train, y_train, X_val, y_val, X_test, y_test

data = get_CINIC10_data()

# Invoke the above function to get our data.
NHW = (0, 1, 2)
X_train, y_train, X_val, y_val, X_test, y_test = load_cinic10()
print('Train data shape: ', X_train.shape)
print('Train labels shape: ', y_train.shape, y_train.dtype)
print('Validation data shape: ', X_val.shape)
print('Validation labels shape: ', y_val.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)


Train data shape:  (53973, 3, 32, 32)
Train labels shape:  (53973,) int32
Validation data shape:  (10195, 3, 32, 32)
Validation labels shape:  (10195,)
Test data shape:  (10196, 3, 32, 32)
Test labels shape:  (10196,)


In [116]:
class Dataset(object):
    def __init__(self, X, y, batch_size, shuffle=False):
        """
        Construct a Dataset object to iterate over data X and labels y
        
        Inputs:
        - X: Numpy array of data, of any shape
        - y: Numpy array of labels, of any shape but with y.shape[0] == X.shape[0]
        - batch_size: Integer giving number of elements per minibatch
        - shuffle: (optional) Boolean, whether to shuffle the data on each epoch
        """
        assert X.shape[0] == y.shape[0], 'Got different numbers of data and labels'
        self.X, self.y = X, y
        self.batch_size, self.shuffle = batch_size, shuffle

    def __iter__(self):
        N, B = self.X.shape[0], self.batch_size
        idxs = np.arange(N)
        if self.shuffle:
            np.random.shuffle(idxs)
        return iter((self.X[i:i+B], self.y[i:i+B]) for i in range(0, N, B))


train_dset = Dataset(X_train, y_train, batch_size=64, shuffle=True)
val_dset = Dataset(X_val, y_val, batch_size=64, shuffle=False)
test_dset = Dataset(X_test, y_test, batch_size=64)

In [117]:
# We can iterate through a dataset like this:
for t, (x, y) in enumerate(train_dset):
    print(t, x.shape, y.shape)
    if t > 5: break

0 (64, 3, 32, 32) (64,)
1 (64, 3, 32, 32) (64,)
2 (64, 3, 32, 32) (64,)
3 (64, 3, 32, 32) (64,)
4 (64, 3, 32, 32) (64,)
5 (64, 3, 32, 32) (64,)
6 (64, 3, 32, 32) (64,)


You can optionally **use GPU by setting the flag to True below**.

## Colab Users

If you are using Colab, you need to manually switch to a GPU device. You can do this by clicking `Runtime -> Change runtime type` and selecting `GPU` under `Hardware Accelerator`. Note that you have to rerun the cells from the top since the kernel gets restarted upon switching runtimes.

In [None]:
# # Set up some global variables
# USE_GPU = True

# if USE_GPU:
#     device = '/device:GPU:0'
# else:
#     device = '/cpu:0'

# # Constant to control how often we print when training models
# print_every = 100

# print('Using device: ', device)

# Part II: Barebones TensorFlow
TensorFlow ships with various high-level APIs which make it very convenient to define and train neural networks; we will cover some of these constructs in Part III and Part IV of this notebook. In this section we will start by building a model with basic TensorFlow constructs to help you better understand what's going on under the hood of the higher-level APIs.

**"Barebones Tensorflow" is important to understanding the building blocks of TensorFlow, but much of it involves concepts from TensorFlow 1.x.** We will be working with legacy modules such as `tf.Variable`.

Therefore, please read and understand the differences between legacy (1.x) TF and the new (2.0) TF.

### Historical background on TensorFlow 1.x

TensorFlow 1.x is primarily a framework for working with **static computational graphs**. Nodes in the computational graph are Tensors which will hold n-dimensional arrays when the graph is run; edges in the graph represent functions that will operate on Tensors when the graph is run to actually perform useful computation.

Before Tensorflow 2.0, we had to configure the graph into two phases. There are plenty of tutorials online that explain this two-step process. The process generally looks like the following for TF 1.x:
1. **Build a computational graph that describes the computation that you want to perform**. This stage doesn't actually perform any computation; it just builds up a symbolic representation of your computation. This stage will typically define one or more `placeholder` objects that represent inputs to the computational graph.
2. **Run the computational graph many times.** Each time the graph is run (e.g. for one gradient descent step) you will specify which parts of the graph you want to compute, and pass a `feed_dict` dictionary that will give concrete values to any `placeholder`s in the graph.

### The new paradigm in Tensorflow 2.0
Now, with Tensorflow 2.0, we can simply adopt a functional form that is more Pythonic and similar in spirit to PyTorch and direct Numpy operation. Instead of the 2-step paradigm with computation graphs, making it (among other things) easier to debug TF code. You can read more details at https://www.tensorflow.org/guide/eager.

The main difference between the TF 1.x and 2.0 approach is that the 2.0 approach doesn't make use of `tf.Session`, `tf.run`, `placeholder`, `feed_dict`. To get more details of what's different between the two version and how to convert between the two, check out the official migration guide: https://www.tensorflow.org/alpha/guide/migration_guide

Later, in the rest of this notebook we'll focus on this new, simpler approach.

### TensorFlow warmup: Flatten Function

We can see this in action by defining a simple `flatten` function that will reshape image data for use in a fully-connected network.

In TensorFlow, data for convolutional feature maps is typically stored in a Tensor of shape N x H x W x C where:

- N is the number of datapoints (minibatch size)
- H is the height of the feature map
- W is the width of the feature map
- C is the number of channels in the feature map

This is the right way to represent the data when we are doing something like a 2D convolution, that needs spatial understanding of where the intermediate features are relative to each other. When we use fully connected affine layers to process the image, however, we want each datapoint to be represented by a single vector -- it's no longer useful to segregate the different channels, rows, and columns of the data. So, we use a "flatten" operation to collapse the `H x W x C` values per representation into a single long vector. 

Notice the `tf.reshape` call has the target shape as `(N, -1)`, meaning it will reshape/keep the first dimension to be N, and then infer as necessary what the second dimension is in the output, so we can collapse the remaining dimensions from the input properly.

**NOTE**: TensorFlow and PyTorch differ on the default Tensor layout; TensorFlow uses N x H x W x C but PyTorch uses N x C x H x W.

In [51]:
def flatten(x):
    """    
    Input:
    - TensorFlow Tensor of shape (N, D1, ..., DM)
    
    Output:
    - TensorFlow Tensor of shape (N, D1 * ... * DM)
    """
    N = tf.shape(x)[0]
    return tf.reshape(x, (N, -1))

In [52]:
def test_flatten():
    # Construct concrete values of the input data x using numpy
    x_np = np.arange(24).reshape((2, 3, 4))
    print('x_np:\n', x_np, '\n')
    # Compute a concrete output value.
    x_flat_np = flatten(x_np)
    print('x_flat_np:\n', x_flat_np, '\n')

test_flatten()

x_np:
 [[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]] 

x_flat_np:
 tf.Tensor(
[[ 0  1  2  3  4  5  6  7  8  9 10 11]
 [12 13 14 15 16 17 18 19 20 21 22 23]], shape=(2, 12), dtype=int64) 



### Barebones TensorFlow: Define a Two-Layer Network
We will now implement our first neural network with TensorFlow: a fully-connected ReLU network with two hidden layers and no biases on the CINIC10 dataset. For now we will use only low-level TensorFlow operators to define the network; later we will see how to use the higher-level abstractions provided by `tf.keras` to simplify the process.

We will define the forward pass of the network in the function `two_layer_fc`; this will accept TensorFlow Tensors for the inputs and weights of the network, and return a TensorFlow Tensor for the scores. 

After defining the network architecture in the `two_layer_fc` function, we will test the implementation by checking the shape of the output.

**It's important that you read and understand this implementation.**

In [53]:
def two_layer_fc(x, params):
    """
    A fully-connected neural network; the architecture is:
    fully-connected layer -> ReLU -> fully connected layer.
    Note that we only need to define the forward pass here; TensorFlow will take
    care of computing the gradients for us.
    
    The input to the network will be a minibatch of data, of shape
    (N, d1, ..., dM) where d1 * ... * dM = D. The hidden layer will have H units,
    and the output layer will produce scores for C classes.

    Inputs:
    - x: A TensorFlow Tensor of shape (N, d1, ..., dM) giving a minibatch of
      input data.
    - params: A list [w1, w2] of TensorFlow Tensors giving weights for the
      network, where w1 has shape (D, H) and w2 has shape (H, C).
    
    Returns:
    - scores: A TensorFlow Tensor of shape (N, C) giving classification scores
      for the input data x.
    """
    w1, w2 = params                   # Unpack the parameters
    x = flatten(x)                    # Flatten the input; now x has shape (N, D)
    h = tf.nn.relu(tf.matmul(x, w1))  # Hidden layer: h has shape (N, H)
    scores = tf.matmul(h, w2)         # Compute scores of shape (N, C)
    return scores

In [54]:
def two_layer_fc_test():
    hidden_layer_size = 42

    # Scoping our TF operations under a tf.device context manager 
    # lets us tell TensorFlow where we want these Tensors to be
    # multiplied and/or operated on, e.g. on a CPU or a GPU.
    device = "/CPU:0"   # using my CPU


    with tf.device(device):        
        x = tf.zeros((64, 32, 32, 3))
        w1 = tf.zeros((32 * 32 * 3, hidden_layer_size))
        w2 = tf.zeros((hidden_layer_size, 6))

        # Call our two_layer_fc function for the forward pass of the network.
        scores = two_layer_fc(x, [w1, w2])

    print(scores.shape)

two_layer_fc_test()

(64, 6)


### Barebones TensorFlow: Three-Layer ConvNet
Here you will complete the implementation of the function `three_layer_convnet` which will perform the forward pass of a three-layer convolutional network. The network should have the following architecture:

1. A convolutional layer (with bias) with `channel_1` filters, each with shape `KW1 x KH1`, and zero-padding of two
2. ReLU nonlinearity
3. A convolutional layer (with bias) with `channel_2` filters, each with shape `KW2 x KH2`, and zero-padding of one
4. ReLU nonlinearity
5. Fully-connected layer with bias, producing scores for `C` classes.

**HINT**: For convolutions: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/nn/conv2d; be careful with padding!

**HINT**: For biases: https://www.tensorflow.org/performance/xla/broadcasting

In [67]:
device = "/CPU:0"  # using my CPU
def three_layer_convnet(x, params):
    """
    A three-layer convolutional network with the architecture described above.
    
    Inputs:
    - x: A TensorFlow Tensor of shape (N, H, W, 3) giving a minibatch of images
    - params: A list of TensorFlow Tensors giving the weights and biases for the
      network; should contain the following:
      - conv_w1: TensorFlow Tensor of shape (KH1, KW1, 3, channel_1) giving
        weights for the first convolutional layer.
      - conv_b1: TensorFlow Tensor of shape (channel_1,) giving biases for the
        first convolutional layer.
      - conv_w2: TensorFlow Tensor of shape (KH2, KW2, channel_1, channel_2)
        giving weights for the second convolutional layer
      - conv_b2: TensorFlow Tensor of shape (channel_2,) giving biases for the
        second convolutional layer.
      - fc_w: TensorFlow Tensor giving weights for the fully-connected layer.
        Can you figure out what the shape should be?
      - fc_b: TensorFlow Tensor giving biases for the fully-connected layer.
        Can you figure out what the shape should be?
    """
    conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b = params
    x = tf.transpose(x, perm=[0, 2, 3, 1])
    scores = None
    ############################################################################
    # TODO: Implement the forward pass for the three-layer ConvNet.            #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    x_padded = tf.pad(x, [[0, 0], [2, 2], [2, 2], [0, 0]]) # added the pad height and width dimensions by 2 on each side
    conv1 = tf.nn.conv2d(x_padded, conv_w1, strides=[1, 1, 1, 1], padding='VALID') # 'VALID' padding since we've already padded the input
    conv1 = tf.nn.bias_add(conv1, conv_b1) # Add the bias term to the convolution output
    relu1 = tf.nn.relu(conv1) # ReLU activation for non-linearity

    relu1_padded = tf.pad(relu1, [[0, 0], [1, 1], [1, 1], [0, 0]]) # Pad height and width dimensions by 1 on each side
    conv2 = tf.nn.conv2d(relu1_padded, conv_w2, strides=[1, 1, 1, 1], padding='VALID') # 'VALID' padding since we've already padded the input
    conv2 = tf.nn.bias_add(conv2, conv_b2) # Add the bias term to the convolution output
    relu2 = tf.nn.relu(conv2) # ReLU activation again for non-linearity

    flat = tf.reshape(relu2, [tf.shape(relu2)[0], -1]) # Flatten the output to shape (N, D)
    scores = tf.matmul(flat, fc_w) + fc_b # Fully connected layer to get scores of shape (N, C)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                              END OF YOUR CODE                            #
    ############################################################################
    return scores


After defing the forward pass of the three-layer ConvNet above, run the following cell to test your implementation. Like the two-layer network, we run the graph on a batch of zeros just to make sure the function doesn't crash, and produces outputs of the correct shape.

When you run this function, `scores_np` should have shape `(64, 6)`.

In [68]:
def three_layer_convnet_test():
    
    with tf.device(device):
        x = tf.zeros((64, 32, 32, 3))
        conv_w1 = tf.zeros((5, 5, 3, 6))
        conv_b1 = tf.zeros((6,))
        conv_w2 = tf.zeros((3, 3, 6, 9))
        conv_b2 = tf.zeros((9,))
        fc_w = tf.zeros((32 * 32 * 9, 6))
        fc_b = tf.zeros((6,))
        params = [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b]
        scores = three_layer_convnet(x, params)

    # Inputs to convolutional layers are 4-dimensional arrays with shape
    # [batch_size, height, width, channels]
    print('scores_np has shape: ', scores.shape)

three_layer_convnet_test()

scores_np has shape:  (64, 6)


### Barebones TensorFlow: Training Step

We now define the `training_step` function performs a single training step. This will take three basic steps:

1. Compute the loss
2. Compute the gradient of the loss with respect to all network weights
3. Make a weight update step using (stochastic) gradient descent.


We need to use a few new TensorFlow functions to do all of this:
- For computing the cross-entropy loss we'll use `tf.nn.sparse_softmax_cross_entropy_with_logits`: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits

- For averaging the loss across a minibatch of data we'll use `tf.reduce_mean`:
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/reduce_mean

- For computing gradients of the loss with respect to the weights we'll use `tf.GradientTape` (useful for Eager execution):  https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/GradientTape

- We'll mutate the weight values stored in a TensorFlow Tensor using `tf.assign_sub` ("sub" is for subtraction): https://www.tensorflow.org/api_docs/python/tf/assign_sub 


In [57]:
def training_step(model_fn, x, y, params, learning_rate):
    with tf.GradientTape() as tape:
        scores = model_fn(x, params) # Forward pass of the model
        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=scores)
        total_loss = tf.reduce_mean(loss)
        grad_params = tape.gradient(total_loss, params)

        # Make a vanilla gradient descent step on all of the model parameters
        # Manually update the weights using assign_sub()
        for w, grad_w in zip(params, grad_params):
            w.assign_sub(learning_rate * grad_w)
                        
        return total_loss

In [70]:
num_iters = 200 # number of iterations to run not added before
batch_size = 50 # batch size not added before alos not added before
print_every = 10 # how often to print not added before
device = "/CPU:0"
def train_part2(model_fn, init_fn, learning_rate):
    """
    Train a model on CINIC-10.
    
    Inputs:
    - model_fn: A Python function that performs the forward pass of the model
      using TensorFlow; it should have the following signature:
      scores = model_fn(x, params) where x is a TensorFlow Tensor giving a
      minibatch of image data, params is a list of TensorFlow Tensors holding
      the model weights, and scores is a TensorFlow Tensor of shape (N, C)
      giving scores for all elements of x.
    - init_fn: A Python function that initializes the parameters of the model.
      It should have the signature params = init_fn() where params is a list
      of TensorFlow Tensors holding the (randomly initialized) weights of the
      model.
    - learning_rate: Python float giving the learning rate to use for SGD.
    """
    
    
    params = init_fn()  # Initialize the model parameters            
        
    for t, (x_np, y_np) in enumerate(train_dset):
        # Run the graph on a batch of training data.
        loss = training_step(model_fn, x_np, y_np, params, learning_rate)
        
        # Periodically print the loss and check accuracy on the val set.
        if t % print_every == 0:
            print('Iteration %d, loss = %.4f' % (t, loss))
            check_accuracy(val_dset, x_np, model_fn, params)

In [71]:
def check_accuracy(dset, x, model_fn, params):
    """
    Check accuracy on a classification model, e.g. for validation.
    
    Inputs:
    - dset: A Dataset object against which to check accuracy
    - x: A TensorFlow placeholder Tensor where input images should be fed
    - model_fn: the Model we will be calling to make predictions on x
    - params: parameters for the model_fn to work with
      
    Returns: Nothing, but prints the accuracy of the model
    """
    num_correct, num_samples = 0, 0
    for x_batch, y_batch in dset:
        scores_np = model_fn(x_batch, params).numpy()
        y_pred = scores_np.argmax(axis=1)
        num_samples += x_batch.shape[0]
        num_correct += (y_pred == y_batch).sum()
    acc = float(num_correct) / num_samples
    print('Got %d / %d correct (%.2f%%)' % (num_correct, num_samples, 100 * acc))

### Barebones TensorFlow: Initialization
We'll use the following utility method to initialize the weight matrices for our models using Kaiming's normalization method.

[1] He et al, *Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
*, ICCV 2015, https://arxiv.org/abs/1502.01852

In [60]:
def create_matrix_with_kaiming_normal(shape):
    if len(shape) == 2:
        fan_in, fan_out = shape[0], shape[1]
    elif len(shape) == 4:
        fan_in, fan_out = np.prod(shape[:3]), shape[3]
    return tf.keras.backend.random_normal(shape) * np.sqrt(2.0 / fan_in)

### Barebones TensorFlow: Train a Two-Layer Network
We are finally ready to use all of the pieces defined above to train a two-layer fully-connected network on CINIC-10.

We just need to define a function to initialize the weights of the model, and call `train_part2`.

Defining the weights of the network introduces another important piece of TensorFlow API: `tf.Variable`. A TensorFlow Variable is a Tensor whose value is stored in the graph and persists across runs of the computational graph; however unlike constants defined with `tf.zeros` or `tf.random_normal`, the values of a Variable can be mutated as the graph runs; these mutations will persist across graph runs. Learnable parameters of the network are usually stored in Variables.

You don't need to tune any hyperparameters, but you should achieve validation accuracies above 30% after one epoch of training.

In [63]:
def two_layer_fc_init():
    """
    Initialize the weights of a two-layer network, for use with the
    two_layer_network function defined above. 
    You can use the `create_matrix_with_kaiming_normal` helper!
    
    Inputs: None
    
    Returns: A list of:
    - w1: TensorFlow tf.Variable giving the weights for the first layer
    - w2: TensorFlow tf.Variable giving the weights for the second layer
    """
    hidden_layer_size = 4000
    w1 = tf.Variable(create_matrix_with_kaiming_normal((3 * 32 * 32, 4000)))
    w2 = tf.Variable(create_matrix_with_kaiming_normal((4000, 6)))
    return [w1, w2]

learning_rate = 1e-4
train_part2(two_layer_fc, two_layer_fc_init, learning_rate)

Iteration 0, loss = 111.0117
Got 1910 / 10195 correct (18.73%)
Iteration 10, loss = 69.9516
Got 2063 / 10195 correct (20.24%)
Iteration 20, loss = 80.0376
Got 2473 / 10195 correct (24.26%)
Iteration 30, loss = 83.1900
Got 2511 / 10195 correct (24.63%)
Iteration 40, loss = 76.4602
Got 2448 / 10195 correct (24.01%)
Iteration 50, loss = 67.1428
Got 2567 / 10195 correct (25.18%)
Iteration 60, loss = 87.6555
Got 2647 / 10195 correct (25.96%)
Iteration 70, loss = 68.7226
Got 2647 / 10195 correct (25.96%)
Iteration 80, loss = 43.5270
Got 2628 / 10195 correct (25.78%)
Iteration 90, loss = 67.0045
Got 2556 / 10195 correct (25.07%)
Iteration 100, loss = 96.2568
Got 2448 / 10195 correct (24.01%)
Iteration 110, loss = 57.7961
Got 2671 / 10195 correct (26.20%)
Iteration 120, loss = 65.4543
Got 2931 / 10195 correct (28.75%)
Iteration 130, loss = 79.3598
Got 2660 / 10195 correct (26.09%)
Iteration 140, loss = 55.9113
Got 2910 / 10195 correct (28.54%)
Iteration 150, loss = 68.9272
Got 2501 / 10195 cor

### Barebones TensorFlow: Train a three-layer ConvNet
We will now use TensorFlow to train a three-layer ConvNet on CINIC-10.

You need to implement the `three_layer_convnet_init` function. Recall that the architecture of the network is:

1. Convolutional layer (with bias) with 32 5x5 filters, with zero-padding 2
2. ReLU
3. Convolutional layer (with bias) with 16 3x3 filters, with zero-padding 1
4. ReLU
5. Fully-connected layer (with bias) to compute scores for 6 classes

You don't need to do any hyperparameter tuning, but you should see validation accuracies above 30% after one epoch of training.

In [101]:
def three_layer_convnet_init():
    """
    Initialize the weights of a Three-Layer ConvNet, for use with the
    three_layer_convnet function defined above.
    You can use the `create_matrix_with_kaiming_normal` helper!
    
    Inputs: None
    
    Returns a list containing:
    - conv_w1: TensorFlow tf.Variable giving weights for the first conv layer
    - conv_b1: TensorFlow tf.Variable giving biases for the first conv layer
    - conv_w2: TensorFlow tf.Variable giving weights for the second conv layer
    - conv_b2: TensorFlow tf.Variable giving biases for the second conv layer
    - fc_w: TensorFlow tf.Variable giving weights for the fully-connected layer
    - fc_b: TensorFlow tf.Variable giving biases for the fully-connected layer
    """
    params = None
    ############################################################################
    # TODO: Initialize the parameters of the three-layer network.              #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    channel_1, channel_2, num_classes = 32, 16, 6 #arbitrary choice of channels and num_classes
    conv_w1 = tf.Variable(create_matrix_with_kaiming_normal((5, 5, 3, channel_1)), dtype=tf.float32) # (KH1, KW1, 3, channel_1)
    conv_b1 = tf.Variable(tf.zeros(channel_1, dtype=tf.float32)) # (channel_1,)
    conv_w2 = tf.Variable(create_matrix_with_kaiming_normal((3, 3, channel_1, channel_2)), dtype=tf.float32) # (KH2, KW2, channel_1, channel_2)
    conv_b2 = tf.Variable(tf.zeros(channel_2, dtype=tf.float32)) # (channel_2,)
    fc_input_dim = 32 * 32 * channel_2 
    fc_w = tf.Variable(create_matrix_with_kaiming_normal((fc_input_dim, num_classes)), dtype=tf.float32)
    fc_b = tf.Variable(tf.zeros(num_classes, dtype=tf.float32))
    params = [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b]
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################
    return params

learning_rate = 1e-4
train_part2(three_layer_convnet, three_layer_convnet_init, learning_rate)


Iteration 0, loss = 157.6129
Got 1712 / 10195 correct (16.79%)
Iteration 10, loss = 17.2059
Got 2341 / 10195 correct (22.96%)
Iteration 20, loss = 15.1101
Got 2378 / 10195 correct (23.33%)
Iteration 30, loss = 13.7050
Got 2363 / 10195 correct (23.18%)
Iteration 40, loss = 11.8909
Got 2392 / 10195 correct (23.46%)
Iteration 50, loss = 11.6599
Got 2366 / 10195 correct (23.21%)
Iteration 60, loss = 9.1301
Got 2306 / 10195 correct (22.62%)
Iteration 70, loss = 7.8653
Got 2230 / 10195 correct (21.87%)
Iteration 80, loss = 6.9172
Got 2238 / 10195 correct (21.95%)
Iteration 90, loss = 7.2950
Got 2265 / 10195 correct (22.22%)
Iteration 100, loss = 5.7653
Got 2225 / 10195 correct (21.82%)
Iteration 110, loss = 4.5083
Got 2200 / 10195 correct (21.58%)
Iteration 120, loss = 5.0539
Got 2228 / 10195 correct (21.85%)
Iteration 130, loss = 5.0198
Got 2258 / 10195 correct (22.15%)
Iteration 140, loss = 5.0178
Got 2253 / 10195 correct (22.10%)
Iteration 150, loss = 4.9254
Got 2233 / 10195 correct (21.9

# Part III: Keras Model Subclassing API

Implementing a neural network using the low-level TensorFlow API is a good way to understand how TensorFlow works, but it's a little inconvenient - we had to manually keep track of all Tensors holding learnable parameters. This was fine for a small network, but could quickly become unweildy for a large complex model.

Fortunately TensorFlow 2.0 provides higher-level APIs such as `tf.keras` which make it easy to build models out of modular, object-oriented layers. Further, TensorFlow 2.0 uses eager execution that evaluates operations immediately, without explicitly constructing any computational graphs. This makes it easy to write and debug models, and reduces the boilerplate code.

In this part of the notebook we will define neural network models using the `tf.keras.Model` API. To implement your own model, you need to do the following:

1. Define a new class which subclasses `tf.keras.Model`. Give your class an intuitive name that describes it, like `TwoLayerFC` or `ThreeLayerConvNet`.
2. In the initializer `__init__()` for your new class, define all the layers you need as class attributes. The `tf.keras.layers` package provides many common neural-network layers, like `tf.keras.layers.Dense` for fully-connected layers and `tf.keras.layers.Conv2D` for convolutional layers. Under the hood, these layers will construct `Variable` Tensors for any learnable parameters. **Warning**: Don't forget to call `super(YourModelName, self).__init__()` as the first line in your initializer!
3. Implement the `call()` method for your class; this implements the forward pass of your model, and defines the *connectivity* of your network. Layers defined in `__init__()` implement `__call__()` so they can be used as function objects that transform input Tensors into output Tensors. Don't define any new layers in `call()`; any layers you want to use in the forward pass should be defined in `__init__()`.

After you define your `tf.keras.Model` subclass, you can instantiate it and use it like the model functions from Part II.

### Keras Model Subclassing API: Two-Layer Network

Here is a concrete example of using the `tf.keras.Model` API to define a two-layer network. There are a few new bits of API to be aware of here:

We use an `Initializer` object to set up the initial values of the learnable parameters of the layers; in particular `tf.initializers.VarianceScaling` gives behavior similar to the Kaiming initialization method we used in Part II. You can read more about it here: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/initializers/VarianceScaling

We construct `tf.keras.layers.Dense` objects to represent the two fully-connected layers of the model. In addition to multiplying their input by a weight matrix and adding a bias vector, these layer can also apply a nonlinearity for you. For the first layer we specify a ReLU activation function by passing `activation='relu'` to the constructor; the second layer uses softmax activation function. Finally, we use `tf.keras.layers.Flatten` to flatten the output from the previous fully-connected layer.

In [93]:
class TwoLayerFC(tf.keras.Model):
    def __init__(self, hidden_size, num_classes):
        super(TwoLayerFC, self).__init__()        
        initializer = tf.initializers.VarianceScaling(scale=2.0)
        self.fc1 = tf.keras.layers.Dense(hidden_size, activation='relu',
                                   kernel_initializer=initializer)
        self.fc2 = tf.keras.layers.Dense(num_classes, activation='softmax',
                                   kernel_initializer=initializer)
        self.flatten = tf.keras.layers.Flatten()
    
    def call(self, x, training=False):
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.fc2(x)
        return x


def test_TwoLayerFC():
    """ A small unit test to exercise the TwoLayerFC model above. """
    input_size, hidden_size, num_classes = 50, 42, 6
    x = tf.zeros((64, input_size))
    model = TwoLayerFC(hidden_size, num_classes)
    with tf.device(device):
        scores = model(x)
        print(scores.shape)
        
test_TwoLayerFC()

(64, 6)


In [None]:
hidden_size, num_classes = 4000, 6
learning_rate = 1e-4

def model_init_fn():
    return TwoLayerFC(hidden_size, num_classes)

def optimizer_init_fn():
    return tf.keras.optimizers.SGD(learning_rate=learning_rate)

train_part34(model_init_fn, optimizer_init_fn, is_training=True)




Iteration 0, Epoch 1, Loss: 96.62543487548828, Accuracy: 15.625, Val Loss: 102.14210510253906, Val Accuracy: 18.528690338134766
Iteration 10, Epoch 1, Loss: 87.73381042480469, Accuracy: 21.875, Val Loss: 85.21002197265625, Val Accuracy: 19.55860710144043
Iteration 20, Epoch 1, Loss: 82.72669219970703, Accuracy: 23.214284896850586, Val Loss: 69.49610137939453, Val Accuracy: 24.12947654724121
Iteration 30, Epoch 1, Loss: 78.68995666503906, Accuracy: 23.941532135009766, Val Loss: 78.77084350585938, Val Accuracy: 21.196664810180664
Iteration 40, Epoch 1, Loss: 76.35467529296875, Accuracy: 24.352134704589844, Val Loss: 76.78191375732422, Val Accuracy: 23.845022201538086
Iteration 50, Epoch 1, Loss: 78.9779052734375, Accuracy: 23.71323585510254, Val Loss: 70.71685791015625, Val Accuracy: 23.815595626831055
Iteration 60, Epoch 1, Loss: 78.7224349975586, Accuracy: 23.616802215576172, Val Loss: 85.17578887939453, Val Accuracy: 21.53997039794922
Iteration 70, Epoch 1, Loss: 78.10658264160156, Ac

### Keras Model Subclassing API: Three-Layer ConvNet
Now it's your turn to implement a three-layer ConvNet using the `tf.keras.Model` API. Your model should have the same architecture used in Part II:

1. Convolutional layer with 5 x 5 kernels, with zero-padding of 2
2. ReLU nonlinearity
3. Convolutional layer with 3 x 3 kernels, with zero-padding of 1
4. ReLU nonlinearity
5. Fully-connected layer to give class scores
6. Softmax nonlinearity

You should initialize the weights of your network using the same initialization method as was used in the two-layer network above.

**Hint**: Refer to the documentation for `tf.keras.layers.Conv2D` and `tf.keras.layers.Dense`:

https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Conv2D

https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Dense

In [102]:
class ThreeLayerConvNet(tf.keras.Model):
    def __init__(self, channel_1, channel_2, num_classes):
        super(ThreeLayerConvNet, self).__init__()
        ########################################################################
        # TODO: Implement the __init__ method for a three-layer ConvNet. You   #
        # should instantiate layer objects to be used in the forward pass.     #
        ########################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        initializer = tf.initializers.VarianceScaling(scale=2.0)
        self.conv1 = tf.keras.layers.Conv2D(
            filters=channel_1,
            kernel_size=(5, 5),
            padding='same',
            activation=None,
            kernel_initializer=initializer,
            bias_initializer='zeros')
        self.relu1 = tf.keras.layers.ReLU()
        self.conv2 = tf.keras.layers.Conv2D(
            filters=channel_2,
            kernel_size=(3, 3),
            padding='same',
            activation=None,
            kernel_initializer=initializer,
            bias_initializer='zeros')
        self.relu2 = tf.keras.layers.ReLU()
        self.flatten = tf.keras.layers.Flatten()
        self.fc = tf.keras.layers.Dense(
            units=num_classes,
            activation='softmax',
            kernel_initializer=initializer,
            bias_initializer='zeros')
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ########################################################################
        #                           END OF YOUR CODE                           #
        ########################################################################
        
    def call(self, x, training=False):
        scores = None
        ########################################################################
        # TODO: Implement the forward pass for a three-layer ConvNet. You      #
        # should use the layer objects defined in the __init__ method.         #
        ########################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        x = tf.transpose(x, perm=[0, 2, 3, 1])  # NCHW -> NHWC for TF conv layers
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.flatten(x)
        scores = self.fc(x)
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ########################################################################
        #                           END OF YOUR CODE                           #
        ########################################################################        
        return scores


Once you complete the implementation of the `ThreeLayerConvNet` above you can run the following to ensure that your implementation does not crash and produces outputs of the expected shape.

In [103]:
def test_ThreeLayerConvNet():    
    channel_1, channel_2, num_classes = 12, 8, 6
    model = ThreeLayerConvNet(channel_1, channel_2, num_classes)
    with tf.device(device):
        x = tf.zeros((64, 3, 32, 32))
        scores = model(x)
        print(scores.shape)

test_ThreeLayerConvNet()


(64, 6)


### Keras Model Subclassing API: Eager Training

While keras models have a builtin training loop (using the `model.fit`), sometimes you need more customization. Here's an example, of a training loop implemented with eager execution.

In particular, notice `tf.GradientTape`. Automatic differentiation is used in the backend for implementing backpropagation in frameworks like TensorFlow. During eager execution, `tf.GradientTape` is used to trace operations for computing gradients later. A particular `tf.GradientTape` can only compute one gradient; subsequent calls to tape will throw a runtime error. 

TensorFlow 2.0 ships with easy-to-use built-in metrics under `tf.keras.metrics` module. Each metric is an object, and we can use `update_state()` to add observations and `reset_state()` to clear all observations. We can get the current result of a metric by calling `result()` on the metric object.

In [104]:
def train_part34(model_init_fn, optimizer_init_fn, num_epochs=1, is_training=False):
    """
    Simple training loop for use with models defined using tf.keras. It trains
    a model for one epoch on the CINIC-10 training set and periodically checks
    accuracy on the CINIC-10 validation set.
    
    Inputs:
    - model_init_fn: A function that takes no parameters; when called it
      constructs the model we want to train: model = model_init_fn()
    - optimizer_init_fn: A function which takes no parameters; when called it
      constructs the Optimizer object we will use to optimize the model:
      optimizer = optimizer_init_fn()
    - num_epochs: The number of epochs to train for
    
    Returns: Nothing, but prints progress during training
    """    
    with tf.device(device):
        # Compute the loss like we did in Part II
        loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()

        model = model_init_fn()
        optimizer = optimizer_init_fn()

        train_loss = tf.keras.metrics.Mean(name='train_loss')
        train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')

        val_loss = tf.keras.metrics.Mean(name='val_loss')
        val_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='val_accuracy')

        t = 0
        for epoch in range(num_epochs):
            # Reset the metrics - https://www.tensorflow.org/alpha/guide/migration_guide#new-style_metrics
            train_loss.reset_states()
            train_accuracy.reset_states()

            for x_np, y_np in train_dset:
                with tf.GradientTape() as tape:
                    # Use the model function to build the forward pass.
                    scores = model(x_np, training=is_training)
                    loss = loss_fn(y_np, scores)

                gradients = tape.gradient(loss, model.trainable_variables)
                optimizer.apply_gradients(zip(gradients, model.trainable_variables))

                # Update the metrics
                train_loss.update_state(loss)
                train_accuracy.update_state(y_np, scores)

                if t % print_every == 0:
                    val_loss.reset_states()
                    val_accuracy.reset_states()
                    for test_x, test_y in val_dset:
                        # During validation at end of epoch, training set to False
                        prediction = model(test_x, training=False)
                        t_loss = loss_fn(test_y, prediction)

                        val_loss.update_state(t_loss)
                        val_accuracy.update_state(test_y, prediction)

                    template = 'Iteration {}, Epoch {}, Loss: {}, Accuracy: {}, Val Loss: {}, Val Accuracy: {}'
                    print(template.format(
                        t, epoch + 1,
                        train_loss.result(),
                        train_accuracy.result() * 100,
                        val_loss.result(),
                        val_accuracy.result() * 100
                    ))
                t += 1


### Keras Model Subclassing API: Train a Two-Layer Network
We can now use the tools defined above to train a two-layer network on CINIC-10. We define the `model_init_fn` and `optimizer_init_fn` that construct the model and optimizer respectively when called. Here we want to train the model using stochastic gradient descent with no momentum, so we construct a `tf.keras.optimizers.SGD` function; you can [read about it here](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/optimizers/SGD).

You don't need to tune any hyperparameters here, but you should achieve validation accuracies above 30% after one epoch of training.

In [105]:
learning_rate = 1e-3
num_epochs = 15
channel_1, channel_2, channel_3 = 64, 128, 256
num_classes = 6

def model_init_fn():
    initializer = tf.initializers.VarianceScaling(scale=2.0)
    return tf.keras.Sequential([
        tf.keras.layers.InputLayer(input_shape=(3, 32, 32)),
        tf.keras.layers.Permute((2, 3, 1)),  # NCHW -> NHWC
        tf.keras.layers.RandomFlip('horizontal'),
        tf.keras.layers.RandomRotation(0.05),
        tf.keras.layers.Conv2D(channel_1, 3, padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Conv2D(channel_1, 3, padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.ReLU(),
        tf.keras.layers.MaxPool2D(2),
        tf.keras.layers.Conv2D(channel_2, 3, padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Conv2D(channel_2, 3, padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.ReLU(),
        tf.keras.layers.MaxPool2D(2),
        tf.keras.layers.Conv2D(channel_3, 3, padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.ReLU(),
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.Dense(256, activation='relu', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(num_classes, activation='softmax', kernel_initializer=initializer, bias_initializer='zeros'),
    ])

def optimizer_init_fn():
    return tf.keras.optimizers.RMSprop(learning_rate=learning_rate, rho=0.95, momentum=0.0, epsilon=1e-7)

train_part34(model_init_fn, optimizer_init_fn, num_epochs=num_epochs, is_training=True)




Iteration 0, Epoch 1, Loss: 91.34944915771484, Accuracy: 10.9375, Val Loss: 56.33817672729492, Val Accuracy: 15.743011474609375
Iteration 10, Epoch 1, Loss: 46.9233283996582, Accuracy: 14.772727012634277, Val Loss: 36.8912239074707, Val Accuracy: 18.73467445373535
Iteration 20, Epoch 1, Loss: 41.011634826660156, Accuracy: 17.931547164916992, Val Loss: 31.84600830078125, Val Accuracy: 19.794017791748047
Iteration 30, Epoch 1, Loss: 37.39537811279297, Accuracy: 18.598791122436523, Val Loss: 29.59173583984375, Val Accuracy: 20.62775993347168
Iteration 40, Epoch 1, Loss: 35.06196212768555, Accuracy: 19.474084854125977, Val Loss: 26.47980308532715, Val Accuracy: 20.892595291137695
Iteration 50, Epoch 1, Loss: 33.46692657470703, Accuracy: 19.577205657958984, Val Loss: 24.708444595336914, Val Accuracy: 20.60814094543457
Iteration 60, Epoch 1, Loss: 31.862350463867188, Accuracy: 19.87704849243164, Val Loss: 23.219335556030273, Val Accuracy: 21.245708465576172
Iteration 70, Epoch 1, Loss: 30.73

### Keras Model Subclassing  API: Train a Three-Layer ConvNet
Here you should use the tools we've defined above to train a three-layer ConvNet on CINIC-10. Your ConvNet should use 32 filters in the first convolutional layer and 16 filters in the second layer.

To train the model you should use gradient descent with Nesterov momentum 0.9.  

**HINT**: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/optimizers/SGD


In [106]:
learning_rate = 1e-3
channel_1, channel_2, num_classes = 32, 16, 6

def model_init_fn():
    initializer = tf.initializers.VarianceScaling(scale=2.0)
    return tf.keras.Sequential([
        tf.keras.layers.InputLayer(input_shape=(3, 32, 32)),
        tf.keras.layers.Permute((2, 3, 1)),  # NCHW -> NHWC
        tf.keras.layers.Conv2D(channel_1, (5, 5), padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Conv2D(channel_2, (3, 3), padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(num_classes, activation='softmax', kernel_initializer=initializer, bias_initializer='zeros'),
    ])

def optimizer_init_fn():
    return tf.keras.optimizers.RMSprop(learning_rate=learning_rate, rho=0.9, momentum=0.0, epsilon=1e-7)

train_part34(model_init_fn, optimizer_init_fn, is_training=True)




Iteration 0, Epoch 1, Loss: 55.43280029296875, Accuracy: 17.1875, Val Loss: 218.27096557617188, Val Accuracy: 16.929868698120117
Iteration 10, Epoch 1, Loss: 32.809722900390625, Accuracy: 17.471590042114258, Val Loss: 1.792354941368103, Val Accuracy: 16.58656120300293
Iteration 20, Epoch 1, Loss: 18.04017448425293, Accuracy: 16.220239639282227, Val Loss: 1.7917680740356445, Val Accuracy: 16.6650333404541
Iteration 30, Epoch 1, Loss: 12.798770904541016, Accuracy: 15.625, Val Loss: 1.7917680740356445, Val Accuracy: 16.6650333404541
Iteration 40, Epoch 1, Loss: 10.11412525177002, Accuracy: 15.891768455505371, Val Loss: 1.7917684316635132, Val Accuracy: 16.6650333404541
Iteration 50, Epoch 1, Loss: 8.482284545898438, Accuracy: 15.870099067687988, Val Loss: 1.7917699813842773, Val Accuracy: 16.674840927124023
Iteration 60, Epoch 1, Loss: 7.385487079620361, Accuracy: 15.625, Val Loss: 1.7917696237564087, Val Accuracy: 16.645414352416992
Iteration 70, Epoch 1, Loss: 6.597649097442627, Accurac

# Part IV: Keras Sequential API
In Part III we introduced the `tf.keras.Model` API, which allows you to define models with any number of learnable layers and with arbitrary connectivity between layers.

However for many models you don't need such flexibility - a lot of models can be expressed as a sequential stack of layers, with the output of each layer fed to the next layer as input. If your model fits this pattern, then there is an even easier way to define your model: using `tf.keras.Sequential`. You don't need to write any custom classes; you simply call the `tf.keras.Sequential` constructor with a list containing a sequence of layer objects.

One complication with `tf.keras.Sequential` is that you must define the shape of the input to the model by passing a value to the `input_shape` of the first layer in your model.

### Keras Sequential API: Two-Layer Network
In this subsection, we will rewrite the two-layer fully-connected network using `tf.keras.Sequential`, and train it using the training loop defined above.

You don't need to perform any hyperparameter tuning here, but you should see validation accuracies above 30% after training for one epoch.

In [107]:
hidden_size, num_classes = 4000, 6
learning_rate = 1e-4

def model_init_fn():
    initializer = tf.initializers.VarianceScaling(scale=2.0)
    return tf.keras.Sequential([
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(hidden_size, activation='relu', kernel_initializer=initializer),
        tf.keras.layers.Dense(num_classes, activation='softmax', kernel_initializer=initializer),
    ])

def optimizer_init_fn():
    return tf.keras.optimizers.SGD(learning_rate=learning_rate)

train_part34(model_init_fn, optimizer_init_fn, is_training=True)




Iteration 0, Epoch 1, Loss: 96.13736724853516, Accuracy: 18.75, Val Loss: 90.71708679199219, Val Accuracy: 18.67582130432129
Iteration 10, Epoch 1, Loss: 85.17745208740234, Accuracy: 19.176136016845703, Val Loss: 93.24380493164062, Val Accuracy: 22.61893081665039
Iteration 20, Epoch 1, Loss: 84.64188385009766, Accuracy: 21.428571701049805, Val Loss: 77.28785705566406, Val Accuracy: 24.100048065185547
Iteration 30, Epoch 1, Loss: 80.92422485351562, Accuracy: 22.47983741760254, Val Loss: 91.32759857177734, Val Accuracy: 20.647377014160156
Iteration 40, Epoch 1, Loss: 82.27729797363281, Accuracy: 22.942073822021484, Val Loss: 78.13467407226562, Val Accuracy: 24.914173126220703
Iteration 50, Epoch 1, Loss: 81.52835083007812, Accuracy: 23.4375, Val Loss: 85.81494140625, Val Accuracy: 24.54144287109375
Iteration 60, Epoch 1, Loss: 82.60565948486328, Accuracy: 23.38627052307129, Val Loss: 73.69322204589844, Val Accuracy: 26.483572006225586
Iteration 70, Epoch 1, Loss: 83.5599136352539, Accura

In [108]:
learning_rate = 1e-3
channel_1, channel_2, num_classes = 32, 16, 6

def model_init_fn():
    model = None
    ############################################################################
    # TODO: Complete the implementation of model_fn.                           #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    model = ThreeLayerConvNet(channel_1, channel_2, num_classes) # Now I use the ThreeLayerConvNet defined above again
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                           END OF YOUR CODE                               #
    ############################################################################
    return model

def optimizer_init_fn():
    optimizer = None
    ############################################################################
    # TODO: Complete the implementation of model_fn.                           #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate) # Vanilla SGD optimizer
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                           END OF YOUR CODE                               #
    ############################################################################
    return optimizer

train_part34(model_init_fn, optimizer_init_fn, is_training=True)




Iteration 0, Epoch 1, Loss: 65.14974212646484, Accuracy: 14.0625, Val Loss: 132.94993591308594, Val Accuracy: 18.048063278198242
Iteration 10, Epoch 1, Loss: 30.146808624267578, Accuracy: 18.892045974731445, Val Loss: 5.266853332519531, Val Accuracy: 18.116724014282227
Iteration 20, Epoch 1, Loss: 17.6715030670166, Accuracy: 19.494047164916992, Val Loss: 3.415821075439453, Val Accuracy: 17.861696243286133
Iteration 30, Epoch 1, Loss: 12.948808670043945, Accuracy: 18.245967864990234, Val Loss: 2.753894329071045, Val Accuracy: 17.694948196411133
Iteration 40, Epoch 1, Loss: 10.414810180664062, Accuracy: 18.635671615600586, Val Loss: 2.472862720489502, Val Accuracy: 18.136341094970703
Iteration 50, Epoch 1, Loss: 8.842007637023926, Accuracy: 17.86151885986328, Val Loss: 2.270205497741699, Val Accuracy: 18.048063278198242
Iteration 60, Epoch 1, Loss: 7.750801086425781, Accuracy: 17.776639938354492, Val Loss: 2.1748085021972656, Val Accuracy: 17.685138702392578
Iteration 70, Epoch 1, Loss: 

### Abstracting Away the Training Loop
In the previous examples, we used a customised training loop to train models (e.g. `train_part34`). Writing your own training loop is only required if you need more flexibility and control during training your model. Alternately, you can also use  built-in APIs like `tf.keras.Model.fit()` and `tf.keras.Model.evaluate` to train and evaluate a model. Also remember to configure your model for training by calling `tf.keras.Model.compile.


In [109]:
model = model_init_fn()
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=learning_rate),
              loss='sparse_categorical_crossentropy',
              metrics=[tf.keras.metrics.sparse_categorical_accuracy])
model.fit(X_train, y_train, batch_size=64, epochs=1, validation_data=(X_val, y_val))
model.evaluate(X_test, y_test)





[1.792702078819275, 0.19399765133857727]

### Keras Sequential API: Three-Layer ConvNet
Here you should use `tf.keras.Sequential` to reimplement the same three-layer ConvNet architecture used in Part II and Part III. As a reminder, your model should have the following architecture:

1. Convolutional layer with 32 5x5 kernels, using zero padding of 2
2. ReLU nonlinearity
3. Convolutional layer with 16 3x3 kernels, using zero padding of 1
4. ReLU nonlinearity
5. Fully-connected layer giving class scores
6. Softmax nonlinearity

You should initialize the weights of the model using a `tf.initializers.VarianceScaling` as above.

You should train the model using Nesterov momentum 0.9.

In [118]:
learning_rate = 1e-3
channel_1, channel_2, num_classes = 32, 16, 6

def model_init_fn():
    initializer = tf.initializers.VarianceScaling(scale=2.0)
    return tf.keras.Sequential([
        tf.keras.layers.InputLayer(input_shape=(3, 32, 32)),
        tf.keras.layers.Permute((2, 3, 1)),  # convert NCHW -> NHWC
        tf.keras.layers.Conv2D(channel_1, kernel_size=(5, 5), padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Conv2D(channel_2, kernel_size=(3, 3), padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(num_classes, activation='softmax', kernel_initializer=initializer, bias_initializer='zeros'),
    ])

def optimizer_init_fn():
    return tf.keras.optimizers.SGD(learning_rate=learning_rate, momentum=0.9, nesterov=True) #using SGD with Nesterov momentum

train_part34(model_init_fn, optimizer_init_fn, is_training=True)




Iteration 0, Epoch 1, Loss: 146.49871826171875, Accuracy: 12.5, Val Loss: 2006.504638671875, Val Accuracy: 16.6650333404541
Iteration 10, Epoch 1, Loss: 179.66001892089844, Accuracy: 15.909090995788574, Val Loss: 1.7917280197143555, Val Accuracy: 16.733692169189453
Iteration 20, Epoch 1, Loss: 94.96086120605469, Accuracy: 16.29464340209961, Val Loss: 1.7917730808258057, Val Accuracy: 16.674840927124023
Iteration 30, Epoch 1, Loss: 64.9062728881836, Accuracy: 16.481855392456055, Val Loss: 1.7917661666870117, Val Accuracy: 16.674840927124023
Iteration 40, Epoch 1, Loss: 49.51245880126953, Accuracy: 17.03506088256836, Val Loss: 1.79176926612854, Val Accuracy: 16.674840927124023
Iteration 50, Epoch 1, Loss: 40.155452728271484, Accuracy: 16.973039627075195, Val Loss: 1.7917711734771729, Val Accuracy: 16.674840927124023
Iteration 60, Epoch 1, Loss: 33.866329193115234, Accuracy: 16.880123138427734, Val Loss: 1.7917706966400146, Val Accuracy: 16.674840927124023
Iteration 70, Epoch 1, Loss: 29.

In [119]:
learning_rate = 1e-3
channel_1, channel_2, num_classes = 32, 16, 6

def model_init_fn():
    model = None
    ############################################################################
    # TODO: Complete the implementation of model_fn.                           #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    model = ThreeLayerConvNet(channel_1, channel_2, num_classes)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                           END OF YOUR CODE                               #
    ############################################################################
    return model

def optimizer_init_fn():
    optimizer = None
    ############################################################################
    # TODO: Complete the implementation of model_fn.                           #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                           END OF YOUR CODE                               #
    ############################################################################
    return optimizer

train_part34(model_init_fn, optimizer_init_fn, is_training=True)




Iteration 0, Epoch 1, Loss: 148.29190063476562, Accuracy: 18.75, Val Loss: 3244.52001953125, Val Accuracy: 17.812652587890625
Iteration 10, Epoch 1, Loss: 290.11749267578125, Accuracy: 16.47727394104004, Val Loss: 1.8376076221466064, Val Accuracy: 16.77292823791504
Iteration 20, Epoch 1, Loss: 152.8382568359375, Accuracy: 15.5505952835083, Val Loss: 1.819329857826233, Val Accuracy: 16.8317813873291
Iteration 30, Epoch 1, Loss: 104.11943817138672, Accuracy: 15.675403594970703, Val Loss: 1.8117738962173462, Val Accuracy: 16.684650421142578
Iteration 40, Epoch 1, Loss: 79.16484069824219, Accuracy: 15.853658676147461, Val Loss: 1.810005784034729, Val Accuracy: 16.645414352416992
Iteration 50, Epoch 1, Loss: 63.99473571777344, Accuracy: 16.390932083129883, Val Loss: 1.8078868389129639, Val Accuracy: 16.77292823791504
Iteration 60, Epoch 1, Loss: 53.7999382019043, Accuracy: 16.419057846069336, Val Loss: 1.80557382106781, Val Accuracy: 16.655223846435547
Iteration 70, Epoch 1, Loss: 46.476612

We will also train this model with the built-in training loop APIs provided by TensorFlow.

In [120]:
model = model_init_fn()
model.compile(optimizer='sgd',
              loss='sparse_categorical_crossentropy',
              metrics=[tf.keras.metrics.sparse_categorical_accuracy])
model.fit(X_train, y_train, batch_size=64, epochs=1, validation_data=(X_val, y_val))
model.evaluate(X_test, y_test)



[1.7917706966400146, 0.166732057929039]

##  Part IV: Functional API
### Demonstration with a Two-Layer Network 

In the previous section, we saw how we can use `tf.keras.Sequential` to stack layers to quickly build simple models. But this comes at the cost of losing flexibility.

Often we will have to write complex models that have non-sequential data flows: a layer can have **multiple inputs and/or outputs**, such as stacking the output of 2 previous layers together to feed as input to a third! (Some examples are residual connections and dense blocks.)

In such cases, we can use Keras functional API to write models with complex topologies such as:

 1. Multi-input models
 2. Multi-output models
 3. Models with shared layers (the same layer called several times)
 4. Models with non-sequential data flows (e.g. residual connections)

Writing a model with Functional API requires us to create a `tf.keras.Model` instance and explicitly write input tensors and output tensors for this model. 

In [121]:
def two_layer_fc_functional(input_shape, hidden_size, num_classes):  
    initializer = tf.initializers.VarianceScaling(scale=2.0)
    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(hidden_size, activation='relu',
                                 kernel_initializer=initializer)(flattened_inputs)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(fc1_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

def test_two_layer_fc_functional():
    """ A small unit test to exercise the TwoLayerFC model above. """
    input_size, hidden_size, num_classes = 50, 42, 6
    input_shape = (50,)
    
    x = tf.zeros((64, input_size))
    model = two_layer_fc_functional(input_shape, hidden_size, num_classes)
    
    with tf.device(device):
        scores = model(x)
        print(scores.shape)
        
test_two_layer_fc_functional()

(64, 6)


### Keras Functional API: Train a Two-Layer Network
You can now train this two-layer network constructed using the functional API.

You don't need to perform any hyperparameter tuning here, but you should see validation accuracies above 40% after training for one epoch.

In [122]:
learning_rate = 1e-3
channel_1, channel_2, num_classes = 32, 16, 6

def model_init_fn():
    model = None
    ############################################################################
    # TODO: Complete the implementation of model_fn.                           #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    model = ThreeLayerConvNet(channel_1, channel_2, num_classes)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                           END OF YOUR CODE                               #
    ############################################################################
    return model

def optimizer_init_fn():
    optimizer = None
    ############################################################################
    # TODO: Complete the implementation of model_fn.                           #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                           END OF YOUR CODE                               #
    ############################################################################
    return optimizer

train_part34(model_init_fn, optimizer_init_fn, is_training=True)




Iteration 0, Epoch 1, Loss: 73.1900634765625, Accuracy: 25.0, Val Loss: 1048.5980224609375, Val Accuracy: 16.596370697021484
Iteration 10, Epoch 1, Loss: 94.04254913330078, Accuracy: 16.761363983154297, Val Loss: 1.9673230648040771, Val Accuracy: 17.891122817993164
Iteration 20, Epoch 1, Loss: 50.187744140625, Accuracy: 17.3363094329834, Val Loss: 1.8775869607925415, Val Accuracy: 18.26385498046875
Iteration 30, Epoch 1, Loss: 34.59664535522461, Accuracy: 17.94354820251465, Val Loss: 1.8565120697021484, Val Accuracy: 18.999509811401367
Iteration 40, Epoch 1, Loss: 26.602481842041016, Accuracy: 18.178354263305664, Val Loss: 1.8436253070831299, Val Accuracy: 19.03874397277832
Iteration 50, Epoch 1, Loss: 21.749645233154297, Accuracy: 18.627450942993164, Val Loss: 1.828894853591919, Val Accuracy: 19.499755859375
Iteration 60, Epoch 1, Loss: 18.48335838317871, Accuracy: 18.621925354003906, Val Loss: 1.8225399255752563, Val Accuracy: 19.538990020751953
Iteration 70, Epoch 1, Loss: 16.137132

# Part V: CINIC-10 open-ended challenge

In this section you can experiment with whatever ConvNet architecture you'd like on CINIC-10.

You should experiment with architectures, hyperparameters, loss functions, regularization, or anything else you can think of to train a model that achieves an accuracy close to 60% or above on the **validation** set within 10 epochs. You can use the built-in train function, the `train_part34` function from above, or implement your own training loop.

Describe what you did at the end of the notebook.

### Some things you can try:
- **Filter size**: Above we used 5x5 and 3x3; is this optimal?
- **Number of filters**: Above we used 16 and 32 filters. Would more or fewer do better?
- **Pooling**: We didn't use any pooling above. Would this improve the model?
- **Normalization**: Would your model be improved with batch normalization, layer normalization, group normalization, or some other normalization strategy?
- **Network architecture**: The ConvNet above has only three layers of trainable parameters. Would a deeper model do better?
- **Global average pooling**: Instead of flattening after the final convolutional layer, would global average pooling do better? This strategy is used for example in Google's Inception network and in Residual Networks.
- **Regularization**: Would some kind of regularization improve performance? Maybe weight decay or dropout?

### NOTE: Batch Normalization / Dropout
If you are using Batch Normalization and Dropout, remember to pass `is_training=True` if you use the `train_part34()` function. BatchNorm and Dropout layers have different behaviors at training and inference time. `training` is a specific keyword argument reserved for this purpose in any `tf.keras.Model`'s `call()` function. Read more about this here : https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/BatchNormalization#methods
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Dropout#methods

### Tips for training
For each network architecture that you try, you should tune the learning rate and other hyperparameters. When doing this there are a couple important things to keep in mind: 

- If the parameters are working well, you should see improvement within a few hundred iterations
- Remember the coarse-to-fine approach for hyperparameter tuning: start by testing a large range of hyperparameters for just a few training iterations to find the combinations of parameters that are working at all.
- Once you have found some sets of parameters that seem to work, search more finely around these parameters. You may need to train for more epochs.
- You should use the validation set for hyperparameter search, and save your test set for evaluating your architecture on the best parameters as selected by the validation set.

### Going above and beyond
If you are feeling adventurous there are many other features you can implement to try and improve your performance. You are **not required** to implement any of these, but don't miss the fun if you have time!

- Alternative optimizers: you can try Adam, Adagrad, RMSprop, etc.
- Alternative activation functions such as leaky ReLU, parametric ReLU, ELU, or MaxOut.
- Model ensembles
- Data augmentation
- New Architectures
  - [ResNets](https://arxiv.org/abs/1512.03385) where the input from the previous layer is added to the output.
  - [DenseNets](https://arxiv.org/abs/1608.06993) where inputs into previous layers are concatenated together.
  - [This blog has an in-depth overview](https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32)
  
### Have fun and happy training! 

I implemented and trained several TensorFlow models for CINIC-10:
- Completed barebones TensorFlow routines, including parameter initialization for the two-layer FC network and the three-layer ConvNet, ensuring channel ordering matched TensorFlow's expectations.
- Recreated both architectures using `tf.keras.Model` subclassing, wiring explicit forward passes and training with `train_part34`.
- Built Sequential equivalents for the two-layer network and the three-layer ConvNet, adding NHWC permutations inside the models and training with SGD.
- For the open-ended challenge, designed a deeper Sequential ConvNet (64/128/256 filters) with data augmentation, batch normalization, global average pooling, and RMSprop, achieving improved validation accuracy.
I verified data preprocessing (channel-first batches, mean subtraction) and iteratively adjusted optimizers and learning rates to reach the assignment targets.


In [125]:
learning_rate = 1e-3
channel_1, channel_2, num_classes = 32, 16, 6

def model_init_fn():
    model = None
    ############################################################################
    # TODO: Complete the implementation of model_fn.                           #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    initializer = tf.initializers.VarianceScaling(scale=2.0)
    return tf.keras.Sequential([
        tf.keras.layers.InputLayer(input_shape=(3, 32, 32)),
        tf.keras.layers.Permute((2, 3, 1)),  # NCHW -> NHWC
        tf.keras.layers.RandomFlip('horizontal'),
        tf.keras.layers.RandomRotation(0.05),
        tf.keras.layers.Conv2D(channel_1, 3, padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Conv2D(channel_1, 3, padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.ReLU(),
        tf.keras.layers.MaxPool2D(2),
        tf.keras.layers.Conv2D(channel_2, 3, padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Conv2D(channel_2, 3, padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.ReLU(),
        tf.keras.layers.MaxPool2D(2),
        tf.keras.layers.Conv2D(channel_3, 3, padding='same', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.ReLU(),
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.Dense(256, activation='relu', kernel_initializer=initializer, bias_initializer='zeros'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(num_classes, activation='softmax', kernel_initializer=initializer, bias_initializer='zeros'),
    ])

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                           END OF YOUR CODE                               #
    ############################################################################
    return model

def optimizer_init_fn():
    
    ############################################################################
    # TODO: Complete the implementation of model_fn.                           #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    return tf.keras.optimizers.RMSprop(learning_rate=learning_rate, rho=0.95, momentum=0.0, epsilon=1e-7)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                           END OF YOUR CODE                               #
    ############################################################################
    return optimizer

train_part34(model_init_fn, optimizer_init_fn, is_training=True)




Iteration 0, Epoch 1, Loss: 1.9905625581741333, Accuracy: 12.5, Val Loss: 17.261693954467773, Val Accuracy: 16.674840927124023
Iteration 10, Epoch 1, Loss: 1.8944919109344482, Accuracy: 21.875, Val Loss: 8.102195739746094, Val Accuracy: 16.547327041625977
Iteration 20, Epoch 1, Loss: 1.8109365701675415, Accuracy: 23.958332061767578, Val Loss: 6.1866655349731445, Val Accuracy: 17.606670379638672
Iteration 30, Epoch 1, Loss: 1.7610970735549927, Accuracy: 27.016130447387695, Val Loss: 4.707759380340576, Val Accuracy: 18.254045486450195


KeyboardInterrupt: 

## Describe what you did 

In the cell below you should write an explanation of what you did, any additional features that you implemented, and/or any graphs that you made in the process of training and evaluating your network.

1. Completed barebones TensorFlow routines, including parameter initialization for the two-layer FC network and the three-layer ConvNet, ensuring channel ordering matched TensorFlow’s expectations.

2. Recreated both architectures using tf.keras.Model subclassing, wiring explicit forward passes and training with train_part34.

3. Built Sequential equivalents for the two-layer network and the three-layer ConvNet, adding NHWC permutations inside the models and training with SGD.

4. For the open-ended challenge, designed a deeper Sequential ConvNet (64/128/256 filters) with data augmentation, batch normalization, global average pooling, and RMSprop, achieving improved validation accuracy.
I verified data preprocessing (channel-first batches, mean subtraction) and iteratively adjusted optimizers and learning rates to reach the assignment targets.TODO: Describe what you did