<a href="https://colab.research.google.com/github/GDS-Education-Community-of-Practice/DSECOP/blob/daleas_module/Connecting_MonteCarlo_to_ModernAI/N3_Predicting_Inverse_Temperature.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 3: Predicting the Inverse Temperature of a Lattice
Ashley Dale

---

In this notebook, you will learn the following concepts:

- The difference between linear/non-linear regression and Deep Neural Networks
- The Universal Approximation Theorem
- How to implement a Fully-connected Deep Neural Network (FC-DNN) for regression of 1D data
- How to implement a Convolutional Neural Network (CNN) for regression of 2D data

> **Note**: This notebook can take 5+ hours to compute.  It is recommended that you start it early, leave it running in the background, then return to it after some time to complete the analysis.  Don't leave everything to the last minute!

## Setup Python Environment

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from numba import jit
from tqdm import tqdm, trange
import copy

import timeit

from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.neural_network import MLPRegressor

import tensorflow as tf
from tensorflow.keras import layers, models

from ipywidgets import interactive

# Background

## Problem Statement

Consider a situation where we have magnetic domain images taken of a ferromagnetic system when it is close to the transition temperature $T_C$, but we aren't sure what the thermodynamic temperature actually is.  (This is not the case for our simulated data, but it would be the case for most experimental data.) We would like a model that predicts the temperature of the input lattice.

**The easy way:** Train a regression model using the net-magnetism of the system as the input variable, and the temperature as the output variable.  The net-magnetism of the system is a *latent feature*: it is something we can calculate from the raw data to make training the prediction model easier.

**The hard way:** Train a regression model using the lattice directly.  The benefit of doing it this way is that we keep the coupling information defined by the lattice site adjacencies, as well as the magnetic domain shapes, structures, and distribution.  The bad thing is that it requires a larger model and more effort to get it to work successfully.

We are going to try both, compare them, and see what gives the best result.

## Regression Model for 1D Data

If we use the net-magnetism $M$ and inverse temperature $\beta$ as our dataset features, each lattice state can be represented as just two numbers, or a row vector of shape $1 \times 2$.  We can represent this as $l_i = [\beta_i, M_i]$, where $l_i$ is the $i^{th}$ lattice.

[<img src=https://mathworld.wolfram.com/images/interactive/TanhReal.gif alt="Picture of $tanh(x)$.  Source: WolframAlpha" align="left" width=300>](https://mathworld.wolfram.com/images/interactive/TanhReal.gif)

It would be tempting to try and use a linear model, except we know that the analytical solution should be non-linear and something resembling hyperbolic tangent (as shown to the left; [source](https://mathworld.wolfram.com/HyperbolicTangent.html)).  A better approach is to use our domain knowledge to pick a model that we know will be able to learn the $tanh(x)$ function directly and well.  Because $tanh(x)$ has two horizontal asymptotes and polynomial expressions do not, any polynomial-based model such as [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR) is expected to behave poorly.  This is why we will explore a multi-layer perceptron (MLP) model, also known as deep neural network (DNN).

### When to Use a DNN

DNN's are [difficult to get right](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607), [hard to explain](https://arxiv.org/abs/2004.14545), [can require a tremendous amount of energy to train](https://www.idgconnect.com/article/3602888/reducing-energy-use-in-neural-networks.html) in addition to a [tremendous amount of data](https://machinelearningmastery.com/much-training-data-required-machine-learning/), and [struggle to quantify uncertainty in predictions](https://imerit.net/blog/a-comprehensive-introduction-to-uncertainty-in-machine-learning-all-una/).  This is bad news for scientists seeking rigorous results with small datasets.

Therefore, in general, a [linear](https://www.analyticsvidhya.com/blog/2021/10/everything-you-need-to-know-about-linear-regression/) or [non-linear](https://www.investopedia.com/terms/n/nonlinear-regression.asp) regression approach to modeling should be the first approach to generating a model.  Linear and non-linear approaches may also return an analytical expression with constants and variables which can be connected backwards to your experiment.  [This resource](https://www.ibm.com/topics/linear-regression) has a good discussion of what assumptions are required for linear regression to successfully model data; most of these assumptions also hold true for non-linear regression.

However, there are many, many problems for which DNNs definitely are the best solution.  This can include probabilistic and stochastic systems and data for which no analytical expression can be presumed to exist.

So. When should you use a DNN to model your data?

**When nothing else works better for your data.**

#### The Universal Approximation Theorem

Why does a DNN work when other linear or non-linear results don't?

---
*A deep neural network that is wide enough and/or deep enough can be used to approximate any measureable function $f(x)$ for an input $x$ perfectly.*
---

This is the Universal Approximation Theorem, and it was [first published in 1989](https://www.sciencedirect.com/science/article/abs/pii/0893608089900208), long before computers capable of calculating such models existed.  It tells us a path forward exists, but says nothing about how to get there.  It also doesn't say anything about why the theorem might be true.

However, the results of the theorem are extremely powerful, and justify why choosing a DNN to model our data is a valid decision.

## Regression Model for 2D Data

If we use the full 2D lattice array and inverse temperature 𝛽 as our dataset features, we want a model that takes advantage of the spatial relationships that are preserved in the array.  This way, we are training the model not only with the net-magnetization information produced by the simulation, but also with the data features such as magnetic domain shape, size, and distribution.  These are features which are definitely present in the data, but difficult to encode numerically into a vector.

These difficult features arising from structural information can be learned using a [*Convolutional Neural Network*](https://www.ibm.com/topics/convolutional-neural-networks) or CNN.  CNNs get very complicated very quickly, but here are some things you should know:

- CNN's are a special kind of DNN, and are typically used for image processing
- The heart of a CNN is [2D convolution](http://www.songho.ca/dsp/convolution/convolution2d_example.html).
- Because [2D convolution can be expressed as matrix multiplication](https://www.baeldung.com/cs/convolution-matrix-multiplication), this makes CNNs a very good choice for GPUs.
- By repeating 2D convolution on the input image, along with other operations, we can condense the input image to a single number that represents the label.  For this notebook, that number will be the inverse temperature $\beta$.


## More Notebooks about DNNs and CNNs

Any of the following notebooks can help deepen your understanding of why we are going to do what we do next.  There are also many good internet resources, some of which are linked in the "Additional Readings" section at the end of the notebook.

- [Intro to Deep Neural Networks](https://github.com/GDS-Education-Community-of-Practice/DSECOP/tree/main/Intro_to_Deep_Learning)
- [NN Basics](https://github.com/GDS-Education-Community-of-Practice/DSECOP/blob/main/Learning_the_Schrodinger_Equation/00_NN_Basics.ipynb)
- [Intro to Machine Learning Workflow with Linear Regression](https://github.com/GDS-Education-Community-of-Practice/DSECOP/tree/main/Machine_Learning_Workflow)
- [What is a neural network?](https://github.com/GDS-Education-Community-of-Practice/DSECOP/blob/main/Solving_Differential_Equations_with_NNs/02_neural_networks.ipynb)

## Summary

In the previous notebook, we used statistical mechanics and the Ising Hamiltonian to model and simulate a ferromagnetic system of atoms.  Now, we are trying to model and simulate the same physics, but without the benefit of having the model know what the physics is.

We are also going to try and run the simulation "backwards".  In the previous notebook, a value for the inverse temperature was chosen first, and then we simulated a lattice based on the probability of that lattice for the given inverse temperature value.  Here, we are going to start with a lattice where the temperature is unknown, then work backwards to see if we can figure out what inverse temperature might have allowed this particular lattice to form.   We will try to figure it out based on all of the other lattice-temperature relationships.

# Exercises

The first code blocks in the `Generate data for usage` section can be executed directly to generate the training and test data for the coding exercises later.

The programming exercises here are meant to help you gain intuition about how the different kinds of models behave, so you will use pre-defined functions and models when they are available.  If you would like to try coding some models from scratch, there are many tutorials on the internet that are beyond the scope of this notebook.

**Programming Exercise 1** walks you through implementing a *Support Vector Machine - Regression* or SVR.

**Programming Exercise 2** walks you through implementing a *Fully-Connected Deep Neural Network* or FC-DNN.

**Programming Exercise 3** walks you through implementing a *2D Convolutional Neural Network* or CNN.

## Generate data for usage

Before training the models, we need need data to train them with.  Executing the following code cell will create the functions needed to duplicate the simulations from the previous notebook.

In [None]:
def initialize_lattice(L: int, c1 = 0):
    """
    Function to initialize lattice.  Adds a border of zeros
    to represent non-interacting atoms and make the neighbor
    calculation easier

    L: The square root of the number of atoms in the lattice
    returns padded_lattice: A lattice of size (L+2, L+2)
    """

    # initialize the lattice using np.random.random for a lattice of size
    # LxL

    init_lattice = np.random.random(size=(L,L))

    # create a lattice of zeros that has an extra row on the top and bottom,
    # and an extra column on the left and the right

    padded_lattice = np.zeros((L+2, L+2))

    #mask lattice by setting values above 0.5 to 1, and everything else to -1
    for idx in range(L):
        for jdx in range(L):
            if init_lattice[idx, jdx] > 0.5:
                init_lattice[idx, jdx] = 1
            else:
                init_lattice[idx, jdx] = -1
    #init_lattice[init_lattice>0.5]= 1
    #init_lattice[init_lattice !=1]= -1

    # added step to create non-interacting atoms
    padded_lattice[1:L+1, 1:L+1] = init_lattice

    return np.array(padded_lattice)


@jit(nopython=True)
def MCMC_step_optimized(beta: float, lattice: np.array):
    """
    Function to repeat the Monte Carlo Markov Chain for this system.
    beta: the inverse temperature value for the MCMC step
    lattice: the system of spins that will be simulated
    returns: an updated version of the input lattice
    """

    # Figure out the size of the lattice
    [rows, cols] = lattice.shape

    # keep the neighbors inside the region
    for r in range(1,rows-1):
        for c in range(1,cols-1):

            # sum over the nearest neighbors
            sum_NN = (lattice[r-1,c]+lattice[r+1, c]+lattice[r,c+1]+lattice[r,c-1])

            # calculate the energy
            E_a = -0.5*lattice[r,c]*sum_NN

            # re-calculate the energy for a spin state change
            E_b = -1*E_a

            # choose whether to keep the new state or not
            if E_b < E_a or np.exp(-(E_b - E_a)*beta) > np.random.rand():
                lattice[r, c] *= -1

    return lattice

def generate_lattice_set(sqrt_N: int, min_temp: float, max_temp: float, num_temps: int):
    """
    Main function to complete a simple Metropolis-Hastings Ising model
    experiment.

    sqrt_N: integer that is the square root of the number of atoms in the system

    returns [inverse_temperature, M]: where M is the net magnetization at each
            temperature
    """

    spin_lattice = initialize_lattice(int(sqrt_N))

    inverse_temperature = np.linspace(min_temp, max_temp, num_temps)

    M = np.zeros(num_temps) #empty variable to hold net magnetism

    # Declare new variable to hold spin lattices here
    lattice_at_T = np.zeros((num_temps, sqrt_N+2, sqrt_N+2))

    # For each temperature
    for i in (trange(len(inverse_temperature))):
        beta = inverse_temperature[i]

        # Repeat the MCMC step 100 times to make sure the system is stable
        for n in range(100):

            spin_lattice = MCMC_step_optimized(beta, spin_lattice)

        M[i] = (np.abs(np.sum(np.sum(spin_lattice)))/(sqrt_N*sqrt_N))
        lattice_at_T[i, :, :] = (copy.copy(spin_lattice))

    return np.array(inverse_temperature), np.array(M), np.array(lattice_at_T)

Let's define some parameters for our data:

In [None]:
sqrt_N = 120 # Consider a system of 120x120 atoms

# We want to limit the temperature range to capture the phase transition between
# ordered and disordered.  If too many lattices are simulated before or after the
# transition, we can expect them to not show much temperature dependence
# and this will make training the models difficult

T_min = 0.8 # The lowest inverse temperature will be 0.8
T_max = 1.0 # The highest inverse temperature will be 1.0

# We want to increase the number of lattices so that
# a sufficient number are generated *during* the phase transition.
num_temps = 5000

Repeat the simulation three times, to make sure there is enough data.

In [None]:
b, m, L = generate_lattice_set(sqrt_N, T_min, T_max, num_temps)
b2, m2, L2 = generate_lattice_set(sqrt_N, T_min, T_max, num_temps)
b3, m3, L3 = generate_lattice_set(sqrt_N, T_min, T_max, num_temps)

Combine all of the simulation results to make them easier to process in the next step

In [None]:
b = np.concatenate((b, b2, b3))
m = np.concatenate((m, m2, m3))
L = np.concatenate((L, L2, L3))

### Format Data into Training and Testing splits

Before the data can be used to train a model, we want to divide it into *testing* data and *training* data.  We can do this by randomly sampling the generated data using a uniform distribution.

In [None]:
# partition into train and test
partition_mask = np.random.uniform(size=(len(b)))
train = np.where(partition_mask > 0.3)
test = np.where(partition_mask <= 0.3)

# create labeled data partitions for reuse later
X_train = m[train[0]]
y_train = b[train[0]]

X_test = m[test[0]]
y_test = b[test[0]]

# These are the np.arrays that hold the lattices.  They have
# the same y_train and y_test labels as above
X_lattice_train = L[None, train[0]]
X_lattice_test = L[None, test[0]]

It is always a good idea to double check that the training and test data agree with eachother.  A plot is a fast way to do this.

In [None]:
fig, ax = plt.subplots(1, 1)
ax.scatter(y_train, X_train, s=1, label="Train")
ax.scatter(y_test, X_test, s=1, label="Test")
ax.set_xlabel(r"$\beta$")
ax.set_ylabel("M")
ax.set_title("Sanity Check: Train and Test data matches")
ax.legend()
plt.show()

In [None]:
def plot_mag_domains(x):
    plt.figure(2)
    plt.imshow(L[x])
    plt.title(r"$\beta$ = "+str(b[x]))
    plt.show()

interactive_plot = interactive(plot_mag_domains, x=(0,len(b)-1))
output = interactive_plot.children[-1]
output.layout.height = '500px'
interactive_plot

## Programming Exercise 1: SVR Implementation

Earlier, we hypothesized that SVR will not do a good job of modeling our data because [SVR uses n-dimensional polynomials](https://www.mathworks.com/help/stats/understanding-support-vector-machine-regression.html) to create its model, and we already know our data looks similar to hyperbolic tangent.  However, it is an easy thing to double check, and we will do this check now.

### 1. Build the Model

Look carefully at the SVR model object in the code cell below, and note that it has [four hyperparameters](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html):

- `kernel` says what kind of math modeling will be used.  The options are `linear`, `poly`, `rbf`, and `sigmoid`.  For some insight in how these different kernels behave, [click here](https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html#sphx-glr-auto-examples-svm-plot-svm-regression-py).
- `C` is the regularization constant that tries to keep the model parameters small.  Large parameters make it difficult to balance terms, and also can require a lot of memory. **C > 0** always.
- `gamma` scales the kernel parameters.  If your data has a particular distribution (e.g. between -0.01 and 0.01), then you want to scale your model parameters so that it will output data within the same range.
- `epsilon` determines how much model error is acceptable.  If you think about the example of fitting your data with a line, the `epsilon` term says that data points which are `epsilon` distance away from the line won't be used to calculate the model error.

In [None]:
svr = SVR(kernel="sigmoid", C=1E-5, gamma="auto", epsilon=1E-12,verbose=True)

### 2. Train the Model

Now that the model is defined, we want to fit the model on the net-magnetization and inverse temperature training data, then use the model to predict the inverse temperature for the net-magnetization testing data.  Uncomment the lines below and execute the code cell to do this.

> Note: the `X_train[:, None]` and `X_test[:, None]` have the `None` in the array because the `svr` model object is expecting input data with at least two dimensions.  The `None` argument defines the second dimension as `None` so that our data has the correct shape expected by the code.

In [None]:
start = timeit.default_timer()
svr.fit(X_train[:, None], y_train)
print("\nTime to completion for SVR: "+str(timeit.default_timer() - start))
print("\nNumber of Iterations: "+str(svr.n_iter_))

### 3. Evaluate the Model

Once the model is trained, we can use it to predict inverse temperature $\beta$ values.

In [None]:
y_predict = svr.predict(X_test[:, None])

#### Make a Plot
Visualize the difference between the true inverse temperature in `y_test` and the predicted inverse temperature in `y_predict`.
- Create a plot that shows `y_test` on the x-axis and `X_test` on the y-axis.  Add appropriate axis labels and legend.
- On the same plot, add `y_predict` on the x-axis and `X_test` on the y-axis as a second data series.

Based on this figure, do you think the model did a good job predicting the true temperature?

In [None]:
# Add plotting code here

Calculate the [mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) for `y_test` and `y_pred` by calling the `mean_squared_error` function. (Click the link for documentation on how to use the function; no need to write one from scratch.)  A `mse` value of zero means that there is perfect agreement between the predicted inverse temperatures and the actual inverse temperatures.  `mse` values larger than zero need to be interpreted correctly, because anything larger than zero is "bad".  Figuring out "how bad" is "bad" depends on your data distribution and how many samples you have.

In [None]:
mse = mean_squared_error(y_test, y_predict)
print("MSE: "+str(mse))

### 4. Retrain the Model

The most important hyperpameter for `SVR` is the kernel, but the other hyperparameters have impact as well.

Rerun steps 1-3 and try five different combinations of kernels with other hyperparameter values to see if you can reduce the `mse` value.  Fill out the table below with your results.  What was your best combination?

#### Fill out this table

The first row is completed for you.  Add four more rows.

|Kernel Type|C|$\Gamma$|$\epsilon$|MSE|Time (s)|
|---|---|---|---|---|---|
|sigmoid|1E-5|auto|1E-12|0.01069|4.971|
|||||||
|||||||
|||||||
|||||||

## Programming Exercise 2: Fully Connected Deep Neural Network

Next, we will try to use a Fully Connected Deep Neural Network or *FC-NN* (see example on the right: [source](https://en.wikipedia.org/wiki/Artificial_neural_network#/media/File:Colored_neural_network.svg)) to predict the inverse temperature for our data.

[<img src=https://upload.wikimedia.org/wikipedia/commons/4/46/Colored_neural_network.svg alt="Picture of feed forward neural network.  Source: Wikipedia" align="right" width=300>](https://en.wikipedia.org/wiki/Artificial_neural_network#/media/File:Colored_neural_network.svg)

In a FC-DNN, only `dense` layers are used.  A `dense` layer contains a set of [*artificial neurons*](https://en.wikipedia.org/wiki/Artificial_neuron), where each neuron has two main parts:
1. A linear operation of the form $y=mx+b$, where
   - $x$ is the neuron input
   - $y$ is the neuron output
   - $m$ is the neuron *weight*
   - $b$ is the neuron *bias*

   The weights and biases are what the model learns during training.  This means that a `dense` layer can **almost** be thought of as a set of linear functions that emphasize the most important parts of the data.  *Almost.*

2. An activation function that decides whether to pass all, some, or none of the layer output values to the next layer.  This is where the FC-DNN becomes non-linear.  It typically mixes the outputs of the neurons in complicated, non-intuitive ways.

These dense layers are then stacked in a *sequence*, with each neuron connected to every neuron in the previous layer **and** every neuron in the following layer, making it *fully connected*.

### 1. Build the Model

The Universal Approximation Theorem is very important here, as there is no rule or mathematical result that states how many neurons and in what order they should be connected.  Therefore, the most important hyperparameters for a `FC-DNN` model are:

- `hidden_layer_sizes`: if you pass an argument `hidden_layer_sizes=[5,2,3]` then your FC-DNN will have three hidden layers with five, two, and three neurons respectively for a total of five layers in the model when including the input and output layers.
- `max_iter`: how long to train the network.  Too long, and the model will over-fit to the data.  Too short, and the model won't learn enough to generalize to the test data.
- `random_state`: makes sure that you can re-initialize the model weights the same way for repeatability.
- `batch_size`: How many data points the model sees at one time.  Too few, and the model will never learn a comprehensive pattern.  Too many, and the model will struggle to capture details.

In the model below, the model is initialized with `[15, 30, 20, 50, 20, 30, 15]`, for a total of approximately 360 parameters (180 neurons, each with its own weight and bias).

In [None]:
layer_neurons = [15, 30, 20, 50, 20, 30, 15]

In [None]:
regr = MLPRegressor(random_state=1,
                    max_iter=2000,
                    hidden_layer_sizes=layer_neurons,
                    batch_size=15,
                   verbose=True)

### 2. Train the Model

Once the model is defined, it can be trained.  Remember to time the execution!

In [None]:
start = timeit.default_timer()
regr.fit(X_train[:, None], y_train)
print("\nTime to completion for FC-NN: "+str(timeit.default_timer() - start))

### 3. Evaluate the Model

Now that the model is trained, we can use it to predict some inverse temperature values:

In [None]:
y_pred_dnn = regr.predict(X_test[:, None])
mse = mean_squared_error(y_test, y_pred_dnn)
print("MSE: "+str(mse))

#### Make a Plot
Visualize the difference between the true inverse temperature in `y_test` and the predicted inverse temperature in `y_pred_dnn`.
- Create a plot that shows `y_test` on the x-axis and `X_test` on the y-axis.  Add appropriate axis labels and legend.
- On the same plot, add `y_pred_dnn` on the x-axis and `X_test` on the y-axis as a second data series.

In [None]:
# Add plotting code here

### 4. Retrain the Model

Now try to optimize the model for speed (less training time), number of neurons (less memory), and performance (reduce MSE).  Re-run steps 1-3 with five different combinations, and fill out the table below. Here are some suggestions to get you started:
- Try reducing the number of hidden layers from 7 to a different number
- Try reducing the number of neurons in each layer
- Try changing the batch size to more/fewer.  If you go too large, you might get an out-of-memory (OOM) error; just reduce the batch size further.

#### Fill out this table

The first row is completed for you.  Add four more rows.  What is your best combination?

|`layer_neurons`|Batch Size|MSE|Time (s)|
|---|---|---|---|
|`[15, 30, 20, 50, 20, 30, 15]`|15|0.0010047|4.226|
|||||
|||||
|||||
|||||

## Programming Exercise 3: Convolutional Neural Network

Using a 2D Convolutional Neural Network to study Ising lattices is a novel and increasingly popular approach for physicists.  Here are some references that link to recently published work in this area: [1](https://www.nature.com/articles/s41598-020-69848-5), [2](https://academic.oup.com/ptep/article/2021/6/061A01/6270799), [3](https://journals.aps.org/pre/abstract/10.1103/PhysRevE.103.033305), [4](https://arxiv.org/abs/1710.04987), [5](https://journals.aps.org/prb/abstract/10.1103/PhysRevB.99.094427).

> Note: Plan on several hours to complete this exercise.  Google colab notebooks are limited to 12 hours runtime, and GPU usage is limited based on availability.  If you run out of GPU-resources and compute time, you can decrease the number of epochs, but it will severely affect your results.

### 0. GPU Availability Check

You will want to make sure that you have GPU usage enabled for this next excerise, as it is compute intensive.

To Enable GPU usage in Google Colab, do the following:

1. In the right top corner find the **RAM/Disk** icon.
2. Click on -> Dropdown ->View resources - >change runtime -> hardware accelerator -> select GPU -> save

If you are using a local machine with a GPU, discuss with your instructor about the proper procedure to enable GPU access.

The following code cell checks to see if a GPU is available.

> Note: If you choose to proceed without a GPU, expect the following calculation to take more than 10 hours.  If you use Google Colab for this period of time, the connection may time out and your notebook (with all calculations) may be reset.

In [None]:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

### 1. Build a CNN Model

Unlike the `SVR` and `FC-NN` models, there is no pre-built `CNN` model that is a good fit for the Ising-Model simulation data.  Instead, we will build one ourselves from scratch using *layers* and *activation functions*.  An example architecture is shown below ([source](https://en.wikipedia.org/wiki/Convolutional_neural_network#/media/File:Typical_cnn.png)).

[<img src=https://upload.wikimedia.org/wikipedia/commons/6/63/Typical_cnn.png alt="Picture of typical convolutional neural network.  Source: Wikipedia" align="center" width=800>](https://en.wikipedia.org/wiki/Convolutional_neural_network#/media/File:Typical_cnn.png)

#### Layers
A *Covolutional Neural Network* is typically a mix of several different kinds of layers:
- `convolutional layer`: This layer takes a matrix of data, then filters it to emphasize the most important information in the dataset.  The filters are learned and updated to be specific to the problem.  [See here for a good in-depth tutorial.](https://machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/)  You need the correct ratio of convolutional layers to the amount of data in your dataset.  Too few layers, and your model cannot learn all of the information it needs.  Too many layers, and your dataset may not have enough information for all of the layers to be helpful, leading to the untrained layers negatively affecting the trained layers.
-  `pooling layer`: This layer reduces the size of the input matrix in order to emphasize big, important patterns in the input matrix.  *Max pooling* means that for an $n\times n$ patch, only the maximum value of that patch is kept.  *Average pooling* means that the average value of the $n\times n$ patch is kept. In both methods, small variations get filtered out.  For example, if the pooling layer has a stride of 2, the output of the layer will be 1/2 the size of the input to the layer.  No model information is stored in this layer, but you can "lose" too much information from previous layers if you aren't careful.  [See here for a good in-depth explanation.](https://machinelearningmastery.com/pooling-layers-for-convolutional-neural-networks/)
-  `batch normalization`: This layer keeps all of the matrix values centered around zero.  If the distribution of the values shifts too far from zero, it can take a longer time for the model to learn the distribution.  [See here for a good in-depth explanation.](https://www.baeldung.com/cs/batch-normalization-cnn)
-  `flatten`: This is a layer that reshapes the matrix of data into a vector.  No model information is stored in this layer.
-  `dense`: These are exactly the same layers used in the FC-DNN example above; the difference is that they are no longer fully connected.  For regression and classification tasks, a model is expected to return a number or set of numbers (vector), not a matrix.  This layer learns the best vector to represent the input created by layers earlier in the network.

There are many types of layers available, and each has its own nuances.  [See here for a complete list of layers available in the Keras Python package.](https://keras.io/api/layers/)

#### Activation Functions
After choosing the layers, the next thing to understand is the `activation` function for each layer.  [See here for a good overview of the different activation functions.](https://towardsdatascience.com/everything-you-need-to-know-about-activation-functions-in-deep-learning-models-84ba9f82c253)  Two different activation functions are selected for the model below: `relu` and `sigmoid`.  The reasons for choosing these two functions are discussed in the [Model Intuition section at the end of this notebook](#model_intuition).

---

Obtaining this particular configuration of layers and parameterized required much testing and iterating.  However, it is [mathematically unlikely](https://www.youtube.com/watch?v=ZVVnvZdUMUk) that this is the only model configuration that can successfully learn the data.

If you have the resources, see if you can create a more efficient and trainable solution at the end of this exercise.

The model described above can be built by executing the code below.

In [None]:
model = models.Sequential()
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', input_shape=(sqrt_N+2, sqrt_N+2, 1)))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding="same"))
model.add(layers.AveragePooling2D((2, 2)))
model.add(layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu', padding="same"))
model.add(layers.AveragePooling2D((2, 2)))
model.add(layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu', padding="same"))
model.add(layers.BatchNormalization())
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='sigmoid'))
model.add(layers.Dense(1, activation='sigmoid'))

The next line prints a summary of the model.  Pay attention to the output shapes and the number of parameters; they should match the above explanation.

In [None]:
model.summary()

### 2. Choose how to optimize the model.

The model `optimizer` calculates how the model parameters are updated.  There are [many parameters and options here](https://keras.io/api/optimizers/), but the most important things to know are the following:
- The `adam` optimizer is considered state-of-the-art.  It is based on an algorithm called [*stochastic gradient descent*](https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/) or `sgd`, which uses the same logic as the energy-minimization logic as in simulated annealing: if a change in values minimizes a loss function then that change is accepted.  However, an updated value which doesn't minimize the loss function is also accepted with some probability.

- The loss function is analogous to the energy calculated by a Hamiltonian: loss and energy are both minimized by their respective systems.

In [None]:
optimizer=tf.keras.optimizers.Adam(learning_rate=3E-8)

### 3. Choose the model loss function.

The choice of loss function is important, and again, there are many options to choose from.

Here, the *mean squared error* (MSE) is chosen because it is easy to understand:

<center>$MSE = \frac{1}{N}\sum_N(y - y')^2$</center>

where $y$ is the true value, and $y'$ is the predicted value.  It is equivalent to the square of the Euclidean distance between the two values.

Choosing MSE as the loss function also makes it easier to compare the error between the *SVR*, *FC-DNN*, and *CNN* models.

In [None]:
loss_function = tf.keras.losses.MeanSquaredError()

### 4. Compile the model

The model must be made into a static object that is trainable.  The next command does this process.

In [None]:
model.compile(optimizer=optimizer,
            loss=loss_function)

### 5. Train the Model for 500 Epochs

Because the model has almost 3 million parameters, we are going to train it for some time, then pause to check the results predicted by the model.  During this process, we need to consider the following parameters:

- `epochs`: This is how many times the model weights will be updated
- `batch_size`: This is how many data points the model will see at once
- `shuffle`: This is whether or not the model sees the data in order, or out of order

Finally, make sure to time the training so that you can keep track of your resource usage.

In [None]:
start1 = timeit.default_timer()
with tf.device(device_name):
    history = model.fit(X_lattice_train[0], y_train, epochs=500, batch_size=128,
                        shuffle=True, validation_data=(X_lattice_test[0], y_test))
stop1 = timeit.default_timer()

In [None]:
print("Training Part 1: "+str((stop1 - start1)/60)+ " minutes")

### 6. Evaluate the Model

Since the model has finished 500 epochs, we can stop and see how it is doing by having the model `predict` on the test data and then calculating the error:

In [None]:
results = model.predict(X_lattice_test[0])
print("MSE: "+str(mean_squared_error(results, y_test)))

We also want to double check that it is learning the `train` data correctly, so we will have it predict temperatures for these values as well.

In [None]:
results_train = model.predict(X_lattice_train[0])

Finally, we want to visualize both the training loss over time, and the prediction results compared to the true values.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].set_title("How the network is learning over time")
ax[0].plot(np.linspace(1, len(history.history['loss']), len(history.history['loss'])),history.history['loss'], label="Train")
ax[0].plot(np.linspace(1, len(history.history['loss']), len(history.history['loss'])),history.history['val_loss'], label="Validation")
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss')
ax[0].set_yscale("log")
ax[0].legend()

ax[1].set_title("How predicted values compare to the actual values")
ax[1].scatter(y_test, X_test, s=1, label="Test")
ax[1].scatter(results_train, X_train, s=1, label="Predictions on Train")
ax[1].scatter(results, X_test, s=1, label="Predictions on Test")
ax[1].set_xlabel(r"$\beta$")
ax[1].set_ylabel("M")
ax[1].legend()
plt.show()

In the plot of the *Loss* vs *Epochs*, we see that the mean squared error for the data decreases quickly at first, and then more slowly. This is typical.  The model values begin with a random initialization, and quickly settles down to a more consistent improvement trend.  If the model is learning things correctly, then the loss value for both the `train` and `validation` data should be decreasing at about the same rate.  If there is a constant offset between them, that is fine; what is not okay is if the validation loss is increasing while the training loss is decreasing.  If this happens, the model is *overfitting* on the training data, and the model parameters or training parameters should be adjusted.

The second plot uses the net-magnetization values calculated earlier so that we can get a sense of how the model is behaving.  The y-axis values are all fixed, and the x-axis values are either the *ground-truth* values used as inputs to the Metropolis-Hastings algorithm that generated the lattices, *predictions on the training data*, or *predictions on the testing data*.  What we would like to see is all three distributions perfectly aligned.  If they aren't, we can continue training the model to try and improve the prediction results.

### 7. (OPTIONAL) Continue Training the Model

We can continue training the model for 500 epochs at a time until we see the loss decrease to an acceptable threshold.

### Keep going until you see the loss quit decreasing, or you run out of time.

Add more code cells as needed below.

#### Epochs 501-1000

In [None]:
start2 = timeit.default_timer()
with tf.device(device_name):
    history2 = model.fit(X_lattice_train[0], y_train, initial_epoch=501, epochs=1000, batch_size=128,
                        shuffle=True, validation_data=(X_lattice_test[0], y_test))
stop2 = timeit.default_timer()

In [None]:
print("Training Part 2: "+str((stop2 - start2)/60)+ " minutes")
results2 = model.predict(X_lattice_test[0])
print("MSE: "+str(mean_squared_error(results, y_test)))
results_train2 = model.predict(X_lattice_train[0])

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].set_title("How the network is learning over time")
ax[0].plot(np.linspace(1, len(history2.history['loss']), len(history2.history['loss'])),history2.history['loss'], label="train")
ax[0].plot(np.linspace(1, len(history2.history['loss']), len(history2.history['loss'])),history2.history['val_loss'], label="val")
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss')
ax[0].set_yscale("log")
ax[0].legend()

ax[1].set_title("How predicted values compare to the actual values")
ax[1].scatter(y_test, X_test, s=1, label="Test")
ax[1].scatter(results_train2, X_train, s=1, label="Predictions on Train")
ax[1].scatter(results2, X_test, s=1, label="Predictions on Test")
ax[1].set_xlabel(r"$\beta$")
ax[1].set_ylabel("M")
plt.legend()
plt.show()

As you generate additional checkpoints for the model, pay attention to the plots created.  How do they compare to the first ones you made?  Have the values shifted?

# Answer these questions
Edit the cell below with your answers.

1. What are the six steps used to build and train a CNN?
   > **Answer**:

2. How many epochs did you train the CNN?  How long (in hours) did it take to train? What was the best `MSE` achieved?
   > **Answer**:

3. How many iterations did you train the FC-NN?  How long (in hours) did it take to train?  What was the best `MSE` achieved?
   > **Answer**:

4. How many iterations did you train the SVR?  How long (in hours) did it take to train?  What was the best `MSE` achieved? (Hint: Look for the `#iter` variable reported for number of iterations.)
   > **Answer**:

5. Which of the three model types (SVR, FC-NN, or CNN) was easiest to use and why?  Which model had the best results (how do you know)?  Which model used the most resources (how do you know)?  Which one do you think best solved the problem of reliably predicting the inverse temperature and why?
   > **Answer**:

6. A physicist has experimental velocity vs time data of a feather undergoing free-fall in a vacuum, [as in this video](https://www.youtube.com/watch?v=AV-qyDnZx0A).  Which type of model should the physicist choose: linear or non-linear?  Why?  Give a kinematics equation to support your answer.
   > **Answer**:

7. A physicist has experimental 2D X-ray Diffraction (XRD) data for a range of lattice spacings $d$ that [follow Bragg's law: $n\lambda = 2dsin(\theta)$](https://www.doitpoms.ac.uk/tlplib/xray-diffraction/bragg.php).  Which of the three model types (SVR, FC-NN, CNN) should the physicist choose?  Why?  [Click here](https://www.doitpoms.ac.uk/tlplib/diffraction/diffraction3.php) and [here](https://www.researchgate.net/figure/Two-dimensional-2D-X-ray-diffraction-patterns-of-a-Cs-b-Rb5Cs-and-c-GA5Cs-based_fig3_331256089) for an examples of 2D data.
   > **Answer**:

# Additional Readings

1. [Overview of different Non-linear Regression Techniques](https://towardsdatascience.com/3-techniques-for-building-a-machine-learning-regression-model-from-a-multivariate-nonlinear-dataset-88b25fc24ad5)
2. [When to use what kind of machine learning model](https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/)
3. [Understanding the Universal Approximation Theorem](https://towardsai.net/p/deep-learning/understanding-the-universal-approximation-theorem)
4. [Tensorflow Tutorial on CNN](https://www.tensorflow.org/tutorials/images/cnn)


<a name="model_intuition"></a>
## Intuition for the CNN Model

The CNN model provided has the following sequence:

>0. **Input**: accepts an input matrix of size [$\sqrt{N}+2$, $\sqrt{N}+2$, $1$] because this is the size we defined for our Ising lattices.  There is a $1$ in the size definition because even though the lattice is not 3D, the python function still requires a defined value along the third dimension.

In this notebook, the lattice size is $122 \times 122 \times 1$.

>1. **Conv2D**: first pass at learning the spatial relationships and values of the lattice
>2. **Conv2D**: second pass at learning the spatial relationships and values of the lattice

We want the model to have a good opportunity to figure out the most important details about the data before we reduce the lattice size.  If the data is very noisy, or has a lot of very small domains that distinguish it from lattices at similar temperatures, the model should have information at this scale.  A `kernel_size=(3,3)` argument means that a matrix of size $3 \times 3$ is convolved across the matrix.  Depending the size of the magnetic domains, a $3 \times 3$ kernel size may learn relevant information, or the model may benefit from a larger kernel of $5 \times 5$. Larger than $5 \times 5$ is typically not used (even for very large matrices), and $4 \times 4$ is not used because it is even.

Like the *nearest-neighbor* calculation in the previous notebook, which required the lattice to have some padding or skip calculating values for the atoms on the edges of the lattice, the convolution operation requires special handling at the edges.  The `padding="same"` argument puts zeros around the edges to keep the matrix size constant; otherwise it decreases by 2 along each dimension every time.

Finally, the `Conv2D` layer does not output a single lattice: it outputs `filters=32`  *feature maps* of the input lattice, where each feature map is a version of the input that emphasizes different information.  For example, one filter might emphasize only edges of domains in the input lattice, another filter might emphasize only the smallest domains, and yet another might emphasize only the largest domains.

>3. **MaxPool2D**: Reduces the lattice size by 2 in each dimension, so we can expect the output to be [$0.5(\sqrt{N}+2)$, $0.5(\sqrt{N}+2)$, 1).

As the data moves deeper into the network, it is condensed.  (This layer can be thought of in terms of the [Renormalization Group flow](https://ieeexplore.ieee.org/abstract/document/9110872) of the lattice.)  Since only one-fourth of the of the lattice remains, and the initial size was $120 \times 120$ before padding, the lattice is now $60 \times 60$.  Larger domains have shrunk, and smaller domains have disappeared or become specks.  However, because of the number of `filters` in the `Conv2D` layers is still large, we can think of this layer as sorting lattice patterns into different parts of a large $M \times M \times P$ *feature vector*.

>4. **Conv2D**: Now the network is learning "medium" sized patterns, and the number of feature maps created has increased to $filters=64$.
>5. **MaxPool2D**: Reduces the lattice again, so that it is now one-sixteenth of its original size.

Now, the lattice size is $30 \times 30$.  Another convolutional layer is added to the model to learn smaller patterns, followed by a `BatchNormalization()` layer.

>6. **Conv2D**: Same parameters as Layer #4, only now learning even more finely grained features.
>7. **BatchNormalization**: This centers the data around zero and gives a maximum value of 1.
>8. **Flatten**: Reshapes all of the matrices values into one long vector

The `BatchNormalization()` layer is important, because up until this point, the matrix values passed by the `relu` activation function have been allowed to spread between $[0, \infty]$.  We need the value returned for the inverse temperature $\beta$ to be $\beta \in [0, 1]$, and the normalization makes sure the values are centered around zero with a standard deviation of 1 before the data moves into the fully connected layers.

Finally, since the `dense` layers require a vector input, the output matrices of the convolutional layers is reshaped by the `flatten` layer.

>9. **Dense**: This layer takes all of the values created at this point and combines them into $64$ numbers.  A `sigmoid` activation function is used to keep the values between [0, 1].

Moving directly from the full set of matrix values to a single number requires the model to condense all of the information very quickly, without much opportunity to shift the data distribution correctly.  Adding Layer 9 before the final output helps correct for this.

>10. **Dense**: This layer returns a single number, the predicted value for $\beta$.

---
