<a href="https://colab.research.google.com/github/DavidSchineis/Math-Physics/blob/main/Copy_of_Lab_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Abstract
This lab introduces neural networks using TensorFlow and Keras. We began by training a linear model to perform regression, learning how the learning rate effects results. We then extended the model with more layers and different activation functions to better fit y=ln(x+1). Then, we explored overfitting by comparing training and validation losses over epochs, identifying where validation loss stopped improving. Finally, we applied all of this to the MINST handwritten digits dataset. We trained a neural network to recognize digits and evaluated its performance with a confusion matrix. This lab demonstrates how neural networks can be built, trained, and evaluated on real problems in Python.

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns

Today, we will provide a brief introduction to neural networks. Neural networks are designed to take a set of inputs and learn a model composed of functions that can process these inputs to generate a specific output. This concept is similar to matrix operations, where you can consider the inputs and outputs as known matrices. During the training process, the goal is to find a matrix that can be multiplied with the inputs to transform them and produce the desired output.

We will begin with an example of using neural net to perform a linear regression. We will first generate a random noisy dataset.

Note, as you are writing this code, do not use "Run All" button - if you overwrite some variables, it may break some of the logic.

In [None]:
np.random.seed(0)

x = np.arange(0,10,0.1)
y = 2*x+3+np.random.normal(size=len(x))

plt.scatter(x,y,label='random data')

plt.plot(x,2*x+3,c='red',label='underlying relation')
plt.legend()

plt.show()

We will now construct a simple model consisting of a single input (x), and a single output (y)


In [None]:
# Clear any existing TensorFlow graph
tf.keras.backend.clear_session()


model = tf.keras.Sequential()
# Define the input layer
model.add(tf.keras.Input(shape=(1,)))
# Define the fully connected dense layer that would consist of our outputs
model.add(tf.keras.layers.Dense(1))

print(model.summary())

Before using it, the model needs to be compiled, specifying what the neural net should pay attention to. The way it is doing it is through evaluating the loss. In the process of training, the goal is to minimize the loss. What the loss is can be defined in a number of different ways, but for regression tasks, it is common to use mean squred error, i.e.
$$MSE=\frac{1}{n}\sum_i^n(y-p)^2$$
where n is the number of points, y are the outputs, and p are the predictions.

Additionally, the model needs an optimizer, which is an algorithm to adjust the parameters of a model during the training process. Adam or SDG are two popular choices.

Both of these optimizers depend on a particular learning rate, which roughly controls the relative scale by which the weights of the model get adjusted. If the learning rate is too high, it will perform very large modifications to the weights, and the model may fail to converge. If the learning rate is too low, it may be unnecessarily slow to converge.

Afterwards, we are ready to train the model. During the training process, the model will make predictions, evaluate the loss, adjust the weights, and repeat these steps for a fixed number of epochs. Each epoch can be broken into several batches that are each adjusted separately.

In [None]:
# Compile the model
model.compile(loss='mean_squared_error',
              optimizer=tf.keras.optimizers.Adam(learning_rate=100.))

# Train the model
history = model.fit(x, y, epochs=100, verbose=False,batch_size=100)

# Make the predictions
p=model.predict(x)



plt.scatter(x,y,label='random data')
plt.plot(x,p,c='orange',label='predictions')
plt.plot(x,2*x+3,c='red',label='underlying relation')
plt.legend()
plt.show()

plt.plot(history.history['loss'],label='loss over training epochs')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.yscale('log')
plt.legend()
plt.show()

Every single training run is random, producing a different set of predictions. However, for a given set of hyperparameters, the training is likely to take a predictable path.

Most likely, the resulting fit was roughly representative of the data, but it was far from ideal.

Train the 8 different models, adjusting the learning rate from $10^{-5}$ to $10^2$, evenly logarithmically spaced. After every training session, plot loss over all of the epochs, keeping all of the outputs on the same graph. Set the y scale to be from 0.5 to 3.

Remember that the goal is to minimize the loss, and to do it as fast as possible. What is the best learning rate in order to accomplish this ?

#### Answer
The best learning rate to accomplish this is $10^{-1}$. See the graph below for justification of minimizing loss.

In [None]:
learnRate = np.logspace(-5, 2, 8)
final_losses = []

plt.figure()
for lr in learnRate:
    tf.keras.backend.clear_session()
    model = tf.keras.Sequential()
    # Define the input layer
    model.add(tf.keras.Input(shape=(1,)))
    # Define the fully connected dense layer that would consist of our outputs
    model.add(tf.keras.layers.Dense(1))
    # Compile the model
    model.compile(loss='mean_squared_error',
                  optimizer=tf.keras.optimizers.Adam(learning_rate=lr))
    # Train the model
    history = model.fit(x, y, epochs=100, verbose=False,batch_size=100)
    # Make the predictions
    p=model.predict(x)

    plt.plot(history.history['loss'], label=f'lr={lr:g}')
    print(lr)


plt.ylim(0.5, 3)
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend()
plt.show()


### Caption
This is a plot of the loss vs epoch for varying learning rates of a sequential model. The learning rates vary from $10^{-5}$ to $10^{2}$.

Show the predictions of a model trained with this learning rate.

In [None]:

tf.keras.backend.clear_session()
model = tf.keras.Sequential()
# Define the input layer
model.add(tf.keras.Input(shape=(1,)))
# Define the fully connected dense layer that would consist of our outputs
model.add(tf.keras.layers.Dense(1))
# Compile the model
model.compile(loss='mean_squared_error',
              optimizer=tf.keras.optimizers.Adam(learning_rate=0.1))
# Train the model
history = model.fit(x, y, epochs=100, verbose=False,batch_size=100)

# Make the predictions
p=model.predict(x)



plt.scatter(x,y,label='random data')
plt.plot(x,p,c='orange',label='predictions')
plt.legend()
plt.show()


###Caption
This plots the prediction of our sequential model at a learning rate of $10^{-1}$ against our data set of points where y = 2x+3

Let's try a different dataset. Train model with the same architecture that is based on the function $y=\ln(x+1)$

In [None]:
np.random.seed(0)

x = np.arange(0,10,0.1)
y = np.log(x+1)+np.random.normal(size=len(x))*0.1

tf.keras.backend.clear_session()
model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(1,)))
model.add(tf.keras.layers.Dense(1))
# Compile the model
model.compile(loss='mean_squared_error',
              optimizer=tf.keras.optimizers.Adam(learning_rate=0.1))
# Train the model
history = model.fit(x, y, epochs=100, verbose=False,batch_size=100)

# Make the predictions
p=model.predict(x)



plt.scatter(x,y,label='random data')
plt.plot(x,p,c='orange',label='predictions')
plt.legend()
plt.show()


###Caption
This plots the prediction of our sequential model at a learning rate of $10^{-1}$ against our data set of points where y = ln(x+1).

The current model is insufficient to handle the complexity of the function. To enhance its performance, we can incorporate additional layers. Instead of directly producing a single output from a single input, we will introduce hidden layers in between. These layers will consist of 3 neurons (divideing the data into 3 distinct channels), each conditioned uniquely, which are subsequently merged to generate the prediction. This approach allows for a more comprehensive and nuanced representation of the data, potentially improving the model's capabilities.

Use this model to produce a more faithful prediction. Experiment with different parameters, including the learning rate and the batch size in order to achieve this. Make sure to run it a couple of times to ensure a self-consistent performance.

In [None]:
x = np.arange(0,10,0.1)
y = np.log(x+1)+np.random.normal(size=len(x))*0.1


tf.keras.backend.clear_session()
model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(1,)))
model.add(tf.keras.layers.Dense(3,activation='tanh'))
model.add(tf.keras.layers.Dense(1))

# Compile the model
model.compile(loss='mean_squared_error',
              optimizer=tf.keras.optimizers.Adam(learning_rate=0.1))
# Train the model
history = model.fit(x, y, epochs=100, verbose=False,batch_size=200)

# Make the predictions
p=model.predict(x)

plt.scatter(x,y,label='random data')
plt.plot(x,p,c='orange',label='predictions')
plt.legend()
plt.show()

###Caption
This plots the prediction of our sequential model, with an added dense layer of with tanh activation function, at a learning rate of $10^{-1}$ against our data set of points where y = ln(x+1).

In the model above, we introduced an activation function for the hidden layers.
Activation functions are mathematical functions applied to the output of a neuron in a neural network. They introduce non-linearity to the network, enabling it to learn and model complex relationships in the data.

Here are explanations of some commonly used activation functions:

1. sigmoid: The sigmoid function, also known as the logistic function, maps the input to a value between 0 and 1. It is often used in binary classification problems where the output represents the probability of belonging to a certain class. However, it can suffer from vanishing gradients and is not commonly used in deeper networks.

2. relu (Rectified Linear Unit): The ReLU function sets all negative values to zero and keeps positive values unchanged. It is the most widely used activation function due to its simplicity and ability to mitigate the vanishing gradient problem. ReLU works well in most cases but can cause dead neurons (i.e., neurons that output zero) during training.

3. tanh (Hyperbolic Tangent): The tanh function maps the input to a value between -1 and 1. It is symmetric around the origin and is useful in models where negative values are meaningful. Tanh can be used in both hidden layers and output layers.

4. softmax: The softmax function is commonly used in multi-class classification problems. It converts a vector of real numbers into a probability distribution, where the sum of all probabilities is equal to 1. Softmax is useful when dealing with mutually exclusive classes.

5. linear: The linear activation function simply outputs the input value without any transformation. It is primarily used in regression problems where the output can be any real value.

The choice of activation function depends on the problem at hand and the characteristics of the data. Experimentation and understanding the behavior of different activation functions can help in selecting the most suitable one for a particular neural network architecture.

When dealing with more complex models, you should be careful to prevent overfitting. You should have not only a substantively large training set, but also a development and/or test sets with which the model isn't familiar, that you can use to vet your predictions. Validation set is used during training, evaluating it after every epoch, without affecting any of the weights. Test set is used after training.

----

Run the following code (it will take a little while longer to compute due to a larger number of epochs, especially running it on CPU instead of GPU, so please be patient). Then compare the difference in loss between the training and validation sets.

In [None]:
np.random.seed(0)

x = np.arange(0,10)
y = np.log(x+1)+np.random.normal(size=len(x))*0.2
x1 = np.arange(0,10,0.1)
y1 = np.log(x1+1)+np.random.normal(size=len(x1))*0.2

tf.keras.backend.clear_session()
model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(1,)))
model.add(tf.keras.layers.Dense(3,activation='tanh'))
model.add(tf.keras.layers.Dense(1))

model.compile(loss='mean_squared_error',
              optimizer=tf.keras.optimizers.Adam(learning_rate=1e-2))
history = model.fit(x, y, epochs=500, verbose=False,batch_size=10,validation_data=(x1,y1))
p=model.predict(x)


plt.scatter(x,y,label='random data')
plt.scatter(x1,y1,label='validation data')
plt.plot(x,p,c='blue',label='predictions')
plt.legend()
plt.show()


plt.plot(history.history['loss'],label='loss over training epochs')
plt.plot(history.history['val_loss'],label='validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.yscale('log')
plt.xscale('log')
plt.legend()
plt.show()

You might notice that the loss in the training set is eventually reaching the point where it becomes quite small, since the model was able to "memorize" the position of these 10 points. The loss in dev set, on the other hand, initially tracks the loss in the training set before starting to lag behind significantly. There are a number of best practices one could employ to prevent it from happening, including terminating the training at a point when dev loss stops improving.

#### Question

At what epoch does the loss in the validation set stops substantively improving?

#### Answer
The validation loss stops substantively improving right around the 10^2 epoch.

----
In addition to performing regression tasks, it is possible to create labels for classification. For this we will load MNIST dataset of handwritten digits

In [None]:
(train_images, train_labels), (test_images, test_labels) = keras.datasets.mnist.load_data()

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(8, 8))
for i, ax in enumerate(axes.flat):
    ax.imshow(test_images[i], cmap='gray')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title(f"Label: {test_labels[i]}")

plt.tight_layout()
plt.show()

For classification, rather than a mean sqared error loss, a popular choice of loss is sparse categorical crossentropy. And the goal of the model can change from minimizing the loss to maximizing the accuracy (these goals are not fully equivalent, but they do correlate with each other strongly).

We will redefine our model architecture to take a 2d image, and it will produce 10 outputs, each one representing a probability of an image corresponding to a particular digit.

In [None]:
# Preprocess the data - it is usually suggested to normalize the input data
# in such a way that it would fit in the range of -1 to 1 or 0 to 1
train_images_norm = train_images / 255.0
test_images_norm = test_images / 255.0


tf.keras.backend.clear_session()
model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(28,28)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128,activation='relu'))
model.add(tf.keras.layers.Dense(10,activation='softmax'))

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(train_images_norm, train_labels, epochs=10)

test_loss, test_acc = model.evaluate(test_images_norm, test_labels, verbose=2)
print('Test accuracy:', test_acc)

predictions = tf.argmax(model.predict(test_images),axis=1).numpy()


In [None]:
#confirming that the predictions are accurate
fig, axes = plt.subplots(3, 3, figsize=(8, 8))
for i, ax in enumerate(axes.flat):
    ax.imshow(test_images[i], cmap='gray')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title(f"Predicted: {predictions[i]}")

plt.tight_layout()
plt.show()

The cell below willl create a confusion matrix, i.e., showing the frequency of cases where the predictions were accurate, vs where they were missmatched, for each of the classes

In [None]:
cm = confusion_matrix(test_labels, predictions)
accuracy = np.trace(cm) / float(np.sum(cm))

# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=False, fmt="d", cmap="Blues", cbar=False)

# Set labels, title, and accuracy
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title(f"Confusion Matrix (Accuracy: {accuracy:.2f})")

# Show the heatmap
plt.show()

Use np.where function to find the images where there is a mismatch between test labels and the predictions.
Display 9 such cases, similarly to the above, and show what was the predicted class for these images.

In [None]:
x=np.random.permutation(np.arange(len(predictions)))
predictions=predictions[x]
test_images=test_images[x]
test_labels=test_labels[x]

a=np.where(predictions != test_labels)[0]

fig, axes = plt.subplots(3, 3, figsize=(8, 8))
for i, ax in enumerate(axes.flat):
    ax.imshow(test_images[a[i]], cmap='gray')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title(f"True: {test_labels[a[i]]}, Pred: {predictions[a[i]]}")

plt.tight_layout()
plt.show()

### Caption
9 examples of misclassified digits showing true and predicted labels for each.

### Extra credit

Build a model and make predictions for dataset presented below. Tune the hyperparameters (including the number of hidden layers, number of neurons, learning rate, and batch size) to produce a good fit; keep the number of epochs to no more than 100, for the sake of speed.

In [None]:
np.random.seed(0)

x = np.linspace(0,np.pi*5,1000)
y = np.sin(x)+np.random.normal(size=len(x))*0.1
x1 = np.random.random(100)*np.pi*5
y1 = np.sin(x1)+np.random.normal(size=len(x1))*0.1

# put your code here