<!-- TITLE: Training Neural Nets -->

# Introduction #

In the first two lessons, we learned how to build neural networks out of stacks of dense layers. 

When we first create a network, all of its weights are set randomly -- it doesn't "know" anything yet. Training the network means finding the right values for the weights, the values that make its predictions closer to the truth.  <mark><b>TODO: soften change make predictions closer to truth --> something along the lines of modeling patterns in data. what is meant by "truth" here?</b></mark>

Before training can begin, however, we need to tell our model two things.  First, we tell it *what* problem it's trying to solve with a "loss function", and second we tell it *how* to solve the problem by choosing an "optimizer".  <mark><b>TODO: need to work in some description of what role "SGD" plays here, so users have that grounding</b></mark>

# Stochastic Gradient Descent # 

The way we train a network is through an iterative process called **stochastic gradient descent**. One *step* of training goes like this:
1. Sample some training data and run it through the network to make predictions.
2. Measure the disparity between the predictions and the true values.
3. Finally, adjust the weights in a direction that makes the prediction closer to the true values.

Each iteration's sample of training data is called a **minibatch** (or often just "batch"), while a complete round of the training data is called an **epoch**. The number of epochs you train for is how many times the network will see each training example.

<mark><b>TODO: preview ideas of "loss" and "optimizer" as making this structure more formal</b></mark>

# The Loss Function #

The **loss function** measures the disparity between the the target's true value and the value the model predicts. 

Different problems call for different loss functions.  <mark><b>TODO: it is useful here to describe what a regression problem is, since it will be used immediately to understand a concept (describing "regression" in lesson 1 was too far away)</b></mark>

A common loss function for regression problems is **mean absolute error** or **MAE**. For each prediction `y_pred`, MAE measures the disparity from the true target `y_true` by an absolute difference `abs(y_true - y_pred)`.

The total MAE loss on a dataset is the mean of all these absolute differences.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/VDcvkZN.png" width="500" alt="A graph depicting error bars from data points to the fitted line..">
<figcaption style="textalign: center; font-style: italic"><center>The mean absolute error is the average length between the fitted curve and the data points.
</center></figcaption>
</figure>

Besides MAE, other loss functions you might see for regression problems are the mean-squared error (MSE) or the Huber loss (both available in Keras).

During training, the model will use the loss function as a guide for finding the correct values of its weights (lower loss is better). In other words, the loss function tells the network its objective.

# The Optimizer #

The **optimizer** is an algorithm that trains the networks by adjusting the weights to minimize the loss.

The size of these shifts is determined by the **learning rate**. A smaller learning rate means the network needs to see more minibatches before its weights converge to their best values.  <mark><b>TODO: what does a larger learning rate mean? good to mention there's a tradeoff here</b></mark>

The learning rate has a big effect on how the SGD training proceeds.  We'll explore this in the exercises.  

<mark><b>TODO: instead of keeping "hyperparameter" vague, useful here to make clear that optimizer specifically for setting LR schedule (so we don't have to)</b></mark>
Fortunately, for most work, it won't be necessary to do an extensive hyperparameter search to get satisfactory results. There have been a number of modifications to the original SGD algorithm that are easier to use. The variant that we'll use most is called **Adam**. It's thought to perform well in most situations and typically doesn't require hyperparameter tuning. It's a great general-purpose optimizer. 


# Visualization #

<figure style="padding: 1em;">
<img src="https://i.imgur.com/rFI1tIk.gif" width="1600" alt="Fitting a line batch by batch. The loss decreases and the weights approach their true values.">
<figcaption style="textalign: center; font-style: italic"><center>Training a neural network with Stochastic Gradient Descent.
</center></figcaption>
</figure>

The animation shows the linear model from Lesson 1 being trained with SGD. The pale red dots depict the entire training set, while the solid red dots are the minibatches. 

Every time SGD sees a new minibatch, it will shift the weights (`w` the slope and `b` the y-intercept) toward their correct values on that batch.  Notice that the line only makes a small shift in the direction of each batch (instead of moving all the way).  <mark><b>TODO: good to remove the idea of "correct values" for the weights, because there are almost never "correct values" in real-world setting when training NN (usually get stuck in one of many, many local minima)</b></mark> 

Batch after batch, the line eventually converges to its best fit. You can see that the loss gets smaller as the weights get closer to their true values.  <mark><b>TODO: in the context of weights, better to change from "true values". suggested rephrasing: something along the lines of you can see adjustments to the weights reduce the loss...</b></mark>

<blockquote style="margin-right:auto; margin-left:auto; background-color: #ebf9ff; padding: 1em; margin:24px;">
    <strong>What's In a Name?</strong><br>
The <strong>gradient</strong> is a vector that tells us in what direction the weights need to go. More precisely, it tells us how to change the weights to make the loss change <em>fastest</em>. We call our process gradient <strong>descent</strong> because it uses the gradient to <em>descend</em> the loss curve towards a minimum. <strong>Stochastic</strong> means "determined by chance." Our training is <em>stochastic</em> because the minibatches are <em>random samples</em> from the dataset. And that's why it's called SGD!
</blockquote>

# Example - Red Wine Quality #

Now we are ready to start training deep learning models. So let's see it in action! We'll use the *Red Wine Quality* dataset. <mark><b>TODO: add embedded link to dataset</b></mark>

This dataset consists of physiochemical measurements from about 1600 Portuguese red wines. Also included is a quality rating for each wine from blind taste-tests. How well can we predict a wine's perceived quality from these measurements?

We've put all of the data preparation into this next hidden cell. It's not essential to what follows so feel free to skip it. One thing you might note for now though is that we've rescaled each feature to lie in the interval $[0, 1]$. As we'll discuss more in Lesson 5, neural networks tend to perform best when their inputs are on a common scale.

<mark><b>TODO: mention data has been loaded into `X_train`, `y_train`</b></mark>

In [None]:
#$HIDE$
import pandas as pd
from IPython.display import display

red_wine = pd.read_csv('../input/dl-course-data/dl-course-data/red-wine.csv')

# Create training and validation splits
df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)
display(df_train.head(4))

# Scale to [0, 1]
max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)
df_train = (df_train - min_) / (max_ - min_)
df_valid = (df_valid - min_) / (max_ - min_)

# Split features and target
X_train = df_train.drop('quality', axis=1)
X_valid = df_valid.drop('quality', axis=1)
y_train = df_train['quality']
y_valid = df_valid['quality']

How many inputs should this network have? We can discover this by looking at the number of columns in the data matrix. Be sure not to include the target (`'quality'`) here -- only the input features.

In [None]:
print(X_train.shape)

Eleven columns means eleven inputs.

We've chosen a three-layer network with over 1500 neurons. This network should be capable of learning fairly complex relationships in the data.

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(512, activation='relu', input_shape=[11]),
    layers.Dense(512, activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dense(1),
])

After defining a model, you can add a loss function and optimizer with the model's `compile` method.  Notice that we are able to specify the loss (mean absolute error) and optimizer (Adam) with just a string.

In [None]:
model.compile(
    optimizer='adam',
    loss='mae',
)

Now we're ready to start the training! We've told Keras to feed the optimizer 256 rows of the training data at a time (the `batch_size`) and to do that 10 times all the way through the dataset (the `epochs`).

In [None]:
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=10,
)

You can see that Keras will keep you updated on the loss as the model trains.

Often, a better way to view the loss though is to plot it. The `fit` method in fact keeps a record of the loss produced during training in a `History` object. We'll convert the data to a Pandas dataframe, which makes the plotting easy.

In [None]:
import pandas as pd

# convert the training history to a dataframe
history_df = pd.DataFrame(history.history)
# use Pandas native plot method
history_df['loss'].plot();

Notice how the loss curve flattens as the epochs go by. When the loss curve becomes horizontal, it means the model has learned all it can and there would be no reason continue for additional epochs.  <mark><b>TODO: i think this is usually not true (it means the model has learned all it can). i think that sometimes the curve can go flat for a bit, because you're stuck in a local min, but then it can recover & resume descending the loss surface. maybe better to soften to: "it means the model has stopped improving"</b></mark>

# Conclusion #
