# Introduction #

# The Loss Function #

The **loss function** measures the disparity between the true value of the target and the value the model predicts. During training, the model will use the loss function as a guide for finding the correct values of its weights. It tells the network its objective, "where it's supposed to go."

Different problems call for different loss functions. or this course, we'll use the **mean absolute error** or **MAE**. For each prediction `y_pred`, MAE measures the disparity from the true target `y_true` using `abs(y_true - y_pred)`.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/.png" width="300" alt="The graph of the MAE function, a 'V'.">
<figcaption style="textalign: center; font-style: italic"><center>MAE takes the absolute value of their difference.
</center></figcaption>
</figure>

The total MAE loss on a dataset would be the mean of all those absolute differences. In the illustration, it would be the average length of all the red bars.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/.png" width="300" alt="The graph of the MAE function, a 'V'.">
<figcaption style="textalign: center; font-style: italic"><center>
</center></figcaption>
</figure>



# Training the Network #

When we first create a neural network, all of its weights are set randomly -- it doesn't "know" anything yet. Training the network means finding the right values for the weights, the values that minimize the loss.

The way we train a network is through an iterative process called **stochastic gradient descent**. One *step* of training goes like this:
1. Sample some training data and run it through the network to make predictions.
2. Measure the loss between the predictions and the true values.
3. Finally, adjust the weights in a direction that makes the loss smaller.

Then just do this over and over until the loss is as small as you like (or until it won't decrease any further.)

<figure style="padding: 1em;">
<img src="https://i.imgur.com/rFI1tIk.gif" width="1600" alt="Fitting a line batch by batch. The loss decreases and the weights approach their true values.">
<figcaption style="textalign: center; font-style: italic"><center>Training a neural network with Stochastic Gradient Descent.
</center></figcaption>
</figure>

Each iteration's sample of training data is called a **minibatch** (or often just "batch"), while a complete round of the training data is called an **epoch**. The number of epochs you train for is how many times the network will see each training example.

The animation shows the linear model from Lesson 1 being trained with SGD. The pale red dots depict the entire training set, while the solid red dots are the minibatches. Every time SGD sees a new minibatch, it will shift the weights (`w` the slope and `b` the y-intercept) toward their correct values on that batch. Batch after batch, the line eventually converges to its best fit. You can see that the loss gets smaller as the weights get closer to their true values.

Notice that the line only makes a small shift in the direction of each batch (instead of moving all the way). The size of these shifts is determined by the **learning rate**. A smaller learning rate means the network needs to see more minibatches before its weights converge to their best values.

The learning rate and the size of the minibatches are the two parameters that have the largest effect on how the SGD training proceeds. Their interaction is often subtle and the right choice for these parameters isn't always obvious. We'll explore these effects in the exercises.

Fortunately, for most work, it won't be necessary to do an extensive hyperparameter search to get satisfactory results. There have been a number of modifications to the original SGD algorithm that are easier to use. The variant that we'll use most is called **Adam**. It's thought to perform well in most situations and typically doesn't require hyperparamter tuning. It's a great general-purpose algorithm.

After defining a model, you can add a loss function and optimizer with the `compile` method:

```
model.compile(
    optimizer="adam",
    loss="mse",
)
```

Notice that we are able to specify these with just a string. You can also access these directly through the Keras API -- if you wanted to tune paramters, for instance -- but for us, the defaults will work fine.

<blockquote style="margin-right:auto; margin-left:auto; background-color: #ebf9ff; padding: 1em; margin:24px;">
    <strong>What's In a Name?</strong><br>
The <strong>gradient</strong> is a vector that tells us in what direction the weights need to go. More precisely, it tells us how to change the weights to make the loss change <em>fastest</em>. We call our process gradient <strong>descent</strong> because it uses the gradient to <em>descend</em> the loss curve towards a minimum. <strong>Stochastic</strong> means "determined by chance." Our training is <em>stochastic</em> because the minibatches are <em>random samples</em> from the dataset. And that's why it's called SGD!
</blockquote>

# Example - Red Wine Quality #

**TODO - discussion**

Now let's see it in action!

In [None]:
#$HIDE_INPUT$
import pandas as pd
from IPython.display import display

red_wine = pd.read_csv('../input/dl-course-data/dl-course-data/red-wine.csv')

# Create training and validation splits
df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)
display(df_train.head(4))

# Scale to [0, 1]
max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)
df_train = (df_train - min_) / (max_ - min_)
df_valid = (df_valid - min_) / (max_ - min_)

# Split features and target
X_train = df_train.drop('quality', axis=1)
X_valid = df_valid.drop('quality', axis=1)
y_train = df_train['quality']
y_valid = df_valid['quality']

How many inputs? We're looking at columns (not including the target `quality`).

In [None]:
print (X_train.shape)

Now make the network.

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(512, activation='relu', input_shape=[11]),
    layers.Dense(512, activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dense(1),
])

We include the loss and optimization algorithm in `compile`. 

In [None]:
model.compile(
    optimizer='adam',
    loss='mae',
)

After you've defined the network architecture and compiled it with an optimizer and loss function, you're ready to start training.

A few things to note
- validation data
- batch size
- epochs

In [None]:
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=10,
)

We can see the "learning curves" in the `History` object. We'll learn more about these in the next lesson.

In [None]:
import pandas as pd
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();

# Conclusion #