<!--TITLE: Deep Classifiers -->

# Introduction #

So far in this course, we've learned about how neural networks can solve regression problems. Now we're going to apply neural networks to another common machine learning problem: classification. Most everything we've learned up until now still applies. The main difference is in the loss function we use and in what kind of outputs we want the final layer to produce.

# Binary Classification #

We are going to look specifically at what is known as "binary" classification. In a **binary classification** problem, each example in the dataset is assigned to one of two target labels, usually 0 or 1. Our goal is to teach the model to correctly predict these labels.

For instance, in the [Titanic competition](https://www.kaggle.com/c/titanic) the goal is to predict which passengers survived the Titanic shipwreck. Given some information about a passenger (age, sex, family relations, ...), you want a model that can predict whether they survived (1) or perished (0).

<blockquote style="margin-right:auto; margin-left:auto; background-color: #ebf9ff; padding: 1em; margin:24px;">
    <strong>The Titanic Competition</strong><br>
If you've never entered a competition on Kaggle before, the [Titanic competition](https://www.kaggle.com/c/titanic) is a great place to start. Check out [Alexis Cook's Titanic Tutorial](https://www.kaggle.com/alexisbcook/titanic-tutorial) for a step-by-step walkthrough! Can your deep learning model beat the traditional models?
</blockquote>

# Accuracy and Cross-Entropy #

**Accuracy** is one of the many metrics in use for measuring success on a classification problem. It is the ratio of correct predictions to total predictions. All else being equal, it's a reasonable metric to use whenever the classes in the dataset occur with about the same frequency.

The problem with accuracy (and most other metrics) is that it can't be used as a loss function. SGD needs a loss function that changes smoothly, but accuracy changes discretely (in "jumps") because it works on categories. So, we have to choose a substitute to act as the loss function -- this substitute is the *cross-entropy* function.

Now, recall that the loss function defines the *objective* of the network during training. With regression, our goal was to minimize the distance between truth and prediction, and so we used MAE.

For classification, what we want is a distance between *probabilities*, and this is what cross-entropy provides. **Cross-entropy** is a sort of measure for the distance from one probability distribution to another, in our case, from predicted probabilities to true.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/DwVV9bR.png" width="400" alt="Graphs of accuracy and cross-entropy.">
<figcaption style="textalign: center; font-style: italic"><center>Cross-entropy penalizes incorrect probability predictions.</center></figcaption>
</figure>

The idea is that we want our network to predict the correct class with probability `1.0`. The further away the predicted probability is from `1.0`, the greater will be the cross-entropy loss.

The main thing to take away from this section is this: use cross-entropy for your classification loss while still monitoring the accuracy or other metrics you care about.

## Adding the Cross-Entropy Loss and Accuracy Metric ##

Add cross-entropy and accuracy to a model with its `compile` method. For two-class problems, be sure to use `'binary'` versions. (Problems with more classes will be slightly different.)

```
model.compile(
    optimizer='adam'
    loss='binary_crossentropy',
    metrics=['binary_accuracy'],
)
```

# Making Probabilities with the Sigmoid Function #

The cross-entropy and accuracy functions both require probabilities as inputs, meaning, numbers from 0 to 1. To covert the real-valued outputs produced by a dense layer into probabilities, we attach a new kind of activation function, the **sigmoid activation**.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/FYbRvJo.png" width="400" alt="The sigmoid graph is an 'S' shape with horizontal asymptotes at 0 to the left and 1 to the right. ">
<figcaption style="textalign: center; font-style: italic"><center>The sigmoid function maps real numbers into the interval $[0, 1]$.</center></figcaption>
</figure>

To get the final class prediction, we define a *threshold* probability. Typically this will be 0.5, so that rounding will give us the correct class: below 0.5 means the class with label 0 and 0.5 or above means the class with label 1.

We'll see some examples of how we can work with these probabilities in the exercises.

## Adding the Sigmoid Activation ##

You'll want to add this sigmoid activation to the final layer of your network. You might define a small classifier like this:

```
model = keras.Sequential([
    layers.Dense(16, activation='relu'),
    layers.Dense(1, activation='sigmoid'),
])
```

Now, instead of any real number, the network will output numbers from `0.0` to `1.0`.

# Example - Survival on the Titanic #

Now let's try it out!

In [None]:
#$HIDE_INPUT$
import pandas as pd

train_data = pd.read_csv('../input/dl-course-data/dl-course-data/titanic.csv')
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
y = train_data["Survived"]

train_data.head(4)

In [None]:
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping

model = keras.Sequential([
    layers.Dense(16, activation='relu'),
    layers.Dense(1, activation='sigmoid'),
])
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['binary_accuracy'],
)

early_stopping = EarlyStopping(
    min_delta=0.001,
    patience=20,
    restore_best_weights=True,
)

history = model.fit(
    X, y,
    validation_split=0.2,
    epochs=1000,
    callbacks=[early_stopping],
    verbose=0,
)

history_frame = pd.DataFrame(history.history)

history_frame.loc[:, ['loss', 'val_loss']].plot();

# Conclusion #