# Welcome to Deep Learning! #

- do deep learning for **regression** and **classification**
- design **neural network architectures**
- navigate the **loss landscape**
- master **stochastic gradient descent**
- solve real world problems

You'll be prepared for deep learning if you've taken our *Introduction to Machine Learning* course.

Let's get started!

# A One Neuron Neural Network #

As an introduction to the fundamental ideas behind deep learning, we're going to build a neural "network" with just *one* neuron. With the flexibility of the deep-learning framework, we'll see that we can push even this one neuron beyond what many classical models can do. In Lesson 2 we'll start building full networks of hundreds or thousands of neurons!

There will be three parts to our model:
1. a linear neural unit, the model **architecture**
2. a mean squared error **loss function**, and
3. the SGD **optimizer**

We'll look briefly at each of these and see what they contribute.

# The Linear Unit #

A single neuron with one input looks like:

<figure style="padding: 1em;">
<img src="https://i.imgur.com/xxS8rzf.png" width="250" alt="Diagram of a linear unit.">
<figcaption style="textalign: center; font-style: italic"><center>The Linear Unit
</center></figcaption>
</figure>

When reading this diagram think about the computation as flowing from left to right. The numbers on the connections we call **weights** and the values that flow from input to output we call **activations**. Notice that this neuron has a constant input of 1 attached; its connection has a special weight called the **bias**. This neuron has two weights, `w` and `b`.

The rule is that whenever an activation flows through a connection, you multiply it by the weight, and to get the output of the unit you just sum up all of the inputs. So, this unit computes a function like $y = w x + b$, or in Python `output = w * input + b`.

<blockquote style="margin-right:auto; margin-left:auto; background-color: #ebf9ff; padding: 1em; margin:24px;">
    <strong>The Linear Unit Makes a Line</strong><br>
Does the formula $y=w x + b$ look familiar? It's an equation of a line! It's the slope-intercept equation, where $w$ is the slope and $b$ is the y-intercept. That's why we call it the <em>linear</em> unit.
</blockquote>

## Example ##

Say the weights on our neuron happened to be `w=3` and `b=2`. What would we get if we plug in `x=-4`?

<figure style="padding: 1em;">
<img src="https://i.imgur.com/.png" width="300" alt="Diagram of neural computation.">
<figcaption style="textalign: center; font-style: italic"><center>Computing with the linear unit.
</center></figcaption>
</figure>

Which checks with our formula: $y = 3(-4) + 2 = -10$.

(By the way, running all of your training data through a network like this is sometimes called doing the *forward pass*.)

# Training the Network #

When we first create a neuron, the weights are set randomly. Our goal is to use the training data to find values for the weights that will create the "best fitting" line to the data. We want, in other words, to find the right value for the slope and the y-intercept.

To find the right values for the weights, all neural networks use a procedure that goes more or less like this: Run some training data through the network to make predictions. Measure the difference between the predictions and the true values. Then, adjust the weights in a direction that makes the difference smaller. Do this over and over until you're satisfied.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/rFI1tIk.gif" width="1200" alt="Fitting a line batch by batch. The loss decreases and the weights approach their true values.">
<figcaption style="textalign: center; font-style: italic"><center>Training a neural network with Stochastic Gradient Descent.
</center></figcaption>
</figure>

You can see from this animation how the training goes. When adjusting the weights, don't use the entire dataset at once, just a sample from it called a **minibatch**. To measure the difference between prediction and truth, we're using the MSE function, called the **loss function**. If everything goes to plan, the loss goes down as we feed in more minibatches, and the weights approach their true values.

# Example - Red Wine Quality #

Our goal in this example will be to predict the perceived quality of a wine (on a scale of 3-8) given its *residual sugar* content, which is the amount of grape sugar remaining after fermentation. High levels of residual sugar make a wine *sweet* while low levels make it *dry*. The data is from the *Red Wine Quality* dataset.

As we'll discuss later, neural networks perform best when your data is put on a common scale -- we will rescale each feature into the interval $[0, 1]$. (Check out our [course on Pandas](https://www.kaggle.com/learn/pandas) for a review of working with dataframes!)

In [None]:
#$HIDE$
import pandas as pd
from IPython.display import display

red_wine = pd.read_csv('../input/dl-course-data/dl-course-data/red-wine.csv')

# Create training and validation splits
df_train = red_wine.sample(frac=0.7, random_state=0)

# Scale to [0, 1]
max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)
df_train = (df_train - min_) / (max_ - min_)

# Split features and target
x_train = df_train['residual sugar']
y_train = df_train['quality']

In Keras, you can create a model with a single linear unit using what's called a `Dense` layer. Most neural networks are built by stacking layers of neurons that connect in a particular way, which we'll learn about in Lesson 2.

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

# Create a network with 1 linear unit
model = keras.Sequential([
    layers.Dense(units=1)
])

To train the network, we first need to choose a loss function and an optimizer. We'll learn more about these in Lesson 3.

In [None]:
# Add the optimizer and loss function
model.compile(
    optimizer='sgd',
    loss='mse',
)

# Fit the network to the training data
history = model.fit(
    x=x_train,
    y=y_train,
    batch_size=256,
    epochs=50,
    verbose=0,
)

Finally, Keras' fitting method returns a record of the training in a `History` object. One of the goals of this course is to show you how you can use this loss history to guide your model development.

In [None]:
import pandas as pd

# Plot the loss history
history_df = pd.DataFrame(history.history)
history_df['loss'].plot();

So our single linear unit was able to fit this training data very well, attaining an MSE loss less than **TODO**. There are a couple objections we might make though. The first is that this is the MSE loss on the *training* data. We don't know necessarily that it will perform this well on data it hasn't seen before. You might recall that the right way to check how well the model generalizes is by using a [*validation set*](https://www.kaggle.com/dansbecker/model-validation). You can actually include this in the `fit` method with a `validation_data` argument, and then you'll get two sets of loss curves, one for loss on the training set and one for loss on the validation set. (We'll start doing this from now on.)

The second objection is that MSE might not be the right kind of loss for this data. The target, remember, is a set of ranks from 3 to 8. Is being 2 ranks off really 4 times as bad as being 1 rank off? It's a matter of interpretation, but it could have practical consequences if you were trying to use this model in the real-world. Fortunately, it's easy to swap in a different loss function to suit your circumstance. Keras includes 'MAE', 'MSLE', 'Huber', and many more -- or you could even define your own.

# Conclusion #

This lesson introduced you to the basic ideas behind deep learning. We learned about the parts of a neuron and how they compute a linear function. We learned that you can use stochastic gradient descent to train a network against your choice of a loss function. The remainder of this course is just about developing everything you've just seen in this tutorial.

# Your Turn #

Now [move on]() to the Exercise, where you'll **TODO**.