# Welcome to Deep Learning! #

This course has everything you need to get started with deep learning in Keras. You'll learn how to:
- design neural networks to perform **regression** and **classification**
- effectively train a network with **stochastic gradient descent**
- tune a model with **early stopping** and **dropout**
- improve training behavior with **Adam** and **batch normalization**

# What Makes Deep Learning Different? #
- layers of simple transformations
- optimization through SGD

In this lesson, we're going to introduce a new way of building and training machine-learning models. We typically conceive of classical machine learning models as performing a single transformation, one transformation direct from input to output. Moreover, the classical model is typically trained on the entire dataset at once. This is true of almost all the models in `scikit-learn`, for instance.

To illustrate this new way of doing machine learning, we're going to reconceive linear regression as a neural network. As we'll see, the solution we arrive at is essentially identical to that produced by classical methods. This new framework, however, is much more powerful and much more flexible, and starting in Lesson 2, we'll see how we can build off of our simple linear regression model to produce sophisticated deep learning networks that can be trained on very large data sets.

# Linear Regression #

So let's review linear regression. In linear regression we model the target as a linear function of the features. So, the features become the inputs $x$ and the target becomes the output $y$ of some function like:
\[y = W x + b\]

When there is a single feature, you could think of linear regression as fitting a line through the $(x, y)$ data points. When there are multiple features, it will fit a plane or hyperplane. (Let's just say "line" for now.)

<!-- fitting a line and plane -->

In the equation above, the variable $W$ represents the **weights**. The weights tell you how much the output is changing for each input. $W$ is a vector with one weight for each feature. The other variable $b$ is called the **bias**. The bias defines what the output $y$ should be when the input is 0. It gives a vertical shift to the regression line.

<!-- effect of weights and bias -->

As you can see, both the weights and the bias are needed to find a well-fitting line. (We'll talk more about what "well-fitting" means in a moment.)

# The Linear Layer #

In Keras, we build models in **layers**. A layer essentially is just something that takes some inputs, does a computation on them, and produces some outputs. In Keras, layers are implemented so as to be easy to combine. Typically, you won't need to keep track of how many inputs or outputs a layer has. Most of that is taken care of for you.

The most general kind of layer is the `Dense` layer. A `Dense` layer, in fact, is just a layer that does a computation like $y = W x + b$. Training a `Dense` layer means to find values for $W$ and $b$ that fit the inputs to the outputs. So a `Dense` layer is essentially calculating a linear regression.

Here is how we could define a linear regression model in Keras:

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([ # this creates a model that we can put a stack of layers in
    layers.Dense(1),
])

The argument given to the `Dense` layer defines how many outputs it has. In this course, we'll just do regression on a single output variable, but if you were doing multivariate regression, you could have more.

# Optimization #

A well-fitting line in linear regression is one that minimizes the distance between itself and the data points. Most commonly, the line is chosen to minimize **mean-squared error** (MSE): for every input $x$, take the difference between its $y$-value and the $y$-value on the line, square each difference, and then sum them all together. This is called *ordinary least-squares* (OLS).

An **optimization** problem is a problem of finding parameter values for a function that will produce a minimum or a maximum. Ordinary least squares is the problem of choosing a $W$ and $b$ that will minimize MSE for the given data set. Most commonly, the OLS problem is solved by applying certain matrix transformations on the training data, transforming the entire dataset at once. This method is guaranteed to find the actual optimal solution, the guaranteed "best" values for $W$ and $b$.

Methods for finding guaranteed optimal parameter values only exist for models that are relatively simple. Deep learning models can be extremely complex and so there's no hope of finding a simple formula that could produce a solution. Instead, we train the model a little bit at a time on small samples of the dataset, attempting to drive down the error a little at a time. These samples we call **minibatches** (or sometimes just "batches"), and the full optimiation process **stochastic gradient descent** (SGD). We'll look at the details of SGD in Lesson 3 and how you can tune it to get good results.

<!-- animation of sgd -->

You set up SGD in Keras by "compiling" your model with a loss function and an optimizer. Taking the model we defined before, we would then do:

In [None]:
model.compile(
    optimizer="sgd",
    loss="mse",
)

This tells Keras we want to minimize mean-squared error using stochastic gradient descent. 

Usually, you'll run through the dataset multiple times. Each complete run of the dataset is called an **epoch**. To train the model for 20 epochs on batches of 64 examples at a time, we do:

In [None]:
model.fit(x=X, y=y, epochs=20, batch_size=64)

<!-- note -->
<strong>Linear Transformation?</strong>
If you've had some linear algebra, you might know that a linear transformation is what you get when you multiply something by a matrix. Geometrically, this means some combination of reflections, rotations, and constant stretches or shrinks. Technically, the addition of the bias makes the transformation *affine*, but since (it turns out) an affine transformation is just a linear transformation one dimension higher, it will be okay if we just stick with "linear".

# Example - Linear Regression in Keras #

Now let's carry this out on an actual dataset.

First let's load the data. We've hidden the cell since the details aren't important for this example, but feel free to take a look if you like.

In [None]:
# load some data

In addition to the loss and the optimizer, you can also include **metrics**. These are additional functions run at the same time as the loss, but that don't affect the training. We might be interested in the mean-absolute error as well, so let's include that as a metric.

In [None]:
import tensorflow.keras as keras
import tensorflow.keras.layers as layers

model = keras.Sequential([
    layers.Dense(1),
])

model.compile(
    optimizer='SGD',
    loss'mse',
    metrics=['mae'],
)

history = model.fit(x=x, y=y, epochs=20, batch_size=64)

The `fit` method produces a record of the loss and metrics produced during training. It's nice to save this to produce some plots afterwards.

In [None]:
import pandas as pd

# convert the training history to a dataframe
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot()
history_df.loc[:, ['mae', 'val_mae']].plot();

# Conclusion #

In this tutorial, we introduced the basic framework of deep learning. We learned that deep learning uses models built with layers and iteratively trained with SGD. All of the developments in this course will be essentially in either of these two things: how we compose the layers, or how we perform the optimization.

# Your Turn #

During this course, you'll develop a deep learning model to *solve a real-world problem*. Move on to the first exercise to get started!