# Welcome to Deep Learning! #

- do deep learning for **regression** and **classification**
- design **neural network architectures**
- navigate the **loss landscape**
- master **stochastic gradient descent**
- solve real world problems

You'll be prepared for deep learning if you've taken our *Introduction to Machine Learning* course.

Let's get started!

# A New Kind of Model #

Deep learning is fundamentally about neural networks. But it's also a new way of building and training machine learning models.

In addition to a dataset, building and training a deep learning model means deciding on three things:
1. A **model architecture** built of layers of data transformations
2. A **loss function** that defines the solution, and
3. A method of **optimization** that fits the model to the data

With classical machine learning models, these three things are usually bundled together in a single algorithm. In `scikit-learn`, for instance, you define a linear model with a single function, like `model = LinearRegression()`.

With the classical approach, you choose from a library of distinct algorithms. With the deep learning approach, instead of choosing from a library of predefined models, you design the model yourself.

These three things -- architecture, loss, and optimization -- are the three central ideas of this course. Each of our future lessons will explore some aspect of these three ideas, how they interact, and how your decisions about them will ultimately decide the success of your project. We hope to develop strong intuitions about what's actually happening when you train a neural network so that you can intelligently diagnose problems and quickly iterate towards a successful solution.

As an introduction, we'll implement a linear regression model in our new framework. Then, we'll apply our model to the [Red Wine Quality](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009) dataset.

<blockquote style="margin-right:auto; margin-left:auto; background-color: #ebf9ff; padding: 1em; margin:24px;">
    <strong>Keras and TensorFlow</strong><br>
<a href="https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD">Tensorflow</a> is a large and robust machine learning platform.
<a href="https://keras.io/">Keras</a> is TensorFlow's deep-learning API.
</blockquote>

# The Linear Model #

In our *Introduction to Machine Learning* course you predicted the price of a home using features like its number of bedrooms or the size of its lot. When we use data to predict a continuous quantity like this we are solving a **regression** problem. (With *classification* we'd be trying to predict some unordered set of class labels.)

## Architecture ##

In linear regression we model the target as a linear function of the features. So, the features become the inputs $x$ and the target becomes the output $y$ of some function like:
$y = W x + b$.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/.png" width="600" alt="Graphs illustrating linear regression models.">
<figcaption style="textalign: center; font-style: italic"><center><strong>Left: </strong>With one input feature, linear regression fits a line. <strong>Right: </strong>With two inputs, it fits a plane.
</center></figcaption>
</figure>

The variables $W$ and $b$ are the parameters we determine when we fit the model to the training data. $W$ is a matrix we call the **weights** and $b$ is a vector we call the **bias**.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/.png" width="600" alt="Illustration of weights as slope and bias as y-intercept..">
<figcaption style="textalign: center; font-style: italic"><center>The weights determine the slope and the bias determines the vertical intercept. <strong>Left:</strong>Without weights. <strong>Right:</strong> Without bias.
</center></figcaption>
</figure>

For our first decision then, we choose a linear architecture.

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

# Build a model by stacking layers inside of Sequential
model = keras.Sequential([
    layers.Dense(units=1),
])

Each layer in a Keras model represents some kind of data transformation. The `Dense` layer represents a transformation like $W x + b$ -- exactly what we want for linear regression. The `units` are how many outputs you want the layer to produce. In a regression problem, every input example outputs just a single value, so we choose `units=1`.

<blockquote style="margin-right:auto; margin-left:auto; background-color: #ebf9ff; padding: 1em; margin:24px;">
    <strong>Example: House Prices</strong><br>
In the <a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques">House Prices: Advanced Regression Techniques</a> competition, each home has a set of features, like its size in square feet or its number of bedrooms. From this set of features, you are trying to predict a single value, its selling price. A simple linear regression model might look like:

<code>
Price = w_0 * LotArea + w_1 * Bedroom + b
</code>
</blockquote>

## Loss ##

The second thing we need to choose is a *loss function*. The "loss" is simply a measure of how well the model fits the training data. Using the optimizer, Keras will try to choose model parameters that make the loss as small as possible.

The `LinearRegression` model in scikit-learn minimizes **mean squared error (MSE)**. Mathematically, this is `(y_true - y_pred) ** 2`.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/.png" width="600" alt="Graph illustrating MSE.">
<figcaption style="textalign: center; font-style: italic"><center>The MSE function. At $x=0$, <code>y_true == y_pred</code>.</center></figcaption>
</figure>

Keras includes a number of loss functions in its `keras.losses` module. *Mean absolute error*, for instance, would also be a fine choice.

Training a model means *optimizing* the loss function, that is, making the loss smaller. We define the loss at the same time as the optimization method, so let's take a look at that now.

## The Optimizer ##

**Stochastic gradient descent (SGD)** is the method of optimization univerally used in practice with deep learning models. To minimize the loss, SGD will go through a process of *iterative refinement*:
1. take a random sample from the training data (a **minibatch**)
2. measure the loss on that sample
3. adjust the weights and biases in way that makes the loss smaller

One round of this is called a **step**, while one run through the entire dataset is an **epoch**. It's not uncommon to train deep learning models for hundreds or even thousands of epochs.

<figure>
<img src="https://i.imgur.com/TTxs4y2.mp4" width="600" alt="A linear regression model iteratively trained using SGD.">
<figcaption style="textalign: center; font-style: italic"><center>A linear regression model iteratively trainined using SGD. The fit improves batch by batch.
</center></figcaption>
</figure>

Ordinary least squares regression, for instance, finds its line of best fit by solving a certain matrix equation using the entire dataset at once. 

Stochastic gradient descent actually comprises a whole family of algorithms. They differ primarily in their strategy of updating the model weights (step 3 above). We'll consider some of them in future lessons, but for now our choice of optimizer will be the one called simply `'SGD'`.

In [None]:
# Add the loss and optimizer with the model's compile method
model.compile(
    optimizer='SGD',
    loss='MSE',
)

Passing a string for the arguments like this will use the defaults for the loss and optimizer, which generally work well. In later lessons we'll configure the optimizer a bit by instead using an object from the [`keras.optimizers` module](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD). (There's nothing to configure for MSE.)

You define the number of training epochs and the size of the minibatches in the `fit` method, which trains the model. Let's fit it on some fake data just to see how it goes.

In [None]:
import numpy as np
x = np.random.normal(0, 1, 256)
err = np.random.normal(0, 0.2, 256)
y = 2*x + 1 + err

model.fit(x=x, y=y, batch_size=16, epochs=7);

In [None]:
#$HIDE_INPUT$
import matplotlib.pyplot as plt
plt.scatter(x, y, alpha=0.25)
plt.plot(x, model.predict(x), color='r')
plt.show();

# Example - Red Wine Quality #

Now that we know how to create linear regression models in Keras, let's see it in action on an actual dataset. We'll use the *Red Wine Quality* dataset. 

Our goal in this example will be to predict the quality of a wine (on a scale of 3-8) given some of its chemical properties (numeric measurements). As we'll discuss later, neural networks perform best when your data is put on a common scale -- we will rescale each feature into the interval $[0, 1]$. (Check out our [course on Pandas](https://www.kaggle.com/learn/pandas) for a review of working with dataframes!)

In [None]:
import pandas as pd
from IPython.display import display

red_wine = pd.read_csv('../input/dl-course-data/dl-course-data/red-wine.csv')
display(red_wine.head())

# Create training and validation splits
df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)

# Scale to [0, 1]
max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)
df_train = (df_train - min_) / (max_ - min_)
df_valid = (df_valid - min_) / (max_ - min_)

# Split features and target
x_train = df_train.copy()
y_train = x_train.pop('quality')

And here is how we define and train the linear regression model:

In [None]:
#$HIDE_OUTPUT$
model = keras.Sequential([
    layers.Dense(1),
])
model.compile(
    optimizer='SGD',
    loss='MSE',
)

history = model.fit(
    x=x_train, y=y_train,
    batch_size=16,
    epochs=200,
)

We've hidden the output here since, with 200 epochs, it's rather long. If you'd like to take a look just click the **Output** button to the upper right of the cell.

Often, a better way to view the loss is to plot it. The `fit` method in fact keeps a record of the loss produced during training in a `History` object. We'll convert the data to a Pandas dataframe, which makes the plotting easy.

In [None]:
# convert the training history to a dataframe
history_df = pd.DataFrame(history.history)
# use Pandas native plot method
history_df['loss'].plot();

Notice how the loss levels off as the epochs go by. When the loss curve becomes horizontal like that, it means the model has learned all it can and there would be no reason to train for additional epochs.

# Conclusion #

Deep learning, as it's practiced in Keras and other libraries, is more like a general framework that you can use to solve a great variety of machine learning problems. This deep learning framework is very powerful and very flexible. It's flexibility, however, means that there are more choices you have to make, and the quality of your model will ultimately depend on how well you understand those choices.

In this first lesson, we introduced the deep learning framework by implementing a linear regression model. We made appropriate choices for the **architecture**, **loss**, and **optimizer**. These three things are the start of every deep learning model.

In the remainder of this course, we'll investigate these three components more deeply. In Lesson 2, we go beyond linear models by stacking layers into **neural networks**. Lessons 3 and 5 build skill with **stochastic gradient descent**, the universal deep learning optimizer. We learn methods of developing models to control overfitting in Lesson 4. In Lesson 6 we solve a **binary classification** problem by using a new kind of loss function.

By the end of this course, you should have a strong grasp of the fundamentals of deep learning. You'll know how to design deep learning models to solve practical problems and feel confident in the choices you make while developing your projects.

# Your Turn #

During this course, you'll develop a deep learning model to *solve a real-world problem*. Move on to the first exercise to get started!