# Machine Learning Crash Course with TensorFlow APIs

## Part 1 - Machine Learning Concepts

### Introduction to Machine Learning

This module introduces Machine Learning (ML).

#### Additional Information

- Rules of Machine Learning, [Rule #1: Don't be afraid to launch a product without machine learning](https://developers.google.com/machine-learning/rules-of-ml/?authuser=1#rule_1_dont_be_afraid_to_launch_a_product_without_machine_learning).

### Framing

This module investigates how to frame a task as a machine learning problem, and covers many of the basic vocabulary terms shared across a wide range of machine learning (ML) methods.

#### Framing: Key ML Terminology

What is (supervised) machine learning? Concisely put, it is the following:

    ML systems learn how to combine input to produce useful predictions on never-before-seen data.

Let's explore fundamental machine learning terminology.

##### Labels

A label is the thing we're predicting — the `y` variable in simple linear regression. The label could be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip, or just about anything.

##### Features

A feature is an input variable — the `x` variable in simple linear regression. A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features, specified as:

    x_1, x_2, ..., x_n

In the spam detector example, the features could include the following:

- words in the email text
- sender's address
- time of day the email was sent
- email contains the phrase "one weird trick."

##### Examples

An example is a particular instance of data, x. (We put x in boldface to indicate that it is a vector.) We break examples into two categories:

- labeled examples
- unlabeled examples

A labeled example includes both feature(s) and the label. That is:

  labeled examples: {features, label}: (x, y)

Use labeled examples to train the model. In our spam detector example, the labeled examples would be individual emails that users have explicitly marked as "spam" or "not spam."

For example, the following table shows 5 labeled examples from a [data set](https://developers.google.com/machine-learning/crash-course/california-housing-data-description?authuser=1) containing information about housing prices in California:

| housingMedianAge | totalRooms | totalBedrooms | medianHouseValue |
|-------------------|------------|---------------|-------------------|
|         15        |    5612    |     1283      |       66900       |
|         19        |    7650    |     1901      |       80100       |
|         17        |    720     |      174      |       85700       |
|         14        |    1501    |      337      |       73400       |
|         20        |    1454    |      326      |       65500       |


An unlabeled example contains features but not the label. That is:

  unlabeled examples: {features, ?}: (x, ?)

Here are 3 unlabeled examples from the same housing dataset, which exclude `medianHouseValue`:

| housingMedianAge | totalRooms | totalBedrooms |
|-------------------|------------|---------------|
|         42        |    1686    |      361      |
|         34        |    1226    |      180      |
|         33        |    1077    |      271      |


Once we've trained our model with labeled examples, we use that model to predict the label on unlabeled examples. In the spam detector, unlabeled examples are new emails that humans haven't yet labeled.

##### Models

A model defines the relationship between features and label. For example, a spam detection model might associate certain features strongly with "spam". Let's highlight two phases of a model's life:

- Training means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label.

- Inference means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y'). For example, during inference, you can predict medianHouseValue for new unlabeled examples.

##### Regression vs. classification

A regression model predicts continuous values. For example, regression models make predictions that answer questions like the following:

- What is the value of a house in California?

- What is the probability that a user will click on this ad?

A classification model predicts discrete values. For example, classification models make predictions that answer questions like the following:

- Is a given email message spam or not spam?

- Is this an image of a dog, a cat, or a hamster?

#### Framing: Check Your Understanding

##### Supervised Learning

Explore the options below.

Suppose you want to develop a supervised machine learning model to predict whether a given email is "spam" or "not spam." Which of the following statements are true?

- [x] The labels applied to some examples might be unreliable.
- [x] Emails not marked as "spam" or "not spam" are unlabeled examples.
- [ ] Words in the subject header will make good labels.
- [ ] We'll use unlabeled examples to train the model.

##### Features and Labels

Explore the options below.

Suppose an online shoe store wants to create a supervised ML model that will provide personalized shoe recommendations to users. That is, the model will recommend certain pairs of shoes to Marty and different pairs of shoes to Janet. The system will use past user behavior data to generate training data. Which of the following statements are true?

- [x] "Shoe size" is a useful feature.
- [x] "The user clicked on the shoe's description" is a useful label.
- [ ] "Shoes that a user adores" is a useful label.
- [ ] "Shoe beauty" is a useful feature.

### Descending into ML

Linear regression is a method for finding the straight line or hyperplane that best fits a set of points. This module explores linear regression intuitively before laying the groundwork for a machine learning approach to linear regression.

#### Descending into ML: Linear Regression

It has long been known that crickets (an insect species) chirp more frequently on hotter days than on cooler days. For decades, professional and amateur scientists have cataloged data on chirps-per-minute and temperature. As a birthday gift, your Aunt Ruth gives you her cricket database and asks you to learn a model to predict this relationship. Using this data, you want to explore this relationship.

First, examine your data by plotting it:

![Raw data of chirps/minute (x-axis) vs. temperature (y-axis).](https://developers.google.com/static/machine-learning/crash-course/images/CricketPoints.svg?authuser=1)

*Figure 1. Chirps per Minute vs. Temperature in Celsius.*

As expected, the plot shows the temperature rising with the number of chirps. Is this relationship between chirps and temperature linear? Yes, you could draw a single straight line like the following to approximate this relationship:

![Best line establishing relationship of chirps/minute (x-axis) vs. temperature (y-axis).](https://developers.google.com/static/machine-learning/crash-course/images/CricketLine.svg?authuser=1)

*Figure 2. A linear relationship.*

True, the line doesn't pass through every dot, but the line does clearly show the relationship between chirps and temperature. Using the equation for a line, you could write down this relationship as follows:

$ y = mx + b $

where:

- y is the temperature in Celsius—the value we're trying to predict.
- m is the slope of the line.
- x is the number of chirps per minute—the value of our input feature.
- b is the y-intercept.

By convention in machine learning, you'll write the equation for a model slightly differently:

$ y' = b + w_1x_1 $

where:

- y' is the predicted label (a desired output).
- b is the bias (the y-intercept), sometimes referred to as w_0.
- w_1 is the weight of feature 1. Weight is the same concept as the "slope" m in the traditional equation of a line.
- x_1 is a feature (a known input).

To infer (predict) the temperature y' for a new chirps-per-minute value x_1, just substitute the x_1 value into this model.

Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (w_1, w_2, etc.). For example, a model that relies on three features might look as follows:

$ y' = b + w_1x_1 + w_2x_2 + w_3x_3 $

#### Descending into ML: Training and Loss

Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples. In supervised learning, a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss; this process is called empirical risk minimization.

Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples. For example, Figure 3 shows a high loss model on the left and a low loss model on the right. Note the following about the figure:

- The arrows represent loss.
- The blue lines represent predictions.

![Two Cartesian plots, each showing a line and some data points. In the first plot, the line is a terrible fit for the data, so the loss is high. In the second plot, the line is a a better fit for the data, so the loss is low.](https://developers.google.com/machine-learning/crash-course/images/LossSideBySide.png?authuser=1)

*Figure 3. High loss in the left model; low loss in the right model.*

Notice that the arrows in the left plot are much longer than their counterparts in the right plot. Clearly, the line in the right plot is a much better predictive model than the line in the left plot.

You might be wondering whether you could create a mathematical function—a loss function—that would aggregate the individual losses in a meaningful fashion.

##### Squared loss: a popular loss function

The linear regression models we'll examine here use a loss function called squared loss (also known as L2 loss). The squared loss for a single example is as follows:

    = the square of the difference between the label and the prediction
    = (observation - prediction(x))^2
    = (y - y')^2

Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:

$ MSE = \frac{1}{N} \sum_{(x,y)\in D} (y - prediction(x))^2 $

where:

- (x, y) is an example in which
    - x is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions.
    - y is the example's label (for example, temperature).
- prediction(x) is a function of the weights and bias in combination with the set of features x.
- D is a data set containing many labeled examples, which are (x, y) pairs.
- N is the number of examples in D.

Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.

#### Descending into ML: Check Your Understanding

##### Mean Squared Error

Consider the following two plots:

![A plot of 10 points. A line runs through 6 of the points. 2 points are 1.](https://developers.google.com/static/machine-learning/crash-course/images/MCEDescendingIntoMLLeft.png?authuser=1) ![A plot of 10 points. A line runs through 8 of the points. 1 point is 2.](https://developers.google.com/static/machine-learning/crash-course/images/MCEDescendingIntoMLRight.png?authuser=1)

Explore the options below.

Which of the two data sets shown in the preceding plots has the higher Mean Squared Error (MSE)?

- [x] The dataset on the right.
- [ ] The dataset on the left.

### Reducing Loss

To train a model, we need a good way to reduce the model’s loss. An iterative approach is one widely used method for reducing loss, and is as easy and efficient as walking down a hill.

#### Reducing Loss: An Iterative Approach

The previous module introduced the concept of loss. Here, in this module, you'll learn how a machine learning model iteratively reduces loss.

Iterative learning might remind you of the "Hot and Cold" kid's game for finding a hidden object like a thimble. In this game, the "hidden object" is the best possible model. You'll start with a wild guess ("The value of $w_1$ is 0.") and wait for the system to tell you what the loss is. Then, you'll try another guess ("The value of $w_1$ is 0.5.") and see what the loss is. Aah, you're getting warmer. Actually, if you play this game right, you'll usually be getting warmer. The real trick to the game is trying to find the best possible model as efficiently as possible.

The following figure suggests the iterative trial-and-error process that machine learning algorithms use to train a model:

![The cycle of moving from features and labels to models and predictions.](https://developers.google.com/static/machine-learning/crash-course/images/GradientDescentDiagram.svg?authuser=1)

*Figure 1. An iterative approach to training a model.*

We'll use this same iterative approach throughout the Machine Learning Crash Course, detailing various complications, particularly within that stormy cloud labeled "Model (Prediction Function)." Iterative strategies are prevalent in machine learning, primarily because they scale so well to large data sets.

The "model" takes one or more features as input and returns one prediction as output. To simplify, consider a model that takes one feature ($x_1$) and returns one prediction ($y'$):

$ y' = b + w_1x_1 $

What initial values should we set for $b$ and $w_1$? For linear regression problems, it turns out that the starting values aren't important. We could pick random values, but we'll just take the following trivial values instead:

- $b = 0$
- $w_1 = 0$

Suppose that the first feature value is 10. Plugging that feature value into the prediction function yields:

$ y' = 0 + 0 . 10 = 0 $

The "Compute Loss" part of the diagram is the loss function that the model will use. Suppose we use the squared loss function. The loss function takes in two input values:

- $y'$: The model's prediction for features x
- $y$: The correct label corresponding to features x.

At last, we've reached the "Compute parameter updates" part of the diagram. It is here that the machine learning system examines the value of the loss function and generates new values for $b$ and $w_1$. For now, just assume that this mysterious box devises new values and then the machine learning system re-evaluates all those features against all those labels, yielding a new value for the loss function, which yields new parameter values. And the learning continues iterating until the algorithm discovers the model parameters with the lowest possible loss. Usually, you iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.

#### Reducing Loss: Gradient Descent

The iterative approach diagram (Figure 1) contained a green hand-wavy box entitled "Compute parameter updates." We'll now replace that algorithmic fairy dust with something more substantial.

Suppose we had the time and the computing resources to calculate the loss for all possible values of $w_1$. For the kind of regression problems we've been examining, the resulting plot of loss vs. $w_1$ will always be convex. In other words, the plot will always be bowl-shaped, kind of like this:

![A plot of a U-shaped curve, with the vertical axis labeled as 'loss' and the horizontal axis labeled as value of weight w_i](https://developers.google.com/static/machine-learning/crash-course/images/convex.svg?authuser=1).

*Figure 2. Regression problems yield convex loss vs. weight plots.*

Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges.

Calculating the loss function for every conceivable value of $w_1$ over the entire data set would be an inefficient way of finding the convergence point. Let's examine a better mechanism—very popular in machine learning—called gradient descent.

The first stage in gradient descent is to pick a starting value (a starting point) for $w_1$. The starting point doesn't matter much; therefore, many algorithms simply set $w_1$ to 0 or pick a random value. The following figure shows that we've picked a starting point slightly greater than 0:

![A plot of a U-shaped curve. A point about halfway up the left side of the curve is labeled 'Starting Point'.](https://developers.google.com/static/machine-learning/crash-course/images/GradientDescentStartingPoint.svg?authuser=1)

*Figure 3. A starting point for gradient descent.*

The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. Here in Figure 3, the gradient of the loss is equal to the derivative (slope) of the curve, and tells you which way is "warmer" or "colder." When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.

Note that a gradient is a vector, so it has both of the following characteristics:

- a direction
- a magnitude

The gradient always points in the direction of steepest increase in the loss function. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.

![A plot of a U-shaped curve. A point on the left side of the curve is labeled 'Starting Point'. An arrow labeled 'negative gradient' points from this point to the right.](https://developers.google.com/static/machine-learning/crash-course/images/GradientDescentNegativeGradient.svg?authuser=1)

*Figure 4. Gradient descent relies on negative gradients.*

To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient's magnitude to the starting point as shown in the following figure:

![A plot of a U-shaped curve. A point on the left side of the curve is labeled 'Starting Point'. An arrow labeled 'negative gradient' points from this point to the right. Another arrow points from the tip of the first arrow down to a second point on the curve. The second point is labeled 'next point'.](https://developers.google.com/static/machine-learning/crash-course/images/GradientDescentGradientStep.svg?authuser=1)

*Figure 5. A gradient step moves us to the next point on the loss curve.*

The gradient descent then repeats this process, edging ever closer to the minimum.

#### Reducing Loss: Learning Rate

As noted, the gradient vector has both a direction and a magnitude. Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point. For example, if the gradient magnitude is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.025 away from the previous point.

Hyperparameters are the knobs that programmers tweak in machine learning algorithms. Most machine learning programmers spend a fair amount of time tuning the learning rate. If you pick a learning rate that is too small, learning will take too long:

![Same U-shaped curve. Lots of points are very close to each other and their trail is making extremely slow progress towards the bottom of the U.](https://developers.google.com/static/machine-learning/crash-course/images/LearningRateTooSmall.svg?authuser=1)

*Figure 6. Learning rate is too small.*

Conversely, if you specify a learning rate that is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong:

![Same U-shaped curve. This one contains very few points. The trail of points jumps clean across the bottom of the U and then jumps back over again.](https://developers.google.com/static/machine-learning/crash-course/images/LearningRateTooLarge.svg?authuser=1)

*Figure 7. Learning rate is too large.*

There's a Goldilocks learning rate for every regression problem. The Goldilocks value is related to how flat the loss function is. If you know the gradient of the loss function is small then you can safely try a larger learning rate, which compensates for the small gradient and results in a larger step size.

![Same U-shaped curve. The trail of points gets to the minimum point in about eight steps.](https://developers.google.com/static/machine-learning/crash-course/images/LearningRateJustRight.svg?authuser=1)

*Figure 8. Learning rate is just right.*

#### Reducing Loss: Optimizing Learning Rate

##### Exercise 1

Set a learning rate of 0.03 on the slider. Keep hitting the STEP button until the gradient descent algorithm reaches the minimum point of the loss curve. How many steps did it take?

Answer: Gradient descent reaches the minimum of the curve in 40 steps.

##### Exercise 2

Can you reach the minimum more quickly with a higher learning rate? Set a learning rate of 0.1, and keep hitting STEP until gradient descent reaches the minimum. How many steps did it take this time?

Answer: Gradient descent reaches the minimum of the curve in 11 steps.

##### Exercise 3

How about an even larger learning rate. Reset the graph, set a learning rate of 1, and try to reach the minimum of the loss curve. What happened this time?

Answer: Gradient descent never reaches the minimum. As a result, steps progressively increase in size. Each step jumps back and forth across the bowl, climbing the curve instead of descending to the bottom.

##### Optional Challenge

Can you find the Goldilocks learning rate for this curve, where gradient descent reaches the minimum point in the fewest number of steps? What is the fewest number of steps required to reach the minimum?

Answer: The Goldilocks learning rate for this data is somewhere between 0.2 and 0.3, which would reach the minimum in three or four steps.

#### Reducing Loss: Stochastic Gradient Descent

In gradient descent, a **batch** is the set of examples you use to calculate the gradient in a single training iteration. So far, we've assumed that the batch has been the entire data set. When working at Google scale, data sets often contain billions or even hundreds of billions of examples. Furthermore, Google data sets often contain huge numbers of features. Consequently, a batch can be enormous. A very large batch may cause even a single iteration to take a very long time to compute.

A large data set with randomly sampled examples probably contains redundant data. In fact, redundancy becomes more likely as the batch size grows. Some redundancy can be useful to smooth out noisy gradients, but enormous batches tend not to carry much more predictive value than large batches.

What if we could get the right gradient on average for much less computation? By choosing examples at random from our data set, we could estimate (albeit, noisily) a big average from a much smaller one. Stochastic gradient descent (SGD) takes this idea to the extreme--it uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term "stochastic" indicates that the one example comprising each batch is chosen at random.

Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.

To simplify the explanation, we focused on gradient descent for a single feature. Rest assured that gradient descent also works on feature sets that contain multiple features.

#### Learning Rate and Convergence

This is the first of several Playground exercises. Playground is a program developed especially for this course to teach machine learning principles. Each Playground exercise in this course includes an embedded playground instance with presets.

Each Playground exercise generates a dataset. The label for this dataset has two possible values. You could think of those two possible values as spam vs. not spam or perhaps healthy trees vs. sick trees. The goal of most exercises is to tweak various hyperparameters to build a model that successfully classifies (separates or distinguishes) one label value from the other. Note that most data sets contain a certain amount of noise that will make it impossible to successfully classify every example.

The interface for this exercise provides three buttons:

- Reset button: resets Iterations to 0. Resets any weights that model had already learned.
- Step button: advance one iteration. With each iteration, the model changes—sometimes subtly and sometimes dramatically.
- Regenerate button: generates a new data set. Does not reset Iterations.

In this first Playground exercise, you'll experiment with learning rate by performing two tasks.

**Task 1**: Notice the Learning rate menu at the top-right of Playground. The given Learning rate—3—is very high. Observe how that high Learning rate affects your model by clicking the "Step" button 10 or 20 times. After each early iteration, notice how the model visualization changes dramatically. You might even see some instability after the model appears to have converged. Also notice the lines running from x1 and x2 to the model visualization. The weights of these lines indicate the weights of those features in the model. That is, a thick line indicates a high weight.

**Task 2**: Do the following:

1. Press the Reset button.
1. Lower the Learning rate.
1. Press the Step button a bunch of times.

How did the lower learning rate impact convergence? Examine both the number of steps needed for the model to converge, and also how smoothly and steadily the model converges. Experiment with even lower values of learning rate. Can you find a learning rate too slow to be useful? (You'll find a discussion just below the exercise.)

Answer: Due to the non-deterministic nature of Playground exercises, we can't always provide answers that will correspond exactly with your data set. That said, a learning rate of 0.1 converged efficiently for us. Smaller learning rates took much longer to converge; that is, smaller learning rates were too slow to be useful.

#### Reducing Loss: Check Your Understanding

##### Check Your Understanding: Batch Size

Explore the options below.

When performing gradient descent on a large data set, which of the following batch sizes will likely be more efficient?

- [x] A small batch or even a batch of one example (SGD).
- [ ] The full batch.

### First Steps with TensorFlow

TensorFlow is an end-to-end open source platform for machine learning. TensorFlow is a rich system for managing all aspects of a machine learning system; however, this class focuses on using a particular TensorFlow API to develop and train machine learning models. See the [TensorFlow documentation](https://tensorflow.org/?authuser=1) for complete details on the broader TensorFlow system.

TensorFlow APIs are arranged hierarchically, with the high-level APIs built on the low-level APIs. Machine learning researchers use the low-level APIs to create and explore new machine learning algorithms. In this class, you will use a high-level API named tf.keras to define and train machine learning models and to make predictions. tf.keras is the TensorFlow variant of the open-source Keras API.

The following figure shows the hierarchy of TensorFlow toolkits:

![Simplified hierarchy of TensorFlow toolkits. tf.keras API is at the top.](https://developers.google.com/static/machine-learning/crash-course/images/TFHierarchyNew.png?authuser=1)

*Figure 1. TensorFlow toolkit hierarchy.*

#### First Steps with TensorFlow: Programming Exercises

As you progress through Machine Learning Crash Course, you'll put machine learning concepts into practice by coding models in tf.keras. You'll use Colab as a programming environment. Colab is Google's version of Jupyter Notebook. Like Jupyter Notebook, Colab provides an interactive Python programming environment that combines text, code, graphics, and program output.

##### NumPy and pandas

Using tf.keras requires at least a little understanding of the following two open-source Python libraries:

- NumPy, which simplifies representing arrays and performing linear algebra operations.
- pandas, which provides an easy way to represent datasets in memory.

If you are unfamiliar with NumPy or pandas, please begin by doing the following two Colab exercises:

1. [NumPy UltraQuick Tutorial](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/numpy_ultraquick_tutorial.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=numpy_tf2-colab&hl=en&authuser=1) Colab exercise, which provides all the NumPy information you need for this course.
1. [pandas UltraQuick Tutorial](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=pandas_tf2-colab&hl=en&authuser=1) Colab exercise, which provides all the pandas information you need for this course.

##### Linear regression with tf.keras

After gaining competency in NumPy and pandas, do the following two Colab exercises to explore linear regression and hyperparameter tuning in tf.keras:

1. [Linear Regression with Synthetic Data](/Machine%20Learning%20Crash%20Course%20with%20TensorFlow%20APIs/Exercises/Linear_Regression_with_Synthetic_Data.ipynb) Colab exercise, which explores linear regression with a toy dataset.
1. [Linear Regression with a Real Dataset](/Machine%20Learning%20Crash%20Course%20with%20TensorFlow%20APIs/Exercises/Linear_Regression_with_a_Real_Dataset.ipynb) Colab exercise, which guides you through the kinds of analysis you should do on a real dataset.

### Generalization

Generalization refers to your model's ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.

#### Generalization: Peril of Overfitting

This module focuses on generalization. In order to develop some intuition about this concept, you're going to look at three figures. Assume that each dot in these figures represents a tree's position in a forest. The two colors have the following meanings:

- The blue dots represent sick trees.
- The orange dots represent healthy trees.

With that in mind, take a look at Figure 1.

![This figure contains about 50 dots, half of which are blue and the other half orange. The orange dots are mainly in the southwest quadrant, though a few orange dots sneak briefly into the other three quadrants. The blue dots are mainly in the northeast quadrant, but a few of the blue dots spill into other quadrants.](https://developers.google.com/machine-learning/crash-course/images/GeneralizationA.png?authuser=1)

*Figure 1. Sick (blue) and healthy (orange) trees.*

Can you imagine a good model for predicting subsequent sick or healthy trees? Take a moment to mentally draw an arc that divides the blues from the oranges, or mentally lasso a batch of oranges or blues. Then, look at Figure 2, which shows how a certain machine learning model separated the sick trees from the healthy trees. Note that this model produced a very low loss.

![This Figure contains the same arrangement of blue and orange dots as Figure 1. However, this figure accurately encloses nearly all of the blue dots and orange dots with a collection of complex shapes.](https://developers.google.com/machine-learning/crash-course/images/GeneralizationB.png?authuser=1)

*Figure 2. A complex model for distinguishing sick from healthy trees.*

At first glance, the model shown in Figure 2 appeared to do an excellent job of separating the healthy trees from the sick ones. Or did it?

##### Low loss, but still a bad model?

Figure 3 shows what happened when we added new data to the model. It turned out that the model adapted very poorly to the new data. Notice that the model miscategorized much of the new data.

![Same illustration as Figure 2, except with about a 100 more dots added. Many of the new dots fall well outside of the predicted model.](https://developers.google.com/machine-learning/crash-course/images/GeneralizationC.png?authuser=1)

*Figure 3. The model did a bad job predicting new data.*

The model shown in Figures 2 and 3 **overfits** the peculiarities of the data it trained on. An overfit model gets a low loss during training but does a poor job predicting new data. If a model fits the current sample well, how can we trust that it will make good predictions on new data? As you'll see later on, overfitting is caused by making a model more complex than necessary. The fundamental tension of machine learning is between fitting our data well, but also fitting the data as simply as possible.

Machine learning's goal is to predict well on new data drawn from a (hidden) true probability distribution. Unfortunately, the model can't see the whole truth; the model can only sample from a training data set. If a model fits the current examples well, how can you trust the model will also make good predictions on never-before-seen examples?

William of Ockham, a 14th century friar and philosopher, loved simplicity. He believed that scientists should prefer simpler formulas or theories over more complex ones. To put Ockham's razor in machine learning terms:

    The less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the sample.

In modern times, we've formalized Ockham's razor into the fields of statistical learning theory and computational learning theory. These fields have developed generalization bounds - a statistical description of a model's ability to generalize to new data based on factors such as:

- the complexity of the model
- the model's performance on training data

While the theoretical analysis provides formal guarantees under idealized assumptions, they can be difficult to apply in practice. Machine Learning Crash Course focuses instead on empirical evaluation to judge a model's ability to generalize to new data.

A machine learning model aims to make good predictions on new, previously unseen data. But if you are building a model from your data set, how would you get the previously unseen data? Well, one way is to divide your data set into two subsets:

- training set: a subset to train a model.
- test set: a subset to test the model.

Good performance on the test set is a useful indicator of good performance on the new data in general, assuming that:

- The test set is large enough.
- You don't cheat by using the same test set over and over.

##### The ML fine print

The following three basic assumptions guide generalization:

- We draw examples independently and identically (i.i.d) at random from the distribution. In other words, examples don't influence each other. (An alternate explanation: i.i.d. is a way of referring to the randomness of variables.)
- The distribution is stationary; that is the distribution doesn't change within the data set.
- We draw examples from partitions from the same distribution.

In practice, we sometimes violate these assumptions. For example:

- Consider a model that chooses ads to display. The i.i.d. assumption would be violated if the model bases its choice of ads, in part, on what ads the user has previously seen.
- Consider a data set that contains retail sales information for a year. User's purchases change seasonally, which would violate stationarity.

When we know that any of the preceding three basic assumptions are violated, we must pay careful attention to metrics.

### Training and Test Sets

A test set is a data set used to evaluate the model developed from a training set.

#### Training and Test Sets: Splitting Data

The previous module introduced the idea of dividing your data set into two subsets:

- training set: a subset to train a model.
- test set: a subset to test the trained model.

You could imagine slicing the single data set as follows:

![A horizontal bar divided into two pieces: 80% of which is the training set and 20% the test set.](https://developers.google.com/static/machine-learning/crash-course/images/PartitionTwoSets.svg?authuser=1)

*Figure 1. Slicing a single data set into a training set and test set.*

Make sure that your test set meets the following two conditions:

- Is large enough to yield statistically meaningful results.
- Is representative of the data set as a whole. In other words, don't pick a test set with different characteristics than the training set.

Assuming that your test set meets the preceding two conditions, your goal is to create a model that generalizes well to new data. Our test set serves as a proxy for new data. For example, consider the following figure. Notice that the model learned for the training data is very simple. This model doesn't do a perfect job—a few predictions are wrong. However, this model does about as well on the test data as it does on the training data. In other words, this simple model does not overfit the training data.

![Two models: one run on training data and the other on test data. The model is very simple, just a line dividing the orange dots from the blue dots. The loss on the training data is similar to the loss on the test data.](https://developers.google.com/static/machine-learning/crash-course/images/TrainingDataVsTestData.svg?authuser=1)

*Figure 2. Validating the trained model against test data.*

Never train on test data. If you are seeing surprisingly good results on your evaluation metrics, it might be a sign that you are accidentally training on the test set. For example, high accuracy might indicate that test data has leaked into the training set.

For example, consider a model that predicts whether an email is spam, using the subject line, email body, and sender's email address as features. We apportion the data into training and test sets, with an 80-20 split. After training, the model achieves 99% precision on both the training set and the test set. We'd expect a lower precision on the test set, so we take another look at the data and discover that many of the examples in the test set are duplicates of examples in the training set (we neglected to scrub duplicate entries for the same spam email from our input database before splitting the data). We've inadvertently trained on some of our test data, and as a result, we're no longer accurately measuring how well our model generalizes to new data.

#### Training and Test Sets: Playground Exercise

##### Training Sets and Test Sets

We return to Playground to experiment with training sets and test sets.

This exercise provides both a test set and a training set, both drawn from the same data set. By default, the visualization shows only the training set. If you'd like to also see the test set, click the **Show test data** checkbox just below the visualization. In the visualization, note the following distinction:

- The training examples have a white outline.
- The test examples have a black outline.

**Task 1**: Run Playground with the given settings by doing the following:

1. Click the Run/Pause button:
1. Watch the Test loss and Training loss values change.
1. When the Test loss and Training loss values stop changing or only change once in a while, press the Run/Pause button again to pause Playground.

Note the delta between the Test loss and Training loss. We'll try to reduce this delta in the following tasks.

**Task 2**: Do the following:

1. Press the Reset button.
1. Modify the Learning rate.
1. Press the Run/Pause button:
1. Let Playground run for at least 150 epochs.

Is the delta between Test loss and Training loss lower or higher with this new Learning rate? What happens if you modify both Learning rate and batch size?

**Optional Task 3**: A slider labeled Training data percentage lets you control the proportion of training data to test data. For example, when set to 90%, then 90% of the data is used for the training set and the remaining 10% is used for the test set.

Do the following:

1. Reduce the "Training data percentage" from 50% to 10%.
1. Experiment with Learning rate and Batch size, taking notes on your findings.

Does altering the training data percentage change the optimal learning settings that you discovered in Task 2? If so, why? 

### Validation Set

Partitioning a data set into a training set and test set lets you judge whether a given model will generalize well to new data. However, using only two partitions may be insufficient when doing many rounds of hyperparameter tuning.

#### Validation Set: Check Your Intuition

Before beginning this module, consider whether there are any pitfalls in using the training process outlined in Training and Test Sets.

Explore the options below.

We looked at a process of using a test set and a training set to drive iterations of model development. On each iteration, we'd train on the training data and evaluate on the test data, using the evaluation results on test data to guide choices of and changes to various model hyperparameters like learning rate and features. Is there anything wrong with this approach? (Pick only one answer.)

- [ ] Totally fine, we're training on training data and evaluating on separate, held-out test data.
- [ ] This is computationally inefficient. We should just pick a default set of hyperparameters and live with them to save resources.
- [x] Doing many rounds of this procedure might cause us to implicitly fit to the peculiarities of our specific test set. 

#### Validation Set: Another Partition

The previous module introduced partitioning a data set into a training set and a test set. This partitioning enabled you to train on one set of examples and then to test the model against a different set of examples. With two partitions, the workflow could look as follows:

![A workflow diagram consisting of three stages. 1. Train model on training set. 2. Evaluate model on test set. 3. Tweak model according to results on test set. Iterate on 1, 2, and 3, ultimately picking the model that does best on the test set.](https://developers.google.com/static/machine-learning/crash-course/images/WorkflowWithTestSet.svg?authuser=1)

*Figure 1. A possible workflow?*

In the figure, "Tweak model" means adjusting anything about the model you can dream up—from changing the learning rate, to adding or removing features, to designing a completely new model from scratch. At the end of this workflow, you pick the model that does best on the test set.

Dividing the data set into two sets is a good idea, but not a panacea. You can greatly reduce your chances of overfitting by partitioning the data set into the three subsets shown in the following figure:

![A horizontal bar divided into three pieces: 70% of which is the training set, 15% the validation set, and 15% the test set.](https://developers.google.com/static/machine-learning/crash-course/images/PartitionThreeSets.svg?authuser=1)

*Figure 2. Slicing a single data set into three subsets.*

Use the **validation set** to evaluate results from the training set. Then, use the test set to double-check your evaluation after the model has "passed" the validation set. The following figure shows this new workflow:

![Similar workflow to Figure 1, except that instead of evaluating the model against the test set, the workflow evaluates the model against the validation set. Then, once the training set and validation set more-or-less agree, confirm the model against the test set.](https://developers.google.com/static/machine-learning/crash-course/images/WorkflowWithValidationSet.svg?authuser=1)

*Figure 3. A better workflow.*

In this improved workflow:

1. Pick the model that does best on the validation set.
1. Double-check that model against the test set.

This is a better workflow because it creates fewer exposures to the test set.

#### Validation Sets and Test Sets: Programming Exercise

The following programming exercise explores training, validating, and testing a model:

- [Validation Sets and Test Sets](/Machine%20Learning%20Crash%20Course%20with%20TensorFlow%20APIs/Exercises/Validation_and_Test_Sets.ipynb) Colab exercise.

### Representation

A machine learning model can't directly see, hear, or sense input examples. Instead, you must create a **representation** of the data to provide the model with a useful vantage point into the data's key qualities. That is, in order to train a model, you must choose the set of features that best represent the data.

#### Representation: Feature Engineering

In traditional programming, the focus is on code. In machine learning projects, the focus shifts to representation. That is, one way developers hone a model is by adding and improving its features.

##### Mapping Raw Data to Features

The left side of Figure 1 illustrates raw data from an input data source; the right side illustrates a **feature vector**, which is the set of floating-point values comprising the examples in your data set. **Feature engineering** means transforming raw data into a feature vector. Expect to spend significant time doing feature engineering.

Many machine learning models must represent the features as real-numbered vectors since the feature values must be multiplied by the model weights.

![Raw data is mapped to a feature vector through a process called feature engineering.](https://developers.google.com/static/machine-learning/crash-course/images/RawDataToFeatureVector.svg?authuser=1)

*Figure 1. Feature engineering maps raw data to ML features.*

##### Mapping numeric values

Integer and floating-point data don't need a special encoding because they can be multiplied by a numeric weight. As suggested in Figure 2, converting the raw integer value 6 to the feature value 6.0 is trivial:

![An example of a feature that can be copied directly from the raw data.](https://developers.google.com/static/machine-learning/crash-course/images/FloatingPointFeatures.svg?authuser=1)

*Figure 2. Mapping integer values to floating-point values.*

##### Mapping categorical values

Categorical features have a discrete set of possible values. For example, there might be a feature called street_name with options that include:

    {'Charleston Road', 'North Shoreline Boulevard', 'Shorebird Way', 'Rengstorff Avenue'}

Since models cannot multiply strings by the learned weights, we use feature engineering to convert strings to numeric values.

We can accomplish this by defining a mapping from the feature values, which we'll refer to as the vocabulary of possible values, to integers. Since not every street in the world will appear in our dataset, we can group all other streets into a catch-all "other" category, known as an **OOV (out-of-vocabulary) bucket**.

Using this approach, here's how we can map our street names to numbers:

- map Charleston Road to 0
- map North Shoreline Boulevard to 1
- map Shorebird Way to 2
- map Rengstorff Avenue to 3
- map everything else (OOV) to 4

However, if we incorporate these index numbers directly into our model, it will impose some constraints that might be problematic:

- We'll be learning a single weight that applies to all streets. For example, if we learn a weight of 6 for street_name, then we will multiply it by 0 for Charleston Road, by 1 for North Shoreline Boulevard, 2 for Shorebird Way and so on. Consider a model that predicts house prices using street_name as a feature. It is unlikely that there is a linear adjustment of price based on the street name, and furthermore this would assume you have ordered the streets based on their average house price. Our model needs the flexibility of learning different weights for each street that will be added to the price estimated using the other features.
- We aren't accounting for cases where street_name may take multiple values. For example, many houses are located at the corner of two streets, and there's no way to encode that information in the street_name value if it contains a single index.

To remove both these constraints, we can instead create a binary vector for each categorical feature in our model that represents values as follows:

- For values that apply to the example, set corresponding vector elements to 1.
- Set all other elements to 0.

The length of this vector is equal to the number of elements in the vocabulary. This representation is called a **one-hot encoding** when a single value is 1, and a **multi-hot encoding** when multiple values are 1.

Figure 3 illustrates a one-hot encoding of a particular street: Shorebird Way. The element in the binary vector for Shorebird Way has a value of 1, while the elements for all other streets have values of 0.

![Mapping a string value](https://developers.google.com/static/machine-learning/crash-course/images/OneHotEncoding.svg?authuser=1)

*Figure 3. Mapping street address via one-hot encoding.*

This approach effectively creates a Boolean variable for every feature value (e.g., street name). Here, if a house is on Shorebird Way then the binary value is 1 only for Shorebird Way. Thus, the model uses only the weight for Shorebird Way.

Similarly, if a house is at the corner of two streets, then two binary values are set to 1, and the model uses both their respective weights.

One-hot encoding extends to numeric data that you do not want to directly multiply by a weight, such as a postal code.

##### Sparse Representation

Suppose that you had 1,000,000 different street names in your data set that you wanted to include as values for street_name. Explicitly creating a binary vector of 1,000,000 elements where only 1 or 2 elements are true is a very inefficient representation in terms of both storage and computation time when processing these vectors. In this situation, a common approach is to use a [sparse representation](https://developers.google.com/machine-learning/glossary#sparse_representation) in which only nonzero values are stored. In sparse representations, an independent model weight is still learned for each feature value, as described above.

#### Representation: Qualities of Good Features

We've explored ways to map raw data into suitable feature vectors, but that's only part of the work. We must now explore what kinds of values actually make good features within those feature vectors.

##### Avoid rarely used discrete feature values

Good feature values should appear more than 5 or so times in a data set. Doing so enables a model to learn how this feature value relates to the label. That is, having many examples with the same discrete value gives the model a chance to see the feature in different settings, and in turn, determine when it's a good predictor for the label. For example, a house_type feature would likely contain many examples in which its value was victorian:

- house_type: victorian :white_check_mark:

Conversely, if a feature's value appears only once or very rarely, the model can't make predictions based on that feature. For example, unique_house_id is a bad feature because each value would be used only once, so the model couldn't learn anything from it:

- unique_house_id: 8SK982ZZ1242Z :x:

##### Prefer clear and obvious meanings

Each feature should have a clear and obvious meaning to anyone on the project. For example, the following good feature is clearly named and the value makes sense with respect to the name:

- house_age_years: 27 :white_check_mark:

Conversely, the meaning of the following feature value is pretty much indecipherable to anyone but the engineer who created it:

- house_age: 851472000 :x:

In some cases, noisy data (rather than bad engineering choices) causes unclear values. For example, the following user_age_years came from a source that didn't check for appropriate values:

- user_age_years: 277 :x:

##### Don't mix "magic" values with actual data

Good floating-point features don't contain peculiar out-of-range discontinuities or "magic" values. For example, suppose a feature holds a floating-point value between 0 and 1. So, values like the following are fine:

- quality_rating: 0.82 :white_check_mark:
- quality_rating: 0.37 :white_check_mark:

However, if a user didn't enter a quality_rating, perhaps the data set represented its absence with a magic value like the following:

- quality_rating: -1 :x:

To explicitly mark magic values, create a Boolean feature that indicates whether or not a quality_rating was supplied. Give this Boolean feature a name like is_quality_rating_defined.

In the original feature, replace the magic values as follows:

- For variables that take a finite set of values (discrete variables), add a new value to the set and use it to signify that the feature value is missing.
- For continuous variables, ensure missing values do not affect the model by using the mean value of the feature's data.

##### Account for upstream instability

The definition of a feature shouldn't change over time. For example, the following value is useful because the city name probably won't change. (Note that we'll still need to convert a string like "br/sao_paulo" to a one-hot vector.)

- city_id: "br/sao_paulo" :white_check_mark:

But gathering a value inferred by another model carries additional costs. Perhaps the value "219" currently represents Sao Paulo, but that representation could easily change on a future run of the other model:

- inferred_city_cluster: "219" :x:

#### Representation: Cleaning Data

Apple trees produce some mixture of great fruit and wormy messes. Yet the apples in high-end grocery stores display 100% perfect fruit. Between orchard and grocery, someone spends significant time removing the bad apples or throwing a little wax on the salvageable ones. As an ML engineer, you'll spend enormous amounts of your time tossing out bad examples and cleaning up the salvageable ones. Even a few "bad apples" can spoil a large data set.

##### Scaling feature values

**Scaling** means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1). If a feature set consists of only a single feature, then scaling provides little to no practical benefit. If, however, a feature set consists of multiple features, then feature scaling provides the following benefits:

- Helps gradient descent converge more quickly.
- Helps avoid the "NaN trap", in which one number in the model becomes a NaN (e.g., when a value exceeds the floating-point precision limit during training), and — due to math operations — every other number in the model also eventually becomes a NaN.
- Helps the model learn appropriate weights for each feature. Without feature scaling, the model will pay too much attention to the features having a wider range.

You don't have to give every floating-point feature exactly the same scale. Nothing terrible will happen if Feature A is scaled from -1 to +1 while Feature B is scaled from -3 to +3. However, your model will react poorly if Feature B is scaled from 5000 to 100000.

##### Handling extreme outliers

The following plot represents a feature called roomsPerPerson from the [California Housing data set](https://developers.google.com/machine-learning/crash-course/california-housing-data-description?authuser=1). The value of roomsPerPerson was calculated by dividing the total number of rooms for an area by the population for that area. The plot shows that the vast majority of areas in California have one or two rooms per person. But take a look along the x-axis.

![A plot of roomsPerPerson in which nearly all the values are clustered between 0 and 4, but there's a verrrrry long tail reaching all the way out to 55 rooms per person.](https://developers.google.com/static/machine-learning/crash-course/images/ScalingNoticingOutliers.svg?authuser=1)

*Figure 4. A verrrrry lonnnnnnng tail.*

How could we minimize the influence of those extreme outliers? Well, one way would be to take the log of every value:

![A plot of log(roomsPerPerson) in which 99% of values cluster between about 0.4 and 1.8, but there's still a longish tail that goes out to 4.2 or so.](https://developers.google.com/static/machine-learning/crash-course/images/ScalingLogNormalization.svg?authuser=1)

*Figure 5. Logarithmic scaling still leaves a tail.*

Log scaling does a slightly better job, but there's still a significant tail of outlier values. Let's pick yet another approach. What if we simply "cap" or "clip" the maximum value of roomsPerPerson at an arbitrary value, say 4.0?

![A plot of roomsPerPerson in which all values lie between -0.3 and 4.0. The plot is bell-shaped, but there's an anomalous hill at 4.0.](https://developers.google.com/static/machine-learning/crash-course/images/ScalingClipping.svg?authuser=1)

*Figure 6. Clipping feature values at 4.0.*

Clipping the feature value at 4.0 doesn't mean that we ignore all values greater than 4.0. Rather, it means that all values that were greater than 4.0 now become 4.0. This explains the funny hill at 4.0. Despite that hill, the scaled feature set is now more useful than the original data.

##### Binning

The following plot shows the relative prevalence of houses at different latitudes in California. Notice the clustering—Los Angeles is about at latitude 34 and San Francisco is roughly at latitude 38.

![A plot of houses per latitude. The plot is highly irregular, containing doldrums around latitude 36 and huge spikes around latitudes 34 and 38.](https://developers.google.com/static/machine-learning/crash-course/images/ScalingBinningPart1.svg?authuser=1)

*Figure 7. Houses per latitude.*

In the data set, latitude is a floating-point value. However, it doesn't make sense to represent latitude as a floating-point feature in our model. That's because no linear relationship exists between latitude and housing values. For example, houses in latitude 35 are not

more expensive (or less expensive) than houses at latitude 34. And yet, individual latitudes probably are a pretty good predictor of house values.

To make latitude a helpful predictor, let's divide latitudes into "bins" as suggested by the following figure:

![A plot of houses per latitude. The plot is divided into bins.](https://developers.google.com/static/machine-learning/crash-course/images/ScalingBinningPart2.svg?authuser=1)

*Figure 8. Binning values.*

Instead of having one floating-point feature, we now have 11 distinct boolean features (LatitudeBin1, LatitudeBin2, ..., LatitudeBin11). Having 11 separate features is somewhat inelegant, so let's unite them into a single 11-element vector. Doing so will enable us to represent latitude 37.4 as follows:

    [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

Thanks to binning, our model can now learn completely different weights for each latitude.

##### Scrubbing

Until now, we've assumed that all the data used for training and testing was trustworthy. In real-life, many examples in data sets are unreliable due to one or more of the following:

- Omitted values. For instance, a person forgot to enter a value for a house's age.
- Duplicate examples. For example, a server mistakenly uploaded the same logs twice.
- Bad labels. For instance, a person mislabeled a picture of an oak tree as a maple.
- Bad feature values. For example, someone typed in an extra digit, or a thermometer was left out in the sun.

Once detected, you typically "fix" bad examples by removing them from the data set. To detect omitted values or duplicated examples, you can write a simple program. Detecting bad feature values or labels can be far trickier.

In addition to detecting bad individual examples, you must also detect bad data in the aggregate. Histograms are a great mechanism for visualizing your data in the aggregate. In addition, getting statistics like the following can help:

- Maximum and minimum
- Mean and median
- Standard deviation

Consider generating lists of the most common values for discrete features. For example, do the number of examples with country:uk match the number you expect. Should language:jp really be the most common language in your data set?

##### Know your data

Follow these rules:

- Keep in mind what you think your data should look like.
- Verify that the data meets these expectations (or that you can explain why it doesn’t).
- Double-check that the training data agrees with other sources (for example, dashboards).

Treat your data with all the care that you would treat any mission-critical code. Good ML relies on good data.

##### Additional Information

Rules of Machine Learning, [ML Phase II: Feature Engineering](https://developers.google.com/machine-learning/rules-of-ml/?authuser=1#ml_phase_ii_feature_engineering).

### Feature Crosses

A feature cross is a synthetic feature formed by multiplying (crossing) two or more features. Crossing combinations of features can provide predictive abilities beyond what those features can provide individually.

#### Feature Crosses: Encoding Nonlinearity

In Figures 1 and 2, imagine the following:

- The blue dots represent sick trees.
- The orange dots represent healthy trees.

![Blues dots occupy the northeast quadrant; orange dots occupy the southwest quadrant.](https://developers.google.com/machine-learning/crash-course/images/LinearProblem1.png?authuser=1)

*Figure 1. Is this a linear problem?*

Can you draw a line that neatly separates the sick trees from the healthy trees? Sure. This is a linear problem. The line won't be perfect. A sick tree or two might be on the "healthy" side, but your line will be a good predictor.

Now look at the following figure:

![Blues dots occupy the northeast and southwest quadrants; orange dots occupy the northwest and southeast quadrants.](https://developers.google.com/machine-learning/crash-course/images/LinearProblem2.png?authuser=1)

*Figure 2. Is this a linear problem?*

Can you draw a single straight line that neatly separates the sick trees from the healthy trees? No, you can't. This is a nonlinear problem. Any line you draw will be a poor predictor of tree health.

![Same drawing as Figure 2, except that a horizontal line breaks the plane. Blue and orange dots are above the line; blue and orange dots are below the line.](https://developers.google.com/machine-learning/crash-course/images/LinearProblemNot.png?authuser=1)

*Figure 3. A single line can't separate the two classes.*

To solve the nonlinear problem shown in Figure 2, create a feature cross. A **feature cross** is a synthetic feature that encodes nonlinearity in the feature space by multiplying two or more input features together. (The term cross comes from [cross product](https://wikipedia.org/wiki/Cross_product).) Let's create a feature cross named $x_3$ by crossing $x_1$ and $x_2$:

$ x_3 = x_1.x_2 $

We treat this newly minted $x_3$ feature cross just like any other feature. The linear formula becomes:

$ y = b + w_1.x_1 + w_2.x_2 + w_3.x_3 $

A linear algorithm can learn a weight for $w_3$ just as it would for $w_1$ and $w_2$. In other words, although $w_3$ encodes nonlinear information, you don’t need to change how the linear model trains to determine the value of $w_3$.

##### Kinds of feature crosses

We can create many different kinds of feature crosses. For example:

- [A X B]: a feature cross formed by multiplying the values of two features.
- [A x B x C x D x E]: a feature cross formed by multiplying the values of five features.
- [A x A]: a feature cross formed by squaring a single feature.

Thanks to stochastic gradient descent, linear models can be trained efficiently. Consequently, supplementing scaled linear models with feature crosses has traditionally been an efficient way to train on massive-scale data sets.

#### Feature Crosses: Crossing One-Hot Vectors

So far, we've focused on feature-crossing two individual floating-point features. In practice, machine learning models seldom cross continuous features. However, machine learning models do frequently cross one-hot feature vectors. Think of feature crosses of one-hot feature vectors as logical conjunctions. For example, suppose we have two features: country and language. A one-hot encoding of each generates vectors with binary features that can be interpreted as country=USA, country=France or language=English, language=Spanish. Then, if you do a feature cross of these one-hot encodings, you get binary features that can be interpreted as logical conjunctions, such as:

    country:usa AND language:spanish

As another example, suppose you bin latitude and longitude, producing separate one-hot five-element feature vectors. For instance, a given latitude and longitude could be represented as follows:

    binned_latitude = [0, 0, 0, 1, 0]
    binned_longitude = [0, 1, 0, 0, 0]

Suppose you create a feature cross of these two feature vectors:

    binned_latitude X binned_longitude

This feature cross is a 25-element one-hot vector (24 zeroes and 1 one). The single 1 in the cross identifies a particular conjunction of latitude and longitude. Your model can then learn particular associations about that conjunction.

Suppose we bin latitude and longitude much more coarsely, as follows:

    binned_latitude(lat) = [
    0  < lat <= 10
    10 < lat <= 20
    20 < lat <= 30
    ]
    binned_longitude(lon) = [
    0  < lon <= 15
    15 < lon <= 30
    ]

Creating a feature cross of those coarse bins leads to synthetic feature having the following meanings:

    binned_latitude_X_longitude(lat, lon) = [
        0  < lat <= 10 AND 0  < lon <= 15
        0  < lat <= 10 AND 15 < lon <= 30
        10 < lat <= 20 AND 0  < lon <= 15
        10 < lat <= 20 AND 15 < lon <= 30
        20 < lat <= 30 AND 0  < lon <= 15
        20 < lat <= 30 AND 15 < lon <= 30
    ]

Now suppose our model needs to predict how satisfied dog owners will be with dogs based on two features:

- Behavior type (barking, crying, snuggling, etc.)
- Time of day

If we build a feature cross from both these features:

    [behavior type X time of day]

then we'll end up with vastly more predictive ability than either feature on its own. For example, if a dog cries (happily) at 5:00 pm when the owner returns from work will likely be a great positive predictor of owner satisfaction. Crying (miserably, perhaps) at 3:00 am when the owner was sleeping soundly will likely be a strong negative predictor of owner satisfaction.

Linear learners scale well to massive data. Using feature crosses on massive data sets is one efficient strategy for learning highly complex models. [Neural networks](https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks?authuser=1) provide another strategy.

#### Feature Crosses: Playground Exercises

##### Introducing Feature Crosses

Can a feature cross truly enable a model to fit nonlinear data? To find out, try this exercise.

**Task**: Try to create a model that separates the blue dots from the orange dots by manually changing the weights of the following three input features:

- x1
- x2
- x1 x2 (a feature cross)

To manually change a weight:

1. Click on a line that connects FEATURES to OUTPUT. An input form will appear.
1. Type a floating-point value into that input form.
1. Press Enter.

Note that the interface for this exercise does not contain a Step button. That's because this exercise does not iteratively train a model. Rather, you will manually enter the "final" weights for the model.

Answer: w1 = 0; w2 = 0; x1.x2 = 1 (or any positive value)

If you enter a negative value for the feature cross, the model will separate the blue dots from the orange dots but the predictions will be completely wrong. That is, the model will predict orange for the blue dots, and blue for the orange dots.

##### More Complex Feature Crosses

Now let's play with some advanced feature cross combinations. The data set in this Playground exercise looks a bit like a noisy bullseye from a game of darts, with the blue dots in the middle and the orange dots in an outer ring.

**Task 1**: Run this linear model as given. Spend a minute or two (but no longer) trying different learning rate settings to see if you can find any improvements. Can a linear model produce effective results for this data set?

Answer: No. A linear model cannot effectively model this data set. Reducing the learning rate reduces loss, but loss still converges at an unacceptably high value.

**Task 2**: Now try adding in cross-product features, such as x1x2, trying to optimize performance.

- Which features help most?
- What is the best performance that you can get?

Answer: Playground's data sets are randomly generated. Consequently, our answers may not always agree exactly with yours. In fact, if you regenerate the data set between runs, your own results won't always agree exactly with your previous runs. That said, you'll get better results by doing the following: Using both x12 and x22 as feature crosses. (Adding x1x2 as a feature cross doesn't appear to help.); Reducing the Learning rate, perhaps to 0.001.

**Task 3**: When you have a good model, examine the model output surface (shown by the background color).

1. Does it look like a linear model?
1. How would you describe the model?

Answer: The model output surface does not look like a linear model. Rather, it looks elliptical.

#### Feature Crosses: Programming Exercise

In the following exercise, you'll explore feature crosses in TensorFlow:

- [Representation with Feature Crosses](/Machine%20Learning%20Crash%20Course%20with%20TensorFlow%20APIs/Exercises/Representation_with_a_Feature_Cross.ipynb) Colab exercise.

#### Feature Crosses: Check Your Understanding

Explore the options below.

Different cities in California have markedly different housing prices. Suppose you must create a model to predict housing prices. Which of the following sets of features or feature crosses could learn city-specific relationships between roomsPerPerson and housing price?

- [x] One feature cross: [binned latitude X binned longitude X binned roomsPerPerson]
- [ ] Two feature crosses: [binned latitude X binned roomsPerPerson] and [binned longitude X binned roomsPerPerson]
- [ ] One feature cross: [latitude X longitude X roomsPerPerson]
- [ ] Three separate binned features: [binned latitude], [binned longitude], [binned roomsPerPerson]

### Regularization for Simplicity

#### Playground Exercise (Overcrossing?)

Before you watch the video or read the documentation, please complete this exercise that explores overuse of feature crosses.

**Task 1**: Run the model as is, with all of the given cross-product features. Are there any surprises in the way the model fits the data? What is the issue?

Answer: Surprisingly, the model's decision boundary looks kind of wacky. In particular, there's a region in the upper left that's hinting towards blue, even though there's no visible support for that in the data. Notice the relative thickness of the five lines running from INPUT to OUTPUT. These lines show the relative weights of the five features. The lines emanating from X1 and X2 are much thicker than those coming from the feature crosses. So, the feature crosses are contributing far less to the model than the normal (uncrossed) features.


**Task 2**: Try removing various cross-product features to improve performance (albeit only slightly). Why would removing features improve performance?

Answer: Removing all the feature crosses gives a more reasonable model (there is no longer a curved boundary suggestive of overfitting) and makes the test loss converge. After 1,000 iterations, test loss should be a slightly lower value than when the feature crosses were in play (although your results may vary a bit, depending on the data set). The data in this exercise is basically linear data plus noise. If we use a model that is too complicated, such as one with too many crosses, we give it the opportunity to fit to the noise in the training data, often at the cost of making the model perform badly on test data.

#### Regularization for Simplicity

Regularization means penalizing the complexity of a model to reduce overfitting.

#### Regularization for Simplicity: L₂ Regularization

Consider the following generalization curve, which shows the loss for both the training set and validation set against the number of training iterations.

![The loss function for the training set gradually declines. By contrast, the loss function for the validation set declines, but then starts to rise.](https://developers.google.com/static/machine-learning/crash-course/images/RegularizationTwoLossFunctions.svg?authuser=1)

*Figure 1. Loss on training set and validation set.*

Figure 1 shows a model in which training loss gradually decreases, but validation loss eventually goes up. In other words, this generalization curve shows that the model is overfitting to the data in the training set. Channeling our inner Ockham, perhaps we could prevent overfitting by penalizing complex models, a principle called **regularization**.

In other words, instead of simply aiming to minimize loss (empirical risk minimization):

$ \text{minimize(Loss(Data|Model))} $

we'll now minimize loss+complexity, which is called **structural risk minimization**:

$ \text{minimize(Loss(Data|Model) + complexity(Model))} $

Our training optimization algorithm is now a function of two terms: the **loss term**, which measures how well the model fits the data, and the **regularization term**, which measures model complexity.

Machine Learning Crash Course focuses on two common (and somewhat related) ways to think of model complexity:

- Model complexity as a function of the weights of all the features in the model.
- Model complexity as a function of the total number of features with nonzero weights. (A [later module](https://developers.google.com/machine-learning/crash-course/regularization-for-sparsity/l1-regularization?authuser=1) covers this approach.)

If model complexity is a function of weights, a feature weight with a high absolute value is more complex than a feature weight with a low absolute value.

We can quantify complexity using the **L2 regularization** formula, which defines the regularization term as the sum of the squares of all the feature weights:

$ L_2\text{ regularization term} = ||\boldsymbol w||_2^2 = {w_1^2 + w_2^2 + ... + w_n^2} $

In this formula, weights close to zero have little effect on model complexity, while outlier weights can have a huge impact.

For example, a linear model with the following weights:

$$ \{w_1 = 0.2, w_2 = 0.5, w_3 = 5, w_4 = 1, w_5 = 0.25, w_6 = 0.75\} $$

Has an L2 regularization term of 26.915:

$$
w_1^2 + w_2^2 + \boldsymbol{w_3^2} + w_4^2 + w_5^2 + w_6^2 \\
= 0.2^2 + 0.5^2 + \boldsymbol{5^2} + 1^2 + 0.25^2 + 0.75^2 \\
= 0.04 + 0.25 + \boldsymbol{25} + 1 + 0.0625 + 0.5625 \\
= 26.915
$$

But $w_3$ (bolded above), with a squared value of 25, contributes nearly all the complexity. The sum of the squares of all five other weights adds just 1.915 to the L2 regularization term.

#### Regularization for Simplicity: Lambda

Model developers tune the overall impact of the regularization term by multiplying its value by a scalar known as **lambda** (also called the **regularization rate**). That is, model developers aim to do the following:

$$ \text{minimize(Loss(Data|Model)} + \lambda \text{ complexity(Model))} $$

Performing L2 regularization has the following effect on a model

- Encourages weight values toward 0 (but not exactly 0);
- Encourages the mean of the weights toward 0, with a normal (bell-shaped or Gaussian) distribution.

Increasing the lambda value strengthens the regularization effect. For example, the histogram of weights for a high value of lambda might look as shown in Figure 2.

![Histogram of a model's weights with a mean of zero and a normal distribution.](https://developers.google.com/static/machine-learning/crash-course/images/HighLambda.svg?authuser=1)

*Figure 2. Histogram of weights.*

Lowering the value of lambda tends to yield a flatter histogram, as shown in Figure 3.

![Histogram of a model's weights with a mean of zero that is somewhere between a flat distribution and a normal distribution.](https://developers.google.com/static/machine-learning/crash-course/images/LowLambda.svg?authuser=1)

*Figure 3. Histogram of weights produced by a lower lambda value.*

When choosing a lambda value, the goal is to strike the right balance between simplicity and training-data fit:

- If your lambda value is too high, your model will be simple, but you run the risk of underfitting your data. Your model won't learn enough about the training data to make useful predictions;
- If your lambda value is too low, your model will be more complex, and you run the risk of overfitting your data. Your model will learn too much about the particularities of the training data, and won't be able to generalize to new data.

Note: Setting lambda to zero removes regularization completely. In this case, training focuses exclusively on minimizing loss, which poses the highest possible overfitting risk.

The ideal value of lambda produces a model that generalizes well to new, previously unseen data. Unfortunately, that ideal value of lambda is data-dependent, so you'll need to do some tuning.

There's a close connection between learning rate and lambda. Strong L2 regularization values tend to drive feature weights closer to 0. Lower learning rates (with early stopping) often produce the same effect because the steps away from 0 aren't as large. Consequently, tweaking learning rate and lambda simultaneously may have confounding effects.

**Early stopping** means ending training before the model fully reaches convergence. In practice, we often end up with some amount of implicit early stopping when training in an online (continuous) fashion. That is, some new trends just haven't had enough data yet to converge.

As noted, the effects from changes to regularization parameters can be confounded with the effects from changes in learning rate or number of iterations. One useful practice (when training across a fixed batch of data) is to give yourself a high enough number of iterations that early stopping doesn't play into things.

#### Regularization for Simplicity: Playground Exercise (L2 Regularization)

##### Examining L2 regularization

This exercise contains a small, noisy training data set. In this kind of setting, overfitting is a real concern. Fortunately, regularization might help.

This exercise consists of three related tasks. To simplify comparisons across the three tasks, run each task in a separate tab.

- **Task 1**: Run the model as given for at least 500 epochs. Note the following:
    - Test loss.
    - The delta between Test loss and Training loss.
    - The learned weights of the features and the feature crosses. (The relative thickness of each line running from FEATURES to OUTPUT represents the learned weight for that feature or feature cross. You can find the exact weight values by hovering over each line.)
- **Task 2**: (Consider doing this Task in a separate tab.) Increase the regularization rate from 0 to 0.3. Then, run the model for at least 500 epochs and find answers to the following questions:
    - How does the Test loss in Task 2 differ from the Test loss in Task 1?
    - How does the delta between Test loss and Training loss in Task 2 differ from that of Task 1?
    - How do the learned weights of each feature and feature cross differ from Task 2 to Task 1?
    - What do your results say about model complexity?
- **Task 3**: Experiment with regularization rate, trying to find the optimum value.

Answer:

Increasing the regularization rate from 0 to 0.3 produces the following effects:

- Test loss drops significantly.
    - Note: While test loss decreases, training loss actually increases. This is expected, because you've added another term to the loss function to penalize complexity. Ultimately, all that matters is test loss, as that's the true measure of the model's ability to make good predictions on new data.
- The delta between Test loss and Training loss drops significantly.
- The weights of the features and some of the feature crosses have lower absolute values, which implies that model complexity drops.

Given the randomness in the data set, it is impossible to predict which regularization rate produced the best results for you. For us, a regularization rate of either 0.3 or 1 generally produced the lowest Test loss.

#### Regularization for Simplicity: Check Your Understanding

##### L2 Regularization

Explore the options below.

Imagine a linear model with 100 input features:

- 10 are highly informative.
- 90 are non-informative.

Assume that all features have values between -1 and 1. Which of the following statements are true?

- [x] L2 regularization will encourage many of the non-informative weights to be nearly (but not exactly) 0.0.
- [x] L2 regularization may cause the model to learn a moderate weight for some non-informative features.
- [ ] L2 regularization will encourage most of the non-informative weights to be exactly 0.0.

##### L2 Regularization and Correlated Features

Explore the options below.

Imagine a linear model with two strongly correlated features; that is, these two features are nearly identical copies of one another but one feature contains a small amount of random noise. If we train this model with L2 regularization, what will happen to the weights for these two features?

- [x] Both features will have roughly equal, moderate weights.
- [ ] One feature will have a large weight; the other will have a weight of exactly 0.0.
- [ ] One feature will have a large weight; the other will have a weight of almost 0.0.

### Logistic Regression

Instead of predicting exactly 0 or 1, logistic regression generates a probability—a value between 0 and 1, exclusive. For example, consider a logistic regression model for spam detection. If the model infers a value of 0.932 on a particular email message, it implies a 93.2% probability that the email message is spam. More precisely, it means that in the limit of infinite training examples, the set of examples for which the model predicts 0.932 will actually be spam 93.2% of the time and the remaining 6.8% will not.

#### Logistic Regression: Calculating a Probability

Many problems require a probability estimate as output. Logistic regression is an extremely efficient mechanism for calculating probabilities. Practically speaking, you can use the returned probability in either of the following two ways:

- "As is";
- Converted to a binary category.

Let's consider how we might use the probability "as is." Suppose we create a logistic regression model to predict the probability that a dog will bark during the middle of the night. We'll call that probability:

$ p(bark | night) $

If the logistic regression model predicts $p(bark | night) = 0.05$, then over a year, the dog's owners should be startled awake approximately 18 times:

$$
startled = p(bark | night) \cdot nights \\
= 0.05 \cdot 365 \\
= 18
$$

In many cases, you'll map the logistic regression output into the solution to a binary classification problem, in which the goal is to correctly predict one of two possible labels (e.g., "spam" or "not spam"). A [later module](https://developers.google.com/machine-learning/crash-course/classification/video-lecture?authuser=1) focuses on that.

You might be wondering how a logistic regression model can ensure output that always falls between 0 and 1. As it happens, a sigmoid function, defined as follows, produces output having those same characteristics:

$$ y = \frac{1}{1 + e^{-z}} $$

The sigmoid function yields the following plot:

![Sigmoid function. The x axis is the raw inference value. The y axis extends from 0 to +1, exclusive.](https://developers.google.com/machine-learning/crash-course/images/SigmoidFunction.png?authuser=1)

*Figure 1: Sigmoid function.*

If z represents the output of the linear layer of a model trained with logistic regression, then $sigmoid(z)$ will yield a value (a probability) between 0 and 1. In mathematical terms:

$$ y' = \frac{1}{1 + e^{-z}} $$

where:

- $y'$ is the output of the logistic regression model for a particular example.
- $ z = b + w_1x_1 + w_2x_2 + \ldots + w_Nx_N $
    - The $w$ values are the model's learned weights, and $b$ is the bias.
    - The $x$ values are the feature values for a particular example.

Note that z is also referred to as the log-odds because the inverse of the sigmoid states that z can be defined as the log of the probability of the 1 label (e.g., "dog barks") divided by the probability of the 0 label (e.g., "dog doesn't bark"):

$$ z = \log\left(\frac{y}{1-y}\right) $$

Here is the sigmoid function with ML labels:

![The Sigmoid function with the x-axis labeled as the sum of all the weights and features (plus the bias); the y-axis is labeled Probability Output.](https://developers.google.com/static/machine-learning/crash-course/images/LogisticRegressionOutput.svg?authuser=1)

*Figure 2: Logistic regression output.*

##### Sample logistic regression inference calculation

Suppose we had a logistic regression model with three features that learned the following bias and weights:

$$
b = 1 \\
w_1 = 2 \\
w_2 = -1 \\
w_3 = 5
$$

Further suppose the following feature values for a given example:

$$
x_1 = 0 \\
x_2 = 10 \\
x_3 = 2
$$

Therefore, the log-odds:

$$ b + w_1x_1 + w_2x_2 + w_3x_3 $$

will be:

$$ (1) + (2)(0) + (-1)(10) + (5)(2) = 1 $$

Consequently, the logistic regression prediction for this particular example will be 0.731:

$$ y' = \frac{1}{1 + e^{-1}} = 0.731 $$

![Plot on the sigmoid function. X = 1, so y = 0.731.](https://developers.google.com/static/machine-learning/crash-course/images/LogisticRegressionOutput0.731.svg?authuser=1)

*Figure 3: 73.1% probability.*

#### Logistic Regression: Loss and Regularization

##### Loss function for Logistic Regression

The loss function for linear regression is squared loss. The loss function for logistic regression is **Log Loss**, which is defined as follows:

$ \text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') - (1 - y)\log(1 - y') $

where:

- (x,y) is the data set containing many labeled examples, which are
pairs.
- y is the label in a labeled example. Since this is logistic regression, every value of y must either be 0 or 1.
- y' is the predicted value (somewhere between 0 and 1), given the set of features in x.

##### Regularization in Logistic Regression

[Regularization](https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/video-lecture?authuser=1) is extremely important in logistic regression modeling. Without regularization, the asymptotic nature of logistic regression would keep driving loss towards 0 in high dimensions. Consequently, most logistic regression models use one of the following two strategies to dampen model complexity:

- L2 regularization.
- Early stopping, that is, limiting the number of training steps or the learning rate.

(We'll discuss a third strategy — L1 regularization — in a [later module](https://developers.google.com/machine-learning/crash-course/regularization-for-sparsity/video-lecture?authuser=1).)

Imagine that you assign a unique id to each example, and map each id to its own feature. If you don't specify a regularization function, the model will become completely overfit. That's because the model would try to drive loss to zero on all examples and never get there, driving the weights for each indicator feature to +infinity or -infinity. This can happen in high dimensional data with feature crosses, when there’s a huge mass of rare crosses that happen only on one example each.

Fortunately, using L2 or early stopping will prevent this problem.

### Classification

This module shows how logistic regression can be used for classification tasks, and explores how to evaluate the effectiveness of classification models.

Explore the options below.

Consider a classification model that separates email into two categories: "spam" or "not spam." If you raise the classification threshold, what will happen to precision?

- [ ] Definitely decrease.
- [ ] Probably decrease.
- [x] Probably increase.
- [ ] Definitely increase. 

#### Classification: Thresholding

Logistic regression returns a probability. You can use the returned probability "as is" (for example, the probability that the user will click on this ad is 0.00023) or convert the returned probability to a binary value (for example, this email is spam).

A logistic regression model that returns 0.9995 for a particular email message is predicting that it is very likely to be spam. Conversely, another email message with a prediction score of 0.0003 on that same logistic regression model is very likely not spam. However, what about an email message with a prediction score of 0.6? In order to map a logistic regression value to a binary category, you must define a classification threshold (also called the decision threshold). A value above that threshold indicates "spam"; a value below indicates "not spam." It is tempting to assume that the classification threshold should always be 0.5, but thresholds are problem-dependent, and are therefore values that you must tune.

The following sections take a closer look at metrics you can use to evaluate a classification model's predictions, as well as the impact of changing the classification threshold on these predictions.

Note: "Tuning" a threshold for logistic regression is different from tuning hyperparameters such as learning rate. Part of choosing a threshold is assessing how much you'll suffer for making a mistake. For example, mistakenly labeling a non-spam message as spam is very bad. However, mistakenly labeling a spam message as non-spam is unpleasant, but hardly the end of your job.

#### Classification: True vs. False and Positive vs. Negative

In this section, we'll define the primary building blocks of the metrics we'll use to evaluate classification models. But first, a fable:

<blockquote>
<strong>An Aesop's Fable: The Boy Who Cried Wolf (compressed)</strong>

A shepherd boy gets bored tending the town's flock. To have some fun, he cries out, "Wolf!" even though no wolf is in sight. The villagers run to protect the flock, but then get really mad when they realize the boy was playing a joke on them.

[Iterate previous paragraph N times.]

One night, the shepherd boy sees a real wolf approaching the flock and calls out, "Wolf!" The villagers refuse to be fooled again and stay in their houses. The hungry wolf turns the flock into lamb chops. The town goes hungry. Panic ensues.
</blockquote>

Let's make the following definitions:

- "Wolf" is a positive class.
- "No wolf" is a negative class.

We can summarize our "wolf-prediction" model using a 2x2 [confusion matrix](https://developers.google.com/machine-learning/glossary?authuser=1#confusion_matrix) that depicts all four possible outcomes:

<div><table>
  <tbody><tr>
    <td>
      <b>True Positive (TP):</b>
      <ul>
        <li>Reality: A wolf threatened.</li>
        <li>Shepherd said: "Wolf."</li>
        <li>Outcome: Shepherd is a hero.</li>
      </ul>
    </td>
    <td>
      <b>False Positive (FP):</b>
      <ul>
        <li>Reality: No wolf threatened.</li>
        <li>Shepherd said: "Wolf."</li>
        <li>Outcome: Villagers are angry at shepherd for waking them up.</li>
    </ul></td>
  </tr>
  <tr>
    <td>
      <b>False Negative (FN):</b>
      <ul>
        <li>Reality: A wolf threatened.</li>
        <li>Shepherd said: "No wolf."</li>
        <li>Outcome: The wolf ate all the sheep.</li>
      </ul>
    </td>
    <td>
      <b>True Negative (TN):</b>
      <ul>
        <li>Reality: No wolf threatened.</li>
        <li>Shepherd said: "No wolf."</li>
        <li>Outcome: Everyone is fine.</li>
      </ul>
    </td>
  </tr>
</tbody></table></div>

A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.

A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.

In the following sections, we'll look at how to evaluate classification models using metrics derived from these four outcomes.

#### Classification: Accuracy

Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition:

$ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} $

For binary classification, accuracy can also be calculated in terms of positives and negatives as follows:

$ \text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN} $

Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.

Let's try calculating accuracy for the following model that classified 100 tumors as malignant (the positive class) or benign (the negative class):

<div><table>
  <tbody><tr>
    <td>
      <b>True Positive (TP):</b>
      <ul>
        <li>Reality: Malignant</li>
        <li>ML model predicted: Malignant"</li>
        <li>Number of TP results: 1</li>
      </ul>
    </td>
    <td>
      <b>False Positive (FP):</b>
      <ul>
        <li>Reality: Benign</li>
        <li>ML model predicted: Malignant"</li>
        <li>Number of FP results: 1</li>
    </ul></td>
  </tr>
  <tr>
    <td>
      <b>False Negative (FN):</b>
      <ul>
        <li>Reality: Malignant</li>
        <li>ML model predicted: Benign</li>
        <li>Number of FN results: 8</li>
      </ul>
    </td>
    <td>
      <b>True Negative (TN):</b>
      <ul>
        <li>Reality: Benign</li>
        <li>ML model predicted: Benign</li>
        <li>Number of TN results: 90</li>
      </ul>
    </td>
  </tr>
</tbody></table></div>

$ \text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN} = \frac{1+90}{1+90+1+8} = 0.91 $

Accuracy comes out to 0.91, or 91% (91 correct predictions out of 100 total examples). That means our tumor classifier is doing a great job of identifying malignancies, right?

Actually, let's do a closer analysis of positives and negatives to gain more insight into our model's performance.

Of the 100 tumor examples, 91 are benign (90 TNs and 1 FP) and 9 are malignant (1 TP and 8 FNs).

Of the 91 benign tumors, the model correctly identifies 90 as benign. That's good. However, of the 9 malignant tumors, the model only correctly identifies 1 as malignant—a terrible outcome, as 8 out of 9 malignancies go undiagnosed!

While 91% accuracy may seem good at first glance, another tumor-classifier model that always predicts benign would achieve the exact same accuracy (91/100 correct predictions) on our examples. In other words, our model is no better than one that has zero predictive ability to distinguish malignant tumors from benign tumors.

Accuracy alone doesn't tell the full story when you're working with a class-imbalanced data set, like this one, where there is a significant disparity between the number of positive and negative labels.

In the next section, we'll look at two better metrics for evaluating class-imbalanced problems: precision and recall.

#### Classification: Precision and Recall

##### Precision

Precision attempts to answer the following question:

  What proportion of positive identifications was actually correct?

Precision is defined as follows:

$ \text{Precision} = \frac{TP}{TP+FP} $

Note: A model that produces no false positives has a precision of 1.0.

Let's calculate precision for our ML model from the previous section that analyzes tumors:

<table>
  <tr>
    <td>
      <b>True Positives (TPs): 1</b>
      </ul>
    </td>
    <td>
      <b>False Positives (FPs): 1</b>
    </td>
  </tr>
  <tr>
    <td>
      <b>False Negatives (FNs): 8</b>
    </td>
    <td>
      <b>True Negatives (TNs): 90</b>
    </td>
  </tr>
</table>

$ \text{Precision} = \frac{TP}{TP+FP} = \frac{1}{1+1} = 0.5 $

Our model has a precision of 0.5—in other words, when it predicts a tumor is malignant, it is correct 50% of the time.

##### Recall

Recall attempts to answer the following question:

  What proportion of actual positives was identified correctly?

Mathematically, recall is defined as follows:

$ \text{Recall} = \frac{TP}{TP+FN} $

Note: A model that produces no false negatives has a recall of 1.0.

Let's calculate recall for our tumor classifier:

<table>
  <tr>
    <td>
      <b>True Positives (TPs): 1</b>
      </ul>
    </td>
    <td>
      <b>False Positives (FPs): 1</b>
    </td>
  </tr>
  <tr>
    <td>
      <b>False Negatives (FNs): 8</b>
    </td>
    <td>
      <b>True Negatives (TNs): 90</b>
    </td>
  </tr>
</table>

$ \text{Recall} = \frac{TP}{TP+FN} = \frac{1}{1+8} = 0.11 $

Our model has a recall of 0.11 — in other words, it correctly identifies 11% of all malignant tumors.

##### Precision and Recall: A Tug of War

To fully evaluate the effectiveness of a model, you must examine both precision and recall. Unfortunately, precision and recall are often in tension. That is, improving precision typically reduces recall and vice versa. Explore this notion by looking at the following figure, which shows 30 predictions made by an email classification model. Those to the right of the classification threshold are classified as "spam", while those to the left are classified as "not spam."

![A number line from 0 to 1.0 on which 30 examples have been placed.](https://developers.google.com/static/machine-learning/crash-course/images/PrecisionVsRecallBase.svg?authuser=1)

*Figure 1. Classifying email messages as spam or not spam.*

Let's calculate precision and recall based on the results shown in Figure 1:

<table>
  <tr>
    <td>
      <b>True Positives (TP): 8</b>
    </td>
    <td>
      <b>False Positives (FP): 2</b>
    </td>
  </tr>
  <tr>
    <td>
      <b>False Negatives (FN): 3</b>
    </td>
    <td>
      <b>True Negatives (TN): 17</b>
    </td>
  </tr>
</table>

Precision measures the percentage of emails flagged as spam that were correctly classified — that is, the percentage of dots to the right of the threshold line that are green in Figure 1:

$ \text{Precision} = \frac{TP}{TP + FP} = \frac{8}{8+2} = 0.8 $

Recall measures the percentage of actual spam emails that were correctly classified—that is, the percentage of green dots that are to the right of the threshold line in Figure 1:

$ \text{Recall} = \frac{TP}{TP + FN} = \frac{8}{8 + 3} = 0.73 $

Figure 2 illustrates the effect of increasing the classification threshold.

![Same set of examples, but with the classification threshold increased slightly. 2 of the 30 examples have been reclassified.](https://developers.google.com/static/machine-learning/crash-course/images/PrecisionVsRecallRaiseThreshold.svg?authuser=1)

*Figure 2. Increasing classification threshold.*

The number of false positives decreases, but false negatives increase. As a result, precision increases, while recall decreases:

<table>
  <tr>
    <td>
      <b>True Positives (TP): 7</b>
    </td>
    <td>
      <b>False Positives (FP): 1</b>
    </td>
  </tr>
  <tr>
    <td>
      <b>False Negatives (FN): 4</b>
    </td>
    <td>
      <b>True Negatives (TN): 18</b>
    </td>
  </tr>
</table>

$ \text{Precision} = \frac{TP}{TP + FP} = \frac{7}{7+1} = 0.88 $
$ \text{Recall} = \frac{TP}{TP + FN} = \frac{7}{7 + 4} = 0.64 $

Conversely, Figure 3 illustrates the effect of decreasing the classification threshold (from its original position in Figure 1).

![Same set of examples, but with the classification threshold decreased.](https://developers.google.com/static/machine-learning/crash-course/images/PrecisionVsRecallLowerThreshold.svg?authuser=1)

*Figure 3. Decreasing classification threshold.*

False positives increase, and false negatives decrease. As a result, this time, precision decreases and recall increases:

<table>
  <tr>
    <td>
      <b>True Positives (TP): 9</b>
    </td>
    <td>
      <b>False Positives (FP): 3</b>
    </td>
  </tr>
  <tr>
    <td>
      <b>False Negatives (FN): 2</b>
    </td>
    <td>
      <b>True Negatives (TN): 16</b>
    </td>
  </tr>
</table>

$ \text{Precision} = \frac{TP}{TP + FP} = \frac{9}{9+3} = 0.75 $
$ \text{Recall} = \frac{TP}{TP + FN} = \frac{9}{9 + 2} = 0.82 $

Various metrics have been developed that rely on both precision and recall. For example, see [F1 score](https://wikipedia.org/wiki/F1_score).

#### Classification: Check Your Understanding (Accuracy, Precision, Recall)

##### Accuracy

Explore the options below.

In which of the following scenarios would a high accuracy value suggest that the ML model is doing a good job?

- [ ] An expensive robotic chicken crosses a very busy road a thousand times per day. An ML model evaluates traffic patterns and predicts when this chicken can safely cross the street with an accuracy of 99.99%.
- [x] In the game of roulette, a ball is dropped on a spinning wheel and eventually lands in one of 38 slots. Using visual features (the spin of the ball, the position of the wheel when the ball was dropped, the height of the ball over the wheel), an ML model can predict the slot that the ball will land in with an accuracy of 4%.
- [ ] A deadly, but curable, medical condition afflicts .01% of the population. An ML model uses symptoms as features and predicts this affliction with an accuracy of 99.99%.

##### Precision

Explore the options below.
Consider a classification model that separates email into two categories: "spam" or "not spam." If you raise the classification threshold, what will happen to precision?

- [ ] Probably decrease.
- [ ] Definitely decrease.
- [x] Probably increase.
- [ ] Definitely increase.

##### Recall

Explore the options below.

Consider a classification model that separates email into two categories: "spam" or "not spam." If you raise the classification threshold, what will happen to recall?

- [ ] Always increase.
- [ ] Always stay constant.
- [x] Always decrease or stay the same.

##### Precision and Recall

Explore the options below.

Consider two models — A and B — that each evaluate the same dataset. Which one of the following statements is true?

- [ ] If model A has better recall than model B, then model A is better.
- [ ] If Model A has better precision than model B, then model A is better.
- [x] If model A has better precision and better recall than model B, then model A is probably better.

#### Classification: ROC Curve and AUC

##### ROC curve

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

- True Positive Rate
- False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

$ TPR = \frac{TP} {TP + FN} $

False Positive Rate (FPR) is defined as follows:

$ FPR = \frac{FP} {FP + TN} $

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve.

![ROC Curve showing TP Rate vs. FP Rate at different classification thresholds.](https://developers.google.com/static/machine-learning/crash-course/images/ROCCurve.svg?authuser=1)

*Figure 4. TP vs. FP rate at different classification thresholds.*

To compute the points in an ROC curve, we could evaluate a logistic regression model many times with different classification thresholds, but this would be inefficient. Fortunately, there's an efficient, sorting-based algorithm that can provide this information for us, called AUC.

##### AUC: Area Under the ROC Curve

AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).

![AUC (Area under the ROC Curve).](https://developers.google.com/static/machine-learning/crash-course/images/AUC.svg?authuser=1)

*Figure 5. AUC (Area under the ROC Curve).*

AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. For example, given the following examples, which are arranged from left to right in ascending order of logistic regression predictions:

![Positive and negative examples ranked in ascending order of logistic regression score.](https://developers.google.com/static/machine-learning/crash-course/images/AUCPredictionsRanked.svg?authuser=1)

*Figure 6. Predictions ranked in ascending order of logistic regression score.*

AUC represents the probability that a random positive (green) example is positioned to the right of a random negative (red) example.

AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

AUC is desirable for the following two reasons:

- AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
- AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.

However, both these reasons come with caveats, which may limit the usefulness of AUC in certain use cases:

- Scale invariance is not always desirable. For example, sometimes we really do need well calibrated probability outputs, and AUC won’t tell us about that.
- Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase of false negatives). AUC isn't a useful metric for this type of optimization.


#### Classification: Check Your Understanding (ROC and AUC)

##### ROC and AUC

Explore the options below.

Which of the following ROC curves produce AUC values greater than 0.5?

- [ ] ![An ROC curve with a horizontal line running from (0,0) to (1,0), and a vertical line from (1,0) to (1,1). The FP rate is 1.0 for all TP rates.](https://developers.google.com/static/machine-learning/crash-course/images/mc_roc2.svg?authuser=1)
- [x] ![An ROC curve with a vertical line running from (0,0) to (0,1), and a horizontal from (0,1) to (1,1). The TP rate is 1.0 for all FP rates.](https://developers.google.com/static/machine-learning/crash-course/images/mc_roc1.svg?authuser=1)
- [ ] ![An ROC curve with one diagonal line running from (0,0) to (1,1). TP and FP rates increase linearly at the same rate.](https://developers.google.com/static/machine-learning/crash-course/images/mc_roc3.svg?authuser=1)
- [x] ![An ROC curve that arcs up and right from (0,0) to (1,1). TP rate increases at a faster rate than FP rate.](https://developers.google.com/static/machine-learning/crash-course/images/mc_roc4.svg?authuser=1)
- [ ] ![An ROC curve that arcs right and up from (0,0) to (1,1). FP rate increases at a faster rate than TP rate.](https://developers.google.com/static/machine-learning/crash-course/images/mc_roc5.svg?authuser=1)

##### AUC and Scaling Predictions

Explore the options below.

How would multiplying all of the predictions from a given model by 2.0 (for example, if the model predicts 0.4, we multiply by 2.0 to get a prediction of 0.8) change the model's performance as measured by AUC?

- [x] No change. AUC only cares about relative prediction scores.
- [ ] It would make AUC better, because the prediction values are all farther apart.
- [ ] It would make AUC terrible, since the prediction values are now way off.

#### Classification: Prediction Bias

Logistic regression predictions should be unbiased. That is:

    "average of predictions" should ≈ "average of observations" 

**Prediction bias** is a quantity that measures how far apart those two averages are. That is:

$ \text{prediction bias} = \text{average of predictions} - \text{average of labels in data set} $

Note: "Prediction bias" is a different quantity than bias (the b in wx + b).

A significant nonzero prediction bias tells you there is a bug somewhere in your model, as it indicates that the model is wrong about how frequently positive labels occur.

For example, let's say we know that on average, 1% of all emails are spam. If we don't know anything at all about a given email, we should predict that it's 1% likely to be spam. Similarly, a good spam model should predict on average that emails are 1% likely to be spam. (In other words, if we average the predicted likelihoods of each individual email being spam, the result should be 1%.) If instead, the model's average prediction is 20% likelihood of being spam, we can conclude that it exhibits prediction bias.

Possible root causes of prediction bias are:

- Incomplete feature set
- Noisy data set
- Buggy pipeline
- Biased training sample
- Overly strong regularization

You might be tempted to correct prediction bias by post-processing the learned model — that is, by adding a calibration layer that adjusts your model's output to reduce the prediction bias. For example, if your model has +3% bias, you could add a calibration layer that lowers the mean prediction by 3%. However, adding a calibration layer is a bad idea for the following reasons:

- You're fixing the symptom rather than the cause.
- You've built a more brittle system that you must now keep up to date.

If possible, avoid calibration layers. Projects that use calibration layers tend to become reliant on them — using calibration layers to fix all their model's sins. Ultimately, maintaining the calibration layers can become a nightmare.

Note: A good model will usually have near-zero bias. That said, a low prediction bias does not prove that your model is good. A really terrible model could have a zero prediction bias. For example, a model that just predicts the mean value for all examples would be a bad model, despite having zero bias.

##### Bucketing and Prediction Bias

Logistic regression predicts a value between 0 and 1. However, all labeled examples are either exactly 0 (meaning, for example, "not spam") or exactly 1 (meaning, for example, "spam"). Therefore, when examining prediction bias, you cannot accurately determine the prediction bias based on only one example; you must examine the prediction bias on a "bucket" of examples. That is, prediction bias for logistic regression only makes sense when grouping enough examples together to be able to compare a predicted value (for example, 0.392) to observed values (for example, 0.394).

You can form buckets in the following ways:

- Linearly breaking up the target predictions.
- Forming quantiles.

Consider the following calibration plot from a particular model. Each dot represents a bucket of 1,000 values. The axes have the following meanings:

- The x-axis represents the average of values the model predicted for that bucket.
- The y-axis represents the actual average of values in the data set for that bucket.

Both axes are logarithmic scales.

![X-axis is Prediction; y-axis is Label. For middle and high values of prediction, the prediction bias is negligible. For low values of prediction, the prediction bias is relatively high.](https://developers.google.com/static/machine-learning/crash-course/images/BucketingBias.svg?authuser=1)

*Figure 8. Prediction bias curve (logarithmic scales)*

Why are the predictions so poor for only part of the model? Here are a few possibilities:

- The training set doesn't adequately represent certain subsets of the data space.
- Some subsets of the data set are noisier than others.
- The model is overly regularized. (Consider reducing the value of lambda.)

#### Binary Classification: Programming Exercise

In the following exercise, you'll explore binary classification in TensorFlow:

- [Binary Classification](/Machine%20Learning%20Crash%20Course%20with%20TensorFlow%20APIs/Exercises/Binary_Classification.ipynb) Colab exercise.

### Regularization for Sparsity

This module focuses on the special requirements for models learned on feature vectors that have many dimensions.

#### Regularization for Sparsity: L₁ Regularization

Sparse vectors often contain many dimensions. Creating a feature cross results in even more dimensions. Given such high-dimensional feature vectors, model size may become huge and require huge amounts of RAM.

In a high-dimensional sparse vector, it would be nice to encourage weights to drop to exactly 0 where possible. A weight of exactly 0 essentially removes the corresponding feature from the model. Zeroing out features will save RAM and may reduce noise in the model.

For example, consider a housing data set that covers not just California but the entire globe. Bucketing global latitude at the minute level (60 minutes per degree) gives about 10,000 dimensions in a sparse encoding; global longitude at the minute level gives about 20,000 dimensions. A feature cross of these two features would result in roughly 200,000,000 dimensions. Many of those 200,000,000 dimensions represent areas of such limited residence (for example, the middle of the ocean) that it would be difficult to use that data to generalize effectively. It would be silly to pay the RAM cost of storing these unneeded dimensions. Therefore, it would be nice to encourage the weights for the meaningless dimensions to drop to exactly 0, which would allow us to avoid paying for the storage cost of these model coefficients at inference time.

We might be able to encode this idea into the optimization problem done at training time, by adding an appropriately chosen regularization term.

Would L2 regularization accomplish this task? Unfortunately not. L2 regularization encourages weights to be small, but doesn't force them to exactly 0.0.

An alternative idea would be to try and create a regularization term that penalizes the count of non-zero coefficient values in a model. Increasing this count would only be justified if there was a sufficient gain in the model's ability to fit the data. Unfortunately, while this count-based approach is intuitively appealing, it would turn our convex optimization problem into a non-convex optimization problem. So this idea, known as L0 regularization isn't something we can use effectively in practice.

However, there is a regularization term called L1 regularization that serves as an approximation to L0, but has the advantage of being convex and thus efficient to compute. So we can use L1 regularization to encourage many of the uninformative coefficients in our model to be exactly 0, and thus reap RAM savings at inference time.

##### L1 vs. L2 regularization.

L2 and L1 penalize weights differently:

- L2 penalizes $weight^2$.
- L1 penalizes $|weight|$.

Consequently, L2 and L1 have different derivatives:

- The derivative of L2 is $2 * weight$.
- The derivative of L1 is $k$ (a constant, whose value is independent of weight).

You can think of the derivative of L2 as a force that removes x% of the weight every time. As Zeno knew, even if you remove x percent of a number billions of times, the diminished number will still never quite reach zero. (Zeno was less familiar with floating-point precision limitations, which could possibly produce exactly zero.) At any rate, L2 does not normally drive weights to zero.

You can think of the derivative of L1 as a force that subtracts some constant from the weight every time. However, thanks to absolute values, L1 has a discontinuity at 0, which causes subtraction results that cross 0 to become zeroed out. For example, if subtraction would have forced a weight from +0.1 to -0.2, L1 will set the weight to exactly 0. Eureka, L1 zeroed out the weight.

L1 regularization — penalizing the absolute value of all the weights — turns out to be quite efficient for wide models.

Note that this description is true for a one-dimensional model.

#### Regularization for Sparsity: Playground Exercise

##### Examining L1 Regularization

This exercise contains a small, slightly noisy, training data set. In this kind of setting, overfitting is a real concern. Regularization might help, but which form of regularization?

This exercise consists of five related tasks. To simplify comparisons across the five tasks, run each task in a separate tab. Notice that the thicknesses of the lines connecting FEATURES and OUTPUT represent the relative weights of each feature.

| Task | Regularization Type | Regularization Rate (lambda) |
|------|----------------------|-------------------------------|
| 1    | L2                   | 0.1                           |
| 2    | L2                   | 0.3                           |
| 3    | L1                   | 0.1                           |
| 4    | L1                   | 0.3                           |
| 5    | L1                   | experiment                    |

Questions:

1. How does switching from L2 to L1 regularization influence the delta between test loss and training loss?
1. How does switching from L2 to L1 regularization influence the learned weights?
1. How does increasing the L1 regularization rate (lambda) influence the learned weights?

Answers:
1. Switching from L2 to L1 regularization dramatically reduces the delta between test loss and training loss.
1. Switching from L2 to L1 regularization dampens all of the learned weights.
1. Increasing the L1 regularization rate generally dampens the learned weights; however, if the regularization rate goes too high, the model can't converge and losses are very high.

#### Regularization for Sparsity: Check Your Understanding

##### L1 regularization

Explore the options below.

Imagine a linear model with 100 input features:
- 10 are highly informative.
- 90 are non-informative.

Assume that all features have values between -1 and 1. Which of the following statements are true?

- [x] L1 regularization will encourage most of the non-informative weights to be exactly 0.0.
- [ ] L1 regularization will encourage many of the non-informative weights to be nearly (but not exactly) 0.0.
- [x] L1 regularization may cause informative features to get a weight of exactly 0.0.

##### L1 vs. L2 Regularization

Explore the options below.

Imagine a linear model with 100 input features, all having values between -1 and 1:
- 10 are highly informative.
- 90 are non-informative.

Which type of regularization will produce the smaller model?

- [ ] L2 regularization.
- [x] L1 regularization.

### Neural Networks

Neural networks are a more sophisticated version of feature crosses. In essence, neural networks learn the appropriate feature crosses for you.

#### Neural Networks: Structure

If you recall from the [Feature Crosses unit](https://developers.google.com/machine-learning/crash-course/feature-crosses/video-lecture?authuser=1), the following classification problem is nonlinear:

![Cartesian plot. Traditional x axis is labeled 'x1'. Traditional y axis is labeled 'x2'. Blue dots occupy the northwest and southeast quadrants; yellow dots occupy the southwest and northeast quadrants.](https://developers.google.com/machine-learning/crash-course/images/FeatureCrosses1.png?authuser=1)

*Figure 1. Nonlinear classification problem.*

"Nonlinear" means that you can't accurately predict a label with a model of the form $b + w_1x_1 + w_2x_2$. In other words, the "decision surface" is not a line. Previously, we looked at feature crosses as one possible approach to modeling nonlinear problems.

Now consider the following data set:

![Data set contains many orange and many blue dots. It is hard to determine a coherent pattern, but the orange dots vaguely form a spiral and the blue dots perhaps form a different spiral.](https://developers.google.com/machine-learning/crash-course/images/NonLinearSpiral.png?authuser=1)

*Figure 2. A more difficult nonlinear classification problem.*

The data set shown in Figure 2 can't be solved with a linear model.

To see how neural networks might help with nonlinear problems, let's start by representing a linear model as a graph:

![Three blue circles in a row connected by arrows to a green circle above them.](https://developers.google.com/static/machine-learning/crash-course/images/linear_net.svg?authuser=1)

*Figure 3. Linear model as graph.*

Each blue circle represents an input feature, and the green circle represents the weighted sum of the inputs.

How can we alter this model to improve its ability to deal with nonlinear problems?

##### Hidden Layers

In the model represented by the following graph, we've added a "hidden layer" of intermediary values. Each yellow node in the hidden layer is a weighted sum of the blue input node values. The output is a weighted sum of the yellow nodes.

![Three blue circles in a row labeled.](https://developers.google.com/static/machine-learning/crash-course/images/1hidden.svg?authuser=1)

*Figure 4. Graph of two-layer model.*

Is this model linear? Yes — its output is still a linear combination of its inputs.

In the model represented by the following graph, we've added a second hidden layer of weighted sums.

![Three blue circles in a row labeled.](https://developers.google.com/static/machine-learning/crash-course/images/2hidden.svg?authuser=1)

*Figure 5. Graph of three-layer model.*

Is this model still linear? Yes, it is. When you express the output as a function of the input and simplify, you get just another weighted sum of the inputs. This sum won't effectively model the nonlinear problem in Figure 2.

##### Activation Functions

To model a nonlinear problem, we can directly introduce a nonlinearity. We can pipe each hidden layer node through a nonlinear function.

In the model represented by the following graph, the value of each node in Hidden Layer 1 is transformed by a nonlinear function before being passed on to the weighted sums of the next layer. This nonlinear function is called the activation function.

![The same as the previous figure, except that a row of pink circles labeled 'Non-Linear Transformation Layer' has been added in between the two hidden layers.](https://developers.google.com/static/machine-learning/crash-course/images/activation.svg?authuser=1)

*Figure 6. Graph of three-layer model with activation function.*

Now that we've added an activation function, adding layers has more impact. Stacking nonlinearities on nonlinearities lets us model very complicated relationships between the inputs and the predicted outputs. In brief, each layer is effectively learning a more complex, higher-level function over the raw inputs. If you'd like to develop more intuition on how this works, see [Chris Olah's excellent blog post](http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/).

##### Common Activation Functions

The following **sigmoid** activation function converts the weighted sum to a value between 0 and 1.

$ F(x)=\frac{1} {1+e^{-x}} $

Here's a plot:

![Sigmoid function.](https://developers.google.com/static/machine-learning/crash-course/images/sigmoid.svg?authuser=1)

*Figure 7. Sigmoid activation function.*

The following **rectified linear unit** activation function (or ReLU, for short) often works a little better than a smooth function like the sigmoid, while also being significantly easier to compute.

$ F(x)=max(0,x) $

The superiority of ReLU is based on empirical findings, probably driven by ReLU having a more useful range of responsiveness. A sigmoid's responsiveness falls off relatively quickly on both sides.

![ReLU activation function.](https://developers.google.com/static/machine-learning/crash-course/images/relu.svg?authuser=1)

*Figure 8. ReLU activation function.*

In fact, any mathematical function can serve as an activation function. Suppose that $\sigma$ represents our activation function (Relu, Sigmoid, or whatever). Consequently, the value of a node in the network is given by the following formula:

$ \sigma(\boldsymbol w \cdot \boldsymbol x+b) $

TensorFlow provides out-of-the-box support for many activation functions. You can find these activation functions within [TensorFlow's list of wrappers for primitive neural network operations](https://www.tensorflow.org/api_docs/python/tf/nn?authuser=1). That said, we still recommend starting with ReLU.

##### Summary

Now our model has all the standard components of what people usually mean when they say "neural network":

- A set of nodes, analogous to neurons, organized in layers.
- A set of weights representing the connections between each neural network layer and the layer beneath it. The layer beneath may be another neural network layer, or some other kind of layer.
- A set of biases, one for each node.
- An activation function that transforms the output of each node in a layer. Different layers may have different activation functions.

A caveat: neural networks aren't necessarily always better than feature crosses, but neural networks do offer a flexible alternative that works well in many cases.

#### Neural Networks: Playground Exercises

##### A First Neural Network

In this exercise, we will train our first little neural net. Neural nets will give us a way to learn nonlinear models without the use of explicit feature crosses.

**Task 1**: The model as given combines our two input features into a single neuron. Will this model learn any nonlinearities? Run it to confirm your guess.

**Task 2**: Try increasing the number of neurons in the hidden layer from 1 to 2, and also try changing from a Linear activation to a nonlinear activation like ReLU. Can you create a model that can learn nonlinearities? Can it model the data effectively?

**Task 3**: Try increasing the number of neurons in the hidden layer from 2 to 3, using a nonlinear activation like ReLU. Can it model the data effectively? How does model quality vary from run to run?

**Task 4**: Continue experimenting by adding or removing hidden layers and neurons per layer. Also feel free to change learning rates, regularization, and other learning settings. What is the smallest number of neurons and layers you can use that gives test loss of 0.177 or lower?

Does increasing the model size improve the fit, or how quickly it converges? Does this change how often it converges to a good model? For example, try the following architecture:

- First hidden layer with 3 neurons.
- Second hidden layer with 3 neurons.
- Third hidden layer with 2 neurons.

Answers:
1. The Activation is set to Linear, so this model cannot learn any nonlinearities. The loss is very high, and we say the model underfits the data.
1. The nonlinear activation function can learn nonlinear models. However, a single hidden layer with 2 neurons cannot reflect all the nonlinearities in this data set, and will have high loss even without noise: it still underfits the data. These exercises are nondeterministic, so some runs will not learn an effective model, while other runs will do a pretty good job. The best model may not have the shape you expect!
1. Playground's nondeterministic nature shines through on this exercise. A single hidden layer with 3 neurons is enough to model the data set (absent noise), but not all runs will converge to a good model. 3 neurons are enough because the XOR function can be expressed as a combination of 3 half-planes (ReLU activation). You can see this from looking at the neuron images, which show the output of the individual neurons. In a good model with 3 neurons and ReLU activation, there will be 1 image with an almost vertical line, detecting X1 being positive (or negative; the sign may be switched), 1 image with an almost horizontal line, detecting the sign of X2, and 1 image with a diagonal line, detecting their interaction. However, not all runs will converge to a good model. Some runs will do no better than a model with 2 neurons, and you can see duplicate neurons in these cases.
1. A single hidden layer with 3 neurons can model the data, but there is no redundancy, so on many runs it will effectively lose a neuron and not learn a good model. A single layer with more than 3 neurons has more redundancy, and thus is more likely to converge to a good model. As we saw, a single hidden layer with only 2 neurons cannot model the data well. If you try it, you can see then that all of the items in the output layer can only be shapes composed of the lines from those two nodes. In this case, a deeper network can model the data set better than the first hidden layer alone: individual neurons in the second layer can model more complex shapes, like the upper-right quadrant, by combining neurons in the first layer. While adding that second hidden layer can still model the data set better than the first hidden layer alone, it might make more sense to add more nodes to the first layer to let more lines be part of the kit from which the second layer builds its shapes. However, a model with 1 neuron in the first hidden layer cannot learn a good model no matter how deep it is. This is because the output of the first layer only varies along one dimension (usually a diagonal line), which isn't enough to model this data set well. Later layers can't compensate for this, no matter how complex; information in the input data has been irrecoverably lost. What if instead of trying to have a small network, we had lots of layers with lots of neurons, for a simple problem like this? Well, as we've seen, the first layer will have the ability to try lots of different line slopes. And the second layer will have the ability to accumulate them into lots of different shapes, with lots and lots of shapes on down through the subsequent layers. By allowing the model to consider so many different shapes through so many different hidden neurons, you've created enough space for the model to start easily overfitting on the noise in the training set, allowing these complex shapes to match the foibles of the training data rather than the generalized ground truth. In this example, larger models can have complicated boundaries to match the precise data points. In extreme cases, a large model could learn an island around an individual point of noise, which is called memorizing the data. By allowing the model to be so much larger, you'll see that it actually often performs worse than the simpler model with just enough neurons to solve the problem.

##### Neural Net Initialization

This exercise uses the XOR data again, but looks at the repeatability of training Neural Nets and the importance of initialization.

**Task 1**: Run the model as given four or five times. Before each trial, hit the Reset the network button to get a new random initialization. (The Reset the network button is the circular reset arrow just to the left of the Play button.) Let each trial run for at least 500 steps to ensure convergence. What shape does each model output converge to? What does this say about the role of initialization in non-convex optimization?

**Task 2**: Try making the model slightly more complex by adding a layer and a couple of extra nodes. Repeat the trials from Task 1. Does this add any additional stability to the results?

Answers:
1. The learned model had different shapes on each run. The converged test loss varied almost 2X from lowest to highest.
1. Adding the layer and extra nodes produced more repeatable results. On each run, the resulting model looked roughly the same. Furthermore, the converged test loss showed less variance between runs.

##### Neural Net Spiral

This data set is a noisy spiral. Obviously, a linear model will fail here, but even manually defined feature crosses may be hard to construct.

**Task 1**: Train the best model you can, using just X1 and X2. Feel free to add or remove layers and neurons, change learning settings like learning rate, regularization rate, and batch size. What is the best test loss you can get? How smooth is the model output surface?

**Task 2**: Even with Neural Nets, some amount of feature engineering is often needed to achieve best performance. Try adding in additional cross product features or other transformations like sin(X1) and sin(X2). Do you get a better model? Is the model output surface any smoother?

#### Neural Networks: Programming Exercise

The following exercise allows you to develop and train a neural network:

- [Intro to Neural Networks](/Machine%20Learning%20Crash%20Course%20with%20TensorFlow%20APIs/Exercises/Intro_to_Neural_Nets.ipynb) Colab exercise.

### Training Neural Networks

**Backpropagation** is the most common training algorithm for neural networks. It makes gradient descent feasible for multi-layer neural networks. TensorFlow handles backpropagation automatically, so you don't need a deep understanding of the algorithm. To get a sense of how it works, walk through the following: [Backpropagation algorithm visual explanation](https://developers.google.com/machine-learning/crash-course/backprop-scroll?authuser=1). As you scroll through the preceding explanation, note the following:

- How data flows through the graph.
- How dynamic programming lets us avoid computing exponentially many paths through the graph. Here "dynamic programming" just means recording intermediate results on the forward and backward passes.

#### Training Neural Networks: Best Practices

This section explains backpropagation's failure cases and the most common way to regularize a neural network.

##### Failure Cases

There are a number of common ways for backpropagation to go wrong.

###### Vanishing Gradients

The gradients for the lower layers (closer to the input) can become very small. In deep networks, computing these gradients can involve taking the product of many small terms.

When the gradients vanish toward 0 for the lower layers, these layers train very slowly, or not at all.

The ReLU activation function can help prevent vanishing gradients.

###### Exploding Gradients

If the weights in a network are very large, then the gradients for the lower layers involve products of many large terms. In this case you can have exploding gradients: gradients that get too large to converge.

Batch normalization can help prevent exploding gradients, as can lowering the learning rate.

###### Dead ReLU Units

Once the weighted sum for a ReLU unit falls below 0, the ReLU unit can get stuck. It outputs 0 activation, contributing nothing to the network's output, and gradients can no longer flow through it during backpropagation. With a source of gradients cut off, the input to the ReLU may not ever change enough to bring the weighted sum back above 0.

Lowering the learning rate can help keep ReLU units from dying.

##### Dropout Regularization

Yet another form of regularization, called Dropout, is useful for neural networks. It works by randomly "dropping out" unit activations in a network for a single gradient step. The more you drop out, the stronger the regularization:

- 0.0 = No dropout regularization.
- 1.0 = Drop out everything. The model learns nothing.
- Values between 0.0 and 1.0 = More useful.

### Multi-Class Neural Networks

Earlier, you encountered binary classification models that could pick between one of two possible choices, such as whether:

- A given email is spam or not spam.
- A given tumor is malignant or benign.

In this module, we'll investigate multi-class classification, which can pick from multiple possibilities. For example:

- Is this dog a beagle, a basset hound, or a bloodhound?
- Is this flower a Siberian Iris, Dutch Iris, Blue Flag Iris, or Dwarf Bearded Iris?
- Is that plane a Boeing 747, Airbus 320, Boeing 777, or Embraer 190?
- Is this an image of an apple, bear, candy, dog, or egg?

Some real-world multi-class problems entail choosing from millions of separate classes. For example, consider a multi-class classification model that can identify the image of just about anything.

#### Multi-Class Neural Networks: One vs. All

One vs. all provides a way to leverage binary classification. Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary classifiers—one binary classifier for each possible outcome. During training, the model runs through a sequence of binary classifiers, training each to answer a separate classification question. For example, given a picture of a dog, five different recognizers might be trained, four seeing the image as a negative example (not an apple, not a bear, etc.) and one seeing the image as a positive example (a dog). That is:

1. Is this image an apple? No.
1. Is this image a bear? No.
1. Is this image candy? No.
1. Is this image a dog? Yes.
1. Is this image an egg? No.

This approach is fairly reasonable when the total number of classes is small, but becomes increasingly inefficient as the number of classes rises.

We can create a significantly more efficient one-vs.-all model with a deep neural network in which each output node represents a different class. The following figure suggests this approach:

![A neural network with five hidden layers and five output layers.](https://developers.google.com/static/machine-learning/crash-course/images/OneVsAll.svg?authuser=1)

*Figure 1. A one-vs.-all neural network.*

#### Multi-Class Neural Networks: Softmax

Estimated Time: 8 minutes

Recall that logistic regression produces a decimal between 0 and 1.0. For example, a logistic regression output of 0.8 from an email classifier suggests an 80% chance of an email being spam and a 20% chance of it being not spam. Clearly, the sum of the probabilities of an email being either spam or not spam is 1.0.

Softmax extends this idea into a multi-class world. That is, Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would.

For example, returning to the image analysis we saw in Figure 1, Softmax might produce the following likelihoods of an image belonging to a particular class:

| Class | Probability |
|-------|-------------|
| apple | 0.001       |
| bear  | 0.04        |
| candy | 0.008       |
| dog   | 0.95        |
| egg   | 0.001       |


Softmax is implemented through a neural network layer just before the output layer. The Softmax layer must have the same number of nodes as the output layer.

![A deep neural net with an input layer, two nondescript hidden layers, then a Softmax layer, and finally an output layer with the same number of nodes as the Softmax layer.](https://developers.google.com/static/machine-learning/crash-course/images/SoftmaxLayer.svg?authuser=1)

*Figure 2. A Softmax layer within a neural network.*

The Softmax equation is as follows:

$ p(y = j|\textbf{x})  = \frac{e^{(\textbf{w}_j^{T}\textbf{x} + b_j)}}{\sum_{k\in K} {e^{(\textbf{w}_k^{T}\textbf{x} + b_k)}}} $

Note that this formula basically extends the formula for logistic regression into multiple classes.

##### Softmax Options

Consider the following variants of Softmax:

- Full Softmax is the Softmax we've been discussing; that is, Softmax calculates a probability for every possible class.
- Candidate sampling means that Softmax calculates a probability for all the positive labels but only for a random sample of negative labels. For example, if we are interested in determining whether an input image is a beagle or a bloodhound, we don't have to provide probabilities for every non-doggy example.

Full Softmax is fairly cheap when the number of classes is small but becomes prohibitively expensive when the number of classes climbs. Candidate sampling can improve efficiency in problems having a large number of classes.

##### One Label vs. Many Labels

Softmax assumes that each example is a member of exactly one class. Some examples, however, can simultaneously be a member of multiple classes. For such examples:

- You may not use Softmax.
- You must rely on multiple logistic regressions.

For example, suppose your examples are images containing exactly one item — a piece of fruit. Softmax can determine the likelihood of that one item being a pear, an orange, an apple, and so on. If your examples are images containing all sorts of things—bowls of different kinds of fruit — then you'll have to use multiple logistic regressions instead.

#### Multi-Class Neural Networks: Programming Exercise

In the following exercise, you'll explore Softmax in TensorFlow by developing a model that will classify handwritten digits:

- [Multi-Class Classification with MNIST](/Machine%20Learning%20Crash%20Course%20with%20TensorFlow%20APIs/Exercises/Multi_class_classification_with_MNIST.ipynb) Colab exercise.

### Embeddings

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

#### Embeddings: Motivation From Collaborative Filtering

Collaborative filtering is the task of making predictions about the interests of a user based on interests of many other users. As an example, let's look at the task of movie recommendation. Suppose we have 500,000 users, and a list of the movies each user has watched (from a catalog of 1,000,000 movies). Our goal is to recommend movies to users.

To solve this problem some method is needed to determine which movies are similar to each other. We can achieve this goal by embedding the movies into a low-dimensional space created such that similar movies are nearby.

Before describing how we can learn the embedding, we first explore the type of qualities we want the embedding to have, and how we will represent the training data for learning the embedding.

##### Arrange Movies on a One-Dimensional Number Line

To help develop intuition about embeddings, on a piece of paper, try to arrange the following movies on a one-dimensional number line so that the movies nearest each other are the most closely related:

| Movie                               | [Rating](https://wikipedia.org/wiki/Motion_Picture_Association_of_America_film_rating_system#MPAA_film_ratings) | Description                                                                                                                             |
|--------------------------------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------|
| Bleu                                 | R      | A French widow grieves the loss of her husband and daughter after they perish in a car accident.                                     |
| The Dark Knight Rises                | PG-13  | Batman endeavors to save Gotham City from nuclear annihilation in this sequel to The Dark Knight, set in the DC Comics universe.         |
| Harry Potter and the Sorcerer's Stone | PG     | An orphaned boy discovers he is a wizard and enrolls in Hogwarts School of Witchcraft and Wizardry, where he wages his first battle against the evil Lord Voldemort. |
| The Incredibles                      | PG     | A family of superheroes forced to live as civilians in suburbia come out of retirement to save the superhero race from Syndrome and his killer robot. |
| Shrek                                | PG     | A lovable ogre and his donkey sidekick set off on a mission to rescue Princess Fiona, who is imprisoned in her castle by a dragon.          |
| Star Wars                            | PG     | Luke Skywalker and Han Solo team up with two androids to rescue Princess Leia and save the galaxy.                                       |
| The Triplets of Belleville           | PG-13  | When professional cycler Champion is kidnapped during the Tour de France, his grandmother and overweight dog journey overseas to rescue him, with the help of a trio of elderly jazz singers. |
| Memento                              | R      | An amnesiac desperately seeks to solve his wife's murder by tattooing clues onto his body.                                               |

![An illustration of a one-dimensional embedding of movies going from more kid-appropriate movies to more adult-oriented movies.](https://developers.google.com/machine-learning/crash-course/images/Embedding1d.svg?authuser=1)

*Figure 1. A possible one-dimensional arrangement.*

While this embedding does help capture how much the movie is geared towards children versus adults, there are many more aspects of a movie that one would want to capture when making recommendations. Let's take this example one step further, adding a second embedding dimension.

##### Arrange Movies in a Two-Dimensional Space

Try the same exercise as before, but this time arrange the same movies in a two-dimensional space.

![When moving to a two-dimensional movie embedding we now capture both how much the movie is geared towards children and also the degree to which it is a blockbuster or art-house film.](https://developers.google.com/machine-learning/crash-course/images/Embedding2dWithLabels.svg?authuser=1)

*Figure 2. A possible two-dimensional arrangement.*

With this two-dimensional embedding we define a distance between movies such that movies are nearby (and thus inferred to be similar) if they are both alike in the extent to which they are geared towards children versus adults, as well as the extent to which they are blockbuster movies versus arthouse movies. These, of course, are just two of many characteristics of movies that might be important.

More generally, what we've done is mapped these movies into an **embedding space**, where each word is described by a two-dimensional set of coordinates. For example, in this space, "Shrek" maps to (-1.0, 0.95) and "Bleu" maps to (0.65, -0.2). In general, when learning a d-dimensional embedding, each movie is represented by d real-valued numbers, each one giving the coordinate in one dimension.

In this example, we have given a name to each dimension. When learning embeddings, the individual dimensions are not learned with names. Sometimes, we can look at the embeddings and assign semantic meanings to the dimensions, and other times we cannot. Often, each such dimension is called a **latent dimension**, as it represents a feature that is not explicit in the data but rather inferred from it.

Ultimately, it is the distances between movies in the embedding space that are meaningful, rather than a single movie's values along any given dimension.

#### Embeddings: Categorical Input Data

Categorical data refers to input features that represent one or more discrete items from a finite set of choices. For example, it can be the set of movies a user has watched, the set of words in a document, or the occupation of a person.

Categorical data is most efficiently represented via sparse tensors, which are tensors with very few non-zero elements. For example, if we're building a movie recommendation model, we can assign a unique ID to each possible movie, and then represent each user by a sparse tensor of the movies they have watched, as shown in Figure 3.

![A sample input for our movie recommendation problem.](https://developers.google.com/machine-learning/crash-course/images/InputRepresentationWithValues.png?authuser=1)

*Figure 3. Data for our movie recommendation problem.*

Each row of the matrix in Figure 3 is an example capturing a user's movie-viewing history, and is represented as a sparse tensor because each user only watches a small fraction of all possible movies. The last row corresponds to the sparse tensor [1, 3, 999999], using the vocabulary indices shown above the movie icons.

Likewise one can represent words, sentences, and documents as sparse vectors where each word in the vocabulary plays a role similar to the movies in our recommendation example.

In order to use such [representations](https://developers.google.com/machine-learning/crash-course/representation/video-lecture?authuser=1) within a machine learning system, we need a way to represent each sparse vector as a vector of numbers so that semantically similar items (movies or words) have similar distances in the vector space. But how do you represent a word as a vector of numbers?

The simplest way is to define a giant input layer with a node for every word in your vocabulary, or at least a node for every word that appears in your data. If 500,000 unique words appear in your data, you could represent a word with a length 500,000 vector and assign each word to a slot in the vector.

If you assign "horse" to index 1247, then to feed "horse" into your network you might copy a 1 into the 1247th input node and 0s into all the rest. This sort of representation is called a **one-hot encoding**, because only one index has a non-zero value.

More typically your vector might contain counts of the words in a larger chunk of text. This is known as a "bag of words" representation. In a bag-of-words vector, several of the 500,000 nodes would have non-zero value.

But however you determine the non-zero values, one-node-per-word gives you very sparse input vectors—very large vectors with relatively few non-zero values. Sparse representations have a couple of problems that can make it hard for a model to learn effectively.

##### Size of Network

Huge input vectors mean a super-huge number of weights for a neural network. If there are M words in your vocabulary and N nodes in the first layer of the network above the input, you have MxN weights to train for that layer. A large number of weights causes further problems:

- Amount of data. The more weights in your model, the more data you need to train effectively.
- Amount of computation. The more weights, the more computation required to train and use the model. It's easy to exceed the capabilities of your hardware.

##### Lack of Meaningful Relations Between Vectors

If you feed the pixel values of RGB channels into an image classifier, it makes sense to talk about "close" values. Reddish blue is close to pure blue, both semantically and in terms of the geometric distance between vectors. But a vector with a 1 at index 1247 for "horse" is not any closer to a vector with a 1 at index 50,430 for "antelope" than it is to a vector with a 1 at index 238 for "television".

##### The Solution: Embeddings

The solution to these problems is to use embeddings, which translate large sparse vectors into a lower-dimensional space that preserves semantic relationships. We'll explore embeddings intuitively, conceptually, and programmatically in the following sections of this module.

#### Embeddings: Translating to a Lower-Dimensional Space

You can solve the core problems of sparse input data by mapping your high-dimensional data into a lower-dimensional space.

As you saw in the movie exercises earlier, even a small multi-dimensional space provides the freedom to group semantically similar items together and keep dissimilar items far apart. Position (distance and direction) in the vector space can encode semantics in a good embedding. For example, the following visualizations of real embeddings show geometrical relationships that capture semantic relations like the relation between a country and its capital:

![Three examples of word embeddings that represent word relationships geometrically: gender (man/woman and king/queen), verb tense (walking/walked and swimming/swam), and capital cities (Turkey/Ankara and Vietnam/Hanoi).](https://developers.google.com/static/machine-learning/crash-course/images/linear-relationships.svg?authuser=1)

*Figure 4. Embeddings can produce remarkable analogies.*

This sort of meaningful space gives your machine learning system opportunities to detect patterns that may help with the learning task.

##### Shrinking the network

While we want enough dimensions to encode rich semantic relations, we also want an embedding space that is small enough to allow us to train our system more quickly. A useful embedding may be on the order of hundreds of dimensions. This is likely several orders of magnitude smaller than the size of your vocabulary for a natural language task.

#### Embeddings: Obtaining Embeddings

There are a number of ways to get an embedding, including a state-of-the-art algorithm created at Google.

##### Standard Dimensionality Reduction Techniques

There are many existing mathematical techniques for capturing the important structure of a high-dimensional space in a low dimensional space. In theory, any of these techniques could be used to create an embedding for a machine learning system.

For example, principal component analysis (PCA) has been used to create word embeddings. Given a set of instances like bag of words vectors, PCA tries to find highly correlated dimensions that can be collapsed into a single dimension.

##### Word2vec

Word2vec is an algorithm invented at Google for training word embeddings. Word2vec relies on the distributional hypothesis to map semantically similar words to geometrically close embedding vectors.

The distributional hypothesis states that words which often have the same neighboring words tend to be semantically similar. Both "dog" and "cat" frequently appear close to the word "veterinarian", and this fact reflects their semantic similarity. As the linguist John Firth put it in 1957, "You shall know a word by the company it keeps".

Word2Vec exploits contextual information like this by training a neural net to distinguish actually co-occurring groups of words from randomly grouped words. The input layer takes a sparse representation of a target word together with one or more context words. This input connects to a single, smaller hidden layer.

In one version of the algorithm, the system makes a negative example by substituting a random noise word for the target word. Given the positive example "the plane flies", the system might swap in "jogging" to create the contrasting negative example "the jogging flies".

The other version of the algorithm creates negative examples by pairing the true target word with randomly chosen context words. So it might take the positive examples (the, plane), (flies, plane) and the negative examples (compiled, plane), (who, plane) and learn to identify which pairs actually appeared together in text.

The classifier is not the real goal for either version of the system, however. After the model has been trained, you have an embedding. You can use the weights connecting the input layer with the hidden layer to map sparse representations of words to smaller vectors. This embedding can be reused in other classifiers.

For more information about word2vec, see the [tutorial on tensorflow.org](https://www.tensorflow.org/tutorials/word2vec/index.html?authuser=1).

##### Training an Embedding as Part of a Larger Model

You can also learn an embedding as part of the neural network for your target task. This approach gets you an embedding well customized for your particular system, but may take longer than training the embedding separately.

In general, when you have sparse data (or dense data that you'd like to embed), you can create an embedding unit that is just a special type of hidden unit of size d. This embedding layer can be combined with any other features and hidden layers. As in any DNN, the final layer will be the loss that is being optimized. For example, let's say we're performing collaborative filtering, where the goal is to predict a user's interests from the interests of other users. We can model this as a supervised learning problem by randomly setting aside (or holding out) a small number of the movies that the user has watched as the positive labels, and then optimize a softmax loss.

![](https://developers.google.com/static/machine-learning/crash-course/images/EmbeddingExample3-1.svg?authuser=1)

*Figure 5. A sample DNN architecture for learning movie embeddings from collaborative filtering data.*

As another example if you want to create an embedding layer for the words in a real-estate ad as part of a DNN to predict housing prices then you'd optimize an L2 Loss using the known sale price of homes in your training data as the label.

When learning a d-dimensional embedding each item is mapped to a point in a d-dimensional space so that the similar items are nearby in this space. Figure 6 helps to illustrate the relationship between the weights learned in the embedding layer and the geometric view. The edge weights between an input node and the nodes in the d-dimensional embedding layer correspond to the coordinate values for each of the d axes.

![A figure illustrating the relationship between the embedding layer weights and the geometric view of the embedding.](https://developers.google.com/static/machine-learning/crash-course/images/dnn-to-geometric-view.svg?authuser=1)

*Figure 6. A geometric view of the embedding layer weights.*

## Part 2 - Machine Learning Engineering

### Production ML Systems

There's a lot more to machine learning than just implementing an ML algorithm. A production ML system involves a significant number of components.

#### Video Lecture Summary

So far, Machine Learning Crash Course has focused on building ML models. However, as the following figure suggests, real-world production ML systems are large ecosystems of which the model is just a single part.

![ML system diagram containing the following components: data collection, feature extraction, process management tools, data verification, configuration, machine resource management, monitoring, and serving infrastructure, and ML code. The ML code part of the diagram is dwarfed by the other nine components.](https://developers.google.com/static/machine-learning/crash-course/images/MlSystem.svg?authuser=1)

*Figure 1. Real-world production ML system.*

The ML code is at the heart of a real-world ML production system, but that box often represents only 5% or less of the overall code of that total ML production system. (That's not a misprint.) Notice that an ML production system devotes considerable resources to input data—collecting it, verifying it, and extracting features from it. Furthermore, notice that a serving infrastructure must be in place to put the ML model's predictions into practical use in the real world.

Fortunately, many of the components in the preceding figure are reusable. Furthermore, you don't have to build all the components in Figure 1 yourself.

[TensorFlow Extended (TFX)](https://www.tensorflow.org/tfx?authuser=1) is an end-to-end platform for deploying production ML pipelines.

Subsequent modules will help guide your design decisions in building a production ML system.

### Static vs. Dynamic Training

Broadly speaking, there are two ways to train a model:

- A static model is trained offline. That is, we train the model exactly once and then use that trained model for a while.
- A dynamic model is trained online. That is, data is continually entering the system and we're incorporating that data into the model through continuous updates.

#### Video Lecture Summary

Broadly speaking, the following points dominate the static vs. dynamic training decision:

- Static models are easier to build and test.
- Dynamic models adapt to changing data. The world is a highly changeable place. Sales predictions built from last year's data are unlikely to successfully predict next year's results.

If your data set truly isn't changing over time, choose static training because it is cheaper to create and maintain than dynamic training. However, many information sources really do change over time, even those with features that you think are as constant as, say, sea level. The moral: even with static training, you must still monitor your input data for change.

For example, consider a model trained to predict the probability that users will buy flowers. Because of time pressure, the model is trained only once using a dataset of flower buying behavior during July and August. The model is then shipped off to serve predictions in production, but is never updated. The model works fine for several months, but then makes terrible predictions around Valentine's Day because user behavior during that holiday period changes dramatically.

#### Static vs. Dynamic Training: Check Your Understanding

##### Dynamic (Online) Training

Explore the options below.

Which one of the following statements is true of dynamic (online) training?

- [ ] Very little monitoring of training jobs needs to be done.
- [x] The model stays up to date as new data arrives.
- [ ] Very little monitoring of input data needs to be done at inference time.

##### Static (Offline) Training

Explore the options below.

Which of the following statements are true about static (offline) training?

- [x] You can verify the model before applying it in production.
- [ ] The model stays up to date as new data arrives.
- [ ] Very little monitoring of input data needs to be done at inference time.
- [x] Offline training requires less monitoring of training jobs than online training.

### Static vs. Dynamic Inference

You can choose either of the following inference strategies:

- offline inference, meaning that you make all possible predictions in a batch, using a MapReduce or something similar. You then write the predictions to an SSTable or Bigtable, and then feed these to a cache/lookup table.
- online inference, meaning that you predict on demand, using a server.

#### Video Lecture Summary

Here are the pros and cons of offline inference:

- Pro: Don’t need to worry much about cost of inference.
- Pro: Can likely use batch quota or some giant MapReduce.
- Pro: Can do post-verification of predictions before pushing.
- Con: Can only predict things we know about — bad for long tail.
- Con: Update latency is likely measured in hours or days.

Here are the pros and cons of online inference:

- Pro: Can make a prediction on any new item as it comes in — great for long tail.
- Con: Compute intensive, latency sensitive—may limit model complexity.
- Con: Monitoring needs are more intensive.

#### Static vs. Dynamic Inference: Check Your Understanding

##### Static (Offline) Inference

Explore the options below.

In offline inference, we make predictions on a big batch of data all at once. We then put those predictions in a look-up table for later use. Which of the following are true of offline inference?

- [x] We must create predictions for all possible inputs.
- [x] After generating the predictions, we can verify them before applying them.
- [ ] We will need to carefully monitor our input signals over a long period of time.
- [ ] We will be able to react quickly to changes in the world.
- [x] For a given input, we can serve a prediction more quickly than with online inference.

##### Dynamic (Online) Inference

Explore the options below.

Dynamic (online) inference means making predictions on demand. That is, in online inference, we put the trained model on a server and issue inference requests as needed. Which of the following are true of dynamic inference?

- [x] You must carefully monitor input signals.
- [x] You can provide predictions for all possible items.
- [ ] When performing online inference, you do not need to worry about prediction latency (the lag time for returning predictions) as much as when performing offline inference.
- [ ] You can do post-verification of predictions before they are used. 

### Data Dependencies

Data is as important to ML developers as code is to traditional programmers. This lesson focuses on the kinds of questions you should be asking of your data.

#### Video Lecture Summary

The behavior of an ML system is dependent on the behavior and qualities of its input features. As the input data for those features changes, so too will your model. Sometimes that change is desirable, but sometimes it is not.

In traditional software development, you focus more on code than on data. In machine learning development, although coding is still part of the job, your focus must widen to include data. For example, on traditional software development projects, it is a best practice to write unit tests to validate your code. On ML projects, you must also continuously test, verify, and monitor your input data.

For example, you should continuously monitor your model to remove unused (or little used) features. Imagine a certain feature that has been contributing little or nothing to the model. If the input data for that feature abruptly changes, your model's behavior might also abruptly change in undesirable ways.

##### Reliability

Some questions to ask about the reliability of your input data:

- Is the signal always going to be available or is it coming from an unreliable source? For example:
    - Is the signal coming from a server that crashes under heavy load?
    - Is the signal coming from humans that go on vacation every August?

##### Versioning

Some questions to ask about versioning:

- Does the system that computes this data ever change? If so:
    - How often?
    - How will you know when that system changes?

Sometimes, data comes from an upstream process. If that process changes abruptly, your model can suffer.

Consider creating your own copy of the data you receive from the upstream process. Then, only advance to the next version of the upstream data when you are certain that it is safe to do so.

##### Necessity

The following question might remind you of [regularization](https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/video-lecture?authuser=1):

- Does the usefulness of the feature justify the cost of including it?

It is always tempting to add more features to the model. For example, suppose you find a new feature whose addition makes your model slightly more accurate. More accuracy certainly sounds better than less accuracy. However, now you've just added to your maintenance burden. That additional feature could degrade unexpectedly, so you've got to monitor it. Think carefully before adding features that lead to minor short-term wins.

##### Correlations

Some features correlate (positively or negatively) with other features. Ask yourself the following question:

- Are any features so tied together that you need additional strategies to tease them apart?

##### Feedback Loops

Sometimes a model can affect its own training data. For example, the results from some models, in turn, are directly or indirectly input features to that same model.

Sometimes a model can affect another model. For example, consider two models for predicting stock prices:

- Model A, which is a bad predictive model.
- Model B.

Since Model A is buggy, it mistakenly decides to buy stock in Stock X. Those purchases drive up the price of Stock X. Model B uses the price of Stock X as an input feature, so Model B can easily come to some false conclusions about the value of Stock X stock. Model B could, therefore, buy or sell shares of Stock X based on the buggy behavior of Model A. Model B's behavior, in turn, can affect Model A, possibly triggering a tulip mania or a slide in Company X's stock.

#### Data Dependencies: Check Your Understanding

Explore the options below.

Which of the following models are susceptible to a feedback loop?

- [x] An election-results model that forecasts the winner of a mayoral race by surveying 2% of voters after the polls have closed.
- [ ] A face-attributes model that detects whether a person is smiling in a photo, which is regularly trained on a database of stock photography that is automatically updated monthly.
- [x] A traffic-forecasting model that predicts congestion at highway exits near the beach, using beach crowd size as one of its features.
- [ ] A book-recommendation model that suggests novels its users may like based on their popularity (i.e., the number of times the books have been purchased).
- [ ] A housing-value model that predicts house prices, using size (area in square meters), number of bedrooms, and geographic location as features.
- [x] A university-ranking model that rates schools in part by their selectivity—the percentage of students who applied that were admitted.

### Fairness

Evaluating a machine learning model responsibly requires doing more than just calculating loss metrics. Before putting a model into production, it's critical to audit training data and evaluate predictions for bias.

This module looks at different types of human biases that can manifest in training data. It then provides strategies to identify them and evaluate their effects.

#### Fairness: Types of Bias

Machine learning models are not inherently objective. Engineers train models by feeding them a data set of training examples, and human involvement in the provision and curation of this data can make a model's predictions susceptible to bias.

When building models, it's important to be aware of common human biases that can manifest in your data, so you can take proactive steps to mitigate their effects.

##### Reporting Bias

Reporting bias occurs when the frequency of events, properties, and/or outcomes captured in a data set does not accurately reflect their real-world frequency. This bias can arise because people tend to focus on documenting circumstances that are unusual or especially memorable, assuming that the ordinary can "go without saying."

    EXAMPLE: A sentiment-analysis model is trained to predict whether book reviews are positive or negative based on a corpus of user submissions to a popular website. The majority of reviews in the training data set reflect extreme opinions (reviewers who either loved or hated a book), because people were less likely to submit a review of a book if they did not respond to it strongly. As a result, the model is less able to correctly predict sentiment of reviews that use more subtle language to describe a book.

##### Automation Bias

Automation bias is a tendency to favor results generated by automated systems over those generated by non-automated systems, irrespective of the error rates of each.

    EXAMPLE: Software engineers working for a sprocket manufacturer were eager to deploy the new "groundbreaking" model they trained to identify tooth defects, until the factory supervisor pointed out that the model's precision and recall rates were both 15% lower than those of human inspectors.

##### Selection Bias

Selection bias occurs if a data set's examples are chosen in a way that is not reflective of their real-world distribution. Selection bias can take many different forms:

- Coverage bias: Data is not selected in a representative fashion.

    `EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product. Consumers who instead opted to buy a competing product were not surveyed, and as a result, this group of people was not represented in the training data.`

- Non-response bias (or participation bias): Data ends up being unrepresentative due to participation gaps in the data-collection process.

    `EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product and with a sample of consumers who bought a competing product. Consumers who bought the competing product were 80% more likely to refuse to complete the survey, and their data was underrepresented in the sample.`

- Sampling bias: Proper randomization is not used during data collection.

    `EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product and with a sample of consumers who bought a competing product. Instead of randomly targeting consumers, the surveyer chose the first 200 consumers that responded to an email, who might have been more enthusiastic about the product than average purchasers.`

##### Group Attribution Bias

Group attribution bias is a tendency to generalize what is true of individuals to an entire group to which they belong. Two key manifestations of this bias are:

- In-group bias: A preference for members of a group to which you also belong, or for characteristics that you also share.

    `EXAMPLE: Two engineers training a résumé-screening model for software developers are predisposed to believe that applicants who attended the same computer-science academy as they both did are more qualified for the role.`

- Out-group homogeneity bias: A tendency to stereotype individual members of a group to which you do not belong, or to see their characteristics as more uniform.

    `EXAMPLE: Two engineers training a résumé-screening model for software developers are predisposed to believe that all applicants who did not attend a computer-science academy do not have sufficient expertise for the role.`

##### Implicit Bias

Implicit bias occurs when assumptions are made based on one's own mental models and personal experiences that do not necessarily apply more generally.

    EXAMPLE: An engineer training a gesture-recognition model uses a head shake as a feature to indicate a person is communicating the word "no." However, in some regions of the world, a head shake actually signifies "yes."

A common form of implicit bias is confirmation bias, where model builders unconsciously process data in ways that affirm preexisting beliefs and hypotheses. In some cases, a model builder may actually keep training a model until it produces a result that aligns with their original hypothesis; this is called experimenter's bias.

    EXAMPLE: An engineer is building a model that predicts aggressiveness in dogs based on a variety of features (height, weight, breed, environment). The engineer had an unpleasant encounter with a hyperactive toy poodle as a child, and ever since has associated the breed with aggression. When the trained model predicted most toy poodles to be relatively docile, the engineer retrained the model several more times until it produced a result showing smaller poodles to be more violent.

#### Fairness: Identifying Bias

As you explore your data to determine how best to represent it in your model, it's important to also keep issues of fairness in mind and proactively audit for potential sources of bias.

Where might bias lurk? Here are three red flags to look out for in your data set.

##### Missing Feature Values

If your data set has one or more features that have missing values for a large number of examples, that could be an indicator that certain key characteristics of your data set are under-represented.

For example, the table below shows a summary of key stats for a subset of features in the [California Housing dataset](https://developers.google.com/machine-learning/crash-course/california-housing-data-description?authuser=1), stored in a pandas DataFrame and generated via `DataFrame.describe`. Note that all features have a count of 17000, indicating there are no missing values:

|           | longitude | latitude | total_rooms | population | households | median_income | median_house_value |
|-----------|-----------|----------|-------------|------------|------------|---------------|--------------------|
| count     | 17000.0   | 17000.0  | 17000.0     | 17000.0    | 17000.0    | 17000.0       | 17000.0            |
| mean      | -119.6    | 35.6     | 2643.7      | 1429.6     | 501.2      | 3.9           | 207.3              |
| std       | 2.0       | 2.1      | 2179.9      | 1147.9     | 384.5      | 1.9           | 116.0              |
| min       | -124.3    | 32.5     | 2.0         | 3.0        | 1.0        | 0.5           | 15.0               |
| 25%       | -121.8    | 33.9     | 1462.0      | 790.0      | 282.0      | 2.6           | 119.4              |
| 50%       | -118.5    | 34.2     | 2127.0      | 1167.0     | 409.0      | 3.5           | 180.4              |
| 75%       | -118.0    | 37.7     | 3151.2      | 1721.0     | 605.2      | 4.8           | 265.0              |
| max       | -114.3    | 42.0     | 37937.0     | 35682.0    | 6082.0     | 15.0          | 500.0              |

Suppose instead that three features (population, households, and median_income) only had a count of 3000—in other words, that there were 14,000 missing values for each feature:

|           | longitude | latitude | total_rooms | population | households | median_income | median_house_value |
|-----------|-----------|----------|-------------|------------|------------|---------------|--------------------|
| count     | 17000.0   | 17000.0  | 17000.0     | 3000.0     | 3000.0     | 3000.0        | 17000.0            |
| mean      | -119.6    | 35.6     | 2643.7      | 1429.6     | 501.2      | 3.9           | 207.3              |
| std       | 2.0       | 2.1      | 2179.9      | 1147.9     | 384.5      | 1.9           | 116.0              |
| min       | -124.3    | 32.5     | 2.0         | 3.0        | 1.0        | 0.5           | 15.0               |
| 25%       | -121.8    | 33.9     | 1462.0      | 790.0      | 282.0      | 2.6           | 119.4              |
| 50%       | -118.5    | 34.2     | 2127.0      | 1167.0     | 409.0      | 3.5           | 180.4              |
| 75%       | -118.0    | 37.7     | 3151.2      | 1721.0     | 605.2      | 4.8           | 265.0              |
| max       | -114.3    | 42.0     | 37937.0     | 35682.0    | 6082.0     | 15.0          | 500.0              |

These 14,000 missing values would make it much more difficult to accurately correlate median income of households with median house prices. Before training a model on this data, it would be prudent to investigate the cause of these missing values to ensure that there are no latent biases responsible for missing income and population data.

##### Unexpected Feature Values

When exploring data, you should also look for examples that contain feature values that stand out as especially uncharacteristic or unusual. These unexpected feature values could indicate problems that occurred during data collection or other inaccuracies that could introduce bias.

For example, take a look at the following excerpted examples from the California housing data set:

|       | longitude | latitude | total_rooms | population | households | median_income | median_house_value |
|-------|-----------|----------|-------------|------------|------------|---------------|--------------------|
| 1     | -121.7    | 38.0     | 7105.0      | 3523.0     | 1088.0     | 5.0           | 0.2                |
| 2     | -122.4    | 37.8     | 2479.0      | 1816.0     | 496.0      | 3.1           | 0.3                |
| 3     | -122.0    | 37.0     | 2813.0      | 1337.0     | 477.0      | 3.7           | 0.3                |
| 4     | -103.5    | 43.8     | 2212.0      | 803.0      | 144.0      | 5.3           | 0.2                |
| 5     | -117.1    | 32.8     | 2963.0      | 1162.0     | 556.0      | 3.6           | 0.2                |
| 6     | -118.0    | 33.7     | 3396.0      | 1542.0     | 472.0      | 7.4           | 0.4                |

Can you pinpoint any unexpected feature values?

Answer: The longitude and latitude coordinates in example 4 (-103.5 and 43.8, respectively) do not fall within the U.S. state of California. In fact, they are the approximate coordinates of Mount Rushmore National Memorial in the state of South Dakota. This is a bogus example that we inserted into the data set. 

##### Data Skew

Any sort of skew in your data, where certain groups or characteristics may be under- or over-represented relative to their real-world prevalence, can introduce bias into your model.

If you completed the Validation programming exercise, you may recall discovering how a failure to randomize the California housing data set prior to splitting it into training and validation sets resulted in a pronounced data skew. Figure 1 visualizes a subset of data drawn from the full data set that exclusively represents the northwest region of California.

![A California state map overlaid with data from the California Housing data set. Each dot represents a housing block. Dots are all clustered in northwest California, with no dots in southern California, illustrating the geographical skew of the data.](https://developers.google.com/static/machine-learning/crash-course/images/california_housing_state_map.svg?authuser=1)

*Figure 1. California state map overlaid with data from the California Housing data set. Each dot represents a housing block, with colors ranging from blue to red corresponding to median house price ranging from low to high, respectively.*

If this unrepresentative sample were used to train a model to predict California housing prices statewide, the lack of housing data from southern portions of California would be problematic. The geographical bias encoded in the model might adversely affect homebuyers in unrepresented communities.

#### Fairness: Evaluating for Bias

When evaluating a model, metrics calculated against an entire test or validation set don't always give an accurate picture of how fair the model is.

Consider a new model developed to predict the presence of tumors that is evaluated against a validation set of 1,000 patients' medical records. 500 records are from female patients, and 500 records are from male patients. The following confusion matrix summarizes the results for all 1,000 examples:

<div>
<table>
  <tr>
    <td>
      <b>True Positives (TPs): 16</b>
    </td>
    <td>
      <b>False Positives (FPs): 4</b>
    </td>
  </tr>
  <tr>
    <td>
      <b>False Negatives (FNs): 6</b>
    </td>
    <td>
      <b>True Negatives (TNs): 974</b>
    </td>
  </tr>
</table>

$$\text{Precision} = \frac{TP}{TP+FP} = \frac{16}{16+4} = 0.800$$
$$\text{Recall} = \frac{TP}{TP+FN} = \frac{16}{16+6} = 0.727$$

</div>

These results look promising: precision of 80% and recall of 72.7%. But what happens if we calculate the result separately for each set of patients? Let's break out the results into two separate confusion matrices: one for female patients and one for male patients.

<div>
<div style="width: 45%; display: inline-block; margin: 10px">
  <h4 style="text-align: center" data-text="Female Patient Results" tabindex="-1">Female Patient Results</h4>
<table>
  <tr>
    <td>
      <b>True Positives (TPs): 10</b>
    </td>
    <td>
      <b>False Positives (FPs): 1</b>
    </td>
  </tr>
  <tr>
    <td>
      <b>False Negatives (FNs): 1</b>
    </td>
    <td>
      <b>True Negatives (TNs): 488</b>
    </td>
  </tr>
</table>

$$\text{Precision} = \frac{TP}{TP+FP} = \frac{10}{10+1} = 0.909$$
$$\text{Recall} = \frac{TP}{TP+FN} = \frac{10}{10+1} = 0.909$$

</div>

<div style="float: right; width: 45%; display: inline-block; margin: 10px;">
  <h4 style="text-align: center" data-text="Male Patient Results" tabindex="-1">Male Patient Results</h4>
<table>
  <tr>
    <td>
      <b>True Positives (TPs): 6</b>
    </td>
    <td>
      <b>False Positives (FPs): 3</b>
    </td>
  </tr>
  <tr>
    <td>
      <b>False Negatives (FNs): 5</b>
    </td>
    <td>
      <b>True Negatives (TNs): 486</b>
    </td>
  </tr>
</table>

$$\text{Precision} = \frac{TP}{TP+FP} = \frac{6}{6+3} = 0.667$$
$$\text{Recall} = \frac{TP}{TP+FN} = \frac{6}{6+5} = 0.545$$

</div>
</div>

When we calculate metrics separately for female and male patients, we see stark differences in model performance for each group.

Female patients:

- Of the 11 female patients who actually have tumors, the model correctly predicts positive for 10 patients (recall rate: 90.9%). In other words, the model misses a tumor diagnosis in 9.1% of female cases.
- Similarly, when the model returns positive for tumor in female patients, it is correct in 10 out of 11 cases (precision rate: 90.9%); in other words, the model incorrectly predicts tumor in 9.1% of female cases.

Male patients:

- However, of the 11 male patients who actually have tumors, the model correctly predicts positive for only 6 patients (recall rate: 54.5%). That means the model misses a tumor diagnosis in 45.5% of male cases.
- And when the model returns positive for tumor in male patients, it is correct in only 6 out of 9 cases (precision rate: 66.7%); in other words, the model incorrectly predicts tumor in 33.3% of male cases.

We now have a much better understanding of the biases inherent in the model's predictions, as well as the risks to each subgroup if the model were to be released for medical use in the general population.

##### Additional Fairness Resources

Fairness is a relatively new subfield within the discipline of machine learning. To learn more about research and initiatives devoted to developing new tools and techniques for identifying and mitigating bias in machine learning models, check out [Google's Machine Learning Fairness resources page](https://developers.google.com/machine-learning/fairness-overview?authuser=1). 

#### Fairness: Programming Exercise

The following exercise demonstrates how to audit data sets with fairness in mind, and one strategy for evaluating a model for fairness:

- [Intro to Fairness](/Machine%20Learning%20Crash%20Course%20with%20TensorFlow%20APIs/Exercises/Intro_to_Fairness.ipynb) Colab exercise.

#### Fairness: Check Your Understanding

##### Types of Bias

Explore the options below.

Which of the following model's predictions have been affected by selection bias?

- [x] Engineers at a company developed a model to predict staff turnover rates (the percentage of employees quitting their jobs each year) based on data collected from a survey sent to all employees. After several years of use, engineers determined that the model underestimated turnover by more than 20%. When conducting exit interviews with employees leaving the company, they learned that more than 80% of people who were dissatisfied with their jobs chose not to complete the survey, compared to a company-wide opt-out rate of 15%.
- [x] A German handwriting recognition smartphone app uses a model that frequently incorrectly classifies ß (Eszett) characters as B characters, because it was trained on a corpus of American handwriting samples, mostly written in English.
- [ ] Engineers developing a movie-recommendation system hypothesized that people who like horror movies will also like science-fiction movies. When they trained a model on 50,000 users' watchlists, however, it showed no such correlation between preferences for horror and for sci-fi; instead it showed a strong correlation between preferences for horror and for documentaries. This seemed odd to them, so they retrained the model five more times using different hyperparameters. Their final trained model showed a 70% correlation between preferences for horror and for sci-fi, so they confidently released it into production.
- [ ] Engineers built a model to predict the likelihood of a person developing diabetes based on their daily food intake. The model was trained on 10,000 "food diaries" collected from a randomly chosen group of people worldwide representing a variety of different age groups, ethnic backgrounds, and genders. However, when the model was deployed, it had very poor accuracy. Engineers subsequently discovered that food diary participants were reluctant to admit the true volume of unhealthy foods they ate, and were more likely to document consumption of nutritious food than less healthy snacks.

##### Evaluating for Bias

A sarcasm-detection model was trained on 80,000 text messages: 40,000 messages sent by adults (18 years and older) and 40,000 messages sent by minors (less than 18 years old). The model was then evaluated on a test set of 20,000 messages: 10,000 from adults and 10,000 from minors. The following confusion matrices show the results for each group (a positive prediction signifies a classification of "sarcastic"; a negative prediction signifies a classificaton of "not sarcastic"):

<div>
<div style="display: inline-block; width: 45%; margin: 10px">
  <h4 style="text-align: center" data-text="Adults" tabindex="-1">Adults</h4>
<table>
  <tr>
    <td>
      <b>True Positives (TPs): 512</b>
    </td>
    <td>
      <b>False Positives (FPs): 51</b>
    </td>
  </tr>
  <tr>
    <td>
      <b>False Negatives (FNs): 36</b>
    </td>
    <td>
      <b>True Negatives (TNs): 9401</b>
    </td>
  </tr>
</table>

$$\text{Precision} = \frac{TP}{TP+FP} = 0.909$$
$$\text{Recall} = \frac{TP}{TP+FN} = 0.934$$

</div>

<div style="float: right; width: 45%; display: inline-block; margin: 10px;">
  <h4 style="text-align: center" data-text="Minors" tabindex="-1">Minors</h4>
<table>
  <tr>
    <td>
      <b>True Positives (TPs): 2147</b>
    </td>
    <td>
      <b>False Positives (FPs): 96</b>
    </td>
  </tr>
  <tr>
    <td>
      <b>False Negatives (FNs): 2177</b>
    </td>
    <td>
      <b>True Negatives (TNs): 5580</b>
    </td>
  </tr>
</table>

$$\text{Precision} = \frac{TP}{TP+FP} = 0.957$$
$$\text{Recall} = \frac{TP}{TP+FN} = 0.497$$

</div>
</div>

Explore the options below.

Which of the following statements about the model's test-set performance are true?

- [ ] The 10,000 messages sent by minors are a [class-imbalanced](https://developers.google.com/machine-learning/glossary?authuser=1#class_imbalanced_data_set) dataset.
- [x] Overall, the model performs better on examples from adults than on examples from minors.
- [ ] Approximately 50% of messages sent by minors are classified as "sarcastic" incorrectly.
- [x] The 10,000 messages sent by adults are a [class-imbalanced](https://developers.google.com/machine-learning/glossary?authuser=1#class_imbalanced_data_set) dataset.
- [x] The model fails to classify approximately 50% of minors' sarcastic messages as "sarcastic."

Explore the options below.

Engineers are working on retraining this model to address inconsistencies in sarcasm-detection accuracy across age demographics, but the model has already been released into production. Which of the following stopgap strategies will help mitigate errors in the model's predictions?

- [x] Restrict the model's usage to text messages sent by adults.
- [ ] Adjust the model output so that it returns "sarcastic" for all text messages sent by minors, regardless of what the model originally predicted.
- [ ] Restrict the model's usage to text messages sent by minors.
- [x] When the model predicts "not sarcastic" for text messages sent by minors, adjust the output so the model returns a value of "unsure" instead.

## Part 3 - Machine Learning Systems in the Real World

### Cancer Prediction

In this lesson, you'll debug a real-world ML problem* related to cancer prediction.

\* We based this module very loosely (making some modifications along the way) on ["Leakage in data mining: formulation, detection, and avoidance" by Kaufman, Rosset, and Perlich](http://dl.acm.org/citation.cfm?id=2020496).

### Literature

In this lesson, you'll debug a real-world ML problem* related to 18th century literature.

\* We based this module very loosely (making some modifications along the way) on ["Meaning and Mining: the Impact of Implicit Assumptions in Data Mining for the Humanities" by Sculley and Pasanek](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.103.21).

### Guidelines

This lesson summarizes the guidelines learned from these real-world examples.

#### Video Lecture Summary

Here's a quick synopsis of effective ML guidelines:

- Keep your first model simple.
- Focus on ensuring data pipeline correctness.
- Use a simple, observable metric for training & evaluation.
- Own and monitor your input features.
- Treat your model configuration as code: review it, check it in.
- Write down the results of all experiments, especially "failures."

#### Other Resources
[Rules of Machine Learning](https://developers.google.com/machine-learning/rules-of-ml?authuser=1) contains additional guidance.