# Fundamentals of machine learning
________
<!--
Author Muhammad Qasim
Date: 19-March-2018
-->
<style>
.gray{
background-color: #ccc !important;
}
</style>

## This chapter covers

- Forms of machine learning beyond classification
and regression
- Formal evaluation procedures for machinelearning
models
- Preparing data for deep learning
- Feature engineering
- Tackling overfitting
- The universal workflow for approaching machinelearning
problems


## 4.1 Four branches of machine learning
 binary classification, multiclass classification, and scalar
regression. All three are instances of supervised learning, where the goal is to learn the
relationship between training inputs and training targets.


### 4.1.1 Supervised learning

- such as optical character
recognition, speech recognition, image classification, and language translation

- Sequence generation—Given a picture, predict a caption describing it. Sequence
generation can sometimes be reformulated as a series of classification problems
(such as repeatedly predicting a word or token in a sequence).
- Syntax tree prediction—Given a sentence, predict its decomposition into a syntax
tree.
- Object detection—Given a picture, draw a bounding box around certain objects
inside the picture. This can also be expressed as a classification problem (given
many candidate bounding boxes, classify the contents of each one) or as a joint
classification and regression problem, where the bounding-box coordinates are
predicted via vector regression.
- Image segmentation—Given a picture, draw a pixel-level mask on a specific object. 

### 4.1.2 Unsupervised learning
-  finding interesting transformations of the
input data without the help of any targets, for the purposes of data visualization, data
compression, or data denoising, or to better understand the correlations present in
the data at hand
-  Unsupervised learning is the bread and butter of data analytics, and
it’s often a necessary step in better understanding a dataset 
- Dimensionality reduction and clustering are well-known
categories of unsupervised learning

#### 4.1.3 Self-supervised learning
- This is a specific instance of supervised learning, but it’s different enough that it
deserves its own category.
- Self-supervised learning is supervised learning without human-annotated labels.
- There are still labels involved (because the learning has to be
supervised by something), but they’re generated from the input data, typically using a
heuristic algorithm
- <b>autoencoders</b> are a well-known instance of self-supervised learning,
where the generated targets are the input, unmodified. In the same way, trying to predict
the next frame in a video, given past frames, or the next word in a text, given previous
words, are instances of self-supervised learning (temporally supervised learning, in this
case: supervision comes from future input data). Note that the distinction between
supervised, self-supervised, and unsupervised learning can be blurry sometimes

- <b>NOTE</b> In this book, we’ll focus specifically on supervised learning, because
it’s by far the dominant form of deep learning today, with a wide range of
industry applications. We’ll also take a briefer look at self-supervised learning
in later chapters.

### 4.1.4 Reinforcement learning
-  In reinforcement learning,
an agent receives information about its environment and learns to choose actions that
will maximize some reward. For instance, a neural network that “looks” at a videogame
screen and outputs game actions in order to maximize its score can be trained
via reinforcement learning
- we expect to see reinforcement
learning take over an increasingly large range of real-world applications:
self-driving cars, robotics, resource management, education, and so on. It’s an idea
whose time has come, or will come soon. 


- Classification and regression glossary<Br>They have precise, machine-learning-specific definitions, and you should be familiar
with them:
  - Sample or input—One data point that goes into your model.
  - Prediction or output—What comes out of your model.
  - Target—The truth. What your model should ideally have predicted, according
to an external source of data.
  - Prediction error or loss value—A measure of the distance between your
model’s prediction and the target.
  - Classes—A set of possible labels to choose from in a classification problem.
For example, when classifying cat and dog pictures, “dog” and “cat” are the
two classes.
  - Label—A specific instance of a class annotation in a classification problem.
For instance, if picture #1234 is annotated as containing the class “dog,”
then “dog” is a label of picture #1234.
  - Ground-truth or annotations—All targets for a dataset, typically collected by
humans.
  - Binary classification—A classification task where each input sample should
be categorized into two exclusive categories
  - Multiclass classification—A classification task where each input sample
should be categorized into more than two categories: for instance, classifying
handwritten digits.
  - Multilabel classification—A classification task where each input sample can
be assigned multiple labels. For instance, a given image may contain both a
cat and a dog and should be annotated both with the “cat” label and the
“dog” label. The number of labels per image is usually variable.
  - Scalar regression—A task where the target is a continuous scalar value. Predicting
house prices is a good example: the different target prices form a continuous
space.
  - Vector regression—A task where the target is a set of continuous values: for
example, a continuous vector. If you’re doing regression against multiple values
(such as the coordinates of a bounding box in an image), then you’re
doing vector regression.
  - Mini-batch or batch—A small set of samples (typically between 8 and 128)
that are processed simultaneously by the model. The number of samples is
often a power of 2, to facilitate memory allocation on GPU. When training, a
mini-batch is used to compute a single gradient-descent update applied to
the weights of the model. 

## 4.2 Evaluating machine-learning models
-  training set, a
validation set, and a test set.
- reason divide our data into two part for removing <b>overfiting</b> problem.
- Should our model generalize that perform on never-seen-data
- overfiting is big obstacle 
- we’ll focus on how
to measure generalization: how to evaluate machine-learning models.


### 4.2.1 Training, validation, and test sets
-  You train on the training data and evaluate your model simple hold-out validation, Kfold
on the validation data. Once your model is ready for prime time, you test it one final
time on the test data.
- You may ask, why not have two sets: a training set and a test set? You’d train on the
training data and evaluate on the test data. Much simpler!
-  model always involves tuning its configuration: for
example, choosing the number of layers or the size of the layers (called the hyperparameters
of the model, to distinguish them from the parameters, which are the network’s
weights). You do this tuning by using as a feedback signal the performance of
the model on the validation data. In essence, this tuning is a form of learning: a search
for a good configuration in some parameter space. As a result, tuning the configuration
of  model always involves tuning its configuration: for
example, choosing the number of layers or the size of the layers (called the hyperparameters
of the model, to distinguish them from the parameters, which are the network’s
weights). You do this tuning by using as a feedback signal the performance of
the model on the validation data. In essence, this tuning is a form of learning: a search
for a good configuration in some parameter space. As a result, tuning the configuration
of the model based on its performance on the validation  set can quickly result in
overfitting to the validation set, even though your model is never directly trained on it.
- If anything about the model has been tuned based on test set performance, then your
measure of generalization will be flawed.
- validation, and iterated K-fold validation with shuffling

#### SIMPLE HOLD-OUT VALIDATION
Set apart some fraction of your data as your test set. Train on the remaining data, and
evaluate on the test set.
<img src='images/f4.1.png'>

### <div style='color:#fff; background-color: skyblue;padding:10px 20px;'>Listing 4.1 Hold-out validation</div>

#### K-FOLD VALIDATION
 For each partition
i, train a model on the remaining K – 1 partitions, and evaluate it on partition i.
Your final score is then the averages of the K scores obtained. 
<img src='images/f4.2.png'>

### <div style='color:#fff; background-color: skyblue;padding:10px 20px;'>Listing 4.2 K-fold cross-validation</div>

#### ITERATED K-FOLD VALIDATION WITH SHUFFLING
 It consists of applying K-fold validation multiple times, shuffling
the data every time before splitting it K ways. The final score is the average of the
scores obtained at each run of K-fold validation. Note that you end up training and
evaluating P × K models (where P is the number of iterations you use), which can very
expensive. 

# 4.3 Data preprocessing, feature engineering, and feature learning

## 4.3.1 Data preprocessing for neural networks

Data preprocessing aims at making the raw data at hand more amenable to neural
networks. This includes vectorization, normalization, handling missing values, and
feature extraction.

#### Vectorization
All inputs and targets in a neural network must be tensors of floating-point data (or, in
specific cases, tensors of integers). Whatever data you need to process—sound,
images, text—you must first turn into tensors, a step called data vectorization.

#### VALUE NORMALIZATION
-  encoding grayscale values. Before you fed this data into your network,
you had to cast it to float32 and divide by 255 so you’d end up with floatingpoint
values in the 0–1 range. 
- when predicting house prices, you started
from features that took a variety of ranges—some features had small floating-point values,
others had fairly large integer values. Before you fed this data into your network,
you had to normalize each feature independently so that it had a standard deviation
of 1 and a mean of 0.

  - Take small values—Typically, most values should be in the 0–1 range.
  - Be homogenous—That is, all features should take values in roughly the same
range.Additionally, the following stricter normalization practice is common and can help,
although it isn’t always necessary (for example, you didn’t do this in the digit-classification
example): 
  - Normalize each feature independently to have a mean of 0.
  - Normalize each feature independently to have a standard deviation of 1.

This is easy to do with Numpy arrays:<br>
```x -= x.mean(axis=0)
x /= x.std(axis=0)
```

#### HANDLING MISSING VALUES

 expecting missing values in the test data, but the network was
trained on data without any missing values, the network won’t have learned to ignore
missing values! In this situation, you should artificially generate training samples with
missing entries: copy some training samples several times, and drop some of the features
that you expect are likely to be missing in the test data. 

#### 4.3.2 Feature engineering
Feature engineering is the process of using your own knowledge about the data and about
the machine-learning algorithm at hand (in this case, a neural network) to make the
algorithm work better by applying
hardcoded (nonlearned) transformations
to the data before it goes
into the model
<img src='images/f4.3.png'>

 Fortunately, modern deep learning removes the need for most feature engineering,
because neural networks are capable of automatically extracting useful features
from raw data. Does this mean you don’t have to worry about feature engineering as
long as you’re using deep neural networks? No, for two reasons:
- Good features still allow you to solve problems more elegantly while using fewer
resources. For instance, it would be ridiculous to solve the problem of reading a
clock face using a convolutional neural network.
- Good features let you solve a problem with far less data. The ability of deeplearning
models to learn features on their own relies on having lots of training
data available; if you have only a few samples, then the information value in
their features becomes critical. 

# 4.4 Overfitting and underfitting

-  The fundamental issue in machine learning is the tension between optimization
and generalization. Optimization refers to the process of adjusting a model to get the
best performance possible on the training data (the learning in machine learning),
whereas generalization refers to how well the trained model performs on data it has
never seen before. The goal of the game is to get good generalization, of course, but
you don’t control generalization; you can only adjust the model based on its training
data
- At the beginning of training, optimization and generalization are correlated: the
lower the loss on training data, the lower the loss on test data. While this is happening,
your model is said to be underfit:
- The processing of fighting overfitting this way is called regularization. 

## 4.4.1 Reducing the network’s size
- The simplest way to prevent overfitting is to reduce the size of the model: the number
of learnable parameters in the model (which is determined by the number of layers
and the number of units per layer)
- Unfortunately, there is no magical formula to determine the right number of layers
or the right size for each layer. You must evaluate an array of different architectures
(on your validation set, not on your test set, of course) in order to find the
correct model size for your data. The general workflow to find an appropriate model
size is to start with relatively few layers and parameters, and increase the size of the layers
or add new layers until you see diminishing returns with regard to validation loss.

### <div style='color:#fff; background-color: skyblue;padding:10px 20px;'>Listing 4.3 Original model</div>

In [1]:
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
print("Now let’s try to replace it with this smaller network.")

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Now let’s try to replace it with this smaller network.


### <div style='color:#fff; background-color: skyblue;padding:10px 20px;'>Listing 4.4 Version of the model with lower capacity</div>

In [2]:
model = models.Sequential()
model.add(layers.Dense(4, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(4, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))


Figure 4.4 shows a comparison of the validation losses of the original network and the
smaller network. The dots are the validation loss values of the smaller network, and
the crosses are the initial network (remember, a lower validation loss signals a better
model)
<img src='images/f4.4.png'>

### <div style='color:#fff; background-color: skyblue;padding:10px 20px;'>Listing 4.5 Version of the model with higher capacity</div>

In [3]:
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

Figure 4.4.5 shows how the bigger network fares compared to the reference network.
The dots are the validation loss values of the bigger network, and the crosses are the
initial network.
<img src='images/f4.4_5.png'>

The bigger network starts overfitting almost immediately, after just one epoch, and it
overfits much more severely. Its validation loss is also noisier.
 Meanwhile, figure 4.6 shows the training losses for the two networks. As you can
see, the bigger network gets its training loss near zero very quickly. The more capacity
the network has, the more quickly it can model the training data (resulting in a low
training loss), but the more susceptible it is to overfitting (resulting in a large difference
between the training and validation loss). 
<img src='images/f4.6.png'>

#### 4.4.2 Adding weight regularization

-  A simple model in this context is a model where the distribution of parameter values
has less entropy (or a model with fewer parameters, as you saw in the previous section).
Thus a common way to mitigate overfitting is to put constraints on the complexity
of a network by forcing its weights to take only small values, which makes the
distribution of weight values more regular. This is called weight regularization
- done by adding to the loss function of the network a cost associated with having large
weights. This cost comes in two flavors:

In Keras, weight regularization is added by passing weight regularizer instances to layers
as keyword arguments. Let’s add L2 weight regularization to the movie-review classification
network.

### <div style='color:#fff; background-color: skyblue;padding:10px 20px;'>Listing 4.6 Adding L2 weight regularization to the model</div>

In [4]:
from keras import regularizers
model = models.Sequential()
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
                       activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
                       activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

l2(0.001) means every coefficient in the weight matrix of the layer will add 0.001 *
weight_coefficient_value to the total loss of the network. Note that because this
penalty is only added at training time, the loss for this network will be much higher at
training than at test time<br>
 Figure 4.7 shows the impact of the L2 regularization penalty. As you can see, the
model with L2 regularization (dots) has become much more resistant to overfitting
than the reference model (crosses), even though both models have the same number
of parameters.
<img src='images/f4.7.png'>

As an alternative to L2 regularization, you can use one of the following Keras weight
regularizers.

### <div style='color:#fff; background-color: skyblue;padding:10px 20px;'>Listing 4.7 Different weight regularizers available in Keras</div>

In [5]:
from keras import regularizers
regularizers.l1(0.001)
regularizers.l1_l2(l1=0.001, l2=0.001)

<keras.regularizers.L1L2 at 0x2052a5d89b0>

4.4.3 Adding dropout
-  Dropout, applied to a layer, consists of randomly dropping out
(setting to zero) a number of output features of the layer during training. Let’s say a
given layer would normally return a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input
sample during training. After applying dropout, this vector will have a few zero entries
distributed at random: for example, [0, 0.5, 1.3, 0, 1.1]. The dropout rate is the fraction
of the features that are zeroed out; it’s usually set between 0.2 and 0.5. At test time, no
units are dropped out; instead, the layer’s output values are scaled down by a factor
equal to the dropout rate, to balance for the fact that more units are active than at
training time.

Consider a Numpy matrix containing the output of a layer, layer_output, of
shape (batch_size, features). At training time, we zero out at random a fraction of
the values in the matrix:<br>
```
layer_output *= np.random.randint(0, high=2, size=layer_output.shape)
```

At test time, we scale down the output by the dropout rate. Here, we scale by 0.5
(because we previously dropped half the units):
```layer_output *= 0.5```

Note that this process can be implemented by doing both operations at training time
and leaving the output unchanged at test time, which is often the way it’s implemented
in practice (see figure 4.8):
```
layer_output *= np.random.randint(0, high=2, size=layer_output.shape)
layer_output /= 0.5
```
<img src='images/f4.8.png'>

model.add(layers.Dropout(0.5))

### <div style='color:#fff; background-color: skyblue;padding:10px 20px;'>Listing 4.8 Adding dropout to the IMDB networkm</div>

In [6]:
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))

Figure 4.9 shows a plot of the results. Again, this is a clear improvement over the reference
network.

<img src='images/f4.9.png'>

## To recap, these are the most common ways to prevent overfitting in neural networks:

# 4.5 The universal workflow of machine learning
____________

 universal blueprint that you can use to attack and solve
any machine-learning problem. The blueprint ties together the concepts you’ve
learned about in this chapter: problem definition, evaluation, feature engineering,
and fighting overfitting

## 4.5.1 Defining the problem and assembling a dataset
- What will your input data be? What are you trying to predict? You can only learn
to predict something if you have available training data: for example, you can
only learn to classify the sentiment of movie reviews if you have both movie
reviews and sentiment annotations available. As such, data availability is usually
the limiting factor at this stage (unless you have the means to pay people to collect
data for you).
- What type of problem are you facing? Is it binary classification? Multiclass classification?
Scalar regression? Vector regression? Multiclass, multilabel classification?
Something else, like clustering, generation, or reinforcement learning?
Identifying the problem type will guide your choice of model architecture, loss
function, and so on.<br><br>
You can’t move to the next stage until you know what your inputs and outputs are, and
what data you’ll use. Be aware of the hypotheses you make at this stage:
1. You hypothesize that your outputs can be predicted given your inputs.
2. You hypothesize that your available data is sufficiently informative to learn the
relationship between inputs and outputs.

## 4.5.2 Choosing a measure of success
<img src='images/recall.png'>

## 4.5.3 Deciding on an evaluation protocol

## 4.5.4 Preparing your data

## 4.5.5 Developing a model that does better than a baseline

 Assuming that things go well, you need to make three key choices to build your
first working model:
- Last-layer activation—This establishes useful constraints on the network’s output.
For instance, the IMDB classification example used sigmoid in the last
layer; the regression example didn’t use any last-layer activation; and so on.
- Loss function—This should match the type of problem you’re trying to solve. For
instance, the IMDB example used binary_crossentropy, the regression example
used mse, and so on.
- Optimization configuration—What optimizer will you use? What will its learning
rate be? In most cases, it’s safe to go with rmsprop and its default learning rate.
-  In general, you
can hope that the lower the crossentropy gets, the higher the ROC AUC will be.
 Table 4.1 can help you choose a last-layer activation and a loss function for a few
common problem types.
<img src='images/4.1.png'>

4.5.6 Scaling up: developing a model that overfits
     Remember that the universal tension in machine learning is between
optimization and generalization; the ideal model is one that stands right at the border
between underfitting and overfitting; between undercapacity and overcapacity. To figure
out where this border lies, first you must cross it.
 To figure out how big a model you’ll need, you must develop a model that overfits.
This is fairly easy:
1. Add layers.
2. Make the layers bigger.
3. Train for more epochs.

Always monitor the training loss and validation loss, as well as the training and validation
values for any metrics you care about. When you see that the model’s performance
on the validation data begins to degrade, you’ve achieved overfitting.
 The next stage is to start regularizing and tuning the model, to get as close as possible
to the ideal model that neither underfits nor overfits. 

## 4.5.7 Regularizing your model and tuning your hyperparameters
This step will take the most time: you’ll repeatedly modify your model, train it, evaluate
on your validation data (not the test data, at this point), modify it again, and
repeat, until the model is as good as it can get. These are some things you should try

Once you’ve developed a satisfactory model configuration, you can train your final
production model on all the available data (training and validation) and evaluate it
one last time on the test set. If it turns out that performance on the test set is significantly
worse than the performance measured on the validation data, this may mean
either that your validation procedure wasn’t reliable after all, or that you began overfitting
to the validation data while tuning the parameters of the model. In this case,
you may want to switch to a more reliable evaluation protocol (such as iterated K-fold
validation).

# Chapter summary
- Define the problem at hand and the data on which you’ll train. Collect
this data, or annotate it with labels if need be.
- Choose how you’ll measure success on your problem. Which metrics will
you monitor on your validation data?
- Determine your evaluation protocol: hold-out validation? K-fold validation?
Which portion of the data should you use for validation?
- Develop a first model that does better than a basic baseline: a model with
statistical power.
- Develop a model that overfits.
- Regularize your model and tune its hyperparameters, based on performance
on the validation data. A lot of machine-learning research tends to
focus only on this step—but keep the big picture in mind.