Skip to content

Commit

Permalink
readthedocs: work on background
Browse files Browse the repository at this point in the history
  • Loading branch information
Evizero committed Jan 16, 2017
1 parent 63711b8 commit b88c184
Show file tree
Hide file tree
Showing 3 changed files with 108 additions and 65 deletions.
5 changes: 2 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,8 @@ Loss Functions for Regression

Loss functions that belong to the category "distance-based" are
primarily used in regression problems. They utilize the numeric
difference between the true target and the predicted output is
used as a proxy variable to quantify the quality of individual
predictions.
difference between the predicted output and the true target as a
proxy variable to quantify the quality of individual predictions.

+----------------------------------------------+----------------------------------------------------------------------------------------+
| .. toctree:: | .. image:: https://rawgithub.com/JuliaML/FileStorage/master/LossFunctions/distance.svg |
Expand Down
46 changes: 31 additions & 15 deletions docs/introduction/gettingstarted.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
Getting Started
================

LossFunctions is the result of a collaborative effort to design
and implement an efficient but also convenient-to-use `Julia
<http://julialang.org/>`_ library that provides the most commonly
utilized loss functions in Machine Learning. As such, this
package implements the functionality needed to query various
properties about a loss function (such as convexity), as well as
a number of methods to compute its value, derivative, and second
derivative for single observations or arrays of observations.
LossFunctions.jl is the result of a collaborative effort to
design and implement an efficient but also convenient-to-use
`Julia <http://julialang.org/>`_ library for, well, loss
functions. As such, this package implements the functionality
needed to query various properties about a loss function (such as
convexity), as well as a number of methods to compute its value,
derivative, and second derivative for single observations or
arrays of observations.

In this section we will provide a condensed overview of the
package. In order to keep this overview concise, we will not
Expand Down Expand Up @@ -47,7 +47,7 @@ as usual.
using LossFunctions
Typically the losses we work with in Machine Learning are
Typically, the losses we work with in Machine Learning are
multivariate functions of two variables, the **true target**
:math:`y`, which represents the "ground truth" (i.e. correct
answer), and the **predicted output** :math:`\hat{y}`, which is
Expand All @@ -67,9 +67,9 @@ returns a value that quantifies how "bad" our prediction is
in comparison to the truth. In other words: the lower the
loss, the better the prediction.

From an implementation perspective we should point out that all
From an implementation perspective, we should point out that all
the concrete loss "functions" that this package provides are
actually defined as immutable types instead of native Julia
actually defined as immutable types, instead of native Julia
functions. We can compute the value of some type of loss using
the function :func:`value`. Let us start with an example of how
to compute the loss of a single observation (i.e. two numbers).
Expand Down Expand Up @@ -213,6 +213,10 @@ derivatives using :func:`deriv2`.

.. code-block:: jlcon
julia> true_targets = [ 1, 0, -2];
julia> pred_outputs = [0.5, 2, -1];
julia> deriv(L2DistLoss(), true_targets, pred_outputs)
3-element Array{Float64,1}:
-1.0
Expand All @@ -225,13 +229,25 @@ derivatives using :func:`deriv2`.
2.0
2.0
Additionally, we provide mutating versions for the methods that
return an array.
Additionally, we provide mutating versions for the subset of
methods that return an array. These have the same function
signatures with the only difference of requiring an additional
parameter as the first argument. This variable should always be
the preallocated array that is to be used as stroage.

.. code-block:: julia
buffer = zeros(3)
deriv!(buffer, L2DistLoss(), true_targets, pred_outputs)
julia> buffer = zeros(3)
3-element Array{Float64,1}:
0.0
0.0
0.0
julia> deriv!(buffer, L2DistLoss(), true_targets, pred_outputs)
3-element Array{Float64,1}:
-1.0
4.0
2.0
Regression vs Classification
-----------------------------
Expand Down
122 changes: 75 additions & 47 deletions docs/introduction/motivation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ Background and Motivation
In this section we will discuss the concept "loss function" in
more detail. We will start by introducing some terminology and
definitions. However, please note that we won't attempt to give a
complete treatment of loss functions or the math involved (unlike
a book or a lecture could do). So this section won't be a
complete treatment of loss functions and the math involved
(unlike a book or a lecture could do). So this section won't be a
substitution for proper literature on the topic. While we will
try to cover all the basics necessary to get a decent intuition
of the ideas involved, we do assume basic knowledge about Machine
Expand All @@ -25,9 +25,9 @@ Terminology
To start off, let us go over some basic terminology. In **Machine
Learning** (ML) we are primarily interested in automatically
learning meaningful patterns from data. For our purposes it
suffices to say that in ML we try to teach the computer how to
solve a task by induction rather than by definition. This package
is primarily concerned with the subset of Machine Learning that
suffices to say that in ML we try to teach the computer to solve
a task by induction rather than by definition. This package is
primarily concerned with the subset of Machine Learning that
falls under the umbrella of **Supervised Learning**. There we are
interested in teaching the computer to predict a specific output
for some given input. In contrast to unsupervised learning the
Expand All @@ -40,17 +40,15 @@ require some meaningful way to show the true answers to the
computer so that it can learn from "seeing" them. More
importantly, we have to somehow put the true answer into relation
to what the computer currently predicts the answer should be.
With this we would have the basic information needed for the
computer to be able to improve; this is what loss functions are
for.
This would provide the basic information needed for the computer
to be able to improve; that is what loss functions are for.

When we say we want our computer to learn something that is able
to make predictions, we are talking about a **prediction
function**, denoted as :math:`h` and sometimes called "fitted
hypothesis", or "fitted model". Note that we will avoid the term
hypothesis for the simple reason that it is widely used in
statistics for something completely different.

We don't consider a prediction *function* as the same thing as a
prediction *model*, because we think of a **prediction model** as
a family of prediction functions. What that boils down to is that
Expand All @@ -66,8 +64,8 @@ in that scenario be a concrete linear function with a particular
set of coefficients.

The purpose of a prediction function is to take some input and
produce a corresponding output that should be as faithful as
possible to the true answer. In the context of this package we
produce a corresponding output. That output should be as faithful
as possible to the true answer. In the context of this package we
will refer to the "true answer" as the **true target**, or short
"target". During training, and only during training, inputs and
targets can both be considered as part of our data set. We say
Expand All @@ -76,7 +74,7 @@ actually have the targets available to us (otherwise there would
be no prediction problem to solve in the first place). In essence
we can think of our data as two entities with a 1-to-1 connection
in each observation, the inputs, which we call **features**, and
the corresponding desired outputs, which we call true targets.
the corresponding desired outputs, which we call **true targets**.

Let us be a little more concrete with the two terms we really
care about in this package.
Expand Down Expand Up @@ -122,74 +120,104 @@ Predicted Outputs
In a classification setting, the predicted outputs and the true
targets are usually of different form and type. For example, in
margin-based classification it could be the case that the target
:math:`\hat{y}=-1` and the predicted output :math:`y = -1000`. It
:math:`y=-1` and the predicted output :math:`\hat{y} = -1000`. It
would seem that the prediction is not really reflecting the
target properly, but in this case we would actually have a
perfectly correct prediction. This is because in margin-based
classification the main thing that matters about the predicted
output is that the sign agrees with the true target.

More generally speaking, to be able to compare the predicted
outputs to the targets in a classification setting, one first has
to convert the predictions into the same form as the targets.
When doing this, we say that we **classfiy** the prediction. We
often refer to the initial predictions that are not yet
classified as **raw predictions**.
More generally speaking, to be able to directly compare the
predicted outputs to the targets in a classification setting, one
first has to convert the predictions into the same form as the
targets. When doing this, we say that we **classfiy** the
prediction. We often refer to the initial predictions that are
not yet classified as **raw predictions**.

Definitions
----------------------

More formally, a prediction function :math:`h` is a function that
maps an input from the feature space :math:`X` to the real
numbers :math:`\mathbb{R}`. So :math:`h` will produce the
prediction that we want to compare to the target.
We base most of our definitions on the work presented in
[STEINWART2008]_. Note, however, that we will adapt or simplify
in places at our discretion, if it makes sense to us considering
the scope of this package.

Let us again consider the term **prediction function**. More
formally, a prediction function :math:`h` is a function that maps
an input from the feature space :math:`X` to the real numbers
:math:`\mathbb{R}`. So calling :math:`h` with some features
:math:`x \in X` will produce the prediction :math:`\hat{y} \in
\mathbb{R}`.

.. math::
h : X \rightarrow \mathbb{R}
We think of a supervised loss as a function of two parameters,
the **true targets** :math:`y \in Y` and the **predicted
outputs** :math:`\hat{y} \in \mathbb{R}`. The result of computing
such a loss will be a non-negative real number. The larger the
number of the loss the worse the prediction.
This resulting prediction :math:`\hat{y}` is what we want to
compare to the target :math:`y` using some supervised loss. We
think of a **supervised loss** as a function of two parameters,
the true targets :math:`y \in Y` and the predicted outputs
:math:`\hat{y} \in \mathbb{R}`. The result of computing such a
loss will be a non-negative real number. The larger the number of
the loss, the worse the prediction.

.. math::
L : Y \times \mathbb{R} \rightarrow [0,\infty)
Note a few interesting things about loss functions.
Note a few interesting things about supervised loss functions.

- The concrete value of a loss is often (but not always)
- The absolute value of a loss is often (but not always)
meaningless and doesn't offer itself to a useful
interpretation. What we usually care about is that the loss is
as small as it can be.

- In general the loss function we use is not the function we are
actually interested in minimizing. Instead we are minimizing
what is referred to as a "surrogate". For classification for
example we are really interested in minimizing the ZeroOne
loss. However, that loss is difficult to minimize given that it
is not convex nor continuous. That is why we use other loss
functions, such as the hinge loss or logistic loss. Those
what is referred to as a "surrogate". For binary classification
for example we are really interested in minimizing the ZeroOne
loss (which simply counts the number of misclassified
predictions). However, that loss is difficult to minimize given
that it is not convex nor continuous. That is why we use other
loss functions, such as the hinge loss or logistic loss. Those
losses are "classification calibrated", which basically means
they are good enough surrogates to solve the same problem.
Additionally, surrogate losses tend to have other nice
properties.

- For classification it does not need to be the case that a
"correct" prediction has a loss of zero. In fact some
classification calibrated losses are never truly zero.



.. While the term "loss function" is usually used in the same
context throughout the literature, the specifics differ from
one textbook to another. Before we talk about the definitions
we settled on, let us first discuss a few of the alternatives.
Note that we will only give a partial and thus simplified
description of these. Please refer to the listed sources for
more specifics. In [SHALEV2014]_ the authors consider a loss
function as a higher-order function of two parameters, a
prediction model and an observation tuple.
.. [SHALEV2014] Shalev-Shwartz, Shai, and Shai Ben-David. `"Understanding machine learning: From theory to algorithms" <http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning>`_. Cambridge University Press, 2014.
Alternative Viewpoints
------------------------

While the term "loss function" is usually used in the same
context throughout the literature, the specifics differ from one
textbook to another. For that reason we would like to mention
alternative definitions of what a "loss function" is. Note that
we will only give a partial and thus very simplified description
of these. Please refer to the listed sources for more specifics.

In [SHALEV2014]_ the authors consider a loss function as a
higher-order function of two parameters, a prediction model and
an observation tuple. So in that definition a loss function and
the prediction function are tightly coupled. This way of thinking
about it makes a lot of sense, considering the process of how a
prediction model is usually fit to the data. For gradient descent
to do its job it needs the, well, gradient of the empirical risk.
This gradient is computed using the chain rule for the inner loss
and the prediction model. If one views the loss and the
prediction model as one entity, then the gradient can sometimes
be simplified immensely. That said, we chose to not follow this
school of thought, because from a software-engineering standpoint
it made more sense to us to have small modular pieces. So in our
implementation the loss functions don't need to know that
prediction functions even exist. This makes the package easier to
maintain, test, and reason with. Given Julia's ability for
multiple dispatch we don't even lose the ability to simplify the
gradient if need be.

.. [SHALEV2014] Shalev-Shwartz, Shai, and Shai Ben-David. `"Understanding machine learning: From theory to algorithms" <http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning>`_. Cambridge University Press, 2014.

0 comments on commit b88c184

Please sign in to comment.