readthedocs: work on background

JuliaML · Jan 16, 2017 · b88c184 · b88c184
1 parent 63711b8
commit b88c184
Show file tree

Hide file tree

Showing 3 changed files with 108 additions and 65 deletions.
diff --git a/docs/index.rst b/docs/index.rst
@@ -90,9 +90,8 @@ Loss Functions for Regression
 
 Loss functions that belong to the category "distance-based" are
 primarily used in regression problems. They utilize the numeric
-difference between the true target and the predicted output is
-used as a proxy variable to quantify the quality of individual
-predictions.
+difference between the predicted output and the true target as a
+proxy variable to quantify the quality of individual predictions.
 
 +----------------------------------------------+----------------------------------------------------------------------------------------+
 | .. toctree::                                 | .. image:: https://rawgithub.com/JuliaML/FileStorage/master/LossFunctions/distance.svg |

diff --git a/docs/introduction/gettingstarted.rst b/docs/introduction/gettingstarted.rst
@@ -1,14 +1,14 @@
 Getting Started
 ================
 
-LossFunctions is the result of a collaborative effort to design
-and implement an efficient but also convenient-to-use `Julia
-<http://julialang.org/>`_ library that provides the most commonly
-utilized loss functions in Machine Learning. As such, this
-package implements the functionality needed to query various
-properties about a loss function (such as convexity), as well as
-a number of methods to compute its value, derivative, and second
-derivative for single observations or arrays of observations.
+LossFunctions.jl is the result of a collaborative effort to
+design and implement an efficient but also convenient-to-use
+`Julia <http://julialang.org/>`_ library for, well, loss
+functions. As such, this package implements the functionality
+needed to query various properties about a loss function (such as
+convexity), as well as a number of methods to compute its value,
+derivative, and second derivative for single observations or
+arrays of observations.
 
 In this section we will provide a condensed overview of the
 package. In order to keep this overview concise, we will not
@@ -47,7 +47,7 @@ as usual.
 
    using LossFunctions
 
-Typically the losses we work with in Machine Learning are
+Typically, the losses we work with in Machine Learning are
 multivariate functions of two variables, the **true target**
 :math:`y`, which represents the "ground truth" (i.e. correct
 answer), and the **predicted output** :math:`\hat{y}`, which is
@@ -67,9 +67,9 @@ returns a value that quantifies how "bad" our prediction is
 in comparison to the truth. In other words: the lower the
 loss, the better the prediction.
 
-From an implementation perspective we should point out that all
+From an implementation perspective, we should point out that all
 the concrete loss "functions" that this package provides are
-actually defined as immutable types instead of native Julia
+actually defined as immutable types, instead of native Julia
 functions. We can compute the value of some type of loss using
 the function :func:`value`. Let us start with an example of how
 to compute the loss of a single observation (i.e. two numbers).
@@ -213,6 +213,10 @@ derivatives using :func:`deriv2`.
 
 .. code-block:: jlcon
 
+   julia> true_targets = [  1,  0, -2];
+
+   julia> pred_outputs = [0.5,  2, -1];
+
    julia> deriv(L2DistLoss(), true_targets, pred_outputs)
    3-element Array{Float64,1}:
     -1.0
@@ -225,13 +229,25 @@ derivatives using :func:`deriv2`.
     2.0
     2.0
 
-Additionally, we provide mutating versions for the methods that
-return an array.
+Additionally, we provide mutating versions for the subset of
+methods that return an array. These have the same function
+signatures with the only difference of requiring an additional
+parameter as the first argument. This variable should always be
+the preallocated array that is to be used as stroage.
 
 .. code-block:: julia
 
-    buffer = zeros(3)
-    deriv!(buffer, L2DistLoss(), true_targets, pred_outputs)
+   julia> buffer = zeros(3)
+   3-element Array{Float64,1}:
+    0.0
+    0.0
+    0.0
+
+   julia> deriv!(buffer, L2DistLoss(), true_targets, pred_outputs)
+   3-element Array{Float64,1}:
+    -1.0
+     4.0
+     2.0
 
 Regression vs Classification
 -----------------------------

diff --git a/docs/introduction/motivation.rst b/docs/introduction/motivation.rst
@@ -4,8 +4,8 @@ Background and Motivation
 In this section we will discuss the concept "loss function" in
 more detail. We will start by introducing some terminology and
 definitions. However, please note that we won't attempt to give a
-complete treatment of loss functions or the math involved (unlike
-a book or a lecture could do). So this section won't be a
+complete treatment of loss functions and the math involved
+(unlike a book or a lecture could do). So this section won't be a
 substitution for proper literature on the topic. While we will
 try to cover all the basics necessary to get a decent intuition
 of the ideas involved, we do assume basic knowledge about Machine
@@ -25,9 +25,9 @@ Terminology
 To start off, let us go over some basic terminology. In **Machine
 Learning** (ML) we are primarily interested in automatically
 learning meaningful patterns from data. For our purposes it
-suffices to say that in ML we try to teach the computer how to
-solve a task by induction rather than by definition. This package
-is primarily concerned with the subset of Machine Learning that
+suffices to say that in ML we try to teach the computer to solve
+a task by induction rather than by definition. This package is
+primarily concerned with the subset of Machine Learning that
 falls under the umbrella of **Supervised Learning**. There we are
 interested in teaching the computer to predict a specific output
 for some given input. In contrast to unsupervised learning the
@@ -40,17 +40,15 @@ require some meaningful way to show the true answers to the
 computer so that it can learn from "seeing" them. More
 importantly, we have to somehow put the true answer into relation
 to what the computer currently predicts the answer should be.
-With this we would have the basic information needed for the
-computer to be able to improve; this is what loss functions are
-for.
+This would provide the basic information needed for the computer
+to be able to improve; that is what loss functions are for.
 
 When we say we want our computer to learn something that is able
 to make predictions, we are talking about a **prediction
 function**, denoted as :math:`h` and sometimes called "fitted
 hypothesis", or "fitted model". Note that we will avoid the term
 hypothesis for the simple reason that it is widely used in
 statistics for something completely different.
-
 We don't consider a prediction *function* as the same thing as a
 prediction *model*, because we think of a **prediction model** as
 a family of prediction functions. What that boils down to is that
@@ -66,8 +64,8 @@ in that scenario be a concrete linear function with a particular
 set of coefficients.
 
 The purpose of a prediction function is to take some input and
-produce a corresponding output that should be as faithful as
-possible to the true answer. In the context of this package we
+produce a corresponding output. That output should be as faithful
+as possible to the true answer. In the context of this package we
 will refer to the "true answer" as the **true target**, or short
 "target". During training, and only during training, inputs and
 targets can both be considered as part of our data set. We say
@@ -76,7 +74,7 @@ actually have the targets available to us (otherwise there would
 be no prediction problem to solve in the first place). In essence
 we can think of our data as two entities with a 1-to-1 connection
 in each observation, the inputs, which we call **features**, and
-the corresponding desired outputs, which we call true targets.
+the corresponding desired outputs, which we call **true targets**.
 
 Let us be a little more concrete with the two terms we really
 care about in this package.
@@ -122,74 +120,104 @@ Predicted Outputs
 In a classification setting, the predicted outputs and the true
 targets are usually of different form and type. For example, in
 margin-based classification it could be the case that the target
-:math:`\hat{y}=-1` and the predicted output :math:`y = -1000`. It
+:math:`y=-1` and the predicted output :math:`\hat{y} = -1000`. It
 would seem that the prediction is not really reflecting the
 target properly, but in this case we would actually have a
 perfectly correct prediction. This is because in margin-based
 classification the main thing that matters about the predicted
 output is that the sign agrees with the true target.
 
-More generally speaking, to be able to compare the predicted
-outputs to the targets in a classification setting, one first has
-to convert the predictions into the same form as the targets.
-When doing this, we say that we **classfiy** the prediction. We
-often refer to the initial predictions that are not yet
-classified as **raw predictions**.
+More generally speaking, to be able to directly compare the
+predicted outputs to the targets in a classification setting, one
+first has to convert the predictions into the same form as the
+targets. When doing this, we say that we **classfiy** the
+prediction. We often refer to the initial predictions that are
+not yet classified as **raw predictions**.
 
 Definitions
 ----------------------
 
-More formally, a prediction function :math:`h` is a function that
-maps an input from the feature space :math:`X` to the real
-numbers :math:`\mathbb{R}`. So :math:`h` will produce the
-prediction that we want to compare to the target.
+We base most of our definitions on the work presented in
+[STEINWART2008]_. Note, however, that we will adapt or simplify
+in places at our discretion, if it makes sense to us considering
+the scope of this package.
+
+Let us again consider the term **prediction function**. More
+formally, a prediction function :math:`h` is a function that maps
+an input from the feature space :math:`X` to the real numbers
+:math:`\mathbb{R}`. So calling :math:`h` with some features
+:math:`x \in X` will produce the prediction :math:`\hat{y} \in
+\mathbb{R}`.
 
 .. math::
 
    h : X \rightarrow \mathbb{R}
 
-We think of a supervised loss as a function of two parameters,
-the **true targets** :math:`y \in Y` and the **predicted
-outputs** :math:`\hat{y} \in \mathbb{R}`. The result of computing
-such a loss will be a non-negative real number. The larger the
-number of the loss the worse the prediction.
+This resulting prediction :math:`\hat{y}` is what we want to
+compare to the target :math:`y` using some supervised loss. We
+think of a **supervised loss** as a function of two parameters,
+the true targets :math:`y \in Y` and the predicted outputs
+:math:`\hat{y} \in \mathbb{R}`. The result of computing such a
+loss will be a non-negative real number. The larger the number of
+the loss, the worse the prediction.
 
 .. math::
 
    L : Y \times \mathbb{R} \rightarrow [0,\infty)
 
-Note a few interesting things about loss functions.
+Note a few interesting things about supervised loss functions.
 
-- The concrete value of a loss is often (but not always)
+- The absolute value of a loss is often (but not always)
   meaningless and doesn't offer itself to a useful
   interpretation. What we usually care about is that the loss is
   as small as it can be.
 
 - In general the loss function we use is not the function we are
   actually interested in minimizing. Instead we are minimizing
-  what is referred to as a "surrogate". For classification for
-  example we are really interested in minimizing the ZeroOne
-  loss. However, that loss is difficult to minimize given that it
-  is not convex nor continuous. That is why we use other loss
-  functions, such as the hinge loss or logistic loss. Those
+  what is referred to as a "surrogate". For binary classification
+  for example we are really interested in minimizing the ZeroOne
+  loss (which simply counts the number of misclassified
+  predictions). However, that loss is difficult to minimize given
+  that it is not convex nor continuous. That is why we use other
+  loss functions, such as the hinge loss or logistic loss. Those
   losses are "classification calibrated", which basically means
   they are good enough surrogates to solve the same problem.
+  Additionally, surrogate losses tend to have other nice
+  properties.
 
 - For classification it does not need to be the case that a
   "correct" prediction has a loss of zero. In fact some
   classification calibrated losses are never truly zero.
 
 
-
-.. While the term "loss function" is usually used in the same
-   context throughout the literature, the specifics differ from
-   one textbook to another. Before we talk about the definitions
-   we settled on, let us first discuss a few of the alternatives.
-   Note that we will only give a partial and thus simplified
-   description of these. Please refer to the listed sources for
-   more specifics.  In [SHALEV2014]_ the authors consider a loss
-   function as a higher-order function of two parameters, a
-   prediction model and an observation tuple.
-
-   .. [SHALEV2014] Shalev-Shwartz, Shai, and Shai Ben-David. `"Understanding machine learning: From theory to algorithms" <http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning>`_. Cambridge University Press, 2014.
+Alternative Viewpoints
+------------------------
+
+While the term "loss function" is usually used in the same
+context throughout the literature, the specifics differ from one
+textbook to another. For that reason we would like to mention
+alternative definitions of what a "loss function" is. Note that
+we will only give a partial and thus very simplified description
+of these. Please refer to the listed sources for more specifics.
+
+In [SHALEV2014]_ the authors consider a loss function as a
+higher-order function of two parameters, a prediction model and
+an observation tuple. So in that definition a loss function and
+the prediction function are tightly coupled. This way of thinking
+about it makes a lot of sense, considering the process of how a
+prediction model is usually fit to the data. For gradient descent
+to do its job it needs the, well, gradient of the empirical risk.
+This gradient is computed using the chain rule for the inner loss
+and the prediction model. If one views the loss and the
+prediction model as one entity, then the gradient can sometimes
+be simplified immensely. That said, we chose to not follow this
+school of thought, because from a software-engineering standpoint
+it made more sense to us to have small modular pieces. So in our
+implementation the loss functions don't need to know that
+prediction functions even exist. This makes the package easier to
+maintain, test, and reason with. Given Julia's ability for
+multiple dispatch we don't even lose the ability to simplify the
+gradient if need be.
+
+.. [SHALEV2014] Shalev-Shwartz, Shai, and Shai Ben-David. `"Understanding machine learning: From theory to algorithms" <http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning>`_. Cambridge University Press, 2014.