I explore the theory behind calculus of variations, functionals, divergence theorem, functional differentiation, calculus of variations, and functional gradient descent. I implement functional gradient descent using both analytical and numerical methods. Furthermore, I explore kernels, reproducing kernel hilbert spaces, support vector machines, and so much more.
Multivariate calculus concerns itself with infitesimal changes of numerical functions – that is, functions that accept a vector of real-numbers and output a real number:
We can take this concept one level further, and inspect functions of functions, called functionals. Given a set of functions,
In other words, a functional is a function defined over functions, that maps functions to the field of real (or complex) numbers.
Some examples of functionals include:
- The evaluation functional:
$E_x(f) = f(x)$ - The sum functional:
- The integration functional:
$I_{[a, b]}(f) = \int_a^b f(x) dx$
It follows that the composition of a function
This is where it gets interesting. Assuming we can create a loss function functional, we can create a measure to how two functions are similar. For example,
could be considered a loss function, where the first term (the
The previous loss function maps a hypothesized function
provides a space of functions
This is good for when data points
A kernel
where
Basically, it is a dot product between "features" of x (
This means, if there is a function
Assume we want to do functional gradient descent where our loss function is
In the case of the Fourier Series, we have
We can see clearly that we can represent a function
we can compute an eigenvector decomposition as
where
Instead of computing
Kernel methods owe their name to the use of kernel functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates.This approach is called the "kernel trick"
Here's the idea in an example. For the RBF kernel:
For this kernel, the dimension of the feature space defined by
The whole idea boils down to this. We want to represent the estimated function f(x) as linear combination of basis functions. Instead of doing this computation, we can instead express it as a linear combination of a kernel evaluated at specific "centers".
For
we have the functional derivative
where
Finally, we can choose to represent
Finally, we define our algorithm and implement it.
- Get dataset
$x,y:=f(x)$ - Create the Gram kernel
$K(i,j) = k(x_i,x_j)$ , where$k$ is the selected kernel (e.g. RBF kernel) - Initialize
$\alpha$ vector (random values close to 0 perhaps) - Perform descent:
- Calculate function:
$f_k(x) = K \cdot \alpha_k$ - Update function:
$\alpha_{k+1} = 2 \eta (y-f_k(x)) + (1-2\eta \lambda) \alpha_k$
- Calculate function:
Using the RBF Kernel defined earlier, we get the following results:
rbf_kernel_close.mp4
After 200 iterations using the RBF kernel with
rbf_kernel.mp4
It looks like there is a region where the function becomes flat, and the rbf function with
I will now attempt to implement a fourier series kernel. The kernel will be defined on
where
I specified the kernel to be
fourier_kernel.mp4
From wikipedia:
In the calculus of variations, a field of mathematical analysis, the functional derivative (or variational derivative) relates a change in a functional (a functional in this sense is a function that acts on functions) to a change in a function on which the functional depends.
In the calculus of variations, functionals are usually expressed in terms of an integral of functions, their arguments, and their derivatives. In an integral
$L$ of a functional, if a function f is varied by adding to it another function δf that is arbitrarily small, and the resulting integrand is expanded in powers of$\delta f$ , the coefficient of$\delta f$ in the first order term is called the functional derivative.
Given a function
Consider the functional:
The functional derivative of
The derivation can be found on wikipedia, where each line uses very interesting math that had me go through deep rabbit holes to actually understand and grasp. Concepts used are the total derivative, product rule for divergence, divergence theorem, and the fundamental lemma of calculus of variations.
Therefore, the functional derivative is
I will now attempt to apply this derivation and solve multiple examples.
The entropy of a discrete random variable is a functional of the probability mass function.
We can utilize the equation that defines the functional derivative:
Thus,
I created the functional:
where
First, I will use the definition of functional derivative.
If we didn't know how to expand
We will use the generic formula:
Finally, we found functionalDerivative
method:
Let
This can be verified using MATLAB's functionalDerivative
method:
Finally, after understanding how the functional derivative works, we will attempt functional gradient descent. Unlike the Kernel approach, we do not have the functional
This means that to get the gradient along a direction, we can perturb the function
The approach looks like this
- Generate range of independent variable x
- Generate a perturbation function
$\phi(x)$ - Initialize learning rate
$\eta$ and epsilon$\epsilon$ - Generate initial estimate
$\rho(x)$ - Set any initial conditions
$\rho(0) = c$ - Start iterating until max iterations reached or the Functional reaches a threshold cost
- Up until a specific degree
$d$ , generate$\frac{\partial^d \rho}{\partial x^d}, \frac{\partial^d \phi}{\partial x^d}$ - Calculate the gradient
$Df = \frac{f(x,\rho + \epsilon \phi,\partial \rho + \epsilon \partial \phi,\dots) - f(x,\rho,\partial \rho,\dots)}{\epsilon}$ - Update the estimate
$\rho_{k+1}(x) = \rho_k(x) - \eta Df$
- Up until a specific degree
Important things to note:
- The learning rate (step size) must be tuned correctly or else we risk the solution diverging.
- When an initial condition exists, we don't update that specific point. This means that if
$\rho(x_0) = cst$ , then the neighboring points$\rho(x_k)$ will be affected by the first degree derivative after$k$ iterations. This is seen only if the functional is affected by degrees of$\rho(x)$ . The example$L(\rho) = \int (\rho(x)+\frac{\partial \rho}{\partial x}(x))^2 dx$ solves to$\rho(x) = c_0e^{-x}$ . This example is illustrated below. - When calculating the gradient numerically, I used central difference for points not at the edges. For points at the edges I used forward difference and backward difference.
I'll start with a simple functional:
sin.mp4
Now, I'll get to a differential equation:
exp_fast.mp4
We can see very clearly how the initial point being set to 1 is propagating the information about the estimated function's derivative with every iteration.
This turned out to be a huge project as I had not planned to dive as deep as I did. However, I learned so much about calculus of variations, divergence theorem, functionals, functional differentiation, SVMs, Kernels, Hilbert spaces, RKHS, Mercer's theorem, Gram kernels, and so much more interesting math that I had to go over grad lectures from multiple universities, wikipedia, books, and published papers to be enough to scratch the surface of this field and make something that applies this theory that actually works.
- Wikipedia Functional Derivative
- Lecture 1,Lecture 2
- Gradient Boosting blog
- Tompkins, A., & Ramos, F. (2018, April). Fourier feature approximations for periodic kernels in time-series modelling. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1). (link)
- Great reference for Functional Gradient Descent with Kernels: Simple Complexities Blog
- Summary reference for Reproducing Kernel Hilbert Spaces
- Great simple to understand reference to RKHS and kernels lecture
- Kernels and Kernel Methods lecture
- Wikipedia Kernel Method
- Wikipedia Divergence and Wolfram Divergence
- Wikipedia Divergence Theorem and Wolfram Divergence Theorem
- SVM and Kernel SVM