# Chapter 1

## Notes

* In the presence of $\textit{p}$ features there may be up to $2^p - p - 1$ multiplicative interaction effects.
* In supervised learning, we are given labeled data, i.e. pairs $(x_1, y_1), . . . , (x_n, y_n), x_1, . . . , x_n \in \textbf{X}, y_1, . . . , y_n \in \textbf{Y}$ , and the goal is to learn the relationship between $\textbf{X}$ and $\textbf{Y}$. Supervised machine learning uses a parameterized model $g(X|θ)$ over independent variables $X$, to predict the continuous or categorical output $Y$. The model is parameterized by one or more free parameters θ which are fitted to data. There are two different classes of supervised learning models, discriminative and generative.
* Each observation $x_i$ is referred to as a feature vector and $y_i$ is the label or response.
* In unsupervised learning, we are given unlabeled data, $x_1, x_2, . . . , x_n$ and our goal is to retrieve exploratory information about the data, perhaps grouping similar observations or capturing some hidden patterns.
* The third type of machine learning paradigm is reinforcement learning and is an algorithmic approach for enforcing Bellman optimality of a Markov Decision Process—defining a set of states and actions in response to a changing regime so as to maximize some notion of cumulative reward.
* There are two different classes of supervised learning models, discriminative
and generative. A discriminative model learns the decision boundary between
the classes and implicitly learns the distribution of the output conditional on the
input. A generative model explicitly learns the joint distribution of the input and
output. An example of the former is a neural network or a decision tree and a
restricted Boltzmann machine (RBM) is an example of the latter.
* The model is referred to as non-parametric if the parameter space is infinite dimensional and parametric if the parameter space is finite dimensional.
* Model selection in machine learning is based on a quantity known as **entropy**. Entropy represents the amount of information associated with each event.
* Shannon entropy is defined as: $$ H(X) = \sum_{x}p_x log(p_x)$$
* Cross entropy is defined as: $$ H(f,g) = -\mathbf{E}_f[log(g)] = \sum_{y \in Y} f(y)log(g(y|\theta)) \geq H(f) $$
* Neural networks represent the non-linear map $F(X)$ over a high-dimensional input
space using hierarchical layers of abstractions. The activation functions are essential for the network to
approximate non-linear functions. For example, if there is one hidden layer and σ(1)
is the identify function, then
$$\hat{Y}(X) = W^{(2)}(W^{(1)}X + b^{(1)}) + b^{(2)} = W^{(2)}W^{(1)}X + W^{(2)}b^{(1)} + b^{(2)} = W`X + b`$$
* Machine learning and statistical methods can be further characterized by whether
they are parametric or non-parametric. Parametric models assume some finite
set of parameters and attempt to model the response as a function of the input
variables and the parameters.
* Within probabilistic modeling, a particular niche is occupied by the so-called
state-space models. In these models one assumes the existence of a certain
unobserved, latent, process, whose evolution drives a certain observable process.
The evolution of the latent process and the dependence of the observable process on
the latent process may be given in stochastic, probabilistic terms, which places the
state-space models within the realm of probabilistic modeling. deterministic model may produce
a probabilistic output, for example, a logistic regression gives the probability
that the response is positive given the input variables.

![](first.png)

regularized regression without use of parametric assumptions on the error
distribution is an example of machine learning. Unregularized regression with, say,
Gaussian error is not a machine learning technique.

Supervised machine learning
1. is an algorithmic approach to statistical inference which, crucially, does not
depend on a data generation process;
2. estimates a parameterized map between inputs and outputs, with the functional
form defined by the methodology such as a neural network or a random forest;
3. automates model selection, using regularization and ensemble averaging techniques
to iterate through possible models and arrive at a model with the best
out-of-sample performance; and
4. is often well suited to large sample sizes of high-dimensional non-linear covariates.

The emphasis on out-of-sample performance, automated model selection, and
absence of a pre-determined parametric data generation process is really the key
to machine learning being a more robust approach than many parametric, financial
econometrics techniques in use today.