# Deep Learning Book

### 1. Introduction
**Deep learning**: building complicated concepts from simple ones - the hierarchical graph is deep.

Knowledge base approach vs machine learning approach (AI systems acquire their own knowledge, by extracting patterns in raw data).

**Features and Representation**: we provide simple ML algorithms a good representation (collection of features) of the independent variable. Like providing area of an apartment instead of pictures of it to a price estimator.

**Representation learning**: not only learn the mapping from representation to output, but also the representation itself.

**Autoencoder**: encoder function converts input data into a different representation, decoder function converts new representation back to the original format.

In feature selection, we want to seperate **factors of variation** that explains the observed data. These may be unobserved or imagined. It can be hard to disentangle these factors of variation from the raw data.

**Deep learning** helps address this central problem in representation learning by introducing representations that are expressed in terms of other, simpler representations.

#### 1.21 Neural networks

Neuroscience has given us a reason to hope that a single deep learning algorithm can solve many different tasks. Neuroscientists have found that ferrets can learn to “see” with the auditory processing region of their brain if their brains are rewired to send visual signals to that area (Von Melchner et al., 2000). This suggests that much of the mammalian brain might use a single algorithm to solve most of the different tasks that the brain solves.

Most neural networks today are based on a model neuron called the **rectified linear unit**. However, deep learning research is not an attempt to simulate the brain. **Computational neuroscience** attempts to understand how brain works.

Two important concepts of connectionism: **distributed representation** - each input to a system should be represented by many features, and each feature should be involved in the representation of many possible inputs (e.g., red-blue-green ~ trucks-cars-birds); and back-propagation.


## I Applied Math and Machine Learning Basics
### 2. Linear Algebra
#### 2.1 Scalars, Vectors, Matrices and Tensors
**Scalers**: are single numbers, written with lowercase italics typeface, with their type, e.g., ``Let $s\epsilon\mathbf{R}$ be the slope of the line.

**Vectors**: array of numbers, written with lowercase bold italics typeface, with elements written with italics with subscript, e.g., 
$\textit{\textbf{x}}=
\left[\begin{array}{c}
x_1\\
x_2\\
\vdots\\
x_n\\
\end{array}
\right]$

**Matrices**: 2-D array of numbers, written with uppercase bold typeface. For a real-valued matrix with height of $m$ and width of $n$, $\mathbf{A}\epsilon\mathbf{R}^{m\times n}$. $A_{i,j}$ indicates the element in $i$-th row and $j$-th column.

**Tensors**: array with more than two axes.

**Transpose**: operation that takes mirror image of a matrix across a diagonal line.

#### 2.2 Multiplying Matrices and Vectors
$\mathbf{A}_{m\times n}\times \mathbf{B}_{n\times p}=\mathbf{C}_{m\times p}$

Matrix product operations are distributive - $A(B+C)=AB+AC$, and associative - $A(BC)=(AB)C$, but not commutative - $AB=BA$ does not always hold. However, dot product between two vectors is commutative - $x^Ty=y^Tx$

#### 2.3 Identity and Inverse Matrices
#### 2.4 Linear Dependence and Span
A square matrix with linearly dependent columns is known as singular.

#### 2.5 Norms



### 3. Probability and Information Theory
Probability theory allows making uncertain statements and reason in the presence of uncertainty. Information theory allows quantification of the amount of uncertainty in a probability distribution.

**Frequentist probability** indicates outcome if the same state was repeated infinitely. **Bayesian probability** indicates degree of belief.

#### 3.3 Probability Distributions
Description of how likely a random variable or set of random variables is to take on each of its possible states. It depends on whether the variables are discrete or continuous.

If $\mathrm{x}$ follows probability mass function $P(\mathrm{x})$, it's written as $\mathrm{x}\sim P(\mathrm{x})$.

**Joint probability distribution** $P(\mathrm{x}=x,\mathrm{y}=y)$.

$\sum_{x\epsilon \mathrm{x}}P(x)=1$ is **normalized**.

**Probability mass function (PMF)** for discrete variables and **Probability density function (PDF)** for continuous variables.

#### 3.4 Marginal Probability
Probability distribution over a subset is known as marginal probability. For example, we know $P(\mathrm{x,y})$, then we can find $P(\mathrm{x})$ with the **sum rule**: $$\forall x\epsilon\mathrm{x},P(\mathrm{x}=x)=\underset{y}{\sum}P(\mathrm{x}=x,\mathrm{y}=y)$$
In case of continuous variables, $$p(x)=\int p(x,y)dy$$

#### 3.5 Conditional Probability
$$P(\mathrm{y}=y|\mathrm{x}=x)=\frac{P(\mathrm{y}=y,\mathrm{x}=x)}{P(\mathrm{x}=x)}$$

What would happen if an action were undertaken is **intervention query**.

#### 3.6 The Chain Rule of Conditional Probabilities


### 4. Numeric Computation


### 5. Machine Learning Basics




## II Deep Networks
### 6. Deep Feedforward Networks


### 7. Regularization for Deep Learning


### 8. Optimization for Training Deep Models


### 9. Convolutional Networks


### 10. Sequence Modeling: Recurrent and Recursive Nets


### 11. Practical Methodology


### 12. Applications




## III Deep Learning Research
### 13. Linear Factor Models


### 14. Autoencoders


### 15. Representation Learning


### 16. Structured Probabilistic Models for Deep Learning


### 17. Monte Carlo Methods


### 18. Confronting the Partition Function


### 19. Approximate Inference


### 20. Deep Generative Models

## 