#### Steps of an ML Project:
1. Select Problem: For example a voice activated project (via supervised learning), we need to specify the input and output
    - X: audioclip $\to$ Y: 1{found trigger word}
2. Get labelled data
    - How many days (1, 2, 3, 5, 8 ...) for collecting data
    - How would you collect data
    - Don't spend too much time on data collection because ML is actually a very iterative process, until we try, we never know what's hard and what's easy (unless you are really well-experienced) After rudimentary training, you will know where your algorithm fails, then go back and collect corresponding data
3. Design model
4. Train model (May turn back to step 2&3 )
    - In the step 3 & 4, research is quite helpful
    - Keep clear notes on experiments, may have a spreadsheet
5. Test model
6. Deploy
    - Location:
        - Edge (end device) Away from network latency, privacy
        - Cloud: Easier to maintance 
    - VAD, voice activity detection
        1. Non-ML See if volumn $\ge \epsilon$; Less reliable, simpler, more robust when applying to a new dataset
        2. Train small NN/ SVM on human speech; Less robust but in pracital is a must
        - Less parameters, more general
    - Data change: the training data are different from which our algorithm need to perform well on
        - New accents
        - Different background noises
        - New microphone
7. Monitor
    - Web search (the world is changing)
    - Self-driving (different rules)
8. Q&A: kind of statistical testing

#### Criteria to Select a Project
- Interest
- Availability of Data
- Domain Knowledge
- Utility
- Feasibility

Structered Data: 
- databases of data
- Each of the features has well defined meaning

Unstructered Data: 
- Audio, images, text
- People have natual empathy to understand unstructured data

### Activation functions
1. Sigmoid
    $$f(x) = \frac{1}{1 + \exp(-x)}$$
2. Tanh (mathematically a shifted version of the sigmoid function, almost always works better than sigmoid), it gives a sense of "centering", the mean of the activations will be 0, will make the learning of next layer easier. But in binary classification, the sigmoid function may be better since it makes more sence to output a value within 0 and 1
    $$f(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$
    $$f'(z) = 1 - f^2(z)$$
#### Weakness of these two above:
The gradient becomes very small when z is either very large or very small, which may slow down the gradient descent

3. Relu:
    $$a = \max(0, z)$$
4. Leaky Relu: It usually works better than ReLU
    $$a  = \max(0.01z, z) \,e.g.$$
#### The network will learn faster if either of these two above is applied

##### Random Initialization:
the params should be small like in the degree of 0.01 (for a shallow nn) since like for example, applying tanh, if w large, then z large, a large, the gradient will be small, then learning is slow

There are functions that only very deep nn can learn and shallower models often fails to command. And it's difficult to predict in advance that exactly how deep a neural network need to be applied. So we may try a shallower nn (one or two) first, and consider the number of layers as a hyperparameters, evaulating them on hold-out cross validation.

#### Notations:
- L: the number of layers
- $n^{[\ell]}$: the number of units in layer $\ell$
- $a^{[\ell]}$: activations in layer $\ell$
- $W^{[\ell]}, b^{[\ell]}$: weights for computing __$z^{[\ell]}$__ for example: $z^{[\ell]} = W^{[\ell]} a^{[\ell - 1]} + b^{[\ell]}$
- $X = a^{[0]}, \hat{y} = a^{[L]}$

##### Propagation
![](images/propgate.png)

#### Hyper parameters:
![](images/params.png)
Applied deep learning is a very empirical process