# Week 1 - Neurons and Layers 

**Regression/Linear Model:**

The function implemented by a neuron with no activation is the same as in Course 1, linear regression:
$$ f_{\mathbf{w},b}(x^{(i)}) = \mathbf{w}\cdot x^{(i)} + b \tag{1}$$

**Neuron with Sigmoid activation:**

The function implemented by a neuron/unit with a sigmoid activation is the same as in Course 1, logistic  regression:
$$ f_{\mathbf{w},b}(x^{(i)}) = g(\mathbf{w}x^{(i)} + b) \tag{2}$$
where $$g(x) = sigmoid(x)$$ 

---

- Recall that the dimensions of these parameters are determined as follows:
    - If network has $s_{in}$ units in a layer and $s_{out}$ units in the next layer, then 
        - $W$ will be of dimension $s_{in} \times s_{out}$.
        - $b$ will a vector with $s_{out}$ elements

>**Note:** The bias vector `b` could be represented as a 1-D (n,) or 2-D (1,n) array. Tensorflow utilizes a 1-D representation and this lab will maintain that convention. 

**How to calculate the number of parameters?** 

If 
```python
model = Sequential(
    [               
        tf.keras.Input(shape=(400,)),                    # L0
        tf.keras.layers.Dense(units=25, activation='sigmoid'), # L1
        tf.keras.layers.Dense(units=15, activation='sigmoid'), # L2
        tf.keras.layers.Dense(units= 1, activation='sigmoid'), # L3
    ], name = "my_model" 
)      

L1_num_params = 400 * 25 + 25  # W1 parameters  + b1 parameters
L2_num_params =  25 * 15 + 15  # W2 parameters  + b2 parameters
L3_num_params =  15 *  1 +  1  # W3 parameters  + b3 parameters

"""
W1 shape = (400, 25), b1 shape = (25,)
W2 shape = (25, 15), b2 shape = (15,)
W3 shape = (15, 1), b3 shape = (1,)
"""
```

## Week 2

**Softmax**

A multiclass neural network generates N outputs. One output is selected as the predicted answer. In the output layer, a vector $\mathbf{z}$ is generated by a linear function which is fed into a softmax function. The softmax function converts $\mathbf{z}$  into a probability distribution as described below. After applying softmax, each output will be between 0 and 1 and the outputs will sum to 1. They can be interpreted as probabilities. The larger inputs to the softmax will correspond to larger output probabilities.

The softmax function can be written:
$$a_j = \frac{e^{z_j}}{ \sum_{k=0}^{N-1}{e^{z_k} }} \tag{1}$$

Where $z = \mathbf{w} \cdot \mathbf{x} + b$ and N is the number of feature/categories in the output layer. 

- Hidden layers should have activation functions that are non-linear. If a linear activation function is used for all of the hidden layer activations, then this is equivalent to regular multiple linear regression.
- Although it's true that a neural network with many layers but no activation function is not effective, a linear activation is the same as "no activation function".
-  A ReLU is most often used because it is faster to train compared to the sigmoid. This is because the ReLU is only flat on one side (the left side) whereas the sigmoid goes flat (horizontal, slope approaching zero) on both sides of the curve.
-  For multiclass classification, the recommended way to implement softmax regression is to set from_logits=True in the loss function, and also to define the model's output layer with a _linear_ activation
-  When the true label is 3, then the cross entropy loss for that training example is just the negative of the log of the activation for the third neuron of the softmax. All other terms of the cross entropy loss equation $(-log(a_1), -log(a_2), -log(a_4))$ are ignored
-  _convolutional layer_: different layer type where each single neuron of the layer does not look at all the values of the input vector that is fed into that layer

## Week 3 

**high bias (underfit)**

- what to do?
  1. Collect additional features or add polynomial features 
  2. Decrease the regularization parameter $λ$ (lambda)
- what not to do?
  - Collecting more training data does not help to resolve high bias, because the model is already not fitting the existing training data well, so more data is unlikely to help.

**high variance (overfit)**

- what to do?
  1. Collect more training data
  2. Increase the regularization parameter
- what not to do? 

---

- Which of these is the best way to determine whether your model has high bias (has underfit the training data)?
  - Compare the training error to the baseline level of performance

**Error Analysis**

- Manually examine a sample of the training examples that the model misclassified in order to identify common traits and trends. 

**Data Augmentatin**

- take an existing training example and modify it (for example, by rotating an image slightly) to create a new example with the same label

**Transfer Learning**

What are two possible ways to perform transfer learning? Hint: two of the four choices are correct.

1. _Only train output layers_: You can choose to train just the output layers' parameters and leave the other parameters of the model fixed.
2. _Train all parameters_: You can choose to train all parameters of the model, including the output layers, as well as the earlier layers.

---

- In the context of machine learning, what is a diagnostic?
  - A test that you run to gain insight into what is/isn’t working with a learning algorithm.
- For a classification task; suppose you train three different models using three different neural network architectures. Which data do you use to evaluate the three models in order to choose the best one? 
  - You'll only use the test set after choosing the best model based on the CV set. You want to avoid using the test set while you are still selecting model options, because the test set is meant to serve as an estimate for how the model will generalize to new examples that it has never seen before.








