In [1]:
import numpy as np
%matplotlib widget
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.activations import relu,linear
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers import Adam

import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)

from autils import * 

tf.keras.backend.set_floatx('float64')
# from assigment_utils import *

tf.autograph.set_verbosity(0)






## Advice on building ML Algorithms

### 1. Debugging a learning algorithm

What if it makes unacceptably large errors?
For example, linear regression: 
$$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2  + \frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2 \tag{1}$$ 
- Fix high bias: Additional feature set, adding polynomial features, decrease $\lambda$(regularisation parameter)?
- Fix high variance: More training examples, smaller set of features, increasing $\lambda$
- Thus, we need machine learning diagnostics

#### Machine learning diagnostic

Diagnostic: A test that you run to gain insight into what is/isn't working with a learning algorithm, to gain guidance
into improving its performance.  


Diagnostics can take time to implement but doing so can be a very good use of your time.

### 2. Evaluating a Model

Overfitting - fits the training data well but fail to generalise to new examples not in the training set **due to higher order polynomials/features**


#### So how?
* Split your original data set into Training set (70%) and Test set (30%). 
    * Use the training data to fit the parameters of the model
    * Use the test data to evaluate the model on *new* data
    * `#split the data using sklearn routine 
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33, random_state=1)`
    
* Develop an error function to evaluate your model.
    * Compute test error and training error (without regularisation)

### 3. Model Selection

Once parameters $\vec{w},b$ are fit to the training set, training error $J_{train}(\vec{w},b)$ is likely lower than the actual generalisation error. $J_{test}(\vec{w},b)$ is better estimate of how well model will generalise to new data than $J_{train}(\vec{w},b)$.

- Choose nth-order polynomial which gives lowest generalisation error (not recommended)

#### Solution: Split dataset into train, cross-validation and test set.
- Training set (60%)
- Cross-validation set (20%)
- Test set (20%)
- Cross validation set is used to test the nth order polynomial that is best suited, and gives a fair representation/estimate of the generalisation error when using test set
- Better model / neural network selection procedure
<img  src="./images/Wk6_1.png"  style=" width:600px; padding: 10px 20px ; ">


| data             | % of total | Description |
|------------------|:----------:|:---------|
| training         | 60         | Data used to tune model parameters $w$ and $b$ in training or fitting |
| cross-validation | 20         | Data used to tune other model parameters like degree of polynomial, regularization or the architecture of a neural network.|
| test             | 20         | Data used to test the model after tuning to gauge performance on new data |


 **(DONT use test set to make decisions about model!)**


### 4. Bias and Variance (Diagnostic)

<img  src="./images/Wk6_2.png"  style=" width:600px; padding: 10px 20px ; ">


#### Regularisation and Bias/Variance
- How to choose $\lambda$ value?
- Note: $\lambda$ introduces a penalty term to the cost function, discouraging the model from fitting the training data too closely, by decreasing $w_j$
- Variance is diff between training error and cross validation error
<img  src="./images/Wk6_3.png"  style=" width:600px; padding: 10px 20px ; ">

- Tune $\lambda$ using $J_{cv}(\vec{w},b)$

<img  src="./images/Wk6_4.png"  style=" width:600px; padding: 10px 20px ; ">

#### Establishing a baseline level of performance

- Compare bias/error in training error (10.8%) and cross validation error with 
    * **human level performance**(10.2%) = 0.6% net error
    * **Competing algorithms performance**
    * **Guess based on experience**

#### Learning curves
As training set size increases: $J_{cv}$ gradually decreases, $J_{train}$ gradually increases

- High bias: Getting more training data will not help much, and will almost never outperform human level performance

- High variance: Getting more training data likely to help as prevents overfitting 


#### Neural networks and bias variance

- Large Neural Networks are low bias machines
- Two main questions:
    * Do well on training set? No- Bigger network
    
    * Do well on cross-validation set? No (High variance)- More data
- Large NN usually do as well than smaller as long as regularisaion chose properly

#### Regularised MNIST Model
`layer_1 = Dense(units=25, activation='relu', kernel_regulariser=L2(0.01))`


`layer_2 = ..., kernel_regulariser=L2(0.01)`


`layer_3 = ..., kernel_regulariser=L2(0.01)`


`model = Sequential([layer_1, layer_2, layer_3])`


### 5. Iterative loop of ML Development

Choose architecture(model, data, etc.) > Train Model > Diagnostics (bias, variance)
- Require multiple iterations

#### Building a spam classifier

Supervised learning:
- $\vec{x}$ = features of email
- $y$ = spam (1) or not spam (0)

Features: list the top 10,000 words to compute $x_1,x_2,...,x_{10,000}$

Ways to reduce error:
- Collect more data. E.g, "Honeypot" project
- Develop sophisticated features based on email routing in header
- Define sophisticated fetaures from email body. Eg: "discounting"/"discount" same word?
- Design algorithms to detect misspellings. Eg: "w4tches, med1cine"

### 6. Error Analysis


$m_{cv}$ = 500 examples in cross validation set
- Algorithm misclassifies 100
- **Manually examine examples and categorise based on common traits**.

Eg (spam classification):
1. Pharma: 21
2. Deliberate misspellings: 3
3. Unusual email routing: 7
4. Steal passwords (phishing): 18
5. Spam message in embedded image: 5


### 7. Adding data

#### 7A. Data Augmentation
Augmentation: modifying an existing training example to create a new training example.
<img src="./images/Wk6_5.png"  alt="Image 1" style="width: 45%; display: inline-block; margin-right: 5px;">
<img src="./images/Wk6_6.png"  alt="Image 2" style="width: 45%; display: inline-block;">

- Can also apply data augmentation to speech recognition by adding noisy background, distortions,etc.
- Modify/Distort in such a way that is similar to test data set

#### 7B. Data Synthesis
Eg: Artificial data synthesis for photo Optimal Character Recognition (OCR)
- Mostly for computer vision tasks
<img  src="./images/Wk6_7.png"  style=" width:600px; padding: 10px 20px ; ">

#### 7C. Data engineering
- Taking a data-centric approach rather than model-centric approach

### 8. Transfer learning
- Using data from a different task of the **same input type**
- By learning from pictures of cats, dogs, might have learn some plausible sets of parameters from earlier layers, and transferring parameters to new neural network, and starts parameters at pretty good place

- Option 1: only train output layers parameters
- Option 2: Train all parameters

<img src="./images/Wk6_8.png"  alt="Image 1" style="width: 45%; display: inline-block; margin-right: 5px;">
<img src="./images/Wk6_9.png"  alt="Image 2" style="width: 45%; display: inline-block;">

Steps:
1. Download neural network parameters pretrained on a large datset with same input data type as your application.

2. Further train (fine tune) the network on your own data.


### 9. Full cycle of a Machine Learning Project

<img  src="./images/Wk6_10.png"  style=" width:600px; padding: 10px 20px ; ">

<img  src="./images/Wk6_11.png"  style=" width:600px; padding: 10px 20px ; ">

### 10. Skewed Datasets

#### Error metrics for skewed datasets

Eg: Rare disease classification example
- Might get 99% correct diagnoses, but 1% error on test set, but only 0.5% of patients have disease
- But you can also get 99.5% accuracy/ 0.5% error by printing('0')!!

#### Precision/Recall
That way, if print 0, TP = 0; Precision = 0%

Precision = $\frac{TP}{TP+ FP}$  
Recall = $\frac{TP}{TP + FN}$ (also known as sensitivity)

#### Trading off precision and recall

If we raise threshold (very confident) for logistic regression (original: 0.5),  
**Higher precision, lower recall**

If reduce threshold (avoid missing),  
**Lower precision, higher recall**

#### F1 Score
- To compare precision/recall
- Choose highest

Formula : $$\frac{1}{\frac{1}{2}(\frac{1}{P}+\frac{1}{R})} = 2\frac{PR}{P+R}$$