In [2]:
import math
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# Week 6

## Debugging a learning algorithm

Which of these should I choose to do better?

- Get more training examples
- Try smaller sets of features
- Try getting additional features
- Try adding polynomial features
- Try decreasing $\lambda$
- Try increasing $\lambda$


1. How to evaluate algorihtms
2. How to run machine learning diagnostics

Diagnostics can take a lot of time to implement, but can end up saving you even more time.

### Evaluating a hypothesis

When we try to fit our parameters, we try to make it so our parameters fit our training data.  But do not want to overfit.

Standard way to evaluate the hypothesis:

- Split 70% of sample data into the "training set"
- Split other 30% of sample data into the "test set"

#### Training/testing procedure for linear regression

- Learn parameter $\theta$ from training data (minimizing training error $J(\Theta)$)
- Compute test set error

### Model selection and train/validation/test sets

Split 60 / 20 / 20

### Diagnosing bias vs. variance

- Plot degree of polynomial d on x axis and error on y axis

- As we increase degree of polynomial, will be able to fit our set better

- Or if we look at test set error

- Bias = underfit

$J_{train}$ will be high
$J_{CV} approx J_{train}$

- Variance = overfit

$J_{train} will be low$
$J_{CV} will be high$

### Regularization and Bias/Variance

- Let's go deeper into bias and variance

Suppose fitting high order polynomial

- Large $\lambda$ (underfit)
- Intermediate $\lambda$ (just right)
- Small $\lambda$ (overfit)

Choosing the regularization parameter $\lambda$

1. Try $\lambda = 0$
2. Try $\lambda = 0.01$
3. Try $\lambda = 0.02$
4. Try $\lambda = 0.04$
5. Try $\lambda = 0.08$

...

\12. Try $\lambda = 10$

### Learning curves

Plot

$J_{train} = \frac{1}{2m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2 $

$J_{CV} = \frac{1}{2m_{cv}} \sum_{i=1}^{m_{CV}}(h_{\theta}(x^{(i)}_{CV}) - y^{(i)}_{CV})^2 $

### Deciding what to do next

Suppose you have implemented regularized linear regression to predict housing prices.  However, when you test your hypothesis in a new set of houses, you find that it makes unacceptably larger errors in its prediction

- Get more training examples --> fixes high variance
- Try smaller sets of features --> fixes high variance
- Try getting additional features --> fixes high bias
- Try adding polynomial features --> fixes high bias
- Try decreasing $\lambda$ --> fixes high bias
- Try increasing $\lambda$ --> fixes high variance

####"Small" neural network
- fewer parameters, more prone to underfitting

####"Large" neural network
- more parameters, more prone to overfitting

## Machine learning system design

Would like to begin with the issue of how to spend your time.

Building a spam classifier

Supervised learning.

$x=$ features of email
$y=$ spam(1) or not spam(0)

e.g. deal, buy, discount, andrew, now...

### Error Analysis

Recommended approach

- Start with a simple algorithm that you can implement quickly.
- At most 24 hours to get something quick and dirty running.
- Then plot learning curves to decide if more data, more deatures, etc. are likely to help
- Error analysis: manually examine the examples (in cross validation set) that your algorithm made errors on.  See if you spot any systematic trend in what type of examples it is making errors on.

$m_{cv} = 500$ examples in cross validation set
Algorithm misclassifies 100 emails
Manually examine the 100 errors, and categorize them based on:

(i) what type of email it is
(ii) what cues (features) you think wouild have helped the algorithm classify them correctly

### Error metrics for skewed classes

Cancer classification example

- Train logistic regression model $h_{\theta}(x)$. ($y=1$ if cancer, $y=0$ otherwise)

Find that you got 1% error on test set. (99% correct diagnoses)

Only 0.50% of patients have cancer

If your function is just y=0, ignore x, than you only have 0.5% error.

This is called skewed classes

99.2% accuracy to 99.5% accuracy sounds good, but it's not clear if it's improving the value of your classifier.

When faced with skewed classes, want to have a different metric.

#### Precision / recall

Predict $y=1$ in presence of rare class that we want to detect

$$ \text{Precision} = \frac{\text{True positives}}{\text{# predicted positive}} $$

$$ \text{Recall} = \frac{\text{True positives}}{\text{# actual positive}} $$

Want both high precision and recall.

Define such taht $y=1$ in presence of the rare class.

### Trading off precision and recall

$F_1 Score: 2\frac{PR}{P + R}$

## Data for machine learning

How much data should I train on?

Under certain conditions, getting a lot of data, can be very effective.

#### Designing a high accuracy learning system

E.g. Classify between confusable words.

{to, two,. too} {then, than}

For breakfast I ate _two_ eggs.

Algorithms

- Perceptron (logistic regression)
- Winnow
- Memory-based
- Naive bayes

All algorithms get better with more data.

[Banko and Brill, 2001]

"It's not who has the best algorithm that wins.  It's who has the most data"

#### Large data rationale

Assume feature $x \in \mathbb{R}^{n+1}$ has sufficient information to predict $y$ accurately.

Example: for breakfast I ate _two_ eggs.

Counterexample: predict housing price from only size (feet^2) and no other features.

Useful test: given the input $x$, can a human expert confidently predict $y$?

Use a learning algorithm with many parameters (e.g. logistic regression / linear regression with many features; neural network with many hidden unites)

Use a very large training set (unlikely to overfit)






