## Build a Spam Classifier
Supervised Learning
- $x$: features of email - 100 words indicative of spam/not spam
    - Check whether or not the words appear in the email
        - 1 if appear, and 0 if not
    - IN practice, take most frequently occuring $n$ words (10K - 50K)
- $y$: If the email is spam (1) or not (0)

### Implementation
- Collect lots of data
- Develop sophisticated features based on email routing information (from email header)
- Develop sophisticated features for message body
- Develop sophisticated algorithm to detect misspellings
    - Otherwise we cannot recognize misspelled words that should be regarded as "spam"
    - Spam emails may use intended typo to cheat the spam classifiers
    
## Good Practice
- Start with a simple algorithm that you can implement quickly (24 hrs). Implement and test it on your cross-validation data
    - Quick and dirty
- **Plot learning curves** to decide if more data, more features, etc are likely to help
- **Error Analysis**: manually examine the examples in the cross-validation set that your algorithm made errors on
    - See if you can identify any systematic trend/pattern in what type of examples it is making errors on
    - For example, in email misclassification, identify 
        - What are the major types of emails that got misclassified
        - What features you think would have helped the algorithm classify them correctly
            - Deliberate misspelling, unusual email routing, unusual punctuation
- **Numerical Evaluation of your learning algorithm**: help you make quick selection and evaluate the improvement of your learning algorithm
    - Accuracy, precision, Cross Validation Error, etc

## Skewed Classes
Skewed classes: our target classes that are rare in the sample data, so the cost of misclassification is very low to the algorithm
- We can get a high accuracy metric by doing nothing
- We cannot rely on the traditional error/accuracy metrics

### Error Metrics of the Skewed Classes
- Precision: of all the patients where we give positive prediction, what **fraction** is actually **True Positive**
- Recall: of all the patients who actually have cancer, what **fraction** did we correctly **predict as Positive**

#### Tradeoff
- If you use a **higher threshold** (like 0.7 instead of 0.5), your positive prediction will require a higher confidence -- Result: **Higher Precision & Lower Recall**
- If we want to **avoid missing too many cases of cancer** (false negatives), you set a low threshold -- result: **Lower Precision & Higher Recall**

![W6-PRTRADEOFF](Plots/W6-PRTRADEOFF.png)

#### F score: Compare Precision & Recall 
- $F_1 = 2*\frac{PR}{P+R}$
    - The higher, the better
- Simply compute the average of Precision and Recall: $\frac{P+R}{2}$ is not desirable
    - Usually force one metric to become very high as a cost
    
## Use Large Data Sets
### Rationale
- First, use a learning algorithm with **many parameteres**: $J_{train}(\theta)$ will be small and bias will be low (avoid underfitting)
- Second, use a very large training set: $J_{train}(\theta) \approx J_{test}(\theta)$ and variance will be low (avoid overfitting)
- Result: $J_{test}(\theta)$ will be small