# Week 8 - Supervised Data Analysis II

## Learning Outcomes
By the end of this week, you should be able to  
- Explain overfitting and underfitting of different models  
- Comprehend bias and variance trade-off  
- Utilise different model evaluation metrics  
- Comprehend the importance of “No Free Lunch Theorem”  
- Explain what ensemble models are and do mathematics!  


## 1. Truth of fitting models
For variables for an individual data case (e.g. a single loan
application or a single heart disease patient), the “truth” can be
measured directly  
- Across examples, the “true” model is harder to define:  
    - What is a “true” model of physics? – Newtonian physics, String Theory?  
- How can you measure the “true” model for the heart disease problem?  
    - collect infinite data and infer statistically  
    - but its a dynamic problem and general population characteristics
      always changing  
- regardless, we assume some underlying “truth” is out there  

### Quality
- To evaluate the quality of results derived from learning, we need notions of value  
- So we will review quality and value  
- May be the quality of your prediction  
- May be the consequence of your actions (making a prediction is a kind of action)  
- Can be measured on a positive or negative scale  
Loss: positive when things are bad, negative (or zero) when they’re good  
Gain: positive when things are good, negative when they’re not  
Error: measure of “miss”, sometimes a distance, but not a direct measure of quality  
    - Absolute error = |x|  
    - Square error = x**2  
    - Hinge-error = |x| if |x| <= 1, 1 otherwise  

### Regression:
- Looks for relationships between variables  
- Linear regression: line of best fit y = mx + c  
    - Minimises loss function Mean Square Error  
    - Residuals: distances between observed values and predicted values  
        - Explained variation ~ variation in y explained by the model  
        - Residual variation ~ variation in y unexplained by the model  
        - Total variation in y = Explained variation + Residual variation  
    - SST (total sum of squared variation) = SSE (explained) + SSR (residual)  
    - R^2 value = SSE/SST
        - R^2 ranges between 0 and 1
            - 1 is good, variability in y is fully explained by model
            - 0 is bad no variability in y is explained by the model    

### Correlation coefficient
- We can measure the strength and direction of the linear relationship of
  two variables  
- (Pearson product-moment) correlation coefficient is the covariance of
  the variables divided by the product of their standard deviations  
- R or Pearson’s R, when applied to a sample  
    - R=+1 is total positive linear correlation,  
    - R=0 is no linear correlation  
    - R=−1 is total negative linear correlation  

### Regression tree (Decision Tree)
A supervised machine learning algorithm that predicts a continuous-valued response
variable by learning decision rules from the predictors (or independent variables).  
- Divide data into subsets of similar values
- Estimate the response within each subset

ANOVA  
- The partitioning is a top-down, greedy approach.  
    - Start with all data  
    - Once split, don’t change  
- Searches every distinct value of every input predictor to find a pair of
  predictor/value that best split the data into two subgroups (G1 and G2).  
    - As in for the population inside that node, this pair of predictor/value
      improves the chosen criteria (e.g., ANOVA) the most.  
- ANOVA criterion = SST − (SSG 1 + SSG 2)  
- 𝑆𝑆𝑇 = σ(𝑦𝑖 − 𝑦¯)2 , total variation of the dependent variable.  
- SSG1 & SSG2 use the SST formula but with the values for the two
  subgroups created by the partition.  


## 2. Overfitting and underfitting
More parameters = model can fit more complicated curve  
Too many parameters = model makes wild predictions  
- Small polynomial; cannot fit the data well; said to have high bias  
- Large polynomial; fits the data too well; said to have small bias  
- Poor fit due to high bias called under-fitting  
- Poor fit due to low bias called overfitting  
- Training/test set split to avoid evaluation metrics including overfitting

## 3. Bias and variance
Bias (Accuracy): measures how much the prediction differs from
the desired regression function.  
Variance (Precision): measures how much the predictions for individual
data sets vary around their average.  

![image.png](attachment:image.png)  
![image-2.png](attachment:image-2.png)  
![image-3.png](attachment:image-3.png)  
![image-4.png](attachment:image-4.png)  
<style type="text/css">
    img {
        width: 400px;
    }
</style>

### No Free Lunch Theorem
If a [learning] algorithm performs well on a certain
class of problems then it necessarily pays for that with
degraded performance on the set of all remaining problems.
- There is no universally good machine learning algorithm (when
  one has finite data)
    - e.g. Naive Bayesian classification performs well for text classification
    with smaller data sets
    - e.g. linear Support Vector Machines perform well for text
    classification

## 4. Model evaluation metrics
- Evaluation metrics are tied to machine learning tasks.  
- For classification models, the task is about predicting the
  class labels given input data.  
- Binary classification vs. Multi-class classification  

Confusion Matrix
- True Positive  
- True Negative  
- False Positive  
- False Negative  

Accuracy = $\frac{TP+TN}{TP+TN+FP+FN}$  
How often classifier correctly predicts  

Precision = $\frac{TP}{TP+FP}$  
How many predicted correct cases are really correct, useful when False Positives are much more important than False Negatives e.g. e-commerce  

Recall = $\frac{TP}{TP+FN}$  
How many actual positive cases were caught, useful when False Negatives are much more important than False Positives e.g. medical scans  

F1 Score = $2.\frac{Precision * Recall}{Precision + Recall}$  
When both precision and recall are important  

R^2 Score = SSE/SST  
Goodness of fit  

Mean Absolute Error
Mean Squared Error - amplifies large errors
Root Mean Squared Error - standard deviation of residuals


## 5. Multiple models
- Suppose you wanted to fit a linear model of the life expectancy for
  every country in your data.  
- Filtering each country one-by-one to fit 142 individual models is not
  a practical solution.  
- Use nest() and map()in R  
    - Group by country  
    ✴ Nested dataframe with new column of country-specific data  
    - Map lm to each country  
    - Tidy() the lm data into a tibble then unnest it  
    - Provides slope (gradient) and intercept coefficients for each  
    - Reorganise for analysis  


## 6. Ensemble models
- Given only data, we do not know the truth and
  can only estimate what may be the “truth”  
- An ensemble is a collection of possible/reasonable models  
- From this we can understand the variability and range of
  predictions that is realistic  

## Tutorial notes
Data transformations can be helpful for your data regression models e.g. log transformation
Dimensionality reduction - good for reducing model noise, reduce computational burden