# Lecture 6: 
Linear Models and Validation Metrics<br/>
Jan 22, 2024
***

## Supervised Learning

### Review: What is Supervised Learning

- Supervised learning is a type of machine learning that learns
from labeled data, which consists of input/output pairs.
- The output can be either a numerical value (regression) or a
class (classification).
- Supervised learning aims to make accurate predictions for new
data that has not been seen before

### Review: Classification and Regression Models

Classification and regression are two types of supervised machine learning
problems, where the goal is to `learn a mapping function` from input variables
to output variables.
- In `classification`, we want to assign a discrete label to an input, such as
"spam" or "not spam" for an email.
- In `regression`, we want to estimate a continuous value for an input, such as
the price of a house based on its features.

The main difference is that the output variable is
- categorical for classification
- continuous for regression.

## Linear Models

### What are Linear Models?

Linear models are supervised learning algorithms that predict an
output variable based on a `linear combination of input features`.
They can be used for both regression and classification tasks,
depending on whether the output variable is continuous or binary.

#### Linear combination example: TVs

Screen Size, Refresh Rate, Price

`Screen Size` and `Refresh Rate` are Inputs<br/>
`Price` is an Output

y = mx+b
m = slope, b = y-intercept

`w_0*(Screen Size) + w_1*(Refresh Rate) + b = Price`

Can make things simple, but what if the model is not Linear?
- very basic model

#### Some Common Linear Models

- Linear regression: Predicts a continuous output variable from one or more
input features.
    - For example, it can model how the height of a person varies with age, or
how the price of a house depends on the size and location.
- Logistic regression: Predicts a binary output variable from one or more input
features.
    - For example, it can estimate the probability of a patient having a heart
disease or not, or the likelihood of a customer buying a product or not.
8
- Linear models are simple, interpretable, and fast to train. However, they
may not perform well on complex or non-linear data

## Linear Models for Regression

- For regression, the general prediction formula for a linear model looks as
follows:<br/>
$\hat{y}$ = 𝐰 ∙ 𝐱 + 𝑏<br/>
    = [𝑤[0] ∗ 𝑥[0] + 𝑤[1] ∗ 𝑥[1] + . . . + 𝑤[𝑝] ∗ 𝑥[𝑝]] + 𝑏
- Here, `𝐰 =[𝑤[0], 𝑤[1], ⋯, 𝑤[𝑝]]` and `𝐱 = [𝑥[0], 𝑥[1], ⋯, 𝑥[𝑝]]` are two
vectors, and `" ∙ " is dot product`.
10
- Each `𝑥[0] to 𝑥[𝑝]` denotes the `features` 
(in this example, the number of features is p+1) of a single data point.
- Also, `𝑤[0] to 𝑤[𝑝]` and `𝑏` are `parameters of the model` that are
learned, and ^𝑦 is the prediction the model makes.

- For a dataset with a single feature, this is:<br/>
$\hat{y}$ = 𝑤[0] ∗ 𝑥[0] + 𝑏
- This is the equation for a line.
- Here, `𝑤_0 is the slope`, and `𝑏 is the y-axis offset` (intercept).
- For more features, `w contains the slopes`
along each feature axis.
- Alternatively, you can think of the
`predicted response` as being a 
`weighted sum of the input features`, with weights
(which can be negative) given by the
entries of w<br/>
    <img src="./images/L6/L6-1.png" alt="L6-1.png" width="300"/>

- There are many different linear models for regression
- The difference between these models lies in how the model parameters
(w and b) are learned from the training data, and how model complexity
can be controlled
- Popular models used:
    - Linear regression (ordinary least squares)
    - [Ridge regression](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression)
    - [Lasso regression](https://scikit-learn.org/stable/modules/linear_model.html#lasso)

### Ridge Regression

- Ridge regression is also a linear model for regression, so it uses the same
formula as ordinary least squares
- For ridge regression, the coefficients (w) are not only chosen so that they
predict well on the training data, but so they can also fit an additional constraint
- The additional constraint is that the magnitude of coefficients must be as small
as possible; all entries of w should be close to zero
- The square of the [𝑙_2-norm](https://en.wikipedia.org/wiki/Norm_(mathematics)) of the w's is defined as:<br/>
    <img src="./images/L6/L6-3.png" alt="L6-3.png" width="600"/>
- The cost function to minimize becomes:<br/>
    <img src="./images/L6/L6-4.png" alt="L6-4.png" width="600"/>

- Having the coefficients close to zero means each feature should have as
little effect on the outcome as possible (which translates to having a
small slope), while still predicting well
- This constraint is an example of what is called regularization
- Regularization means explicitly restricting a model to avoid overfitting
- Ridge regression uses L_2 regularization