# K-nearest neighbor (KNN)
* A **non-parametric** model
* A **non-linear** model

It makes few assumptions about structure of data and usually gives accurate result, but it is unstable to small changes in the dataset.

* Classifier
* Regressor

Instance based or memory based supervised learning.

- KNN classifier: memorize the entire training set.

Four things should be specified:

    1) A distance metric.
        * it controls the distance function between points and thus which points are considered as nearest in finding neighbors.
        * Typically Euclidean (Minkowski with p = 2)
    2) How many nearest neighbors to look at? (Model complexity)
        * k=5
    3) Optional weighting function on the neighbor points.
        * Ignored
    4) Methods for aggregating the classes of neighbor points.
        * Majority class

### Relation between k and model complexity

* **Reducing k** in knn classifier **increases** the variance of the decision boundries and the risk of **Overfitting** because very local changes is captured.

* **k=the total number** of points in the training set, the result would be a **single decision** which it is the **most frequent** calss in the training set.

### Drawback:

When the training data has many samples, or each sample has lots of features, this can slow down the performance of KNN model.

For data set with hundred of thousands of features, especially if it is sparse, we should apply another model in stead of KNN.

# Linear regression:

A **parametric** model. It assumes a linear relationship between the input variables (features) and the single output variable (target).

 It gives the target based on weighted sum of the features. The task of machine learning is to find the weighting parameters based on the previous data.

**Least square linear regression** (AKA: ordinary least square)

* It minimize the mean square error between target and prediction to find ws(weights) and b (bias/intercept parameter)

 $RSS(w,b) = \sum_{i=1}^N (y_i - (w.x_i + b))^2$

**Implementation in Sklearn**:

* $w$ : linreg.coef_

* $b$: linreg.intercept_

    -The ' _ ' in linreg.coef_ means it is a parameter that has been derived by training the data and it is not set by the user.

## Comparing between KNN and Linear regression:

* **KNN**:
    - does not make a lot of assumption about the structure of the data.
    - gives potentially accurate but sometimes unstable predictions that are sensitive to small changes in the training data.
    - better on training set.

* **Linear regression**:
    - makes strong assuptions about the structure of the data: linear relationship.
    - gives stable but potentially inaccurate predictions.
    - better on unseen data.
    - very extendable to new data beyond the training set.
    - no parameter to control the complexity.

# Regularization:
Regularization prevents **overfitting** by restricting the model typically to reduce its complexity. 

* Ridge regression (L2)
* Lasso regression (L1)

### Ridge regression:

Using same least-square criterion but adds a penalty for **large variations** in weight parameters.

$RSS_{ridge}(w,b) = \sum_{i=1}^N (y_i - (w.x_i+b))^2 + \alpha \sum_{i=1}^P w_j^2$

**Higher** $\alpha$ means **more** regularization and **simpler** models.

### Lasso regression:

Like Ridge regression, a regullarization penalty term to the ordinary RSS that cause w coefficients to shrink toward zero.

$RSS_{lasso}(w,b) = \sum_{i=1}^N (y_i - (w.x_i+b))^2 + \alpha \sum_{i=1}^P |w_j|$

With lasso, a subset of the coefficients are forced to be precisely zero. (it is called sparse solution which is a kind of **Feature selection**)

By default $\alpha=0$.


### Use
* **Ridge**: Many small/Medium sized effects.
* **Lasso**: Only a few variables with medium effects.

# Polynomial Features:

Generate polynomial and interaction features.

* It is still a **linear** model.
* Polynomial feature expansion is often combined with a regularization learning method like ridge regression.
* Using higher degrees leads to more complex models and regularization might be needed to avoid overfitting.

# Linear model for Classification

# Logistic regression

* Linear model
* default: Binary classification but can be applied on multi-class 
* Applying logistic function (activation function) on estimated probabilities determines the class
* Parameter $C$ controls **regularization**
    - default: $C = 1$ Ridge (L2) regularization
    - **Higher** $C$ corresponds to **less** regularization
* Normalization woud be important here

# Support vector machine (SVM):

* Apply **sign function** as activation function to produce binary output
    -feature vector -> linear function: $Sign(w.x+b)$ -> class value
* **Classifier margin** is defined as the width the decision boundary area can be increased before hitting a data point.
* The **best** classifier has the **maximum** margin.
* The **maximum** margin classifier is called the **linear support vector machine (LSVM)**
* Parameter $C$ controls **regularization**
    - default: $C = 1$ Ridge (L2) regularization
    - **Higher** $C$ corresponds to **less** regularization.
        * Fit the training data as well as possible
        * Each individual data point is important to classify correctly.

