## Machine Learning Algorithms

https://wch.github.io/latexsheet/latexsheet.pdf

<ol>
    <li>Introduction to machine learning</li>
    <li> Supervised ML and Unsupervised ML</li>
    <li>Linear Regression</li>
    <li>$R^2$ and Abjusted $R^2$</li>
    <li>Ridge and Lasso Regression</li>
</ol>

### Introduction to Machine Learning <br>


#### <u>AI application</u>
    AI application is able to do its own tasks without any human intervention.
    eg. Netflix, Amazon, self-driving cars
    
#### <u>Machine learning</u>
    Machine learning is a subset of AI that used statistics to analyze, visualize, predict and forecast.
   
#### <u>Deep learning</u>
    Deep learning is a subset to machine learning that mimics human's brain through multilayers neural network.
    
In machine learning and deep learning, there are supervised learning, unsupervised learning and reinforcement. In supervised learning, there are regression, and classification techniques. In unsupervised learning, there are clustering, and dimensionality reduction.



#### Supervised Learning

Models are built based on the independent features (inputs) and dependent features (outputs)

##### <u>Regression</u> 
    If the dependent features (outputs) is continuous, then the problem becomes a regression problem
    
##### <u>Classification</u>
    If the dependent features (outputs) is a fixed number of categories, then the problem becomes a classification problem.
    eg. binary classification (2 categories)
        multiclass classification (more than 2 categories)

####  Unsupervised Learning

Models are built based on the independent features. Here there are no dependent features

#####  <u>Clustering</u>
    Clustering groups observations with similar features into groups
    e.g. customer segmentation
#####  <u>Dimensionality Reduction</u>
    Dimensionality reduction reduces higher dimension to lower dimension

#### Algorithms will be discussed <br>

##### Supervised learning  
<ol>
    <li>Linear Regression</li>
    <li>Ridge and Lasso</li>
    <li>Logistic Regression</li>
    <li>Decision Tree</li>
    <li>AdaBoost</li>
    <li>Random Forest</li>
    <li>Gradient Boosting</li>
    <li>Xgboost</li>
    <li>Naive Bayes</li>
    <li>Support Vector Machine, SVM</li>
    <li>KNN</li>
</ol>
<br>

##### Unsupervised learning  
<ol>
    <li>K means</li>
    <li>DBScan</li>
    <li>Hierarchical Clustering</li>
    <li>K Nearest Clustering</li>
    <li>Principal Component Analysis, PCA</li>
    <li>Latent Dirichlet Allocation, LDA</li>
    <li>Gradient Boosting</li>
    <li>Xgboost</li>
    <li>Naive Bayes</li>
</ol>

#####  Linear Regression
<br>
  
  Train dataset --> Model --> Hypothesis (new input --> predict output)
  
  Model : $ h_{\theta}(x) = \theta_0 + \theta_1x$ 
 
 where $\theta_0$ and  $\theta_1$ are weights

  Purpose is the minimize the distance between the data points and the best fit line.
  
  <u>Cost function</u>
  
  $J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x_i) - y_i)^2$
  
  This cost function is called Squared Error Function
  
$$minimize_{\theta_0,\theta_1}    \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x_i) - y_i)^2$$
 
$$minimize_{\theta_0,\theta_1}    J(\theta_0,\theta_1)$$

![cost function graph](cost_curve.png "cost function graph")

 <u>Convergence Algorithms</u>
 
 repeat until convergence
 
   {
 
 $$\theta_j := \theta_j - \alpha \frac{\delta}{\delta\theta_j}J(\theta_0,\theta_1)$$
 
   }
   
$\alpha$ is the learning rate, usually 0.01.

<u>Gradient Descrent Algorithm</u>

 repeat until convergence
 
   {
 
 $$\theta_j := \theta_j - \alpha \frac{\delta}{\delta\theta_j}J(\theta_0,\theta_1)$$
 
   }
<br><br>
Performing derivatives

   $$\frac{\delta}{\delta\theta_j}J(\theta_0,\theta_1) = \frac{\delta}{\delta\theta_j} \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x_i) - y_i)^2$$
<br><br>
   
$ j = 0$ <br>
   $$\frac{\delta}{\delta\theta_0}J(\theta_0,\theta_1) = \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x_i) - y_i)$$
   
<br><br>
$ j = 1 $<br>
   $$\frac{\delta}{\delta\theta_1}J(\theta_0,\theta_1) = \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x_i) - y_ix_i$$
<br><br>   
repeat until convergence
 
   {
 
 $$\theta_0 := \theta_0 - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})$$ <br>
 $$\theta_1 := \theta_1 - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}$$
 
   }
   
<br><br>

<u>Performance Metrics</u>

$R^2$ and Adjusted $R^2$

$$R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum(y_i - \hat{y})^2}{\sum(y_i - \bar{y})^2}$$
<br>
Where RSS is sum of squares of residuals, TSS is total sum of squares, $\hat{y}$ is the y of the best fit line and $\bar{y}$ is the mean of y values. <br><br>

![R-square graph](R_square_curve.jpg 'R-square graph')

<br><br>

<u>Adjusted R-square</u><br><br>
Adjusted R-square is used because some of non-correlated features improve the R-square where they don't affect the outcomes. <br><br>

$$Adjusted\ R^2 = 1 - \frac{(1-R^2)(N-1)}{N - p -1}$$ <br><br>

Where N is the sample size and p is number of features/predictors/independent variables. <br><br>

Between $R^2$ and $Adjusted\ R^2$, $R^2$ is always bigger than $Adjusted\ R^2$ because as more features added $(N - p - 1)$ will get smaller. With $R^2$, if the features are correlated, it would have a huge increase. However, when the features are not correlated, it would have a little increase. <br><br>

<u>Overfitting</u>

Model performs well with training data indicating low bias but it fails to perform well with test data indicating high variance. <br>

<u>Underfitting</u>

Model performs poorly with both traning data and test data, indicating high bias and high variance.

##### Ridge and Lasso Regression

Ridge and Lasso regression are some of the simple techniques to reduce model complexity and prevent overfitting which may result from simple linear regression. They are added to the cost function to penalize the model for large weights and biases. Larger weights and biases tend to overfitting.<br><br>

Regularization refers to techniques that are used to calibrate machine learning models in order to minimize the adjusted loss function and prevent overfitting or underfitting. <br><br>

<u>Ridge Regression (L2 Regularization)</u>

Ridge regression introduces a small amount bias to our model such that the new model doesn't fit the training data as well. 

![linear regression](Ridge1.png 'linear regression')
![Ridge regression](Ridge2.png 'ridge regression')

The red line fits the training data perfectly but it produces a high variance for the test data, green dots. The blue line, Ridge regression, fits the test data data with a compromise of the training data; hence, it produces a smaller variance. Thus, Ridge regression aims to improve generalization. <br><br>

Model for one feature and m observations: $$h_{\theta}(x) = \theta_0 + \theta_1x $$

Ridge cost function: <br><br>
$$J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=1}^{m}(y_i - h_{\theta}(x_i))^2 + \lambda\theta_1^2$$

In general, if there are $p$ features and $m$ features, then <br><br>
Model: $$h_{\theta}(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_px_p $$
$$h_{\theta}(x) = \theta_0 + \sum_{j=1}^{p}\theta_jx_j$$

Ridge cost function:
$$J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(y_i - \theta_0 - \sum_{j=1}^{p}\theta_{ij}x_{ij})^2 + \frac{\lambda}{2m}\sum_{j=1}^{p}\theta_j^2$$

Where $\lambda$ is hyperparameter.
<br><br>

To get $\theta$'s, the cost function needs minimized.

$$
argmin_{\theta \in \mathbb{R}}\ J(\theta)
$$
$$
argmin_{\theta \in \mathbb{R}}\ \frac{1}{2m}\sum_{i=1}^{m}(y_i - \theta_0 - \sum_{j=1}^{p}\theta_{ij}x_{ij})^2 + \frac{\lambda}{2m}\sum_{j=1}^{p}\theta_j^2
$$
<br> $$or$$

$$
argmin_{\theta \in \mathbb{R}}\ \parallel \mathbf{y} - \mathbf{X\theta}\ \parallel _2 ^2 + \mathbf {\lambda \parallel\theta\ \parallel} _2 ^2
$$

Where $\parallel\theta\ \parallel _2$ is L2 norm of $\theta$ defined as,

$$\parallel\theta\ \parallel _2 = \sqrt{\theta_1^2 + \theta_2^2 + \theta_3^2 + ... +\theta_p^2}$$