# **Machine learning**

### *Introduction*
- Machine learning helps to reproduce tasks that cannot be programmed by hand and taks that are made by humans. It is also used for data mining and self-customizing programs (Customizing preferences and data).
- ML gives the capability to machines to learn without being explicitly programmed.
- Types of ML :
  - Supervised learning: We're given a dataset with labels, in other words, we already know what our output should look like. Supervised learning is categorized into :
    1. Regression problems: Predicting output within a continuous domain of definition (Mapping to a continuous function).
    2. Classification problems: Predicting output in a descrete domain of definition (Mapping to a descrete categories, can be seen as a step function).
  - Unsupervised learning: We're given a dataset without knowing how our results should look like and don't necessarily know the effect of each variable. A solution is usually to cluster data based on relationships among the variables. With unsupervised learning, there's no feedback on the prediction results.
  - Other: Reinforcement learning, recommender systems.

# Linear regression :
- Given a training set, our main goal is to learn a mapping function $h: x \rightarrow y$ which can predict correctly (or approximatly) a value of $y$ given $x$ as input ($x, y \in \reals$).
- By default, we define the hypothesis function as follow : $h_{\theta}(x) = \theta_{0} + \theta_{1}x$ where $\theta_{i}$ are parameters.
- $h(x)$ is a function for univariate linear regression (with one variable as input).
-  Cost function is a function that measures the loss between the output of $h(x)$ and the ground truth value $y$.
- Most common cost function is the "Mean squared error" and is defined as follow :
$$ J(\theta_{0}, \theta_{1}) = \frac{1}{2m} \sum_1^m (\widehat{y} - y)^2 $$  

- Where $m$ is the number of training examples and $\widehat{y}$ is the predicted output. We put $\frac{1}{2}$ on the MSE to simplify the computation of the gradient descent, when derived, it cancels the square value.

#### 1. Gradient descent:
- We modify values of the parameters by substracting it's gradient each time until we get a function $h$ that predicts well. In other words, finding a combination of parameters $\theta_{0}, \theta_{1}, ..., \theta_{n}$ that minimizes the cost function as much as possible. In some cases we run into local minimums when dealing with complex functions, one possible solution is to restart the learning phase with another distribution of parameters.  
``Repeat until convergence:``
$$ \theta_{j} := \theta_{j} - \eta \times \frac{\partial}{\partial \theta_{j}}J(\theta_{0}, \theta_{1},...,\theta{n}) $$  
Where $\eta$ is the learning rate. This parameter has an impact on the training speed, a substancial value means a faster learning but can sometimes cause oscillations and diverge, in the other case, if $\eta$ is too small then gradient descent will be slow.
- *Note*: All the parameters should be updated at once.
- One possible solution if the cost starts growing after $k$ iterations, is to reduce the value of the learning rate $\eta$.
- If $\eta$ is sufficiently small, the cost should decrease on every iteration, but if $\eta$ is too small, the gradient descent will hardly converge.

#### 2. Multivariate linear regression (Linear algebra):
- let $n$ be the number of input features. If $n > 1$, then we're working on multivariate linear regression problem :
$h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n}$
- By vectorizing our hypothesis function, we'll be able to reduce the computing time and apply gradient descent on all training examples at once. This method is called '**Batch gradient descent (BGD)**' (Update is made on a whole batch of examples).  
$$ \theta = \begin{bmatrix}\theta_{0} \\\theta_{1} \\... \\\theta_{n} \end{bmatrix},
x = \begin{bmatrix}x_{0}^{(1)} & x_{0}^{(2)} & ... & x_{0}^{(m)}\\x_{1}^{(1)} & x_{1}^{(2)} & ... & x_{1}^{(m)} \\. & . & . & . \\x_{n}^{(1)} & x_{n}^{(2)} & ... & x_{n}^{(m)} \end{bmatrix}, 
h_{\theta}(x) = \theta^{T}x$$  

- In more details, here's what the **BGD** formula looks like : $ \theta_{j} := \theta_{j} - \frac{\eta}{m} \sum_i^m \frac{\partial}{\partial \theta_{j}}J^{(i)}(\theta_{0}, \theta_{1},..., \theta_{n}) $ where $\frac{\partial J}{\partial \theta_{j}}$ formula can be found as follow :  
$$\frac{\partial J}{\partial \theta_{j}} = \frac{\partial J}{\partial h}\frac{\partial h}{\partial \theta_{j}} = (h_{\theta}(x) - y)x_{j}$$
$$ \theta_{j} := \theta_{j} - \frac{\eta}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})x_{j}^{(i)} $$

#### 3. Features scaling (Mean normalization):
- When dealing with multivariate regression problems, it is important to rescale all the features to the same range around 0, especially when we have different units. Variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bais. A variable that ranges between 0 and 1000 will outweigh a variable that ranges between 0 and 1.
- One way to rescal input features is to apply the formule :
$$ x_{i} := \frac{x_{i} - \mu_{i}}{s_{i}} $$  
Where $\mu_{i}$ is the average value of the feature $x_{i}$ and $s_{i}$ is the standard deviation ``max - min``.

#### 4. Normal equations :
- Another method to minimize the function $\partial J$ without using an iterative algorithm like the gradient descent, is to use the '**Normal equation**'. In this method we minimize the cost by resolving the problem $\frac{\partial J}{\partial \theta} = 0$ and by simplyfing this equation, we obtain the normal equation formula:
$$\theta = (X^TX)^{-1}X^Ty$$ 
- I explain in more details the demonstration of this function [here]() ([NBviewer version]()).
- The normal equation has some advantages and disadvantages over the default method which is gradient descent. Here are some points (From Andrew Ng's course):
  - **Gradient descent**:
    - Pros:
      - Complexity $O(kn²)$.
      - Works well when the number of features is large.
    - Cons:
      - Features scaling is necessary.
      - Needs many iterations.
      - Needs to define a learning rate $\eta$.
  - **Normal equation**:
    - Pros:
      - No iteration.
      - No need to a learning rate.      
    - Cost:
      - Computing the inverse of $X^TX$ has a complexity of $O(n^3)$.
      - Slow when dealing with a lot of features
- *Note:* Features scaling is not necessary when working with normal equation, keeping the features as they are won't affect the results.
- It is recommended to use **Normal equation** when the number of features $n$ does not exceed $10^{5 to 6}$, otherwise, gradient descent is better. 

# Polynomial regression:
- Sometimes, th repartition of our training data forms a non-linear curve, and our hypothesis function cannot fit all these data. In order to deal with non-linear problems, we have to use polynomial regression.
- The idea behind polynomial regression is to compose our input features with non-linear functions then create new additionnal features based on those compositions. Here are some examples:
  - Quadratic function: $h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{1}^2$
  - Cubic function: $h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{1}^2 + \theta_{3}x_{1}^3$
  - Root squared function: $h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}\sqrt{x_{1}}$
- One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important. Do not forget it to normalize your inputs. 