In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Conditional probability: $P(A|B)$

In the context of models and data: Probability = $P(data|model)$

and Likelihood = $L(model|data)$

If we have two competing models, we want to choose the model that has the highest likelihood to output the given data

<img src = 'gaussian.png' width = 500>

If x comes from a normal distribution, then $$P(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma}}e^-{\frac{(x-\mu)^2}{2\sigma^2}}$$ now the **likelihood** is just the conditional probability of the distribution parameters, $\mu$ and $\sigma$, given the observations we have, $x$.  The likelihood is just the probabilities multiplied together: $$\mathcal{L}(\mu, \sigma)|(x_1,...,x_n) = \prod\limits_{i=1}^{N} P(x_i)\ldots P(x_N)$$ 

We can take the **log-likelihood** which allows us to sum these probabilities instead of multiplying them.  Maximizing the log-likelihood is the same as maximizing the likelihood because logarithmic functions are monotonically increasing.  We can do it like this: $$ln(\mathcal{L}(\mu, \sigma)|(x_1,...,x_n)) = ln(\prod\limits_{i=1}^{N} P(x_i)$$ $$ln(\mathcal{L}) = \sum\limits_{i=1}^{N} lnP(x_i)$$  

or $$ln(\mathcal{L}) = \sum\limits_{i=1}^{N}ln\frac{1}{\sqrt{2\pi\sigma}} - {\frac{(x-u)^2}{2\sigma^2}}$$

Notice in the right side of this equation the expression: $(x-u)^2$ which is the cost function for linear regression.  We see that minimizing this will maximize the liklihood, therefore **$$\mathcal{L} = -J$$

One of the most important algorithms in machine learning is **Gradient descent** which we use to minimize the cost function.  We use it in place of something like the normal equation because it does not require taking the inverse of a matrix, so it is much more efficient when we have many columns, or features.

$$w_i = w_i - \alpha\frac{\partial{J}}{\partial{w_i}}$$

In [3]:
x = np.linspace(0, 100, 100)
y = 3*x + 3 + np.random.normal(5, 10, 100)

def gradient_descent(alpha, x, y, ep=0.0001, max_iter=10000):
    converged = False
 

    iter = 0
    m = x.shape[0] # number of samples

    # initial theta
    intercept = 0
    slope = 0
    # total error, J(theta)
    J = sum([(intercept + slope*x[i] - y[i])**2 for i in range(m)])

    # Iterate Loop
    while not converged:
        # for each training sample, compute the gradient (d/d_theta j(theta))
        grad0 = 1.0/m * sum([(intercept + slope*x[i] - y[i]) for i in range(m)]) 
        grad1 = 1.0/m * sum([(intercept + slope*x[i] - y[i])*x[i] for i in range(m)])

        # update the theta_temp
        temp_intercept = intercept - alpha * grad0
        temp_slope = slope - alpha * grad1
    
        # update theta
        intercept = temp_intercept
        slope = temp_slope

        # mean squared error
        e = sum( [ (intercept + slope*x[i] - y[i])**2 for i in range(m)] ) 

        if abs(J-e) <= ep:
            print ('Converged, iterations: ', iter, '!!!')
            converged = True
    
        J = e   # update error 
        iter += 1  # update iter
    
        if iter == max_iter:
            print ('Max interactions exceeded!')
            converged = True

    return intercept,slope