In [None]:
# using Flux
using Random, Distributions, LinearAlgebra, DataFrames
using Plots

# Linear Regression
Regression problems pop up whenever we want to predict a numerical value. Common examples include predicting prices

As a running example, suppose that we wish to estimate the prices of houses (in dollars) based on their area (in square feet) and age (in years). To develop a model for predicting house prices

$$
price = w_{area} * area + w_{age} * age + b
$$

Where w is weight which determine the influence of each variable and b the bias

In [None]:
function tst(n)
    a = ones(1, n)
    b = ones(1, n)

    for i in range(1, n)
        return a[i] + b[i]
    end
end

In [None]:
@time tst(1000)

Collecting all features into a vector X and all weights into a vector W, we can express our model compactly via the dot product between X and W
$$
\hat{y} = w * x + b
$$

### Loss Function
Before we can go about searching for the best parameters (or model parameters) w and b, we will need two more things: 
- a quality measure for some given model
- a procedure for updating the model to improve its quality.
$$
l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2.
$$

### Gradient Descent to reduce Loss Function to find local minimum
The most common use of gradient descent consists of taking the derivative of the loss function. 
(In practice, this can be extremely slow: we must pass over the entire dataset before making a single update)

 Even worse, if there is a lot of redundancy in the training data, the benefit of a full update is even lower.
$$
(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b).
$$

We sample a minibatch Beta consisting of a fixed number Beta{t} of training examples. We then compute the derivative (gradient) of the average loss on the minibatch with respect to the model parameters. Finally, we multiply the gradient by a predetermined small positive value L, called the learning rate

In [None]:
function normal(x, mu, sigma)
    p = 1 / sqrt(2 * π * sigma^2)
    return p * exp(-0.5 * (x - mu)^2 / sigma^2)
end

### Normal Distribution
Linear regression was invented at the turn of the 19th century. While it has long been debated whether Gauss or Legendre first thought up the idea, it was Gauss who also discovered the normal distribution (also called the Gaussian). It turns out that the normal distribution and linear regression with squared loss share a deeper connection than common parentage.
$$
p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (x - \mu)^2\right).
$$

Thus, we can now write out the likelihood of seeing a particular X for a given Y

$$
P(y \mid \mathbf{x}) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (y - \mathbf{w}^\top \mathbf{x} - b)^2\right).
$$

# Predictions
$$ given \ a \ model \ \ \ \hat{\mathbf{w}}^\top \mathbf{x} + \hat{b}\ $$

we can now make predictions for a new example, e.g., to predict the sales price of a previously unseen house given its area 
and age 


In [None]:
A = collect(-10.0:0.1:10.0)
# μ is the mean
# σ is the standard deviaton
# we choose some values such as μ = 0 and σ = 1 and visualize the normal distribution

y = normal.(A, 0.0, 3.0)

plot(A, y)