 ##### In the terminology of machine learning, the dataset is called a training dataset or training set, and each row (containing the data corresponding to one sale) is called an example (or data point, instance, sample). The thing we are trying to predict (price) is called a label (or target). The variables (age and area) upon which the predictions are based are called features (or covariates).

In [1]:
import math
import time
import numpy as np
import torch


##### In a linear regression there are several assumptions. The first being that the realtionship between the fatures $\bm{x}$ and target $\bm{y}$ is approximately linear,i.e., that the condiotnal mean $E[Y \mid X=x]$ can be expressed as a weighted sum of the features $\bm{x}$. This setup allows that the target value may still deviate from its expected value on account of observation noise. In general we impose the assumption that the noise is wel behaved i.e. has a Gaussian distribution.  

The assumption of linearity means that the expectde value of the target can be expressed as a weighted sum of the features (area and age):

<a id="eq1"></a>
$$
price = w_{area} * area + w_{age} * age + b
$$

where, $w_{area}$ and $w_{age}$ are the corresponding weights and $b$ is the bias. The weights express the influence of each feature on our prediction. 

#### <font color='red'>The above [equation 1](#eq1) is an affine transformation of the input features, which is characterized by a linear transformation of features via a weighted sum, combined with a translation via the added bias.</font>  

#### <font color='lightgreen'>So the goal is to choose the weights "w" and the bias "b" such that on average, we can make our models predictions fit the true prices observed in the data as closely as possible. </font>

In machine learning we usually work with datasets with higher dimensions so it is more convinient to apply compact linear algebra notation. Assuming our datapoints have $d$-dimensions, we can express our prediction $\bm{\hat{y}}$ as,

<a id="eq2"></a>
$$
\hat{y} = w_1x_1 + w_2x_2 + \dots +w_dx_d + b  
$$

The features can be clubbed into a single vector $\bm{x} \in R^d $ and all weights into a vector $\bm{w} \in R^d$ and thus we can express our model compactly via the dot product between $\bm{w}$ and $\bm{x}$:

<a id="eq3"></a>
$$
\bm{\hat{y}} = \bm{w}^T\bm{x} + b
$$

So when we consider a dataset with n examples and d features we can represent the $\text{design matrix}$ as $\mathbf{X} \in R^{n \times d}$. The predictions $\mathbf{\hat{y}} \in R^n$ can be expressed as a matrix-vector product:

$$

\hat{y}  = \mathbf{Xw} + b,
$$

Given features of a training dataset $\mathbf{X}$ and corresponding (known) labels $\mathbf{y}$, the goal of linear regression is to find the weight vector $\mathbf{w}$ and the bias term $b$ such that, given features of a new data example sampled from the same distribution as $\mathbf{X}$, the new example’s label will (in expectation) be predicted with the smallest error.

#### <font color='red'> Before we can go about searching for the best parameters (or model parameters) the weights "w" and the bias "b" and , we will need two more things: (i) a measure of the quality of some given model; and (ii) a procedure for updating the model to improve its quality. </font>



# Loss Function

Loss functions quantify the distance between the real and the predicted values of the target. Loss will be usually a non-negative number where smaller values are better. For regression problems the most common loss function is the sqaured error loss. For an example $i$, the sqaured error between the predicted and the true label is given by,

$$
l^i(\bm{w},b) = \frac{1}{2}(\hat{y}^i-y^i)^2
$$

where, 

$$
\hat{y}^i = w^Tx^i+b
$$

Note that large differences between estimates $\hat{y}^i$ and targets $y^i$ lead to even larger contributions to the loss, due to its quadratic form (this quadraticity can be a double-edge sword; while it encourages the model to avoid large errors it can also lead to excessive sensitivity to anomalous data).  

To measure the quality of a model on the entire dataset of n examples we simply average the losses on the training set:

$$
L(\bm{w},b) = \frac{1}{n}\sum_{i=1}^{n}l^i(\bm{w},b) = \frac{1}{n}\sum_{i=1}^{n}\frac{1}{2}(\bm{w}^Tx^i + b -y^i)^2
$$

