## Linear Regression (Multi Features and m Training Examples)

### Hypothesis
We will use $\mathbf{𝐱_𝐢}$  to denote the feature vector and  $\mathbf{𝐲_𝐢}$  to denote output variable for $\mathbf{i^{th}}$ training example.  

$h(x)=w_1x^{(i)}_1+w_2x^{(i)}_2+.....+w_{nx}x^{(i)}_{nx}+b$  

Where,   

$m$ : training examples  
$nx$ : Number of features  

Let us write $\hat{y}$ for the prediction from the hypothesis.

$\hat{y}^{(i)} = w_1x^{(i)}_1+w_2x^{(i)}_2+.....+w_{nx}x^{(i)}_{nx}+b $   

Feature vector for $i^{th}$ training example:
$\mathbf{x}^{(i)} =\begin{pmatrix}
  {x}_1^{(i)} \\ {x}_2^{(i)} \\ \vdots \\ {x}_{nx}^{(i)}
 \end{pmatrix} $   

Feature vector of the problem dataset:   

$ \mathbf{X} = \begin{pmatrix}
\mathbf{x}^{(1)} & \mathbf{x}^{(2)} & \cdots & \mathbf{x}^{(i)}
\end{pmatrix}$   

$ \mathbf{X} = \begin{pmatrix}
{x}_1^{(1)} & {x}_1^{(2)} & \cdots & {x}_1^{(m)} \\ 
{x}_2^{(1)} & {x}_2^{(2)} & \cdots & {x}_2^{(m)} \\ 
\vdots & \vdots & \cdots & \vdots \\ 
{x}_{nx}^{(1)} & {x}_{nx}^{(1)} & \cdots & {x}_{nx}^{(m)}
\end{pmatrix}$


Parameter vector :
$\mathbf{w} =\begin{pmatrix}
  {w}_1 \\ {w}_2 \\ \vdots \\ {w}_{nx}
 \end{pmatrix}, b $   

### Example Data   

| Size      | Bedrooms | Price (L) |
| :---:     | :----:   |   :---:   |
| 100       | 1        | 20        |
| 150       | 2        | 28        | 
| 200       | 3        | 39        | 
| 250       | 4        | 51        | 
| 500       | 4        | 80        |  

$m=5; nx=2$

Features for $3^{rd}$ training example:   
$$x^{(3)}_1=200; x^{(3)}_2=3$$  

Feature vector for $2^{nd}$ training example:
$\mathbf{x}^{(2)} =\begin{pmatrix}
  150 \\ 2
 \end{pmatrix} $   

Feature vector of the problem dataset:   

$ \mathbf{X} = \begin{pmatrix}
\mathbf{x}^{(1)} & \mathbf{x}^{(2)} & \cdots & \mathbf{x}^{(5)}
\end{pmatrix}$   

$ \mathbf{X} = \begin{pmatrix}
100 & 150 & 200 & 250 & 500 \\ 1 & 2 & 3 & 4 & 4
\end{pmatrix}$

Parameter vector :   
$ \mathbf{w} = \begin{pmatrix}
w_{1} \\ w_{2}
\end{pmatrix}, b$  

Output vector (Given/Labelled) : 
$\mathbf {y} = \begin{pmatrix}
y^{(1)} & y^{(2)} & \cdots & y^{(m)}
\end{pmatrix} = \begin{pmatrix}
20 & 28 & 39 & 53 & 80
\end{pmatrix}$  

Output vector (Prediction): 
$\mathbf {\hat{y}} = \begin{pmatrix}
\hat{y}^{(1)} & \hat{y}^{(2)} & \cdots & \hat{y}^{(m)}
\end{pmatrix} $ 

**Predicted Values**   

$\hat{y}^{(1)} = w_1x^{(1)}_1+w_2x^{(1)}_2+b = \mathbf {w}^T \mathbf {x}^{(1)} + b $  

$\hat{y}^{(2)} = w_1x^{(2)}_1+w_2x^{(2)}_2+b = \mathbf {w}^T \mathbf {x}^{(2)} + b $  

$\hat{y}^{(3)} = w_1x^{(3)}_1+w_2x^{(3)}_2+b = \mathbf {w}^T \mathbf {x}^{(3)} + b $  

$\hat{y}^{(4)} = w_1x^{(4)}_1+w_2x^{(4)}_2+b = \mathbf {w}^T \mathbf {x}^{(4)} + b $  

$\hat{y}^{(5)} = w_1x^{(5)}_1+w_2x^{(5)}_2+b = \mathbf {w}^T \mathbf {x}^{(5)} + b $   


$\mathbf{\hat{y}} = \mathbf {w}^T \mathbf {X} + b $  


In [1]:
import numpy as np
X=np.array([[100,1], [150,2], [200,3], [250,4],[500,4]])
y=np.array([20,28, 39, 51, 80])
print(X, y)

[[100   1]
 [150   2]
 [200   3]
 [250   4]
 [500   4]] [20 28 39 51 80]


### Solving using sklearn library

In [2]:
from sklearn import linear_model
print(X.shape, type(X), y.shape, type(y))


(5, 2) <class 'numpy.ndarray'> (5,) <class 'numpy.ndarray'>
[0.1196 4.42  ]
2.5199999999999747


In [None]:
# Create a Logistic Regression Object, perform Logistic Regression
lr = linear_model.LinearRegression()
lr.fit(X, y)


In [None]:
print(lr.coef_)
print(lr.intercept_)


**Data substitution**   

$\hat{y}^{(1)} = w_1x^{(1)}_1+w_2x^{(1)}_2+b = w_1(100)+w_2(1) + b $    

$\hat{y}^{(2)} = w_1x^{(2)}_1+w_2x^{(2)}_2+b = w_1(150)+w_2(2) + b $    

$\hat{y}^{(3)} = w_1x^{(3)}_1+w_2x^{(3)}_2+b = w_1(200)+w_2(3) + b $    

$\hat{y}^{(4)} = w_1x^{(4)}_1+w_2x^{(4)}_2+b = w_1(250)+w_2(4) + b $    

$\hat{y}^{(5)} = w_1x^{(5)}_1+w_2x^{(5)}_2+b = w_1(500)+w_2(4) + b $    



### Cost Function  
$J(\mathbf{w},b)=\frac{1}{2m}\sum \limits _{i=1} ^{m} (\hat{y}-y)^{2} $  
$J(\mathbf{w},b)=\frac{1}{2m}\sum \limits _{i=1} ^{m} ((\mathbf{w}^T \mathbf{x}^{(i)}+b)-y)^{2} $  

Start with sum assumed value of $\mathbf{w}$ and $b$ and evaluate $J(\mathbf{w},b)$  

$J(w_1, w_2,b)=\frac{1}{2m}[(100w_1+w_2+b-20)^2+(150w_1+2w_2+b-28)^2+(200w_1+3w_2+b-39)^2+(250w_1+4w_2+b-51)^2+(500w_1+4w_2+b-80)^2]$  

**Our aim is to minimize the cost function,** $J(\mathbf{w},b)$   

### Gradient Descent
$ \frac{\partial J}{\partial w_j} = \frac{1}{m} \sum \limits _{i=1} ^m (\hat {y}^{(i)}-y^{(i)}){x}^{(i)}_j$  

$ \frac{\partial J}{\partial b} = \frac{1}{m} \sum \limits _{i=1} ^m (\hat {y}^{(i)}-y^{(i)})$  

**Substituting**   

$ \frac{\partial J}{\partial w_1} = \frac{1}{m} [(\hat {y}^{(1)}-y^{(1)}){x}^{(1)}_1 + (\hat {y}^{(2)}-y^{(2)}){x}^{(2)}_1 + (\hat {y}^{(3)}-y^{(3)}){x}^{(3)}_1+(\hat {y}^{(4)}-y^{(4)}){x}^{(4)}_1 + (\hat {y}^{(5)}-y^{(5)}){x}^{(5)}_1]$  

$ \frac{\partial J}{\partial w_2} = \frac{1}{m} [(\hat {y}^{(1)}-y^{(1)}){x}^{(1)}_2 + (\hat {y}^{(2)}-y^{(2)}){x}^{(2)}_2 + (\hat {y}^{(3)}-y^{(3)}){x}^{(3)}_2+(\hat {y}^{(4)}-y^{(4)}){x}^{(4)}_2 + (\hat {y}^{(5)}-y^{(5)}){x}^{(5)}_2]$   


$ \frac{\partial J}{\partial w_1} = \frac{1}{m} \begin{pmatrix}
{x}^{(1)}_1 & {x}^{(2)}_1 & {x}^{(3)}_1 & {x}^{(4)}_1 & {x}^{(5)}_1 
\end{pmatrix}\begin{pmatrix}
\hat {y}^{(1)}-y^{(1)}\\
\hat {y}^{(2)}-y^{(2)}\\
\hat {y}^{(3)}-y^{(3)}\\
\hat {y}^{(4)}-y^{(4)}\\
\hat {y}^{(5)}-y^{(5)}
\end{pmatrix}$   

$$ \frac{\partial J}{\partial w_1} = \frac{1}{m} \mathbf{x}_{(1)}(\mathbf{\hat {y}-y})$$  

$ \frac{\partial J}{\partial w_2} = \frac{1}{m} \begin{pmatrix} 
{x}^{(1)}_2 & {x}^{(2)}_2 & {x}^{(3)}_2 & {x}^{(4)}_2 & {x}^{(5)}_2
\end{pmatrix}\begin{pmatrix}
\hat {y}^{(1)}-y^{(1)}\\
\hat {y}^{(2)}-y^{(2)}\\
\hat {y}^{(3)}-y^{(3)}\\
\hat {y}^{(4)}-y^{(4)}\\
\hat {y}^{(5)}-y^{(5)}
\end{pmatrix}$

$$ \frac{\partial J}{\partial w_2} = \frac{1}{m} \mathbf{x}_{(2)}(\mathbf{\hat {y}-y})$$  

$ \frac {\partial J}{\partial \mathbf{w}} = \begin{pmatrix}
\frac{\partial J}{\partial w_1} \\ 
\frac{\partial J}{\partial w_2}
\end{pmatrix}$ 

$ \frac {\partial J}{\partial \mathbf{w}} = \frac{1}{m}\begin{pmatrix}
{x}^{(1)}_1 & {x}^{(2)}_1 & {x}^{(3)}_1 & {x}^{(4)}_1 & {x}^{(5)}_1 \\ 
{x}^{(1)}_2 & {x}^{(2)}_2 & {x}^{(3)}_2 & {x}^{(4)}_2 & {x}^{(5)}_2
\end{pmatrix}\begin{pmatrix}
\hat {y}^{(1)}-y^{(1)}\\
\hat {y}^{(2)}-y^{(2)}\\
\hat {y}^{(3)}-y^{(3)}\\
\hat {y}^{(4)}-y^{(4)}\\
\hat {y}^{(5)}-y^{(5)}
\end{pmatrix}$ 

$$ \frac {\partial J}{\partial \mathbf{w}} =\frac{1}{m} \mathbf {X} (\mathbf{\hat {y}-y})^{T} $$   

**Updating Parameters**  

$ \mathbf{w} = \mathbf{w} - \alpha \frac {\partial J}{\partial \mathbf{w}}$  

$ b = b - \alpha \frac {\partial J}{\partial b}$  

Where,  
        $ \alpha$ : Learning Rate (0.0001, 0.001, 0.01...)