$g(x_1) = g(x_{i1}, x_{i2}, ..., x_{in}) = w_0 + \sum_{j=1}^n x_{ij}w_j$

where  
$w_0$ is the bias term  
$w_1,w_2....,w_n$ are the weights for each feature $x_{i1}, x_{i2}, ..., x_{in}$

As example for machine learning let's use this values  
$w_0 = 7.17$  
$w_1 = 0.01$  
$w_2 = 0.04$  
$w_3 = 0.002$

In [1]:
import numpy as np

w0 = 7.17
## [ w1,  w2,    w3]
w = [0.01, 0.04, 0.002]
n = 3 

xi = [453, 11, 86] ## values as example

def linear_regression(xi):
    result = w0
    for j in range(n):
        result = result + xi[j] * w[j]
    return result

The bias term is the value we would predict if we didn't know anything about; it serves as a baseline

Because we now think of both features and weights as vectors $x_i$ and $w$, respectively, we can replace the sum of the elements of these vectors with a dot product between them

In [2]:
def dot(xi, w):
    n = len(w)
    result = 0.0
    for j in range(n):
        result = result + xi[j] * w[j]
    return result

Using the new notation, we can rewrite the entire equation for linear regresion as:  
$g(x_i) = w_0 + x^T_i w$

Now we can use the new **dot** function, so the linear regresion function in Python becomes very short:

In [3]:
def linear_regression(xi):
    return w0 + dot(xi,w)

Alternatively, if __xi__ and __w__ are NumPy arrays, we can use the built in __dot__ method

In [4]:
def linear_regresion(xi):
    return w0 + xi.dot(w)

To make it even shorter, we can combine $w_0$ and $w$ into one (n+1)-dimensional vector

In [5]:
w = [w0] + w

Because now w becomes a (n+1)-dimensional vector, we also need to adjust the feature vector $x_i$ so that the dot  product between them still works

In [6]:
xi = [1] + xi

With these modifications, we can express the model as the dot product between the new $x_i$ and the new $w$:  
$g(x_i) = x^T_iw$  
The translation to the code is simple:

In [7]:
w0 = 7.17
w = [0.01,0.04,0.002]
w = [w0] + w

def linear_regression(xi):
    xi = [1] + xi
    return dot(xi,w)

Let's talk about the matrix form. There are many observations and $x_i$ is one of them. Thus, we have $m$ feature vectors $x_1, x_2, ... , x_i, ..., x_m$ and each of these vectors consists of $n+1$ features. We can put these vectors together as rows of a matrix. Let's call this matrix X.
Let's see how it looks in code, we can take a few rows from the training dataset, such as the first, second and tenth:

In [8]:
x1  = [1, 148, 24, 1385]
x2  = [1, 132, 25, 2031]
x10 = [1, 453, 11, 86]

## now let's put the rows together in another list
X = [x1, x2, x10]
X = np.asarray(X)

We alrade learn that to make a prediction for a single feature vector, we need to calculate the dot product between this feature vector and the weigths vector.  
To make predictions for all the rows of the matrix, we can simply iterate over all rows of X and compute the dot product:

In [9]:
predictions = []

for xi in X:
    pred = dot(xi, w)
    predictions.append(pred)

In linear algebra, this is the matrix-vector multiplications: we multiply the matrix X by the vector w. THe formula for linear regression becomes:  
$g(X) = w_0 + Xw$  
The result is an array with predictions for each roe of X.  
With this matrix formulation, the code for applying linear regression to make predictions becomes very simple. THe translation to NumPy becomes straightforward:

In [10]:
X.dot(w)

array([12.38 , 13.552, 12.312])

HOw do we get the weights w ?  
We will use normal equation, which is the simplest method to implement:  
$w = (X^T X)^{-1}X^T y$  
$X^T$ is the transpose of X. in NumPy is: X.T  
$X^TX$ in NumPy: X.T.dot(X)  
$X^{-1}$ is the inverse of X, we can use np.linalg.inv  
So the formula translates directly to: inv(X.T.dot(X)).dot(X.T).dot(y)

In [12]:
# Linear Regresion WIth NumPy
def train_linear_regression(X,y):
    # Adding the dummy column
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones,X]) #Adds the array of 1's as the first column of X
    
    # Normal Equation Formula
    XTX = X.T.dot(X) # Computes X^TX
    XTX_inv = np.linalg.inv(XTX) # Computes the inverse of X^TX
    w = XTX_inv.dot(X.T).dot(y) # Computes the rest of the normal equation
    
    return w[0], w[1:] # Splits the weights vector into the bias and the rest of weights

If weights are split into the bias term and the rest, the linear regression formula for making predictions changes slightly  
$g(X) = w_0 + Xw$
this is still very easy to translate to NumPy:
y_pred = w0 + X.dot(w)