# Simple Linear Regression

We will use Pandas to work with dataframes and numpy to do the math.

In [None]:
import pandas as pd
import numpy as np

For linear regression we'll have a set of samples, each sample consists of a feature x_i and a label y_i. Linear regression will determine the weights  $w_0$ (intercept) and $w_1$ (the slope), to estimate a linear relation between the feature vector ($X$) and the label vector ($Y$). The estimate will be denoted $\hat Y$.

Pointwise:

$${\hat y}_i=x_0+w_1x_i$$

And in vector form:
$${\hat Y}=w_0+w_1 \cdot X$$


In [None]:
#Compute the prediction of the linear model giventhe wights and the input feature
def predictions_linear_reg(X, w_0, w_1):
    pred=w_0+w_1*X
    return pred

We will use the Mean Squared Error as cost measure. The formula is:

$$ mse=\frac{\sum_i^m(y_i-\hat{y}_i)^2}{m}$$

In [None]:
#Compute MSE
def get_mse(X, Y, w_0,w_1):
    #X is the feature vector (m,1)
    #Y is the labels vector (m,1)
    m=X.shape[0]
    pred=predictions_linear_reg(X, w_0, w_1)
    res=pred-Y
    sqrd=res**2
    MSE=np.sum(sqrd)/m
    return MSE

Now we write the linear regression algorithm. Since we are working with one variable, it is not difficult to obtain the exact formula for $w_0$ and $w_1$ that minimizes the MSE (all sums are from $1$ to $m$):

$$w_1=\frac{\sum x_iy_i - \frac{\sum x_i \sum y_i}{m} }{\sum x_i^2 - \frac{(\sum x_i)^2}{m}}$$

$$w_0= \sum y_i - w_1 \frac{x_i}{m}$$

In [None]:
## simple linear regression with exact formula
def linear_regression_simple_exact_formula(X, Y):
    m=Y.size
    sum_x=np.sum(X)
    sum_y=np.sum(Y)
    prod=X*Y
    sum_prod=np.sum(prod)
    sq_x=X*X
    sum_sq_x=np.sum(sq_x)
    sq_sum_x=sum_x**2
    prod_sum_x_y=sum_x*sum_y
    w_1=(sum_prod -prod_sum_x_y/m)/(sum_sq_x-sq_sum_x/m)
    w_0=(sum_y- w_1*sum_x)/m
    return w_0,w_1

To test this algorithm we are going to create a sample set.

In [None]:
#First we generate a random list of 10 numbers uniformily distribuited
X=np.random.uniform(0,1,[10,1])*100
print("The feature vector is: ")
print(X)
#Then we create the Y vector applying a linear function to X
Y=3*X+7
print("This is the image of X when we apply a linear function: ")
print(Y)
#Finally we add some noise to Y
for i in range(len(Y)):
    Y[i]+=np.random.uniform(-5,5)
print("The label (or output) vector is: ")    
print(Y)

Let's apply our algorithm to this set:

In [None]:
wa_0, wa_1 =linear_regression_simple_exact_formula(X, Y)
print("wa_0: "+str(wa_0))
print("wa_1: "+str(wa_1))

Let's calculate the cost.

In [None]:
MSE=get_mse(X,Y,wa_0,wa_1)
print(MSE)

Now we will plot the sample points and the line obtained by the regression algorithm. We will use matplotlib.

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.scatter(X,Y)
t = np.arange(0.0, np.max(X), 0.01)
plt.plot(t,wa_1*t+wa_0,'-',color='r')

plt.show()

We procede now to write the gradient descent algorithm. We need to calculate the derivative of the MSE, with respect to the weights $w_0,w_1$:

$$\frac{d}{dw_0}mse=-\frac{2}{m}\sum(y_i-\hat{y}_i)$$

$$\frac{d}{dw_1}mse=-\frac{2}{m}\sum x_i(y_i-\hat{y}_i)$$

In [None]:
def derivative(X,Y,w_0,w_1):
    m=X.shape[0]
    error=Y- predictions_linear_reg(X, w_0,w_1)
    dw_0=-2*np.sum(error)/m
    dw_1=-2*np.sum(X*error )/m
    return dw_0, dw_1

Recall that in this case we shall usa an hyperparameter: the learning rate.
We update the weights as follows:

$$w_j:=w_j- \eta \frac{d}{dw_j} mse$$

In [None]:
def gradient_descent_(X, Y, learning_rate, tolerance, initial_w_0, initial_w_1,max_iter=5000):
    #X is the feature matrix
    #Y is the output vector

    w_0=initial_w_0
    w_1=initial_w_1
    converged = False
    k=0
    #We print the current iteration every 1000 iterations
    while k<max_iter and not converged:
        if k % 1000 == 0:
            print("Iteration: "+str(k))
        preds=predictions_linear_reg(X, w_0,w_1)
        error=Y-preds
        dw_0,dw_1=derivative(X,Y, w_0,w_1)
        w_0=w_0- learning_rate*dw_0
        w_1=w_1-learning_rate*dw_1
        gradient_norm=np.linalg.norm([dw_0,dw_1])
        k=k+1
        if gradient_norm < tolerance:
            converged= True
            print("Converged on iteration: "+str(k))
   
    return w_0,w_1

Now we will apply this algorithm to the set we defined above.

In [None]:
learning_rate=0.0001
tolerance=.5
initial_weights=[1.,1.]

In [None]:
wb_0,wb_1= gradient_descent_(X, Y, learning_rate, tolerance, initial_weights[0], initial_weights[1],max_iter=50000)
print("wb_0: "+str(wb_0))
print("wb_1: "+str(wb_1))

Let's calculate the MSE with these weights.

In [None]:
MSE=get_mse(X,Y,wb_0,wb_1)
print(MSE)

And now let's plot the results.

In [None]:
plt.scatter(X,Y)
t = np.arange(0.0, np.max(X), 0.01)
plt.plot(t,wb_1*t+wb_0,'-',color='y')

plt.show()

Let's plot both lines, the one obtained with the exact formula, and the one obtained with gradient descent.

In [None]:
t = np.arange(0.0, np.max(X)/2, 0.01)
plt.plot(t,wa_1*t+wa_0,'-',color='r')
plt.plot(t,wb_1*t+wb_0,'-',color='y')

plt.show()

Even if the error is bigger than the one obtained with the exact formula, and the weights are not quite the same, the line seems to aproximate the sample set pretty well.
It is important to keep in mind that for the gradient descent algorithm we must chose the learning rate and the initial weights.

Now we're going to use this algorithms on a "real" data set. We will import to our notebook a csv file with data of house prices. We will put this data on a dataframe.

In [None]:
house_prices=pd.read_csv("houseprices.csv")

Let us explore the dataframe.

In [None]:
house_prices.head()

In [None]:
house_prices[["price","LivingArea"]].head()

We need to split our data set on two parts: training set and validation set.We will do it in two new dataframes. We set a seed to the random process in order to make everythin repeatable.

In [None]:
#split the dataframe into train and test sets
def trainset__testset_split(df, train_ratio=.8, seed=0):
    np.random.seed(seed)
    m = len(df.index)
    shuffle = np.random.permutation(df.index)
    train_end = int(train_ratio * m)
    train = df.loc[shuffle[:train_end]] 
    test = df.loc[shuffle[train_end:]]
    return train, test

Let's divide our set.

In [None]:
train_set, test_set= trainset__testset_split(house_prices, train_ratio=.8, seed=0)

In [None]:
train_set.head()

Note that the indexes are shuffled (that is what we did to create the random partition). So in order to locate the ith row of the train set we need to use the index vector.

In [None]:
print("The index corresponding to the fourth row is: " +str(train_set.index[3]) )
print("The fourth trow is: "+ str(train_set.loc[train_set.index[0]]) ) 

We will use "LivingArea" as feature and "price" as target label.

In [None]:
X_train=train_set['LivingArea']
Y_train=train_set['price']

Let us apply the exact formula to the train set.

In [None]:
w_0,w_1=linear_regression_simple_exact_formula(X_train,Y_train)
print("w_0 = "+str(w_0))
print("w_1 = "+str(w_1))

We calculate the MSE for the training set, and we plot our model.

In [None]:
MSE_train=get_mse(X_train,Y_train,w_0,w_1)
print(np.format_float_scientific(MSE_train) )

In [None]:
train_set.plot(kind="scatter", x="LivingArea",y="price")
t = np.arange(0.0, 6000.0, 0.01)
plt.plot(t,w_1*t+w_0,'-',color='r')
plt.show()

Now to evaluate our model we must see how it does on the test set.

In [None]:
X_test=test_set["LivingArea"]
Y_test=test_set["price"]
MSE_test=get_mse(X_test,Y_test,w_0,w_1)
print(np.format_float_scientific(MSE_test))

In [None]:
print(np.format_float_scientific(MSE_test-MSE_train))

Let us apply gradient descent. First we must set the hyperparameters: the initial weights, the tolerance and the learning rate.

In [None]:
initial_weights = np.array([15000., 10.])
learning_rate = 1e-10
tolerance = 1e3

In [None]:
w_gd_0,w_gd_1=gradient_descent_(X_train,Y_train, learning_rate, tolerance, initial_weights[0], initial_weights[1],max_iter=200000)
print("w_0 = "+str(w_gd_0))
print("w_1 = "+str(w_gd_1))

Let's calculate the cost for these weights and the difference with the ones we obtained with the exact formula.

In [None]:
MSE_train_gradient_desc=get_mse(X_train,Y_train,w_gd_0,w_gd_1)
print(np.format_float_scientific(MSE_train_gradient_desc))

In [None]:
print(np.format_float_scientific(MSE_train_gradient_desc-MSE_train))

Next we calculate the predictions of a data sample, to see how the algorithms do in a particular case:

In [None]:
print("Prediction with the exact formula: "+str(predictions_linear_reg(X_train[10], w_0, w_1)))
print("Prediction with gradient descent: "+str(predictions_linear_reg(X_train[10],w_gd_0,w_gd_1)))
print("Actual value: "+str(Y_train[10]))

Now we plot the line obtained with gradient descent.

In [None]:
train_set.plot(kind="scatter", x="LivingArea",y="price")
t = np.arange(0.0, 6000.0, 0.01)
plt.plot(t,w_1*t+w_0,'-',color='r')
plt.plot(t,w_gd_1*t+w_gd_0,'-',color='y')
plt.show()

Let's evaluate on the test set.

In [None]:
#MSE of the test set for the Gradient Descent model
MSE_test_gradient_desdent=get_mse(X_test,Y_test,w_gd_0,w_gd_1)
print(np.format_float_scientific(MSE_test_gradient_desdent))

In [None]:
#Difference bewtween the exact formula MSE and the gradient descent MSE for the test set
print(np.format_float_scientific(MSE_test_gradient_desdent-MSE_test))

In [None]:
#Plot of the lines and the test set
test_set.plot(kind="scatter", x="LivingArea",y="price")
t = np.arange(0.0, 6000.0, 0.01)
plt.plot(t,w_1*t+w_0,'-',color='r')
plt.plot(t,w_gd_1*t+w_gd_0,'-',color='y')
plt.show()

In [None]:
#Plot of both models
t = np.arange(0.0, 1000.0, 0.01)
plt.plot(t,w_1*t+w_0,'-',color='r')
plt.plot(t,w_gd_1*t+w_gd_0,'-',color='y')
plt.show()

Of course we can chose a different feature (as long as it is "numeric") and apply our algorithms. Let's do it with "Bedrooms".

In [None]:
X_train2=train_set['Bedrooms']
#Y is the same.

In [None]:
#with the exact formula:
w_0,w_1=linear_regression_simple_exact_formula(X_train2,Y_train)
print("w_0 = "+str(w_0))
print("w_1 = "+str(w_1))

In [None]:
MSE_train=get_mse(X_train2,Y_train,w_0,w_1)
print(np.format_float_scientific(MSE_train) )

In [None]:
train_set.plot(kind="scatter", x="Bedrooms",y="price")
t = np.arange(0.0, 8.0, 0.01)
plt.plot(t,w_1*t+w_0,'-',color='r')

plt.show()

This does not look right, it could be that "Bedrooms" is not a good feature to predict the price. Anyway, let's se what happens on the test set.

In [None]:
X_test2=test_set["Bedrooms"]
MSE_test=get_mse(X_test2,Y_test,w_0,w_1)
print(np.format_float_scientific(MSE_test))

In [None]:
print(np.format_float_scientific(MSE_test-MSE_train))

Now let's apply gradient descent.

In [None]:
initial_weights = np.array([20000., 20000.])
learning_rate = 1e-4
tolerance = 1e5

In [None]:
w_gd_0,w_gd_1=gradient_descent_(X_train,Y_train, learning_rate, tolerance, initial_weights[0], initial_weights[1],max_iter=200000)
print("w_0 = "+str(w_gd_0))
print("w_1 = "+str(w_gd_1))

Let's calculate the cost and plot the line.

In [None]:
MSE_train_gd=get_mse(X_train2,Y_train,w_gd_0,w_gd_1)
print(np.format_float_scientific(MSE_train_gd) )

In [None]:
train_set.plot(kind="scatter", x="Bedrooms",y="price")
t = np.arange(0.0, 8.0, 0.01)
plt.plot(t,w_gd_1*t+w_gd_0,'-',color='y')

plt.show()

This is the cost for the test set:

In [None]:
MSE_test_gd=get_mse(X_test2,Y_test,w_gd_0,w_gd_1)
print(np.format_float_scientific(MSE_test_gd))

Now we plot the two models and the whole data set.

In [None]:
house_prices.plot(kind="scatter", x="Bedrooms",y="price")
t = np.arange(0.0, 8.0, 0.01)
plt.plot(t,w_1*t+w_0,'-',color='r')
plt.plot(t,w_gd_1*t+w_gd_0,'-',color='y')

plt.show()