# Regression Week 1: Simple Linear Regression

In this notebook we will use data on house sales in King County to predict house prices using simple (one input) linear regression. You will:
* Write a function to compute the Simple Linear Regression weights using the closed form solution
* Write a function to make predictions of the output given the input feature
* Turn the regression around to predict the input given the output
* Compare two different models for predicting house prices

In this notebook you will be provided with some already complete code as well as some code that you should complete yourself in order to answer quiz questions. The code we provide to complte is optional and is there to assist you with solving the problems but feel free to ignore the helper code and write your own.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [2]:
df = pd.read_csv('kc_house_data.csv')

# Split data into training and testing

There are multiple ways to split the data into training and testing sets
* The train and test sets are already available as 2 separate csv files : <b><i>kc_house_train_data</i></b> and <b><i>kc_house_test_data</i></b> 
* Using <b><i>sklearn</i></b> to split the data into train and test sets
* Manually splitting the data into train and test sets using pandas built-in <b><i>sample</i></b> method

For this notebook, I will use already available csv files for train and test sets. The codes for manually splitting data using sklearn as well as manual splitting are also included in comments for reference purposes. A random state is used to ensure that the data is split in the same manner every time for purposes of reproducing the results

In [3]:
# Loading the train and test datasets using the csv files
train_data = pd.read_csv('kc_house_train_data.csv')
test_data = pd.read_csv('kc_house_test_data.csv')


# Code for splitting data manually
# train_data = df.sample(frac = 0.8, random_state = 0)
# test_data = df.drop(train.index)
 
# Code for splitting data into train and test sets using sklearn
# train_data, test_data = train_test_split(df, test_size=0.2, random_state = 0)

# Extracting the Input Features and the Output Variable

Machine Learning algorithms require set of input features <b><i>X</i></b> and output feature <b><i>Y</i></b>. The pandas <b><i>iloc</i></b> method is used for indexing to get the required columns. Both <b><i>X</i></b> and <b><i>Y</i></b> are also dataframes. For this notebook, Simple Linear Regression or Single Variable Linear Regression is used, hence only one input feature is used. The input feature is <b><i>‘sqft_living’</i></b> and the output feature is <b><i>‘price’</i></b>

In [4]:
X = np.array(train_data['sqft_living'])
Y = np.array(train_data['price'])

# Build a generic simple linear regression function 

The following function computes the simple linear regression slope and intercept. The equations used for the simple linear regression model are as below :

\begin{equation}
Y = a + bX
\end{equation}

\begin{equation}
slope (b) = \frac{\sum (X - \bar X) (Y - \bar Y)}{\sum (X - \bar X)^2}
\end{equation}

\begin{equation}
intercept (a) = Y - bX 
\end{equation}

<b> Note : </b> For intercept, use either the mean values of X and Y or take one particular value corresponding to the same index for both X and Y

In [5]:
def simple_linear_regression (X, Y) :
    
    '''
    Arguments :
    X -- a numpy array containing the values from train dataset for the variable 'sqft_living'
    Y -- a numpy array containing the values from train dataset for the variable 'price'
    
    Returns :
    slope -- the slope for the simple linear regression model
    intercept -- the intercept for the simple linear regression model
    
    '''
    
    # Compute the mean of the X and Y
    mean_X = np.mean(X)
    mean_Y = np.mean(Y)
    
    # Compute X - mean_X, Y - mean_Y and square(X - mean)
    diff_X = X - mean_X
    diff_Y = Y - mean_Y
    diff_X_squared = np.square(diff_X)
    
    
    # Compute the summation of (diff_X * diff_Y) and the summation of diff_X_squared
    prod_diffX_diffY = np.dot(diff_X, diff_Y)
    summation_diff_X_Y = np.sum(prod_diffX_diffY)
    summation_diff_X_mean = np.sum(diff_X_squared)
    
    # Compute the slope
    slope = summation_diff_X_Y / summation_diff_X_mean
    
    # Compute product of N and summation(X squared)
    N = len(X)
    X_squared = np.square(X)
    N_sum_X_squared = N * np.sum(X_squared)
    square_sum_X = np.square(np.sum(X))
    
    # Compute the intercept
    intercept = mean_Y - (slope * mean_X)
    
    return intercept, slope

An implementation of the simple linear regression is found as below which uses the equations used for the simple linear regression model as used in the course. The equations are as below :



\begin{equation}
Y = a + bX
\end{equation}

\begin{equation}
slope (b) = \frac{\sum XY - \frac{1}{N} (\sum X \sum Y)}{\sum X^2 - \frac{1}{N} (\sum X)^2)}
\end{equation}

\begin{equation}
intercept (a) = Y - bX 
\end{equation}

<b> Note : </b> For intercept, use either the mean values of X and Y or take one particular value corresponding to the same index for both X and Y

    def simple_linear_regression (X, Y) :
    
    Arguments :
    X -- a numpy array containing the values from train dataset for the variable 'sqft_living'
    Y -- a numpy array containing the values from train dataset for the variable 'price'
    
    Returns :
    slope -- the slope for the simple linear regression model
    intercept -- the intercept for the simple linear regression model
        
    N = len(X)
    
    sum_XY = np.dot(X,Y)
    prod_sumX_sum_Y = np.sum(X) * np.sum(Y)
    
    sum_X_square = np.sum(np.square(X))
    square_sum_X = np.square(np.sum(X))
    
    slope = (sum_XY - ((prod_sumX_sum_Y) / N)) / (sum_X_square - ((square_sum_X / N)))
    
    # Compute the intercept
    intercept = np.mean(Y) - (slope * np.mean(X))
    
    return intercept, slope
    
    



We can test that our function works by passing it something where we know the answer. In particular we can generate a feature and then put the output exactly on a line: output = 1 + 1\*input_feature then we know both our slope and intercept should be 1

In [6]:
test_feature = np.array(range(5))
test_output = np.array(1 + 1*test_feature)
test_intercept, test_slope  =  simple_linear_regression(test_feature, test_output)
print('Slope : ' , test_slope)
print('Intercept : ' , test_intercept)

Slope :  1.0
Intercept :  1.0


Now that we know it works let's build a regression model for predicting price based on sqft_living. Rembember that we train on train_data!

In [7]:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])

print('Intercept : ' , sqft_intercept)
print('Slope : ' , sqft_slope)

Intercept :  -47116.07907289383
Slope :  281.9588396303424


# Predicting Values

Now that we have the model parameters: intercept & slope we can make predictions. Using numpy arrays it's easy to multiply an numpy array by a constant and add a constant value. Complete the following function to return the predicted output given the input_feature, slope and intercept:

In [8]:
def get_regression_predictions(X, intercept, slope):
    
    '''
    Arguments :
    X -- a numpy array containing the values from train dataset for the variable 'sqft_living'
    Y -- a numpy array containing the values from train dataset for the variable 'price'
    intercept -- intercept value learned from train data by the simple linear regression model
    slope -- slope value learned from the train data by the simple linear regression model
    
    Returns :
    predicted_values -- predicted value of the using X, intercept and slope
    
    '''
    
    # Compute the predicted values
    predicted_values = intercept + (slope * X)
    
    return predicted_values

Now that we can calculate a prediction given the slope and intercept let's make a prediction. Use (or alter) the following to find out the estimated price for a house with 2650 squarefeet according to the squarefeet model we estiamted above.

<font color = 'steelblue'><b> Quiz : Using your Slope and Intercept from (4), What is the predicted price for a house with 2650 sqft? </b></font>

In [9]:
my_house_sqft = 2650
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print('The estimated price for a house with %d squarefeet is $%.2f' % (my_house_sqft, estimated_price))

The estimated price for a house with 2650 squarefeet is $700074.85


# Residual Sum of Squares

Now that we have a model and can make predictions let's evaluate our model using Residual Sum of Squares (RSS). Recall that RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output. 

Complete the following (or write your own) function to compute the RSS of a simple linear regression model given the input_feature, output, intercept and slope:

In [10]:
def get_residual_sum_of_squares(X, Y, intercept, slope) :
    
    '''
    Arguments :
    X -- a numpy array containing the values from train dataset for the variable 'sqft_living'
    intercept -- intercept value learned from train data by the simple linear regression model
    slope -- slope value learned from the train data by the simple linear regression model
    
    Returns :
    RSS -- the residual sum of squares, i.e. the error or loss in the predicted value and the original value
    
    '''
    
    # First get the predictions. We convert the predictions to numpy array as Y is also a numpy array
    predictions = np.array(get_regression_predictions(X, intercept, slope))

    # then compute the residuals (since we are squaring it doesn't matter which order you subtract)
    residuals = predictions - Y

    # square the residuals and add them up
    RSS = np.sum(np.square(residuals))

    return(RSS)

Let's test our get_residual_sum_of_squares function by applying it to the test model where the data lie exactly on a line. Since they lie exactly on a line the residual sum of squares should be zero!

In [11]:
print(get_residual_sum_of_squares(test_feature, test_output, test_intercept, test_slope)) # should be 0.0

0.0


Now use your function to calculate the RSS on training data from the squarefeet model calculated above.

<font color = 'steelblue'><b> Quiz Question: According to this function and the slope and intercept from the squarefeet model What is the RSS for the simple linear regression using squarefeet to predict prices on TRAINING data? </b></font>

In [12]:
rss_prices_on_sqft = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_intercept, sqft_slope)

print('The RSS of predicting Prices based on Square Feet is : ' , rss_prices_on_sqft)

The RSS of predicting Prices based on Square Feet is :  1201918354177283.0


# Predict the squarefeet given price

What if we want to predict the squarefoot given the price? Since we have an equation y = a + b\*x we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated squarefeet (x).

Complete the following function to compute the inverse regression estimate, i.e. predict the input_feature given the output.

In [13]:
def inverse_regression_predictions(Y, intercept, slope):
    
    '''
    Arguments :
    Y -- a numpy array containing the values from train dataset for the variable 'prices'
    intercept -- intercept value learned from train data by the simple linear regression model
    slope -- slope value learned from the train data by the simple linear regression model
    
    Returns :
    estimated_feature -- estimated value of the the feature 'sqft_living' from the train data
    
    '''
    
    # solve output = intercept + slope*input_feature for input_feature. Use this equation to compute the inverse predictions:
    estimated_feature = (Y - intercept) / slope

    return estimated_feature

Now that we have a function to compute the squarefeet given the price from our simple regression model let's see how big we might expect a house that costs $800,000 to be.

<font color = 'steelblue'><b> Quiz Question: According to this function and the regression slope and intercept from (3) what is the estimated square-feet for a house costing $800,000? </b></font>

In [14]:
my_house_price = 800000
estimated_squarefeet = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)

print('The estimated squarefeet for a house worth $%.2f is %d' % (my_house_price, estimated_squarefeet))

The estimated squarefeet for a house worth $800000.00 is 3004


# New Model: estimate prices from bedrooms

We have made one model for predicting house prices using squarefeet, but there are many other features in the sales SFrame. 
Use your simple linear regression function to estimate the regression parameters from predicting Prices based on number of bedrooms. Use the training data!

In [15]:
bedrooms_intercept, bedrooms_slope = simple_linear_regression(train_data['bedrooms'], train_data['price'])
print('Intercept for bedroom : ', bedrooms_intercept)
print('Slope for bedroom : ', bedrooms_slope)

Intercept for bedroom :  109473.17762295937
Slope for bedroom :  127588.9529339879


# Test your Linear Regression Algorithm

Now we have two models for predicting the price of a house. How do we know which one is better? Calculate the RSS on the TEST data (remember this data wasn't involved in learning the model). Compute the RSS from predicting prices using bedrooms and from predicting prices using squarefeet.

<font color = 'steelblue'><b> Quiz : Which model (square feet or bedrooms) has lowest RSS on TEST data? Think about why this might be the case. </b></font>

In [16]:
# Compute RSS when using bedrooms on TEST data:
RSS_Bedroom_Test = get_residual_sum_of_squares(test_data['bedrooms'], test_data['price'], bedrooms_intercept, bedrooms_slope)
print('RSS for bedroom on test data : ', RSS_Bedroom_Test)

RSS for bedroom on test data :  493364585960300.94


In [17]:
# Compute RSS when using squarefeet on TEST data:
RSS_Squarefeet_Test = get_residual_sum_of_squares(test_data['sqft_living'], test_data['price'], sqft_intercept, sqft_slope)
print('RSS for bedroom on test data : ', RSS_Squarefeet_Test)

RSS for bedroom on test data :  275402933617812.12


The Model with the squarefeet as the input feature has the <b>lower RSS</b> on the <b>TEST DATA</b>