# Regression Week 2:

&emsp;
&emsp;
&emsp;

&emsp;&emsp;
# Exercise 1-1
# Regression Week 2: Multiple Regression (Interpretation)

The goal of this first notebook is to explore multiple regression and feature engineering with existing sklearn Create functions.

In this notebook you will use data on house sales in King County to predict prices using multiple regression. You will:
* Use sklearn to do some feature engineering
* Use built-in sklearn Create functions to compute the regression weights (coefficients/parameters)
* Given the regression weights, predictors and outcome write a function to compute the Residual Sum of Squares
* Look at coefficients and interpret their meanings
* Evaluate multiple models via RSS

# Load libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

from sklearn.model_selection import train_test_split
from sklearn import linear_model

# Load in house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [2]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

sales = pd.read_csv("data/week2_kc_house_data.csv")
sales.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [3]:
print('len Sales: %d'%(len(sales)))

len Sales: 21613


# Split data into training and testing.
* Split 
    
        80% train data and 

        20% test data.

In [4]:
train_data,test_data = train_test_split(sales,test_size=0.2)
print('len Train: %d\nlen Test: %d'%(len(train_data),len(test_data)))

len Train: 17290
len Test: 4323


# Learning a multiple regression model
Recall we can use the following code to learn a multiple regression model predicting 'price' based on the following features:
example_features = ['sqft_living', 'bedrooms', 'bathrooms'] on training data with the following code:

(Aside: We set validation_set = None to ensure that the results are always the same)

In [5]:
example_features = ['sqft_living', 'bedrooms', 'bathrooms']

example_model = linear_model.LinearRegression()
example_model.fit(train_data[example_features],train_data['price'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now that we have fitted the model we can extract the regression weights (coefficients) as an sklearn as follows:

In [6]:
example_weight_summary = example_model.coef_
print(example_weight_summary)

[   312.28759548 -63374.36308577   6893.18662492]


# Making Predictions

In the gradient descent notebook we use numpy to do our regression. In this book we will use existing sklearn Create functions to analyze multiple regressions. 

Recall that once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above:

In [7]:
example_predictions = example_model.predict(train_data[example_features])

print(example_predictions[0]) # should be 271789.505878

1038520.9319986172


# Compute RSS

Now that we can make predictions given the model, let's write a function to compute the RSS of the model. Complete the function below to calculate RSS given the model, data, and the outcome.

In [8]:
def get_residual_sum_of_squares(model, data, m_features, outcome):
    # First get the predictions
    predictions = model.predict(data[m_features])
    
    # Then compute the residuals/errors
    residuals_errors = outcome - predictions

    # Then square and add them up
    residuals_errors_square = residuals_errors*residuals_errors
    RSS = residuals_errors_square.sum()

    return(RSS)    

Test your function by computing the RSS on TEST data for the example model:

In [9]:
rss_example_train = get_residual_sum_of_squares(example_model, test_data, example_features, test_data['price'])
print(rss_example_train) # should be 2.7376153833e+14

303947419523376.2


# Create some new features

Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, squarefeet, and # of bathrooms) but we can also consider transformations of existing features e.g. the log of the squarefeet or even "interaction" features such as the product of bedrooms and bathrooms.


You will use the logarithm function to create a new feature. so first you should import it from the math library.

In [10]:
from math import log

Next create the following 4 new features as column in both TEST and TRAIN data:
* bedrooms_squared = bedrooms\*bedrooms
* bed_bath_rooms = bedrooms\*bathrooms
* log_sqft_living = log(sqft_living)
* lat_plus_long = lat + long 
As an example here's the first one:

In [11]:
train_data['bedrooms_squared'] = train_data['bedrooms'].apply(lambda x: x**2)
test_data['bedrooms_squared'] = test_data['bedrooms'].apply(lambda x: x**2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [12]:
# create the remaining 3 features in both TEST and TRAIN data

# bed_bath_rooms = bedrooms*bathrooms
train_data['bed_bath_rooms'] = train_data['bedrooms']*train_data['bathrooms']
test_data['bed_bath_rooms'] = test_data['bedrooms']*test_data['bathrooms']

# log_sqft_living = log(sqft_living)
train_data['log_sqft_living'] = train_data['sqft_living'].apply(lambda x: log(x))
test_data['log_sqft_living'] = test_data['sqft_living'].apply(lambda x: log(x))

# lat_plus_long = lat + long As an example here's the first one:
train_data['lat_plus_long'] = train_data['lat']+train_data['long']
test_data['lat_plus_long'] = test_data['lat']+test_data['long']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexin

* Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 
$$
\begin{align}
1^2 = 1
\end{align}
$$
but
$$
\begin{align}
4^2 = 16. \end{align}
$$
Consequently this feature will mostly affect houses with many bedrooms.
* bedrooms times bathrooms gives what's called an "interaction" feature. It is large when *both* of them are large.
* Taking the log of squarefeet has the effect of bringing large values closer together and spreading out small values.
* Adding latitude to longitude is totally non-sensical but we will do it anyway (you'll see why)

In [13]:
train_data[['bedrooms','bathrooms','lat','long','bedrooms_squared','bed_bath_rooms','log_sqft_living','lat_plus_long']].head()

Unnamed: 0,bedrooms,bathrooms,lat,long,bedrooms_squared,bed_bath_rooms,log_sqft_living,lat_plus_long
8028,3,2.5,47.595,-122.173,9,7.5,8.185907,-74.578
11967,2,1.0,47.5532,-122.28,4,2.0,6.917706,-74.7268
15743,4,1.75,47.6209,-122.302,16,7.0,7.791523,-74.6811
5750,5,1.0,47.6605,-122.324,25,5.0,7.154615,-74.6635
10630,4,1.5,47.7501,-122.302,16,6.0,7.489971,-74.5519


**Quiz Question: What is the mean (arithmetic average) value of your 4 new features on TEST data? (round to 2 digits)**

**Answer:**

In [14]:
print('Mean of TestData---(bedrooms squared: %d'%test_data['bedrooms_squared'].mean())
print('Mean of TestData-----(bed bath rooms: %d '%test_data['bed_bath_rooms'].mean())
print('Mean of TestData----(log sqft living: %d '%test_data['log_sqft_living'].mean())
print('Mean of TestData------(lat plus long: %d '%test_data['lat_plus_long'].mean())

Mean of TestData---(bedrooms squared: 12
Mean of TestData-----(bed bath rooms: 7 
Mean of TestData----(log sqft living: 7 
Mean of TestData------(lat plus long: -74 


# Learning Multiple Models

Now we will learn the weights for three (nested) models for predicting house prices. The first model will have the fewest features the second model will add one more feature and the third will add a few more:
* Model 1: squarefeet, # bedrooms, # bathrooms, latitude & longitude
* Model 2: add bedrooms\*bathrooms
* Model 3: Add log squarefeet, bedrooms squared, and the (nonsensical) latitude + longitude

In [15]:
model_1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']

Now that you have the features, learn the weights for the three different models for predicting target = 'price' using linear_model.LinearRegression(). and look at the value of the weights/coefficients:

In [16]:
# Learn the three models: (don't forget to set validation_set = None)
# MODEL 1
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features],train_data['price'])

# MODEL 2
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features],train_data['price'])

# MODEL 3
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features],train_data['price'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [17]:
# Examine/extract each model's coefficients:
model_1_weight_summary = model_1.coef_
model_2_weight_summary = model_2.coef_
model_3_weight_summary = model_3.coef_

print('Model1 Coeficients:')
print(model_1_weight_summary)

print('\nModel2 Coeficients:')
print(model_2_weight_summary)

print('\nModel3 Coeficients:')
print(model_3_weight_summary)

Model1 Coeficients:
[ 3.08887695e+02 -5.83106922e+04  1.64003544e+04  6.56212809e+05
 -3.04971323e+05]

Model2 Coeficients:
[ 3.04530899e+02 -1.02903339e+05 -5.62387893e+04  6.53407715e+05
 -2.92481427e+05  2.12575454e+04]

Model3 Coeficients:
[ 5.14465125e+02  2.71898262e+04  8.02273301e+04  5.32003957e+05
 -4.04032857e+05 -1.23095455e+04 -4.64241039e+03 -5.30058064e+05
  1.27971100e+05]


**Quiz Question: What is the sign (positive or negative) for the coefficient/weight for 'bathrooms' in model 1?**

    **Answer:** Coeficient of 'bathroom' ---> Positive (3th)


**Quiz Question: What is the sign (positive or negative) for the coefficient/weight for 'bathrooms' in model 2?**

    **Answer:** Negative (3th)

Think about what this means.

# Comparing multiple models

Now that you've learned three models and extracted the model weights we want to evaluate which model is best.

First use your functions from earlier to compute the RSS on TRAINING Data for each of the three models.

In [18]:
# Compute the RSS on TRAINING data for each of the three models and record the values:
rss_model_1_train = get_residual_sum_of_squares(model_1, train_data, model_1_features, train_data['price'])
rss_model_2_train = get_residual_sum_of_squares(model_2, train_data, model_2_features, train_data['price'])
rss_model_3_train = get_residual_sum_of_squares(model_3, train_data, model_3_features, train_data['price'])

print('RSS Model 1: %d'%(rss_model_1_train))
print('\nRSS Model 2: %d'%(rss_model_2_train))
print('\nRSS Model 3: %d'%(rss_model_3_train))

RSS Model 1: 940940329132490

RSS Model 2: 934349514420330

RSS Model 3: 885348573944778


**Quiz Question: Which model (1, 2 or 3) has lowest RSS on TRAINING Data?** Is this what you expected?

    **Answer:** Model 3 is better than others.

&emsp;&emsp;

Now compute the RSS on TEST data for each of the three models.

In [19]:
# Compute the RSS on TESTING data for each of the three models and record the values:
rss_model_1_test = get_residual_sum_of_squares(model_1, test_data, model_1_features, test_data['price'])
rss_model_2_test = get_residual_sum_of_squares(model_2, test_data, model_2_features, test_data['price'])
rss_model_3_test = get_residual_sum_of_squares(model_3, test_data, model_3_features, test_data['price'])

print('RSS Model 1: %d'%(rss_model_1_test))
print('\nRSS Model 2: %d'%(rss_model_2_test))
print('\nRSS Model 3: %d'%(rss_model_3_test))

RSS Model 1: 252310966056068

RSS Model 2: 247600046018032

RSS Model 3: 255242885877590


&emsp;
## Result
 **Quiz Question: Which model (1, 2 or 3) has lowest RSS on TESTING Data?** Is this what you expected? Think about the features that were added to each model from the previous.

* RSS Model-1 on TEST data: 219742130498805
* RSS Model-2 on TEST data: 216715801956578
* RSS Model-3 on TEST data: 209668128559972

The RSS of the Model_3 is the lowest.

> #### Model_3 is better than other models.

&emsp;

&emsp;&emsp;&emsp;&emsp;
# Exercise 1- 2
# Regression Week 2: Multiple Regression (gradient descent)

In the first notebook we explored multiple regression using sklearn Create. Now we will use sklearn Create along with numpy to solve for the regression weights with gradient descent.

In this notebook we will cover estimating multiple regression weights via gradient descent. You will:
* Add a constant column of 1's to a dataframe to account for the intercept
* Convert an dataframe into a Numpy array
* Write a predict_output() function using Numpy
* Write a numpy function to compute the derivative of the regression weights with respect to a single feature
* Write gradient descent function to compute the regression weights given an initial weight vector, step size and tolerance.
* Use the gradient descent function to estimate regression weights for multiple features

# Load in house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [20]:
sales_2 = pd.read_csv("data/week2_kc_house_data.csv")
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

sales_2.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


If we want to do any "feature engineering" like creating new features or adjusting existing ones we should do this directly using the dataframe as seen in the other Week 2 notebook. For this notebook, however, we will work with the existing features.

&emsp;
# Convert to Numpy Array
Although dataframe offer a number of benefits to users (especially when using Big Data and built-in ... Create functions) in order to understand the details of the implementation of algorithms it's important to work with a library that allows for direct (and optimized) matrix operations. Numpy is a Python solution to work with matrices (or any multi-dimensional "array").

Recall that the predicted value given the weights and the features is just the dot product between the feature and weight vector. Similarly, if we put all of the features row-by-row in a matrix then the predicted value for *all* the observations can be computed by right multiplying the "feature matrix" by the "weight vector". 

First we need to take the dataframe of our data and convert it into a 2D numpy array (also called a matrix). To do this we can use Panda's .as_matrix() to convert the dataframe into a numpy matrix.

&emsp;

Now we will write a function that will accept an Pandas dataframe, a list of feature names (e.g. ['sqft_living', 'bedrooms']) and an target feature e.g. ('price') and will return two things:
* A numpy matrix whose columns are the desired features plus a constant column (this is how we create an 'intercept')
* A numpy array containing the values of the output

With this in mind, complete the following function (where there's an empty line you should write a line of code that does what the comment above indicates)

In [21]:
def get_numpy_data(dataframe, features, output):
    dataframe['constant'] = 1 # this is how you add a constant column to an dataframe
    
    # add the column 'constant' to the front of the features list so that we can extract it along with the others:
    features = ['constant'] + features # this is how you combine two lists
    
    # select the columns of data given by the features list into the dataframe features_data (now including constant):
    features_dataframe = dataframe[features]
    
    # the following line will convert the features_data into a numpy matrix:
    feature_matrix = features_dataframe.as_matrix()
    
    # assign the column of data associated with the output to the Array output_array
    output_dataframe = dataframe[output]
    
    # the following will convert the Array into a numpy array by first converting it to a list
    output_array = output_dataframe.as_matrix()
    
    return(feature_matrix, output_array)

For testing let's use the 'sqft_living' feature and a constant as our features and price as our output:

In [22]:
(example_features, example_output) = get_numpy_data(sales_2, ['sqft_living'], 'price') # the [] around 'sqft_living' makes it a list

print("Example_features: ")
print(example_features[0,:]) # this accesses the first row of the data the ':' indicates 'all columns'
print("\nExample_output: ")
print(example_output[0]) # and the corresponding output

Example_features: 
[   1 1180]

Example_output: 
221900.0


  # This is added back by InteractiveShellApp.init_path()


# Predicting output given regression weights
Suppose we had the weights [1.0, 1.0] and the features [1.0, 1180.0] and we wanted to compute the predicted output 
$$
\begin{align}
1.0*1.0 + 1.0*1180.0 = 1181.0 
\end{align}
$$
this is the dot product between these two arrays. If they're numpy arrayws we can use np.dot() to compute this:

In [23]:
my_weights = np.array([1., 1.])    # the example weights
my_features = example_features[0,] # we'll use the first data point
predicted_value = np.dot(my_features, my_weights)

print("Predicted Value: %d"%(predicted_value))

Predicted Value: 1181


np.dot() also works when dealing with a matrix and a vector. Recall that the predictions from all the observations is just the RIGHT (as in weights on the right) dot product between the features *matrix* and the weights *vector*. With this in mind finish the following predict_output function to compute the predictions for an entire matrix of features given the matrix and the weights:

In [24]:
def predict_output(feature_matrix, weights):
    # assume feature_matrix is a numpy matrix containing the features as columns and weights is a corresponding numpy array
    # create the predictions vector by using np.dot()
    predictions = np.dot(feature_matrix, weights)

    return(predictions)

If you want to test your code run the following cell:

In [25]:
test_predictions = predict_output(example_features, my_weights)

print("Test_prediction[0]: %d"%(test_predictions[0])) # should be 1181.0
print("Test_prediction[1]: %d"%(test_predictions[1])) # should be 2571.0

Test_prediction[0]: 1181
Test_prediction[1]: 2571


# Computing the Derivative
We are now going to move to computing the derivative of the regression cost function. Recall that the cost function is the sum over the data points of the squared difference between an observed output and a predicted output.

Since the derivative of a sum is the sum of the derivatives we can compute the derivative for a single data point and then sum over data points. We can write the squared difference between the observed output and predicted output for a single point as follows:

$$
\begin{align}
(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... +  w[k]*[feature_k] - output)^2
\end{align}
$$

Where we have k features and a constant. So the derivative with respect to weight w[i] by the chain rule is:

$$
\begin{align}
2*(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... +  w[k]*[feature_k] - output)* [feature_i]
\end{align}
$$

The term inside the paranethesis is just the error (difference between prediction and output). So we can re-write this as:

$$
\begin{align}
2*error*[feature_i]
\end{align}
$$

That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself. In the case of the constant then this is just twice the sum of the errors!

Recall that twice the sum of the product of two vectors is just twice the dot product of the two vectors. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors. 

With this in mind complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points).

In [26]:
def feature_derivative(errors, feature):
    # Assume that errors and feature are both numpy arrays of the same length (number of data points)
    # compute twice the dot product of these vectors as 'derivative' and return the value
    derivative = 2*np.dot(errors, feature)
    return(derivative)

To test your feature derivartive run the following:

In [27]:
(example_features, example_output) = get_numpy_data(sales, ['sqft_living'], 'price') 
my_weights = np.array([0., 0.]) # this makes all the predictions 0
test_predictions = predict_output(example_features, my_weights) 
# just like dataframes 2 numpy arrays can be elementwise subtracted with '-': 
errors = test_predictions - example_output # prediction errors in this case is just the -example_output
feature = example_features[:,0] # let's compute the derivative with respect to 'constant', the ":" indicates "all rows"
derivative = feature_derivative(errors, feature)

print(derivative)
print(-np.sum(example_output)*2) # should be the same as derivative

-23345850016.0
-23345850016.0


  # This is added back by InteractiveShellApp.init_path()


# Gradient Descent
Now we will write a function that performs a gradient descent. The basic premise is simple. Given a starting point we update the current weights by moving in the negative gradient direction. Recall that the gradient is the direction of *increase* and therefore the negative gradient is the direction of *decrease* and we're trying to *minimize* a cost function. 

The amount by which we move in the negative gradient *direction*  is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. We define this by requiring that the magnitude (length) of the gradient vector to be smaller than a fixed 'tolerance'.

With this in mind, complete the following gradient descent function below using your derivative function above. For each step in the gradient descent we update the weight for each feature befofe computing our stopping criteria

In [28]:
from math import sqrt # recall that the magnitude/length of a vector [g[0], g[1], g[2]] is sqrt(g[0]^2 + g[1]^2 + g[2]^2)

In [29]:
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
    converged = False 
    weights = np.array(initial_weights) # make sure it's a numpy array
    while not converged:
        # compute the predictions based on feature_matrix and weights using your predict_output() function
        predictions = predict_output(feature_matrix, weights)

        # compute the errors as predictions - output
        errors = predictions - output

        gradient_sum_squares = 0 # initialize the gradient sum of squares
        # while we haven't reached the tolerance yet, update each feature's weight
        for i in range(len(weights)): # loop over each weight
            # Recall that feature_matrix[:, i] is the feature column associated with weights[i]
            # compute the derivative for weight[i]:
            derivative = feature_derivative(errors, feature_matrix[:, i])

            # add the squared value of the derivative to the gradient sum of squares (for assessing convergence)
            gradient_sum_squares += (derivative**2)

            # subtract the step size times the derivative from the current weight
            weights[i] -= (step_size * derivative)
            
        # compute the square-root of the gradient sum of squares to get the gradient magnitude:
        gradient_magnitude = sqrt(gradient_sum_squares)
        if gradient_magnitude < tolerance:
            converged = True
    return(weights)

A few things to note before we run the gradient descent. Since the gradient is a sum over all the data points and involves a product of an error and a feature the gradient itself will be very large since the features are large (squarefeet) and the output is large (prices). So while you might expect "tolerance" to be small, small is only relative to the size of the features. 

For similar reasons the step size will be much smaller than you might expect but this is because the gradient has such large values.

# Running the Gradient Descent as Simple Regression
First let's split the data into training and test data.

In [30]:
train_data,test_data = train_test_split(sales_2,test_size=0.2)

Although the gradient descent is designed for multiple regression since the constant is now a feature we can use the gradient descent function to estimate the parameters in the simple regression on squarefeet. The folowing cell sets up the feature_matrix, output, initial weights and step size for the first model:

In [31]:
# let's test out the gradient descent
simple_features = ['sqft_living']
my_output = 'price'

(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)

initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
  # This is added back by InteractiveShellApp.init_path()


Next run your gradient descent with the above parameters.

In [32]:
test_weight = regression_gradient_descent(simple_feature_matrix, output, initial_weights, step_size, tolerance)

print(test_weight)

[-46999.88427043    282.48042519]


How do your weights compare to those achieved in week 1 (don't expect them to be exactly the same)? 

**Quiz Question: What is the value of the weight for sqft_living -- the second element of ‘simple_weights’ (rounded to 1 decimal place)?**

Use your newly estimated weights and your predict_output() function to compute the predictions on all the TEST data (you will need to create a numpy array of the test feature_matrix and test output first:

In [33]:
(test_simple_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
  # This is added back by InteractiveShellApp.init_path()


Now compute your predictions using test_simple_feature_matrix and your weights from above.

In [34]:
test_predictions = predict_output(test_simple_feature_matrix, test_weight)

print(test_predictions)

[306100.64722172 871061.49760915 534909.79162863 ... 458640.07682632
 667675.59146968 362596.73226046]


**Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 1 (round to nearest dollar)?**

In [35]:
print(test_predictions[0])

306100.6472217162


Now that you have the predictions on test data, compute the RSS on the test data set. Save this value for comparison later. Recall that RSS is the sum of the squared errors (difference between prediction and output).

In [36]:
test_residuals = test_output - test_predictions
test_RSS = (test_residuals * test_residuals).sum()

print(test_RSS)

291220853705378.5


# Running a multiple regression
Now we will use more than one actual feature. Use the following code to produce the weights for a second model with the following parameters:

In [37]:
model_features = ['sqft_living', 'sqft_living15'] # sqft_living15 is the average squarefeet for the nearest 15 neighbors. 
my_output = 'price'
(feature_matrix, output) = get_numpy_data(train_data, model_features, my_output)
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
  # This is added back by InteractiveShellApp.init_path()


Use the above parameters to estimate the model weights. Record these values for your quiz.

In [38]:
weight_2 = regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance)

print(weight_2)

[-9.99999546e+04  2.42341529e+02  6.88997536e+01]


Use your newly estimated weights and the predict_output function to compute the predictions on the TEST data. Don't forget to create a numpy array for these features from the test set first!

In [39]:
(test_feature_matrix, test_output) = get_numpy_data(test_data, model_features, my_output)

test_predictions_2 = predict_output(test_feature_matrix, weight_2)

print(test_predictions_2)

[280783.67886893 891553.28695681 526688.14035549 ... 453676.95449708
 670215.55326846 351299.90592349]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
  # This is added back by InteractiveShellApp.init_path()


**Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 2 (round to nearest dollar)?**

In [40]:
print(test_predictions_2[0])

280783.67886893265


What is the actual price for the 1st house in the test data set?

In [41]:
# print(test_data['price'][0])
test_data['price'].iloc[0]

155000.0

**Quiz Question: Which estimate was closer to the true price for the 1st house on the TEST data set, model 1 or model 2?**

Now use your predictions and the output to compute the RSS for model 2 on TEST data.

In [42]:
test_residuals_2 = test_output - test_predictions_2
test_RSS_2 = (test_residuals_2**2).sum()

print(test_RSS_2)

287039655170051.94


&emsp;
## Result
 **Quiz Question: Which model (1 or 2) has lowest RSS on all of the TEST data? **
* RSS Model-1 on TEST data: 930653102866614
* RSS Model-2 on TEST data: 329573350672128

The RSS of the Model_2 is lower than the Model_1.

> #### Model_2 is the better model.

&emsp;