Compare different regression models in order to assess which model fits best. We will be using polynomial regression as a means to examine this topic.

* Write a function to take a pandas Series and a degree and return a pandas DataFrame where each column is the Series to a polynomial value up to the total degree e.g. degree = 3 then column 1 is the Series column 2 is the Series squared and column 3 is the Series cubed
* Use matplotlib to visualize polynomial regressions
* Use matplotlib to visualize the same polynomial degree on different subsets of the data
* Use a validation set to select a polynomial degree
* Assess the final fit using test data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from scipy import stats

In [None]:
# using lambda to create polynomial features
tmp = pd.Series([1., 2., 3.])
tmp_cubed = tmp.apply(lambda x: x**3)
print(tmp)
print(tmp_cubed)

In [None]:
# creating new columns
ex_df = pd.DataFrame()
ex_df['power_1'] = tmp
print(ex_df)

In [None]:
def polynomial_dataframe(feature, degree):
    # assume that degree >= 1
    # initialize the SFrame:
    poly_df = pd.DataFrame()
    # and set poly_sframe['power_1'] equal to the passed feature
    poly_df['power_1'] = feature
    # first check if degree > 1
    if degree > 1:
        # then loop over the remaining degrees:
        # range usually starts at 0 and stops at the endpoint - 1. We want it to start at 2 and stop at degree
        for power in range(2, degree + 1): 
            # first we'll give the column a name:
            name = 'power_' + str(power)
            # then assign poly_sframe[name] to the appropriate power of feature
            poly_df[name] = feature ** power
    return poly_df
# test it
print(polynomial_dataframe(tmp, 3))

#### Visualizing polynomial regression

In [None]:
# load data
sales = pd.read_csv("../../ML Data & Script/kc_house_data.csv")
sales.head()

# sort by sqft_living and price (for plotting purposes)
sales = sales.sort_values(by=['sqft_living', 'price'])

In [None]:
# use statsmodels
import statsmodels.formula.api as smf
import statsmodels.api as sm
# use sklearn
from sklearn.linear_model import LinearRegression

In [None]:
# degree one polynomial
X = polynomial_dataframe(sales['sqft_living'], 1)
Y = sales['price'] # add price to the data since it's the target
model_one_sk = LinearRegression()
model_one_sk.fit(X, Y)
print(str(model_one_sk.intercept_) + str(model_one_sk.coef_))

print(model_one_sk.score(X, Y))

In [None]:
# degree one polynomial
X = polynomial_dataframe(sales['sqft_living'], 1)
Y = sales['price']
X = sm.add_constant(X)
model_one = smf.OLS(Y, X).fit()
model_one.summary()

In [None]:
#let's take a look at the weights before we plot
print('Parameters: ', model_one.params)

In [None]:
plt.plot(X['power_1'], Y,'.',
        X['power_1'], model_one.predict(X),'-')

In [None]:
# degree two polynomial
X = polynomial_dataframe(sales['sqft_living'], 2)
Y = sales['price']
X = sm.add_constant(X)
model_two = smf.OLS(Y, X).fit()
model_two.summary()
plt.plot(X['power_1'], Y,'.',
        X['power_1'], model_two.predict(X),'-')
plt.show()

In [None]:
# degree three polynomial
X = polynomial_dataframe(sales['sqft_living'], 3)
Y = sales['price']
X = sm.add_constant(X)
model_three = smf.OLS(Y, X).fit()
model_three.summary()
plt.plot(X['power_1'], Y,'.',
        X['power_1'], model_three.predict(X),'-')
plt.show()

In [None]:
# degree 15 polynomial
X = polynomial_dataframe(sales['sqft_living'], 15)
Y = sales['price']
X = sm.add_constant(X)
model_fifteenth = smf.OLS(Y, X).fit()
model_fifteenth.summary()
plt.plot(X['power_1'], Y,'.',
        X['power_1'], model_fifteenth.predict(X),'-')
plt.show()

#### Changing Data and Re-learning

* Split the sales data into four subsets of roughly equal size.
* Estimate a 15th degree polynomial model on all four subsets of the data. 
* Print the coefficients (you should use .print_rows(num_rows = 16) to view all of them) and plot the resulting fit (as we did above).

4 subsets (`set_1`, `set_2`, `set_3`, `set_4`) of approximately equal size. 

In [None]:
# re-read the data to remove the sorting
sales = pd.read_csv("../../ML Data & Script/kc_house_data.csv")
sales = sales[['sqft_living', 'price']]
# randomize data
sales = sales.sample(frac=1,random_state=5)
amount = sales.shape[0] // 4
# create the four sets
set_1 = sales[0:amount * 1].sort_values(by=['sqft_living', 'price'])
set_2 = sales[amount * 1:amount * 2].sort_values(by=['sqft_living', 'price'])
set_3 = sales[amount * 2:amount * 3].sort_values(by=['sqft_living', 'price'])
set_4 = sales[amount * 3:].sort_values(by=['sqft_living', 'price'])

In [None]:
fig = plt.figure(figsize=(16,6))
fig.subplots_adjust(hspace = 0.4, wspace = 0.2)
fig.suptitle("Model Variance")
fig.add_subplot(221)

# degree 15 polynomial for set1
X = polynomial_dataframe(set_1['sqft_living'], 15)
Y = set_1['price']
X = sm.add_constant(X)
model_set1 = smf.OLS(Y, X).fit()
plt.plot(X['power_1'], Y,'.',
        X['power_1'], model_set1.predict(X),'-')

# degree 15 polynomial for set2
X = polynomial_dataframe(set_2['sqft_living'], 15)
Y = set_2['price']
X = sm.add_constant(X)
model_set2 = smf.OLS(Y, X).fit()
fig.add_subplot(222)
plt.plot(X['power_1'], Y,'.',
        X['power_1'], model_set2.predict(X),'-')

# degree 15 polynomial for set1
X = polynomial_dataframe(set_3['sqft_living'], 15)
Y = set_3['price']
X = sm.add_constant(X)
model_set3 = smf.OLS(Y, X).fit()
fig.add_subplot(223)
plt.plot(X['power_1'], Y,'.',
        X['power_1'], model_set3.predict(X),'-')

# degree 15 polynomial for set4
X = polynomial_dataframe(set_4['sqft_living'], 15)
Y = set_4['price']
X = sm.add_constant(X)
model_set4 = smf.OLS(Y, X).fit()
fig.add_subplot(224)
plt.plot(X['power_1'], Y,'.',
        X['power_1'], model_set4.predict(X),'-')
plt.show()

In [None]:
model_set1.summary()

In [None]:
model_set2.summary()

In [None]:
model_set3.summary()

In [None]:
model_set4.summary()

#### Selecting a Polynomial Degree

Whenever we have a "magic" parameter like the degree of the polynomial there is one well-known way to select these parameters: validation set. (We will explore another approach in week 4).

We split the sales dataset 3-way into training set, test set, and validation set as follows:

* Split our sales data into 2 sets: `training_and_validation` and `testing`. Use `random_split(0.9, seed=1)`.
* Further split our training data into two sets: `training` and `validation`. Use `random_split(0.5, seed=1)`.

In [None]:
from sklearn.model_selection import train_test_split
X = sales['sqft_living']
y = sales['price']
# obtain 10% test data and remaining 90%
X_remaining, X_test, y_remaining, y_test = train_test_split(X, y, test_size=0.1, random_state=5)
# obtain 45% train and 45% validation data
X_train, X_val, y_train, y_val = train_test_split(X_remaining, y_remaining, test_size=0.5, random_state=5)

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
X_val.head()

* For degree in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] (to get this in python type range(1, 15+1))
    * Build a dataframe of polynomial data of train_data['sqft_living'] at the current degree
    * Add train_data['price'] to the polynomial SFrame
    * Learn a polynomial regression model to sqft vs price with that degree on TRAIN data
    * Compute the RSS on VALIDATION data 
        * (here you will want to use .predict()) for that degree and you will need to make a polynmial dataframe using validation data.
* Report which degree had the lowest RSS on validation data (remember python indexes from 0)

In [None]:
def get_residual_sum_of_squares(model, data, outcome):
    # First get the predictions
    predictions=model.predict(data)    
    # Then compute the residuals/errors
    RSS = outcome - predictions
    # Then square and add them up
    RSS =(RSS * RSS).sum()
    return(RSS)    
rss_list = []
for i in range(1, 15+1):
    # poly dataframe using training data
    X_train_poly = polynomial_dataframe(X_train, i)
    model = LinearRegression()
    model.fit(X_train_poly, y_train)    
    
    # making polynomial dataframe using validation data
    X_val_poly = polynomial_dataframe(X_val, i)
      
    rss = get_residual_sum_of_squares(model, X_val_poly, y_val)
   
    # calcualte rss using sklearn
    #from sklearn import metrics
    #rss2 = poly_df.shape[0] * metrics.mean_squared_error(model.predict(poly_df), y_val)
    
    rss_list.append(rss)
    print("{} {:E}" .format(i ,rss))

In [None]:
# find the smallest RSS on validation set
min_index = np.argmin(rss_list) + 1
min_index

In [None]:
# plotting complexity vs validation error
plt.plot(list(np.arange(1,16,1)), rss_list)
plt.xlabel("Complexity")
plt.ylabel("Validation Error")
plt.show()

#### Test Error

In [None]:
from sklearn import metrics
X_train_poly = polynomial_dataframe(X_train, min_index)
model = LinearRegression()
model.fit(X_train_poly, y_train)    
rss_train = get_residual_sum_of_squares(model, X_train_poly, y_train)
#rss_train = X_train_poly.shape[0] * metrics.mean_squared_error(model.predict(X_train_poly), y_train)


X_test_poly = polynomial_dataframe(X_test, min_index)

rss_test = get_residual_sum_of_squares(model, X_test_poly, y_test)

print("Training Error: {:E}" .format(rss_train))
print("Test Error: {:E}" .format(rss_test))
#print("Validation Error: {:E}".format(rss_list[min_index]))

#### Reflection

Polynomial regression is a model which has polynomial degrees of the input variable. Some people call them feature transformations.
In this notebook, sqft_living is the predictor variable and price is the dependent variable. 
The polynomial features are sqft_living, sqft_living_squared, sqft_living_cubed, sqft_living_raised_to_four,.... 
I wrote a function that creates polynomial features up to a given degree. 
The house price data is split into 2 parts: 90% Train_Validation_Set, 10% Test_Set
90% Train_Val_set is further split into two equal parts: 50% Training Set, 50% Validation Set.

* Training Set: 45%
* Validation Set: 45%
* Test Set: 10%
    
The model is trained on training data. It is good to use the Validation Set, to select hyper-parameters like degree of a polynomial. The smallest validation error is found to be a 2 degree polynomial model = sqft + sqft^2 + intercept

The 2 degree polynomial model is tested on the test-set. The test-error is an approximation of generalization error.

In [None]:
X_test_poly.head()


In [None]:
X_train_poly.head()

In [None]:
X_val_poly.head()