Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

In [None]:
#######################
# standard code block #
#######################

# see https://ipython.readthedocs.io/en/stable/interactive/magics.html
%pylab inline
%config InlineBackend.figure_formats = ['retina']

#######################
#       imports       #
#######################
import pandas as pd
import seaborn as sns
# import sklearn

sns.set_style("whitegrid", {"font.family": ["serif"]})

# Predicting House Prices - A case study with SciKit Learn and Pandas

In this section, we will walk through how to build regression models in scikit-learn.

We will load in a the Ames Housing Data, split into train and test sets, and build some models.

Using the Ames Housing Data:

Dean De Cock
Truman State University
Journal of Statistics Education Volume 19, Number 3(2011), www.amstat.org/publications/jse/v19n3/decock.pdf

In [None]:
df=pd.read_csv("http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt", sep='\t')

In [None]:
df.info()

### Data Dictionary
A description of the variables can be found here:

https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt



### Data Cleaning
From the above, and reading the documentation, here are a few things to note about this data set:
- SalePrice is our target variable
- The authors recommend removing the few houses that are >4000 SQFT (based on the 'Gr Liv Area' variable)
- Many columns have missing data (based on the number of "non-null" entries in each column
- We have many predictor variables

### Challenge 1: Remove all houses that are greater than 4000 sqft with filtering (‘Gr Liv Area’)

In [None]:
df.shape

In [None]:
# Let's remove the large houses as suggested by the authors

# YOUR CODE HERE
raise NotImplementedError()


In [None]:
assert df.shape == (2925, 82)

- How many data points did we remove from the data set?

## Next, let's restrict ourselves to just a few variables to get started

In [None]:
smaller_df= df[['Lot Area','Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add',
        'Gr Liv Area', 
        'Full Bath', 'Bedroom AbvGr',
        'Fireplaces', 'Garage Cars','SalePrice']]

In [None]:
smaller_df = df.select_dtypes("number").dropna()

In [None]:
## Let's have a look at these variables

smaller_df.describe()

In [None]:
smaller_df.info()

In [None]:
# There appears to be one NA in Garage Cars - fill with 0
smaller_df = smaller_df.fillna(0)

In [None]:
smaller_df.info()

In [None]:
import seaborn as sns

In [None]:
## Let's do a pairplot with seaborn to get a sense of the variables in this data set
sns.pairplot(smaller_df, y_vars=["SalePrice"], x_vars=smaller_df.columns)

### Comprehension question
From the pairplot above:

- Which variables seem to have the strongest correlations with SalePrice?

In [None]:
sns.heatmap(smaller_df.corr())

### Train - Test Splits

Train-test splitting is a big part of the data science pipeline. The reason being, we're always trying to build models that perform well "in the wild." This means that in order to evaluate our model's performance, we need to test it on data that we didn't use when building the model. This means we often want to cut out some section of our data before we do any model-building; to save for use as a "evaluator" of how our model performs on data it's never seen before. 

<img src="images/train_test_split.png">

In SkLearn, we use `train_test_split` to do this, which allows us to randomly sample the data instead of taking one big chunk. 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
#Split the data 70-30 train/test
X = smaller_df.drop(['SalePrice'], axis=1)
y = smaller_df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)

In [None]:
X_train.shape, X_test.shape

In [None]:
X_train.columns

## Linear Regression
In the first part of this notebook we will use linear regression.  We will start with a simple one-variable linear regression and then proceed to more complicated models.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# First let us fit only on Living Area (sqft)
selected_columns_1 = ['Gr Liv Area']

### Sklearn Modeling
The package scikit-learn has a particular structure to their predictive modeling functionality.  Typically, a model is "defined" then it is "fit" (to a set of examples with their answers).  Then the trained model can be used to predict on a set of (unlabeled) data points.  We will walk through this process in the next few cells.

In [None]:
## First we define a `default` LinearRegression model and fit it to the data (with just `Gr Liv Area' as a predictor
## and SalePrice as the targer.)

lr_model1 = LinearRegression()
lr_model1.fit(X_train[selected_columns_1],y_train)

In [None]:
## Let us look at the (single) variable coefficient and the intercept
lr_model1.coef_, lr_model1.intercept_

### Comprehension Question
- What would this simple model predict as the sales price of a 1000 sq ft home?
- Does that seem reasonable? (Remember, these are house prices in Ames, Iowa between 2006 and 2010)
- Write a function that takes the variables above and predicts the output

In [None]:
def make_prediction(beta_0, beta_1, sqft):
    return beta_0 + beta_1*sqft

make_prediction(lr_model1.intercept_, lr_model1.coef_, 1000)

## Plotting the Regression Line
Let's use our knowledge of Matplotlib/Seaborn to make some plots of this data. Let's begin by plotting Price vs Square Footage. Let's also add a line for our model.

In [None]:
plt.figure(figsize=(10,8))
plt.scatter(X_train['Gr Liv Area'],y_train,alpha=.1)
vec1 = np.linspace(0,4000,1000)
plt.plot(vec1, lr_model1.intercept_ + lr_model1.coef_[0]*vec1,'r')
plt.title("Housing Prices in Ames Iowa by Sq Ft (Training Set)")
plt.xlabel("Sq ft of home")
plt.ylabel("Price of home");

In [None]:
# Let's make a similar plot for the test set

plt.figure(figsize=(10,8))
plt.scatter(X_test['Gr Liv Area'],y_test,alpha=.1)
vec1 = np.linspace(0,4000,1000)
plt.plot(vec1, lr_model1.intercept_ + lr_model1.coef_[0]*vec1,'r')
plt.title("Housing Prices in Ames Iowa by Sq Ft (Test Set)")
plt.xlabel("Sq ft of home")
plt.ylabel("Price of home");

In [None]:
# Let's get predictions of the model on the test set
# Note the use of the `model.predict(feature_matrix)` syntax

test_set_pred1 = lr_model1.predict(X_test[selected_columns_1])

A very useful plot for diagnosing problems is to plot the actual price vs the predicted price. If our model was perfect, every point would be on a perfect 45-degree line starting from (0,0) and extending up with a slope 1. Let's see how we did here.

In [None]:
## Let's plot the actual vs expected house price (along with the line x=y for reference)
plt.figure(figsize=(10,8))
plt.scatter(test_set_pred1,y_test,alpha=.1)
plt.plot(np.linspace(0,600000,1000),np.linspace(0,600000,1000), 'r-')
plt.xlabel("Predicted")
plt.ylabel("Actual");

Let's talk about some metrics and what they're used for.

In [None]:
# How good is our model on the test set?

# Mean Squared Error
def mean_square_error(true, pred):
    return np.mean((pred - true)**2)

mean_square_error(y_test,test_set_pred1)

In [None]:
# Root Mean Square Error
def root_mean_square_error(true,pred):
    return np.sqrt(mean_square_error(true,pred))

root_mean_square_error(y_test,test_set_pred1)

In [None]:
# Mean Absolute Deviation
def mean_absolute_deviation(true,pred):
    return np.mean(np.abs(pred - true))

mean_absolute_deviation(y_test, test_set_pred1)

In [None]:
# R^2

def R2_score(true,pred):
    y_bar_test = np.mean(true)
    SSE = np.sum((pred - true)**2)
    SST = np.sum((true - y_bar_test)**2)
    return 1.-SSE/SST

R2_score(y_test, test_set_pred1)

Let's put all of those into one nice function that prints out all of our stats:

In [None]:
def model_stats(true, pred):
    results = {
        "MSE": mean_square_error(true, pred),
        "MAE": mean_absolute_deviation(true, pred),
        "RMSE": root_mean_square_error(true,pred),
        "R2": R2_score(true,pred),
    }
    print(results)
    return results
    
model_stats(y_test, test_set_pred1)

That was all well-and-good, but we left a lot of information out when we switched to just the square footage. So let's add some information back in by allowing the "Lot Size" to inform our decisions as well the quality of the home.

In [None]:
selected_columns_2 = ['Lot Area', 'Gr Liv Area']

In [None]:
lr_model2 = LinearRegression()
lr_model2.fit(X_train[selected_columns_2],y_train)

In [None]:
lr_model2.coef_

In [None]:
## This is a hack to show the variables next to their values
list(zip(selected_columns_2,lr_model2.coef_))

In [None]:
test_set_pred2 = lr_model2.predict(X_test[selected_columns_2])

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(test_set_pred2,y_test,alpha=.2)
plt.plot(np.linspace(0,600000,1000),np.linspace(0,600000,1000),'r-');
plt.xlabel("Predicted")
plt.ylabel("Actual");

In [None]:
model_stats(y_test,test_set_pred2)

Excellent! That's an improvement. You can see that our errors went down and our R2 went up. That's lovely. 

### Challenge 2: Add the overall quality informaton into your process for prediction


Modify the code below to create a model named `lr_model_cha` that adds the Overall Qual column


In [None]:

#---------modify this code----------------------------------#
selected_columns_challenge = ['Lot Area', 'Gr Liv Area']
lr_model_cha = LinearRegression()
lr_model_cha.fit(X_train[selected_columns_challenge],y_train)


#---------modify this code----------------------------------#

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
test_set_pred_cha = lr_model_cha.predict(X_test[selected_columns_challenge])

print(list(zip(selected_columns_challenge,lr_model_cha.coef_)))

plt.figure(figsize=(10,6))
plt.scatter(test_set_pred_cha,y_test,alpha=.2)
plt.plot(np.linspace(0,600000,1000),np.linspace(0,600000,1000),'r-');
plt.xlabel("Predicted")
plt.ylabel("Actual");

stats = model_stats(y_test, test_set_pred_cha);

In [None]:
assert 'Gr Liv Area' in selected_columns_challenge
assert 'SalePrice' not in selected_columns_challenge

test_set_pred_cha = lr_model_cha.predict(X_test[selected_columns_challenge])
r2  = model_stats(y_test, test_set_pred_cha)["R2"]

assert r2 > .6
assert r2 < .95


### Feature Engineering
Since there seems to be some non-linearity, let's make a new variable that is "Greater Living Area"^2. This is called feature engineering since we're "engineering (or making)" a new feature out of our old features.

In [None]:
X['GLA2'] = X['Gr Liv Area']**2
X.columns

In [None]:
## We need to recreate the train and test sets -- make sure you use the same random seed!
#Split the data 70-30 train/test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)

In [None]:
selected_columns_3 = ['Lot Area', 'Overall Qual', 'GLA2']

In [None]:
lr_model3 = LinearRegression()
lr_model3.fit(X_train[selected_columns_3],y_train)

In [None]:
list(zip(X_train[selected_columns_3].columns,lr_model3.coef_))

In [None]:
test_set_pred3 = lr_model3.predict(X_test[selected_columns_3])

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(test_set_pred3,y_test,alpha=.1)
plt.plot(np.linspace(0,600000,1000),np.linspace(0,600000,1000));
plt.xlabel("Predicted")
plt.ylabel("Actual");

In [None]:
model_stats(y_test,test_set_pred3)

## Exercise

Attempt to build the best model they can using the techniques shown above. Some recommendations:

* Add some of the features we removed. But be careful, we haven't talked about how to handle categorical data, so your model won't work with categories.
* Do some feature engineering. We played with GLA^2, but there are more variables you can try things with. You might also try multiplying some features together to see if there are "interaction" terms.

Go wild. Next week, you will have a chance to describe what sort of work you tried and how their model performed!

In [None]:
challenge_columns = ["Year Built", 'Gr Liv Area']

# YOUR CODE HERE
raise NotImplementedError()

In [None]:

X_challenge = X.loc[:,challenge_columns]

X_train, X_test, y_train, y_test = train_test_split(X_challenge, y, test_size=0.3,random_state=42)


lr_model = LinearRegression()
lr_model.fit(X_train,y_train)
print("Coeffs: ")
for col, coe in list(zip(X_train.columns,lr_model.coef_)):
    print((col,coe))

test_set_pred = lr_model.predict(X_test)

plt.figure(figsize=(8,6))
plt.scatter(test_set_pred,y_test,alpha=.1)
plt.plot(np.linspace(0,600000,1000),np.linspace(0,600000,1000),'r-')
plt.xlabel('Predicted Price')
plt.ylabel('Actual Price')


stats = model_stats(y_test,test_set_pred);

In [None]:
assert stats["R2"] > .796
assert stats["R2"] < .999