# Machine Learning - SF vs NYC Housing 
#### Going Down the EECS Stack DeCal Spring 2017
#### Data courtesy of [r2d3](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)

<img src="https://grapecollective.com/media/article/image/cache/720x337-center/c/o/comparison.jpg">

In this notebook, we'll explore some ideas behind machine learning using housing data from San Francisco and New York. </br>
Each data point corresponds to a house with the following fields:
- year_build (int, e.g. 1990)
- price_per_sqft (int, dollars)
- bath (float)
- beds (float)
- elevation (int, ft)
- price (int, dollars)
- in_sf (int, 1 if in sf, 0 otherwise)

In the first part of this notebook we'll try to classify houses and predict whether a house is from NYC or SF. In the second part we'll try to regress and predict the prices of the houses.

<br />
Mathy Notation for later parts:
- $n$ denotes the number of data points (houses) 
- $d$ denotes the number of features 
- $X$ is an $n \times d$ matrix, where each row corresponds to a house. $X_i$ means the $i$th row, or the $i$th feature vector.
- $y$ is a length $n$ vector, where each index corresponds to a label. $y_i$ means the label for the $i$th house. For part 1 the labels are either $1$ or $0$ for SF and NYC respectively. For part 2 the labels are real numbers denoting housing price.


In [1]:
# imports
import numpy as np
%matplotlib notebook
from sklearn.metrics import mean_absolute_error, r2_score
from IPython.display import clear_output
from time import sleep
from sklearn.tree import DecisionTreeRegressor, export_graphviz
from util import extract_cols, visualize_linear_regression, visualize_perceptron, load_data

In [2]:
# load housing data
all_data, features_c, labels_c, features_r, labels_r = load_data()

## Part 1: Classification: NYC vs SF
In this section, our goal is to learn a model that predicts whether a given house is from SF or from NYC. Our label is "is_sf", which is 1 if the house belongs to SF, 0 if NYC.

### Part 1.1 Manual Classification
1. Use the cell below to explore statistics about the housing data. You can look for things such as "average elevation of houses in SF" vs "average elevation of houses in NY".
2. Use your knowledge about the data to fill in the function "is_in_sf," which takes in a feature dictionary and returns 1 if you think this house belongs in SF, 0 if it belongs to NY
3. Test how good your result is. Try to improve your score as much as possible!

In [3]:
# 1) explore the data
def query(in_sf, feature_name, statistic):
    '''
    in_sf is either 1 or 0
    statistic is either 'mean', 'std', 'min', 'max'
    feel free to modify this function to explore other properties about the housing data.
    '''
    if in_sf not in (0, 1):
        raise ValueError("Location can only be 0 for nyc or 1 for sf!")
    if feature_name not in features_c[0]:
        raise ValueError("Invalid feature_name!")
    if statistic not in ('mean', 'std', 'max', 'min'):
        raise ValueError("Statistic can only be 'mean', 'std', 'min', 'max'")
    
    col = []
    for data in all_data:
        if data['in_sf'] == in_sf:
            col.append(data[feature_name])
    
    statistics_map = {
        'min': np.min,
        'max': np.max,
        'mean': np.mean,
        'std': np.std
    }
    
    return statistics_map[statistic](col)
    
print "The mean elevation of houses in NYC is", query(0, 'elevation', 'mean')
print "The mean elevation of houses in SF is", query(1, 'elevation', 'mean')

The mean elevation of houses in NYC is 13.5044642857
The mean elevation of houses in SF is 61.8619402985


In [4]:
# 2) Implement is_in_sf
def is_in_sf(feature):
    '''
    feature is a dictionary with the following keys: 
    - 'year_built'
    - 'price_per_sqft'
    - 'bath'
    - 'beds'
    - 'elevation'
    - 'price'
    
    return 1 if house is predicted to be in SF, 0 if NYC
    '''
    ### Your Code Below ###
    # SAMPLE CODE #
    if feature['elevation'] > 30:
        return 1
    else:
        0

In [5]:
# 3) Test your performance!
num_correct = 0
sf_but_pred_ny = 0
total_num = len(features_c)
for i, feature in enumerate(features_c):
    prediction_in_sf = is_in_sf(feature)
    if prediction_in_sf == labels_c[i]:
        num_correct += 1
    else:
        if labels_c[i] and not prediction_in_sf:
            sf_but_pred_ny += 1
print "Got {:.2f}% correct!".format(num_correct/1./total_num*100)
print "Out of incorrect predictions, {:.2f}% were SF houses predicted to be in NYC".format(
                                                                                sf_but_pred_ny/1./(total_num - num_correct)*100)

Got 37.20% correct!
Out of incorrect predictions, 27.51% were SF houses predicted to be in NYC


### Part 1.2 The Perceptron Algorithm

In this section we will implement the perceptron algorithm, which will learn a linear decision boundary function $f(x)$ of the form:
$$
f(X_i) = \left\{
        \begin{array}{ll}
            1 & \quad w^\top X_i + b > 0 \\
            0 & \quad else
        \end{array}
    \right.
$$
You can think of this function as drawing a line in the feature space. If a data point is above this line, we'll say it's from SF. If a data point is below this line, we'll say it's from NYC. In the 1D case where there is only 1 feature used, $w$ would be the slope of the line, and $b$ the y-intercept.

<br />

Note that $f$ is *parameterized* by $w$ and $b$. So our goal is to find the $w$ and $b$ that best minimizes a *loss* function:
$$
L(X, y) = \frac{1}{N} \sum_{i=1}^{N} |f_{w, b}(X_i) - y_i|
$$

<br />
The perceptron algorithm has 2 *hyperparameters*: the learning rate and the number of epochs to be trained on. In addition, we can also select a smaller set of features instead of using all of them to learn on (sometimes this performs better). Complete the steps below:
1. Understand the perceptron algorithm implementation
2. Experiment w/ learning rate and epochs. How do these affect the train and test performance? Why?
3. Experiment w/ feature selection. Which features seem to work the best? Why?
4. Tweak the above settings to get the best test performance. 
5. If you used 2 or 3 features, you can run the visualization code to visualize the decision boundary.

In [6]:
# helper functions
def f(w, b, x):
    if w.dot(x) + b > 0:
        return 1
    else:
        return 0

def classification_accuracy(X, y, w, b):
    y_pred = [f(w, b, x) for x in X]
    return (1 - mean_absolute_error(y_pred, y))*100

def regression_mae(X, y, w, b):
    y_pred = X.dot(w) + b
    return mean_absolute_error(y_pred, y)

In [7]:
# 1) Read this implementation. Try matching it with the pseudocode
def perceptron_learn_w_b(X_tr, y_tr, X_t, y_t, epochs, learning_rate):
    '''
    Run the perceptron algorithm for epochs iterations
    Return w, b
    '''
    # dimensions
    N = X_tr.shape[0] # number of data points we have
    d = X_tr.shape[1] # dimension of a feature vector
    
    # initialize weights
    w = np.zeros(d) # a vector of 0's of size d
    b = 0 # bias starts at 0
    
    # perceptron learning algorithm
    for t in range(epochs):
        for i in range(N):            
            x = X_tr[i]
            pred_y = f(w, b, x)
            error = y_tr[i] - pred_y
            
            # TODO: HIDE THIS FOR IMPLEMENTATION??
            b = b + learning_rate * error
            w = w + learning_rate * error * x
        
        # reporting accuracy
        train_accuracy = classification_accuracy(X_tr, y_tr, w, b)
        test_accuracy = classification_accuracy(X_t, y_t, w, b)
        clear_output(wait=True)
        print 'epoch={}/{}'.format(t+1, epochs), 'train={:.2f}%'.format(train_accuracy), 'test={:.2f}%'.format(test_accuracy)
                
    return w, b

In [8]:
# 2) Choose hyperparameters
epochs = 100
learning_rate = 1

In [9]:
# 3) Choose which features to use
# available features are 'year_built', 'price_per_sqft', 'bath', 'beds', 'elevation', 'price'
features_to_use_c = ['elevation', 'price_per_sqft', 'bath', 'beds', 'elevation', 'price']

In [10]:
# 4) Run this cell to train perceptron!
index_train_split = int(0.8 * len(features_c))
features_c_small = extract_cols(features_c, features_to_use_c)
features_c_train_array, features_c_test_array = features_c_small[:index_train_split], features_c_small[index_train_split:]
labels_c_train_array, labels_c_test_array = np.array(labels_c[:index_train_split]), np.array(labels_c[index_train_split:])

w_c, b_c = perceptron_learn_w_b(features_c_train_array, labels_c_train_array, features_c_test_array, labels_c_test_array, 
                            epochs, learning_rate)
print 'w is ', w_c, 'b is ', b_c

epoch=100/100 train=46.82% test=40.40%
w is  [  834777.   2758516.     18145.5    25886.    834777.    -91700. ] b is  9501


In [11]:
# 5) visualize data and learned decision boundary
visualize_perceptron(features_to_use_c, features_c_train_array, labels_c_train_array, w_c, b_c)

Visualize 2d can only take 2 or 3 features at once!


## Part 2 Regression on Housing Price

In this section, our goal is to learn a model that predicts house prices. Our label is now "price" instead of "in_sf." "in_sf" is now included as a feature.

### Part 2.1 Linear Regression for Expected House Price

In linear regression, we use a linear function to map from input features to output labels. Similar to the perceptron algorithm above, this model has the form:
$$
y_i = w^\top X_i + b
$$
The new $f$ for regression which outputs the predicted $y$ values will be:
$$
f(X) = Xw + B
$$
Where $B$ is a length $n$ vector, and every single value is $b$. Our loss function will be the same as above:
$$
L(X, y) = \frac{1}{N} \sum_{i=1}^{N} |f_{w, b}(X_i) - y_i|
$$

Complete the steps below:
1. Implement Linear Regression using $b$ as the mean of $y$ and $w$ the pseudoinverse of $X$
2. Experiment with which features to use to get the best performance
3. Run Linear Regression. Observe results. MAE stands for Mean Absolute Error. 
4. Visualize the linear regression line for when using only 1 feature or 2 features.
5. Are the results good or bad? Why?

In [12]:
# 1) Implement Linear Regression
def linear_regression(X_tr, y_tr, X_t, y_t):
    '''
    return weight vector w and bias b
    '''
    # TODO: ERASE FOR IMPLEMENTATION?
    b = y_tr.mean()
    
    w = np.linalg.inv(X_tr.T.dot(X_tr)).dot(X_tr.T).dot(y_tr - b)
    
    train_mae = regression_mae(X_tr, y_tr, w, b)
    test_mae = regression_mae(X_t, y_t, w, b)
    
    print "Train MAE {}".format(train_mae), "Test MAE {}".format(test_mae)   
    
    return w, b

In [13]:
# 2) Choose which features to use
features_to_use_r = ['price_per_sqft', 'year_built', 'bath', 'beds', 'elevation', 'in_sf']

In [14]:
# 3) Run this cell to run linear regression
index_train_split = int(0.8 * len(features_r))
features_r_small = extract_cols(features_r, features_to_use_r)
features_r_train_array, features_r_test_array = features_r_small[:index_train_split], features_r_small[index_train_split:]
labels_r_train_array, labels_r_test_array = np.array(labels_r[:index_train_split]), np.array(labels_r[index_train_split:])

w_r, b_r = linear_regression(features_r_train_array, labels_r_train_array, features_r_test_array, labels_r_test_array)
print 'w is ', w_r, 'b is ', b_r

Train MAE 740438.642185 Test MAE 850913.904033
w is  [  2.76338481e+03  -2.87204623e+03   7.80270315e+05   2.40583926e+05
   5.51170228e+02   4.22113454e+05] b is  2039970.8855


In [15]:
# 5) Visualize linear predictor for 1 or 2 features
visualize_linear_regression(features_to_use_r, features_r_train_array, labels_r_train_array, w_r, b_r)

Can only visualize 1 or 2 features at a time.


### Part 2.2 Regression with Decision Tree

The main drawback of linear regression is that it is a model with low expressiveness (or representational power) - it can't fit to complex patterns in data. Another popular method in supervised learning is called Decision Tree. Below we demonstrate fitting this data using decision trees and show the improvements in prediction.

<br />

First, because mean absolute error can be difficult to interpret, we introduce another metric called the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination), or r2 score. R2 scores roughly compute how good a set of predictions are given ground truth data. A higher r2 score means better predictions, and 100% accuracy correspond to an r2 score of 1. 

<br />
Complete the following steps:
1. Compute r2 scores for linear regressor
2. Run Decision Tree Regressor
3. Compute r2 scores for decision tree
4. Export and visualize decision tree using dot. Then run the command below in terminal to generate a png file. Compare results with your neighbors. Which features are being split on? Which feature is the most important one?

```bash
dot -Tpng decision_tree_regressor.dot -o tree.png
```

In [16]:
# 1) Compute linear regressor's r2 score using the linear regressor above
pred_tr_linear = features_r_train_array.dot(w_r) + b_r
pred_t_linear = features_r_test_array.dot(w_r) + b_r
pred_tr_linear_r2 = r2_score(pred_tr_linear, labels_r_train_array)
pred_t_linear_r2 = r2_score(pred_t_linear, labels_r_test_array)
print "Linear Regressor | Train r2 {:.2f}. Test r2 {:.2f}".format(pred_tr_linear_r2, pred_t_linear_r2)

Linear Regressor | Train r2 0.81. Test r2 0.19


In [17]:
# 2) Train a Decision Tree Regressor
decision_tree = DecisionTreeRegressor()
decision_tree.fit(features_r_train_array, labels_r_train_array)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

In [18]:
# 3) Compute decision tree regressor's r2 score
pred_tr_decisiontree = decision_tree.predict(features_r_train_array)
pred_t_decisiontree = decision_tree.predict(features_r_test_array)
pred_tr_decisiontree_r2 = r2_score(pred_tr_decisiontree, labels_r_train_array)
pred_t_decisiontree_r2 = r2_score(pred_t_decisiontree, labels_r_test_array)
print "Decision Tree Regressor | Train r2 {:.2f}. Test r2 {:.2f}".format(pred_tr_decisiontree_r2, pred_t_decisiontree_r2)

Decision Tree Regressor | Train r2 1.00. Test r2 0.65


In [19]:
# 4) export decision tree
export_graphviz(decision_tree, out_file='decision_tree_regressor.dot', feature_names=features_to_use_r)