#### Jason Schenck
#### Date: 02/13/16
<hr>


## Boston Housing Assignment

In this assignment you'll be using linear regression to estimate the cost of house in boston, using a well known dataset.

Goals:
+  Measure the performance of the model I created using $R^{2}$ and MSE
> Learn how to use sklearn.metrics.r2_score and sklearn.metrics.mean_squared_error
+  Implement a new model using L2 regularization
> Use sklearn.linear_model.Ridge or sklearn.linear_model.Lasso 
+  Get the best model you can by optimizing the regularization parameter.   

In [1]:
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression



In [2]:
bean = datasets.load_boston()
print(bean.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [3]:
def load_boston():
    scaler = StandardScaler()
    boston = datasets.load_boston()
    X=boston.data
    y=boston.target
    X = scaler.fit_transform(X)
    return train_test_split(X,y)
    

In [4]:
X_train, X_test, y_train, y_test = load_boston()

In [5]:
X_train.shape

(379, 13)

### Fitting a Linear Regression

It's as easy as instantiating a new regression object (line 1) and giving your regression object your training data
(line 2) by calling .fit(independent variables, dependent variable)



In [6]:
clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Making a Prediction
X_test is our holdout set of data.  We know the answer (y_test) but the computer does not.   

Using the command below, I create a tuple for each observation, where I'm combining the real value (y_test) with
the value our regressor predicts (clf.predict(X_test))

Use a similiar format to get your r2 and mse metrics working.  Using the [scikit learn api](http://scikit-learn.org/stable/modules/model_evaluation.html) if you need help!

In [7]:
list(zip (y_test, clf.predict(X_test)))

[(20.399999999999999, 19.890594582524699),
 (13.1, 13.721248305563673),
 (10.800000000000001, 11.410207753360217),
 (17.699999999999999, 20.672868758671722),
 (21.399999999999999, 24.364416550830267),
 (15.6, 15.617517777743258),
 (43.799999999999997, 35.600427776912767),
 (23.800000000000001, 25.176264336397882),
 (11.800000000000001, 12.8287358982341),
 (25.0, 29.923767372244686),
 (22.800000000000001, 25.059633816787645),
 (26.199999999999999, 24.338188771678674),
 (32.200000000000003, 31.388681924472735),
 (15.1, 17.449968729960077),
 (18.399999999999999, 16.467001528469467),
 (29.100000000000001, 30.395083519334303),
 (14.4, 6.6383699100961735),
 (13.800000000000001, -0.98799661870104671),
 (23.0, 24.300793887277049),
 (11.9, 7.8115821797585792),
 (21.699999999999999, 19.789708241145323),
 (14.1, 16.272874978156324),
 (22.199999999999999, 25.695772080710739),
 (24.600000000000001, 25.134047168153014),
 (23.699999999999999, 28.316535000729971),
 (16.100000000000001, 22.524605457011

## Task 1: Measure the performance of the model Professor Bernico created using  $R^{2}$   and MSE

In [8]:
# Store the predicted values for easier access
y_predicted = clf.predict(X_test)

Check the $R^{2}$ score, aka coefficient of determination, using sklearn.metrics.r2_score. An $R^{2}$ score is a value that ranges up to 1. The closer this value is 1, the more accurate our model is. In other words, the less error present between our predicted values (line of fit) and the actual values (y_test). From wikipedia:
>"In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is a number that indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s)."

In [9]:
r2_score(y_test, y_predicted)

0.72726891273767869

Next, I will test the Mean Squared Error. From wikipedia:
> "In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors or deviations—that is, the difference between the estimator and what is estimated."

The smaller an RMSE value, the closer predicted and observed values are.

In [10]:
mean_squared_error(y_test, y_predicted)

23.340201656823584

## Task 2: Implement new models using L2 regularization (Lasson & Ridge)

### Lasso:

In [183]:
# Import for lasso model
from sklearn import linear_model

# To ensure fresh data, restore as fresh
X_train, X_test, y_train, y_test = load_boston()
print("Data Shape: ", X_train.shape)

# From the sklearn doc, followed the recommended parameter setting. Adjusted alpha a few times however.
clf_lasso = linear_model.Lasso(alpha=0.05, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=True)
clf_lasso.fit(X_train, y_train)

Data Shape:  (379, 13)


Lasso(alpha=0.05, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=True)

In [184]:
# Make predictions, zip together with actual y data for testing.
y_pred_lasso = clf_lasso.predict(X_test)
list(zip (y_test, y_pred_lasso))

# Print out R and MSE scores
print("R-Squared Score: ", r2_score(y_test, y_pred_lasso))
print("MSE Score: ", mean_squared_error(y_test, y_pred_lasso))

R-Squared Score:  0.764486033667
MSE Score:  19.8437462135


### Ridge:

In [115]:
# Import for Ridge model
from sklearn.linear_model import Ridge

# To ensure fresh data, restore as fresh
X_train, X_test, y_train, y_test = load_boston()
print("Data Shape: ", X_train.shape)

# From the sklearn doc, followed the recommended parameter setting. Adjusted alpha a few times however.
clf_ridge = Ridge(alpha=0.9997, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)
clf_ridge.fit(X_train, y_train)

Data Shape:  (379, 13)


Ridge(alpha=0.9997, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [116]:
# Make predictions, zip together with actual y data for testing.
y_pred_ridge = clf_ridge.predict(X_test)
list(zip (y_test, y_pred_ridge))

# Print out R and MSE scores
print("R-Squared Score: ", r2_score(y_test, y_pred_ridge))
print("MSE Score: ", mean_squared_error(y_test, y_pred_ridge))

R-Squared Score:  0.802148948873
MSE Score:  17.9035829556
