## Boston Housing Assignment

In this assignment you'll be using linear regression to estimate the cost of house in boston, using a well known dataset.

Goals:
+  Measure the performance of the model I created using $R^{2}$ and MSE
> Learn how to use sklearn.metrics.r2_score and sklearn.metrics.mean_squared_error
+  Implement a new model using L2 regularization
> Use sklearn.linear_model.Ridge or sklearn.linear_model.Lasso 
+  Get the best model you can by optimizing the regularization parameter.   

In [1]:
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

In [2]:
bean = datasets.load_boston()
print bean.DESCR

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [3]:
def load_boston():
    scaler = StandardScaler()
    boston = datasets.load_boston()
    X=boston.data
    y=boston.target
    X = scaler.fit_transform(X)
    return train_test_split(X,y)
    

In [4]:
X_train, X_test, y_train, y_test = load_boston()

In [5]:
X_train.shape

(379L, 13L)

### Fitting a Linear Regression

It's as easy as instantiating a new regression object (line 1) and giving your regression object your training data
(line 2) by calling .fit(independent variables, dependent variable)



In [6]:
clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Making a Prediction
X_test is our holdout set of data.  We know the answer (y_test) but the computer does not.   

Using the command below, I create a tuple for each observation, where I'm combining the real value (y_test) with
the value our regressor predicts (clf.predict(X_test))

Use a similiar format to get your r2 and mse metrics working.  Using the [scikit learn api](http://scikit-learn.org/stable/modules/model_evaluation.html) if you need help!

In [7]:
zip (y_test, clf.predict(X_test))

[(16.199999999999999, 14.934111473561986),
 (18.5, 19.322066513247371),
 (18.199999999999999, 19.089616349443425),
 (29.600000000000001, 25.178325080974297),
 (35.100000000000001, 35.307831002685944),
 (15.0, 18.801808788235789),
 (23.199999999999999, 18.233997910079196),
 (23.100000000000001, 10.317583496596818),
 (24.399999999999999, 23.602938054922429),
 (17.100000000000001, 20.084666734900395),
 (23.300000000000001, 28.431228608747467),
 (12.300000000000001, 13.981159147899476),
 (14.800000000000001, 14.495524088419224),
 (20.899999999999999, 21.15639158922442),
 (45.399999999999999, 38.514628576872639),
 (14.199999999999999, 18.633914255735554),
 (10.4, 6.8167075602965621),
 (22.199999999999999, 19.56951305094886),
 (7.0, 8.8219284906517874),
 (26.199999999999999, 24.402001872851635),
 (14.4, 9.0851904118948088),
 (22.100000000000001, 26.846428067516008),
 (25.100000000000001, 31.143980653341067),
 (33.100000000000001, 35.702393069004252),
 (13.6, 12.512582643514264),
 (21.3999999

Assignment 

1.  Impliment scikit learn's r2 and mse methods to measure the performance of my linear regressor.

2.  Impliment either sklearn.linear_model.Ridge or sklearn.linear_model.Lasso.

3.  Optimize (by reviewing the r2 and mse scores and adjusting the regularization paramater) the regression model you pick.


I found this to be useful.  

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model

In [26]:
import math
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV


In [11]:
print 'Your Linear Regressor R2: ', r2_score(y_test, clf.predict(X_test))
print 'Your Linear Regressor MSE: ', mean_squared_error(y_test, clf.predict(X_test))
print 'Your Linear Regressor RMSE: ', math.sqrt(mean_squared_error(y_test, clf.predict(X_test)))

Your Linear Regressor R2:  0.744962114957
Your Linear Regressor MSE:  21.882568998
Your Linear Regressor RMSE:  4.67788082341


In [106]:
clf_rcv = RidgeCV(alphas=[.1, .4, .8, 1, 4, 8, 10])
clf_rcv.fit(X_train, y_train)

RidgeCV(alphas=[0.1, 0.4, 0.8, 1, 4, 8, 10], cv=None, fit_intercept=True,
    gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)

In [107]:
clf_rcv.alpha_

8.0

In [108]:
clf_r = Ridge(alpha=8)
clf_r.fit(X_train, y_train)

Ridge(alpha=8, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [109]:
print 'My Ridge r2: ', r2_score(y_test, clf_r.predict(X_test))
print 'My Ridge MSE: ', mean_squared_error(y_test, clf_r.predict(X_test))
print 'My Ridge RMSE: ', math.sqrt(mean_squared_error(y_test, clf_r.predict(X_test)))

My Ridge r2:  0.744142926454
My Ridge MSE:  21.9528563945
My Ridge RMSE:  4.68538753941
