# Boston Housing Assignment

Implimenting scikit learn's r2 score and mse methods for measuring the performance of the linear regressor and sklearn.linear_model.Ridge modle 
 

In [1]:
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

In [2]:
bean = datasets.load_boston()
print bean.DESCR

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [3]:
def load_boston():
    scaler = StandardScaler()
    boston = datasets.load_boston()
    X=boston.data
    y=boston.target
    X = scaler.fit_transform(X)
    return train_test_split(X,y)

In [4]:
X_train, X_test, y_train, y_test = load_boston()

In [5]:
X_train.shape

(379, 13)

## Fitting a Linear Regression
It's as easy as instantiating a new regression object (line 1) and giving your regression object your training data (line 2) by calling .fit(independent variables, dependent variable)

In [6]:
clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

# Making a Prediction
X_test is our holdout set of data. We know the answer (y_test) but the computer does not.
Using the command below, I create a tuple for each observation, where I'm combining the real value (y_test) with the value our regressor predicts (clf.predict(X_test))

In [7]:
zip (y_test, clf.predict(X_test))

[(5.0, 9.3656636850083235),
 (19.300000000000001, 21.39753677389043),
 (37.899999999999999, 33.089125967384817),
 (24.5, 28.013571843863616),
 (23.399999999999999, 24.380522482250853),
 (26.600000000000001, 27.43482409595941),
 (23.800000000000001, 22.692242901613188),
 (22.699999999999999, 24.382585776139685),
 (20.800000000000001, 17.48766606718134),
 (16.399999999999999, 18.819193723828533),
 (24.399999999999999, 23.720144866575527),
 (24.699999999999999, 23.080754938498448),
 (13.0, 17.103337493587819),
 (13.1, 13.65067442617786),
 (29.100000000000001, 31.951307371477981),
 (22.899999999999999, 23.11703946662308),
 (22.5, 29.39625172526717),
 (7.2000000000000002, 17.920346361698957),
 (43.799999999999997, 35.412536165484127),
 (20.800000000000001, 16.996707961573563),
 (50.0, 31.711538630002075),
 (13.4, 14.719281444064926),
 (24.800000000000001, 25.95614950588903),
 (15.0, 13.933790483297321),
 (23.300000000000001, 21.699498262916688),
 (13.800000000000001, 13.585506300237551),
 (

In [8]:

mean_squared_error(y_test, clf.predict(X_test))

25.604546245580071

In [9]:

r2_score(y_test, clf.predict(X_test))

0.70660580797569805

## Linear Regression
## MSE = 25.604546245580071
## r2_score = 0.70660580797569805

In [10]:
#taking alpha=.01 for implementing sklearn.linear_model.Ridge
from sklearn.linear_model import Ridge
r = Ridge(alpha=.01)
r.fit(X_train, y_train)

Ridge(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [11]:
zip (y_test, r.predict(X_test))

[(5.0, 9.3671305427965983),
 (19.300000000000001, 21.398438435676677),
 (37.899999999999999, 33.088672377256174),
 (24.5, 28.013570343473472),
 (23.399999999999999, 24.380415168878233),
 (26.600000000000001, 27.434786450273705),
 (23.800000000000001, 22.692125229200361),
 (22.699999999999999, 24.382428119971138),
 (20.800000000000001, 17.48730028283769),
 (16.399999999999999, 18.819155379978532),
 (24.399999999999999, 23.720050366901265),
 (24.699999999999999, 23.080680941699065),
 (13.0, 17.103303326973762),
 (13.1, 13.651151258333199),
 (29.100000000000001, 31.951179514917399),
 (22.899999999999999, 23.117098999113502),
 (22.5, 29.396292270726597),
 (7.2000000000000002, 17.920342809863421),
 (43.799999999999997, 35.412751310794548),
 (20.800000000000001, 16.996414219379378),
 (50.0, 31.711081759620225),
 (13.4, 14.720037917903193),
 (24.800000000000001, 25.956060388333036),
 (15.0, 13.934380762973451),
 (23.300000000000001, 21.699461327913518),
 (13.800000000000001, 13.58620385395193

In [12]:
mean_squared_error(y_test, r.predict(X_test))


25.605134378214487

In [13]:
r2_score(y_test, r.predict(X_test))

0.70659906875456824

## Ridge Linear Regression
## MSE = 25.605134378214487
## r2_score = 0.70659906875456824
