### Loading data

In [30]:
from sklearn.datasets import load_boston
boston_market = load_boston()
print(boston_market["DESCR"])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [31]:
dir(boston_market)

['DESCR', 'data', 'data_module', 'feature_names', 'filename', 'target']

In [32]:
print(boston_market['data'].shape)
print(boston_market['target'].shape)

(506, 13)
(506,)


In [33]:
print(boston_market.data[1,:])
print(boston_market.target[1])

[2.7310e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 6.4210e+00
 7.8900e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9690e+02
 9.1400e+00]
21.6


### Representing data

In [34]:
import pandas as pd
boston_p = pd.DataFrame(boston_market.data, columns=[boston_market.feature_names])
boston_p.head(10)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3.0,222.0,18.7,394.12,5.21
6,0.08829,12.5,7.87,0.0,0.524,6.012,66.6,5.5605,5.0,311.0,15.2,395.6,12.43
7,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,5.9505,5.0,311.0,15.2,396.9,19.15
8,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,6.0821,5.0,311.0,15.2,386.63,29.93
9,0.17004,12.5,7.87,0.0,0.524,6.004,85.9,6.5921,5.0,311.0,15.2,386.71,17.1


In [35]:
boston_p.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


### Standarisation with StandardScaler

Data originally was not centered around 0, which may be required for many elements used in the objective function of learning algorithms. 

After **standardization** it is centered around 0 and has variance in the same order.

In [36]:
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
boston_market_scaled_data = standard_scaler.fit_transform(boston_market["data"])

print("Before standardization:\n", boston_market.data[1])
print("After standardization:\n", boston_market_scaled_data[1])

Before standardization:
 [2.7310e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 6.4210e+00
 7.8900e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9690e+02
 9.1400e+00]
After standardization:
 [-0.41733926 -0.48772236 -0.59338101 -0.27259857 -0.74026221  0.19427445
  0.36716642  0.55715988 -0.8678825  -0.98732948 -0.30309415  0.44105193
 -0.49243937]


### Data split into train and test batchers with **sklearn.model_selecion.train_test_split**

In [37]:
from sklearn.model_selection import train_test_split

boston_market_train_data, boston_market_test_data, \
boston_market_train_target, boston_market_test_target = \
train_test_split(boston_market_scaled_data, boston_market['target'], test_size=0.1)

In [38]:
print("Training dataset for Boston market:")
print(boston_market_train_data.shape)
print(boston_market_train_target.shape)
print(506 * 0.9 // 1)

Training dataset for Boston market:
(455, 13)
(455,)
455.0


In [39]:
print("Testing dataset for Boston market:")
print(boston_market_test_data.shape)
print(boston_market_test_target.shape)
print(506 * 0.1 // 1)

Testing dataset for Boston market:
(51, 13)
(51,)
50.0


## Learning

### Training sklearn.linear_model.LinearRegression

In [40]:
from sklearn.linear_model import LinearRegression

boston_market_linear_regression = LinearRegression()
boston_market_linear_regression.fit(boston_market_train_data, boston_market_train_target)

LinearRegression()

### Training sklearn.linear_model.Lasso

In [41]:
from sklearn.linear_model import Lasso

boston_market_lasso_regression = Lasso(alpha=0.3)
boston_market_lasso_regression.fit(boston_market_train_data, boston_market_train_target)

Lasso(alpha=0.3)

### Training sklearn.linear_model.SGDRegressor

In [42]:
from sklearn.linear_model import SGDRegressor

boston_market_sgd_regression = SGDRegressor()
boston_market_sgd_regression.fit(boston_market_train_data, boston_market_train_target)

SGDRegressor()

## Model evaluation

In [43]:
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score

### LinearRegression

In [44]:
linear_regression_predict = boston_market_linear_regression.predict(boston_market_test_data)

linear_regression_mean_sq_error = mean_squared_error(boston_market_test_target, linear_regression_predict)
linear_regression_variance_score = r2_score(boston_market_test_target, linear_regression_predict)

print("Mean squared error for linear regression: %.2f" % linear_regression_mean_sq_error)
print("Variance score for linear regression: %.2f" % linear_regression_variance_score)

Mean squared error for linear regression: 27.93
Variance score for linear regression: 0.76


##### With cross-validation

In [45]:
linear_cross_valid_scores = cross_val_score(LinearRegression(), boston_market_scaled_data, boston_market['target'], cv=6)
print(linear_cross_valid_scores)
print(sum(linear_cross_valid_scores) / 4)

[ 0.64286835  0.6124552   0.51498797  0.78529513 -0.14696285 -0.00747687]
0.6002917321707772


### Lasso

In [46]:
lasso_regression_predict = boston_market_lasso_regression.predict(boston_market_test_data)

lasso_regression_mean_sq_error = mean_squared_error(boston_market_test_target, lasso_regression_predict)
lasso_regression_variance_score = r2_score(boston_market_test_target, lasso_regression_predict)

print("Mean squared error for linear regression: %.2f" % lasso_regression_mean_sq_error)
print("Variance score for lasso regression: %.2f" % linear_regression_variance_score)

Mean squared error for linear regression: 33.15
Variance score for lasso regression: 0.76


##### With cross-validation

In [47]:
lasso_cross_valid_scores = cross_val_score(Lasso(alpha=0.3), boston_market_scaled_data, boston_market['target'], cv=6)
print(lasso_cross_valid_scores)
print(sum(lasso_cross_valid_scores) / 4)

[ 0.71977325  0.70723812  0.50743388  0.72275262 -0.13691052  0.11145143]
0.6579346958958101


### Stochastic Gradient Descent

In [48]:
sgd_regression_predict = boston_market_sgd_regression.predict(boston_market_test_data)

sgd_regression_mean_sq_error = mean_squared_error(boston_market_test_target, sgd_regression_predict)
sgd_regression_variance_score = r2_score(boston_market_test_target, sgd_regression_predict)

print("Mean squared error for linear regression: %.2f" % sgd_regression_mean_sq_error)
print("Variance score for sgd regression: %.2f" % sgd_regression_variance_score)

Mean squared error for linear regression: 28.30
Variance score for sgd regression: 0.76


##### With cross-validation

In [49]:
sgd_cross_valid_scores = cross_val_score(SGDRegressor(), boston_market_scaled_data, boston_market['target'], cv=6)
print(sgd_cross_valid_scores)
print(sum(sgd_cross_valid_scores) / 4)

[ 0.65099435  0.58440434  0.51806986  0.78629624 -0.16179572  0.02231388]
0.6000707347657382


## Prediction

In [50]:
sample_id = 5

### LinearRegression

In [51]:
linear_regression_prediction = boston_market_linear_regression.predict(boston_market_test_data[sample_id,:].reshape(1,-1))

print("Model predicted for house = {0} value {1}".format(sample_id, linear_regression_prediction))
print("Real value for house = {0} value {1}".format(sample_id, boston_market_test_target[sample_id]))
print('Coefficients of a learned model: \n', boston_market_linear_regression.coef_)

Model predicted for house = 5 value [31.90642518]
Real value for house = 5 value 33.2
Coefficients of a learned model: 
 [-0.83523992  0.76056796  0.37268977  0.76112575 -1.97649451  2.5734149
  0.11360303 -2.76222327  2.66141538 -1.91921502 -2.06375878  1.03715287
 -4.04824417]


### Lasso

In [52]:
lasso_regression_prediction = boston_market_lasso_regression.predict(boston_market_test_data[sample_id,:].reshape(1,-1))

print("Model predicted for house = {0} value {1}".format(sample_id, lasso_regression_prediction))
print("Real value for house = {0} value {1}".format(sample_id, boston_market_test_target[sample_id]))
print('Coefficients of a learned model: \n', boston_market_lasso_regression.coef_)

Model predicted for house = 5 value [31.65260384]
Real value for house = 5 value 33.2
Coefficients of a learned model: 
 [-0.17835818  0.         -0.          0.65048963 -0.40377936  2.81613311
 -0.         -1.03300128  0.         -0.         -1.61542146  0.81452814
 -3.96510303]


### Stochastic Gradient Descent

In [53]:
sgd_regression_prediction = boston_market_sgd_regression.predict(boston_market_test_data[sample_id,:].reshape(1,-1))

print("Model predicted for house = {0} value {1}".format(sample_id, sgd_regression_prediction))
print("Real value for house = {0} value {1}".format(sample_id, boston_market_test_target[sample_id]))
print('Coefficients of a learned model: \n', boston_market_sgd_regression.coef_)

Model predicted for house = 5 value [32.22598944]
Real value for house = 5 value 33.2
Coefficients of a learned model: 
 [-0.791065    0.62011932  0.07326268  0.80983955 -1.76922169  2.65223318
  0.03867356 -2.63933013  1.84627835 -1.09115078 -2.00055899  1.05960826
 -4.00703434]
