The code in this notebook will build a simple Linear regression model on Boston housing prices dataset. 

In [7]:
import pandas as pd
from sklearn import datasets
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

In [8]:
data = datasets.load_boston() ## loads Boston dataset from datasets library

The dataset is about Boston house prices. It comes with a description of the dataset. Print data description by using the command print(data.DESCR). This is possible only for sklearn datasets. Lets see the description to better understand the variables.

In [9]:
print(data.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

Running data.feature_names and data.target would print the column names of the independent variables and the dependent variable, respectively. Fit a linear regression model on this dataset. 

In [10]:
import numpy as np
import pandas as pd
# define the data/predictors as the pre-set feature names  
X = pd.DataFrame(data.data, columns=data.feature_names)

# Put the target (housing value -- MEDV) in another DataFrame
y = pd.DataFrame(data.target, columns=["MEDV"])

The dataset is loaded as a pandas data frame now. The predictor variables are in `df`. `target` has the dependent variable. 

Choose variables that you think will be good predictors for the dependent variable. Do this by checking the correlation between variables, by plotting the data and searching visually for relationship. Perform some preliminary research on what variables are good predictors of target.

## Make the training/validation split, train the model, and validate it

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model = LinearRegression()     # this is a blank model
model.fit(X_train, y_train)    # Train the model against the data
model.score(X_test, y_test)

0.74649046601372104

This score is the R squared value of the model for this dataset, 
which measures what portion of total variation is explained by the model.

In [12]:
predictions = model.predict(X_test)
print(predictions[0:5])

[[ 15.9257846 ]
 [ 20.8853591 ]
 [  8.3809087 ]
 [ 20.17628287]
 [ 24.39435958]]


The print function would print the first 5 predictions for y. Removing [0:5] would print the entire list.

Sci-kit learn also provides convenience functions for computing mean squared error.

In [13]:
from sklearn.metrics import mean_squared_error

# Measure the model error based on expected output and predicted output
mean_squared_error(y_test, model.predict(X_test))  # also known as   MSE

21.673675089289198