A good dataset to practice Regression techniques, we can load the [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) saved directly to Scikitlearn using the `dataset` submodule.

## Loading the Data

In [1]:
from sklearn.datasets import load_boston

data = load_boston()

Doing so gives us a `Bunch` object

In [2]:
type(data)

sklearn.utils.Bunch

Which is basically a dictionary, but with some other stuff

In [3]:
data.__class__.__bases__

(dict,)

## Inspecting the Data

Let's look at the keys

In [4]:
data.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

The `data` and `target` keys are just numpy arrays

In [5]:
print(type(data['data']), data['data'].shape)
print(type(data['target']), data['target'].shape)

<class 'numpy.ndarray'> (506, 13)
<class 'numpy.ndarray'> (506,)


Whereas `feature_names` are just that

In [6]:
print(data['feature_names'])

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


However, making sense of those requires some inspection of the documentation (see the bottom)

## Using the Data

### Sklearn

Data's already broken up by `X` and `y` so let's assign it as such.

In [7]:
X = data['data']
y = data['target']

Done deal.

### Pandas

A bit trickier, basically, we want to merge our `X` and `y` together

In [8]:
import numpy as np

values = np.c_[X, y]

Then stuff those into a `DataFrame`

In [9]:
import pandas as pd

df = pd.DataFrame(values)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


And label the data accordingly

In [10]:
# MEDV, the median value of owner-occupied homes in $1000's
cols = list(data['feature_names']) + ['MEDV']

df.columns = cols
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


## Some More Description

The `DESCR` key gives a pretty good overview of what we're dealing with

In [11]:
print(data['DESCR'])

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      