## Boston Housing using regression trees


### Setup libraries and download data

In [16]:
from js import fetch 
import io

URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/real_estate_data.csv"
resp = await fetch(URL)
reg_tree_data = io.BytesIO((await resp.arrayBuffer()).to_py())

In [17]:
import piplite
await piplite.install(['pandas'])
await piplite.install(['numpy'])
await piplite.install(['scikit-learn'])

In [18]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor 
from sklearn.model_selection import train_test_split

## About the dataset

The information is collected on various areas of Boston and we want to predict the median price of that area so it can be used to make offers.

The dataset had information on areas/towns not individual houses, the features are

CRIM: Crime per capita

ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS: Proportion of non-retail business acres per town

CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX: Nitric oxides concentration (parts per 10 million)

RM: Average number of rooms per dwelling

AGE: Proportion of owner-occupied units built prior to 1940

DIS: Weighted distances to ﬁve Boston employment centers

RAD: Index of accessibility to radial highways

TAX: Full-value property-tax rate per $10,000

PTRAIO: Pupil-teacher ratio by town

LSTAT: Percent lower status of the population

MEDV: Median value of owner-occupied homes in $1000s

### Read data


In [19]:
# read the data 
data = pd.read_csv(reg_tree_data)
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,,36.2


In [23]:
# size of data
print('The size of the data is {}'.format(data.shape))

The size of the data is (506, 13)


#### the data has missing values

In [22]:
data.isna().sum()

CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
LSTAT      20
MEDV        0
dtype: int64

### Data preprocessing

In [27]:
# remove missing values from the dataset
data.dropna(inplace=True)

In [28]:
data.isna().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
LSTAT      0
MEDV       0
dtype: int64

#### Split data into features and target comlumns

In [31]:
X = data.drop(columns=['MEDV'])
Y = data['MEDV']

In [32]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3,222,18.7,5.21


In [33]:
Y.head()

0    24.0
1    21.6
2    34.7
3    33.4
5    28.7
Name: MEDV, dtype: float64

##### Train/Test split

In [35]:
train_data, test_data, train_label, test_label = train_test_split(X, Y, test_size=.2, random_state= 1)

### Build regression tree
Will use Mean Squared Error, mse, to as criterion parameter

In [49]:
reg_tree = DecisionTreeRegressor(criterion='squared_error')

In [75]:
### Training
reg_tree.fit(train_data,train_label)

DecisionTreeRegressor()

### Model Evaluation

In [76]:
print('The R2 score is {}'.format(reg_tree.score(test_data,test_label)))

The R2 score is 0.7369364586969518


#### The average error in the testing set which is the average error in median home value prediction



In [77]:
prediction = reg_tree.predict(test_data)

print("$",(prediction - test_label).abs().mean()*1000)

$ 3117.721518987341


### Train regression tree using the criterion Mean Absolute Error
#### then report its $R^2$ value and average error

In [78]:
reg_tree_mae = DecisionTreeRegressor(criterion='absolute_error')

# train the model
reg_tree_mae.fit(train_data, train_label)

print('The R2 score is {}'.format(reg_tree_mae.score(test_data,test_label)))


The R2 score is 0.8581655704232203


In [79]:
#### The average error in median home value prediction

In [72]:
prediction = reg_tree_mae.predict(test_data)

print("$",(prediction - test_label).abs().mean()*1000)

$ 2667.0886075949365


## Conclusion 

Using the _Mean Absolute Error_ criterion we get an R2 score that ranges from _0.84 to 0.86_. Making the prediction more reliable than when we we use the  _Mean Squared Error_ criterion

#### Definitions 

A regression model's R2 score is a statistical indicator that shows how much of the variance for a dependent variable is explained by one or more independent variables.  The dependent variable's variation can be fully explained by the independent variable(s) when the R2 score is 1, which varies from 0 to 1. The forecast is more accurate the higher the score.
There, the performance of a regression-based machine learning model is assessed using the R2 score.