# Decision trees - Boston Housing dataset

The dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.

The features in the dataset:
* CRIM: Per capita crime rate by town
* ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
* INDUS: Proportion of non-retail business acres per town
* CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* NOX: Nitric oxide concentration (parts per 10 million)
* RM: Average number of rooms per dwelling
* AGE: Proportion of owner-occupied units built prior to 1940
* DIS: Weighted distances to five Boston employment centers
* RAD: Index of accessibility to radial highways
* TAX: Full-value property tax rate per 10,000 USD
* PTRATIO: Pupil-teacher ratio by town
* B: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
* LSTAT: Percentage of lower status of the population
* PRICE: Median value of owner-occupied homes in $1000s
There are 506 instances in the dataset, where each instance represents a different suburb of Boston. The PRICE feature is the target variable for regression problems: the objective is to predict the median value of owner-occupied homes.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn import tree
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score


In [2]:

# Load the Boston Housing dataset from the UCI Machine Learning Repository
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'PRICE']
data = pd.read_csv(url, delim_whitespace=True, names=names)


In [3]:

# Split the data into features (X) and target (y)
X = data.drop('PRICE', axis=1)
y = data['PRICE']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [4]:


regressor_rf = RandomForestRegressor(n_estimators = 500, random_state = 0)


In [5]:

# Train the model using the training sets
regressor_rf.fit(X_train, y_train.ravel())



In [6]:
# Make predictions using the testing set
y_pred = regressor_rf.predict(X_test)

# Evaluate the performance of the algorithm
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


Mean Absolute Error: 2.070966666666667
Mean Squared Error: 8.565180143921568
Root Mean Squared Error: 2.9266329021456667


In [11]:

from sklearn.metrics import r2_score

# Predicting R2 Score the Train set results
y_pred_rf_train = regressor_rf.predict(X_train)
r2_score_rf_train = r2_score(y_train, y_pred_rf_train)

# Predicting R2 Score the Test set results
y_pred_rf_test = regressor_rf.predict(X_test)
r2_score_rf_test = r2_score(y_test, y_pred_rf_test)

# Predicting RMSE the Test set results
rmse_rf = (np.sqrt(mean_squared_error(y_test, y_pred_rf_test)))
print('R2_score (train): ', r2_score_rf_train)
print('R2_score (test): ', r2_score_rf_test)
print("RMSE: ", rmse_rf)

R2_score (train):  0.9785315412639791
R2_score (test):  0.8832028053810302
RMSE:  2.9266329021456667


In [9]:
# Make predictions using the testing set
y_pred = regressor_rf.predict(X_test)

# print the first 10 predicted values
print(y_pred[:10])

[22.8384 31.1626 17.0584 23.3398 16.7504 21.385  19.3182 15.5084 21.417
 20.9662]


In [10]:
# Example of a new data point
new_data = [[0.02731, 0.0, 7.07, 0.0, 0.469, 6.421, 78.9, 4.9671, 2.0, 242.0, 17.8, 396.90, 9.14]]

# Make a prediction on the new data point
new_pred = regressor_rf.predict(new_data)

# print the predicted house price
print(new_pred)

[21.8816]


