# 1. Introduction - Regression Trees
We can use Decision Trees for Regression, commonly called Regression Trees. The idea is to split our data into groups based on features and return a prediction that is the average across the data we have already seen.
- In Regression Trees we choose features that minimise the error (e.g. MSE)

Regression Trees are implemented using `DecisionTreeRegressor` from `sklearn.tree`

The important parameters of `DecisionTreeRegressor` are

`criterion`: {"mse", "friedman_mse", "mae", "poisson"} - The function used to measure error

`max_depth` - The max depth the tree can be

`min_samples_split` - The minimum number of samples required to split a node

`min_samples_leaf` - The minimum number of samples that a leaf can contain

`max_features`: {"auto", "sqrt", "log2"} - The number of feature we examine looking for the best one, used to speed up training

## About the Dataset
You are a data scientist working for a real estate company that is planning to invest in Boston real estate. You have collected information about various areas of Boston and are tasked with creating a model that can predict the median price of houses for that area so it can be used to make offer.

## Read the Data
First we install all necessary libraries and download the dataset.

In [79]:
# Necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn import metrics

In [80]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Load the data
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/real_estate_data.csv")
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,,36.2


Next, we will inspect the data to understand its shape and if any rows are missing values.

In [81]:
# Shape of the dataset
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns.\n")
# Missing values
print("The following columns are missing values:\n", df.isna().sum())

The dataset has 506 rows and 13 columns.

The following columns are missing values:
 CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
LSTAT      20
MEDV        0
dtype: int64


# 2. Data Pre-Processing
First, we drop all the rows with missing values because we have enough data in our dataset.

In [82]:
# drop missing values
df.dropna(inplace=True)

Check if all missing values are gone:

In [83]:
# Missing values
df.isna().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
LSTAT      0
MEDV       0
dtype: int64

Now we are able to split the dataset into our features and target variables.

In [84]:
# Creating X our independent feature set and Y our dependent target feature
X = np.asanyarray(df.drop('MEDV', axis=1))
Y = np.asanyarray(df.pop('MEDV'))
# Display the result
print(f"{X[:5]}\n\n")
print(Y[:5])

[[6.3200e-03 1.8000e+01 2.3100e+00 0.0000e+00 5.3800e-01 6.5750e+00
  6.5200e+01 4.0900e+00 1.0000e+00 2.9600e+02 1.5300e+01 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 6.4210e+00
  7.8900e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.1850e+00
  6.1100e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 4.0300e+00]
 [3.2370e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 6.9980e+00
  4.5800e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 2.9400e+00]
 [2.9850e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 6.4300e+00
  5.8700e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 5.2100e+00]]


[24.  21.6 34.7 33.4 28.7]


Finally, we split our data into a training and testing dataset using `train_test_split`.

In [85]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

# 3. Create a Regression Tree

In [86]:
# Model a Regression Tree
regression_tree = DecisionTreeRegressor(criterion='squared_error')
# Train the model
regression_tree.fit(X_train, y_train)

## Evaluation of the model
To evaluate our dataset we will use the `score` method of the `DecisionTreeRegressor` object. It is the $R^2$ value which indicates the coefficient of determination.

In [87]:
# R^2 Score
regression_tree.score(X_test, y_test)

0.8554975314009299

We can also find the average error in our testing set which is the average error in median home value prediction.

In [88]:
prediction = regression_tree.predict(X_test)
print("$", (prediction - pd.Series(y_test)).abs().mean()*1000)

$ 2673.4177215189875


The output value represents the mean absolute error (MAE) of the regression tree model’s predictions, scaled by 1000. This indicates the average deviation between the predicted values and the actual values (y_test) without considering direction. The multiplication by 1000 suggests the target variable is measured in large units, such as thousands of dollars. For example, an MAE of 3025.32 means that, on average, the model’s predictions are off by approximately $3,025.32. **A lower MAE would indicate better prediction accuracy**.