<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# ***Regression Trees***

Estimated time needed: **20** minutes



$ \ $

-----

## ***Objectives.***

In this lab you will learn how to implement regression trees using ScikitLearn. We will show:

* What parameters are important

* how to train a regression tree

* how to determine our regression trees accuracy


After completing this lab you will be able to:

*   Train a Regression Tree
*   Evaluate a Regression Trees Performance


$ \ $

---

## ***Setup (configuracion).***

For this lab, we are going to be using Python and several Python libraries. 


In [None]:
#(1) Pandas will allow us to create a dataframe of the data so it can be used and manipulated
import pandas as pd

#(2) Regression Tree Algorithm
from sklearn.tree import DecisionTreeRegressor

#(3) Split our data into a training and testing data
from sklearn.model_selection import train_test_split


$ \ $

----

## ***About the Dataset.***


Imagine you are a data scientist working for a real estate company that is planning to invest in Boston real estate. You have collected information about various areas of Boston and are tasked with created a model that can predict the median price of houses for that area so it can be used to make offers.  The dataset had information on areas/towns not individual houses, the $\color{yellow}{\text{features}}$ are

* $\color{aquamarine}{\text{CRIM: }}$ Crime per capita

* $\color{aquamarine}{\text{ZN: }}$  Proportion of residential land zoned for lots over 25,000 sq.ft.

* $\color{aquamarine}{\text{INDUS: }}$  Proportion of non-retail business acres per town

* $\color{aquamarine}{\text{CHAS: }}$  Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

* $\color{aquamarine}{\text{NOX: }}$   Nitric oxides concentration (parts per 10 million)

* $\color{aquamarine}{\text{RM: }}$   Average number of rooms per dwelling

* $\color{aquamarine}{\text{AGE: }}$   Proportion of owner-occupied units built prior to 1940

* $\color{aquamarine}{\text{DIS: }}$  Weighted distances to ﬁve Boston employment centers

* $\color{aquamarine}{\text{RAD: }}$  Index of accessibility to radial highways

* $\color{aquamarine}{\text{TAX: }}$   Full-value property-tax rate per $10,000

* $\color{aquamarine}{\text{PTRAIO: }}$ Pupil-teacher ratio by town

* $\color{aquamarine}{\text{LSTAT: }}$  Percent lower status of the population

* $\color{aquamarine}{\text{MEDV: }}$  Median value of owner-occupied homes in $1000s


$ \ $

---

## ***Read the Data.***

$ \ $

$(1)$ Lets read in the data we have downloaded.


In [None]:
data = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/real_estate_data.csv")

In [None]:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,,36.2


$ \ $

$(2)$ Now lets learn about the size of our data, there are 506 rows and 13 columns.


In [None]:
data.shape

(506, 13)

$ \ $

$(3)$ Most of the data is valid, but there are rows with missing values which we will deal with in pre-processing


In [None]:
data.isna().sum()

CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
LSTAT      20
MEDV        0
dtype: int64

$ \ $

---

## ***Data Pre-Processing.***


$ \ $

$(1)$ First lets drop the rows with missing values because we have enough data in our dataset.


In [None]:
data.shape

(506, 13)

In [None]:
data.dropna(inplace=True)

$ \ $

$(2)$ Now we can see our dataset has no missing value.


In [None]:
data.shape

(394, 13)

In [None]:
data.isna().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
LSTAT      0
MEDV       0
dtype: int64

$ \ $

$(3)$ Lets split the dataset into our features and what we are predicting (target).


In [None]:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3,222,18.7,5.21,28.7


In [None]:
X = data.drop(columns=["MEDV"])
Y = data["MEDV"]

In [None]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3,222,18.7,5.21


In [None]:
Y.head()

0    24.0
1    21.6
2    34.7
3    33.4
5    28.7
Name: MEDV, dtype: float64

$ \ $

$(4)$ Finally lets split our data into a training and testing dataset using `train_test_split` from `sklearn.model_selection`.


In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

$ \ $

-----

## ***Create Regression Tree.***

$ \ $

Regression Trees are implemented using `DecisionTreeRegressor` from ***sklearn.tree***. The important parameters of ***DecisionTreeRegressor*** are:

* $\color{aquamarine}{\text{criterion: {"mse", "friedman_mse", "mae", "poisson"}}}$ - The function used to measure error.

* $\color{aquamarine}{\text{max_depth}}$ - The max depth the tree can be.

* $\color{aquamarine}{\text{min_samples_split}}$ - The minimum number of samples required to split a node.

* $\color{aquamarine}{\text{min_samples_leaf}}$ - The minimum number of samples that a leaf can contain.

* $\color{aquamarine}{\text{max_features: {"auto", "sqrt", "log2"}}}$ - The number of feature we examine looking for the best one, used to speed up training.

$ \ $

$(1)$ First lets start by creating a `DecisionTreeRegressor` object, setting the `criterion` parameter to `mse` for Mean Squared Error




In [None]:
regression_tree = DecisionTreeRegressor(criterion = "mse")

$ \ $

$(2)$ Now lets train our model using the `fit` method on the `DecisionTreeRegressor` object providing our training data


In [None]:
regression_tree.fit(X_train, Y_train)



DecisionTreeRegressor(criterion='mse')

$ \ $

----

## ***Evaluation.***


$ \ $

$(1)$ To evaluate our dataset we will use the `score` method of the `DecisionTreeRegressor` object providing our testing data, this number is the $R^2$ value which indicates the coefficient of determination.


In [None]:
regression_tree.score(X_test, Y_test)

0.8619442285600525

$ \ $

$(2)$ We can also find the average error in our testing set which is the average error in median home value prediction.


In [None]:
prediction = regression_tree.predict(X_test)


In [None]:
(prediction - Y_test).abs()

96      2.5
289     1.5
456    10.5
143     2.2
267     6.9
       ... 
27      1.7
379     5.2
17      0.7
106     2.1
71      2.4
Name: MEDV, Length: 79, dtype: float64

In [None]:
(prediction - Y_test).abs().mean()

2.834177215189873

In [None]:
print("$",(prediction - Y_test).abs().mean()*1000)

$ 2834.177215189873


$ \ $

----

## ***Excercise.***


Train a regression tree using the `criterion` `mae` then report its $R^2$ value and average error


In [None]:
reg_tree = DecisionTreeRegressor(criterion = "mae")
reg_tree.fit(X_train, Y_train)
regression_tree.score(X_test, Y_test)



0.846753001591031

In [None]:
pred = reg_tree.predict(X_test)
print("$",(pred - Y_test).abs().mean()*1000)

$ 2644.3037974683543
