# Exercise 3: Decision Tree

---
**Written by Hendi Lie (h2.lie@qut.edu.au) and Richi Nayak (r.nayak@qut.edu.au). All rights reserved.**

Welcome to the third practical exercise for CAB330. Each exercise sheet contains a number of theoretical and programming exercises, designed to strengthen both conceptual and practical understanding of data mining processes in this unit.

To answer conceptual questions, write the answer to each question on a paper/note with your reasoning. For programming exercises, open your iPython console/Jupyter notebook and use Python commands/libraries introduced in each practical to answer the questions. In many cases, you will need to write code to support your conceptual answers.

## 0. Prequisite

Perform the following steps before trying the exercises:
1. Import pandas as "pd" and load the house price dataset into "df".
2. Print dataset information to refresh your memory.
3. Run `preprocess_data` function on the dataframe to perform preprocessing steps discussed last week.

In [1]:
import pandas as pd

df = pd.read_csv('datasets/melbourne_house_price.csv', index_col=0)

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24197 entries, 0 to 24196
Data columns (total 22 columns):
Suburb                24197 non-null object
Address               24197 non-null object
Rooms                 24197 non-null int64
Type                  24197 non-null object
Price                 24197 non-null float64
Method                24197 non-null object
SellerG               24197 non-null object
Date                  24197 non-null object
Distance              24196 non-null float64
Postcode              24196 non-null float64
Bedroom2              18673 non-null float64
Bathroom              18669 non-null float64
Car                   18394 non-null float64
Landsize              15946 non-null float64
BuildingArea          9609 non-null float64
YearBuilt             10961 non-null float64
CouncilArea           24194 non-null object
Lattitude             18843 non-null float64
Longtitude            18843 non-null float64
Regionname            24194 non-null object
Pr

In [3]:
def preprocess_data(df):
    # Q1.4 and Q6.2
    df = df.drop(['Address', 'Landsize', 'BuildingArea', 'YearBuilt', 'Price', 'Bedroom2', 'SellerG'], axis=1)
    
    # Q1.1
    cols_miss_drop =['Postcode', 'CouncilArea', 'Regionname', 'Propertycount']
    mask = pd.isnull(df['Distance'])

    for col in cols_miss_drop:
        mask = mask | pd.isnull(df[col])

    df = df[~mask]
    
    # Q1.2
    df['Bathroom'].fillna(df['Bathroom'].mean(), inplace=True)
    df['Car'].fillna(df['Car'].mean(), inplace=True)
    
    df['Latitude_nan'] = pd.isnull(df['Lattitude'])
    df['Longtitude_nan'] = pd.isnull(df['Longtitude'])
    df['Lattitude'].fillna(0, inplace=True)
    df['Longtitude'].fillna(0, inplace=True)
    
    # Q6.1. Change date into weeks and months
    df['Sales_week'] = pd.to_datetime(df['Date']).dt.week
    df['Sales_month'] = pd.to_datetime(df['Date']).dt.month
    df = df.drop(['Date'], axis=1)  # drop the date, not required anymore
    
    # Q4
    df = pd.get_dummies(df)
    
    return df

df_prep = preprocess_data(df)

In [4]:
df_prep.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24194 entries, 0 to 24196
Columns: 402 entries, Rooms to Regionname_Western Victoria
dtypes: bool(2), float64(7), int64(4), uint8(389)
memory usage: 11.2 MB


## 1. Data Partitioning

Perform following operations and answer the following questions:
1. Describe training, validation and test dataset. What is the purpose for each of these split?
2. What is k-fold cross validation? What is the advantage and disadvantage of k-fold CV compared to normal training/test/validation method?
3. What does it mean by *stratification*?
4. What does random state do?
5. Set random state to 0. Split the dataframe into X and Y, then split respective data into training and test set of 70/30 proportion.

### Answer

Question 1, 2, 3 and 4 are throughly explained in the practical and lecture notes.

Question 5 is as follow.

In [5]:
# To ignore any future warnings
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split

rs = 0
X = df_prep.drop(['Price_above_median'], axis=1)
y = df_prep['Price_above_median']

X_mat = X.as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.3, stratify=y, random_state=rs)

## 2. Decision Tree

Perform the following operations and answer the question.
1. Import and build a decision tree classifier. Set the random state to 0 to ensure your result is similar with the answers. Fit it against the training data.
2. What is the performance of the model against training data? How about against the test data? Do you see any indication of overfitting here?
3. What are the top 5 most important features in this model?
4. Find the best hyperparameters using GridSearchCV. What is the optimal parameter set? Use the following parameters as initial guidance **criterion** of `gini` or `entropy`, **max depth** of 2-7 and **min_samples_leaf** from 20-60, increment of 10.
    
5. Visualise the structure of your decision tree. Can you identify characteristics of expensive houses?

## Answer

Code to answer all questions are as follow.

In [6]:
# import required libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score
from dm_tools import visualize_decision_tree, analyse_feature_importance  # use the functions we build in the practical

In [7]:
model = DecisionTreeClassifier(random_state=rs)
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

In [8]:
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))

0.9981694715087098
0.8424025347844055


Major difference between performance on training data vs test data. Strong indication of overfitting.

In [9]:
analyse_feature_importance(model, X.columns, 20)

Type_u : 0.1663717197004314
Distance : 0.14105464031286377
Regionname_Southern Metropolitan : 0.13320735698073669
Longtitude : 0.0735364961075368
Regionname_Eastern Metropolitan : 0.07287281292272205
Rooms : 0.05997213997847961
Lattitude : 0.05692904637420696
Sales_week : 0.0454671733784824
Postcode : 0.02886272712831667
Type_h : 0.02641844227105693
Sales_month : 0.02341347233009052
Car : 0.022237535375885464
Bathroom : 0.02199774346522733
CouncilArea_Kingston City Council : 0.015772356922391702
Propertycount : 0.01518330039036208
CouncilArea_Monash City Council : 0.012291156725175655
Method_S : 0.009123962832994232
Method_PI : 0.007633009722691979
Type_t : 0.00604574606420386
Method_VB : 0.004303988332194074


House type, distance from CBD, regions, location and rooms are the most important features.

In [10]:
from sklearn.model_selection import GridSearchCV

# grid search CV
params = {'criterion': ['gini', 'entropy'],
          'max_depth': range(2, 7),
          'min_samples_leaf': range(200, 600, 100)}

cv = GridSearchCV(param_grid=params, estimator=DecisionTreeClassifier(random_state=rs), cv=10, n_jobs=-1)
cv.fit(X_train, y_train)

print("Train accuracy:", cv.score(X_train, y_train))
print("Test accuracy:", cv.score(X_test, y_test))

# test the best model
y_pred = cv.predict(X_test)
print(classification_report(y_test, y_pred))

# print parameters of the best model
print(cv.best_params_)

Train accuracy: 0.8478299379982285
Test accuracy: 0.8422647747623639
              precision    recall  f1-score   support

           0       0.83      0.87      0.85      3634
           1       0.86      0.82      0.84      3625

   micro avg       0.84      0.84      0.84      7259
   macro avg       0.84      0.84      0.84      7259
weighted avg       0.84      0.84      0.84      7259

{'criterion': 'entropy', 'min_samples_leaf': 200, 'max_depth': 6}


The model generalises better on both training and test data.

In [11]:
visualize_decision_tree(cv.best_estimator_, X.columns, "optimal_tree.png")

![DT visualisation](optimal_tree.png "DT visualisation")

It is quite hard to see on the notebook, but if you open the file and zoom in into the right most branch in the picture, one characteristics set of the more expensive houses are
1. (Type_u <= 0.5 == False) = Unit houses.
2. (Rooms <= 2.5 == False) = With more than 2.5 rooms (3 rooms and above)
3. (Regionname_Southern_Metropolitan) <= 0.5 == False) = Located in Melbourne's Southern Metropolitan area.

# Answer

When you are finished with all exercise questions, the sample answers are available in the following Github repository. Remember, please try the exercises first before viewing the answers.

https://github.com/cab330/2019/tree/master/CAB330_answers