Importing modules

In [87]:
import numpy as np
import graphviz
from pandas import read_csv
from sklearn.tree import DecisionTreeClassifier,export_graphviz
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score,f1_score,recall_score

loading data but unlike logistic regression we wont standardize the data since it wont affect decision tree classification.

In [88]:
data=read_csv("../data/california_housing.csv")
x=data.drop("MedianHouseValue",axis=1)
priceColumn=data["MedianHouseValue"]
averagePrice=np.median(priceColumn)
y=(priceColumn < averagePrice).astype(int)

splitting the data

In [89]:
xTrain,xTest,yTrain,yTest=train_test_split(x,y,random_state=42,test_size=0.3)

Decision tree results are highly sensitive to hyperparameters. To find the best configuration, we will employ `GridSearchCV` to evaluate all possible parameter combinations and select the optimal model based on precision, f1_score, and recallScore metrics, which were utilized in part 2.

In [90]:
hyperParams={
    "max_depth":[3,5,7],
    "min_samples_leaf":[10,20,50]
}

scorer = ['f1','recall','precision']


model = DecisionTreeClassifier()
grid=GridSearchCV(model,hyperParams,cv=5,scoring=scorer,refit="f1")
grid.fit(xTrain,yTrain)
pred=grid.predict(xTest)
print("The best params : "+str(grid.best_params_))
precision = precision_score(yTest, pred)
print("Precision is : "+str(precision))
recall = recall_score(yTest, pred)
print("recallScore is : "+str(recall))
f1 = f1_score(yTest, pred)
print("F1Score is : "+str(f1))

The best params : {'max_depth': 7, 'min_samples_leaf': 20}
Precision is : 0.8188727042431919
recallScore is : 0.8428943937418514
F1Score is : 0.8307099261162866


The results align closely with logistic regression. By fine-tuning hyperparameters, both models can achieve better performance. After experimenting with various max_depth values, a depth of 7 proved optimal, preventing overfitting.

Visulazing the tree with graphviz

In [91]:
dot_data = export_graphviz(grid.best_estimator_, out_file=None, feature_names=x.columns, class_names=["expensive","affordable"])
graph=graphviz.Source(dot_data)
graph.render("decision tree",cleanup=True)

'decision tree.pdf'

After looking at the generated tree we can deduce the following:

1. `Median income`: Most important, splits the data at the root. Lower income leads to affordable housing prediction.
2. `Average rooms`: Higher values often predict expensive housing.
3. `Average occupancy`: Lower occupancy associates with affordable housing.
4. `Location`: Significantly impacts affordability based on longitude and latitude.
5. `House age`: Older houses are more likely to be affordable.
6. `Population density`: May influence affordability.
7. `Average bedrooms`: Plays a role in some cases.