### Decision Trees as Regression Models - California Housing~ Precticting House Price based on HouseAge,Room# and geographiocal location

We use the California Housing dataset to obtain a Decision Tree model, which is then used to predict the median house prices in various districts across California.

In [None]:
# 1) Import the tools you’ll need
# These are fundamental libraries:
#
# numpy (np): fast numerical arrays and math operations.
# pandas (pd): for working with tables (dataframes).
# random (rnd): a tiny helper if you need simple randomness.
#
# Why it matters — even simple data science projects start
# by importing these building blocks.

import numpy as np
import pandas as pd  
import random as rnd

In [None]:
#2) Make Jupyter show every output in a cell - this is optional , I added as I like tio see what each line does 
# Jupyter normally shows only the last expression’s output in a cell.
# This line configures the notebook so that every expression’s result in a cell is displayed.
#
# Why it matters — great for teaching and debugging because you can see
# intermediate outputs (dataframes, numbers, etc.) without adding print() everywhere.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# 3) Load the California Housing dataset into a DataFrame
#
# What this does —
# scikit-learn provides the California housing dataset.
# We load it and wrap the features into a pandas.DataFrame with readable column names.
# Displaying cali_df shows the table.
#
# Why it matters —
# DataFrames make it easy to inspect columns and values,
# which is the first step in any data project.

from sklearn.datasets import fetch_california_housing
cali_dataset=fetch_california_housing()
cali_df = pd.DataFrame(cali_dataset.data,columns=cali_dataset.feature_names)
cali_df

In [None]:
# 4) Add the target variable (median house value)
#
# What this does —
# The dataset separates features and targets; here we add the target
# (median house value for each row) as a new column named 'MedHouseVal'.
#
# Why it matters —
# Now the DataFrame has both the input features and
# the value we want to predict (the target).

cali_df['MedHouseVal']=cali_dataset.target
cali_df

In [None]:
#5) Document the data (comments)
# Data Sorce Link :https://scikit-learn.org/0.24/datasets/real_world.html#california-housing-dataset
# Attribute Information
# MedInc median income in block
# HouseAge median house age in block
# AveRooms average number of rooms
# AveBedrms average number of bedrooms
# Population block population
# AveOccup average house occupancy
# Latitude house block latitude
# Longitude house block longitude

names=cali_dataset.feature_names
names


In [None]:
# 6) (Typical) Quick EDA — inspect head, info, shape, describe
#
# What this does —
# Shows the first few rows, data types, summary statistics,
# and the number of rows and columns.
#
# Why it matters —
# Quick checks help catch missing values, unexpected data types,
# or obvious data issues before moving on to modeling.

cali_df.head()
cali_df.info()
cali_df.describe()
cali_df.shape

In [None]:
# 7) Prepare features and target for modeling
#
# What this does —
# X contains the input features; y contains the target values.
# We separate them because scikit-learn models expect this format.
#
# Why it matters —
# Modeling functions require a feature matrix (X) and a label vector (y).

X = cali_df.drop('MedHouseVal', axis=1)
y = cali_df['MedHouseVal']

In [None]:
# 7) Prepare features and target for modeling using DecisionTreeRegressor from sklearn
#
# What this does —
# X contains the input features ( 8 input colums); y contains the target values(new predicted column).
# We separate them because scikit-learn models expect this format.
#
# Why it matters —
# Modeling functions require a feature matrix (X) and a label vector (y).



from sklearn.tree import DecisionTreeRegressor 
array= cali_df.values

X=array[:,0:8]
Y=array[:,8]

X
Y

In [None]:
# 8) Split into training and test sets
#
# What this does —
# Randomly splits the data into a training set (used to train the model)
# and a test set (used to evaluate it).
# test_size=0.2 reserves 20% of the data for testing.
# random_state ensures the split is reproducible.
#
# Why it matters —
# Testing on unseen data gives a realistic sense of how well
# the model will perform on new, unseen inputs.

from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=1234)
len(X_train)
len(Y_train)
len(X_test)
len(Y_test)

### Decision Tree

<u>Note</u>: Changing max_leaf_nodes from 10 to 20 only results in an R-squared increase
of 0.04, and a far greater increase in depth/complexity.

In [None]:
# 9) Fit a Decision Tree Regressor
#
# What this does —
# Creates a Decision Tree regression model and trains (fits) it
# using the training data (X_train, y_train).
#
# Why it matters —
# Decision Trees are easy to visualize and understand.
# They’re great for teaching regression intuition and
# for seeing how the model makes decisions based on feature values.

model=DecisionTreeRegressor(max_leaf_nodes = 10)
regTree=model.fit(X_train,Y_train)
regTree

In [None]:
#Step 10: Making a prediction for a random sample
# What this does:
# rnd.seed(123458) — sets a random seed.
# This ensures that every time you run the notebook, 
# it picks the same random number. (Reproducibility = good science!)
#
# rnd.randrange(X.shape[0]) — chooses a random integer between 0 
# and the total number of rows in X.
#
# X_new = X_new.reshape(1,8) — selects one random row (a single data point) from the dataset.
# Displaying X_new shows its feature values — e.g., median income, house age, etc.
#
# Why this matters:
# Instead of predicting on the whole test set, this picks one house and asks the model,
# “What price do you think this house will have?”

rnd.seed(123458)
X_new=X[rnd.randrange(X.shape[0])]
X_new

X_new=X_new.reshape(1,8)
X_new
YHat=model.predict(X_new)
YHat

In [None]:
# Step 11: Presenting the prediction in a clean table

# Creates a small table (DataFrame) from your one data sample.
# Each column has its proper name (like MedInc, HouseAge, etc.).
# Adds a new column called "Predicted Price," which stores your model’s output
# df.head(1) Shows the first (and only) row — all the house details plus the model’s predicted price.

df=pd.DataFrame(X_new,columns = names)
df["Predicted Price"]=YHat
df.head(1)

In [None]:
# Step 12 Predict & evaluate the model
#
# What this does —

# imports the R² (R-squared) metric function from scikit-learn.
# R² is a standard way to measure the performance of regression models 
# (like your DecisionTreeRegressor).
#
# Interpretation of R²:
# R² = 1 → Perfect prediction (the model explains 100% of the variation in the target).
# R² = 0 → The model is no better than predicting the mean (no explanatory power).
# R² < 0 → The model performs worse than a simple average guess.
#
# The predict() method applies the trained Decision Tree model
# to unseen test data (X_test) to generate predictions (y_Hat).

# Here, you use your trained model to predict outputs (YHat) for the test input data (X_test).
# The result, YHat, is an array of predicted values — e.g., predicted house prices.
# Why it matters —
# These metrics tell you whether the model is accurate enough to be useful
# or if it needs improvement (through tuning, more data, or a different model).

from sklearn.metrics import r2_score
YHat=model.predict(X_test)
r2 = r2_score(Y_test, YHat)
print("R-Squared = ", np.round(r2, 2))

In [None]:
from IPython.display import Image   # To display images directly inside Jupyter Notebook
from sklearn import tree            # To access decision tree functions from scikit-learn
import matplotlib.image as mpimg     # To read and display image files using matplotlib
import io                            # To handle in-memory file-like objects (used for .dot data)
import matplotlib.pyplot as plt      # To create and show plots or images
!pip install pydotplus               # Install pydotplus library for converting .dot files to images
import pydotplus                     # Import pydotplus to create image graphs from .dot data
!pip install graphviz                # Install Graphviz backend to render decision tree visuals

In [None]:
dot_data = io.StringIO()   # Create an in-memory text stream to store the decision tree data in .dot format
dot_data                   # Display the empty StringIO object (optional, just shows it's created)

tree.export_graphviz(regTree, out_file=dot_data, filled=True, feature_names=names)  
# Export the trained decision tree (regTree) as .dot data with filled color nodes and feature names

# Draw graph
pydotplus.graph_from_dot_data(dot_data.getvalue()).write_png('dt.png')  
# Convert the .dot data to a graph object and save it as a PNG image file (dt.png)

plt.figure(figsize=(100, 100))  
# Create a very large figure to properly display the detailed decision tree image

img = mpimg.imread('dt.png')  
# Read the saved PNG image file into Python using matplotlib

implot = plt.imshow(img)  
# Display the image of the decision tree inside the notebook

In [None]:
# How to read the tree (in simple words):
#Start at the top (root node) — here the model first checks MedInc (median income).
#Follow the True/False branches depending on the condition.
#Keep following until you reach a leaf node (no further splits).
#The “value” at the leaf is your model’s predicted target for that path.

# Attribute        Meaning
# ------------------------------------------------------------
# Feature ≤ Threshold   # Splitting rule for that node
# squared_error         # How “impure” or variable the node’s values are
# samples               # Number of data points in that node
# value                 # Predicted output (mean target value) for that node
# Color                 # Indicates relative prediction value (light = low, dark = high)




### Examine the output

In [None]:
cali_df.shape   # Returns the number of rows and columns in the DataFrame (as a tuple)
regTree.max_features_   # Shows the number of features considered when looking for the best split in the trained Decision Tree

In [None]:
# Create a DataFrame showing each feature name and its importance score from the trained decision tree
feature_imp = pd.DataFrame({"Feature": cali_dataset.feature_names,
                           "Importance": regTree.feature_importances_})
feature_imp

In [None]:
relevant_features = feature_imp[feature_imp["Importance"] > 0] # Filter out only the features that have non-zero importance in the model
relevant_features = relevant_features.sort_values(["Importance"]) # Sort the remaining features in ascending order of their importance

In [None]:
import seaborn as sns   # Import Seaborn library for creating attractive and easy-to-read plots
sns.barplot(x="Feature", y="Importance", data=relevant_features)  
# Plot a bar chart showing each feature on the x-axis and its importance score on the y-axis