# Decision Tree Classification — Heart Disease Dataset

This notebook walks through EDA, preprocessing, Decision Tree training, hyperparameter tuning, evaluation, and interpretation using `sklearn.tree.plot_tree` for visualization.

Change `file_path` in the data-loading cell if you run the notebook locally.

In [9]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             roc_auc_score, confusion_matrix, RocCurveDisplay)

plt.rcParams['figure.figsize'] = (8,6)
print('Imports ready')

Imports ready


In [10]:
# Load dataset (change file_path if needed)
file_path = r'D:\NEW DATA SCIENCE ASSIGNMENT\8.Decision Tree\heart_disease.xlsx'  # change this to your local path if needed

try:
    df = pd.read_excel(file_path)
    print('Loaded:', file_path)
except Exception as e:
    raise SystemExit(f"Could not load the dataset at {file_path}: {e}")

print('Shape:', df.shape)
df.head()

Loaded: D:\NEW DATA SCIENCE ASSIGNMENT\8.Decision Tree\heart_disease.xlsx
Shape: (12, 2)


Unnamed: 0,age,Age in years
0,Gender,"Gender ; Male - 1, Female -0"
1,cp,Chest pain type
2,trestbps,Resting blood pressure
3,chol,cholesterol measure
4,fbs,(fasting blood sugar > 120 mg/dl) (1 = true; 0...


## 1 — Exploratory Data Analysis (EDA)

Inspect structure, missing values, summary statistics, class balance, and correlations.

In [11]:
# Basic EDA
print(df.info())
print('\nMissing values per column:\n', df.isnull().sum())

display(df.describe().T)

# Class distribution: try to detect a binary target column
possible_targets = [c for c in df.columns if c.lower() in ('target','outcome','disease','heartdisease','hd','diagnosis','num')]
print('Candidate target columns found:', possible_targets)

# If none found, try numeric columns with only 0/1 values
if not possible_targets:
    for c in df.select_dtypes(include=[np.number]).columns:
        unique_vals = df[c].dropna().unique()
        if set(unique_vals).issubset({0,1}):
            possible_targets.append(c)
print('Final candidate targets:', possible_targets)

# Show distribution for top candidate (not applied yet)
if possible_targets:
    tgt = possible_targets[0]
    print(f"Using candidate target column: {tgt}")
    print(df[tgt].value_counts(dropna=False))
else:
    print('No obvious binary target found. You will need to set the target column manually in the next cell.')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   age           12 non-null     object
 1   Age in years  12 non-null     object
dtypes: object(2)
memory usage: 324.0+ bytes
None

Missing values per column:
 age             0
Age in years    0
dtype: int64


Unnamed: 0,count,unique,top,freq
age,12,12,Gender,1
Age in years,12,12,"Gender ; Male - 1, Female -0",1


Candidate target columns found: []
Final candidate targets: []
No obvious binary target found. You will need to set the target column manually in the next cell.


## 2 — Feature Engineering & Preprocessing

- Identify the target column (auto-detected if possible).  
- Handle missing values and encode categorical variables.  
- Scale numeric features when appropriate.

In [12]:
# === Preprocessing ===
# Auto-detect target (falls back to manual assignment)
candidate_targets = [c for c in df.columns if c.lower() in ('target','outcome','disease','heartdisease','hd','diagnosis','num')]
if not candidate_targets:
    # try 0/1 numeric detection
    for c in df.select_dtypes(include=[np.number]).columns:
        uv = df[c].dropna().unique()
        if set(uv).issubset({0,1}):
            candidate_targets.append(c)

if candidate_targets:
    target_col = candidate_targets[0]
else:
    # If auto-detection fails, set target_col manually here
    target_col = 'target'  # <-- change this to your known target column name

print('Selected target column:', target_col)

# Basic cleaning: drop rows where target is missing
df = df.copy()
if target_col not in df.columns:
    raise SystemExit(f"Target column '{target_col}' not found. Edit `target_col` to the correct column name.")

print('Rows before dropping missing target:', df.shape[0])
df = df.dropna(subset=[target_col])
print('Rows after dropping missing target:', df.shape[0])

# Separate features and target
X = df.drop(columns=[target_col])
y = df[target_col]

# Identify categorical columns (object or category dtypes)
cat_cols = X.select_dtypes(include=['object','category']).columns.tolist()
num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
print('Categorical columns:', cat_cols)
print('Numeric columns:', num_cols)

# Simple encoding: one-hot encode categoricals (drop_first to avoid collinearity)
if cat_cols:
    X = pd.get_dummies(X, columns=cat_cols, drop_first=True)

# Impute numeric missing values with median
for c in num_cols:
    if c in X.columns and X[c].isnull().any():
        X[c].fillna(X[c].median(), inplace=True)

# Scale numeric features
scaler = StandardScaler()
# Only scale columns that still exist (num_cols may overlap with get_dummies modifications)
num_cols_existing = [c for c in num_cols if c in X.columns]
if num_cols_existing:
    X[num_cols_existing] = scaler.fit_transform(X[num_cols_existing])

print('Processed feature shape:', X.shape)
print('Sample processed features:')
X.head()

Selected target column: target


SystemExit: Target column 'target' not found. Edit `target_col` to the correct column name.

In [7]:
print(df.columns.tolist())


['age', 'Age in years']


## 3 — Train/Test Split and Baseline Decision Tree

Split the data 80/20 and train a baseline Decision Tree with default hyperparameters.

In [None]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

# Baseline Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

y_pred = dt.predict(X_test)

# Metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, zero_division=0)
rec = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)

print('Baseline Decision Tree performance:')
print(f'Accuracy: {acc:.4f}  Precision: {prec:.4f}  Recall: {rec:.4f}  F1: {f1:.4f}')

# ROC-AUC if possible
if hasattr(dt, 'predict_proba'):
    y_proba = dt.predict_proba(X_test)[:,1]
    try:
        roc = roc_auc_score(y_test, y_proba)
        print(f'ROC-AUC: {roc:.4f}')
    except Exception:
        pass

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('\nConfusion matrix:\n', cm)

## 4 — Hyperparameter Tuning (GridSearchCV)

Tune max_depth, min_samples_split, and criterion using 5-fold CV.

In [None]:
# Grid search
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 3, 5, 7, 9],
    'min_samples_split': [2, 5, 10]
}

gscv = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='f1', n_jobs=-1)
gscv.fit(X_train, y_train)
print('Best params:', gscv.best_params_)
print('Best CV score (f1):', gscv.best_score_)

best_dt = gscv.best_estimator_
best_dt.fit(X_train, y_train)

# Evaluate best model
y_pred_best = best_dt.predict(X_test)
acc_b = accuracy_score(y_test, y_pred_best)
prec_b = precision_score(y_test, y_pred_best, zero_division=0)
rec_b = recall_score(y_test, y_pred_best, zero_division=0)
f1_b = f1_score(y_test, y_pred_best, zero_division=0)
print('\nTuned Decision Tree performance:')
print(f'Accuracy: {acc_b:.4f}  Precision: {prec_b:.4f}  Recall: {rec_b:.4f}  F1: {f1_b:.4f}')

if hasattr(best_dt, 'predict_proba'):
    try:
        roc_b = roc_auc_score(y_test, best_dt.predict_proba(X_test)[:,1])
        print(f'ROC-AUC: {roc_b:.4f}')
    except Exception:
        pass

cm_b = confusion_matrix(y_test, y_pred_best)
print('\nConfusion matrix (tuned):\n', cm_b)

## 5 — Model Interpretation & Visualization

Plot the decision tree and show feature importances.

In [None]:
# Plot the tree (limit max depth in the plot for readability)
plt.figure(figsize=(20,12))
plot_tree(best_dt, feature_names=X.columns, class_names=[str(c) for c in np.unique(y)], filled=True, rounded=True)
plt.title('Decision Tree (full)')
plt.show()

# Feature importances
fi = pd.Series(best_dt.feature_importances_, index=X.columns).sort_values(ascending=False)
print('Top 20 feature importances:')
display(fi.head(20))

plt.figure(figsize=(10,6))
fi.head(20).plot(kind='bar')
plt.title('Top 20 Feature Importances')
plt.ylabel('Importance')
plt.show()

# ROC Curve
if hasattr(best_dt, 'predict_proba'):
    try:
        RocCurveDisplay.from_estimator(best_dt, X_test, y_test)
        plt.show()
    except Exception:
        pass

# Confusion matrix heatmap
plt.figure(figsize=(6,5))
sns.heatmap(confusion_matrix(y_test, y_pred_best), annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (tuned)')
plt.show()

## Interview Questions & Answers

1. What are some common hyperparameters of decision tree models, and how do they affect the model's performance?

- max_depth: maximum depth of the tree. Smaller values reduce overfitting but may underfit. Larger values allow complex rules but risk overfitting.
- min_samples_split: minimum number of samples required to split an internal node. Larger values make the tree more conservative.
- min_samples_leaf: minimum number of samples required to be at a leaf node. Increasing it reduces variance.
- criterion: the function to measure the quality of a split (e.g., 'gini' or 'entropy'). Different criteria can produce slightly different trees.
- max_features: number of features to consider when looking for the best split. Limiting can reduce variance and speed up training.

2. What is the difference between Label encoding and One-hot encoding?

- Label encoding assigns each category a unique integer label. It is compact but introduces ordinal relationships between categories which may be inappropriate.
- One-hot encoding creates binary columns for each category level; it does not assume any ordering and is typically safer for nominal categorical data but increases dimensionality.
