# Decision Tree

#### Question 1: What is a Decision Tree, and how does it work in the context of classification?

- A decision tree is a flowchart-like structure where:

    - Each internal node represents a decision based on a feature (e.g., "Is age > 30?").

    - Each branch represents the outcome of that decision (Yes/No or a split based on values).

    - Each leaf node represents the final prediction (a class label in classification).

It works by recursively splitting the dataset into smaller and smaller groups based on the feature that best separates the data until a stopping condition is reached (like maximum depth, minimum samples, or pure class distribution).


The algorithm tries to partition the dataset so that each group (leaf) contains mostly instances of a single class.

Steps:

1. Start with the full dataset as the root.

2. Choose the best feature to split the data — the one that creates the "purest" child nodes.

- Metrics used:

    - Gini Impurity

    - Entropy / Information Gain

    - Chi-Square

3. Split the dataset into subsets based on the chosen feature’s values.

4. Repeat recursively on each subset until:

    - All samples in a node belong to the same class, or

    - A maximum depth is reached, or

    - No further improvement can be made.

5.Assign a class label to each leaf node (majority class in that subset).

#### Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?


- Gini Impurity

Definition: Probability of incorrectly classifying a randomly chosen sample if it was randomly labeled according to the class distribution in the node.

Gini=1−
i=1∑Cpi2

Where:

pi= proportion of samples belonging to class 

C = total number of classes.

Example: Suppose a node has:

70% samples of Class A (pA=0.7)

30% samples of Class B (𝑝𝐵=0.3)

If node has all samples of one class → Gini = 0 (pure).

Maximum Gini occurs when classes are evenly split (e.g., 50-50 → Gini = 0.5).



- Entropy (Information Gain)

Definition: Measures the amount of uncertainty or disorder in a node.
Entropy=−∑Cpi⋅log2(pi)
Where:
pi= proportion of samples in class i.



#### Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.


- Pre-Pruning (Early Stopping)

    - Also called early stopping.

    - The tree stops growing before it becomes too complex.

    - We set constraints while building the tree.

Examples of Pre-Pruning techniques:

    - Limit the maximum depth of the tree (max_depth).

    - Require a minimum number of samples per node (min_samples_split, min_samples_leaf).

    - Set a maximum number of leaf nodes (max_leaf_nodes).

    - Stop splitting if impurity reduction (Gini/Entropy) is below a threshold.


- Post-Pruning (Cost Complexity Pruning)

    - The tree is allowed to grow fully (possibly overfitting).

    - Afterward, we prune back the tree by removing branches that do not improve performance on validation data.

How it works:

    - Build a deep tree.

    - Use a validation set or cross-validation to evaluate subtrees.

    - Iteratively remove nodes/branches that give the least improvement in accuracy or increase in error.

    - The best pruned subtree is selected.

#### Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

- Information Gain is a measure of how much “knowledge” a feature gives us about the class labels.

In Decision Trees, when we split on a feature, we want the resulting child nodes to be as pure (homogeneous) as possible.

    - If a split reduces the disorder (impurity) a lot → high information gain.

    - If a split doesn’t change much → low information gain.

It is calculated using Entropy.

Information Gain in Decision Tree important because:

    - At each step, a decision tree algorithm chooses the feature that provides the highest Information Gain.

    - This ensures the split maximally reduces impurity and increases class homogeneity in the child nodes.

    - Repeating this process recursively leads to a tree that efficiently separates classes.




#### Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

Real-World Applications of Decision Trees

1. Healthcare

    - Diagnosing diseases (e.g., “Does the patient have diabetes?” based on symptoms, test results).

    - Predicting patient outcomes (survival, risk factors).

2. Finance

    - Credit risk assessment (approve/reject a loan).

    - Fraud detection in banking transactions.

3. Marketing & Business

    - Customer segmentation (who is likely to buy a product).

    - Churn prediction (which customers may stop using a service).

4. Retail & E-commerce

    - Recommendation systems (products based on past behavior).

    - Price prediction for items.

5. Manufacturing & Operations

    - Predicting equipment failure (maintenance scheduling).

    - Quality control decisions.

6. Government & Law

    - Crime prediction (identifying high-risk areas).

    - Tax fraud detection.Healthcare


                              
Main Advantages of Decision Trees

1. Simple and Interpretable

    - Easy to visualize and explain (“white-box model”).

    - Non-technical stakeholders can understand rules.

2. Handles Different Data Types

    - Works with both categorical and numerical data.

3. Requires Little Data Preparation

    - No need for feature scaling (like standardization/normalization).

    - Handles missing values (depending on implementation).

4. Fast Training and Prediction

    - Splitting is straightforward and computationally efficient.

5. Versatility

    - Can be used for classification and regression problems.



Main Limitations of Decision Trees

1. Overfitting

    - If grown too deep, trees capture noise and lose generalization power.

2. Instability

    - Small changes in data can produce very different trees (high variance).

3. Biased Splits

    - Features with many levels (e.g., unique IDs) may dominate splits.

4. Limited Predictive Accuracy

    - Alone, decision trees are often less accurate than ensemble methods like Random Forest or Gradient Boosted Trees.

5. Not Good with Continuous Boundaries

    - Struggles with smooth decision boundaries (linear classifiers often do better here).

#### Question 6: Write a Python program to:
#### ● Load the Iris Dataset
#### ● Train a Decision Tree Classifier using the Gini criterion
#### ● Print the model’s accuracy and feature importances

In [3]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print feature importances
print("\nFeature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Model Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


#### Question 7: Write a Python program to:
#### ● Load the Iris Dataset
#### ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

In [4]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# --- Fully-grown Decision Tree (no depth limit) ---
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# --- Pruned Decision Tree (max_depth=3) ---
clf_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)
y_pred_pruned = clf_pruned.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)

# Print results
print("Accuracy of fully-grown tree:", accuracy_full)
print("Accuracy of pruned tree (max_depth=3):", accuracy_pruned)


Accuracy of fully-grown tree: 1.0
Accuracy of pruned tree (max_depth=3): 1.0


#### Question 8: Write a Python program to:
#### ● Load the California Housing dataset from sklearn
#### ● Train a Decision Tree Regressor
#### ● Print the Mean Squared Error (MSE) and feature importances

In [5]:
# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train a Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Make predictions
y_pred = reg.predict(X_test)

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Print feature importances
print("\nFeature Importances:")
for feature_name, importance in zip(housing.feature_names, reg.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Mean Squared Error (MSE): 0.495235205629094

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


#### Question 9: Write a Python program to: 
#### ● Load the Iris Dataset 
#### ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV 
#### ● Print the best parameters and the resulting model accuracy

In [6]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define the Decision Tree model
clf = DecisionTreeClassifier(random_state=42)

# Define the parameter grid
param_grid = {
    "max_depth": [2, 3, 4, 5, None],          # tree depth options
    "min_samples_split": [2, 3, 4, 5, 10]     # min samples per split
}

# Perform GridSearchCV (5-fold cross-validation)
grid_search = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Evaluate best model on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy with Best Parameters:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Test Accuracy with Best Parameters: 1.0


#### Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
#### Explain the step-by-step process you would follow to:
#### ● Handle the missing values
#### ● Encode the categorical features
#### ● Train a Decision Tree model
#### ● Tune its hyperparameters
#### ● Evaluate its performance
#### And describe what business value this model could provide in the real-world setting.


1. Handle missing values — step by step

- Assess the missingness

    - Compute % missing per column; visualize patterns (heatmap, missingno, or df.isna().sum()).

    - Try to determine mechanism: MCAR / MAR / MNAR — this influences strategy.

- Decide to drop or impute

    - Drop a feature if > threshold missing (e.g., >60–80%) and it’s not critical.

    - Drop rows only if very few and missingness appears random.

- Impute carefully (avoid leakage)

    - Use SimpleImputer (mean/median for numeric, most_frequent for categorical) for a baseline.

    - For better quality, use KNNImputer or IterativeImputer (model-based).

    - Add a missing indicator column when “missingness” may itself be predictive.

- Always fit imputers on training data only (use Pipeline/ColumnTransformer).


2. Encode categorical features

- Low-cardinality (<= ~10) → OneHotEncoder(drop='first').

- High-cardinality → target encoding / hashing / leave-one-out / embedding. Beware target leakage (do target encoding with cross fold or within CV).

- Label/Ordinal encoding: trees can split on integer values, but label encoding may introduce spurious order — use only if category has real order.

- Consider algorithms that handle categoricals natively (CatBoost / LightGBM) if you want to avoid heavy encodings.


3. Training a Decision Tree (best practice)

- Use a Pipeline that includes preprocessing (imputation + encoding) and the classifier.

- Use class_weight='balanced' or sample weights if the disease is rare.

- Use stratified splitting (StratifiedKFold, train_test_split(..., stratify=y)) to preserve prevalence in train/test.


4. Hyperparameter tuning

- Tune important parameters:

    - max_depth, min_samples_split, min_samples_leaf, max_features, ccp_alpha (cost-complexity pruning), criterion (gini/entropy),class_weight.

- Use GridSearchCV or RandomizedSearchCV with StratifiedKFold.

- Choose scoring aligned with business objective (e.g., recall/sensitivity if false negatives are expensive, average_precision / PR-AUC for severe class imbalance, or ROC-AUC for overall ranking).

- Consider nested CV for unbiased generalization-error estimates if you report a final score.


5. Evaluation (what to compute & why)

- Primary metrics (choose per business need):

    - Sensitivity (Recall), Specificity, Precision, F1.

    - ROC-AUC and PR-AUC (PR-AUC is more informative with rare disease).

    - Confusion matrix at chosen threshold.

- Probability calibration: Decision Trees are poorly calibrated; use CalibratedClassifierCV (isotonic or sigmoid) if you need reliable probabilities for risk scoring.

- Threshold tuning: pick a decision threshold that balances FN/FPs based on cost matrix.

- Explainability: show global feature_importances_ and run SHAP or LIME for per-patient explanations.

- Fairness & subgroup analysis: check performance across age, gender, ethnicity subgroups.


6. Business value & real-world considerations

- Clinical value: early detection, prioritize high-risk patients, target diagnostic testing, reduce time-to-treatment.

- Operational value: improve allocation of scarce resources (specialist visits, tests), reduce downstream costs by early intervention.

- Risk management: tune for high recall if missing disease is costly; use human-in-the-loop to verify positive predictions.

- Trust & adoption: provide explanations (SHAP), plot decision rules for clinicians, and pilot before full deployment.

- Governance: ensure privacy (HIPAA/PDPA), audit model for bias, maintain retraining/monitoring pipeline to detect drift.

- Deployment approach: start as decision-support (assistive), not autonomous diagnosis; measure clinical outcomes via an A/B or prospective study.

In [None]:
# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, roc_auc_score, average_precision_score, confusion_matrix

# -----------------------------
# 1) Load dataset as DataFrame
# -----------------------------
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

# -----------------------------
# 2) Train-test split (stratified)
# -----------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# -----------------------------
# 3) Identify numeric/categorical columns
# -----------------------------
numeric_features = X.select_dtypes(include=['int64','float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object','category']).columns.tolist()  # none in iris

# -----------------------------
# 4) Preprocessing pipelines
# -----------------------------
numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore", drop="first"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_pipeline, numeric_features),
    ("cat", categorical_pipeline, categorical_features)
])

# -----------------------------
# 5) Full pipeline with Decision Tree
# -----------------------------
pipe = Pipeline([
    ("preproc", preprocessor),
    ("clf", DecisionTreeClassifier(random_state=42))
])

# -----------------------------
# 6) Hyperparameter grid for tuning
# -----------------------------
param_grid = {
    "clf__criterion": ["gini", "entropy"],
    "clf__max_depth": [3, 5, 8, None],
    "clf__min_samples_split": [2, 5, 10],
    "clf__min_samples_leaf": [1, 2, 5],
    "clf__class_weight": [None, "balanced"],
    "clf__ccp_alpha": [0.0, 0.001, 0.01]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(pipe, param_grid, cv=cv, scoring="accuracy", n_jobs=-1, verbose=1)
grid.fit(X_train, y_train)

# -----------------------------
# 7) Best parameters
# -----------------------------
print("Best Parameters:", grid.best_params_)

# -----------------------------
# 8) Evaluate on test set
# ------


Fitting 5 folds for each of 432 candidates, totalling 2160 fits
