# Decision Tree | Assignment


**Question 1: What is a Decision Tree, and how does it work in the context of
classification?**


**Answer:**
A **Decision Tree** is a machine learning model that looks like a flowchart. It is used for classification problems. The tree splits the data into smaller groups based on different features.

It works by asking a series of questions. For example: “Is age > 30?” Depending on the answer, it moves to the next branch. At each step, the tree chooses the feature that best separates the classes using measures like **Gini impurity** or **information gain**.

This process continues until the tree reaches a **leaf node**, which gives the final class label.
So basically, a Decision Tree keeps dividing the data based on conditions, and finally predicts the class by following the path from the root to a leaf.



**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?**

Answer:
Gini Impurity and Entropy are impurity measures used to decide the quality of a split in a Decision Tree.

Gini Impurity: Measures how often a randomly chosen sample would be misclassified.

Gini = 0 means the node is pure (all samples in one class).

Entropy: Measures the amount of randomness or disorder in the node.

Entropy = 0 also means the node is pure.

Impact on splits:
A Decision Tree chooses the feature that reduces impurity the most.
Lower Gini or lower Entropy after a split means better purity, so the tree selects that split to grow the tree.

**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.**

Answer:

Pre-Pruning:

Pre-pruning stops the tree from growing too deep while it is being built.
Examples: setting max depth, minimum samples to split, etc.
Advantage: Saves time and prevents the tree from becoming too complex.

Post-Pruning:

Post-pruning allows the tree to grow fully first and then cuts off unnecessary branches afterward.
Advantage: Usually gives better accuracy because the tree learns all patterns first and then removes only the weak splits.


**Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?**

Answer:

Information Gain is a measure that tells us how much a feature helps in reducing uncertainty (entropy) when splitting the data in a Decision Tree.

It is calculated as:

Information Gain = Entropy before split – Entropy after split

Information Gain helps the tree choose the best feature for splitting.
A higher Information Gain means the split makes the data more pure, so the tree selects that feature to create a better and more accurate node.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
Dataset Info:

● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).

● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

**Answer:**

**Real-world applications of Decision Trees:**

1. **Medical Diagnosis** – predicting diseases based on symptoms.
2. **Finance** – loan approval, credit risk prediction.
3. **Marketing** – identifying customer segments.
4. **Manufacturing** – predicting equipment failure.
5. **Classification problems** like the **Iris dataset** (classifying flower species).
6. **Regression problems** like the **Boston Housing dataset** (predicting house prices).

**Advantages of Decision Trees:**

* Easy to understand and visualize.
* Works with both classification and regression.
* Requires little data preprocessing (no scaling needed).
* Can handle both numerical and categorical features.

**Limitations of Decision Trees:**

* Can easily **overfit** if not pruned.
* Small changes in data can change the entire tree (unstable).
* Not as accurate as ensemble models like Random Forests.
* May struggle with very complex datasets.


Question 6: Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy:", accuracy)
print("Feature Importances:", model.feature_importances_)


Model Accuracy: 1.0
Feature Importances: [0.         0.01911002 0.89326355 0.08762643]


Question 7: Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to

a fully-grown tree.


In [2]:
iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
pred_limited = tree_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, pred_limited)

tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
pred_full = tree_full.predict(X_test)
accuracy_full = accuracy_score(y_test, pred_full)

print("Accuracy with max_depth=3:", accuracy_limited)
print("Accuracy of fully-grown tree:", accuracy_full)


Accuracy with max_depth=3: 1.0
Accuracy of fully-grown tree: 1.0


Question 8: Write a Python program to:

● Load the Boston Housing Dataset

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X = housing.data
y = housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("Feature Importances:", model.feature_importances_)



Mean Squared Error (MSE): 0.5280096503174904
Feature Importances: [0.52345628 0.05213495 0.04941775 0.02497426 0.03220553 0.13901245
 0.08999238 0.08880639]


Question 9: Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

● Print the best parameters and the resulting model accuracy

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

model = DecisionTreeClassifier(random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Model Accuracy: 1.0


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.

Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.


# Answer —

Below is a clear, practical procedure you could follow when building a Decision Tree to predict a disease from a large mixed dataset.



### 1) Handle missing values

1. **Understand the pattern:** Check how much is missing per feature and whether missingness is random (MCAR), related to other features (MAR), or informative (MNAR).
2. **Remove only if justified:** Drop features with very high missing rate (e.g., > 50%) or rows if only a tiny fraction of data are missing.
3. **Impute sensibly:**

   * **Numerical:** median (robust) or mean; use KNN or iterative imputer if relationships are complex.
   * **Categorical:** a new category like `"Missing"` or the mode; or use predictive imputation.
4. **Keep a missing indicator:** For important features, add a boolean column `feature_is_missing` so model can learn missingness patterns.
5. **Use the same imputer in production:** Fit imputers on training data and apply to test/production (use pipelines).


 ### 2) Encode categorical features

1. **Low-cardinality categorical (few unique values):** Use **One-Hot Encoding**.
2. **High-cardinality categorical:** Use **Target Encoding**, **Ordinal Encoding**, or embeddings (be careful of leakage—do target encoding inside CV).
3. **Ordered categories:** Use **Ordinal Encoding** if order matters.
4. **Trees & scaling:** Decision Trees don’t need feature scaling.
5. **Use pipelines / ColumnTransformer** to apply different encodings to different columns and avoid leakage.


### 3) Train a Decision Tree model

1. **Train/test split:** Split (e.g., 70/30 or 80/20) or use nested CV for robust estimates.
2. **Baseline model:** Train a simple DecisionTreeClassifier (default criteria Gini or Entropy).
3. **Use class weights if imbalance:** `class_weight='balanced'` or oversample/undersample (SMOTE) inside CV.
4. **Fit using a pipeline** that includes imputation and encoding to avoid leakage.



### 4) Tune hyperparameters

1. **Key hyperparameters to tune:** `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features`, `criterion` (gini/entropy), `ccp_alpha` (cost complexity pruning).
2. **Search method:** GridSearchCV or RandomizedSearchCV with cross-validation (e.g., 5-fold). For speed, RandomizedSearch is good with many params.
3. **Nested CV (if possible):** Use nested CV to get unbiased performance estimates when tuning.
4. **Pruning:** Use `ccp_alpha` (cost-complexity pruning) to control overfitting.
5. **Use scoring aligned with business goal:** e.g., maximize recall if missing a disease is costly, or maximize precision if false positives are expensive.


### 5) Evaluate performance

1. **Choose metrics suited to problem & cost of errors:**

   * Classification: **Accuracy, Precision, Recall, F1-score, ROC-AUC, PR-AUC.**
   * For imbalanced disease detection, emphasize **Recall (sensitivity)** and **ROC-AUC / PR-AUC**.
2. **Confusion matrix:** Inspect TP, TN, FP, FN to understand types of errors.
3. **Calibration:** Check probability calibration (especially important in clinical settings) — use calibration plots or `CalibratedClassifierCV`.
4. **Threshold tuning:** Select decision threshold based on business trade-offs (e.g., choose threshold that achieves desired recall).
5. **Explainability:** Report feature importances and use SHAP or LIME to explain individual predictions (critical in healthcare).
6. **Robustness checks:** Test on holdout set, time-split validation, and subgroups (age, gender) to check fairness.
7. **Monitoring:** Once deployed, monitor drift, performance, and data quality.


### 6) Business value (real-world)

* **Early detection:** Identify high-risk patients earlier so clinicians can intervene sooner.
* **Prioritize resources:** Triage tests, specialist appointments, and follow-ups for likely positive cases.
* **Cost reduction:** Reduce unnecessary expensive tests by screening patients first.
* **Improved outcomes:** Early treatment can reduce morbidity, hospital stays, and downstream costs.
* **Operational efficiency:** Automate part of screening workflows, freeing clinician time.
* **Interpretability helps adoption:** Decision Trees are interpretable—easier to explain to clinicians and regulators.


### 7) Practical & ethical considerations (important in healthcare)

* **Avoid data leakage** (e.g., features recorded after diagnosis).
* **Bias & fairness:** Check model performance across demographic groups to avoid harming vulnerable populations.
* **Privacy & compliance:** Ensure HIPAA/GDPR-like safeguards for patient data.
* **Human-in-the-loop:** Use model as decision support, not an absolute decision-maker.
* **Clinical validation:** Validate prospectively and with clinicians before deployment.








In [7]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

model = DecisionTreeClassifier(random_state=42)

param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 3, 4, 5, 10],
}

grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Accuracy: 1.0

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

