Question 1: What is a Decision Tree, and how does it work in the context of classification?

- Answer:

A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks. In classification, it works by recursively splitting the dataset based on feature values to create branches that lead to a class label.

- Each internal node represents a decision based on a feature, each branch represents an outcome of that decision, and each leaf node represents a predicted class. The tree selects splits using impurity measures such as Gini Impurity or Entropy to maximize class separation. During prediction, a data point traverses the tree from root to leaf following the decision rules, and the class at the leaf node is assigned as the output.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

- Answer:

Gini Impurity and Entropy are metrics used to measure the impurity or disorder of a dataset at a node.

Gini Impurity measures the probability of incorrectly classifying a randomly chosen sample if labels were assigned randomly according to class distribution.

Entropy measures the amount of uncertainty or information disorder in the dataset.

Lower impurity values indicate purer nodes. During tree construction, the algorithm evaluates all possible splits and selects the one that results in the maximum reduction in impurity. Gini tends to be faster computationally, while Entropy provides a more information-theoretic interpretation.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

- Answer:

Pre-Pruning stops the tree from growing further during training by setting constraints such as maximum depth or minimum samples per split.
Advantage: Reduces overfitting and training time.

Post-Pruning allows the tree to grow fully and then removes unnecessary branches after training.
Advantage: Often results in better generalization since pruning decisions are based on validation performance.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

- Answer:

Information Gain measures the reduction in entropy achieved after splitting a dataset based on a feature. It quantifies how much information a feature provides about the target variable.

The feature with the highest Information Gain is chosen for splitting because it best separates the data into distinct classes. This ensures that each split improves the modelâ€™s ability to classify data correctly.

Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

- Answer:

Applications:

Medical diagnosis

Credit risk assessment

Fraud detection

Customer churn prediction

Recommendation systems

- Advantages:

Easy to interpret and visualize

Works with both numerical and categorical data

Requires minimal data preprocessing

- Limitations:

Prone to overfitting

Sensitive to small data changes

Can create biased trees with imbalanced datasets

In [1]:
# Question 6: Write a Python program to load the Iris Dataset, train a Decision Tree Classifier using the Gini criterion, and print accuracy and feature importances.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = DecisionTreeClassifier(criterion="gini", random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Output
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)

Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


In [2]:
# Question 7: Write a Python program to train a Decision Tree with max_depth=3 and compare its accuracy with a fully-grown tree.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fully grown tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full_tree.predict(X_test))

# Limited depth tree
limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
limited_acc = accuracy_score(y_test, limited_tree.predict(X_test))

print("Fully grown tree accuracy:", full_acc)
print("Max depth=3 tree accuracy:", limited_acc)

Fully grown tree accuracy: 1.0
Max depth=3 tree accuracy: 1.0


In [3]:
# Question 8: Write a Python program to train a Decision Tree Regressor on the Boston Housing Dataset and print MSE and feature importances.
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
boston = fetch_openml(name="boston", version=1, as_frame=False)
X, y = boston.data, boston.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)

Mean Squared Error: 10.416078431372549
Feature Importances: [5.12956739e-02 3.35270585e-03 5.81619171e-03 2.27940651e-06
 2.71483790e-02 6.00326256e-01 1.36170630e-02 7.06881622e-02
 1.94062297e-03 1.24638653e-02 1.10116089e-02 9.00872742e-03
 1.93328464e-01]


In [4]:
# Question 9: Write a Python program to tune max_depth and min_samples_split using GridSearchCV.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

param_grid = {
    "max_depth": [None, 3, 5, 10],
    "min_samples_split": [2, 5, 10]
}

grid = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5
)

grid.fit(X_train, y_train)

best_model = grid.best_estimator_
accuracy = accuracy_score(y_test, best_model.predict(X_test))

print("Best Parameters:", grid.best_params_)
print("Best Model Accuracy:", accuracy)

Best Parameters: {'max_depth': None, 'min_samples_split': 2}
Best Model Accuracy: 1.0


Question 10: Explain the full step-by-step process of building a Decision Tree for disease prediction and its business value.

- First, missing values are handled by either removing incomplete records or imputing values using statistical measures such as mean, median, or mode. Medical datasets often benefit from domain-aware imputation.

- Next, categorical features are encoded using techniques like Label Encoding or One-Hot Encoding so they can be processed by the model.

- The dataset is then split into training and testing sets, and a Decision Tree model is trained using appropriate impurity criteria. Hyperparameters such as max_depth and min_samples_split are tuned using GridSearchCV to prevent overfitting.

- Model performance is evaluated using metrics like accuracy, precision, recall, F1-score, and ROC-AUC to ensure reliability in healthcare decisions.

- Business Value:
This model enables early disease detection, improves clinical decision support, reduces diagnostic costs, and enhances patient outcomes by providing interpretable and actionable predictions for healthcare professionals.