Question 1: What is a Decision Tree, and how does it work in the context of
classification?

Answer:A Decision Tree is a type of supervised machine learning algorithm used for both classification and regression tasks, though it is most commonly applied to classification problems. Conceptually, it resembles a tree-like structure where each internal node represents a decision based on the value of a particular feature, each branch represents the outcome of that decision, and each leaf node represents a final class label or decision. The tree works by recursively splitting the dataset into subsets based on feature values that provide the most meaningful separation between classes. In the context of classification, the algorithm evaluates potential splits using metrics like Gini Impurity, Information Gain, or Entropy to determine which feature best divides the data into distinct classes. Once the tree is fully grown or meets a stopping criterion, new instances can be classified by traversing the tree from the root node to a leaf node, following the decisions at each internal node that match the instance’s feature values. This makes decision trees intuitive, interpretable, and effective for many practical classification problems.


Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

Answer:Gini Impurity and Entropy are two commonly used metrics in decision trees that measure the “impurity” or disorder of a dataset at a given node, helping the algorithm decide the best feature to split on. Gini Impurity quantifies the probability of incorrectly classifying a randomly chosen element if it were labeled according to the distribution of class labels in that node. A lower Gini Impurity indicates that the node is more homogeneous, meaning most of its samples belong to a single class. Entropy, derived from information theory, measures the amount of uncertainty or randomness in the class distribution at a node. Higher entropy indicates greater disorder and more mixed classes, while zero entropy corresponds to a perfectly pure node with samples from only one class. When constructing a decision tree, the algorithm evaluates potential splits by calculating the reduction in impurity—known as Information Gain for entropy or Gini Gain for Gini Impurity. The split that maximizes the reduction in impurity is chosen, ensuring that the resulting child nodes are as pure as possible. Therefore, these measures directly impact the structure of the tree: nodes with higher homogeneity are preferred, which helps the tree classify data more accurately and efficiently.


Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

Answer:Pre-pruning and post-pruning are two techniques used in decision trees to prevent overfitting, which occurs when the tree becomes too complex and models noise in the training data rather than the underlying patterns. Pre-pruning, also known as early stopping, involves halting the growth of the tree during its construction based on specific criteria such as a maximum depth, a minimum number of samples required to split a node, or a threshold for improvement in impurity reduction. The main advantage of pre-pruning is that it reduces the computational cost of building the tree because it stops unnecessary splits early, resulting in a simpler model that is faster to train and easier to interpret. Post-pruning, on the other hand, allows the tree to grow fully and then removes branches that do not provide significant predictive power, often by evaluating the impact of pruning on a validation dataset. This approach typically results in higher predictive accuracy because it considers the overall structure of the fully grown tree before deciding which branches are redundant. A practical advantage of post-pruning is that it can achieve a better balance between bias and variance, often producing more robust and generalizable models compared to pre-pruned trees.


Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Answer:Information Gain is a key metric used in decision trees to evaluate the effectiveness of a potential split at a given node. It is based on the concept of entropy from information theory and measures the reduction in uncertainty about the class labels after splitting the data according to a particular feature. In other words, Information Gain quantifies how much “information” is obtained by knowing the value of that feature. A higher Information Gain indicates that the split results in child nodes that are more homogeneous, meaning the samples within each node belong predominantly to a single class. This is important because the goal of a decision tree is to create nodes that are as pure as possible, which improves the tree’s ability to classify new instances accurately. During tree construction, the algorithm evaluates all possible splits and chooses the one with the highest Information Gain, ensuring that each decision maximally reduces uncertainty and contributes to a more efficient and effective model. By prioritizing splits that yield the greatest reduction in entropy, Information Gain helps guide the tree toward a structure that captures the underlying patterns in the data rather than noise.


Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Answer:Decision Trees have a wide range of real-world applications due to their simplicity, interpretability, and effectiveness in handling both categorical and numerical data. In healthcare, they are used for diagnosing diseases by analyzing patient symptoms and medical test results. In finance, decision trees assist in credit scoring, fraud detection, and risk assessment by evaluating patterns in customer data. In marketing, they help segment customers, predict purchasing behavior, and optimize targeted campaigns. Other applications include predicting equipment failures in manufacturing, classifying species in biology, and guiding decision-making in business strategy. The main advantages of decision trees include their intuitive, visual representation, which makes them easy to interpret and explain to non-technical stakeholders, their ability to handle both continuous and categorical features without extensive preprocessing, and their relatively fast training and prediction times. However, they also have limitations, such as a tendency to overfit on noisy or small datasets, sensitivity to small changes in the data which can lead to different tree structures, and sometimes lower predictive accuracy compared to ensemble methods like Random Forests or Gradient Boosted Trees. Despite these limitations, decision trees remain a foundational tool in machine learning due to their clarity and versatility.


Question 6: Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances

(Include your Python code and output in the code box below.)

Answer:


In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Accuracy: {accuracy:.2f}")

for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.3f}")


Decision Tree Accuracy: 1.00
sepal length (cm): 0.000
sepal width (cm): 0.019
petal length (cm): 0.893
petal width (cm): 0.088


Question 7: Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to

a fully-grown tree.

(Include your Python code and output in the code box below.)

Answer:


In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)
print(f"Fully-grown Tree Accuracy: {accuracy_full:.2f}")

clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)
print(f"Tree with max_depth=3 Accuracy: {accuracy_limited:.2f}")


Fully-grown Tree Accuracy: 1.00
Tree with max_depth=3 Accuracy: 1.00


Question 8: Write a Python program to:

● Load the Boston Housing Dataset

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances

(Include your Python code and output in the code box below.)

Answer:


In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X = housing.data
y = housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

for feature, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{feature}: {importance:.3f}")



Mean Squared Error: 0.53
MedInc: 0.523
HouseAge: 0.052
AveRooms: 0.049
AveBedrms: 0.025
Population: 0.032
AveOccup: 0.139
Latitude: 0.090
Longitude: 0.089


Question 9: Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

● Print the best parameters and the resulting model accuracy

(Include your Python code and output in the code box below.)

Answer:


In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
y_pred = grid_search.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Best Parameters: {best_params}")
print(f"Model Accuracy: {accuracy:.2f}")


Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Model Accuracy: 1.00


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.

Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance

And describe what business value this model could provide in the real-world
setting.

Answer:To predict whether a patient has a certain disease using a dataset with mixed data types and missing values, I would follow a structured, step-by-step approach to ensure data quality, model performance, and actionable business insights. The first step is **handling missing values**. For numerical features, missing values can be imputed using strategies such as the mean, median, or a more advanced approach like K-Nearest Neighbors imputation. For categorical features, missing values can be replaced with the mode or a new category labeled “Unknown.” This ensures that the dataset remains complete and avoids biases caused by simply dropping rows with missing data.

Next, I would **encode categorical features** so that the Decision Tree can interpret them. Label encoding can be applied to ordinal categorical features where the order matters, while one-hot encoding is suitable for nominal features with no inherent order. Proper encoding allows the tree to evaluate splits on categorical variables accurately.

Once the data is cleaned and encoded, I would **train a Decision Tree model**. This involves splitting the dataset into training and test sets to ensure unbiased evaluation, initializing the Decision Tree classifier, and fitting it on the training data. During training, the algorithm uses impurity measures such as Gini Impurity or Entropy to determine optimal splits and create a tree that separates patients into disease-positive or disease-negative groups.

After training, I would **tune the model’s hyperparameters** to optimize performance and prevent overfitting. Key parameters include `max_depth` (to limit the tree size), `min_samples_split` (to control the minimum number of samples required to split a node), and `min_samples_leaf` (to enforce a minimum number of samples at leaf nodes). Techniques like GridSearchCV or RandomizedSearchCV can systematically test combinations of these parameters and select the best set based on cross-validated accuracy or other metrics.

Finally, I would **evaluate the model’s performance** using the test set, focusing not only on accuracy but also on metrics relevant to healthcare, such as precision, recall, F1-score, and the area under the ROC curve (AUC). High recall is particularly important in disease prediction to minimize false negatives, ensuring patients with the disease are correctly identified. Feature importance analysis can also provide insights into which factors most strongly predict disease risk.

In a real-world healthcare setting, this model offers significant **business value**. It can assist clinicians in early disease detection, prioritize high-risk patients for further testing, reduce unnecessary diagnostic costs, and ultimately improve patient outcomes. Moreover, the insights derived from feature importance can inform preventive health strategies and guide resource allocation in hospitals or insurance companies, making the model both a predictive and strategic tool for healthcare decision-making.
