1. What is a Decision Tree, and how does it work in the context of classification?
- A Decision Tree is a tree-structured model where internal nodes represent feature-based tests, branches represent outcomes of those tests, and leaves represent class labels.
- For classification, the tree repeatedly splits the data based on feature thresholds (e.g., petal_length <= 2.45) to create groups that are as “pure” (single class) as possible.
- Prediction: a new sample is passed from the root node down the tree by evaluating these tests until it reaches a leaf, where the majority class in that leaf is returned as the prediction.
- It creates axis-aligned decision boundaries in feature space and can naturally model non-linear relationships and interactions between features.

2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
- During training, the tree algorithm evaluates potential splits and chooses the one that maximally reduces impurity (Gini or Entropy), i.e., produces children nodes that are purer than the parent.
- In practice, both often give similar trees; Gini is slightly faster and tends to favor the most frequent class, while Entropy is more sensitive to changes in minority classes

3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
- Pre-pruning (early stopping): stops growing the tree early using constraints like max_depth, min_samples_split, or min_samples_leaf instead of expanding until perfectly pure.
- Post-pruning: first grows a large tree (possibly overfitted), then prunes back by removing or merging subtrees based on validation performance to reduce overfitting.
- Practical advantage of Pre-pruning: simpler, faster training and smaller trees without needing a separate pruning phase; good when you want tight control on model size.
- Practical advantage of Post-pruning: can start from a very expressive tree, then carefully remove branches based on data, often giving better generalization than aggressive early stopping alone.

4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?
- Information Gain (IG) measures how much uncertainty (impurity) is reduced by a split
- It quantifies how “informative” a feature and threshold are in separating the classes at that node.
- During training, the algorithm evaluates possible splits and chooses the one with the maximum Information Gain, i.e., maximum impurity reduction.
- Using Information Gain ensures that the tree grows by asking the most discriminative questions first, leading to shorter, more accurate trees.

5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations? Dataset Info: ● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or provided CSV). ● Boston Housing Dataset for regression tasks (sklearn.datasets.load_boston() or provided CSV).
- Applications: credit risk scoring, medical diagnosis, churn prediction, fraud detection, marketing segmentation, and decision-support systems where explainability is important.
- Advantages: easy to interpret/visualize; handles non-linear relationships; works with numerical and categorical data; little feature scaling or preprocessing needed.
- Limitations: high risk of overfitting (especially deep trees); unstable to small changes in data (small change can yield a very different tree); biased towards features with many possible splits/categories. Often, better performance is achieved by using ensembles of trees (Random Forest, Gradient Boosting) rather than a single Decision Tree.

In [1]:
# 6. Write a Python program to: ● Load the Iris Dataset ● Train a Decision Tree Classifier using the Gini criterion ● Print the model’s accuracy and feature importances (Include your Python code and output in the code box below.)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

importances = clf.feature_importances_

print("Accuracy on test set:", accuracy)
print("Feature importances:")
for name, imp in zip(feature_names, importances):
    print(f"  {name}: {imp:.4f}")


Accuracy on test set: 0.9333333333333333
Feature importances:
  sepal length (cm): 0.0062
  sepal width (cm): 0.0292
  petal length (cm): 0.5586
  petal width (cm): 0.4060


In [2]:
# 7. Write a Python program to: ● Load the Iris Dataset ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree. (Include your Python code and output in the code box below.)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

shallow_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
shallow_tree.fit(X_train, y_train)
y_pred_shallow = shallow_tree.predict(X_test)
acc_shallow = accuracy_score(y_test, y_pred_shallow)

print("Accuracy of fully-grown tree:", acc_full)
print("Accuracy of max_depth=3 tree:", acc_shallow)


Accuracy of fully-grown tree: 0.9333333333333333
Accuracy of max_depth=3 tree: 0.9666666666666667


In [5]:
# 8. Write a Python program to: ● Load the Boston Housing Dataset ● Train a Decision Tree Regressor ● Print the Mean Squared Error (MSE) and feature importances (Include your Python code and output in the code box below.)

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

boston = load_boston()
X = boston.data
y = boston.target
feature_names = boston.feature_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error on test set:", mse)
print("Feature importances:")
for name, imp in zip(feature_names, reg.feature_importances_):
    print(f"  {name}: {imp:.4f}")


ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


In [4]:
# 9. Write a Python program to: ● Load the Iris Dataset ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV ● Print the best parameters and the resulting model accuracy (Include your Python code and output in the code box below.)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y)

dt = DecisionTreeClassifier(random_state=42)

param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 5, 10]
}

grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print("Best parameters:", grid_search.best_params_)
print("Best cross-val accuracy:", grid_search.best_score_)
print("Test set accuracy with best model:", test_accuracy)


Best parameters: {'max_depth': 4, 'min_samples_split': 2}
Best cross-val accuracy: 0.9416666666666668
Test set accuracy with best model: 0.9333333333333333


10. Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to: ● Handle the missing values ● Encode the categorical features ● Train a Decision Tree model ● Tune its hyperparameters ● Evaluate its performance And describe what business value this model could provide in the real-world setting. Answer the theoritical quesitons in 3-4 points

- 1. Handle missing values
  - Impute numeric features (median/mean) and categorical features (most frequent or “Missing”).
  - Use imputers inside a pipeline to prevent data leakage.

- 2. Encode categories & train model
  - Apply OneHotEncoder via ColumnTransformer.
  - Build a pipeline with preprocessing + DecisionTreeClassifier.
- 3. Tune & evaluate
  - Use GridSearchCV to tune max_depth, min_samples_split, etc.
  - Evaluate using precision, recall, F1, and ROC-AUC for medical reliability.
- 4. Business value
  - Supports early disease detection, better patient risk prioritization.
  - Reduces costs and provides interpretable, trustworthy decision rules for clinicians.