Q1. What is a Decision Tree, and how does it work in classification?

- A Decision Tree is a supervised learning model that makes predictions by splitting data into branches based on feature values. For classification, it divides the dataset into pure class groups by asking sequential yes/no questions until it reaches a final class label.

Q2. Explain Gini Impurity and Entropy. How do they impact splits?

- Gini Impurity: Measures how often a randomly chosen sample would be misclassified.

- Entropy: Measures disorder or uncertainty in a node.
The tree chooses splits that reduce impurity the most, creating purer child nodes.

Q3. Difference between Pre-Pruning and Post-Pruning + one advantage each.

- Pre-Pruning: Stops tree growth early using limits like max_depth or min_samples_split.
Advantage: Prevents overfitting and reduces training time.

- Post-Pruning: Grows the full tree first, then removes unnecessary branches.
Advantage: Produces a simpler, more accurate model after evaluating real performance.

Q4. What is Information Gain and why is it important?

- Information Gain is the reduction in impurity (Entropy or Gini) after a split. It helps the tree choose the best feature and threshold, leading to better decision boundaries and higher accuracy.

Q5. Real-world applications of Decision Trees + advantages and limitations.

- Applications: Fraud detection, medical diagnosis, credit scoring, customer segmentation, and loan approval.
Advantages: Easy to interpret, handles numerical + categorical data, no scaling needed.
Limitations: Prone to overfitting, unstable to small data changes, and can create overly complex trees.

In [2]:
#Question 6: Write a Python program to:
#● Train a Decision Tree Classifier using the Gini criterion
#● Load the Iris Dataset
#● Print the model’s accuracy and feature importances

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.datasets import load_iris
iris = load_iris()
df= pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

x= df.drop('target', axis=1)
y= df['target']

from  sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='gini')
classifier.fit(x_train, y_train)

y_pred = classifier.predict(x_test)

from sklearn.metrics import accuracy_score
print("Model Accuracy:", accuracy_score(y_test, y_pred))
print(classifier.feature_importances_)

Model Accuracy: 0.9555555555555556
[0.01906318 0.         0.05330732 0.9276295 ]


In [3]:
#Question 7: Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

from sklearn.tree import DecisionTreeClassifier
classifier_new = DecisionTreeClassifier(criterion='gini', max_depth=3)
classifier_new.fit(x_train, y_train)

y_pred = classifier_new.predict(x_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.9777777777777777


In [4]:
#Question 8: Write a Python program to:
#● Load the Boston Housing Dataset
#● Train a Decision Tree Regressor
#● Print the Mean Squared Error (MSE) and feature importances

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Load the Boston Housing Dataset from the original source
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

# Define feature names for the Boston Housing Dataset
feature_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

df = pd.DataFrame(data, columns=feature_names)
df['Price'] = target

X= df.drop('Price', axis=1)
Y= df['Price']

from  sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(criterion='squared_error', random_state=42)
regressor.fit(X_train, Y_train)

Y_pred = regressor.predict(X_test)

from sklearn.metrics import mean_squared_error
print(mean_squared_error(Y_test, Y_pred))
print(regressor.feature_importances_)

11.588026315789474
[5.84654523e-02 9.88919249e-04 9.87244881e-03 2.97334284e-04
 7.05056208e-03 5.75807411e-01 7.17019866e-03 1.09624049e-01
 1.64635669e-03 2.18111251e-03 2.50428658e-02 1.18729904e-02
 1.89980299e-01]


In [5]:
#Question 9: Write a Python program to:
#● Load the Iris Dataset
#● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
#● Print the best parameters and the resulting model accuracy

parameter = {

             'criterion': ['gini', 'entropy', 'log_loss'],
             'splitter': ['best', 'random'],
             'max_depth': [3],
             'min_samples_split': [1,2, 3, 4, 5],
             'max_features': ['auto', 'sqrt', 'log2']
}

from sklearn.model_selection import GridSearchCV
cls = DecisionTreeClassifier()
model = GridSearchCV(cls, param_grid=parameter, cv=5, scoring='accuracy', verbose=3)
model.fit(x_train, y_train)

print(model.best_params_)
print(model.best_score_)

Fitting 5 folds for each of 90 candidates, totalling 450 fits
[CV 1/5] END criterion=gini, max_depth=3, max_features=auto, min_samples_split=1, splitter=best;, score=nan total time=   0.0s
[CV 2/5] END criterion=gini, max_depth=3, max_features=auto, min_samples_split=1, splitter=best;, score=nan total time=   0.0s
[CV 3/5] END criterion=gini, max_depth=3, max_features=auto, min_samples_split=1, splitter=best;, score=nan total time=   0.0s
[CV 4/5] END criterion=gini, max_depth=3, max_features=auto, min_samples_split=1, splitter=best;, score=nan total time=   0.0s
[CV 5/5] END criterion=gini, max_depth=3, max_features=auto, min_samples_split=1, splitter=best;, score=nan total time=   0.0s
[CV 1/5] END criterion=gini, max_depth=3, max_features=auto, min_samples_split=1, splitter=random;, score=nan total time=   0.0s
[CV 2/5] END criterion=gini, max_depth=3, max_features=auto, min_samples_split=1, splitter=random;, score=nan total time=   0.0s
[CV 3/5] END criterion=gini, max_depth=3, max

Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

- To build a disease-prediction model, I would first handle missing values by imputing numerical features with median, categorical features with mode or a “missing” category, and adding missing-indicator flags if needed. For categorical encoding, I would apply one-hot encoding for low-cardinality features and ordinal encoding where order matters. Then I would train a Decision Tree using a preprocessing pipeline and stratified train–test split. Next, I would tune hyperparameters such as max_depth, min_samples_split, min_samples_leaf, and ccp_alpha using GridSearchCV. Finally, I would evaluate performance using accuracy, recall, F1-score, ROC-AUC, and a confusion matrix to ensure good detection of positive cases. This model provides business value by enabling early disease detection, reducing diagnostic costs, improving patient triage, and supporting faster, data-driven clinical decisions.