In [None]:
# 1.What is a Decision Tree, and how does it work in the context of
# classification?
"""
A Decision Tree is a supervised learning algorithm that splits data into branches based on feature conditions to classify 
samples into categories. It works by recursively selecting the best feature using metrics like Gini impurity or Information
Gain until a decision (class label) is reached at the leaf node.
"""

In [None]:
# 2. : Explain the concepts of Gini Impurity and Entropy as impurity measures.
# How do they impact the splits in a Decision Tree?
"""
Gini Impurity measures how often a randomly chosen sample would be incorrectly classified if it were randomly labeled according
to the class distribution in a node — lower Gini means purer nodes.
Entropy measures the amount of uncertainty or disorder in a node — higher entropy means more mixed classes.
"""

In [None]:
# 3. What is the difference between Pre-Pruning and Post-Pruning in Decision
# Trees? Give one practical advantage of using each
"""
Pre-Pruning stops the tree from growing too deep during training by setting limits like maximum depth or minimum samples
per split — this prevents overfitting early and saves computation.
Post-Pruning allows the tree to grow fully, then removes branches that don’t improve accuracy — this helps simplify the 
model while keeping strong predictive performance.
"""

In [None]:
# 4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?
"""
Information Gain measures how much a feature reduces the uncertainty (entropy) in the dataset after a split. 
It is important because the Decision Tree selects the feature with the highest Information Gain at each node — meaning that 
split provides the most informative and pure separation of classes.
"""

In [None]:
# 5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
"""
Applications: Decision Trees are widely used in loan approval, medical diagnosis, customer churn prediction, and fraud detection.
Advantages: They are easy to interpret, handle both numerical and categorical data, and require little data preprocessing.
Limitations: They can overfit easily, are sensitive to small data changes, and may become biased toward features with many categories.
"""

In [1]:
# 6. Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Model Accuracy:", accuracy_score(y_test, y_pred))

for name, importance in zip(iris.feature_names, model.feature_importances_):
    print(f"{name}: {importance:.4f}")


C:\Users\aditi\anaconda3\py\lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
C:\Users\aditi\anaconda3\py\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


Model Accuracy: 1.0
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [2]:
# 7. Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
# a fully-grown tree.


from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model_full = DecisionTreeClassifier(random_state=42)
model_full.fit(X_train, y_train)
y_pred_full = model_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

model_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
model_limited.fit(X_train, y_train)
y_pred_limited = model_limited.predict(X_test)
acc_limited = accuracy_score(y_test, y_pred_limited)

print("Accuracy of Fully-Grown Tree:", acc_full)
print("Accuracy of Tree with max_depth=3:", acc_limited)


Accuracy of Fully-Grown Tree: 1.0
Accuracy of Tree with max_depth=3: 1.0


In [3]:
# 8. : Write a Python program to:
# ● Load the Boston Housing Dataset
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances


from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

boston = load_boston()
X = boston.data
y = boston.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

for name, importance in zip(boston.feature_names, model.feature_importances_):
    print(f"{name}: {importance:.4f}")


Mean Squared Error: 10.416078431372549
CRIM: 0.0513
ZN: 0.0034
INDUS: 0.0058
CHAS: 0.0000
NOX: 0.0271
RM: 0.6003
AGE: 0.0136
DIS: 0.0707
RAD: 0.0019
TAX: 0.0125
PTRATIO: 0.0110
B: 0.0090
LSTAT: 0.1933


In [4]:
# 9. Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using
# GridSearchCV
# ● Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10, 15]
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("Model Accuracy:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy: 1.0


In [None]:
# 10c.Imagine you’re working as a data scientist for a healthcare company that
# wants to predict whether a patient has a certain disease. You have a large dataset with
# mixed data types and some missing values.
# Explain the step-by-step process you would follow to:
# ● Handle the missing values
# ● Encode the categorical features
# ● Train a Decision Tree model
# ● Tune its hyperparameters
# ● Evaluate its performance
# And describe what business value this model could provide in the real-world
# setting.

Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.
"""1. Handle Missing Values

Identify missing values in the dataset using methods like isnull() or info().

Numerical features: Impute missing values using mean or median.

Categorical features: Impute missing values using the mode or a placeholder like 'Unknown'.

Optional: Use advanced methods like K-Nearest Neighbors imputation for more accuracy.

2. Encode Categorical Features

Convert categorical variables into numerical form so that the Decision Tree can process them.

Label Encoding: For ordinal categories (e.g., 'low', 'medium', 'high').

One-Hot Encoding: For nominal categories (e.g., 'blood type').

3. Train a Decision Tree Model

Split the dataset into training and testing sets (e.g., 80/20 split).

Initialize a Decision Tree Classifier using sklearn.tree.DecisionTreeClassifier.

Fit the model on the training data.

4. Tune Hyperparameters

Use GridSearchCV or RandomizedSearchCV to find the best hyperparameters:

max_depth → prevents overfitting by limiting tree depth.

min_samples_split → minimum samples required to split a node.

criterion → Gini or Entropy to measure node purity.

Select the best model based on cross-validated performance.

5. Evaluate Performance

Use the testing set to evaluate the model.

Common metrics for classification:

Accuracy → overall correctness.

Precision → proportion of predicted positives that are correct (important in healthcare to avoid false positives).

Recall (Sensitivity) → proportion of actual positives correctly identified (critical to avoid missing patients with the disease).

F1-Score → balances precision and recall.

ROC-AUC → evaluates model discrimination ability.

6. Business Value

Early Detection: Helps doctors identify patients at risk earlier, improving treatment outcomes.

Resource Optimization: Focus medical resources on high-risk patients.

Decision Support: Assists healthcare professionals in making data-driven decisions.

Cost Reduction: Reduces unnecessary tests for low-risk patients while prioritizing high-risk ones."""