#**Decision Tree**


1. What is a Decision Tree, and how does it work in the context of classification?

    - A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. It models decisions and their possible consequences as a tree-like structure. For classification, it repeatedly splits the dataset into subsets based on the value of input features, making decisions at each node. The splitting continues until the data is classified or stopping criteria are met. The output is a tree, where each leaf represents a class label, and each internal node represents a decision rule based on a feature.

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

    - Gini Impurity measures the likelihood of incorrect classification of a randomly chosen element in a dataset if it was randomly labeled according to class distribution in that node. It ranges from 0 (perfectly pure) to 0.5 (maximal impurity for binary classification).

    - Entropy quantifies the level of disorder, unpredictability, or impurity in the dataset. It is highest when classes are equally mixed and zero when perfectly pure.

    - Both these metrics are used to decide how to split the data at each node; splits that reduce impurity the most are preferred, creating purer child nodes, which improves classification accuracy.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

    - Pre-Pruning involves stopping the growth of the tree early—before it perfectly classifies the training data—based on predefined criteria like min_samples_split, max_depth, or min_impurity_decrease. Advantage: Prevents overfitting by keeping the tree simple and generalizable.

    - Post-Pruning allows the tree to grow fully and then prunes back some branches based on validation data or performance metrics (e.g., cost complexity pruning). Advantage: Can discover optimal tree structure after evaluating actual performance, potentially increasing accuracy.

4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

    - Information Gain measures the reduction in impurity (entropy or Gini) after a dataset is split on a feature. It quantifies how well a feature separates classes. The split with the highest information gain is chosen, leading to the most informative child nodes and improving the classification power of the tree.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

    - Applications:Medical diagnosis (predict diseases), Credit approval and fraud detection, Customer segmentation in marketing, Churn prediction, Risk assessment in insurance
    
    - Advantages:Easy to interpret and visualize, Handles both numerical and categorical data, Requires little data preprocessing

    - Limitations:Prone to overfitting, Can be unstable with small changes in data, May yield biased trees if classes are imbalanced

In [1]:
# 6. Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances


from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
clf = DecisionTreeClassifier(criterion='gini')
clf.fit(X, y)
y_pred = clf.predict(X)

accuracy = accuracy_score(y, y_pred)
importances = clf.feature_importances_

print("Accuracy:", accuracy)
print("Feature Importances:", importances)


Accuracy: 1.0
Feature Importances: [0.01333333 0.01333333 0.05072262 0.92261071]


In [2]:
# 7. Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
clf_full = DecisionTreeClassifier()
clf_depth3 = DecisionTreeClassifier(max_depth=3)

clf_full.fit(X, y)
clf_depth3.fit(X, y)

acc_full = accuracy_score(y, clf_full.predict(X))
acc_depth3 = accuracy_score(y, clf_depth3.predict(X))

print("Fully-grown tree accuracy:", acc_full)
print("Max depth 3 accuracy:", acc_depth3)



Fully-grown tree accuracy: 1.0
Max depth 3 accuracy: 0.9733333333333334


In [4]:
# 8. Write a Python program to:
#  Load the Boston Housing Dataset
#  Train a Decision Tree Regressor
#  Print the Mean Squared Error (MSE) and feature importances

# Here actually load_boston dataset is removed from sklearn dataset so I have used Fetch_california_housing.

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target
reg = DecisionTreeRegressor()
reg.fit(X, y)
y_pred = reg.predict(X)

mse = mean_squared_error(y, y_pred)
importances = reg.feature_importances_

print("MSE:", mse)
print("Feature Importances:", importances)

MSE: 9.555001274479309e-32
Feature Importances: [0.52468705 0.0510357  0.0536226  0.02716341 0.03212433 0.13153595
 0.0938629  0.08596806]


In [5]:
# 9. Write a Python program to:
#Load the Iris Dataset
# Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
# Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

X, y = load_iris(return_X_y=True)
params = {'max_depth': [2, 3, 4, 5], 'min_samples_split': [2, 5, 10]}
clf = DecisionTreeClassifier()
gs = GridSearchCV(clf, params, cv=5)
gs.fit(X, y)

print("Best parameters:", gs.best_params_)
print("Best accuracy:", gs.best_score_)


Best parameters: {'max_depth': 3, 'min_samples_split': 2}
Best accuracy: 0.9733333333333334


10. Imagine you're working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting

    - Handle missing values: Use imputation (mean/median for numeric, mode or custom for categorical).

    - Encode categorical features: Use label encoding or one-hot encoding for non-numeric attributes.

    - Train Decision Tree model: Fit to training data.

    - Tune hyperparameters: Use GridSearchCV (parameters like max_depth, min_samples_split).

    - Evaluate performance: Use metrics like accuracy, precision, recall, F1 score (for classification); also ROC-AUC if relevant.

    - Business value: Automating diagnosis improves efficiency and consistency, supports early detection and intervention, and helps prioritize cases needing urgent care.