1] What is a Decision Tree, and how does it work in the context of
classification?
- A Decision Tree is a simple, tree-like model that helps make decisions by splitting data into smaller and smaller groups based on feature values.
How it works in classification-

1) Start at the root node

- All data is in one place.

- The algorithm chooses the best feature to split the data (e.g., "Age" or "Income") using metrics like Gini Impurity or Entropy (Information Gain).

2) Split the data

- Based on the chosen feature, the data is divided into branches.

Example: If "Age ≤ 30" → go left branch, else → right branch.

3) Repeat for each branch

- Keep splitting until:

- The node is pure (all samples belong to one class), or

- A stopping criterion is met (e.g., max depth reached).

4) Assign a class at the leaf node

- When no more splitting is done, the node becomes a leaf and is labeled with the most common class in that group.

2] Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
- Gini Impurity: Measures how often a randomly chosen sample would be misclassified if labels were assigned based on class proportions.

              Gini=1−∑pᵢ²

0 = pure, higher = more mixed.

Entropy: Measures the level of disorder or unpredictability.

              Entropy= −∑pᵢ(​log₂​pᵢ)

Impact on splits: Decision trees try all possible splits and pick the one that gives the largest drop in impurity (Gini Gain or Information Gain). Gini is faster, Entropy is more information-theoretic, but both usually give similar trees.

3] What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
- Pre-pruning: Stops tree growth early (e.g., max depth, min samples split).
 - How: The tree stops splitting when certain conditions are met (max depth, min samples at a node, min impurity decrease).

 - Effect: Prevents the tree from becoming too complex from the start.

 - Example: Stop splitting if a node has fewer than 10 samples.

 - Advantage: Faster training and less overfitting risk without building unnecessary branches.

- Post-pruning: Grows full tree first, then removes weak branches.

 - How: The tree grows to full depth, then weak/insignificant branches are cut back based on performance on a validation set.

 - Effect: Keeps only branches that meaningfully improve accuracy.

 - Example: Build full tree, then remove a split that increases validation error.

 - Advantage: Finds a better complexity–accuracy balance because it starts from maximum detail and trims.

4] What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
- Information Gain (IG): Measures how much a split reduces impurity (usually using Entropy).

              IG= Entropyₚₐᵣₑₙₜ - ∑(nchild​/nparent) ​× Entropycₕᵢₗd

Importance:

 - Higher IG = greater reduction in uncertainty.

 - Decision trees choose the split with maximum IG because it creates the purest child nodes, improving classification accuracy.

5] What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
- Applications:

 - Credit scoring and loan approval

 - Predicting customer churn

 - Medical diagnosis support

 - Fraud detection in transactions

 - Product/service recommendations

- Advantages:

 - Simple and easy to interpret

 - Handles numerical and categorical features

 - Minimal preprocessing needed

 - Transparent decision-making process


- Limitations:

 - Easily overfits without pruning

 - Sensitive to small changes in data

 - Can produce biased results with imbalanced datasets

In [3]:
#6] Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier using the Gini criterion
#● Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Feature importances
for name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.4f}")




Accuracy: 1.00
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [4]:
# 7] Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fully-grown tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_pred = full_tree.predict(X_test)
full_acc = accuracy_score(y_test, full_pred)

# Tree with max_depth=3
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
pruned_pred = pruned_tree.predict(X_test)
pruned_acc = accuracy_score(y_test, pruned_pred)

# Results
print(f"Accuracy (Full Tree): {full_acc:.2f}")
print(f"Accuracy (Max Depth=3): {pruned_acc:.2f}")


Accuracy (Full Tree): 1.00
Accuracy (Max Depth=3): 1.00


In [5]:
# 8]: Write a Python program to:
#● Load the California Housing dataset from sklearn
#● Train a Decision Tree Regressor
#● Print the Mean Squared Error (MSE) and feature importances
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predictions
y_pred = reg.predict(X_test)

# MSE
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Feature importances
print("\nFeature Importances:")
for name, importance in zip(housing.feature_names, reg.feature_importances_):
    print(f"{name}: {importance:.4f}")


Mean Squared Error: 0.50

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


In [6]:
# 9] Write a Python program to:
#● Load the Iris Dataset
#● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
#● Print the best parameters and the resulting model accuracy
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

# GridSearchCV
grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid.best_params_)

# Accuracy with best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Accuracy: 1.00


10] Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world setting.

- As a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease here's how i would appraoch the problem with the dataset:

1. Handle Missing Values

 - Numerical features: Replace missing values with median (robust to outliers).

 - Categorical features: Replace with most frequent category or create a new category "Unknown".

 - Why: Ensures all rows can be used by the model without introducing bias.

2. Encode Categorical Features

 - One-Hot Encoding: For nominal variables (e.g., blood type).

 - Ordinal Encoding: For ordered variables (e.g., disease stage).

 - Use ColumnTransformer to apply different encodings to different columns.

3. Train a Decision Tree Model

 - Split data into train/test (e.g., 80/20).

 - Use DecisionTreeClassifier(criterion='gini', random_state=42).

 - Fit the model on the processed training data.

4. Tune Hyperparameters

- Parameters to tune:

 - max_depth → controls tree depth

 - min_samples_split → minimum samples to split a node

 - min_samples_leaf → minimum samples at a leaf

 - Use GridSearchCV with cross-validation to find the best combination.

5. Evaluate Performance

- Metrics:

 - Accuracy (overall correctness)

 - Precision & Recall (important in healthcare to avoid false negatives)

 - F1-score (balance between precision & recall)

 - ROC-AUC (ability to separate classes)

 - Evaluate on test set to ensure generalization.

6. Business Value

 - Early detection: Helps doctors flag high-risk patients quickly.

 - Resource optimization: Prioritize testing for those most likely to have the disease.

 - Cost savings: Reduces unnecessary medical tests.

 - Better patient outcomes: Timely interventions can save lives.