##Question 1:What is a Decision Tree, and how does it work in the context of classification?

A Decision Tree is a supervised learning model used for classification and regression. It represents decisions and their possible consequences as a tree structure composed of internal nodes (tests on features), branches (outcomes of tests), and leaf nodes (predicted class or value).
In the context of classification, a decision tree works as follows:
1.	Root node: The tree starts with all training samples at the root.
2.	Splitting: The algorithm chooses a feature and threshold to split the data into subsets that are more homogeneous with respect to the target class. The split is selected by maximizing a measure of purity improvement (e.g., information gain based on entropy, or reduction in Gini impurity).
3.	Repeat recursively: For each child node, the algorithm repeats splitting until a stopping criterion is met (pure node, minimum samples, max depth, or no improvement).
4.	Leaf nodes: When no further splitting is done, a leaf node stores the majority class (or class probabilities) based on the training samples that ended there. For classification, prediction follows a path from the root to a leaf according to the sample’s features.

###Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
Answer:
Both Gini Impurity and Entropy are measures of node impurity (how mixed the class labels are). They are used to evaluate how good a split is by computing the impurity before and after the split; a good split reduces impurity significantly.
Entropy (Information Theory)
	For a node with class probabilities
	Entropy is 0 when all samples belong to one class, and maximum when classes are uniformly distributed.
	Information Gain for a split is the reduction in entropy from parent to children.
Gini Impurity
	Defined as: G=1-∑_(k=1)^K p_k^2.
	Like entropy, Gini is 0 for a pure node and increases as class mixing increases.
Impact on splits:
	Both measures generally choose similar splits. Sometimes Gini favors larger partitions and is slightly faster to compute (no logarithm). Entropy (used for information gain) is more theoretically grounded in information theory.
	In practice, the chosen impurity metric rarely changes final performance much; however, subtle differences can exist in the tree shape and chosen thresholds.
When to use which:
	sklearn defaults to Gini for classification (criterion='gini') and 'entropy' if specified. For most use cases, try both if you suspect sensitivity; otherwise Gini is fine.


###Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
Answer:
Pre-Pruning (Early Stopping): Stop tree growth early by specifying constraints while building the tree. Common pre-pruning hyperparameters include:
•	max_depth: maximum tree depth
•	min_samples_split: minimum samples required to split
•	min_samples_leaf: minimum samples required to be at a leaf
•	max_leaf_nodes: maximum number of leaf nodes
Advantage of Pre-Pruning: Controls tree size during training, reducing overfitting and computation cost; it's simple and prevents unnecessarily complex splits.
Post-Pruning (Prune after full growth): Grow a full or large tree and then prune back nodes that do not improve generalization, typically using a validation set or cross-validation. Methods include cost-complexity pruning (e.g., ccp_alpha in scikit-learn) and reduced error pruning.
Advantage of Post-Pruning: Potentially yields better generalization since the pruning decisions are made with knowledge of the full tree structure and validation performance, allowing removal of only those branches that truly harm performance


###Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?
Answer:
Information Gain measures how much a split decreases the impurity (usually entropy) of a node. Formally, for a parent node with entropy H(parent), and two child nodes with entropies H(left)and H(right)and proportions w_left,w_right, the information gain is:
IG=H(parent)-(w_left H(left)+w_right H(right)).
It is important because:
	It quantifies the expected reduction in uncertainty about the class label when partitioning by a feature and threshold.
	The algorithm chooses the split with the highest information gain (or equivalently the largest impurity reduction) at each node.
	High information gain means the split creates children that are more homogeneous (closer to pure), which improves classification accuracy.


###Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
Answer:
Applications:
•	Finance: credit scoring, loan default prediction, fraud detection (interpretable rules).
•	Healthcare: disease diagnosis, patient risk stratification (interpretable decision rules useful for clinicians).
•	Marketing: customer segmentation, churn prediction, targeted promotions.
•	Manufacturing: fault detection, quality control.
•	Operations: routing/decision policies, rule-based automation.
Advantages:
•	Interpretable: easy-to-understand rules and visualizations.
•	Handles numerical and categorical data (with appropriate encodings).
•	Non-parametric: no assumptions about data distributions.
•	Fast inference and can handle missing values (to some extent, though scikit-learn requires imputation).
Limitations:
•	Overfitting: fully-grown trees often overfit to training data.
•	Instability: small changes in the data can produce very different trees.
•	Greedy splits: the algorithm makes locally optimal choices, which may not be globally optimal.
•	Bias toward variables with many levels unless regularized.
•	Often outperformed by ensembles (Random Forest, Gradient Boosting) in predictive accuracy.


###Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances


In [1]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data       # Features
y = iris.target     # Target labels

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Step 3: Create a Decision Tree Classifier using the Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Step 4: Train the classifier
clf.fit(X_train, y_train)

# Step 5: Make predictions on the test data
y_pred = clf.predict(X_test)

# Step 6: Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Step 7: Print accuracy and feature importances
print("Decision Tree Classifier using Gini Criterion")
print("------------------------------------------------")
print(f"Accuracy: {accuracy:.2f}")
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Decision Tree Classifier using Gini Criterion
------------------------------------------------
Accuracy: 1.00
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


###Question7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
(Include your Python code and output in the code box below.)

In [3]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Step 2: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Step 3: Train a Decision Tree with max_depth=3
clf_limited = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)

# Step 4: Train a fully-grown Decision Tree
clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)

# Step 5: Predict on test data
y_pred_limited = clf_limited.predict(X_test)
y_pred_full = clf_full.predict(X_test)

# Step 6: Calculate accuracy
accuracy_limited = accuracy_score(y_test, y_pred_limited)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Step 7: Print results
print("Decision Tree Classifier Comparison")
print("-----------------------------------")
print(f"Accuracy (max_depth=3): {accuracy_limited:.2f}")
print(f"Accuracy (fully grown): {accuracy_full:.2f}")


Decision Tree Classifier Comparison
-----------------------------------
Accuracy (max_depth=3): 1.00
Accuracy (fully grown): 1.00


###Question 8: Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances
(Include your Python code and output in the code box below.)

In [4]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Step 1: Load the California Housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Step 3: Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(criterion='squared_error', random_state=42)
regressor.fit(X_train, y_train)

# Step 4: Predict on the test set
y_pred = regressor.predict(X_test)

# Step 5: Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Step 6: Print MSE and feature importances
print("Decision Tree Regressor Results")
print("--------------------------------")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print("Feature Importances:")
for feature, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Decision Tree Regressor Results
--------------------------------
Mean Squared Error (MSE): 0.5280
Feature Importances:
MedInc: 0.5235
HouseAge: 0.0521
AveRooms: 0.0494
AveBedrms: 0.0250
Population: 0.0322
AveOccup: 0.1390
Latitude: 0.0900
Longitude: 0.0888


###Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy

In [5]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Step 2: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Step 3: Define the Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Step 4: Define the parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, 6, None],
    'min_samples_split': [2, 3, 4, 5, 6, 10]
}

# Step 5: Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Step 6: Get the best model and evaluate it
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Step 7: Print results
print("Decision Tree Classifier - Grid Search Results")
print("----------------------------------------------")
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_:.2f}")
print(f"Test Set Accuracy: {accuracy:.2f}")


Decision Tree Classifier - Grid Search Results
----------------------------------------------
Best Parameters: {'max_depth': 4, 'min_samples_split': 6}
Best Cross-Validation Score: 0.94
Test Set Accuracy: 1.00


###Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.


Answer:


1) Quick data understanding & EDA

Inspect data types, class balance (value_counts()), missingness patterns (df.isna().mean()), unique counts for categoricals.

Look for target leakage (features that are derived from the target or measured after diagnosis).

Visualize distributions, pairwise relationships, and missingness heatmaps.

Decide if some columns should be dropped (IDs, high leakage, constant columns).

Why: informs choices for imputation, encoding, and evaluation metrics (e.g., heavy class imbalance → different metrics and resampling).

2) Handle missing values

Principles

Determine mechanism (MCAR / MAR / MNAR) qualitatively — affects whether simple imputation is acceptable.

Never impute with values derived from the test set (use a pipeline so fit/transform are done only on training data).

Use simple methods first: median for numeric (robust to outliers), most_frequent or 'MISSING' token for categoricals.

For complex dependencies consider IterativeImputer (model-based) or KNN imputer.

Add binary missing indicators (MissingIndicator) when missingness itself is potentially informative.

Options

Small fraction missing (<~5%): consider dropping rows.

Numeric: SimpleImputer(strategy='median') or IterativeImputer.

Categorical: SimpleImputer(strategy='constant', fill_value='MISSING') or target/mean encoding for high-cardinality features (careful about leakage).

Keep a flag column (<feature>_was_missing) if missingness is correlated with outcome.

3) Encode categorical features

For decision trees

Trees are robust to monotonic transforms, but ordinal codes can inject artificial order. Prefer:

OneHotEncoder(handle_unknown='ignore') for low-cardinality categories.

Target encoding / leave-one-out / CatBoost encodings for high-cardinality features (with careful cross-validation to avoid leakage).

If using libraries that natively support categoricals (CatBoost, LightGBM), consider them — they often perform better and avoid high-dimensional OHE.

Why: correct encoding avoids artificial ordering and prevents the model from failing on unseen categories.

4) Build a preprocessing + modeling pipeline

Use ColumnTransformer to apply different preprocessing to numeric and categorical columns.

Compose a pipeline: preprocessor -> estimator so all transformations are done properly per fold (avoids leakage).

Example estimator: DecisionTreeClassifier(random_state=42, class_weight='balanced') (use class_weight if imbalance exists).

5) Hyperparameter tuning

Hyperparameters for Decision Tree to tune

max_depth (prevents overfitting)

min_samples_split, min_samples_leaf (control complexity)

max_features (subset of features considered per split)

criterion ('gini' or 'entropy')

class_weight (or use sampling techniques)

Tuning strategy

Use StratifiedKFold and GridSearchCV or RandomizedSearchCV (for large search spaces).

Use scoring consistent with business objective: e.g., roc_auc or average_precision (PR-AUC) if classes imbalanced; or custom cost-sensitive scoring.

Prefer nested CV when you want an unbiased estimate of generalization while tuning hyperparameters.

6) Evaluate performance (clinical emphasis)

Metrics: sensitivity (recall) for the positive disease class, specificity, precision (PPV), NPV, F1, ROC-AUC, PR-AUC.

Confusion matrix for a chosen threshold; tune threshold using domain costs (false negative > false positive in many diseases).

Calibration: check predicted probabilities with calibration curve and Brier score (are probabilities reliable?).

Explainability: show tree rules (for small trees), feature_importances_, permutation importance, and SHAP values for local explanations.

Validation: holdout test set + external validation on data from other hospitals/populations, and subgroup analysis (by age, sex, ethnicity).

Decision curve analysis to measure net clinical benefit across thresholds.

7) Deployment & monitoring (operational)

Convert pipeline to a reproducible artifact (pickle, joblib, or an ML-serving container) with preprocessing included.

Monitor data drift, model performance, and calibration in production; set alerts for performance degradation.

Retraining plan (frequency or trigger-based).

Auditability, logging, and explainability for clinicians.

Privacy & regulatory: HIPAA/GDPR compliance, access controls, and model documentation (Model Card).

8) Clinical & business value (with caveats)

Value

Early detection and triage → faster treatment, reduced morbidity.

Risk stratification → prioritize high-risk patients for tests/interventions.

Resource optimization → focusing expensive diagnostics on likely positives.

Improved patient outcomes and lower long-term costs when combined with clinical pathways.

Caveats / Risks

False negatives can harm patients; false positives add unnecessary tests/costs and anxiety.

Biases in training data can propagate to inequitable care—require subgroup validation and mitigation.

Clinical validation and prospective trials often required before deployment.

Clinician trust demands interpretability and clear documentation.

End-to-end Python example

This is a reusable template (adapt df, target_col and column lists). It builds preprocessing pipelines, trains a Decision Tree with GridSearchCV, prints the best params and evaluation metrics, and shows feature importances mapped back to input features.

# Example pipeline / tuning code for a mixed-type healthcare dataset
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score,
    average_precision_score, brier_score_loss
)



   