**1.What is a Decision Tree, and how does it work in the context of
classification?**

* A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. In the context of classification, it is one of the most intuitive and easy-to-understand algorithms because it mimics human decision-making.

* A Decision Tree is a flowchart-like structure:
* Each internal node represents a decision based on a feature (attribute).
* Each branch represents the outcome of that decision.
* Each leaf node represents a class label (the predicted outcome).
* The model splits the dataset into subsets based on the most significant features, recursively, until it reaches a stopping condition (like max depth, minimum samples, or pure class labels).
* Root Node Selection
  * The algorithm chooses the best feature to split the dataset (based on metrics like Gini Impurity, Entropy/Information Gain, or Chi-square).
* Splitting
    * The dataset is split into subsets based on the chosen feature’s values.
    * For example, if the feature is "Age < 30", the tree creates two branches: one for "Yes" and one for "No".
* Recursive Partitioning
      * The algorithm continues to split each subset into further nodes using the same process.
      * This creates a hierarchy of decisions that narrows down possibilities.
* Leaf Nodes (Classification Output)
    * Splitting stops when:
        * All samples in a node belong to the same class (pure node).
        * Or maximum tree depth/minimum samples is reached.
        * The leaf node assigns a class label (majority class of samples in that node).
* Example
    * Suppose we want to classify whether a person will buy a computer:
    * Features: Age, Income, Student Status, Credit Rating.

* Advantages
    * Easy to interpret and visualize.
    * No need for feature scaling (works with categorical and numerical data).
    * Can capture nonlinear relationships.

* Disadvantages
   * Prone to overfitting if not pruned.
   * Can be unstable (small changes in data can lead to a different tree).
   * Biased toward features with more levels.

**2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?**

* 1. Gini Impurity
* Definition: Gini impurity measures the probability of incorrectly classifying a randomly chosen sample if it was labeled according to the class distribution in that node.

Formula:
 * Gini=1−i=1∑Cpi2
 * Where: C = number of classes pi = proportion of samples belonging to class
     * i in the node
     * Range:
       * Minimum = 0 (pure node: all samples in one class)
       * Maximum = 1 - \frac{1}{C} (highest impurity, evenly distributed classes)
 * Example: If a node has 10 samples → 4 in Class A, 6 in Class B:
      * pA=0.4,pB=0.6 Gini=1−(0.42+0.62)=1−(0.16+0.36)=0.48.

2. Entropy (Information Gain)
* Definition: Entropy measures the "disorder" or "uncertainty" in a dataset. A pure node has entropy = 0, while mixed classes have higher entropy.
 * Formula: Entropy=−i=1∑Cpi log2 (pi)
 * Range:
   * Minimum = 0 (pure node: all samples in one class)
   * Maximum = \log_2(C) (when classes are equally likely)
   * Example (same node as before: 40% A, 60% B):
        * Entropy =−[(0.4×−1.32)+(0.6×−0.74)]=0.97
* How they impact Decision Tree Splits
 * At each split, the Decision Tree algorithm:
    * Calculates impurity before splitting (parent node).
    * Calculates impurity after splitting (child nodes).
    * Chooses the feature and threshold that maximizes the impurity reduction.
 * Impurity Reduction (Information Gain):
       * Gain=Impurity parent − k=1∑mnk
	     * Impurity child k

	​* Where:
  * n = total samples in parent
  * nk = samples in child
  * k The split with the highest gain is chosen.

**3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.**

* Pruning in Decision Trees
   * Decision Trees, if left unregulated, can grow very deep, perfectly fitting the training data.
   * This often leads to overfitting (great accuracy on training data, poor on test data).
   * Pruning reduces tree complexity by limiting growth (Pre-Pruning) or trimming after full growth (Post-Pruning).

**1. Pre-Pruning (Early Stopping)**
 * The tree stops growing early, before it becomes too complex.
 * Done by setting constraints such as:
    * Maximum depth (max_depth)
    * Minimum samples required to split a node (min_samples_split)
    * Minimum samples per leaf (min_samples_leaf)
    * Minimum impurity decrease (min_impurity_decrease)
* Practical Advantage:
   * Saves computation time and memory, since the tree is never fully grown.
   * Useful when working with large datasets where training speed matters.

**2. Post-Pruning (Pruning After Full Growth)**
* First, the tree is allowed to grow fully.
* Then, branches that contribute little to predictive power are cut back.
* Techniques include:
  * Cost Complexity Pruning (CCP) → removes branches with least importance by balancing accuracy vs complexity.
  * Reduced Error Pruning → removes nodes if accuracy on validation data does not decrease.
* Practical Advantage:
   * Often yields better generalization, since the pruning decision is based on the full grown tree and validation performance.
  * Useful when accuracy is more important than speed.

**4.What is Information Gain in Decision Trees, and why is it important for
choosing the best split?**

* Information Gain (IG) measures the reduction in impurity (uncertainty) achieved by splitting a dataset based on a particular feature.
* In decision trees (like ID3, C4.5), it is usually based on Entropy.
* Formula
   * IG(S,A)=Entropy(S)−v∈Values(A) ∑ Entropy(Sv)
   * Where:
        * S = parent dataset
        * A = feature we split on
        * v = possible values of feature A
        * Sv = subset of S where A=v
* Start with overall entropy (uncertainty) before the split.
* Subtract the weighted average entropy of the subsets after splitting.
* The bigger the Information Gain, the better the split.
* Why is it important?
    * Decision Trees need to decide which feature to split on at each step.
    * Information Gain tells us how much a feature reduces uncertainty in the dataset.
    * The feature with the highest IG is chosen because it produces the purest child nodes.
* Example
   * Suppose we want to classify whether students "Play Tennis" based on "Outlook" (Sunny, Overcast, Rainy).
* Parent node:
   * 9 "Yes", 5 "No" → Entropy(Parent) ≈ 0.94
* Splitting on Outlook gives subsets:
  * Sunny (2 Yes, 3 No) → Entropy ≈ 0.97
  * Overcast (4 Yes, 0 No) → Entropy = 0
  * Rainy (3 Yes, 2 No) → Entropy ≈ 0.97
  * Weighted Entropy ≈ 0.69 IG=0.94−0.69=0.25
* This means splitting on Outlook reduces uncertainty by 0.25 bits of information.

**5.What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?**
* Real-World Applications of Decision Trees
* Finance & Banking
  * Credit scoring / loan approval → Classify whether a person is likely to repay or default.
  * Fraud detection → Identify suspicious transactions.
* Healthcare
  * Disease diagnosis → Classify patients based on symptoms/test results.
  * Treatment recommendation → Decide treatment paths based on patient history.
* Marketing & Sales
  * Customer segmentation → Identify which customers are likely to buy a product.
  * Churn prediction → Predict if a customer is likely to leave.
* Manufacturing & Operations
  * Quality control → Detect defective products in production.
  * Supply chain optimization → Classify demand levels for better inventory planning.
* Retail & E-commerce
  * Recommendation systems → Classify users’ preferences for personalized ads.
  * Pricing decisions → Decide discounts/promotions for specific customer groups.
* Education
  * Student performance prediction → Predict dropout risk or success probability.
  * Adaptive learning systems → Personalize learning paths.
* Advantages of Decision Trees
  * Easy to interpret & visualize → Even non-technical users can understand them.
  * Works with both categorical & numerical data.
  * No need for feature scaling/normalization.
  * Captures non-linear relationships → Can split data into complex decision boundaries.
  * Fast predictions → Good for real-time classification.

* Limitations of Decision Trees
  * Overfitting → Trees can become too deep, memorizing the training data (reduced by pruning or using ensembles like Random Forest).
  * Instability → Small changes in data can lead to a very different tree structure.
  * Bias toward features with many levels → Categorical variables with many distinct values may dominate splits.
  * Not always optimal → A single tree may have lower accuracy compared to ensemble methods (Random Forest, Gradient Boosted Trees).



6.Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Predict on test set
y_pred = clf.predict(X_test)

# Model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Feature importance
print("\nFeature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")

Model Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


7. Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

In [2]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fully-grown Decision Tree (no max_depth limit)
clf_full = DecisionTreeClassifier(criterion="gini", random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Decision Tree with max_depth=3
clf_pruned = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)
y_pred_pruned = clf_pruned.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)

# Print comparison results
print("Accuracy of fully-grown tree:", accuracy_full)
print("Accuracy of max_depth=3 tree:", accuracy_pruned)

Accuracy of fully-grown tree: 1.0
Accuracy of max_depth=3 tree: 1.0


8. Write a Python program to:

● Load the California Housing dataset from sklearn

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances

In [3]:
# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predict on test set
y_pred = reg.predict(X_test)

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Feature importances
print("\nFeature Importances:")
for feature, importance in zip(housing.feature_names, reg.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 0.495235205629094

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


9. Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

● Print the best parameters and the resulting model accuracy

In [4]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define parameter grid for tuning
param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 3, 4, 5, 10]
}

# Initialize Decision Tree and GridSearchCV
dt = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring="accuracy")

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test Set Accuracy:", accuracy)

Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Test Set Accuracy: 1.0


10.  Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.

Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

* 1) Handle missing values — steps & rationale
   * Understand the missingness: check if data are MCAR / MAR / MNAR. This affects choice of strategy.
   * Simple strategies (fast & robust):
        * Numeric: median (robust to outliers) or mean.
   * Categorical: new category like "missing" or most_frequent.
   * Advanced strategies:
      * IterativeImputer (MICE) or KNNImputer when relationships between features are important.
      * Keep a “missingness indicator” for important variables (a boolean column) — sometimes “missing” is predictive.
  * Avoid leakage: fit imputers only on training data (use Pipeline/ColumnTransformer).
* 2) Encode categorical features — options & when to use them
  * One-Hot Encoding: good for low-cardinality categories; safe for trees (uses more columns).
  * Ordinal / Label Encoding: fast but introduces ordinal relationships (use only if category order is meaningful).
  * Target / Mean encoding: works well with high-cardinality categories — but must be done carefully (cross-validated target encoding) to avoid leakage / overfitting.
  * Frequency / Count encoding: useful for high-cardinality when you want compactness.
  * For Decision Trees, you don’t need scaling; they work with integer encodings — but be mindful of implicit order if using label encoding.
* 3) Train a Decision Tree model — best practices
   * Always use a pipeline: preprocessing (imputer + encoder) → classifier.
   * Use stratified train/test split for binary disease label (preserve class ratio).
   * Consider class_weight='balanced' or resampling (SMOTE) for imbalanced disease prevalence (apply SMOTE only on training set).
* 4) Tune hyperparameters
  * Important hyperparameters:
        * max_depth, min_samples_split, min_samples_leaf, max_features
        * criterion (gini / entropy)
        * ccp_alpha (cost complexity pruning)
        * class_weight
  * Tuning strategy:
      * Use GridSearchCV or RandomizedSearchCV with StratifiedKFold.
      * Use scoring='roc_auc' if classes are imbalanced and you care about ranking; use f1 if you want balanced precision/recall.
  * For unbiased final evaluation, use nested CV if possible (outer CV for performance estimate, inner CV for tuning).

* 5) Evaluate performance
  * Primary metrics: ROC AUC, Precision, Recall (sensitivity), F1-score.
  * Important in healthcare: prioritize sensitivity (recall) if missing a disease is costly; balance with acceptable false positives.
  * Other checks: confusion matrix, PR AUC (better than ROC for very imbalanced), calibration (is predicted probability well calibrated?), decision threshold tuning.
  * Use calibration (Platt / isotonic) if you’ll use probabilities for clinical decision thresholds.
  * Explainability: feature_importances_, SHAP, partial dependence plots for clinician trust.
  * Post-deployment: continuous monitoring for data drift and performance decay.

* Business value (real-world benefits)
   * Early detection & triage: flag high-risk patients for faster follow-up, reducing time to treatment.
   * Resource optimization: prioritize diagnostic testing or specialist visits for high-probability cases.
   * Cost savings: reduce unnecessary testing for low-risk patients, focus resources where they matter.
   * Improved outcomes: early intervention can lead to better clinical outcomes and lower downstream costs.
   * Actionable insights: feature importances / SHAP can suggest which symptoms/tests most strongly predict disease — aiding clinicians and care pathway design.
* Important caveats & deployment considerations
   * False negatives are often more costly in healthcare — choose metrics and thresholds accordingly.
   * Bias & fairness: audit model performance across subgroups (age, sex, ethnicity).
   * Clinical validation: models must be validated prospectively and ideally run in pilot clinical studies before production use.
   * Privacy & compliance: follow HIPAA / local regulations for patient data.
   * Monitoring & retraining: data distributions change — set up monitoring and periodic retraining.
   * Explainability: clinicians require interpretable outputs — pair the model with explanations (feature importances, SHAP).