# Decision Tree

1.  What is a Decision Tree, and how does it work in the context of
classification?

 - A Decision Tree is a non-parametric supervised learning algorithm that can be used for both classification and regression tasks. It uses a hierarchical, tree-like structure, much like a flowchart, to model decisions and their potential consequences.
   - Root Node: Represents the entire dataset, which is then split into two or more homogeneous subsets.
   - Internal Nodes: Represent a test or condition on a specific feature.
   - Branches: Represent the outcome of the test or condition.
   - Leaf Nodes: Represent the final decision or prediction.
 - For classification, a decision tree, often called a Classification Tree, works by recursively partitioning the data based on feature values to create subsets that are as "pure" as possible with respect to the class label.
   1. Feature Selection:
   - The algorithm chooses the best feature to split the data using criteria like: Gini Impurity,Entropy and Chi-square.
   - The goal is to create the most homogeneous branches possible.

   2. Splitting:
   - The dataset is split into subsets based on the selected feature.
   - This process is repeated recursively for each subset.

   3. Stopping Criteria:The tree stops growing when:
   - All data in a node belongs to the same class.
   - Maximum depth is reached.
   - Minimum number of samples per node is met.

   4. Prediction:
   - For a new input, the tree is traversed from root to leaf by following the decision rules.
   - The label at the leaf node is the predicted class.

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.How do they impact the splits in a Decision Tree?

 - A Decision Tree's ability to classify data relies on an iterative process of splitting nodes, which is guided by metrics that measure the impurity or disorder of a dataset. The two most common impurity measures are Gini Impurity and Entropy.
   1. Gini Impurity measures the likelihood of misclassifying a randomly chosen element from the dataset if it were labeled according to the distribution of labels in the subset.
   - Impact on Splits: In decision trees, the algorithm selects the feature that results in the lowest Gini Impurity after the split. This means that the chosen feature will create child nodes that are as pure as possible, leading to better classification accuracy.

   2. Entropy is a measure of the disorder or uncertainty in a dataset. It quantifies the impurity of a node by measuring the unpredictability of the class labels.
   - Impact on Splits: The decision tree algorithm uses entropy to determine the best feature for splitting. The feature that results in the highest information gain is selected for the split. This helps in creating nodes that are more homogeneous in terms of class distribution.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

 - Pruning is a technique used to avoid overfitting in Decision Trees, which happens when the tree becomes too complex and captures noise in the training data. Pruning simplifies the tree by removing sections that provide little power to classify instances.

   1. Pre-Pruning: Stop growing the tree early before it becomes too complex.

   - How: Apply constraints during tree building, such as:
     - Maximum tree depth
     - Minimum number of samples required to split a node
     - Minimum impurity decrease required for a split
   -  Prevent overfitting by not letting the tree grow unnecessarily deep.
   - Practical Advantage: Faster training and simpler trees.
   - Example: In real-time fraud detection, pre-pruning ensures the model is lightweight and quick to train/deploy, avoiding overly complex rules.

   2. Post-Pruning: First grow the tree fully, then prune back branches that don't improve performance on validation data.

   - How: Techniques like Reduced Error Pruning or Cost-Complexity Pruning.
   - Goal: Simplify the model by removing sections that lead to overfitting.
   - Practical Advantage: Better generalization.
   - Example: In medical diagnosis models, post-pruning ensures the tree doesn't overfit rare noise patterns, improving reliability on unseen patient data.

4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?

 - Information Gain is a key metric in Decision Trees that quantifies the effectiveness of a feature in classifying the data. It measures the reduction in uncertainty achieved after splitting a dataset based on that feature.
    - Before the split, we have some uncertainty about class labels.
    - After the split, if subsets are purer, entropy decreases.
    - The decrease is the Information Gain.
    - A higher IG means the split gives us more information about the target.

 - Importance for Choosing the Best Split:
    - Decision Tree algorithms, like ID3 and C4.5, use a greedy approach called recursive binary splitting. At every single node, the algorithm must choose the absolute best feature and split point to divide the data.
    - Metric for Optimality: Information Gain serves as the objective function. The algorithm calculates the IG for every possible feature split.
    - Maximization: The decision tree then greedily selects the split that yields the maximum Information Gain. This guarantees that at each step, the tree is choosing the question that most effectively separates the classes, moving it closer to its goal of having perfectly pure leaf nodes.
    - Efficiency and Purity: By prioritizing the split with the highest IG, the algorithm minimizes the number of splits required to classify all instances, resulting in a more concise and efficient tree.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

 - Common Real-World Applications of Decision Trees:
   1. Healthcare & Diagnosis
   - Predicting diseases based on patient attributes like age, symptoms, and medical history.
   - Easy to interpret for doctors: “If blood sugar > X and BMI > Y, then…”
   2. Finance & Credit Scoring
   - Evaluating loan eligibility, credit risk, or likelihood of default.
   - Useful because regulators often require interpretable models.
   3. Marketing & Customer Segmentation
   - Identifying customer groups likely to buy a product or churn.
   - E.g., If Age < 25 and Monthly Spend > $100 → likely to churn.
   4. Fraud Detection
   - Detecting fraudulent transactions by checking patterns in transaction data.
   5. Manufacturing & Quality Control
   - Diagnosing machine failures or predicting product defects based on sensor readings.
   6. Retail & Recommendation Systems
   - Deciding promotions: “If customer buys A and not B, then recommend C.”

 - Main Advantages:
    - Easy to Understand and Interpret: The decision path from the root to a leaf node is easy to follow and translate into simple if-then-else rules. This makes them invaluable for communicating model results to non-technical stakeholders.
    - Minimal Data Preparation: They generally do not require feature scaling or normalization, unlike many other machine learning algorithms.
    - Handle Both Data Types: They naturally work well with both categorical and numerical data.
    - Non-Parametric: They make no assumptions about the statistical distribution of the data.
    - Feature Selection is Built-in: The features placed at the top of the tree are the most informative, providing a natural ranking of feature importance.

 - Main Limitations:
    - Prone to Overfitting: If a tree is allowed to grow too deep, it can memorize the training data and its noise, leading to poor generalization on new, unseen data. Techniques like pruning or limiting the tree's depth are necessary to combat this.
    - Instability: Small variations in the training data can result in a completely different tree structure, making the model unstable and less reliable.
    - Greedy Approach: The algorithm uses a greedy search to find the best split at the current node, without checking if that split will lead to the overall best tree structure down the line. It's locally optimal, not globally optimal.
    - Bias with Imbalanced Data: If a dataset is heavily skewed toward one class, the decision tree can become biased toward the majority class, leading to poor predictions for the minority class.

6.  Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier using the Gini criterion
- Print the model's accuracy and feature importances


  - Iris, DecisionTreeClassifier with Gini

        from sklearn.datasets import load_iris
        from sklearn.model_selection import train_test_split
        from sklearn.tree import DecisionTreeClassifier
        from sklearn.metrics import accuracy_score

        iris = load_iris()
        X, y = iris.data, iris.target
        X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.25, random_state=42, stratify=y)

        clf_gini = DecisionTreeClassifier(criterion='gini', random_state=42)
        clf_gini.fit(X_train, y_train)
        y_pred = clf_gini.predict(X_test)

        print("Accuracy:", accuracy_score(y_test, y_pred))
        print("Feature importances:", list(zip(iris.feature_names, clf_gini.feature_importances_)))

    - Output
    - Accuracy: 0.8947 (≈ 89.47%)
    - Feature importances:
      - sepal length (cm): 0.0134
      - sepal width (cm): 0.0201
      - petal length (cm): 0.9199
      - petal width (cm): 0.0466

7. Write a Python program to:
- Load the Iris Dataset
-  Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.
   - Compare max_depth=3 vs fully-grown tree

         from sklearn.tree import DecisionTreeClassifier
         from sklearn.metrics import accuracy_score

         clf_md3 = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
         clf_md3.fit(X_train, y_train)
         acc_md3 = accuracy_score(y_test, clf_md3.predict(X_test))

         clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)  # default (no max_depth)
         clf_full.fit(X_train, y_train)
         acc_full = accuracy_score(y_test, clf_full.predict(X_test))

         print("Accuracy (max_depth=3):", acc_md3)
         print("Accuracy (fully-grown):", acc_full)
     
     - Output
     - Accuracy (max_depth=3): 0.8947
     - Accuracy (fully-grown): 0.8947

8. Write a Python program to:
- Load the Boston Housing Dataset
- Train a Decision Tree Regressor
- Print the Mean Squared Error (MSE) and feature importances

  -  Boston Housing

         from sklearn import datasets
         from sklearn.model_selection import train_test_split
         from sklearn.tree import DecisionTreeRegressor
         from sklearn.metrics import mean_squared_error

         boston = datasets.load_boston()  # deprecated in newer versions; replace if not available
         X, y = boston.data, boston.target
         feature_names = list(boston.feature_names)

         X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
         regr = DecisionTreeRegressor(random_state=42)
         regr.fit(X_train, y_train)
         y_pred = regr.predict(X_test)

         print("MSE:", mean_squared_error(y_test, y_pred))
         print("Feature importances:", list(zip(feature_names, regr.feature_importances_)))

    - Output
    - MSE: 16.6884
    - Feature importances :CRIM: 0.0663, ZN: 0.0013, INDUS: 0.0115, CHAS: 0.0011, NOX: 0.0070, RM: 0.5872, AGE: 0.0140, DIS: 0.0739, RAD: 0.0008, TAX: 0.0056, PTRATIO: 0.0096, B: 0.0113, LSTAT: 0.2103 .

9.  Write a Python program to:
- Load the Iris Dataset
- Tune the Decision Tree's max_depth and min_samples_split using
GridSearchCV
- Print the best parameters and the resulting model accuracy

   - GridSearchCV for DecisionTree on Iris

         from sklearn.model_selection import GridSearchCV
         from sklearn.tree import DecisionTreeClassifier

         param_grid = {'max_depth': [2, 3, 4, None], 'min_samples_split': [2, 4, 6]}
         dt = DecisionTreeClassifier(criterion='gini', random_state=42)
         grid = GridSearchCV(dt, param_grid, cv=4, scoring='accuracy', n_jobs=1)
         grid.fit(X_train, y_train)

         print("Best params:", grid.best_params_)
         print("Best CV accuracy:", grid.best_score_)

    - Output
    - Best params: {'max_depth': 4, 'min_samples_split': 2}
    - Best CV accuracy (4-fold CV on training folds): 0.9554 (≈ 95.54%)

10. Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
- Handle the missing values
- Encode the categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance And describe what business value this model could provide in the real-world setting.

  - Here is the step-by-step process for building and evaluating the predictive model, along with the business value it provides.

  1. Data Preprocessing: Since Decision Trees can handle mixed data types and are somewhat robust to missing values, we can adopt simple yet effective preprocessing steps.
     - Step 1: Handle Missing Values: Decision Trees can often handle missing values by treating them as a separate category, but it's generally better to impute, especially for complex ensemble methods later on.
       - For Numerical Features: Replace missing values with the median of the non-missing values for that feature. The median is robust to outliers.
       - For Categorical Features: Replace missing values with the mode or treat "Missing" as its own category. If the feature is highly predictive, treating "Missing" as a category is often the best approach to capture that information.
     - Step 2: Encode Categorical Features: Decision Trees cannot directly process text or string data. Since they are used for classification and regression, categorical features must be converted to numbers.
       - Nominal/Unordered Categories: Use One-Hot Encoding. This creates a new binary column for each unique category.
       - Ordinal/Ordered Categories: Use Ordinal Encoding or Label Encoding. This maps the categories to integers based on their inherent order.
  2. Model Training and Tuning: The next phase is to build and optimize the Decision Tree model.
     - Step 3: Train the Initial Decision Tree Model
       - Split Data: Divide the preprocessed dataset into three parts: Training Set , Validation Set, and Test Set. A typical split is 60% Train, 20% Validation, 20% Test.
      - Initial Training: Train a baseline Decision Tree classifier using default settings on the Training Set.
  3. Evaluation and Business Value
     - Step 4: Evaluate Model Performance: Once the best hyperparameters are found, the model is evaluated on the completely unseen Test Set to get a true measure of its real-world performance.
     - Given the disease prediction task, the following metrics are essential:
       1. Area Under the ROC Curve: This measures the model's ability to distinguish between the two classes. A score near 1.0 is excellent.
       2. F1-Score: This is the harmonic mean of Precision and Recall. It is especially important in medical diagnosis because the dataset is often imbalanced.
       - High Recall: Crucial for minimizing False Negatives.
       - High Precision: Important for minimizing False Positives.
       3. Confusion Matrix: A table to visualize the actual versus predicted outcomes, clearly showing the number of True Positives, True Negatives, False Positives, and False Negatives.

       