#Decision Tree | Assignment

1. What is a Decision Tree, and how does it work in the context of
classification?
   - A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. In the context of classification, it works like a flowchart structure where data is split at each node based on feature values to reach a decision (final class label).
   
    How it works in Classification:--  In classification, a Decision Tree starts with a root node, which represents the entire dataset and is split into subsets based on the most important feature. At each step, the algorithm performs splitting at decision nodes, where it selects the best feature to divide the data, usually determined using metrics like the Gini Index (to measure impurity) or Entropy/Information Gain (to measure reduction in uncertainty). These splits form branches, which represent possible outcomes of the decision at each node. Eventually, the process leads to leaf nodes (terminal nodes), where the final classification decision is made, and each leaf corresponds to a particular class label. This step-by-step structure allows the model to classify data by following the path from the root to a leaf based on feature values.



2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
   - In Decision Trees, Gini Impurity and Entropy are two commonly used measures of node impurity that help determine the best feature splits. Gini Impurity measures the probability of incorrectly classifying a randomly chosen sample if it were labeled according to the class distribution in that node. A Gini value of 0 means the node is pure (all samples belong to one class), while higher values indicate more mixed classes. Entropy, on the other hand, comes from information theory and measures the level of uncertainty or disorder in a node. A value of 0 means complete purity, while higher values indicate greater randomness, with the maximum reached when classes are evenly distributed. In practice, a Decision Tree algorithm evaluates different possible splits and chooses the one that results in the greatest reduction in impurity—either by minimizing Gini or maximizing the reduction in Entropy (Information Gain). Thus, both measures guide the tree in creating branches that move toward purer subsets, ultimately improving classification accuracy.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
   - In Decision Trees, pre-pruning (also called early stopping) involves halting the growth of the tree during construction by setting constraints such as maximum depth, minimum number of samples required to split a node, or minimum information gain needed for a split. This prevents the tree from becoming too complex and overfitting the training data. A practical advantage of pre-pruning is that it makes the model simpler and faster to train, which is especially useful when working with large datasets. Post-pruning, on the other hand, allows the tree to grow fully and then trims back branches that do not improve performance on validation data. This reduces overfitting by removing unnecessary complexity after observing the complete structure. A practical advantage of post-pruning is that it usually produces a more accurate and generalizable model since it evaluates the impact of branches before deciding to remove them.

4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
   - Information Gain in Decision Trees is a measure of how much a feature helps in reducing uncertainty or impurity when splitting the data. It is calculated as the difference between the entropy of the parent node and the weighted sum of the entropies of the child nodes after the split. A higher Information Gain means the feature provides more useful information for separating the classes, leading to purer subsets. This is important because Decision Trees choose the feature with the highest Information Gain at each step, ensuring that the most informative and discriminative splits are made first, which improves classification accuracy and helps the tree generalize better.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
   - Decision Trees are versatile and can be applied to both classification and regression tasks, which is why they are often demonstrated using datasets like the Iris dataset and the Boston Housing dataset.

    In the Iris dataset (classification), Decision Trees can classify flowers into species (Setosa, Versicolor, Virginica) based on features like petal length, petal width, sepal length, and sepal width. This shows how decision trees are effective in real-world classification problems such as medical diagnosis (e.g., disease identification), customer segmentation, or spam detection.

    In the Boston Housing dataset (regression), Decision Trees can predict house prices based on features like crime rate, number of rooms, property tax rate, and access to highways. This demonstrates their use in real-world regression problems such as property valuation, sales forecasting, or predicting equipment failure in manufacturing.

  Advantages:

      * Easy to understand and interpret with tree visualizations.
      * Handle both categorical and numerical features without much preprocessing.
      * Can model both classification and regression tasks.
      * Fast training and prediction compared to some complex models.

 Limitations:

      * Tend to overfit, especially with deep trees (though pruning helps).  
      * Unstable — small data changes can lead to very different trees.
      * Not as accurate as ensemble methods (Random Forests, Gradient Boosting).
      * Can be biased if data is imbalanced.

 So, with datasets like Iris, Decision Trees highlight their strength in classification tasks, and with Boston Housing, they highlight their ability to handle regression tasks — but in both cases, overfitting and instability remain key challenges.

6. Write a Python program to:

   ●   Load the Iris Dataset
  
  ● Train a Decision Tree Classifier using the Gini criterion
   
   ● Print the model’s accuracy and feature importances

   (Include your Python code and output in the code box below.)    

In [1]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Classifier using Gini index
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Predict on test set
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Decision Tree Classifier (Gini Criterion)")
print("Accuracy on test data: {:.2f}%".format(accuracy * 100))

# Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Decision Tree Classifier (Gini Criterion)
Accuracy on test data: 100.00%

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


7. Write a Python program to:

   ● Load the Iris Dataset

   ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

   (Include your Python code and output in the code box below.)

In [2]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree with max_depth=3
clf_limited = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Train fully-grown Decision Tree (no depth limit)
clf_full = DecisionTreeClassifier(criterion="gini", random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print results
print("Decision Tree Classifier Comparison on Iris Dataset")
print("---------------------------------------------------")
print(f"Accuracy with max_depth=3: {accuracy_limited*100:.2f}%")
print(f"Accuracy with fully-grown tree: {accuracy_full*100:.2f}%")


Decision Tree Classifier Comparison on Iris Dataset
---------------------------------------------------
Accuracy with max_depth=3: 100.00%
Accuracy with fully-grown tree: 100.00%


8.  Write a Python program to:

   ● Load the Boston Housing Dataset

   ● Train a Decision Tree Regressor

   ● Print the Mean Squared Error (MSE) and feature importances

    (Include your Python code and output in the code box below.)


In [None]:
# Import libraries
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load Boston Housing dataset
# Note: load_boston is deprecated in latest sklearn versions.
# If unavailable, use fetch_california_housing() as a substitute.
boston = load_boston()
X, y = boston.data, boston.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Regressor
regressor = DecisionTreeRegressor(criterion="squared_error", random_state=42)
regressor.fit(X_train, y_train)

# Predict on test data
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# Print results
print("Decision Tree Regressor on Boston Housing Dataset")
print("-------------------------------------------------")
print(f"Mean Squared Error (MSE): {mse:.2f}")

print("\nFeature Importances:")
for feature, importance in zip(boston.feature_names, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


9. Write a Python program to:
     ● Load the Iris Dataset  

    ● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

     ● Print the best parameters and the resulting model accuracy
     
     (Include your Python code and output in the code box below.)

In [5]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 6]
}

# Initialize Decision Tree and GridSearchCV
clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid,
                           cv=5, scoring='accuracy', n_jobs=-1)

# Fit the model
grid_search.fit(X_train, y_train)

# Get best model
best_clf = grid_search.best_estimator_

# Predict on test data
y_pred = best_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Decision Tree Hyperparameter Tuning with GridSearchCV (Iris Dataset)")
print("--------------------------------------------------------------------")
print("Best Parameters:", grid_search.best_params_)
print(f"Best Cross-Validation Score: {grid_search.best_score_:.2f}")
print(f"Accuracy on Test Data: {accuracy*100:.2f}%")


Decision Tree Hyperparameter Tuning with GridSearchCV (Iris Dataset)
--------------------------------------------------------------------
Best Parameters: {'max_depth': 4, 'min_samples_split': 6}
Best Cross-Validation Score: 0.94
Accuracy on Test Data: 100.00%


10.  Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:

     ● Handle the missing values

     ● Encode the categorical features

     ● Train a Decision Tree model

     ● Tune its hyperparameters

     ● Evaluate its performance

    And describe what business value this model could provide in the real-world
setting

     -   If I were working as a data scientist in a healthcare company aiming to predict whether a patient has a certain disease, I would first address the missing values by imputing numerical features with statistical measures such as the mean or median and categorical features with the most frequent value or by assigning a new “Unknown” category. Next, I would encode categorical features, using one-hot encoding for nominal variables like blood type and ordinal encoding for ordered features such as disease stage, ensuring that the model can process both numerical and categorical data. After preprocessing, I would train a Decision Tree classifier on the dataset, starting with a simple tree and then optimizing it. Since Decision Trees are prone to overfitting, I would tune hyperparameters such as max_depth, min_samples_split, and min_samples_leaf using GridSearchCV or RandomizedSearchCV with cross-validation to find the balance between model complexity and generalization. Once the best model is chosen, I would evaluate its performance on a test set using metrics beyond accuracy, such as precision, recall, F1-score, and ROC-AUC, since in healthcare reducing false negatives (missing a disease case) is critical. In the real-world setting, this model could provide significant business value by supporting doctors in early disease detection, reducing the burden of manual screening, improving patient outcomes with timely interventions, and optimizing healthcare resources by prioritizing high-risk patients, ultimately leading to both better care delivery and cost savings for the organization.