## Decision Tree | Assignment

### **Question 1:** What is a Decision Tree, and how does it work in the context of classification?

Ans:-
A **Decision Tree** is a supervised machine learning algorithm that can be used for both classification and regression tasks. It's a flowchart-like structure where each internal node represents a 'test' on an attribute (e.g., 'Is the email subject line spammy?'), each branch represents the outcome of the test, and each leaf node represents a class label (for classification) or a numerical value (for regression).



**How it works in classification:**

1. **Splitting Criteria:** The tree starts with a single root node that contains the entire dataset. The algorithm then looks for the best way to split this node into two or more sub-nodes. The 'best split' is determined by metrics that measure the 'purity' of the resulting sub-nodes (e.g., Gini Impurity or Entropy, which we'll discuss next).

2. **Recursive Splitting:** This splitting process is repeated recursively for each sub-node. The goal is to create sub-nodes that are as 'pure' as possible, meaning they contain data points belonging predominantly to a single class.

3. **Stopping Criteria:** The tree-building process stops when certain conditions are met, such as:
    - All data points in a node belong to the same class (perfectly pure).
    - No more features are available for splitting.
    - A pre-defined maximum depth of the tree is reached.
    - The number of data points in a node falls below a certain threshold.

4. **Leaf Nodes:** Once the splitting stops, the final nodes are called leaf nodes. Each leaf node is assigned a class label, which is typically the majority class of the data points within that node.

5. **Prediction:** To classify a new data point, you traverse the tree from the root to a leaf node by following the decisions made at each internal node based on the data point's features. The class label of the reached leaf node is the predicted class for the new data point.

Decision Trees are intuitive and easy to interpret, as their structure mirrors human decision-making.



### **Question 2:** Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

Ans:-
**Gini Impurity and Entropy** are two common metrics used by Decision Tree algorithms to measure the 'impurity' or 'mixed-up-ness' of a set of samples. The algorithm aims to find splits that minimize the impurity in the child nodes.

1. Gini Impurity (or Gini Index):

    - Concept: Gini Impurity measures how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. A Gini Impurity of 0 means all elements belong to a single class (perfectly pure), while a Gini Impurity of 0.5 (for a binary classification) indicates an even distribution of classes (maximum impurity).
    - Formula: For a node containing data points from C classes, Gini Impurity is calculated as 1 - sum(p_i^2), where p_i is the proportion of samples belonging to class i in that node.
    - Impact on Splits: The Decision Tree algorithm calculates the Gini Impurity for a potential split. It chooses the split that results in the largest decrease in Gini Impurity (or largest Gini Gain). This means it prefers splits that create child nodes where the classes are more homogeneous.
2. Entropy:

    - Concept: Entropy is a measure of the disorder or unpredictability in a set of data. In the context of classification, it quantifies the amount of uncertainty about the class of a randomly chosen data point. Like Gini Impurity, an Entropy of 0 means perfect purity (all samples belong to one class), and higher Entropy values indicate greater disorder.
    - Formula: For a node, Entropy is calculated as -sum(p_i * log2(p_i)), where p_i is the proportion of samples belonging to class i.
    - Impact on Splits: Similar to Gini Impurity, the Decision Tree algorithm seeks to maximize the information gain (the decrease in entropy) for each split. A split that leads to child nodes with lower entropy is preferred because it reduces the uncertainty about the class labels.

**How they impact splits in a Decision Tree:**

Both Gini Impurity and Entropy guide the Decision Tree to make the most informative splits. The algorithm evaluates all possible splits for a given feature and chooses the one that maximizes the reduction in impurity (Gini Gain or Information Gain). This iterative process ensures that the tree effectively partitions the data into increasingly homogeneous subsets, leading to better classification accuracy. While both generally produce similar trees, Gini Impurity is computationally less intensive as it doesn't involve logarithmic calculations, and it often tends to isolate the most frequent class in its own branch. Entropy, on the other hand, tends to produce a more balanced tree.

### **Question 3:** What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Ans:-
**Pruning** in Decision Trees is the process of removing branches that make the model more complex and prone to overfitting. It helps simplify the tree and improve its generalization ability.

- **Pre-Pruning (Early Stopping):**

    - **Difference:** This technique stops the tree construction early, before it has perfectly classified the training data. It sets conditions that, if met, will stop the splitting process at a node. Common conditions include a maximum tree depth, a minimum number of samples required to split a node, or a minimum decrease in impurity (e.g., Gini Gain or Information Gain).
    - **Practical Advantage:** A significant advantage of pre-pruning is **reduced computational cost and training time.** By stopping the tree growth early, it prevents the generation of overly complex subtrees that might later be pruned anyway, saving resources.
- **Post-Pruning (Backward Pruning):**

    - **Difference:** In post-pruning, the decision tree is grown to its full potential (or a very large size) first, allowing it to overfit the training data. Then, sub-nodes and branches are removed or collapsed from the bottom-up, usually based on their impact on a validation set or statistical measures. The idea is to identify and remove branches that contribute little to the generalization performance.
    - **Practical Advantage:** Post-pruning often leads to **more optimal or accurate** trees compared to pre-pruning. Since the full tree is built first, it can capture more complex patterns before simplification, potentially finding better structures that pre-pruning might miss by stopping too early.

### **Question 4:** What is Information Gain in Decision Trees, and why is it important for choosing the best split?

**Information Gain (IG)** is a metric used in the construction of Decision Trees to determine the effectiveness of a feature in classifying the training data. It quantifies the expected reduction in entropy (or uncertainty) caused by splitting a dataset based on a particular attribute.

- **How it works:**

  1. First, the entropy of the original dataset (before any split) is calculated.
  2. Then, for each potential feature to split on, the **weighted average entropy** of the child nodes created by that split is calculated. The weights are typically the proportion of samples going into each child node.
  3. Information Gain is then calculated as: Entropy(Parent) - Weighted Average Entropy(Children).
 - **Why it's important for choosing the best split:** Information Gain is crucial because the Decision Tree algorithm's primary goal is to create homogeneous (pure) child nodes. A higher Information Gain indicates that a particular split does a better job of separating the data into distinct classes, thereby reducing the uncertainty about the class labels in the resulting subsets. The algorithm greedily selects the feature and split point that yields the **maximum Information Gain** at each step, ensuring that the most informative splits are made first. This iterative process helps in building an efficient and accurate tree that effectively partitions the data.

### **Question 5:** What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

**Real-World Applications of Decision Trees:**

Decision Trees are versatile algorithms used across many domains due to their interpretability and ease of use. Some common applications include:

- **Medical Diagnosis:** Identifying potential diseases based on symptoms and patient history.
- **Financial Risk Assessment:** Evaluating credit risk for loan applications or predicting stock market trends.
- **Customer Relationship Management (CRM):** Segmenting customers for targeted marketing campaigns, predicting customer churn, or identifying valuable customers.
- **Fraud Detection:** Detecting fraudulent transactions in credit card usage or insurance claims.
- **Manufacturing Quality Control:** Identifying defects in products based on manufacturing parameters.
- **Image Classification:** Simple classification tasks in computer vision.
- **Recommendation Systems:** Suggesting products or content to users based on their preferences and past behavior.

**Main Advantages of Decision Trees:**

1. **Easy to Understand and Interpret:** Their tree-like structure closely mimics human decision-making, making them intuitive and easy to explain to non-technical stakeholders.
2. **Handle Both Numerical and Categorical Data:** They can process various data types without extensive preprocessing.
3. **Require Little Data Preprocessing:** They are not sensitive to data scaling or normalization, and can handle missing values reasonably well.
4. **No Parametric Assumptions:** They don't assume linearity or specific data distributions.
5. **Feature Selection:** The tree structure implicitly performs feature selection, as more important features are typically used closer to the root.
6. **Non-linear Relationships:** Can capture non-linear relationships between features and target variables.

**Main Limitations of Decision Trees:**

1. **Prone to Overfitting:** Without proper pruning or setting appropriate hyperparameters, Decision Trees can easily overfit the training data, leading to poor generalization on unseen data.
2. **Instability:** Small changes in the training data can lead to a completely different tree structure, making them unstable.
3. **Bias towards Dominant Classes:** If the dataset is imbalanced, Decision Trees might be biased towards the majority class.
4. **Sub-optimality (Greedy Approach):** The greedy approach of choosing the best split at each step doesn't guarantee a globally optimal tree. It might miss better solutions that involve less optimal splits at an earlier stage.
5. **Difficulty with Continuous Variables:** Splitting continuous variables involves creating binary splits (e.g., age > 30), which can lead to information loss if not handled carefully.
6. **Complex for Large Trees:** While individual decisions are simple, a very deep or complex tree can still be difficult to interpret.

**Dataset Info:**
- Iris Dataset for classification tasks ( `sklearn.datasets.load_iris()` or
provided CSV).
- Boston Housing Dataset for regression tasks
( `sklearn.datasets.load_boston()` or provided CSV).
### **Question 6:** Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier using the Gini criterion
- Print the model’s accuracy and feature importances

In [6]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Classifier using the Gini criterion
decision_tree_model = DecisionTreeClassifier(criterion='gini', random_state=42)
decision_tree_model.fit(X_train, y_train)

# Predict on the test set
y_pred = decision_tree_model.predict(X_test)

# Print the model’s accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print feature importances
feature_importances = pd.DataFrame({
    'feature': iris.feature_names,
    'importance': decision_tree_model.feature_importances_
}).sort_values(by='importance', ascending=False)

print("\nFeature Importances:")
print(feature_importances)

Model Accuracy: 1.00

Feature Importances:
             feature  importance
2  petal length (cm)    0.893264
3   petal width (cm)    0.087626
1   sepal width (cm)    0.019110
0  sepal length (cm)    0.000000


### **Question 7:** Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.


In [7]:
# Train a fully-grown Decision Tree Classifier
decision_tree_full = DecisionTreeClassifier(random_state=42) # max_depth=None by default
decision_tree_full.fit(X_train, y_train)
y_pred_full = decision_tree_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

print(f"Accuracy of fully-grown Decision Tree: {accuracy_full:.2f}")

# Train a Decision Tree Classifier with max_depth=3
decision_tree_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
decision_tree_pruned.fit(X_train, y_train)
y_pred_pruned = decision_tree_pruned.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)

print(f"Accuracy of Decision Tree with max_depth=3: {accuracy_pruned:.2f}")

Accuracy of fully-grown Decision Tree: 1.00
Accuracy of Decision Tree with max_depth=3: 1.00


### **Question 8:** Write a Python program to
- load the Boston Housing Dataset
- train a Decision Tree Regressor
- print the Mean Squared Error (MSE) and feature importances.

In [8]:
from sklearn.datasets import fetch_california_housing # Changed from load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

# Load the California Housing Dataset (as an alternative to Boston Housing)
housing = fetch_california_housing()
X_boston = housing.data
y_boston = housing.target

# Split the dataset into training and testing sets
X_train_boston, X_test_boston, y_train_boston, y_test_boston = train_test_split(X_boston, y_boston, test_size=0.3, random_state=42)

# Train a Decision Tree Regressor
decision_tree_regressor = DecisionTreeRegressor(random_state=42)
decision_tree_regressor.fit(X_train_boston, y_train_boston)

# Predict on the test set
y_pred_boston = decision_tree_regressor.predict(X_test_boston)

# Print the Mean Squared Error (MSE)
mse = mean_squared_error(y_test_boston, y_pred_boston)
print(f"Mean Squared Error (MSE): {mse:.2f}")

# Print feature importances
feature_importances_boston = pd.DataFrame({
    'feature': housing.feature_names, # Changed from boston.feature_names
    'importance': decision_tree_regressor.feature_importances_
}).sort_values(by='importance', ascending=False)

print("\nFeature Importances (California Housing):") # Changed print statement
print(feature_importances_boston)

Mean Squared Error (MSE): 0.53

Feature Importances (California Housing):
      feature  importance
0      MedInc    0.523456
5    AveOccup    0.139012
6    Latitude    0.089992
7   Longitude    0.088806
1    HouseAge    0.052135
2    AveRooms    0.049418
4  Population    0.032206
3   AveBedrms    0.024974


### **Question 9:** Write a Python program to:
- Load the Iris Dataset
- Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
- Print the best parameters and the resulting model accuracy

In [9]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the Iris Dataset (if not already loaded)
# For this example, we'll assume X_train, X_test, y_train, y_test are already available
# If starting fresh, uncomment and run:
# iris = load_iris()
# X = iris.data
# y = iris.target
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the Decision Tree Classifier
dtc = DecisionTreeClassifier(random_state=42)

# Define the parameter grid to search
param_grid = {
    'max_depth': [None, 3, 5, 7, 10],
    'min_samples_split': [2, 5, 10, 15]
}

# Initialize GridSearchCV
# 'cv' is the number of folds for cross-validation
# 'scoring' specifies the evaluation metric
grid_search = GridSearchCV(estimator=dtc, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV to the training data
print("Performing GridSearchCV...")
grid_search.fit(X_train, y_train)
print("GridSearchCV completed.")

# Print the best parameters found
print(f"\nBest Parameters: {grid_search.best_params_}")

# Get the best estimator (model) from the grid search
best_dtc_model = grid_search.best_estimator_

# Predict on the test set using the best model
y_pred_tuned = best_dtc_model.predict(X_test)

# Print the accuracy of the best model
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print(f"Accuracy with Best Parameters: {accuracy_tuned:.2f}")

Performing GridSearchCV...
GridSearchCV completed.

Best Parameters: {'max_depth': None, 'min_samples_split': 10}
Accuracy with Best Parameters: 1.00


### **Question 10:** Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.

**Explain the step-by-step process you would follow to:**
- Handle the missing values
- Encode the categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Ans:-

**Step-by-Step Process for Predicting Disease with a Decision Tree Model:**
1. **Data Understanding and Initial Exploration:**

    - **Load and Inspect Data:** Load the dataset (e.g., using Pandas). Examine the first few rows (df.head()), data types (df.info()), and summary statistics (df.describe()).
    - **Identify Mixed Data Types:** Note which columns are numerical, categorical, or mixed.
    - **Identify Missing Values:** Use df.isnull().sum() to get a count of missing values per column. Visualize missingness patterns if the dataset is very large or complex (e.g., using missingno library).
2. **Handle Missing Values:** This step is crucial as Decision Trees can sometimes handle missing values (e.g., by treating them as a separate category or by using surrogate splits), but explicit handling often improves performance.

    - **For Numerical Features:**
      - **Imputation:** Replace missing values with the mean, median, or mode of the column. The median is often preferred for skewed distributions to avoid being influenced by outliers. sklearn.impute.SimpleImputer is a good tool.
      - **Advanced Imputation:** Consider more sophisticated methods like K-Nearest Neighbors (KNN) imputation if the data relationships are complex, or regression imputation.
    - **For Categorical Features:**
      - **Imputation with Mode:** Replace missing values with the most frequent category (mode).
      - **Create a 'Missing' Category:** If 'missing' itself could be informative (e.g., patient didn't report a symptom because they don't have it),  create a new category like 'Unknown' or 'Missing'.
      - **Deletion (Use with Caution):** If a feature has a very high percentage of missing values (e.g., >70-80%), it might be better to drop the entire column, or if a row has many missing values, drop the row. This should be done carefully to avoid losing valuable information.
3. **Encode Categorical Features:** Decision Trees can sometimes handle categorical features directly if they are ordinal, but typically they require numerical input. Even for non-ordinal categories, conversion is often necessary.

    - **Nominal (Unordered) Categorical Features:**
      - **One-Hot Encoding:** Convert each category into a new binary column (0 or 1). This is suitable for features with a small to moderate number of unique categories to avoid creating too many new features. pandas.get_dummies() or sklearn.preprocessing.OneHotEncoder.
      - **Binary Encoding:** Converts categories to binary code. More compact than one-hot for high cardinality categories but loses interpretability. Not always needed for Decision Trees if one-hot works.
    - **Ordinal (Ordered) Categorical Features:**
      - **Label Encoding:** Assign a unique integer to each category, preserving the order (e.g., 'low'=1, 'medium'=2, 'high'=3). This is important for Decision Trees to correctly interpret the order. sklearn.preprocessing.OrdinalEncoder.
4. **Train a Decision Tree Model:**

    - **Split Data:** Divide the preprocessed dataset into training and testing sets (e.g., 70-80% for training, 20-30% for testing) using sklearn.model_selection.train_test_split. This ensures you evaluate the model on unseen data.
    - **Initialize and Train:** Instantiate sklearn.tree.DecisionTreeClassifier (for classification) or DecisionTreeRegressor (if predicting a continuous risk score). Fit the model to the training data (model.fit(X_train, y_train)).
    - **Initial Prediction:** Make predictions on the test set (y_pred = model.predict(X_test)).
5. **Tune its Hyperparameters:** Decision Trees are prone to overfitting, so tuning hyperparameters is critical to achieve a balance between bias and variance.

    - **Identify Key Hyperparameters:** Common hyperparameters for Decision Trees include:
      - **max_depth:** Maximum depth of the tree.
      - **min_samples_split:** Minimum number of samples required to split an internal node.
      - **min_samples_leaf:** Minimum number of samples required to be at a leaf node.
      - **criterion:** Function to measure the quality of a split (e.g., 'gini' or 'entropy' for classification, 'mse' or 'mae' for regression).
    - **Hyperparameter Tuning Strategy:**
      - **Grid Search (GridSearchCV):** Define a grid of hyperparameters to explore. GridSearchCV will systematically try every combination of parameters, using cross-validation on the training set to find the best set.
      - **Random Search (RandomizedSearchCV):** Define distributions for hyperparameters and randomly sample combinations. More efficient than Grid Search for large search spaces.
    - **Cross-Validation:** Use k-fold cross-validation during tuning to get a more robust estimate of model performance and prevent overfitting to a single train-test split.
    - **Select Best Model:** After tuning, retrieve the best_estimator_ from the search object.
6. **Evaluate its Performance:** Evaluate the performance of the best model on the unseen test set.

    - **For Classification (Predicting Disease Presence/Absence):**
      - **Accuracy:** Overall correct predictions (accuracy_score).
      - **Precision, Recall, F1-Score:** Important for imbalanced datasets, especially in healthcare where false negatives (missing a disease) can be critical (classification_report).
      - **Confusion Matrix:** Provides a detailed breakdown of true positives, true negatives, false positives, and false negatives.
      - **ROC Curve and AUC:** Measures the classifier's performance across all possible classification thresholds, particularly useful for understanding the trade-off between sensitivity and specificity (roc_curve, roc_auc_score).
    - **For Regression (Predicting Disease Severity/Risk Score):**
      - **Mean Squared Error (MSE) / Root Mean Squared Error (RMSE):** Common metrics for regression, indicating the average squared/absolute difference between predicted and actual values.
      - **Mean Absolute Error (MAE):** Less sensitive to outliers than MSE.
      - **R-squared:** Measures the proportion of the variance in the dependent variable that is predictable from the independent variables.
    - **Feature Importances:** Analyze model.feature_importances_ to understand which patient attributes are most influential in the disease prediction. This provides valuable insights to medical professionals.
**Business Value in a Real-World Healthcare Setting:**

A predictive model for disease presence, especially one as interpretable as a Decision Tree, offers significant business value:

1. **Early Detection and Intervention:** Identifying patients at high risk of a certain disease earlier allows for proactive medical interventions, potentially improving patient outcomes and reducing long-term treatment costs. This can lead to better quality of life for patients.
2. **Resource Optimization:** Hospitals and clinics can allocate resources (e.g., specialist appointments, diagnostic tests, preventative care programs) more efficiently by prioritizing high-risk patients. This reduces unnecessary screenings for low-risk individuals and ensures critical resources are directed where they are most needed.
3. **Personalized Treatment Plans:** By understanding the specific features that contribute to a patient's risk (via feature importances), clinicians can tailor treatment and prevention plans to individual patient profiles, leading to more effective care.
4. **Cost Reduction:** Early diagnosis and prevention can reduce the severity of disease progression, leading to less expensive and less invasive treatments. For instance, preventing a chronic condition from worsening can save millions in healthcare expenditure.
5. **Improved Patient Experience:** Proactive care based on risk prediction can lead to fewer medical emergencies, less time spent in hospital, and a more streamlined healthcare journey for patients.
6. **Drug Development and Research:** Insights from the model can highlight previously unrecognized risk factors or interactions between patient characteristics, guiding pharmaceutical companies and researchers towards new targets for drug development or deeper epidemiological studies.
7. **Enhanced Diagnostic Support:** The model can serve as a decision-support tool for clinicians, helping them confirm diagnoses, especially in complex cases, or flagging patients who might warrant a second look despite ambiguous initial symptoms. Its interpretability is a major advantage here, as clinicians can follow the 'reasoning' of the model.