#Decision Tree Assignment

#Question 1: What is a Decision Tree, and how does it work in the context of classification?

- A Decision Tree works like a flowchart — it splits data into branches based on feature values to reach a final decision (output class) at the leaf nodes.

- How It Works in Classification:

   1. Root Node: The tree starts with a root node that contains the entire dataset.

   2. Splitting: The algorithm selects the best feature to split the data based on a criterion like:

     - Gini Impurity

     - Entropy / Information Gain

   3. Decision Nodes: Each internal node represents a test on a feature (e.g., “Age > 30?”).

   4. Branches: Each branch represents the outcome of that test (e.g., “Yes” or “No”).

   5. Leaf Nodes: The end of each path is a leaf node that represents a class label (e.g., “Approved” or “Denied”).

- Advantages:

     - Easy to visualize and interpret

    - Handles both numerical and categorical data

    - No need for feature scaling

- Disadvantages:

   - Prone to overfitting

   - Small data changes can lead to a different tree

#Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?


- Both Gini Impurity and Entropy are measures of impurity (or disorder) used to decide how to split data at each node in a Decision Tree.

1. Gini Impurity

  - Definition: Gini Impurity measures the probability that a randomly chosen sample would be incorrectly classified if it were labeled according to the class distribution in that node.

  - Interpretation:

      - Gini = 0 → Node is pure (all samples belong to one class).

      - Higher Gini → More mixed classes (higher impurity).

2. Entropy (Information Gain)

  - Definition: Entropy measures the amount of randomness or uncertainty in the data.

  - Interpretation:

     - Entropy = 0 → Node is pure (all samples same class).

     - Entropy = 1 → Classes are evenly mixed (50-50).

- How They Impact Splits

   - At each node, the Decision Tree algorithm tries different splits on features.

   - It calculates Gini or Entropy for each possible split.

   - The split that gives the lowest impurity (or highest Information Gain) is chosen.

#Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.


- Decision Trees often overfit — they become too complex and perform well on training data but poorly on unseen data.
To control this, we use pruning, which simplifies the tree.

1. Pre-Pruning (Early Stopping)

   - Definition: Pre-pruning stops the tree from growing too deep by applying certain stopping conditions during the training process.

   - Common Stopping Conditions:

      - Maximum depth of the tree (max_depth)

      - Minimum number of samples required to split a node (min_samples_split)

      - Minimum impurity decrease required for a split

      - Minimum samples per leaf

   - Example: Stop splitting a node if it has fewer than 5 samples.

2. Post-Pruning (Cost Complexity Pruning)

   - Definition: Post-pruning allows the tree to grow fully first, and then removes unnecessary branches that do not improve model performance on a validation set.

   - How It Works:

     - Build a full decision tree.

     - Evaluate subtrees on validation data.

     - Prune (remove) branches that provide little or no accuracy gain.

   - Example: CART algorithm uses cost complexity pruning — balancing accuracy vs. complexity.


#Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?


- Definition:-

    - Information Gain (IG) measures how much uncertainty (impurity) in the  data is reduced after a dataset is split based on a feature.
    - It helps the Decision Tree decide which feature and threshold to split on at each node.

- Intuition:

  - A good split separates the data so that each child node is purer (contains mostly one class).

  - The greater the reduction in impurity (i.e., higher Information Gain), the better the split.

- Why It’s Important:

    - Guides the Tree: It helps the Decision Tree algorithm choose the best feature and threshold for splitting.

    - Reduces Disorder: Maximizing Information Gain ensures that child nodes are more homogeneous, leading to clearer class separation.

    - Improves Accuracy: Trees with high IG splits usually make more accurate predictions.


#Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?


- Real-World Applications

   1. Business & Finance

       - Credit Scoring: Predict whether a customer will repay a loan.

       - Customer Churn: Identify customers likely to leave a service.

  2. Healthcare

      - Disease Diagnosis: Classify patients based on symptoms or test results.

      - Treatment Decisions: Suggest treatments based on patient conditions.

  3. Marketing

      - Targeted Advertising: Segment customers for personalized ads.

     - Sales Prediction: Predict which products a customer might buy.

  4. Education

      - Student Performance Prediction: Determine students at risk of failure.

      - Admissions Decisions: Evaluate applicant success probability.

  5. Manufacturing

     - Quality Control: Detect defects in products based on features.

     - Process Optimization: Identify key factors affecting production efficiency.

- Advantages of Decision Trees

  - Easy to understand
  - No need for data scaling
  - Handles non-linear relationship
  - Features importance
  - Fast and furious

- Limitations of Decision Trees

  - Overfitting
  - Unstable
  - Biased toward features with many levels
  - Not ideal for continuous relations

#Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).
#Question 6: Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier using the Gini criterion
#● Print the model’s accuracy and feature importances  











In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

print("\nFeature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Model Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


#Question 7: Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
y_pred_limited = tree_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

print("Accuracy with max_depth=3:", accuracy_limited)
print("Accuracy with fully-grown tree:", accuracy_full)

print("\nAccuracy Difference:", round(accuracy_full - accuracy_limited, 4))


Accuracy with max_depth=3: 1.0
Accuracy with fully-grown tree: 1.0

Accuracy Difference: 0.0


#Question 8: Write a Python program to:
#● Load the Boston Housing Dataset
#● Train a Decision Tree Regressor
#● Print the Mean Squared Error (MSE) and feature importances

In [5]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X = housing.data
y = housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

print("\nFeature Importances:")
for feature, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 0.495235205629094

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


#Question 9: Write a Python program to:
#● Load the Iris Dataset
#● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
#● Print the best parameters and the resulting model accuracy

In [6]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dt = DecisionTreeClassifier(random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 6, 10]
}

grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Test Accuracy: 1.0


#Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:
● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance

And describe what business value this model could provide in the real-world
setting.


Answer 10 - :

---


Step-by-Step Process for Disease Prediction Using Decision Tree

1. Handle Missing Values

   - For numeric columns: replace missing values with median or mean using SimpleImputer.

   - For categorical columns: fill missing values with "MISSING" or most frequent category.

2. Encode Categorical Features

   - Use OneHotEncoder for nominal data
   - Use OrdinalEncoder if categories have order

3. Train the Decision Tree Model

4. Tune Hyperparameters

   - max_depth

   - min_samples_split

   - min_samples_leaf

   - criterion

5. Evaluate the Model

  - Accuracy, Precision, Recall, F1-score, and ROC-AUC.

  - Confusion matrix to check false positives and negatives.

6. Business Value

  - Helps detect diseases early → saves lives.

  - Supports doctors in decision-making.

