#Decision Tree

1. What is a Decision Tree, and how does it work in the context of
classification?

 Ans.  A Decision Tree is a supervised learning algorithm used for both classification and regression tasks. Think of it as a tree-like model: it starts at the root node (which contains all the data) and splits it into branches based on certain features or questions. At each internal node, the dataset is separated according to answers to those questions, progressively dividing into smaller subsets. This process continues until you reach a leaf node, which provides the final classification or output.

 In classification:
    i) The decision tree separates data based on features (like age, income, etc.),
    ii) Each leaf node represents a class label (for example, "spam" or "not spam"),
    iii) The algorithm keeps partitioning until all items in a branch belong to the same category or another stopping criterion is met.



2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

Ans.  Gini Impurity and Entropy are two common ways to measure how mixed or impure the data is at each node of a decision tree.

  i) Gini Impurity estimates how often a randomly chosen element from the node would be incorrectly classified if it were labeled according to the current class distribution. If all items at a node belong to one class, the impurity is zero (the node is pure); if classes are evenly mixed, impurity is higher.

  ii) Entropy gauges the uncertainty or disorder in a node. Higher entropy means more unpredictability—i.e., classes are mixed. Lower entropy is more predictable or pure, meaning the data belongs mostly to one class.

 Impact on Tree Splits:

   Both measures guide the decision tree in choosing the best point to split the data. The split that results in the lowest impurity (highest purity) is preferred, leading to simpler, more accurate decision branches. So, Gini and Entropy help the algorithm select features and split points that best separate the classes at each step, which improves the classification accuracy and makes the tree's predictions more reliable.



3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Ans.  Difference Between Pre-Pruning and Post-Pruning in Decision Trees
Pre-Pruning (Early Stopping):

  i) This technique stops the growth of the decision tree early during the training process before it becomes too complex. It uses predefined conditions to halt splitting further, such as limiting the maximum depth or requiring a minimum number of samples to split a node. The goal is to prevent overfitting by keeping the tree simpler from the start.

  ii) Post-Pruning: In contrast, post-pruning lets the tree grow fully without restrictions and then removes branches or nodes that do not significantly improve the model's accuracy. This simplification happens after the full tree is built, often using methods like cost-complexity pruning or reduced error pruning.

Practical Advantages:

  i) Pre-Pruning Advantage: It is computationally efficient since it stops building unnecessary branches early, saving training time, especially useful for large datasets.

  ii) Post-Pruning Advantage: It often results in better model accuracy and generalization because it evaluates the fully grown tree before simplifying, reducing the risk of stopping too early and underfitting.



4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Ans.  Information Gain is a concept used in decision trees to measure how well a feature (or attribute) helps to split the data into distinct classes. In simple terms, it tells us how much our uncertainty about the target class decreases if we know the value of a particular feature.

 When building a decision tree, the algorithm looks at each feature and calculates the information gain from splitting the data on that feature. The feature with the highest information gain is selected for the split at that node. This helps the tree create branches where the resulting groups are more pure (meaning most items in a group belong to the same class).

Information gain is essential because:

  i) It guides the tree to choose features that separate the classes most effectively at each step.

  ii) Features that produce the largest reduction in impurity or uncertainty (i.e., higher information gain) lead to simpler, more accurate classification paths.

  iii) By favoring splits that maximize information gain, the tree can more quickly homogeneously group data, reducing errors and overfitting.


5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Ans. Common Real-World Applications of Decision Trees:

  a) Healthcare: Used for disease diagnosis by analyzing symptoms and test results to classify conditions, aiding doctors in decision-making.

  b) Finance: Employed for credit scoring, risk assessment, and fraud detection by evaluating financial histories and transaction patterns.

  c) Education: Predict student performance or dropout risk based on attendance, grades, and study habits.

  d) Marketing and Customer Segmentation: Classify customers by demographics and buying behavior for targeted marketing campaigns.

  e) Manufacturing and Quality Control: Detect product defects based on sensor or production variables.

  f) Retail and E-commerce: Build recommendation systems and manage inventory based on customer preferences and purchase data.

  g) Agriculture: Predict crop yields and manage pest control using environmental data.


Advantages of Decision Trees:

   a) Easy to Understand and Interpret: They produce human-readable rules, making them ideal for explaining decisions to stakeholders.

   b) Handles Both Numerical and Categorical Data: Versatile in terms of the types of data they can process.

   c) Nonlinear Relationships: Can model complex decision boundaries without assuming linearity.

   d) Require Little Data Preparation: They do not need normalization or scaling.

Limitations of Decision Trees:

   a) Overfitting: Trees can become overly complex, fitting noise in the training data.

   b) Instability: Small changes in data can produce very different trees.

   c) Bias Toward Features with More Levels: Features with many distinct values can dominate splits.

   d) Poor Generalization on Some Datasets: May perform poorly compared to ensemble methods like random forests.

In [1]:
# 6)Write a Python program to:
# Load the Iris Dataset
# Train a Decision Tree Classifier using the Gini criterion
# Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and test parts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Initialize Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=1)

# Train the model
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')

# Print feature importances
features = iris.feature_names
importances = clf.feature_importances_
for feature, importance in zip(features, importances):
    print(f'Feature: {feature}, Importance: {importance:.4f}')


Model Accuracy: 0.96
Feature: sepal length (cm), Importance: 0.0215
Feature: sepal width (cm), Importance: 0.0215
Feature: petal length (cm), Importance: 0.0632
Feature: petal width (cm), Importance: 0.8939


In [2]:
# 7) Write a Python program to:
# Load the Iris Dataset
# Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.


from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and test parts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Train Decision Tree Classifier without depth limit (fully-grown tree)
clf_full = DecisionTreeClassifier(random_state=1)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Train Decision Tree Classifier with max_depth=3
clf_limited = DecisionTreeClassifier(max_depth=3, random_state=1)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

print(f"Accuracy of fully-grown tree: {accuracy_full:.2f}")
print(f"Accuracy of tree with max_depth=3: {accuracy_limited:.2f}")


Accuracy of fully-grown tree: 0.96
Accuracy of tree with max_depth=3: 0.96


In [3]:
#8) Write a Python program to:
# Load the Boston Housing Dataset
# Train a Decision Tree Regressor
# Print the Mean Squared Error (MSE) and feature importances


from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing Dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Split dataset into training and testing parts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Initialize Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=1)

# Train the model
regressor.fit(X_train, y_train)

# Predict on test data
y_pred = regressor.predict(X_test)

# Calculate and print Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse:.4f}')

# Print feature importances
features = housing.feature_names
importances = regressor.feature_importances_
for feature, importance in zip(features, importances):
    print(f'Feature: {feature}, Importance: {importance:.4f}')


Mean Squared Error: 0.4908
Feature: MedInc, Importance: 0.5107
Feature: HouseAge, Importance: 0.0507
Feature: AveRooms, Importance: 0.0306
Feature: AveBedrms, Importance: 0.0276
Feature: Population, Importance: 0.0267
Feature: AveOccup, Importance: 0.1384
Feature: Latitude, Importance: 0.1088
Feature: Longitude, Importance: 0.1064


In [4]:
#9) Write a Python program to:
# Load the Iris Dataset
# Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
# Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Initialize Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=1)

# Parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, 6, None],
    'min_samples_split': [2, 5, 10]
}

# GridSearchCV setup
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, n_jobs=-1)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_

# Predict using best estimator
y_pred = grid_search.best_estimator_.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Best Parameters: {best_params}")
print(f"Model Accuracy: {accuracy:.2f}")



Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Model Accuracy: 0.96


10. Imagine you're working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
Explain the step-by-step process you would follow to:
    ● Handle the missing values
    ● Encode the categorical features
    ● Train a Decision Tree model
    ● Tune its hyperparameters
    ● Evaluate its performance
And describe what business value this model could provide in the real-world setting.


Ans.  Step-by-Step Machine Learning Workflow for Disease Prediction
Let's break down how you'd handle this healthcare prediction task:

a. Handle Missing Values:

     i) Identify missing data: Check which features have missing entries and how many.
     ii) Decide on imputation strategies:
         For numerical variables, fill missing values with techniques such as mean, median, or model-based imputation.
         For categorical variables, fill with the most frequent category or use a separate “missing” label.
     iii) Consider the impact: Evaluate if the missingness carries meaning and if advanced techniques like Multiple Imputation are needed.

b. Encode Categorical Features

     i) Identify categorical columns.
     ii) Choose encoding method based on feature type and cardinality:
           For nominal categories with no order: use One-Hot Encoding.
           For ordinal categories: use Ordinal Encoding.
           For high-cardinality features: consider target encoding or embedding.

c. Train a Decision Tree Model

     i) Split the dataset into training and test sets to validate performance fairly.
     ii) Initialize a Decision Tree Classifier; decision trees handle mixed data types well but require encoded input for categorical features.
     iii) Train the model on the training data.

d. Tune Hyperparameters

     i) Identify important hyperparameters: max_depth, min_samples_split, min_samples_leaf, criterion (like Gini or Entropy).
     ii) Use techniques like Grid Search or Randomized Search with cross-validation to find the best parameters without overfitting.

e. Evaluate Performance

     i) Use metrics suitable for classification: accuracy, precision, recall, F1-score, and especially AUC-ROC due to potential class imbalance.
     ii) Analyze confusion matrix to understand types of errors.
     iii) Consider validation with a separate test set or cross-validation for robustness.

Business Value:

     i) Early disease detection: The model helps identify patients at high risk quickly, aiding timely interventions.
     ii) Resource allocation: Enables the healthcare company to better target diagnostic tests and treatments, optimizing cost and care delivery.
     iii) Improved patient outcomes: Early prediction can reduce complications by allowing earlier treatment.
     iv) Data-driven decisions: Provides insights into factors affecting disease presence, supporting clinical research and policy.