# Decision Tree Assignment

1.  What is a Decision Tree, and how does it work in the context of
classification?
 - A Decision Tree is a popular supervised machine learning algorithm used for classification and regression tasks. In the context of classification, it helps to predict the category or class of an input sample based on its features.

How does it work in the context of classification -

 - Feature Selection and Splitting:

The algorithm selects the best feature to split the data based on a criterion like Gini impurity, Entropy, or Information Gain.

It recursively splits the dataset into subsets where each subset becomes a new node.

- Recursive Partitioning:

This process continues until one of the stopping conditions is met:

All instances in a node belong to the same class.

No features are left to split.

A maximum tree depth is reached.

- Prediction:

To classify a new instance, start at the root and follow the decisions down the tree until a leaf node is reached, which gives the predicted class.

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
 - Both Gini Impurity and Entropy are metrics used to evaluate how "pure" or "impure" a node is during the process of splitting in a decision tree. The goal is to select the feature and threshold that results in the most homogeneous (pure) child nodes.

 At each node, the algorithm evaluates all possible splits and selects the one that produces the largest reduction in impurity (called Information Gain for Entropy or Gini Gain for Gini Impurity).

3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
 -  Pre-Pruning (Early Stopping)

Definition :
Pre-pruning involves stopping the tree growth early, i.e., before it becomes overly complex. The decision to stop is made during the tree construction process.

- Common Pre-Pruning Criteria:

Maximum tree depth (max_depth)

Minimum samples required to split a node (min_samples_split)

Minimum samples per leaf (min_samples_leaf)

Minimum information gain or Gini gain threshold

 - Post-Pruning (Cost-Complexity Pruning)

Definition :
Post-pruning allows the tree to grow fully first, and then prunes back branches that have little importance or contribute to overfitting.

Techniques:

Reduced Error Pruning

Cost Complexity Pruning (used in CART, based on complexity parameter α)

Validation set pruning (prune if validation accuracy improves)

4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
 - Information Gain (IG) is a measure of how well a particular feature separates the data into target classes. It quantifies the reduction in entropy (or disorder) after a dataset is split based on a feature.

 Why is it Important?

 In decision trees (especially algorithms like ID3 and C4.5), Information Gain is the key criterion used to choose the best feature to split a node. A higher information gain indicates a more effective feature for creating pure child nodes.

 Choosing the feature with the highest IG at each node leads to a tree that classifies data more accurately.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
 - Real-World Applications

- Healthcare

Use: Diagnosing diseases based on symptoms and test results

Example: Predicting whether a patient has diabetes based on features like glucose level, BMI, age

- Finance & Banking

Use: Credit scoring, fraud detection

Example: Deciding whether to approve a loan based on income, credit history, and debt ratio

- Marketing

Use: Customer segmentation, churn prediction

Example: Predicting if a customer will respond to a marketing campaign

- Retail

Use: Recommender systems, inventory decisions

Example: Predicting product preferences or seasonal demand

- Education

Use: Student performance prediction

Example: Predicting dropouts based on attendance, grades, and engagement

- Manufacturing

Use: Quality control and defect detection

Example: Classifying whether a product is defective based on sensor readings

- Main Advantages

     Advantage	                                 Explanation

Easy to Understand	    Tree structure is intuitive and can be visualized easily

No Need for Feature Scaling	       Works well with unscaled or categorical data

Handles Non-linear Relationships	   Captures complex interactions between     
features

Fast to Train and Predict	       Computationally efficient for small to medium datasets

Interpretable	                Each decision path provides a clear explanation of the prediction process

6. Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances
- Load the Iris Dataset


In [None]:
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()

# Create a DataFrame for better visualization
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target
df['target_name'] = df['target'].apply(lambda x: iris.target_names[x])

# Display the first 5 rows
print(df.head())


- Train a Decision Tree Classifier using the Gini criterion

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the Decision Tree classifier using Gini index
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=iris.target_names))


 - Print the model’s accuracy and feature importances

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Decision Tree classifier
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Print model accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print feature importances
importances = clf.feature_importances_
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
print(importance_df.to_string(index=False))


7. Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
- Load the Iris Dataset

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()

# Create a pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add the target column (numerical labels)
df['target'] = iris.target

# Add the target names (class labels)
df['target_name'] = df['target'].apply(lambda i: iris.target_names[i])

# Display the first 5 rows
print(df.head())


 - Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree with max_depth=3
clf_limited = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Train fully-grown Decision Tree (no depth limit)
clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Output accuracies
print(f"Accuracy with max_depth=3: {accuracy_limited:.2f}")
print(f"Accuracy with fully-grown tree: {accuracy_full:.2f}")


8.  Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances

- Load the Boston Housing Dataset

In [None]:
from sklearn.datasets import load_boston
import pandas as pd

# Load the Boston dataset
boston = load_boston()

# Create a DataFrame
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target  # MEDV = Median value of owner-occupied homes

# Display first few rows
print(df.head())


- Train a Decision Tree Regressor

In [None]:
import pandas as pd
import statsmodels.api as sm
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load Boston Housing dataset from statsmodels
boston = sm.datasets.get_rdataset("Boston", "MASS").data

# Features and target
X = boston.drop("medv", axis=1)
y = boston["medv"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train Decision Tree Regressor
regressor = DecisionTreeRegressor(criterion='squared_error', random_state=42)
regressor.fit(X_train, y_train)

# Predict on test data
y_pred = regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.2f}")


- Print the Mean Squared Error (MSE) and feature importances

In [None]:
import pandas as pd
import statsmodels.api as sm
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the Boston Housing dataset
boston = sm.datasets.get_rdataset("Boston", "MASS").data

# Features and target
X = boston.drop("medv", axis=1)
y = boston["medv"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Decision Tree Regressor
regressor = DecisionTreeRegressor(criterion='squared_error', random_state=42)
regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")

# Print Feature Importances
importances = regressor.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
print(feature_importance_df.to_string(index=False))


9.  Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy

- Load the Iris Dataset

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()

# Create a DataFrame with feature data
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add the target labels (numerical)
df['target'] = iris.target

# Map numerical target labels to class names (e.g., setosa, versicolor, virginica)
df['target_name'] = df['target'].apply(lambda x: iris.target_names[x])

# Display the first 5 rows
print(df.head())


- Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, 6, None],
    'min_samples_split': [2, 4, 6, 8, 10]
}

# Initialize the Decision Tree Classifier
dt = DecisionTreeClassifier(criterion='gini', random_state=42)

# Setup GridSearchCV
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the model on training data
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters Found:", grid_search.best_params_)

# Best model evaluation
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on Test Set: {accuracy:.2f}")


- Print the best parameters and the resulting model accuracy

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set up parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 4, 6, 8]
}

# Initialize Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit model
grid_search.fit(X_train, y_train)

# Get best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Predict on test set
y_pred = best_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters Found:", best_params)
print(f"Model Accuracy on Test Set: {accuracy:.2f}")


10.  Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

- Step-by-Step Workflow
 - Handle Missing Values
Goal: Clean the dataset to ensure model compatibility and integrity.

Numerical features:

Use mean or median imputation:

In [None]:
from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy='median')
X_num = num_imputer.fit_transform(X[numeric_columns])


- Encode Categorical Features

 Goal: Convert categorical features into numerical form.

Low-cardinality categorical features (e.g., gender, smoker):

Use One-Hot Encoding:

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_cat_encoded = encoder.fit_transform(X_cat)


- Train a Decision Tree Model

Goal: Fit a model to the cleaned, encoded data.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Combine numerical and encoded categorical features
import numpy as np
X_combined = np.hstack((X_num, X_cat_encoded))

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42)

# Initialize and train model
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)


- Tune Hyperparameters

Goal: Optimize model for better generalization using GridSearchCV.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_


- Evaluate Model Performance

Goal: Assess how well the model predicts disease status.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

y_pred = best_model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# If using probabilities:
y_prob = best_model.predict_proba(X_test)[:, 1]
print("ROC AUC Score:", roc_auc_score(y_test, y_prob))
