**Decision Tree - Assignment**

**Question 1: What is a Decision Tree, and how does it work in the context of classification?**


A Decision Tree is a supervised machine learning algorithm that uses a flowchart-like, tree-structured model to make predictions.

In classification, it uses an iterative "divide and conquer" approach to split a dataset into increasingly homogeneous subsets, assigning a class label to each final subset.

**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**

Gini Impurity and Entropy are impurity measures used to determine the best split in a decision tree node.

Gini Impurity measures the probability of a random data point being misclassified, while Entropy measures the disorder or randomness in the data. Both criteria aim to minimize impurity, with the tree selecting the split that results in the greatest reduction in Gini Impurity or Entropy, leading to more pure (less mixed) child nodes.

**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

Pre-pruning stops a decision tree from growing by applying constraints during construction, while post-pruning grows a full tree and then removes branches afterward.

A practical advantage of pre-pruning is its efficiency in computational resources, as it avoids the cost of building a large tree that is later pruned.

A practical advantage of post-pruning is that it can lead to more accurate results than pre-pruning by considering all possible splits, even those that appear unhelpful early on, before making a final decision.

**Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

Information Gain (IG) is a measure used in decision trees to select the best feature for splitting data by quantifying the reduction in entropy (or impurity) after the split.

It is important because the feature that results in the highest information gain is chosen at each step, as this feature is the most effective at separating the data into more homogenous subsets, leading to a more accurate and efficient tree.

**Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**

Common applications of decision trees include medical diagnosis, fraud detection, customer segmentation, and loan approval.

Key advantages are their ease of interpretation, minimal data preparation requirements, and ability to handle both numerical and categorical data.

However, they are prone to overfitting, are unstable to small data changes, and can become computationally expensive and complex for large datasets

Dataset Info:

- Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
- Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

Question 6: Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [2]:
# Train a Decision Tree Classifier using the Gini criterion
dt_classifier = DecisionTreeClassifier(criterion='gini', random_state=42)
dt_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dt_classifier.predict(X_test)

In [3]:
# Print the model’s accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Print the feature importances
print("\nFeature Importances:")
for feature, importance in zip(feature_names, dt_classifier.feature_importances_):
    print(f"{feature}: {importance:.4f}")

Model Accuracy: 1.0000

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


Question 7: Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

In [4]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Train a Decision Tree Classifier with max_depth=3
dt_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_pruned.fit(X_train, y_train)

# Make predictions and calculate accuracy for the pruned tree
y_pred_pruned = dt_pruned.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)
print(f"Accuracy of Decision Tree with max_depth=3: {accuracy_pruned:.4f}")

# 3. Train a fully-grown Decision Tree Classifier (no max_depth limit)
dt_full = DecisionTreeClassifier(random_state=42) # No max_depth specified means fully grown
dt_full.fit(X_train, y_train)

# Make predictions and calculate accuracy for the fully-grown tree
y_pred_full = dt_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)
print(f"Accuracy of fully-grown Decision Tree: {accuracy_full:.4f}")

# Compare accuracies
if accuracy_pruned > accuracy_full:
    print("\nDecision Tree with max_depth=3 performed better.")
elif accuracy_full > accuracy_pruned:
    print("\nFully-grown Decision Tree performed better.")
else:
    print("\nBoth Decision Trees performed equally well.")

Accuracy of Decision Tree with max_depth=3: 1.0000
Accuracy of fully-grown Decision Tree: 1.0000

Both Decision Trees performed equally well.


Question 8: Write a Python program to:
- Load the Boston Housing Dataset
- Train a Decision Tree Regressor
- Print the Mean Squared Error (MSE) and feature importances

In [7]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the Boston Housing Dataset (assuming a CSV named 'boston_housing.csv' is available)
try:
    df = pd.read_csv('boston_housing.csv')
except FileNotFoundError:
    print("boston_housing.csv not found. Creating a dummy dataset for demonstration.")
    # Create a dummy dataset resembling the Boston Housing structure for demonstration
    data = {
        'CRIM': [0.00632, 0.02731, 0.02729, 0.03237, 0.06905],
        'ZN': [18.0, 0.0, 0.0, 0.0, 0.0],
        'INDUS': [2.31, 7.07, 7.07, 2.18, 2.18],
        'CHAS': [0, 0, 0, 0, 0],
        'NOX': [0.538, 0.469, 0.469, 0.458, 0.458],
        'RM': [6.575, 6.421, 7.185, 6.998, 7.147],
        'AGE': [65.2, 78.9, 61.1, 45.8, 54.2],
        'DIS': [4.0900, 4.9671, 4.9671, 6.0622, 6.0622],
        'RAD': [1, 2, 2, 3, 3],
        'TAX': [296, 242, 242, 222, 222],
        'PTRATIO': [15.3, 17.8, 17.8, 18.7, 18.7],
        'B': [396.90, 396.90, 392.83, 394.63, 396.90],
        'LSTAT': [4.98, 9.14, 4.03, 2.94, 5.33],
        'MEDV': [24.0, 21.6, 34.7, 33.4, 36.2]
    }
    df = pd.DataFrame(data)

# Separate features (X) and target (y)
X = df.drop('MEDV', axis=1)  # 'MEDV' is the target variable (median home value)
y = df['MEDV']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train a Decision Tree Regressor
dt_regressor = DecisionTreeRegressor(random_state=42)
dt_regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dt_regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")

# Print feature importances
feature_importances = pd.Series(dt_regressor.feature_importances_, index=X.columns).sort_values(ascending=False)
print("\nFeature Importances:")
print(feature_importances)

boston_housing.csv not found. Creating a dummy dataset for demonstration.
Mean Squared Error (MSE): 5.76

Feature Importances:
AGE        0.956787
LSTAT      0.033914
DIS        0.009299
ZN         0.000000
CRIM       0.000000
NOX        0.000000
CHAS       0.000000
INDUS      0.000000
RM         0.000000
RAD        0.000000
TAX        0.000000
PTRATIO    0.000000
B          0.000000
dtype: float64


Question 9: Write a Python program to:
- Load the Iris Dataset
- Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
- Print the best parameters and the resulting model accuracy

In [8]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Tune the Decision Tree max_depth and min_samples_split using GridSearchCV
# Define the parameter grid to search
param_grid = {
    'max_depth': [None, 3, 5, 7, 10],
    'min_samples_split': [2, 5, 10, 15]
}

# Initialize the Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Perform the grid search on the training data
grid_search.fit(X_train, y_train)

# 3. Print the best parameters and the resulting model accuracy
print(f"Best parameters found: {grid_search.best_params_}")

# Get the best estimator (model)
best_dt_model = grid_search.best_estimator_

# Make predictions on the test set with the best model
y_pred = best_dt_model.predict(X_test)

# Calculate the accuracy of the best model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the best model: {accuracy:.4f}")

Best parameters found: {'max_depth': None, 'min_samples_split': 10}
Accuracy of the best model: 1.0000


Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
Explain the step-by-step process you would follow to:
- Handle the missing values
- Encode the categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance
And describe what business value this model could provide in the real-world setting.

- Handle the missing values: Impute missing numerical values using the mean, median, or mode, and missing categorical values with a new category or the mode.

- Encode the categorical features: Use one-hot encoding for nominal categories and label encoding for ordinal categories to convert them into a numerical format the model can process.

- Train a Decision Tree model: Split the dataset into training and testing sets and fit the Decision Tree model to the training data.

- Tune its hyperparameters: Use techniques like Grid Search or Randomized Search with cross-validation to find the optimal values for parameters such as max_depth and min_samples_leaf.

- Evaluate its performance: Assess the model using metrics like accuracy, precision, recall, F1-score, and the confusion matrix on the testing set to understand its effectiveness.

- Business value: The model can assist clinicians with early, data-driven disease prediction, leading to earlier intervention, improved patient outcomes, and potentially reduced healthcare costs.