*Assignment Code: DA-AG-012*
*Decision Tree | Assignment*

Question 1: What is a Decision Tree, and how does it work in the context of classification?
-	A decision tree is a decision support recursive partitioning structure that uses a tree-like model of decisions and their possible consequences.
-	Decision tree employs a divide and conquer strategy by conducting a greedy search to identify the optimal split points within a tree. This process of splitting is then repeated in a top-down, recursive manner until all, or the majority of records have been classified under specific class labels.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
- Entropy measures a set's disorder level, while Gini impurity quantifies the probability of misclassifying instances. Both are used in decision trees to determine node splits, but Gini favors larger partitions.
- In decision trees, the best split at each node is determined by evaluating how well each potential division separates the data, using criteria like Gini impurity, entropy. A lower Gini impurity suggests a more homogeneous set of elements within the node, making it an attractive split in a decision tree. At every branch, the entropy computed for the target column is the weighted entropy. The weighted entropy means taking the weights of each attribute. The weights are the probability of each of the classes. The more the decrease in the entropy, the more is the information gained.
Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
- The pre-pruning or early stopping involves stopping the tree before it has completed classifying the training set and post-pruning refers to pruning the tree after it has finished. Post-Pruning is used generally for small datasets whereas Pre-Pruning is used for larger ones.
- Pre-pruning stops the tree from growing too large during training. Post-pruning, on the other hand, trims a fully grown tree by removing parts that don't improve accuracy. Both techniques aim to prevent overfitting, but they take different approaches to simplifying a decision tree.
Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?
- Information gain is the basic criterion to decide whether a feature should be used to split a node or not.
- The feature with the optimal split i.e., the highest value of information gain at a node of a decision tree is used as the feature for splitting the node. The best split at each node is determined by evaluating how well each potential division separates the data, using criteria like Gini impurity, entropy, or sum of squared errors.
Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
- The most commonly used applications of decision trees are data mining and data classification. It has a wide range of applications in various fields. They are used in medical research and practice for diagnostic testing and in the field of genomics. In business, decision trees are used for strategy formulation, decision making, and to visualize cost effectiveness.
- The limitations are overfitting of data, lack of robustness (highly sensitive to small variations in the data), limitations in handling imbalanced data (Decision trees tend to favor the majority class during splits, leading to biased predictions and poor model performance in minority classes), difficulty in modeling interactions, Scalability Concerns.
- The advantages are easy to interpret, little or no data preparation required, more flexible for both classification and regression tasks.




In [None]:
"""Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV)."""


In [3]:
'''Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances'''


import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

# Load Iris dataset into a pandas DataFrame
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Split the data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Train Decision Tree using Gini criterion
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy and confusion matrix
print(f"Model Accuracy: {accuracy_score(y_test, y_pred):.2f}")

# Feature importances as a DataFrame
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
print(importance_df)



Model Accuracy: 1.00

Feature Importances:
             Feature  Importance
2  petal length (cm)    0.906143
3   petal width (cm)    0.077186
1   sepal width (cm)    0.016670
0  sepal length (cm)    0.000000


In [7]:
'''Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.'''

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Train Decision Tree with max_depth=3
tree_limited = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=25)
tree_limited.fit(X_train, y_train)

# Train fully grown Decision Tree
tree_full = DecisionTreeClassifier(criterion='gini', random_state=42)
tree_full.fit(X_train, y_train)

# Make predictions
y_pred_limited = tree_limited.predict(X_test)
y_pred_full = tree_full.predict(X_test)

# Calculate accuracy
accuracy_limited = accuracy_score(y_test, y_pred_limited)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print results
print(f"Decision Tree (max_depth=3) Accuracy: {accuracy_limited:.2f}")
print(f"Decision Tree (full tree) Accuracy: {accuracy_full:.2f}")



Decision Tree (max_depth=3) Accuracy: 1.00
Decision Tree (full tree) Accuracy: 1.00


In [10]:

'''Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances'''

import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')

# Load the Boston Housing dataset
boston = fetch_openml(name="boston", version=1, as_frame=True)
X = boston.data
y = boston.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Display feature importances
importance_df = pd.DataFrame({
    "Feature": X.columns,
    "Importance": regressor.feature_importances_
}).sort_values(by="Importance", ascending=False)

print("\nFeature Importances:")
print(importance_df)


Mean Squared Error: 10.42

Feature Importances:
    Feature  Importance
5        RM    0.600326
12    LSTAT    0.193328
7       DIS    0.070688
0      CRIM    0.051296
4       NOX    0.027148
6       AGE    0.013617
9       TAX    0.012464
10  PTRATIO    0.011012
11        B    0.009009
2     INDUS    0.005816
1        ZN    0.003353
8       RAD    0.001941
3      CHAS    0.000002


In [12]:
'''Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy'''

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Define Decision Tree model
dtree = DecisionTreeClassifier(random_state=42)

# Define parameter grid for tuning
param_grid = {'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]}

# 5. Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=dtree,param_grid=param_grid,cv=5,scoring='accuracy',n_jobs=-1)

grid_search.fit(X_train, y_train)

# 6. Get best model and parameters
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

# 7. Predict on test set
y_pred = best_model.predict(X_test)

# 8. Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

# 9. Print results
print(f"Best Parameters: {best_params}")
print(f"Model Accuracy: {accuracy:.2f}")


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy: 1.00


In [None]:
'''Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.'''

#1. Handle Missing Values

'''Missing data is common in healthcare (e.g., lab tests not taken, incomplete forms).

Steps:

Identify missing values:'''

df.isnull().sum()

#2. Handle Missing Values

'''For numbers: Fill missing values with the median (better for skewed data).

For categories: Fill missing values with the most common value or label them as "Unknown".'''

from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')

# 3. Encode Categorical Features

# 4. Train a Decision Tree Model

'''Split data into training and testing sets.

Fit the model on processed data.'''

# 5. Tune Hyperparameters with GridSearchCV

'''Key hyperparameters:

max_depth: Controls tree depth

min_samples_split: Minimum samples required to split a node

min_samples_leaf: Minimum samples at leaf nodes

criterion: gini or entropy'''

# 6. Evaluate Performance

'''Use both training and testing data to assess performance.

Metrics:

Accuracy – Overall correctness

Precision & Recall – Important in healthcare where false negatives/positives matter

F1-score – Balance between precision and recall

ROC-AUC – Probability that model ranks positive higher than negative'''

# 7.Business Value

'''A well-trained Decision Tree model can:

Support doctors: Provide quick, data-driven insights to flag high-risk patients.

Improve early detection: Catch diseases early, improving patient outcomes.

Optimize resources: Focus diagnostic tests or treatments on high-probability cases.

Reduce costs: Prevent unnecessary tests and hospitalizations.

Compliance & Explainability: Decision Trees are interpretable, helping meet regulatory requirements in healthcare.'''



