# Decision Tree Assignment


## Question 1: What is a Decision Tree, and how does it work in the context of classification?

**Answer:**
A Decision Tree is a supervised machine learning algorithm used for classification and regression.
In classification, it splits the dataset into smaller subsets based on feature values, forming a tree-like structure.
Each internal node represents a decision based on a feature, each branch represents an outcome, and each leaf node represents a class label.
The tree selects the best split using impurity measures like Gini Impurity or Entropy and continues splitting until a stopping condition is met.


## Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

**Answer:**
Gini Impurity measures the probability of incorrect classification of a randomly chosen element.
Lower Gini values indicate purer nodes.

Entropy measures the level of uncertainty or randomness in the dataset.
Lower entropy means higher purity.

Decision Trees choose splits that minimize Gini Impurity or Entropy, resulting in more homogeneous child nodes.


## Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

**Answer:**
Pre-Pruning stops the tree growth early by setting constraints like maximum depth.
Advantage: Reduces overfitting and computation time.

Post-Pruning allows the tree to grow fully and then removes unnecessary branches.
Advantage: Improves generalization and model accuracy.


## Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

**Answer:**
Information Gain measures the reduction in entropy after splitting the dataset on a feature.
It helps identify the feature that provides the most information about the target variable.
The feature with the highest Information Gain is selected for the split, improving classification accuracy.


## Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

**Answer:**
Applications include medical diagnosis, fraud detection, credit risk analysis, and customer churn prediction.

Advantages:
- Easy to understand and interpret
- Handles both numerical and categorical data

Limitations:
- Prone to overfitting
- Sensitive to small changes in data


# Question 6: Decision Tree Classifier using Gini Criterion

In [17]:
import pandas as pd
from sklearn.datasets import load_iris, fetch_openml
import os

# Create the directory if it doesn't exist
os.makedirs('/mnt/data', exist_ok=True)

# ----- Iris Dataset -----
iris = load_iris(as_frame=True)
iris_df = iris.frame
iris_csv_path = "/mnt/data/iris_dataset.csv"
iris_df.to_csv(iris_csv_path, index=False)

# ----- Boston Housing Dataset (via OpenML) -----
boston = fetch_openml(name="boston", version=1, as_frame=True)
boston_df = boston.frame
boston_csv_path = "/mnt/data/boston_housing_dataset.csv"
boston_df.to_csv(boston_csv_path, index=False)

iris_csv_path, boston_csv_path

('/mnt/data/iris_dataset.csv', '/mnt/data/boston_housing_dataset.csv')

In [18]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = DecisionTreeClassifier(criterion="gini", random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


In [19]:
# Question 7: max_depth=3 vs Fully-Grown Tree

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full_tree.predict(X_test))

limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
limited_acc = accuracy_score(y_test, limited_tree.predict(X_test))

print("Fully-Grown Tree Accuracy:", full_acc)
print("Max Depth=3 Tree Accuracy:", limited_acc)


Fully-Grown Tree Accuracy: 1.0
Max Depth=3 Tree Accuracy: 1.0


In [20]:
# Question 8: Decision Tree Regressor on Boston Housing Dataset

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Use the boston_df already loaded from fetch_openml
X = boston_df.drop('MEDV', axis=1)  # Features are all columns except 'MEDV'
y = boston_df['MEDV']              # Target is the 'MEDV' column

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", regressor.feature_importances_)

Mean Squared Error: 10.416078431372549
Feature Importances: [5.12956739e-02 3.35270585e-03 5.81619171e-03 2.27940651e-06
 2.71483790e-02 6.00326256e-01 1.36170630e-02 7.06881622e-02
 1.94062297e-03 1.24638653e-02 1.10116089e-02 9.00872742e-03
 1.93328464e-01]


In [21]:
# Question 9: Hyperparameter Tuning using GridSearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor # Changed from DecisionTreeClassifier
from sklearn.metrics import mean_squared_error, r2_score # Changed metric for regression

param_grid = {
    "max_depth": [None, 2, 3, 4, 5],
    "min_samples_split": [2, 5, 10]
}

grid = GridSearchCV(
    DecisionTreeRegressor(random_state=42), # Changed to DecisionTreeRegressor
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error' # Specify a scoring metric appropriate for regression
)

grid.fit(X_train, y_train)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid.best_params_)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R-squared:", r2_score(y_test, y_pred))

Best Parameters: {'max_depth': 2, 'min_samples_split': 2}
Mean Squared Error: 25.993190895971196
R-squared: 0.6455495710736121


## Question 10: Healthcare Decision Tree Use Case

**Answer:**
Missing values are handled using mean or mode imputation.
Categorical features are encoded using one-hot or label encoding.
A Decision Tree model is trained using cleaned data.
Hyperparameters are tuned using GridSearchCV.
Performance is evaluated using accuracy, precision, recall, and F1-score.

Business Value:
The model helps in early disease detection, reduces healthcare costs, improves patient outcomes, and provides transparent decision-making support to doctors.
