QUESTION 1:
What is a Decision Tree, and how does it work in the context of classification?

ANSWER:
A Decision Tree is a supervised machine learning algorithm used for classification and regression.
In classification, it works by recursively splitting the dataset based on feature values that best separate the target classes.
Each internal node represents a feature test, each branch represents an outcome of the test, and each leaf node represents a class label.
The tree makes decisions by following a path from the root to a leaf based on feature conditions.


QUESTION 2:
Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

ANSWER:
Gini Impurity measures how often a randomly chosen element would be incorrectly classified.
Formula: Gini = 1 - Σ(p_i)^2

Entropy measures the level of disorder or uncertainty in the dataset.
Formula: Entropy = -Σ(p_i * log2(p_i))

Lower values indicate purer nodes.
Decision Trees choose splits that minimize Gini or Entropy, resulting in more homogeneous child nodes.
Better impurity reduction leads to better splits.


QUESTION 3:
What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

ANSWER:
Pre-Pruning stops the tree from growing early by setting constraints like max_depth or min_samples_split.
Advantage: Prevents overfitting and reduces training time.

Post-Pruning allows the tree to grow fully and then removes unnecessary branches.
Advantage: Produces a simpler model with better generalization.


QUESTION 4:
What is Information Gain in Decision Trees, and why is it important for choosing the best split?

ANSWER:
Information Gain measures the reduction in entropy after a dataset is split on a feature.
Formula: Information Gain = Entropy(parent) - Σ(weighted Entropy(children))

It is important because it helps select the feature that best separates the data into pure classes.
Higher Information Gain results in better splits.


QUESTION 5:
What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

ANSWER:
Applications:
- Medical diagnosis
- Credit risk assessment
- Fraud detection
- Customer churn prediction

Advantages:
- Easy to understand and interpret
- Handles both numerical and categorical data
- Requires minimal data preprocessing

Limitations:
- Prone to overfitting
- Sensitive to small changes in data
- Less accurate compared to ensemble methods


QUESTION 6:
Python program to load Iris dataset, train Decision Tree (Gini), print accuracy and feature importances


In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeClassifier(criterion="gini")
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


QUESTION 7:
Train Decision Tree with max_depth=3 and compare with fully grown tree


In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

full_tree = DecisionTreeClassifier()
limited_tree = DecisionTreeClassifier(max_depth=3)

full_tree.fit(X_train, y_train)
limited_tree.fit(X_train, y_train)

print("Full Tree Accuracy:", accuracy_score(y_test, full_tree.predict(X_test)))
print("Max Depth=3 Accuracy:", accuracy_score(y_test, limited_tree.predict(X_test)))


Full Tree Accuracy: 1.0
Max Depth=3 Accuracy: 1.0


QUESTION 8:
Train Decision Tree Regressor on Boston Housing dataset and print MSE and feature importances


In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeRegressor()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)

Mean Squared Error: 0.48981034170457843
Feature Importances: [0.52817895 0.05195682 0.05410209 0.02902475 0.02977124 0.13098161
 0.09300318 0.08298136]


QUESTION 9:
Hyperparameter tuning using GridSearchCV


In [5]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {
    "max_depth": [2, 3, 4, 5],
    "min_samples_split": [2, 5, 10]
}

grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", accuracy_score(y_test, best_model.predict(X_test)))


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Best Accuracy: 1.0


QUESTION 10:
Healthcare Decision Tree workflow and business value

ANSWER:
Step 1: Handle missing values using mean/median for numerical data and mode for categorical data.
Step 2: Encode categorical features using Label Encoding or One-Hot Encoding.
Step 3: Split the dataset into training and testing sets.
Step 4: Train a Decision Tree classifier on the processed data.
Step 5: Tune hyperparameters like max_depth and min_samples_split using GridSearchCV.
Step 6: Evaluate performance using accuracy, precision, recall, F1-score, and confusion matrix.

Business Value:
The model helps in early disease detection, reduces diagnostic costs, improves treatment planning, and supports doctors in decision-making.
