# Decision Tree Assignment

## Question 1
A **Decision Tree** is a supervised machine-learning algorithm used for classification and regression. It works by recursively splitting data based on feature values to create decision rules. In classification, each leaf node represents a class label, and internal nodes represent decision conditions.

## Question 2
**Gini Impurity** measures the probability of incorrect classification of a randomly chosen element.  
**Entropy** measures the disorder or uncertainty in the dataset.  
Lower impurity → better split. Decision Trees choose splits that reduce impurity the most.

## Question 3
**Pre-pruning** stops tree growth early (e.g., max_depth). It prevents overfitting and reduces computation.  
**Post-pruning** allows full growth, then removes weak branches. It improves generalization.

## Question 4
**Information Gain** = impurity(parent) − weighted impurity(children).  
It helps choose the best feature to split on by maximizing reduction in impurity.

## Question 5
Applications: healthcare diagnosis, loan approval, fraud detection, recommendation systems, manufacturing defect detection.  
Advantages: simple, interpretable, works with mixed data.  
Limitations: prone to overfitting, unstable, biased toward features with many categories.


In [1]:
# Question 6
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(criterion='gini')
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
acc = accuracy_score(y_test, pred)

print("Accuracy:", acc)
print("Feature Importances:", clf.feature_importances_)

Accuracy: 1.0
Feature Importances: [0.01667014 0.         0.40593501 0.57739485]


In [2]:
# Question 7
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Fully grown tree
full = DecisionTreeClassifier()
full.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full.predict(X_test))

# max_depth=3
limited = DecisionTreeClassifier(max_depth=3)
limited.fit(X_train, y_train)
limited_acc = accuracy_score(y_test, limited.predict(X_test))

print("Full Tree Accuracy:", full_acc)
print("Max Depth=3 Accuracy:", limited_acc)

Full Tree Accuracy: 1.0
Max Depth=3 Accuracy: 1.0


In [6]:
# Question 8
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')

housing = fetch_california_housing()
Xb, yb = housing.data, housing.target

Xb_train, Xb_test, yb_train, yb_test = train_test_split(Xb, yb, test_size=0.2, random_state=42)

reg = DecisionTreeRegressor()
reg.fit(Xb_train, yb_train)

pred_b = reg.predict(Xb_test)
mse = mean_squared_error(yb_test, pred_b)

print("MSE:", mse)
print("Feature Importances:", reg.feature_importances_)

MSE: 0.4942615214722141
Feature Importances: [0.52936799 0.05168014 0.05398948 0.02855997 0.02999578 0.13036844
 0.0933088  0.08272941]


In [4]:
# Question 9
from sklearn.model_selection import GridSearchCV

params = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

grid = GridSearchCV(DecisionTreeClassifier(), params, cv=5)
grid.fit(X_train, y_train)

print("Best Params:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)

Best Params: {'max_depth': 4, 'min_samples_split': 2}
Best Accuracy: 0.9416666666666668


## Question 10
**Steps for disease prediction model:**

1. **Handle Missing Values**  
   - Numerical: mean/median imputation  
   - Categorical: mode imputation  
   - Optionally use advanced techniques like KNN imputer.

2. **Encode Categorical Features**  
   - Use one-hot encoding for non-ordinal data  
   - Label encoding if order exists.

3. **Train a Decision Tree Model**  
   - Fit the tree on cleaned and encoded data  
   - Use criteria like Gini/Entropy.

4. **Hyperparameter Tuning**  
   - Use GridSearchCV to tune max_depth, min_samples_split, min_samples_leaf.

5. **Evaluate Performance**  
   - Use accuracy, precision, recall, F1-score, confusion matrix.

**Business Value:**  
This model helps early disease detection, reduces healthcare costs, improves patient outcomes, and assists doctors in decision-making.
