Assignment Code: DA-AG-012


### Question 1: What is a Decision Tree, and how does it work in the context of classification?

Ans. A Decision Tree is a supervised learning algorithm used for both classification and regression. In classification, it predicts the class label by learning simple decision rules inferred from the data features.

### How it works:
The tree starts at the root node, splits the data on feature values at internal nodes, and outputs a class label at the leaf node.

At each node, the algorithm chooses the feature and threshold that best separate the data using impurity measures (like Gini or Entropy).

### Example:
Using the Iris dataset, a Decision Tree might first split on petal length ≤ 2.45 cm. If true, it's classified as Setosa; otherwise, the tree continues further splits to distinguish between Versicolor and Virginica.

##Question 2: Explain Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
Answer: Both are used to measure how mixed the class labels are in a node.

### Gini Impurity:
$Gini = 1 - \sum p_i^2$, where $p_i$ is the probability of class $i$. Lower Gini means purer nodes.

### Entropy:
$Entropy = -\sum p_i \log_2(p_i)$. Measures the level of uncertainty.

### Impact on splits:
The algorithm selects the feature split that results in the largest reduction in impurity (Gini or Entropy). This reduction is called Information Gain.

### Example:
If node A has 50% class A and 50% class B, its Gini = 0.5 and Entropy = 1. A split that produces pure nodes (100% of a single class) would reduce impurity to 0.

##Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
Answer:

| Type        | Description                                                                  | Advantage                       |
|-------------|------------------------------------------------------------------------------|---------------------------------|
| Pre-Pruning | Stops the tree growth early by limiting depth, min samples, etc.             | Prevents overfitting early      |
| Post-Pruning| Grows the full tree, then prunes back based on validation performance        | Results in simpler, better tree |

Example:

Pre-Pruning: `max_depth=3`, `min_samples_split=5`

Post-Pruning: cost complexity pruning (`ccp_alpha`) after building a full tree

##Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?
Answer:
Information Gain is the decrease in impurity after a dataset is split on an attribute.

Formula:
$IG = Impurity_{parent} - \sum \left( \frac{n_{child}}{n_{parent}} \times Impurity_{child} \right)$

Why it's important:
The feature that maximizes Information Gain is chosen to split the node, ensuring the best separation of classes.

Example:
If a split reduces Gini from 0.5 to 0.2, the Information Gain is 0.3, indicating an effective split.

##Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
Answer:
### Applications:

* Healthcare: disease prediction
* Finance: loan approval, credit risk analysis
* Marketing: customer segmentation
* Operations: supply chain decisions

### Advantages:

* Easy to interpret and visualize
* No need for feature scaling
* Handles both numerical and categorical data

### Limitations:

* Prone to overfitting
* Unstable with small changes in data
* Biased toward features with more levels (for classification)

### Question 6: Load Iris Dataset, Train Decision Tree (Gini), Print Accuracy & Feature Importances

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Feature Importances:", model.feature_importances_)

Accuracy: 1.0
Feature Importances: [0.         0.01911002 0.89326355 0.08762643]


### Question 7: Train Iris Decision Tree (max_depth=3) vs Full Tree Accuracy

In [3]:
model_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
model_limited.fit(X_train, y_train)
acc_limited = accuracy_score(y_test, model_limited.predict(X_test))

model_full = DecisionTreeClassifier(random_state=42)
model_full.fit(X_train, y_train)
acc_full = accuracy_score(y_test, model_full.predict(X_test))

print("Accuracy (max_depth=3):", acc_limited)
print("Accuracy (full tree):", acc_full)

Accuracy (max_depth=3): 1.0
Accuracy (full tree): 1.0


### Question 8: Train Decision Tree Regressor on Boston Housing Dataset

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Note: load_boston is deprecated, using California housing dataset
boston = fetch_california_housing()
X, y = boston.data, boston.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("MSE:", mse)
print("Feature Importances:", regressor.feature_importances_)

MSE: 0.5280096503174904
Feature Importances: [0.52345628 0.05213495 0.04941775 0.02497426 0.03220553 0.13901245
 0.08999238 0.08880639]


### Question 9: GridSearchCV for Decision Tree on Iris Dataset

In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load Iris dataset again to ensure X_train and y_train are from Iris
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


params = {
    'max_depth': [2, 3, 4, 5],
    'min_samples_split': [2, 3, 4, 5]
}

grid = GridSearchCV(DecisionTreeClassifier(random_state=42), params, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy on Validation:", grid.best_score_)

Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Best Accuracy on Validation: 0.9333333333333333


### Question 10: End-to-End Process for Healthcare Prediction Using Decision Trees
**Step-by-Step:**
1. **Handle Missing Values**: Use `SimpleImputer` for numerical and categorical columns.
2. **Encode Categorical Features**: Use `OneHotEncoder` or `OrdinalEncoder`.
3. **Train Model**: Use `DecisionTreeClassifier` with tuned parameters.
4. **Hyperparameter Tuning**: Use `GridSearchCV` to optimize `max_depth`, `min_samples_split`.
5. **Evaluation**: Use metrics like Accuracy, Precision, Recall, and ROC-AUC.

**Business Value:**
- Early detection of disease
- Better patient management
- Cost-effective treatment allocation