#***Decision Tree***

# Question 1: What is a Decision Tree, and how does it work in the context of classification?

A Decision Tree is a supervised machine learning model that represents decisions and their possible consequences as a tree-like structure. It splits data recursively based on features to form branches, with leaves representing outcomes (e.g., class labels in classification). In classification, it works by selecting the best feature at each node to split the data, maximizing separation between classes (e.g., using impurity measures like Gini). The tree is traversed from root to leaf to predict the class of new data.

# Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
Answer:-
- **Gini Impurity**: Measures the probability of incorrectly classifying a randomly chosen element if labeled randomly according to the class distribution. Formula: Gini(p) = 1 - Σ(p_i^2), where p_i is the probability of class i. Lower Gini indicates purer nodes.


- **Entropy**: Measures uncertainty or disorder in the data. Formula: Entropy(p) = - Σ(p_i * log2(p_i)). Zero entropy means pure node.
They impact splits by guiding feature selection: the feature that minimizes Gini or Entropy (maximizes information gain) is chosen, leading to more homogeneous child nodes. Gini is faster; Entropy may give slightly better splits in some cases.

# Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
Answer:-
- **Pre-Pruning**: Stops tree growth early during training (e.g., limiting max_depth or min_samples_split).
- **Post-Pruning**: Grows the full tree, then removes branches (e.g., using cost complexity pruning).
Difference: Pre-Pruning is proactive (prevents overfitting during build); Post-Pruning is reactive (trims after build).
Advantage of Pre-Pruning: Faster training (e.g., for large dataset with 45 features).
Advantage of Post-Pruning: Potentially higher accuracy by pruning only unnecessary branches.

# Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Information Gain (IG) is the reduction in entropy or impurity after splitting on a feature. Formula: IG = Entropy(parent) - Σ(weighted Entropy(child)). It's important for choosing the best split as it quantifies how much uncertainty is reduced, ensuring the tree is efficient and interpretable (e.g., splitting on age=77 in dataset maximizes IG for classification).

# Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

- **Applications**: Credit risk assessment (classify loan approval), medical diagnosis (classify disease based on symptoms), customer segmentation (classify buying behavior).
- **Advantages**: Easy to interpret (visual tree), handle non-linear data, no need for scaling.
- **Limitations**: Prone to overfitting, sensitive to small changes in data, biased toward features with more levels.

# Question 6: Write a Python program to: A. Load the Iris Dataset B. Train a Decision Tree Classifier using the Gini criterion C. Print the model's accuracy and feature importances

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# A. Load Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# B. Train Decision Tree Classifier with Gini
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict and accuracy
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# C. Print accuracy and feature importances
print(f"My Model Accuracy: {accuracy}")
print(f"My Feature Importances: {clf.feature_importances_}")

My Model Accuracy: 1.0
My Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


# Question 7: Write a Python program to: A. Load the Iris Dataset B. Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# A. Load Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# B. Train with max_depth=3
clf_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)
y_pred_pruned = clf_pruned.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)

# Train fully-grown tree
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Compare
print(f"Pruned Tree Accuracy (max_depth=3): {accuracy_pruned}")
print(f"Full Tree Accuracy: {accuracy_full}")

Pruned Tree Accuracy (max_depth=3): 1.0
Full Tree Accuracy: 1.0


# Question 8: Write a Python program to: • Load the California Housing dataset from sklearn • Train a Decision Tree Regressor . Print the Mean Squared Error (MSE) and feature importances

In [5]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing Dataset
california = fetch_california_housing()
X = california.data
y = california.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predict and MSE
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# Print MSE and feature importances
print(f"MSE: {mse}")  # Using your name
print(f"Feature Importances: {reg.feature_importances_}")

MSE: 0.495235205629094
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]


# Question 9: Write a Python program to: • Load the Iris Dataset Tune the Decision Tree's max_depth and min_samples_split using GridSearchCV • Print the best parameters and the resulting model accuracy

In [6]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Tune with GridSearchCV
param_grid = {'max_depth': [3, 5, None], 'min_samples_split': [2, 5, 10]}
clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_
best_clf = grid_search.best_estimator_
y_pred = best_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print(f"Best Parameters: {best_params}")
print(f"Model Accuracy: {accuracy}")

Best Parameters: {'max_depth': None, 'min_samples_split': 2}
Model Accuracy: 1.0


# Question 10: Imagine you're working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:
Answer:-
A. Handle the missing values
B. Encode the categorical features
C. Train a Decision Tree model
D. Tune its hyperparameters
E. Evaluate its performance
And describe what business value this model could provide in the real-world setting.

**Step-by-Step Process**:
A. **Handle Missing Values**: Identify with `df.isnull().sum()`. For numerical (e.g., blood pressure), impute with mean/median; for categorical (e.g., gender), use mode. Drop rows if missingness >50% or use KNN imputation for complex cases. Reason: Prevents bias and retains data.

B. **Encode Categorical Features**: Use one-hot encoding for nominal (e.g., symptoms: 'fever', 'cough' → binary columns) via `pd.get_dummies()`. For ordinal (e.g., severity: 'low', 'high'), use label encoding. Reason: Decision Trees handle encoded data better.

C. **Train a Decision Tree Model**: Split data (80/20) using `train_test_split`. Train `DecisionTreeClassifier` with `fit(X_train, y_train)`. Use Gini or entropy criterion.

D. **Tune Hyperparameters**: Use `GridSearchCV` to tune `max_depth`, `min_samples_split`, `min_samples_leaf` with CV=5. Fit on train data, get best params.

E. **Evaluate Performance**: Use accuracy, F1-score (for imbalance), confusion matrix, and ROC-AUC on test data. Cross-validate for robustness.


###**Business Value**:
The model enables early disease detection, optimizing resource allocation (e.g., prioritizing high-risk patients), reducing costs (e.g., by 20% through targeted treatments), and improving patient outcomes (e.g., 15% faster recovery), enhancing the company’s reputation and revenue. For XYZ company, it could personalize care for 76-year-old patients.