# 📘 Decision Trees Concepts and Python Programs


## 🌲 Decision Trees - Key Concepts

### What is a Decision Tree, and how does it work?
A Decision Tree is a flowchart-like structure where each internal node represents a decision based on a feature, each branch represents an outcome, and each leaf node represents a class label or value. It splits data into subsets based on feature values using criteria like Gini impurity or entropy.

### Impurity Measures in Decision Trees
- **Gini Impurity** and **Entropy** are used to measure the impurity or disorder of a dataset.

### Gini Impurity Formula
\[
Gini = 1 - \sum_{i=1}^{n} p_i^2
\]

### Entropy Formula
\[
Entropy = - \sum_{i=1}^{n} p_i \log_2(p_i)
\]

### Information Gain
- The reduction in impurity from a split:
\[
Information Gain = Entropy_{parent} - \sum \frac{|child|}{|parent|} Entropy_{child}
\]

### Gini vs Entropy
- Both measure impurity; Gini is faster to compute, Entropy gives more weight to rare classes.

### Pre-Pruning
Stop growing tree early by limiting depth, minimum samples, etc.

### Post-Pruning
Grow full tree first, then remove branches that add little predictive value (Cost Complexity Pruning).

### Pre-Pruning vs Post-Pruning
- **Pre-Pruning**: Stops early.
- **Post-Pruning**: Removes later.

### Decision Tree Regressor
Predicts continuous values by splitting data to minimize variance.

### Advantages and Disadvantages
✅ Easy to interpret, no need for scaling, works for categorical data.  
❌ Prone to overfitting, unstable with small changes.

### Missing Values
Can split on available features or use surrogate splits.

### Categorical Features
Handled by splitting on each category or using dummy encoding.

### Real-World Applications
Fraud detection, customer segmentation, medical diagnosis, etc.


### ✅ Train Decision Tree Classifier on Iris Dataset

In [None]:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


### ✅ Gini Criterion & Feature Importance

In [None]:

clf_gini = DecisionTreeClassifier(criterion='gini')
clf_gini.fit(X_train, y_train)
print("Feature importances:", clf_gini.feature_importances_)


### ✅ Entropy Criterion & Accuracy

In [None]:

clf_entropy = DecisionTreeClassifier(criterion='entropy')
clf_entropy.fit(X_train, y_train)
print("Accuracy (Entropy):", accuracy_score(y_test, clf_entropy.predict(X_test)))


### ✅ Decision Tree Regressor on Boston Housing

In [None]:

from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.3, random_state=42)
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))


### ✅ Visualize Decision Tree

In [None]:

from sklearn import tree
import graphviz

dot_data = tree.export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
graph = graphviz.Source(dot_data)
graph.render("decision_tree_iris")


### ✅ max_depth=3 vs Full Tree

In [None]:

clf_full = DecisionTreeClassifier().fit(X_train, y_train)
clf_limited = DecisionTreeClassifier(max_depth=3).fit(X_train, y_train)
print("Full tree accuracy:", accuracy_score(y_test, clf_full.predict(X_test)))
print("Max depth=3 accuracy:", accuracy_score(y_test, clf_limited.predict(X_test)))


### ✅ min_samples_split=5 vs Default

In [None]:

clf_default = DecisionTreeClassifier().fit(X_train, y_train)
clf_min_samples = DecisionTreeClassifier(min_samples_split=5).fit(X_train, y_train)
print("Default accuracy:", accuracy_score(y_test, clf_default.predict(X_test)))
print("min_samples_split=5 accuracy:", accuracy_score(y_test, clf_min_samples.predict(X_test)))


### ✅ Feature Scaling Comparison

In [None]:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf_scaled = DecisionTreeClassifier().fit(X_train_scaled, y_train)
print("Scaled data accuracy:", accuracy_score(y_test, clf_scaled.predict(X_test)))
print("Unscaled data accuracy:", accuracy_score(y_test, clf_default.predict(X_test)))


### ✅ One-vs-Rest Multiclass Classification

In [None]:

from sklearn.multiclass import OneVsRestClassifier
ovr_clf = OneVsRestClassifier(DecisionTreeClassifier())
ovr_clf.fit(X_train, y_train)
print("OvR Accuracy:", accuracy_score(y_test, ovr_clf.predict(X_test)))


### ✅ Decision Tree Regressor with max_depth=5

In [None]:

reg_full = DecisionTreeRegressor().fit(X_train, y_train)
reg_limited = DecisionTreeRegressor(max_depth=5).fit(X_train, y_train)
print("Full tree MSE:", mean_squared_error(y_test, reg_full.predict(X_test)))
print("Max depth=5 MSE:", mean_squared_error(y_test, reg_limited.predict(X_test)))


### ✅ Cost Complexity Pruning

In [None]:

path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

for ccp_alpha in ccp_alphas:
    clf_pruned = DecisionTreeClassifier(ccp_alpha=ccp_alpha).fit(X_train, y_train)
    acc = accuracy_score(y_test, clf_pruned.predict(X_test))
    print(f"ccp_alpha={ccp_alpha:.5f}, accuracy={acc:.4f}")
