# Decision Tree
A Decision Tree (DT) is a tree-structured model used for classification or regression.

Each internal node = feature test

Each branch = outcome of the test

Each leaf node = prediction


Decision Tree  Classifier	Classification	Class label (0,1,2…)

Decision Tree Regressor   Regression   Continuous value

How It Works

Select the best feature to split (based on criterion)

Split data into subsets

Repeat recursively (recursive partitioning)

Stop when:

Maximum depth reached

Minimum samples per leaf reached

Node is pure

Splitting Criteria
For Classification

Gini Impurity 

Gini = 1−∑​(pi)^2

Entropy / Information Gain
⁡
Entropy = − ∑ p_i log2(p_i)

For Regression

Variance Reduction / MSE 

V =  1/N  ∑(y_i - y')^2



Node split → reduces impurity (classification) or variance (regression)

Advantages

Easy to interpret / visualize

Handles numerical & categorical features

Non-linear relationships allowed

No need for feature scaling

Disadvantages

Prone to overfitting

Sensitive to noise

Small changes → large changes in tree structure

Solve overfitting with max_depth, min_samples_split, pruning, ensemble methods

Hyperparameters

max_depth → max levels of tree

min_samples_split → min samples to split a node

min_samples_leaf → min samples at leaf node

max_features → number of features to consider for split

criterion → “gini”, “entropy” (classification) / “mse”, “mae” (regression)



Overfitting & Pruning

Trees tend to fit training data perfectly

Solve using:

max_depth

min_samples_leaf

cost complexity pruning (ccp_alpha in sklearn)

Ensemble methods (Random Forest, Gradient Boosting)

Feature Importance

DT can provide importance of each feature:

``` print(clf.feature_importances_)```

Useful for feature selection

How does a decision tree decide splits?

By maximizing information gain (classification) or reducing variance (regression).

Q: How to prevent overfitting?

Limit depth, min_samples, or prune the tree.

Q: Difference between classifier and regressor?

Classifier → discrete labels; Regressor → continuous values.

Q: Can Decision Trees handle missing values?

Yes, sklearn can handle via surrogate splits.

Q: Are Decision Trees sensitive to feature scaling?

No scaling required.


Q1: How does a Decision Tree decide where to split?

It performs a greedy search over all features and possible thresholds. For each, it calculates the purity gain (Information Gain or Gini Gain for classification, variance reduction for regression). It selects the single feature and threshold that provides the maximum gain at that specific node.

Q2: How to prevent overfitting in a Decision Tree?

Pre-pruning (Early Stopping): Restrict tree growth using max_depth, min_samples_split, min_samples_leaf.

Post-pruning: Grow the full tree, then prune back branches that provide little predictive power using Cost Complexity Pruning (ccp_alpha).

Ensemble it: Use the tree as a base learner in Bagging (Random Forest) or Boosting methods, which are far more robust.

Q3: What's the difference between Gini Impurity and Entropy?

Both measure node impurity. Gini calculates the probability of misclassification. Entropy measures the informational disorder. In practice, they yield very similar results, but Gini is slightly faster to compute as it doesn't require logarithms, which is why it's often the default. Entropy might produce slightly more balanced trees.

Q4: Are Decision Trees sensitive to feature scaling?

No. The splitting rule is based on feature thresholds and ordering, not on magnitude or distance. Scaling does not change the tree's structure.

Q5: Can they handle missing values?

Sklearn's implementation does NOT natively handle missing values. You must impute them before training.
However, the classic algorithm (CART) can handle them via surrogate splits — finding splits in other features that mimic the primary split, so data with missing values can be routed down the tree.

Q6: What are the pros and cons compared to Linear Models?

Pros: No need for scaling, handles non-linearity and interactions automatically, more interpretable visualizations.
Cons: Far more prone to overfitting (high variance), less stable, worse at extrapolation (vs. linear regression).

Q7: How would you handle a categorical variable with many levels (high cardinality)?

This is a weakness. A tree might overfit by giving it high importance. Solutions:

Group rare levels into an "Other" category.

Use target encoding (mean of target per category), but be cautious of leakage.

Use a model better suited for high-cardinality features (like CatBoost).




In [None]:
from sklearn.tree import DecisionTreeClassifier

X = [[0,0], [1,1], [1,0], [0,1]]
y = [0, 1, 1, 0]

clf = DecisionTreeClassifier(max_depth=2, criterion='gini')
clf.fit(X, y)

print(clf.predict([[1,0]]))


In [None]:
from sklearn.tree import DecisionTreeRegressor
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1.5, 3.2, 2.8, 4.5, 5.1])

reg = DecisionTreeRegressor(max_depth=3)
reg.fit(X, y)

print(reg.predict([[6]]))


# Random Forest

Random Forest  is an ensemble learning method that builds many decision trees and combines their results.

Classifier → majority vote (mode)

Regressor → average prediction

Many trees reduce overfitting and improve accuracy

How It Works

Create bootstrap samples from training data (sampling with replacement)

Train a decision tree on each sample

For each split, select a random subset of features

Combine predictions:

Classification → majority vote

Regression → mean value

This is called bagging (Bootstrap Aggregation)

Why Random Forest?

Reduces overfitting of individual trees

Handles high-dimensional data

Works with numerical & categorical features

Provides feature importance

Hyperparameters 

n_estimators → number of trees

max_depth → max depth of each tree

min_samples_split → min samples to split

min_samples_leaf → min samples at leaf

max_features → number of features considered for split

bootstrap → True (default) for bagging

Feature Importance

- Gini Importance (Mean Decrease Impurity)

Importance(f) = Σ (Weighted impurity decrease across all nodes using f)

Sum of impurity reduction when feature is used for splitting

Weighted by number of samples reaching each node

- Permutation Importance

Importance(f) = Baseline Score - Score with feature f shuffled

More reliable for correlated features

Measures actual predictive power



Advantages

Reduces overfitting vs single tree

Handles high-dimensional & large datasets

Robust to outliers & noise

Can estimate feature importance

Disadvantages

Less interpretable than a single tree

Can be slow for large datasets

Large memory usage



Q: Why Random Forest over Decision Tree?
Reduces overfitting, improves accuracy using bagging.

Q: How does it combine predictions?
Classifier → majority vote, Regressor → average.

Q: How is feature importance calculated?
Mean decrease in impurity or permutation importance.

Q: Can Random Forest handle missing values?
Yes (sklearn handles natively).

Overfitting Control

Limit max_depth

Increase n_estimators

Use max_features < total features

Q1: Why does Random Forest work better than a single Decision Tree?
Three key mechanisms:

Bagging (Bootstrap Aggregation): Reduces variance by averaging multiple models trained on different data samples

Feature Randomness: Each split considers random subset of features → trees become decorrelated → ensemble diversity increases

Ensemble Effect: Errors from individual trees cancel out; correct predictions reinforced

Q2: What's the difference between Bagging and Random Forest?
Bagging: Builds multiple models on bootstrap samples (could be any model)
Random Forest = Bagging + Random Feature Selection

Standard bagging uses all features at each split

RF adds extra randomness by limiting features per split → further reduces correlation between trees

Q3: How do you prevent overfitting in Random Forest?
Control Tree Complexity: max_depth, min_samples_split, min_samples_leaf

Increase Number of Trees: More trees stabilize predictions (but diminishing returns)

Limit Features per Split: max_features = sqrt(n_features) or smaller

Use OOB Score: Monitor out-of-bag error during training

Early Stopping: Stop when OOB error plateaus

Q4: What is Out-of-Bag (OOB) error and why is it useful?
OOB Error: Prediction error on samples not included in a tree's bootstrap sample

Each sample is OOB for ~36.8% of trees

Provides free validation without needing separate test set

In sklearn: oob_score=True enables this

Q5: How does Random Forest handle missing values?
Two approaches:

During Training: Uses surrogate splits (find similar splits using other features)

In sklearn: Requires imputation first (median/mode)

Smart Imputation: Can use proximity matrix from RF to impute missing values iteratively

Q6: When would you NOT use Random Forest?
Interpretability Required: Need clear decision rules

Extrapolation Needed: Predicting outside training range (regression)

Extremely High-dimensional Sparse Data: Like text data (use linear models)

Streaming/Online Learning: RF needs batch training

Memory/Time Constrained: Large forests are resource-intensive

Q7: Can Random Forest feature importance be misleading?
Yes! Important caveats:

Biased toward high-cardinality features: Continuous or many-category features get inflated importance

Correlated features: Importance splits between correlated features

Use permutation importance for more reliable measure

Always validate with domain knowledge or ablation studies

When Tuning:
Start with n_estimators=100, increase until OOB error stabilizes

Tune max_features first (most impactful parameter)

Use n_jobs=-1 for parallel training

Monitor OOB score for early stopping

Common Pitfalls:
Too many trees without benefit (waste resources)

Forgetting to set random seed (non-reproducible results)

Using default max_features='auto' (might not be optimal)

Ignoring OOB score as free validation
```bash

Aspect	            Decision Tree	                Random Forest
Overfitting     	High risk                   	Much lower risk
Interpretability	High (white box)	            Low (black box)
Prediction Speed	Very fast	                    Slower (needs all trees)
Feature Importance	Yes, but unreliable	            More robust
Handling Noise	        Poor	                        Good
```

Extremely Randomized Trees (ExtraTrees)
Even more randomness: random thresholds for splits (not best threshold)

Faster training, sometimes better performance

Introduces more bias but reduces variance further

Balanced Random Forest
For imbalanced data: bootstrap samples maintain class ratio

Or use class_weight='balanced' parameter

Quantile Regression Forest
Predicts full distribution, not just mean

Useful for prediction intervals

Bagging

Key Idea: Train multiple models in PARALLEL on different data subsets

Goal: Reduce VARIANCE without increasing bias

Examples: Random Forest, Bagged Trees

Boosting 

Key Idea: Train models SEQUENTIALLY, each correcting previous errors

Goal: Reduce BIAS (and eventually variance)

Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM






In [None]:
from sklearn.ensemble import RandomForestClassifier

X = [[0,0], [1,1], [1,0], [0,1]]
y = [0, 1, 1, 0]

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=42)
clf.fit(X, y)

print(clf.predict([[1,0]]))
print(clf.feature_importances_)


In [None]:
from sklearn.ensemble import RandomForestRegressor
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1.5, 3.2, 2.8, 4.5, 5.1])

reg = RandomForestRegressor(n_estimators=100, max_depth=3, random_state=42)
reg.fit(X, y)

print(reg.predict([[6]]))
print(reg.feature_importances_)


# Extra Trees

Extra Trees (Extremely Randomized Trees) is an ensemble of decision trees, similar to Random Forest, but with more randomness.

Classifier → majority vote

Regressor → average prediction

Key idea:

Randomize both data samples and splits → reduces variance and overfitting faster than Random Forest.

How Extra Trees Work

Use entire dataset or bootstrap samples (optional)

For each node, randomly select a subset of features

Choose random split values (not optimal like RF)

Aggregate predictions:

Classifier → majority vote

Regressor → average

Difference from RF:
RF chooses best split; Extra Trees chooses random split → faster, more diverse trees

Advantages

Faster to train than Random Forest

Reduces overfitting due to extreme randomness

Handles high-dimensional data

Works with numerical & categorical features

Disadvantages

Slightly less accurate than Random Forest if data is small

Less interpretable than a single tree

Hyperparameters 

n_estimators → number of trees

max_depth → max depth of each tree

min_samples_split → min samples to split

min_samples_leaf → min samples at leaf

max_features → number of features considered per split

bootstrap → True/False (sampling method)


Feature Importance

Built-in, similar to Random Forest:

```print(clf.feature_importances_)```


Use Cases

High-dimensional data (genomics, text)

Fraud detection

Regression / classification tasks requiring fast training

Situations with large datasets

Q: How is Extra Trees different from Random Forest?

Extra randomness → splits chosen randomly → faster & less overfitting.

Q: Classifier vs Regressor?

Classifier → majority vote; Regressor → average prediction.

Q: When to use Extra Trees over RF?

Faster training, reduce variance on large datasets.

Q: Does it require feature scaling?

No scaling needed.

difference between Random Forest and Extra Trees

Split selection: Random Forest chooses the best split at each node, while Extra Trees selects a random split.

Training speed: Random Forest is slower because it searches for the best split, whereas Extra Trees is faster due to random splits.

Variance: Random Forest has medium variance, while Extra Trees generally has lower variance because of the extra randomness.

Bias: Random Forest has low bias, whereas Extra Trees has slightly higher bias due to random splits.

Overfitting: Random Forest can overfit moderately, while Extra Trees is less prone to overfitting.

Accuracy: Random Forest usually achieves higher accuracy, but Extra Trees can be slightly less accurate on small datasets.

Bias-Variance Decomposition:

Error = Bias² + Variance + Noise

Random Forest: Low Bias, Medium Variance

Extra Trees: Medium Bias, Low Variance

Why Less Variance?

Random Forest trees are correlated because they all search for "best" splits

Extra Trees trees are more diverse due to random thresholds

More diversity → better error cancellation in ensemble

Expected Error Reduction:

For M trees:

Variance_reduction ≈ 1/M * (Average pair-wise correlation between trees)

Since Extra Trees has lower correlation:

Variance_ExtraTrees < Variance_RandomForest


Q1: Why is Extra Trees faster than Random Forest?
Three reasons:

No split optimization: RF evaluates multiple thresholds per feature, ET picks random threshold

Simpler computation: ET doesn't sort feature values or compute impurity for many splits

Parallel efficiency: While both are parallel, ET has less overhead per split

Complexity: RF: O(k⋅d⋅n⋅log n) vs ET: O(d⋅n⋅log n) where k is # of split evaluations

Q2: When would Extra Trees perform worse than Random Forest?
Four scenarios:

Very small datasets (< 1000 samples) - RF's optimal splits matter more

Clean, deterministic data - RF can find perfect splits

Features with critical thresholds - ET might miss important cutpoints

Competition settings - RF usually achieves slightly higher accuracy with tuning


Q3: How does the bootstrap parameter affect Extra Trees?

bootstrap=True (default):
  - Creates diversity through data sampling
  - Enables OOB error estimates
  - Better for variance reduction

bootstrap=False:
  - Uses entire dataset for each tree
  - Lower bias, especially with small datasets
  - No OOB estimates available
  - Faster training (no sampling overhead)

Q4: Can Extra Trees handle categorical features better than RF?
Yes, in some cases:

For high-cardinality categorical features, ET's random splits can be beneficial

RF might overfit to specific category thresholds

ET treats all splits equally randomly

Best practice: Use proper encoding (target encoding for high-cardinality)

Q5: How to choose between sqrt(n_features) and all features for max_features?

Use sqrt(n_features) when:
  - Many irrelevant features
  - Want stronger regularization
  - Training time is concern

Use all features (max_features=None) when:
  - Few features (< 20)
  - Most features are informative
  - Want lower bias
  - Dataset is small



In [None]:
from sklearn.ensemble import ExtraTreesClassifier

X = [[0,0], [1,1], [1,0], [0,1]]
y = [0, 1, 1, 0]

clf = ExtraTreesClassifier(n_estimators=100, max_depth=2, random_state=42)
clf.fit(X, y)

print(clf.predict([[1,0]]))
print(clf.feature_importances_)


In [None]:
from sklearn.ensemble import ExtraTreesRegressor
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1.5, 3.2, 2.8, 4.5, 5.1])

reg = ExtraTreesRegressor(n_estimators=100, max_depth=3, random_state=42)
reg.fit(X, y)

print(reg.predict([[6]]))
print(reg.feature_importances_)
