# Decision Trees & Random Forests — scikit-learn

**A compact course notebook (classification + regression)**

This notebook contains notes, code examples, visualizations and exercises covering Decision Trees and Random Forests in scikit-learn. It includes both **classification** and **regression** examples, hyperparameter tuning, pipelines, feature importances, evaluation, and common gotchas.

---

**Contents**

1. Introduction & key concepts
2. Setup & imports
3. Decision Tree — Classification (Iris)
4. Decision Tree — Regression (synthetic house price)
5. Random Forest — Classification (Iris)
6. Random Forest — Regression (synthetic house price)
7. Hyperparameter tuning with GridSearchCV
8. Practical tips, pitfalls, and exercises



## 1 — Quick conceptual notes (short and sweet)

- **Decision tree**: a flowchart-like model that splits the feature space into regions using feature thresholds. Easy to interpret; can overfit if deep.
- **Random forest**: an ensemble of many decision trees trained on bootstrap samples and feature subsamples; reduces variance and improves generalization.
- **Advantages**: handles numeric and categorical (with encoding), captures non-linear relationships, provides feature importances, little preprocessing required for trees.
- **Disadvantages**: single trees overfit easily, forests are less interpretable and larger; random forests can be slow and memory-heavy.

Key hyperparameters:
- `DecisionTree`: `max_depth`, `min_samples_split`, `min_samples_leaf`, `criterion` (`gini`/`entropy` for classification, `squared_error` for regression)
- `RandomForest`: `n_estimators`, `max_features`, plus the tree hyperparams above; `oob_score=True` gives out-of-bag estimate



## 2 — Setup & imports

Run the cell below to import required libraries and set a reproducible random seed.

In [None]:
%pip install --upgrade scikit-learn
%pip install --upgrade pandas
%pip install --upgrade numpy
%pip install --upgrade matplotlib
%pip install --upgrade seaborn
%pip install --upgrade jupyterlab

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_iris, make_regression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, learning_curve
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve,
    r2_score, mean_squared_error, mean_absolute_error
)

# this is for reproducibility and consistent plot sizes first we make a seed with numpy seed 0 will always generate the same random numbers
np.random.seed(0)
plt.rcParams['figure.figsize'] = (8,6) # set default figure size for matplotlib

print('scikit-learn version:', end=' ')
import sklearn
print(sklearn.__version__)


---

# Part I — Classification with Decision Trees (Iris dataset)

We will:
- Load Iris dataset
- Train a DecisionTreeClassifier
- Visualize the tree
- Evaluate performance
- Inspect feature importances

The goal here is that we have fetures for a plant and the correct lable for that plant we will train a model on these fetures and there correct lables to evaluate new unseen plants using tehre fetures 



In [None]:
# Load Iris iris is a classic dataset for classification tasks it includes 3 classes of iris plants with 4 features each
iris = load_iris()
X = iris.data # features
y = iris.target # target labels
# we only pick 2 features for easy visualization
feature_names = iris.feature_names
class_names = iris.target_names

# Quick EDA (Exploratory Data Analysis) this is just to understand the dataset better
print('X shape:', X.shape)
print('Classes:', class_names)

# Make a DataFrame for convenience
df_iris = pd.DataFrame(X, columns=feature_names)
df_iris['target'] = y
df_iris['target_name'] = df_iris['target'].map(lambda v: class_names[v])
df_iris.head()


In [None]:
# Train/test split 25% of data for testing rest for training stratify to maintain class distribution which means each class will have same proportion in train and test as in original data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y) # random_state for reproducibility and stratify to maintain class distribution (class dist means how many samples of each class we have in train and test)

# Fit a Decision Tree
dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X_train, y_train)

# Evaluate
y_pred = dt_clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('\nClassification report:\n', classification_report(y_test, y_pred, target_names=class_names))

# Confusion matrix a way to evaluate model performance (confusion matrix shows how many samples were correctly classified and how many were misclassified for each class)
cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix:\n', cm)

# Plot tree (simplified) we limit depth to 3 for better visualization after that depth its just more subtrees
plt.figure(figsize=(12,8))
plot_tree(dt_clf, feature_names=feature_names, class_names=class_names, filled=True, rounded=True, max_depth=3)
plt.title('Decision Tree (truncated to depth 3)')
plt.show()

# ex matrix we get 
""" 
Confusion matrix:
 [[12  0  0]
 [ 0 12  1]
 [ 0  3 10]]
 
 meaning: 
- 12 samples of class 0 were correctly classified as class 0
- 12 samples of class 1 were correctly classified as class 1, 1 sample of class 1 was misclassified as class 2
- 10 samples of class 2 were correctly classified as class 2, 3 samples of class 2 were misclassified as class 1
"""


### Inspect feature importances and overfitting

Decision trees often overfit when unconstrained. Let's look at feature importances and compare training vs test accuracy for different `max_depth` values.

In [None]:
import pandas as pd
depths = range(1, 11)
train_scores = []
test_scores = []
for d in depths:
    clf = DecisionTreeClassifier(max_depth=d, random_state=42)
    clf.fit(X_train, y_train)
    train_scores.append(clf.score(X_train, y_train))
    test_scores.append(clf.score(X_test, y_test))

plt.plot(depths, train_scores, label='train')
plt.plot(depths, test_scores, label='test')
plt.xlabel('max_depth')
plt.ylabel('accuracy')
plt.title('Decision Tree: train vs test accuracy by max_depth')
plt.legend();

# Feature importances
fi = pd.Series(dt_clf.feature_importances_, index=feature_names).sort_values(ascending=False)
print('Feature importances:\n', fi)

# Decision trees often overfit when unconstrained. Let's look at feature importances and compare training vs test accuracy for different `max_depth` values.
# why do this? To understand how the complexity of the decision tree affects its performance and to identify which features are most important in making predictions.
# Feature importances are a way to measure the relative importance of each feature in a decision tree. The higher the feature importance, the more important the feature is in making predictions.
# By looking at feature importances, we can identify which features are most important in making predictions and use them to train a more accurate decision tree model.
# what we mean by Decision trees often overfit when unconstrained is that when we allow the decision tree to grow without any restrictions (like max depth, min samples per leaf, etc.), it can create a model that is too complex and fits the training data very closely. This can lead to poor generalization to new, unseen data because the model has essentially memorized the training data rather than learning the underlying patterns.
# in a nutshell feature importance mesures the importance by looking at how much each feature contributes to reducing the impurity (or uncertainty) in the data at each split in the tree.

---

# Part II — Regression with Decision Trees (Synthetic House Price)

We will:
- Create a synthetic dataset representing House Price vs SquareFeet and extra noise/features
- Fit a DecisionTreeRegressor
- Visualize predictions and residuals
- Compare to linear baseline


In [None]:
# Create a synthetic dataset (house price-like)
from sklearn.datasets import make_regression
X_reg, y_reg = make_regression(n_samples=300, n_features=3, n_informative=2, noise=30.0, random_state=42)
# Let's make the first feature represent 'square feet' scaled reasonably
X_reg[:,0] = np.clip((X_reg[:,0] * 200) + 1000, 300, None)  # roughly 300-3000 sqft

# Put in a DataFrame
df_house = pd.DataFrame(X_reg, columns=['SquareFeet','Feature2','Feature3'])
df_house['Price'] = y_reg

df_house.head()


In [None]:
# Train/test split
Xr = df_house[['SquareFeet','Feature2','Feature3']].values
yr = df_house['Price'].values
Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr, test_size=0.2, random_state=42)

# Fit Decision Tree Regressor
dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(Xr_train, yr_train)

# Predictions & evaluation
yr_pred = dt_reg.predict(Xr_test)
print('R2:', r2_score(yr_test, yr_pred))
print('MAE:', mean_absolute_error(yr_test, yr_pred))

# Plot true vs predicted
plt.scatter(yr_test, yr_pred, alpha=0.6)
plt.plot([yr_test.min(), yr_test.max()], [yr_test.min(), yr_test.max()], 'k--')
plt.xlabel('True Price')
plt.ylabel('Predicted Price')
plt.title('Decision Tree Regressor: True vs Predicted')
plt.show()

# Visualize tree (truncated depth)
plt.figure(figsize=(14,8))
plot_tree(dt_reg, feature_names=['SquareFeet','Feature2','Feature3'], filled=True, rounded=True, max_depth=3)
plt.title('Decision Tree Regressor (truncated to depth 3)')
plt.show()


### Compare to a linear regression baseline
A linear model is a useful baseline to compare against tree-based models.

In [None]:
from sklearn.linear_model import LinearRegression
lin = LinearRegression()
lin.fit(Xr_train, yr_train)

y_lin_pred = lin.predict(Xr_test)
print('Linear R2:', r2_score(yr_test, y_lin_pred))
print('Linear MAE:', mean_absolute_error(yr_test, y_lin_pred))

# Plot comparison
plt.scatter(yr_test, yr_pred, alpha=0.6, label='DecisionTree')
plt.scatter(yr_test, y_lin_pred, alpha=0.6, label='Linear')
plt.plot([yr_test.min(), yr_test.max()], [yr_test.min(), yr_test.max()], 'k--')
plt.legend()
plt.xlabel('True Price')
plt.ylabel('Predicted Price')
plt.title('Model comparison: Decision Tree vs Linear')
plt.show()


---

# Part III — Random Forests (Classification & Regression)

Random forests reduce variance by averaging many decorrelated trees. We'll train and evaluate both classifier and regressor.


In [None]:
# Random Forest Classification (Iris)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
yp_rf = rf_clf.predict(X_test)
print('RF Accuracy (Iris):', accuracy_score(y_test, yp_rf))
print('\nClassification report:\n', classification_report(y_test, yp_rf, target_names=class_names))

# Feature importances
fi_rf = pd.Series(rf_clf.feature_importances_, index=feature_names).sort_values(ascending=False)
print('\nRandom Forest feature importances:\n', fi_rf)

# OOB score example (train a new RF with oob)
rf_oob = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42, n_jobs=-1)
rf_oob.fit(X_train, y_train)
print('\nOOB score:', rf_oob.oob_score_)


In [None]:
# Random Forest Regression (house data)
rf_reg = RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1)
rf_reg.fit(Xr_train, yr_train)
yrf_pred = rf_reg.predict(Xr_test)
print('RandomForest R2:', r2_score(yr_test, yrf_pred))
print('RandomForest MAE:', mean_absolute_error(yr_test, yrf_pred))

# Feature importances
fi_rf_reg = pd.Series(rf_reg.feature_importances_, index=['SquareFeet','Feature2','Feature3']).sort_values(ascending=False)
print('\nRF regressor feature importances:\n', fi_rf_reg)

# Compare predictions scatter
plt.scatter(yr_test, yrf_pred, alpha=0.6)
plt.plot([yr_test.min(), yr_test.max()], [yr_test.min(), yr_test.max()], 'k--')
plt.xlabel('True Price')
plt.ylabel('RF Predicted Price')
plt.title('Random Forest Regressor: True vs Predicted')
plt.show()


### Learning curves (to diagnose bias vs variance)
Let's plot learning curves for the Random Forest regressor to see how train/test error behaves with more data.

In [None]:
# Learning curve example (Iris with RF)
train_sizes, train_scores, test_scores = learning_curve(
    RandomForestClassifier(n_estimators=100, random_state=42), X, y, cv=5,
    train_sizes=np.linspace(0.1, 1.0, 10), n_jobs=-1
)
train_scores_mean = np.mean(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
plt.plot(train_sizes, train_scores_mean, 'o-', color='C0', label='Training score')
plt.plot(train_sizes, test_scores_mean, 'o-', color='C1', label='Cross-validation score')
plt.xlabel('Training examples')
plt.ylabel('Score')
plt.legend(loc='best')
plt.show()

---

# Part IV — Hyperparameter tuning (GridSearchCV) — classification example

We'll tune `max_depth` and `min_samples_leaf` for a Decision Tree classifier using grid search with cross-validation.


In [None]:
param_grid = {
    'max_depth': [None, 2, 3, 4, 5, 6],
    'min_samples_leaf': [1, 2, 4, 6]
}

gs = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, n_jobs=-1, scoring='accuracy')
gs.fit(X_train, y_train)
print('Best params:', gs.best_params_)
print('Best CV score:', gs.best_score_)

best_dt = gs.best_estimator_
print('\nTest accuracy of best DT:', best_dt.score(X_test, y_test))


### Hyperparameter tuning notes
- For Random Forests, common params to tune: `n_estimators`, `max_features`, `max_depth`, `min_samples_leaf`, `bootstrap`.
- Use `n_jobs=-1` to parallelize (if CPU/memory allows).
- Consider `RandomizedSearchCV` for large parameter spaces.



## Part V — Practical tips, pitfalls, and exercises

**Tips & Pitfalls**

- Trees do not need feature scaling. But when combining with other models in pipelines, be mindful.
- Trees easily overfit—use `max_depth`, `min_samples_leaf`, `min_samples_split` to regularize.
- Random forests reduce variance but may hide bias; if all trees are biased, averaging won't fix it.
- `oob_score=True` is a convenient estimate of generalization for forests using bootstrap samples.

**Exercises**

1. Try training a `DecisionTreeClassifier(max_depth=3)` and visualize the tree fully (not truncated).
2. For the regression dataset, remove `Feature2` and `Feature3` and see how well the tree and forest do using just `SquareFeet`.
3. Use `RandomizedSearchCV` on the RandomForestRegressor to tune `n_estimators`, `max_features`, `max_depth`, and compare results vs default.
4. Implement a simple feature importance plot function and use it to compare tree vs forest importance rankings.

---

_End of notebook — you're ready to experiment!_
