<div style="text-align:left;">
  <a href="https://code213.tech/" target="_blank">
    <img src="../images/code213.PNG" alt="QWorld">
  </a>
  <p><em>prepared by Latreche Sara</em></p>
</div>

Step1: setting up the envirement

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 100)

font = {'size'   : 16}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=14)
matplotlib.rc('ytick', labelsize=14)
#matplotlib.rcParams.update({'figure.autolayout': True})
matplotlib.rcParams['figure.dpi'] = 300

In [None]:
from sklearn import metrics
from sklearn.model_selection import cross_validate, KFold, cross_val_predict, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor, GradientBoostingRegressor

In [None]:
# Step 1: Import the pandas library to handle tabular data.
# Use: import pandas as pd

# Step 2: Read the input feature file.
# Use read_csv method with tab separator ('\t') since the columns are separated by tabs.
# Save the data into a variable named sel_features
# File name: 'sel_features.csv'
sel_features= pd.read_csv("sel_features",sep="'\'")
# Step 3: Read the target redshift values.
# Use read_csv with default settings (it's comma-separated by default).
# Save the data into a variable named sel_target
# File name: 'sel_target.csv'

# Step 4: Check the shape (number of rows and columns) of the sel_features data
# This helps verify how many galaxy observations and feature columns we have.
# Use the .shape attribute on sel_features


In [None]:
# Convert the target DataFrame to a 1D NumPy array using .values.ravel()
# This flattens the redshift column into a row-like array, which is the expected format for training a model
# Use this for your target variable y


In [None]:
# Step 1: Import AdaBoostRegressor from sklearn.ensemble
# Step 2: Create an instance of AdaBoostRegressor using default parameters

# Step 3: Import cross_val_predict from sklearn.model_selection
# Step 4: Use cross_val_predict to perform 5-fold cross-validation
#         - Pass in your model, the feature data (sel_features), and the target data
#         - Make sure to flatten the target array using .values.ravel()
#         - Use KFold with 5 splits, shuffle=True, and a fixed random_state for reproducibility
#         - Store the cross-validated predictions in a variable (e.g., ypred)

# Step 5: Import matplotlib.pyplot as plt
# Step 6: Create a scatter plot of true redshift values vs predicted values
#         - Use sel_target for the x-axis and ypred for the y-axis
#         - Use a small point size (s=10) for clarity

# Step 7: Set the plot size to 7x7 inches
# Step 8: Set the x-axis and y-axis limits from 0 to 3 to focus on the most relevant redshift range


What it visually tells you:
It helps you quickly evaluate how well your model is performing. If the points are close to the diagonal, it means your model is accurately predicting redshifts. If they are spread out, especially far from that line, there's room for improvement.

### This is where we started wondering whether the boosting process was working


In [None]:
# Use the method get_params() on the AdaBoostRegressor model to retrieve its current parameters.
# This shows the default or modified settings such as number of estimators and learning rate.
# It helps you understand or tweak how the model behaves.

ere we use AdaBoostRegressor which by default uses shallow decision trees (max_depth=3) as weak learners.
AdaBoost builds an ensemble by sequentially training these weak learners and adaptively reweighting training samples,
focusing more on those samples that previous learners mispredicted.
This is different from naive stacking; the adaptive boosting process is key to its performance.
This approach is inspired by the scikit-learn AdaBoost example:
https://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_regression.html


#### This is what happens if max_depth = 3.


In [None]:
# Create the dataset
plt.figure(figsize=(15,10))

rng = np.random.RandomState(1)
X = np.linspace(0, 4, 100)[:, np.newaxis]
y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])

weakl = DecisionTreeRegressor(max_depth=3)

# Fit regression model, saving each "stage"

regr_1 = weakl
""
regr_2 = AdaBoostRegressor(weakl,
                          n_estimators=2, random_state=rng)

regr_3 = AdaBoostRegressor(weakl,
                          n_estimators=3, random_state=rng)

regr_4 = AdaBoostRegressor(weakl,
                          n_estimators=4, random_state=rng)

regr_10 = AdaBoostRegressor(weakl,
                          n_estimators=10, random_state=rng)

regr_100 = AdaBoostRegressor(weakl,
                          n_estimators=100, random_state=rng)


regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)
regr_4.fit(X, y)
regr_10.fit(X, y)
regr_100.fit(X, y)

# Predict
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)
y_3 = regr_3.predict(X)
y_4 = regr_4.predict(X)
y_10 = regr_10.predict(X)

for yp in [y_1,y_2,y_3,y_4,y_10]:
    print('r2 score: ', np.round(metrics.r2_score(yp,y),3))

# Plot the results

plt.scatter(X, y, c="k", s=10,label="training samples")
plt.plot(X, y_1, "-g", label="n_estimators=1", linewidth=1)
plt.plot(X, y_2, "--r", label="n_estimators=2", linewidth=1)
plt.plot(X, y_3, "-.b", label="n_estimators=3", linewidth=1)
plt.xlabel("data")
plt.ylabel("target")
plt.title("AdaBoost Regression, max depth = 3", fontsize = 14)
plt.legend(fontsize=10);
#plt.tight_layout()
#plt.savefig("AdaBoost_3.png")

#### This is what happens if max_depth = 6


In [None]:
# Create the dataset
plt.figure(figsize=(15,10))

rng = np.random.RandomState(1)
X = np.linspace(0, 4, 100)[:, np.newaxis]
y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])

weakl = DecisionTreeRegressor(max_depth=6)

# Fit regression model, saving each "stage"
regr_1 = weakl
""
regr_2 = AdaBoostRegressor(weakl,
                          n_estimators=2, random_state=rng)

regr_3 = AdaBoostRegressor(weakl,
                          n_estimators=3, random_state=rng)

regr_4 = AdaBoostRegressor(weakl,
                          n_estimators=4, random_state=rng)

regr_10 = AdaBoostRegressor(weakl,
                          n_estimators=10, random_state=rng)

regr_100 = AdaBoostRegressor(weakl,
                          n_estimators=100, random_state=rng)


regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)
regr_4.fit(X, y)
regr_10.fit(X, y)
regr_100.fit(X, y)

# Predict
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)
y_3 = regr_3.predict(X)
y_4 = regr_4.predict(X)
y_10 = regr_10.predict(X)

for yp in [y_1,y_2,y_3,y_4,y_10]:
    print(metrics.r2_score(yp,y))

# Plot the results

plt.scatter(X, y, c="k", s=10,label="training samples")
plt.plot(X, y_1, "-g", label="n_estimators=1", linewidth=1)
plt.plot(X, y_2, "--r", label="n_estimators=2", linewidth=1)
plt.plot(X, y_3, "-.b", label="n_estimators=3", linewidth=1)
plt.xlabel("data")
plt.ylabel("target")
plt.title("AdaBoost Regression, max depth = 6", fontsize = 14)
plt.legend(fontsize=10);
#plt.tight_layout()
#plt.savefig("AdaBoost_6.png")
plt.show()

 We can now go back to the photo-z determination problem.

We create a train/test split because we need to use the ".fit" method in order to access the "staged_predict" property to examine how the prediction changes at each stage

In [None]:
# Step 1: Split the dataset into training and testing sets
# - Use the train_test_split method to divide sel_features and sel_target.values.ravel() training and test parts
# - Set test_size=0.3 to allocate 30% of data for testing and 70% for training
# - Use random_state=42 to ensure reproducibility of the split
# We do this so we can train the model on training data and later evaluate on unseen test data
# This also allows us to use the ".fit" method on training data, which is necessary to access the "staged_predict" property


We begin with a very weak learner:

In [None]:
# Step 2: Create an AdaBoost regressor model
# - Use AdaBoostRegressor from sklearn.ensemble
# - Set the base estimator to be a DecisionTreeRegressor with max_depth=3 (a weak learner)
# - Set n_estimators=30 to use 30 boosting rounds (number of weak learners combined)
# This model will sequentially train 30 decision trees, each trying to correct errors of the previous ones


In [None]:
# Step 3: Train (fit) the AdaBoost model on the training data
# - Use the .fit() method on the model object
# - Pass the training features (X_train) and training targets (y_train)
# This trains the ensemble of weak learners on the training set to learn the relationship between features and target


We can plot the R2 score and the Spearman correlation coefficient between true and predicted values as a function of the number of stages/iterations.


## Definitions

### R² Score (Coefficient of Determination)
- **What is it?**  
  Measures how well the predicted values explain the variability of the actual data.

- **Formula:**  
  $$
  R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2}
  $$
  where:  
  $y_i$ = true value,  
  $\hat{y}_i$ = predicted value,  
  $\bar{y}$ = mean of true values.

- **Interpretation:**  
  - 1 = perfect prediction  
  - 0 = model predicts as well as mean  
  - Negative = worse than mean prediction

---

### Spearman Correlation Coefficient ($\rho$)
- **What is it?**  
  Measures the strength and direction of the monotonic relationship between two ranked variables.

- **Formula:**  
  $$
  \rho = 1 - \frac{6 \sum d_i^2}{n (n^2 - 1)}
  $$
  where:  
  $d_i = R(x_i) - S(y_i)$ is the difference between the ranks of corresponding variables,  
  $n$ = number of data points.

- **Interpretation:**  
  - 1 = perfect positive correlation of ranks  
  - 0 = no correlation  
  - -1 = perfect negative correlation of ranks

---


In [None]:
n_estimators = 30

plt.plot(range(n_estimators), [metrics.r2_score(y_test,list(model.staged_predict(X_test))[i]) for i in range(n_estimators)], label = 'r2 score')

plt.plot(range(n_estimators), [stats.spearmanr(y_test,list(model.staged_predict(X_test))[i])[0] for i in range(n_estimators)], label = 'Spearman r')

plt.xlabel('Iteration')

plt.ylim(0,1.0)

plt.title('Max depth = 3')
plt.legend();

### The scores don't seem to improve as we stack more estimators.

### We can now try again with a stronger base learner (max_depth = 6).


In [None]:
# Number of boosting stages (iterations)
n_estimators = 30

# Define the AdaBoost regressor model
# Using a DecisionTreeRegressor with max_depth=6 as the base learner (stronger tree)


# Split the data into training and testing sets (70% train, 30% test)
# Use random_state=42 for reproducibility


# Fit the AdaBoost model on the training data

# For each boosting iteration (stage), calculate the R2 score on the test set predictions

# For each boosting iteration, calculate the Spearman correlation between true and predicted test values

# Plot the R2 score and Spearman correlation as functions of the boosting iteration number
plt.plot(range(n_estimators), r2_scores, label='R2 Score')
plt.plot(range(n_estimators), spearman_scores, label='Spearman Correlation')

# Add labels and title to the plot
plt.xlabel('Iteration')
plt.title('AdaBoost Regression Performance\nBase Estimator: Decision Tree (max_depth=6)')
plt.legend()
plt.show()


### And do the same with an even stronger base learner (max_depth = 10).

In [None]:
n_estimators = 30

model= AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=10),
                  n_estimators=30)

X_train, X_test, y_train, y_test = \
        train_test_split(sel_features,sel_target.values.ravel(), test_size=.3, random_state=42)

model.fit(X_train, y_train)

plt.plot(range(n_estimators), [metrics.r2_score(y_test,list(model.staged_predict(X_test))[i]) for i in range(n_estimators)], label = 'r2')

plt.plot(range(n_estimators), [stats.spearmanr(y_test,list(model.staged_predict(X_test))[i])[0] for i in range(n_estimators)], label = 'Spearman r')

plt.xlabel('Iteration')

plt.title('Base estimator, max depth = 10')

plt.legend();

1. R2 Score Plot
What it shows:
The R2 score measures how well your model's predictions match the actual data in terms of explained variance.

R2 = 1 means perfect prediction.

R2 = 0 means the model predicts no better than simply predicting the mean.

Negative R2 means the model performs worse than predicting the mean.

How to interpret:

If the R2 score increases as the number of iterations grows, it means the model is learning and improving its predictions.

If the R2 plateaus or decreases, adding more trees doesn’t improve or may even harm the model (overfitting or noise).

A smooth increase followed by flattening is typical — the model improves initially but eventually gains less from more estimators.

2. Spearman Correlation Plot
What it shows:
The Spearman correlation measures how well the predicted values preserve the order or ranking of the actual values (regardless of exact numeric values).

Spearman = 1 means perfect rank agreement.

Spearman = 0 means no correlation in ranks.

How to interpret:

An increasing Spearman correlation means your model is getting better at correctly ranking the outputs — important if the order matters more than exact values (e.g., prioritizing cases).

Like R2, a plateau or decrease suggests limited or negative returns from more estimators.

Putting It Together
If both R2 and Spearman increase over iterations, the model is improving both in prediction accuracy and ranking quality.

If R2 improves but Spearman doesn’t (or vice versa), the model might be improving numeric accuracy but not ranking, or vice versa.

If both flatten or decline, adding more trees isn’t helping and might lead to overfitting.

### Let's combine all of them in one figure.

In [None]:
plt.figure(figsize=(12,4))

n_estimators = 30

for i, md in enumerate([3,6,10]):

    model = AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=md),
                  n_estimators=n_estimators)

    model.fit(X_train,y_train)

    plt.subplot(1,3,i+1)

    plt.plot(range(n_estimators), [metrics.r2_score(y_test,list(model.staged_predict(X_test))[i]) for i in range(n_estimators)], label = 'r2 score', c = 'steelblue')

    plt.plot(range(n_estimators), [stats.spearmanr(y_test,list(model.staged_predict(X_test))[i])[0] for i in range(n_estimators)], label = 'Spearman r', c = 'fuchsia')

    plt.xlabel('Iteration')

    plt.ylim(0,1.0)

    plt.title('Max depth = '+str(md)+', AdaBoost')

    if i == 2:
        plt.legend();

    plt.tight_layout()

#plt.savefig('AdaB_performance.png')

### We sort-of have an answer from the third panel of the figure above, but we could also ask whether we should keep boosting (i.e. if adding more stages is beneficial.).

In [None]:
#Shall we keep boosting? (max_depth = 10)

n_estimators = 60

model= AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=10),
                  n_estimators=n_estimators)

X_train, X_test, y_train, y_test = \
        train_test_split(sel_features,sel_target.values.ravel(), test_size=.3, random_state=42)

model.fit(X_train, y_train)

plt.plot(range(n_estimators), [metrics.r2_score(y_test,list(model.staged_predict(X_test))[i]) for i in range(n_estimators)], label = 'r2')

plt.plot(range(n_estimators), [stats.spearmanr(y_test,list(model.staged_predict(X_test))[i])[0] for i in range(n_estimators)], label = 'Spearman r')

plt.xlabel('Iteration')

plt.title('Base estimator, max depth = 10')

plt.legend();

### Conclusions:

1.   Stacking learners that are too weak doesn't help;
2.   There is a plateau in the boosting stages so that adding more estimators is not beneficial.


### Would this be true also for Gradient Boosted Trees algorithms?

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

:The parameters depend on the particular implementation.

In the sklearn formulation, the parameters of each tree are essentially the same we have for Random Forests; additionally we have the "learning_rate" parameter, which dictates how much each tree contribute to the final estimator, and the "subsample" parameters, which allows one to use a < 1.0 fraction of samples.

We can check how this works with a weak learner on our toy data set.

In [None]:
# Create the dataset

plt.figure(figsize=(15,12))

rng = np.random.RandomState(1)
X = np.linspace(0, 4, 100)[:, np.newaxis]
y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])

weakl = DecisionTreeRegressor(max_depth=3)

# Fit regression model
regr_1 = weakl
""
regr_2 = GradientBoostingRegressor(max_depth=3,
                          n_estimators=2, random_state=rng)

regr_3 = GradientBoostingRegressor(max_depth=3,
                          n_estimators=3, random_state=rng)

regr_4 = GradientBoostingRegressor(max_depth=3,
                          n_estimators=4, random_state=rng)
regr_10 = GradientBoostingRegressor(max_depth=3,
                          n_estimators=10, random_state=rng)

regr_100 = GradientBoostingRegressor(max_depth=3,
                          n_estimators=100, random_state=rng)


regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)
regr_4.fit(X, y)
regr_10.fit(X, y)
regr_100.fit(X, y)

# Predict
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)
y_3 = regr_3.predict(X)
y_4 = regr_4.predict(X)
y_10 = regr_10.predict(X)
y_100 = regr_100.predict(X)

for yp in [y_1,y_2,y_3,y_4,y_10, y_100]:
    print('R2 score: ', np.round(metrics.r2_score(yp,y),3))

# Plot the results

plt.scatter(X, y, c="k", s=10,label="training samples")
plt.plot(X, y_1, "-g", label="n_estimators=1", linewidth=1)
plt.plot(X, y_3, "-.b", label="n_estimators=3", linewidth=1)
plt.plot(X, y_10, "-k", label="n_estimators=10", linewidth=1)
plt.plot(X, y_100, "-c", label="n_estimators=100", linewidth=1)
plt.xlabel("data")
plt.ylabel("target")
plt.ylim(-2.5,2.5)
plt.title("Gradient Boosting Regression, max depth = 3", fontsize = 14)
plt.legend(fontsize=14, loc = 'upper right');
#plt.tight_layout()
#plt.savefig("GradBoost_3.png")

In [None]:
plt.figure(figsize=(12,4))

n_estimators = 30

# Loop over different max_depth values to compare their effect on model performance

    # Create a Gradient Boosting Regressor with current max_depth and fixed n_estimators

    # Fit the model on the training data

    # Create subplot for each max_depth setting
    plt.subplot(1,3,i+1)

    # Plot R² score vs number of iterations


    # Plot Spearman correlation coefficient vs number of iterations


    # Only add legend to the last subplot


# Uncomment to save figure to a PDF file
# plt.savefig('GBR_performance.pdf')


In [None]:
plt.figure(figsize=(12,4))

n_estimators = 30

for i, md in enumerate([3,6,10]):

    model = GradientBoostingRegressor(max_depth=md,
                  n_estimators=n_estimators)

    model.fit(X_train,y_train)

    plt.subplot(1,3,i+1)

    plt.plot(range(n_estimators), [metrics.r2_score(y_test,list(model.staged_predict(X_test))[i]) for i in range(n_estimators)], label = 'r2 score', c = 'steelblue')

    plt.plot(range(n_estimators), [stats.spearmanr(y_test,list(model.staged_predict(X_test))[i])[0] for i in range(n_estimators)], label = 'Spearman r', c = 'fuchsia')

    plt.xlabel('Iteration')

    plt.ylim(0,1.0)

    plt.title('Max depth = '+str(md)+', GBR')

    if i == 2:
        plt.legend();

    plt.tight_layout()

#plt.savefig('GBR_performance.pdf')

### Because of the different boosting process, GBT models tend to work well even with weak base learners.

We compare the performance of AdaBoost and various GBT models on the photometric redshifts problem in the next notebook (FlavorsOfBoosting).