# scikit-learn PLS Analysis with Pinard

This notebook demonstrates how to perform a complete PLS regression analysis using Pinard with scikit-learn. The workflow includes:

- Loading data
- Preprocessing with multiple transformations
- Training a PLS model
- Making predictions and evaluating performance
- Saving the model for future use

## Setup and Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import TransformedTargetRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.cross_decomposition import PLSRegression

from pinard import utils
from pinard import preprocessing as pp
from pinard.model_selection import train_test_split_idx

# Set random seed for reproducibility
np.random.seed(42)

## Load and Explore Data

In [None]:
# Load data (using example data files)
try:
    x_path = "../Xcal.csv"  # Adjust path if necessary
    y_path = "../Ycal.csv"  # Adjust path if necessary
    x, y = utils.load_csv(x_path, y_path, x_hdr=0, y_hdr=0, autoremove_na=True)
    print(f"Loaded data: X shape: {x.shape}, y shape: {y.shape}")
except Exception as e:
    print(f"Could not load CSV files: {e}")
    print("Generating random data instead...")
    # Generate synthetic data if files not available
    x = np.random.rand(100, 200)  # 100 samples, 200 features
    y = np.random.rand(100)      # Target values
    print(f"Generated data: X shape: {x.shape}, y shape: {y.shape}")

In [None]:
# Quick data exploration
plt.figure(figsize=(12, 4))

# Plot first sample
plt.subplot(1, 2, 1)
plt.plot(x[0])
plt.title("First Sample")
plt.xlabel("Feature Index")
plt.ylabel("Value")

# Plot target distribution
plt.subplot(1, 2, 2)
plt.hist(y, bins=20)
plt.title("Target Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")

plt.tight_layout()
plt.show()

## Split Data Into Train and Test Sets

In [None]:
# Split data into train and test sets
train_index, test_index = train_test_split_idx(x, y=y, method="random", test_size=0.25, random_state=42)
X_train, y_train = x[train_index], y[train_index]
X_test, y_test = x[test_index], y[test_index]

print(f"Training set: X shape: {X_train.shape}, y shape: {y_train.shape}")
print(f"Test set: X shape: {X_test.shape}, y shape: {y_test.shape}")

## Define Preprocessing Steps

We will apply multiple preprocessing transformations to the data:

In [None]:
# Define preprocessing operators
preprocessing = [
    ('id', pp.IdentityTransformer()),    # Keep original data
    ('savgol', pp.SavitzkyGolay()),      # Savitzky-Golay smoothing
    ('gaussian1', pp.Gaussian(order=1, sigma=2)),  # Gaussian filtering
]

print(f"Number of preprocessing operators: {len(preprocessing)}")

In [None]:
# Visualize the effect of preprocessing on a sample
sample_idx = 0  # Use first sample for visualization
sample = X_train[sample_idx:sample_idx+1]

plt.figure(figsize=(12, 8))

# Original data
plt.subplot(2, 2, 1)
plt.plot(sample[0])
plt.title("Original Data")
plt.xlabel("Feature Index")
plt.ylabel("Value")

# Processed data for each operator
for i, (name, transformer) in enumerate(preprocessing):
    plt.subplot(2, 2, i+2)
    processed = transformer.fit_transform(sample)
    plt.plot(processed[0])
    plt.title(f"{name} Transformed")
    plt.xlabel("Feature Index")
    plt.ylabel("Value")

plt.tight_layout()
plt.show()

## Create and Train PLS Pipeline

We will create a scikit-learn pipeline with the following steps:
1. Feature scaling with MinMaxScaler
2. Preprocessing with FeatureUnion (applies transformations in parallel)
3. PLS regression model
4. Target scaling (via TransformedTargetRegressor)

In [None]:
# Create a pipeline with FeatureUnion for preprocessing and PLS regression
pipeline = Pipeline([
    ('scaler', MinMaxScaler()),          # Scale input features to [0,1]
    ('preprocessing', FeatureUnion(preprocessing)),  # Apply preprocessing operators
    ('pls', PLSRegression(n_components=10))  # PLS regression with 10 components
])

# Wrap the pipeline in TransformedTargetRegressor to scale the target variable
estimator = TransformedTargetRegressor(
    regressor=pipeline,
    transformer=MinMaxScaler()  # Scale target variable to [0,1]
)

In [None]:
# Train the model
print("Training the model...")
%time estimator.fit(X_train, y_train)
print("Model training complete.")

## Make Predictions and Evaluate Performance

In [None]:
# Make predictions
y_pred = estimator.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Model performance metrics:")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

In [None]:
# Visualize predictions vs actual values
plt.figure(figsize=(10, 6))

# Plot predictions vs actual
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('PLS Model: Actual vs Predicted')

# Add R² value to plot
plt.text(0.05, 0.95, f'R² = {r2:.4f}', transform=plt.gca().transAxes, 
         bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", alpha=0.8))

plt.grid(True, alpha=0.3)
plt.show()

## Investigate PLS Components

Let's look at how many components are optimal for our PLS model.

In [None]:
# Evaluate performance with different numbers of components
max_components = min(20, min(X_train.shape))
component_range = range(1, max_components + 1)
r2_scores = []

for n_components in component_range:
    # Create a new pipeline with the specified number of components
    pipeline = Pipeline([
        ('scaler', MinMaxScaler()),
        ('preprocessing', FeatureUnion(preprocessing)),
        ('pls', PLSRegression(n_components=n_components))
    ])
    
    model = TransformedTargetRegressor(
        regressor=pipeline,
        transformer=MinMaxScaler()
    )
    
    # Train and evaluate
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    r2_scores.append(r2)
    
    print(f"Components: {n_components}, R²: {r2:.4f}")

In [None]:
# Plot R² vs number of components
plt.figure(figsize=(10, 6))
plt.plot(component_range, r2_scores, 'o-')
plt.xlabel('Number of PLS Components')
plt.ylabel('R² Score')
plt.title('Model Performance vs Number of PLS Components')
plt.grid(True, alpha=0.3)

# Find and highlight the optimal number of components
optimal_components = component_range[np.argmax(r2_scores)]
max_r2 = max(r2_scores)
plt.axvline(x=optimal_components, color='r', linestyle='--', alpha=0.7)
plt.scatter([optimal_components], [max_r2], color='r', s=100, zorder=5)
plt.text(optimal_components + 0.5, max_r2, f'Optimal: {optimal_components} components\nR²: {max_r2:.4f}')

plt.show()

## Save the Model

Save the trained model for later use.

In [None]:
import joblib

# Train the final model with optimal number of components
final_pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('preprocessing', FeatureUnion(preprocessing)),
    ('pls', PLSRegression(n_components=optimal_components))
])

final_model = TransformedTargetRegressor(
    regressor=final_pipeline,
    transformer=MinMaxScaler()
)

final_model.fit(X_train, y_train)

# Save the model
joblib.dump(final_model, 'pls_model.joblib')
print("Model saved to 'pls_model.joblib'")

## Verify the Saved Model

To ensure the saved model works correctly, let's reload it and test it.

In [None]:
# Load the saved model
loaded_model = joblib.load('pls_model.joblib')

# Make predictions with the loaded model
y_pred_loaded = loaded_model.predict(X_test)

# Verify the predictions
r2_loaded = r2_score(y_test, y_pred_loaded)
print(f"Loaded model R² score: {r2_loaded:.4f}")

## Summary

This notebook demonstrated a complete workflow for building a PLS regression model with Pinard and scikit-learn. We covered:

1. Data loading and preprocessing with multiple transformations
2. Building a pipeline with feature scaling and PLS regression
3. Training and evaluating the model
4. Optimizing the number of PLS components
5. Saving and reloading the model

The approach we used enables reproducible analysis and can be adapted to different types of spectral data.

# scikit-learn PLS Analysis with Pinard

This notebook demonstrates how to perform a complete PLS regression analysis using Pinard with scikit-learn. The workflow includes:

- Loading data
- Preprocessing with multiple transformations
- Training a PLS model
- Making predictions and evaluating performance
- Saving the model for future use

## Setup and Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import TransformedTargetRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.cross_decomposition import PLSRegression

from pinard import utils
from pinard import preprocessing as pp
from pinard.model_selection import train_test_split_idx

# Set random seed for reproducibility
np.random.seed(42)

## Load and Explore Data

In [None]:
# Load data (using example data files)
try:
    x_path = "../Xcal.csv"  # Adjust path if necessary
    y_path = "../Ycal.csv"  # Adjust path if necessary
    x, y = utils.load_csv(x_path, y_path, x_hdr=0, y_hdr=0, autoremove_na=True)
    print(f"Loaded data: X shape: {x.shape}, y shape: {y.shape}")
except Exception as e:
    print(f"Could not load CSV files: {e}")
    print("Generating random data instead...")
    # Generate synthetic data if files not available
    x = np.random.rand(100, 200)  # 100 samples, 200 features
    y = np.random.rand(100)      # Target values
    print(f"Generated data: X shape: {x.shape}, y shape: {y.shape}")

In [None]:
# Quick data exploration
plt.figure(figsize=(12, 4))

# Plot first sample
plt.subplot(1, 2, 1)
plt.plot(x[0])
plt.title("First Sample")
plt.xlabel("Feature Index")
plt.ylabel("Value")

# Plot target distribution
plt.subplot(1, 2, 2)
plt.hist(y, bins=20)
plt.title("Target Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")

plt.tight_layout()
plt.show()

## Split Data Into Train and Test Sets

In [None]:
# Split data into train and test sets
train_index, test_index = train_test_split_idx(x, y=y, method="random", test_size=0.25, random_state=42)
X_train, y_train = x[train_index], y[train_index]
X_test, y_test = x[test_index], y[test_index]

print(f"Training set: X shape: {X_train.shape}, y shape: {y_train.shape}")
print(f"Test set: X shape: {X_test.shape}, y shape: {y_test.shape}")

## Define Preprocessing Steps

We will apply multiple preprocessing transformations to the data:

In [None]:
# Define preprocessing operators
preprocessing = [
    ('id', pp.IdentityTransformer()),    # Keep original data
    ('savgol', pp.SavitzkyGolay()),      # Savitzky-Golay smoothing
    ('gaussian1', pp.Gaussian(order=1, sigma=2)),  # Gaussian filtering
]

print(f"Number of preprocessing operators: {len(preprocessing)}")

In [None]:
# Visualize the effect of preprocessing on a sample
sample_idx = 0  # Use first sample for visualization
sample = X_train[sample_idx:sample_idx+1]

plt.figure(figsize=(12, 8))

# Original data
plt.subplot(2, 2, 1)
plt.plot(sample[0])
plt.title("Original Data")
plt.xlabel("Feature Index")
plt.ylabel("Value")

# Processed data for each operator
for i, (name, transformer) in enumerate(preprocessing):
    plt.subplot(2, 2, i+2)
    processed = transformer.fit_transform(sample)
    plt.plot(processed[0])
    plt.title(f"{name} Transformed")
    plt.xlabel("Feature Index")
    plt.ylabel("Value")

plt.tight_layout()
plt.show()

## Create and Train PLS Pipeline

We will create a scikit-learn pipeline with the following steps:
1. Feature scaling with MinMaxScaler
2. Preprocessing with FeatureUnion (applies transformations in parallel)
3. PLS regression model
4. Target scaling (via TransformedTargetRegressor)

In [None]:
# Create a pipeline with FeatureUnion for preprocessing and PLS regression
pipeline = Pipeline([
    ('scaler', MinMaxScaler()),          # Scale input features to [0,1]
    ('preprocessing', FeatureUnion(preprocessing)),  # Apply preprocessing operators
    ('pls', PLSRegression(n_components=10))  # PLS regression with 10 components
])

# Wrap the pipeline in TransformedTargetRegressor to scale the target variable
estimator = TransformedTargetRegressor(
    regressor=pipeline,
    transformer=MinMaxScaler()  # Scale target variable to [0,1]
)

In [None]:
# Train the model
print("Training the model...")
%time estimator.fit(X_train, y_train)
print("Model training complete.")

## Make Predictions and Evaluate Performance

In [None]:
# Make predictions
y_pred = estimator.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Model performance metrics:")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

In [None]:
# Visualize predictions vs actual values
plt.figure(figsize=(10, 6))

# Plot predictions vs actual
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('PLS Model: Actual vs Predicted')

# Add R² value to plot
plt.text(0.05, 0.95, f'R² = {r2:.4f}', transform=plt.gca().transAxes, 
         bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", alpha=0.8))

plt.grid(True, alpha=0.3)
plt.show()

## Investigate PLS Components

Let's look at how many components are optimal for our PLS model.

In [None]:
# Evaluate performance with different numbers of components
max_components = min(20, min(X_train.shape))
component_range = range(1, max_components + 1)
r2_scores = []

for n_components in component_range:
    # Create a new pipeline with the specified number of components
    pipeline = Pipeline([
        ('scaler', MinMaxScaler()),
        ('preprocessing', FeatureUnion(preprocessing)),
        ('pls', PLSRegression(n_components=n_components))
    ])
    
    model = TransformedTargetRegressor(
        regressor=pipeline,
        transformer=MinMaxScaler()
    )
    
    # Train and evaluate
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    r2_scores.append(r2)
    
    print(f"Components: {n_components}, R²: {r2:.4f}")

In [None]:
# Plot R² vs number of components
plt.figure(figsize=(10, 6))
plt.plot(component_range, r2_scores, 'o-')
plt.xlabel('Number of PLS Components')
plt.ylabel('R² Score')
plt.title('Model Performance vs Number of PLS Components')
plt.grid(True, alpha=0.3)

# Find and highlight the optimal number of components
optimal_components = component_range[np.argmax(r2_scores)]
max_r2 = max(r2_scores)
plt.axvline(x=optimal_components, color='r', linestyle='--', alpha=0.7)
plt.scatter([optimal_components], [max_r2], color='r', s=100, zorder=5)
plt.text(optimal_components + 0.5, max_r2, f'Optimal: {optimal_components} components\nR²: {max_r2:.4f}')

plt.show()

## Save the Model

Save the trained model for later use.

In [None]:
import joblib

# Train the final model with optimal number of components
final_pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('preprocessing', FeatureUnion(preprocessing)),
    ('pls', PLSRegression(n_components=optimal_components))
])

final_model = TransformedTargetRegressor(
    regressor=final_pipeline,
    transformer=MinMaxScaler()
)

final_model.fit(X_train, y_train)

# Save the model
joblib.dump(final_model, 'pls_model.joblib')
print("Model saved to 'pls_model.joblib'")

## Verify the Saved Model

To ensure the saved model works correctly, let's reload it and test it.

In [None]:
# Load the saved model
loaded_model = joblib.load('pls_model.joblib')

# Make predictions with the loaded model
y_pred_loaded = loaded_model.predict(X_test)

# Verify the predictions
r2_loaded = r2_score(y_test, y_pred_loaded)
print(f"Loaded model R² score: {r2_loaded:.4f}")

## Summary

This notebook demonstrated a complete workflow for building a PLS regression model with Pinard and scikit-learn. We covered:

1. Data loading and preprocessing with multiple transformations
2. Building a pipeline with feature scaling and PLS regression
3. Training and evaluating the model
4. Optimizing the number of PLS components
5. Saving and reloading the model

The approach we used enables reproducible analysis and can be adapted to different types of spectral data.