# Automated CV Design: Example Pipeline

This notebook demonstrates a complete workflow for processing pairwise distance data, including:
1. **Data Loading**: Using the refactored `data_access` to load H5/NPY files.
2. **Variance Filtering**: Removing low-variance features.
3. **MPSO Feature Selection**: Projecting features using Multi-objective Particle Swarm Optimization.
4. **Dimension Reduction**: Applying Fisher's Linear Discriminant Analysis (FLDA).

In [None]:
import sys
import os
import importlib.util
import pandas as pd
import numpy as np

def import_module_from_path(module_name, path):
    """Helper to import modules from paths that aren't valid Python identifiers (e.g. starting with numbers)"""
    # Get absolute path
    abs_path = os.path.abspath(path)
    spec = importlib.util.spec_from_file_location(module_name, abs_path)
    module = importlib.util.module_from_spec(spec)
    sys.modules[module_name] = module
    spec.loader.exec_module(module)
    return module

# Since the notebook is in 'lda/', we can import data_access directly
import data_access as da

# Import modules from numeric subdirectories
variance = import_module_from_path("variance", "2_feature_extraction/variance.py")
MPSO = import_module_from_path("MPSO", "3_feature_selection/3.5.MPSO.py")
FLDA = import_module_from_path("FLDA", "4_dimensionality_reduction/FLDA.py")

print("Modules imported successfully.")

## 1. Data Loading

We use the `data_iterator` to load snapshots from our H5/NPY distance maps. This handled both formats transparently.

In [None]:
# Define your base directory for data (local or absolute)
base_dir = '../data/dist_maps'

# Initialize the data iterator
raw_data_iter = da.data_iterator(base_dir=base_dir, chunk_size=5000)

print(f"Created data iterator for: {base_dir}")

## 2. Variance Filtering

We apply a variance threshold (using the knee-point method) to filter out static or irrelevant distance pairs.

In [None]:
# Apply the two-pass variance filter
# Pass 1: Analyzes total variance across all chunks
# Pass 2: Yields filtered DataFrames
filtered_iter = variance.variance_filter_pipeline(raw_data_iter)

print("Variance filter applied. Ready for feature selection.")

## 3. Label Assignment (TODO)

Before running supervised feature selection (MPSO) or dimensionality reduction (FLDA), you MUST assign class labels to each sample.

In [None]:
def add_target_labels(df_iter):
    """
    TODO: Implement logic to assign classes to each sample.
    Classes could be based on construct names, frame ranges, or other metadata.
    """
    for df in df_iter:
        # EXAMPLE FILLER: Assigning a dummy class based on replica
        # Replace this with your actual classification logic
        df['class'] = (df['replica'].astype(int) % 2) + 1  # FLDA often expects integers starting at 1
        yield df

labeled_iter = add_target_labels(filtered_iter)
print("Labels added to the pipeline (Placeholder logic used).")

## 4. MPSO Feature Selection

MPSO finds the best projection of features to separate your classes using Particle Swarm Optimization.

In [None]:
# MPSO consumes the iterator and returns a single projected DataFrame
# Adjust dims and iterations as needed for your dataset
mpso_result_df = MPSO.run_mpso_pipeline(
    labeled_iter, 
    target_col='class', 
    dims=10,        # Number of features to project into
    mpso_iters=20   # Iterations for demo efficiency
)

print(f"MPSO complete. Result shape: {mpso_result_df.shape}")
mpso_result_df.head()

## 5. FLDA Dimensionality Reduction

Finally, we apply Fisher's LDA to the selected features to find the optimal global discriminant subspace.

In [None]:
# Apply FLDA
flda_iter = FLDA.run_flda(
    mpso_result_df, 
    num_eigenvector=2, 
    target_col='class'
)

# FLDA returns an iterator yielding the final transformed DataFrame
final_lda_df = next(flda_iter)

print("FLDA Transformation Complete.")
final_lda_df.head()

## 6. Visualization

Plot the final CV (Collective Variable) space.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
for cls in final_lda_df['class'].unique():
    mask = final_lda_df['class'] == cls
    plt.scatter(final_lda_df.loc[mask, 'LD1'], final_lda_df.loc[mask, 'LD2'], label=f'Class {cls}', alpha=0.6)

plt.xlabel('CV 1 (LD1)')
plt.ylabel('CV 2 (LD2)')
plt.title('Final CV Space (FLDA)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()