<a href="https://colab.research.google.com/github/BuradsakonPongtippitak/PCA/blob/main/Python_Tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Python Tools for FTIR Microplastic Analysis Software


---

### I. Data Loading & Management Tools

1.  **`ftir_loader.py`**
    * **Purpose:** Robustly load various FTIR spectral file formats.
    * **Functionality:**
        * `load_omnic(filepath)`: Loads .spa files (Thermo Fisher Scientific OMNIC).
        * `load_nicolet(filepath)`: Loads .dpt files (Nicolet/Thermo Fisher).
        * `load_jdx(filepath)`: Loads JCAMP-DX (.jdx, .dx) files (common standard).
        * `load_csv(filepath, sep=',', header=True)`: Loads simple CSV files (assuming wavenumber in first column, absorbance/transmittance in second).
        * `load_folder(folder_path, file_extension)`: Iterates and loads all spectra from a specified folder.
    * **Dependencies:** `numpy`, `pandas`, `specio` (for .spa, .dpt, .jdx – highly recommended for spectroscopy files).
    * **Output:** Pandas DataFrame with wavenumbers as index and sample names as columns, or a list of `Spectrum` objects (if using a specialized library like `specio`).

2.  **`spectral_dataset.py`**
    * **Purpose:** A class to manage multiple loaded spectra, their metadata, and associated labels.
    * **Functionality:**
        * `add_spectrum(spectrum_data, metadata, label)`: Adds a single spectrum.
        * `add_spectra_from_loader(loader_output)`: Integrates output from `ftir_loader`.
        * `get_X()`: Returns spectral data (absorbance/transmittance) as a NumPy array (features for ML).
        * `get_y()`: Returns labels (e.g., polymer type) as a NumPy array.
        * `get_wavenumbers()`: Returns the common wavenumber axis.
        * `filter_by_label(label_list)`: Selects subsets of data.
        * `split_train_test(test_size=0.2, random_state=None)`: Splits data for ML.
    * **Dependencies:** `numpy`, `pandas`.

---

### II. Preprocessing Tools (`ftir_preprocessor.py`)

This module will contain functions for each preprocessing step, designed to be chained together.

1.  **`correct_baseline(spectrum_data, method='als', poly_order=3, lam=10e5, p=0.01)`**
    * **Methods:**
        * `'als'`: Asymmetric Least Squares (recommended for robustness). Parameters: `lam` (smoothness), `p` (asymmetry).
        * `'poly'`: Polynomial fitting. Parameter: `poly_order`.
        * `'minmax'`: Simple min-max baseline (finding local minima).
    * **Dependencies:** `scipy.signal` (for some polynomial fits), custom ALS implementation.

2.  **`smooth_spectrum(spectrum_data, window_length=11, poly_order=3)`**
    * **Method:** Savitzky-Golay filter.
    * **Dependencies:** `scipy.signal.savgol_filter`.

3.  **`normalize_spectrum(spectrum_data, method='vector')`**
    * **Methods:**
        * `'minmax'`: Scales to [0, 1].
        * `'vector'`: Divides by L2 norm (unit vector normalization).
        * `'snv'`: Standard Normal Variate (mean 0, std 1).
        * `'msc'`: Multiplicative Signal Correction (requires a reference spectrum, or calculated mean).
    * **Dependencies:** `numpy`.

4.  **`apply_derivative(spectrum_data, order=1, window_length=11, poly_order=3)`**
    * **Methods:** First and Second derivative using Savitzky-Golay.
    * **Dependencies:** `scipy.signal.savgol_filter` (with `deriv=True`).

5.  **`apply_atr_correction(spectrum_data, wavenumbers, atr_crystal_ri=2.4, sample_ri=1.5)`**
    * **Purpose:** Corrects for varying penetration depth in ATR.
    * **Dependencies:** `numpy`. Requires physical constants.

6.  **`remove_membrane_filter(spectrum_data, filter_spectrum_ref, method='subtract')`**
    * **Methods:**
        * `'subtract'`: Simple subtraction after scaling.
        * `'ridge'`: Ridge regression-based removal (more robust, requires a set of filter spectra and sample spectra for training).
    * **Dependencies:** `numpy`, `scipy.optimize` (for optimization in scaling subtraction), `sklearn.linear_model` (for ridge).

---

### III. Machine Learning Tools (`ftir_ml_models.py`)

This module will house the ML model definitions and training/prediction functionalities.

1.  **`PolymerClassifier` Class**
    * **Purpose:** Encapsulates various classification algorithms for polymer identification.
    * **Initialization:** `__init__(model_type='svm', **model_params)`
        * `model_type`: 'svm', 'random_forest', 'knn', 'mlp', 'cnn' (basic).
        * `model_params`: Dictionary of hyperparameters for the chosen model.
    * **Functionality:**
        * `train(X_train, y_train)`: Trains the selected ML model.
        * `predict(X_test)`: Predicts labels for new spectra.
        * `predict_proba(X_test)`: (If applicable) Returns probability estimates for each class.
        * `evaluate(X_test, y_test)`: Calculates common metrics (accuracy, precision, recall, F1-score, confusion matrix).
        * `save_model(filepath)`: Saves the trained model using `joblib` or `pickle`.
        * `load_model(filepath)`: Loads a pre-trained model.
    * **Dependencies:** `scikit-learn` (for SVM, RF, KNN, MLP, metrics), `tensorflow` or `pytorch` (for CNN if implemented).

2.  **`DimensionalityReducer` Class**
    * **Purpose:** For reducing the number of features (wavenumbers).
    * **Initialization:** `__init__(method='pca', n_components=0.95)`
        * `method`: 'pca', 'tsne' (for visualization mostly), 'umap'.
        * `n_components`: Number of components or explained variance ratio for PCA.
    * **Functionality:**
        * `fit(X_data)`: Fits the reducer to the data.
        * `transform(X_data)`: Transforms data to lower dimension.
        * `fit_transform(X_data)`: Fits and transforms.
    * **Dependencies:** `scikit-learn.decomposition` (for PCA), `sklearn.manifold` (for t-SNE), `umap-learn` (for UMAP).

3.  **`FeatureSelector` Class (Optional, for more advanced scenarios)**
    * **Purpose:** Selects the most informative wavenumbers.
    * **Initialization:** `__init__(method='rfe', estimator=RandomForestClassifier())`
        * `method`: 'rfe' (Recursive Feature Elimination), 'select_from_model' (using feature importances).
    * **Functionality:**
        * `fit(X, y)`: Identifies relevant features.
        * `transform(X)`: Selects only the identified features.
    * **Dependencies:** `scikit-learn.feature_selection`.

---

### IV. Visualization Tools (`ftir_visualizer.py`)

1.  **`plot_spectrum(wavenumbers, absorbance, title='FTIR Spectrum', label=None, show_peaks=False, peak_indices=None, filename=None)`**
    * **Purpose:** Plot single or multiple spectra.
    * **Functionality:**
        * Plots absorbance vs. wavenumber.
        * Supports overlaying multiple spectra.
        * Optionally highlights identified peaks.
        * Saves plot to file.
    * **Dependencies:** `matplotlib.pyplot`.

2.  **`plot_pca_results(pca_model, X_transformed, y_labels, title='PCA of FTIR Spectra', filename=None)`**
    * **Purpose:** Visualize results of PCA (or other dimensionality reduction).
    * **Functionality:**
        * Plots 2D or 3D scatter plots of principal components, colored by polymer type.
        * Plots explained variance ratio.
    * **Dependencies:** `matplotlib.pyplot`, `seaborn` (for nicer aesthetics).

3.  **`plot_confusion_matrix(y_true, y_pred, labels, title='Confusion Matrix', filename=None)`**
    * **Purpose:** Visualize classification model performance.
    * **Functionality:**
        * Generates a heatmap of the confusion matrix.
    * **Dependencies:** `matplotlib.pyplot`, `seaborn`, `sklearn.metrics.confusion_matrix`.

---

### V. Spectral Library & Matching Tools (`spectral_library.py`)

1.  **`SpectralLibrary` Class**
    * **Purpose:** Stores and manages a collection of reference FTIR spectra (polymers, contaminants, etc.).
    * **Functionality:**
        * `add_reference(spectrum_data, metadata)`: Adds a new reference spectrum.
        * `load_from_csv(filepath)`: Loads a library from a structured CSV.
        * `search_by_name(polymer_name)`: Retrieves spectra by name.
        * `match_spectrum(query_spectrum, wavenumbers, method='correlation')`
            * **Methods:**
                * `'correlation'`: Pearson correlation coefficient (most common).
                * `'euclidean'`: Euclidean distance.
                * `'spectral_angle'`: Cosine similarity.
            * Returns ordered list of best matches (polymer name, similarity score).
    * **Dependencies:** `numpy`, `pandas`, `scipy.spatial.distance`.

---

### VI. Main Application Logic (`main_app.py` or `gui.py`)

This would orchestrate the use of all the above modules.

* **User Interface:** Could be a command-line interface (CLI) for simple scripts or a graphical user interface (GUI) using libraries like `PyQt5`, `Tkinter`, or `Streamlit` (for web-based dashboards).
* **Workflow:**
    1.  Load raw spectra using `ftir_loader`.
    2.  Manage data with `spectral_dataset`.
    3.  Apply selected preprocessing steps using `ftir_preprocessor` (e.g., `preprocess_pipeline = Pipeline([('baseline', corrector), ('smooth', smoother), ('normalize', normalizer)])`).
    4.  Train or load `PolymerClassifier` using the preprocessed data.
    5.  Perform predictions on unknown samples.
    6.  Match spectra against `SpectralLibrary`.
    7.  Visualize results using `ftir_visualizer`.

---

**Example of a simple workflow using these tools:**

In [None]:
# Assuming you have the modules designed above

import os
from ftir_loader import FTIRLoader
from spectral_dataset import SpectralDataset
from ftir_preprocessor import Preprocessor
from ftir_ml_models import PolymerClassifier, DimensionalityReducer
from ftir_visualizer import plot_spectrum, plot_pca_results, plot_confusion_matrix
from spectral_library import SpectralLibrary

# --- 1. Load Data ---
loader = FTIRLoader()
spectra_raw_list = loader.load_folder('data/raw_spectra', '.spa') # Or your specific format

dataset = SpectralDataset()
for spec_data, metadata, label in spectra_raw_list: # Assuming your loader returns these
    dataset.add_spectrum(spec_data, metadata, label)

X_raw = dataset.get_X()
y_labels = dataset.get_y()
wavenumbers = dataset.get_wavenumbers()

# --- 2. Preprocess Data ---
preprocessor = Preprocessor()
# Define a preprocessing pipeline
X_baseline_corrected = preprocessor.correct_baseline(X_raw)
X_smoothed = preprocessor.smooth_spectrum(X_baseline_corrected)
X_normalized = preprocessor.normalize_spectrum(X_smoothed, method='snv')

X_processed = X_normalized # Final preprocessed data for ML

# --- 3. Train ML Model ---
X_train, X_test, y_train, y_test = dataset.split_train_test(X=X_processed, y=y_labels)

classifier = PolymerClassifier(model_type='svm', C=10) # Example: SVM with C=10
classifier.train(X_train, y_train)

y_pred = classifier.predict(X_test)
classifier.evaluate(X_test, y_test)
plot_confusion_matrix(y_test, y_pred, labels=dataset.unique_labels) # Assuming unique_labels attribute

# --- 4. Dimensionality Reduction & Visualization ---
reducer = DimensionalityReducer(method='pca', n_components=2)
X_pca = reducer.fit_transform(X_processed)
plot_pca_results(reducer.pca_model, X_pca, y_labels, title='PCA of Processed Spectra')

# --- 5. Spectral Library Matching (Example for a single unknown spectrum) ---
library = SpectralLibrary()
# Assume you have a way to populate your library (e.g., from CSV or another folder)
# library.load_from_csv('data/reference_library.csv')

unknown_spectrum_data = X_test[0] # Take one spectrum from test set as unknown
best_match = library.match_spectrum(unknown_spectrum_data, wavenumbers)
print(f"Unknown spectrum best match: {best_match[0]['polymer_name']} with score {best_match[0]['score']:.2f}")

# Plotting example for a single spectrum
plot_spectrum(wavenumbers, unknown_spectrum_data, title='Preprocessed Unknown Spectrum', label='Unknown Sample')