# Data acquisition and loading

This notebook loads the raw spectroscopic CSV files from the repository's `data/raw` folder and prepares a combined pandas DataFrame for downstream analysis.

Data source (on-disk): `data/raw/Plastic Washing CMSE project CSV files/ATR set 1_washed/` â€” filenames encode polymer and contaminant metadata (examples: `HDPE-W_BSA 1203-67(#4)1.CSV`, `LDPE-W_BSA_OIL_CMC_STARCH 1203-72(#6)1.CSV`).

In [14]:
# Use the spectra-based ML loader to prepare X (spectra) and y (one-hot targets)
# Use importlib and module object to avoid stale import errors in the notebook kernel
import importlib
import src.preprocessing.loader as loader_mod
importlib.reload(loader_mod)

try:
    data_root = repo_root / 'data' / 'raw' / 'Plastic Washing CMSE project CSV files' / 'ATR set 1_washed'
except NameError:
    from pathlib import Path
    data_root = Path.cwd() / 'data' / 'raw' / 'Plastic Washing CMSE project CSV files' / 'ATR set 1_washed'

print('Preparing spectra dataset from:', data_root)
X, y, feature_names, target_names = loader_mod.prepare_ml_dataset_spectra(data_root)
print('Spectra X shape:', X.shape)
print('Targets y shape:', y.shape)
print('Number of spectral features:', len(feature_names))
print('Target names:', target_names)

# quick preview: first spectrum and corresponding targets
if X.size:
    print('\nFirst spectrum (first 10 values):', X[0,:10])
    print('First sample targets:', y[0])


Preparing spectra dataset from: c:\Users\Mikey\Documents\GitHub\cmse492_project\data\raw\Plastic Washing CMSE project CSV files\ATR set 1_washed
Spectra X shape: (122, 1868)
Targets y shape: (122, 9)
Number of spectral features: 1868
Target names: ['is_HDPE', 'is_LDPE', 'is_LLDPE', 'is_PP', 'has_BSA', 'has_OIL', 'has_GUAR', 'has_CMC', 'has_STARCH']

First spectrum (first 10 values): [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
First sample targets: [1. 0. 0. 0. 1. 0. 0. 0. 0.]
Spectra X shape: (122, 1868)
Targets y shape: (122, 9)
Number of spectral features: 1868
Target names: ['is_HDPE', 'is_LDPE', 'is_LLDPE', 'is_PP', 'has_BSA', 'has_OIL', 'has_GUAR', 'has_CMC', 'has_STARCH']

First spectrum (first 10 values): [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
First sample targets: [1. 0. 0. 0. 1. 0. 0. 0. 0.]


In [12]:
# Reload the loader module in the notebook kernel (useful after editing source files)\nimport importlib\nimport src.preprocessing.loader as loader_mod\nimportlib.reload(loader_mod)\nfrom src.preprocessing.loader import prepare_ml_dataset_spectra\n\ntry:\n    data_root = repo_root / 'data' / 'raw' / 'Plastic Washing CMSE project CSV files' / 'ATR set 1_washed'\nexcept NameError:\n    from pathlib import Path\n    data_root = Path.cwd() / 'data' / 'raw' / 'Plastic Washing CMSE project CSV files' / 'ATR set 1_washed'\n\nprint('Preparing spectra dataset from (reloaded):', data_root)\nX, y, feature_names, target_names = prepare_ml_dataset_spectra(data_root)\nprint('Spectra X shape:', X.shape)\nprint('Targets y shape:', y.shape)\nprint('Number of spectral features:', len(feature_names))\nprint('Target names:', target_names)\n\nif X.size:\n    print('\nFirst spectrum (first 10 values):', X[0,:10])\n    print('First sample targets:', y[0])
