This notebook will explore models for element identification in a LIBS spectra
The model will not estimate abundance, but rather provide a filter for subsequent fitting
For example, a L-S fit of the indicated elemental atomic spectra to the composite input spectra yields abundance
Note the the unequal intensity of pure atomic spectra indicates a non-equal weighting needed in L-S step

While this may ultimately be implemented in pytorch as a custom model (and would be a good exercise), there
are off the shelf libraries for this task available in sklearn that provide a straightforward starting point
and benchmark for further improvement, such as BP-MLL (Zhang/Zhou 2006)

In [1]:
#imports
import pickle
import numpy as np
from pathlib import Path
from matplotlib import pyplot as plt
##COMMENT OUT FOR NON-INTEL PROCESSOR
from sklearnex import patch_sklearn
patch_sklearn()
###

from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import jaccard_score

import plotly.graph_objects as go

top_dir = Path.cwd()
datapath = top_dir / 'data'

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [2]:
#data import
with open(datapath / 'training' / 'el80_pairs.pickle', 'rb') as f:
    elem_symb = pickle.load(f)
    el_index = pickle.load(f)
    fracs = pickle.load(f)
    wave = pickle.load(f)
    x_data = pickle.load(f)
    y_data = pickle.load(f)

Multilabel classification
The relevant library appears to be sklearn.multiclass.OneVsRestClassifier
with the target y set to a (sample x classes) array of binary class indicators
https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier

In [3]:
#prep data for multilabel classification
#x array will be the intensities (potentially scaled / transformed) by wavelength
#y array will be the binary indcators of element presence - e.g. length 80 array of 0/1s

#prediction accuracy zero without transform, x_data as imported
eps = 1e-6
#X = x_data
X = np.log(x_data + eps) # (9560, 782)
#X = np.where(X < eps, eps, X)
#X = np.log(X)
y = (fracs > 0).astype('int') # (9560, 80)

#split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [4]:
# Train classifier
#Notes on choice of estimator for labeled classifiction indicates LinearSVC for <100k sample, else SGDClassifier
# https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
#SVC was 1m 40s for 9,560 samples. Linear SVC took 10m 26s and did not converge
clf = OneVsRestClassifier(SVC()).fit(X_train, y_train)

In [5]:
#review predictions for test set
y_pred = clf.predict(X_test)

In [6]:
#prediction accuracy, all labels exact
hits = 0
for i in range(len(y_pred)):
     if np.array_equal(y_pred[i], y_test[i]):
        hits +=1
print(f"prediction accuracy: {hits / len(y_pred)}")
#accuracy 46.4% for log transformed x
#accuracy 5.18% for non-transformed x
#accuracy 5.6% for log.log transformed x


prediction accuracy: 0.46485355648535565


In [18]:
#Look at Jaccard scores which gives partial credit
#https://scikit-learn.org/stable/modules/model_evaluation.html#jaccard-similarity-score
samp_avg_jacc = jaccard_score(y_test, y_pred, average='samples') # % correct by sample, then averaged
elem_jacc = np.round(jaccard_score(y_test, y_pred, average=None),4)

fig = go.Figure(data=[go.Table(header=dict(values=['Element', 'Prediction Accuracy']),
                 cells=dict(values=[elem_symb, elem_jacc]))])
fig.update_layout(width=500, height=2000)
fig.show()


https://scikit-learn.org/stable/auto_examples/multioutput/plot_classifier_chain_yeast.html#sphx-glr-auto-examples-multioutput-plot-classifier-chain-yeast-py

Look another variation of multilabel which does not assume independence