# Autoencoding for clustering of spectroscopic data

---

Lecture: "Physics-augmented machine learning" @ Cyber-Physical Simulation, TU Darmstadt

Lecturer: Prof. Oliver Weeger

Assistants: Dr.-Ing. Maximilian Kannapin, Jasper O. Schommartz, Dominik K. Klein

Summer term 2025

---

Experimental data by Ho et al.: ''Rapid identification of pathogenic bacteria using Raman spectroscopy and deep learning''. Nature Commuications 10:4927 (2019).



*Run the following cell to clone the GitHub repository in your current Google Colab environment.*

In [None]:
!git clone https://github.com/CPShub/LecturePhysicsAwareML.git

*Run the following cell to import all modules and python files to this notebook. If you made changes in the python files, run the following cell again to update the python files in this notebook. You might need to restart your Colab session first ("Runtime / Restart session" in the header menu).*


In [None]:
import datetime

import pandas as pd
import tensorflow as tf

import LecturePhysicsAwareML.Autoencoder.data as ld
import LecturePhysicsAwareML.Autoencoder.models as lm
import LecturePhysicsAwareML.Autoencoder.plots as lp

now = datetime.datetime.now

*Run this cell if you are executing the notebook locally on your device.*

In [None]:
import datetime
import pandas as pd
import tensorflow as tf

import data as ld
import models as lm
import plots as lp

now = datetime.datetime.now

*If you want to clone the repository again, you have to delete it from your Google Colab files first. For this, you can run the following cell.*

In [None]:
%rm -rf LecturePhysicsAwareML

Load full autoencoder and encoder

In [None]:
latent_variables = 2  # number of latent dimensions
nodes = 64            # number of hidden encoder/decoder nodes
feature_number = 1000 # number of measurements per spectrum

# Build full encoder-decoder model
units = [nodes, latent_variables, nodes, feature_number]
activation = ['softplus', 'linear', 'softplus', 'linear']
autoencoder = lm.build(input_shape=feature_number, units=units, activation=activation)

# Build encoder model (for later evaluation of latent variables)
units = [nodes, latent_variables]
activation = ['softplus', 'linear']
encoder = lm.build(input_shape=feature_number, units=units, activation=activation)

Select bacteria sets so be investigated

In [None]:
# define bacteria sets to be investigated (numbers between 0 and 29)
cases = [18, 27, 0, 26]
raman_shift, intensity_spectrum, label = ld.load_data(cases)

# Create a DataFrame with label, raman_shift, and intensity
# Only use the first 5 components of raman_shift and intensity for each row
df = pd.DataFrame({
    'Bacteria class': label,
    'Raman shift': [rs[:5] for rs in raman_shift],
    'Intensity spectrum': [intens[:5] for intens in intensity_spectrum]
})
df

Define study and calibrate the autoencoder

In [None]:
# Fit encoder-decoder model
epochs = 500
h = autoencoder.fit(
    [intensity_spectrum], [intensity_spectrum], epochs=epochs, verbose=2
)

lp.plot_loss(h)

Visualize results

In [None]:
# Transfer weights from the encoder-decoder model to the encoder model
# for evaluation of the latent variables
encoder.set_weights(autoencoder.weights[0:4])

# plot latent space
for i in range(latent_variables):
    for j in range(latent_variables):
        if i != j:
            if i > j:
                lp.plot_latent_space_ij(encoder, intensity_spectrum, label, i, j)

# plot the different bacteria types
for i in range(len(cases)):
    raman_shift, intensity_spectrum = ld.load_single_case(cases[i])
    lp.plot_spectra(raman_shift, intensity_spectrum, i, cases[i])