# Model Training

This Notebook is to build and train the models that can be used with the pipeline. 
The two models have different uses.
* For synthesizing missing data, an autoencoder is trained using the known data, and applied to the entries with missing data.
* For dimensionality reduction, a different autoencoder is trained to replicate the full dataset. Then only the encoder part is kept, used to bring the full data to a lower dimensional space that will nevertheless be just as representative.

The models are meant to be trained separately from the main pipeline, and the pipeline itself can be configured to read ready-trained models to perform the corresponding tasks.

Both models are stored locally in the Container. This was chosen as this project is a demo, and it is most convenient for anyone to download the repo, build the Docker Container and run the whole pipeline. In a proper deployment, something such as S3 storage would be prefered, and the code in the pipeline would be modified to read from that storage.

In [1]:
# Import dependencies
import os
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam
import pandas as pd

2025-08-14 19:15:33.526485: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-08-14 19:15:33.526927: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-08-14 19:15:33.529862: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-08-14 19:15:33.536591: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1755188133.547911  318808 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1755188133.55

## 1. Autoencoder for synthesizing data

In [2]:
# Import the dataset for building/training
dataset_path = os.path.join('..', 'data', 'dataset.xlsx')
dataset = pd.read_excel(dataset_path)

In [3]:
# Bring the dataset to the appropriate format
# Only keep the entries with full data, to be used for training 
dataset_train = dataset.dropna()
# Keep the Recontact columns to separate from Core columns
re_cols_lst = [col for col in dataset.columns if 'core_re' in col]
# Use the Core as X
dataset_X = dataset_train.drop(re_cols_lst, axis=1).reset_index(drop=True)
# Use the Recontact as y
dataset_y = dataset_train[re_cols_lst].reset_index(drop=True)

In [4]:
# Build the autoencoder
# Input should reflect the shape of X
input_dim = dataset_X.shape[1]
# Output should reflect the shape of y
output_dim = dataset_y.shape[1]
# Compact architecture, suited for the small dataset we have
input_layer = Input(shape=(input_dim,))
encoder = Dense(16, activation='relu')(input_layer)
encoder = Dense(8, activation='relu')(encoder)
encoder = Dense(4, activation='relu')(encoder)  # Bottleneck
decoder = Dense(8, activation='relu')(encoder)
decoder = Dense(16, activation='relu')(decoder)
decoder = Dense(output_dim, activation='sigmoid')(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)
# Compile the autoencoder
autoencoder.compile(optimizer=Adam(learning_rate=0.001), loss='mse')

2025-08-14 19:20:38.075667: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


In [5]:
# Visualize the autoencoder (because it's nice)
print(autoencoder.summary())

None


In [6]:
# Train the autoencoder
autoencoder.fit(dataset_X, dataset_y,
                epochs=100,
                batch_size=32,
                shuffle=True,
                validation_split=0.2,
                verbose=1)

Epoch 1/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.2497 - val_loss: 0.2454
Epoch 2/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.2436 - val_loss: 0.2360
Epoch 3/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.2320 - val_loss: 0.2137
Epoch 4/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.2070 - val_loss: 0.1798
Epoch 5/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.1750 - val_loss: 0.1571
Epoch 6/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.1556 - val_loss: 0.1488
Epoch 7/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 0.1474 - val_loss: 0.1465
Epoch 8/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.1476 - val_loss: 0.1456
Epoch 9/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x771635759e20>

At this stage, the autoencoder is ready and trained. To use it for synthesizing data, we must split the rest of the data (with the missing values) in Core and Recontact like before, only this time we only keep the Core part, which will be the X used to predict (synthesize) the Recontact answers.

In [7]:
# Pick the entries with unanswered questions
dataset_missing = dataset[dataset.isna().any(axis=1)]
# Use Core questions (no missing data here) as X
dataset_X_missing = dataset_missing.drop(re_cols_lst, axis=1).reset_index(drop=True)
# Use trained model to synthesize the missing data
results = autoencoder.predict(dataset_X_missing)
# Create Pandas DF and round to [0 1] format to display the synthesized data
dataset_y_missing = pd.DataFrame(data = results, columns = dataset_y.columns).round()

[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  


In [8]:
dataset_y_missing

Unnamed: 0,core_re_q5_1,core_re_q5_2,core_re_q5_3,core_re_q5_4,core_re_q5_5,core_re_q5_6,core_re_q5_7,core_re_q5_8,core_re_q5_9,core_re_q5_10,...,core_re_q10_1,core_re_q10_2,core_re_q10_3,core_re_q10_4,core_re_q10_5,core_re_q10_6,core_re_q10_7,core_re_q10_8,core_re_q10_9,core_re_q10_10
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1746,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1747,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1748,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


We save the model so that it can be readily used with the pipeline. In a proper deployment, S3 storage would be prefered, however as this is a demo, we will use a local folder, that can be included in the Containerized application, so that anyone can readily download and run the full pipeline.

In [None]:
# Save the trained model
model_name = 'model_synthesize'
model_path = os.path.join('..', 'models', f'{model_name}.keras')
autoencoder.save(model_path)

## 2. Encoder for dimensionality reduction

### Warning! 

The model for dimensionality reduction presumes a specific length for its input. Meaning that it expects a specific number of features to be present in the dataset. 
If in the pipeline, feature reduction options like variance or correlation check are active, the number of features present may vary according to the specific settings. Therefore it is advised, if the pipeline is to perform dimensionality reduction with the Encoder method, not to activate variance or correlation checks.
Similarly, if the ignore missing data option is selected, the model for dimensionality reduction must be trained and used with data that only includes the Core questions (where we don't have missing datapoints).

In [9]:
# To continue with the training of the reduction model using the above synthesized data:
# Arrange the data for concatenation
dataset_known = pd.concat([dataset_X, dataset_y], axis=1)
dataset_synthesized = pd.concat([dataset_X_missing, dataset_y_missing], axis=1)
# Concatenate synthesized with the known data to provide full dataset
dataset_full = pd.concat([dataset_known,dataset_synthesized]).reset_index(drop=True)

In [12]:
# Choose encoding dimension
encoding_dim = 2

In [13]:
# 1. Build the autoencoder
input_dim = dataset_full.shape[1]
# Smaller architecture for CPU training
input_layer = Input(shape=(input_dim,))
encoder = Dense(32, activation='relu')(input_layer)
encoder = Dense(16, activation='relu')(encoder)
encoder = Dense(8, activation='relu')(encoder)
encoder = Dense(encoding_dim, activation='relu')(encoder)  # Bottleneck
decoder = Dense(8, activation='relu')(encoder)
decoder = Dense(16, activation='relu')(decoder)
decoder = Dense(32, activation='relu')(decoder)
decoder = Dense(input_dim, activation='sigmoid')(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)
autoencoder.compile(optimizer=Adam(learning_rate=0.001), loss='mse')

In [14]:
# Visualize the autoencoder (because it's nice)
print(autoencoder.summary())

None


In [15]:
autoencoder.fit(dataset_full, dataset_full,
                epochs=120,
                batch_size=32,
                shuffle=True,
                validation_split=0.2,
                verbose=1)

Epoch 1/120
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.2405 - val_loss: 0.1343
Epoch 2/120
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.1341 - val_loss: 0.1085
Epoch 3/120
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.1244 - val_loss: 0.1059
Epoch 4/120
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.1214 - val_loss: 0.1048
Epoch 5/120
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.1205 - val_loss: 0.1034
Epoch 6/120
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.1193 - val_loss: 0.1021
Epoch 7/120
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.1183 - val_loss: 0.1008
Epoch 8/120
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.1175 - val_loss: 0.0996
Epoch 9/120
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x771634171a00>

In [16]:
# 3. Save just the encoder
encoder_model = Model(inputs=input_layer, outputs=encoder)
model_name = 'model_reduce'
model_path = os.path.join('..', 'models', f'{model_name}.keras')
encoder_model.save(model_path)