Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance Issue]: Model Convergence issue in Keras with Parallel Execution of the fit() Method ? #1245

Open
AdrienDeverin opened this issue May 13, 2024 · 7 comments

Comments

@AdrienDeverin
Copy link

Brief Description

I've encountered a perplexing issue while utilizing Keras and its fit() function to train a standard CNN.

To illustrate, consider the following code snippet where the model learns and converges successfully:

// Get the data
int epoch = 100;
TensorLoader tensorLoader = new TensorLoader();
(NDArray xData, NDArray yData) = tensorLoader.GetDatasFromZeroFormat(@"C:\GitHub\CSharpCode\Solutions\PNN_Master_Delta\_Datas_\Cnn_Inputs\GSCD_4+1Classes_Tensors_trigger_16_zeroF_v2_f6");

// Model initialization and setup
IModel? model = keras.models.load_model(@"C:\GitHub\CSharpCode\Solutions\PNN_Master_Delta\_Datas_\Cnn_Model\GSCD_16x90_4+1classes_trigger_n00");
IOptimizer optimizer = keras.optimizers.Adam(0.01f);
ILossFunc loss = keras.losses.BinaryCrossentropy();

// Compile (IMPORTANTE LINE)
model.compile(optimizer: optimizer, loss: loss, metrics: new string[1] { "acc" });

// Parallel execution for training
Thread t = new Thread(() =>
{
    List<ICallback> callbacks = new List<ICallback> { new EarlyStoppingCallback(
                    new Tensorflow.Keras.Callbacks.CallbackParams { Model = model, Epochs = epoch, Verbose = -1},
                    monitor: "val_loss", patience: 5) }; // Early Stopping if problem 
    model.fit(xData, yData,
             batch_size: 32,
             epochs: epoch,
             validation_split: 0.1f,
             callbacks: callbacks,
             shuffle: false, verbose: -1);
});
t.SetApartmentState(ApartmentState.STA);
t.Start();

However, when the model.compile() and model.fit() functions are executed sequentially in the same thread, the model seemingly learns but fails to converge, as demonstrated below:

IModel? model = null;
string folder = @"C:\GitHub\CSharpCode\Solutions\PNN_Master_Delta\_Datas_\Cnn_Model\GSCD_16x90_4+1classes_trigger_n00"; // a path with a saved_model 

Thread t = new Thread(() =>
{
    // Loading the model
    model = keras.models.load_model(folder);
    IOptimizer optimizer = keras.optimizers.Adam(0.01f);
    ILossFunc loss = keras.losses.BinaryCrossentropy();
    model.compile(optimizer: optimizer, loss: loss, metrics: new string[1] { "acc" });

    // Model training
    List<ICallback> callbacks = new List<ICallback> { new EarlyStoppingCallback(
                    new Tensorflow.Keras.Callbacks.CallbackParams { Model = model, Epochs = epoch, Verbose = -1},
                    monitor: "val_loss", patience: 5) }; // Early Stopping if problem 
    model.fit(xData, yData,
             batch_size: 32,
             epochs: epoch,
             validation_split: 0.1f,
             callbacks: callbacks,
             shuffle: false, verbose: -1);
});
t.SetApartmentState(ApartmentState.STA);
t.Start();

Remarkably, the convergence issue appears to be entirely deterministic (used obviously over the same dataset) between these two examples. The logs are send below.
Despite thorough investigation, I've failed to identify any critical differences. All variables remain consistent.

Specifically, it seems that the function model.compile() must not be executed in the same thread as the fit() function for successful convergence. I don't understand why...

Any insights or suggestions on this peculiar behavior would be greatly appreciated. Datas used can be given.

Best regards,
DEVERIN Adrien

Device and Context

Used on CPU

Benchmark

Logs example 1 (convergence working) :

2024-05-13 16:51:59.842255: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-05-13 16:51:59.847909: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Data Loaded sucessfully ...
Epoch: 001/100
0076/0076 [=====] - 5ms/step - loss: 0,272631 - accuracy: 0,883721 - val_loss: 0,274186 - val_accuracy: 0,902622
Epoch: 002/100
0076/0076 [=====] - 5ms/step - loss: 0,271166 - accuracy: 0,883721 - val_loss: 0,272107 - val_accuracy: 0,902622
Epoch: 003/100
0076/0076 [=====] - 5ms/step - loss: 0,270275 - accuracy: 0,880814 - val_loss: 0,270979 - val_accuracy: 0,902622
Epoch: 004/100
0076/0076 [=====] - 6ms/step - loss: 0,269701 - accuracy: 0,880814 - val_loss: 0,270195 - val_accuracy: 0,902622
Epoch: 005/100
0076/0076 [=====] - 5ms/step - loss: 0,269153 - accuracy: 0,879568 - val_loss: 0,269510 - val_accuracy: 0,902622
Epoch: 006/100
0076/0076 [=====] - 5ms/step - loss: 0,268411 - accuracy: 0,883306 - val_loss: 0,268801 - val_accuracy: 0,902622
Epoch: 007/100
0076/0076 [=====] - 5ms/step - loss: 0,267722 - accuracy: 0,884136 - val_loss: 0,268095 - val_accuracy: 0,902622
Epoch: 008/100
0076/0076 [=====] - 5ms/step - loss: 0,267106 - accuracy: 0,885382 - val_loss: 0,267437 - val_accuracy: 0,902622
Epoch: 009/100
0076/0076 [=====] - 5ms/step - loss: 0,266548 - accuracy: 0,886628 - val_loss: 0,266843 - val_accuracy: 0,902622
Epoch: 010/100
0076/0076 [=====] - 5ms/step - loss: 0,266097 - accuracy: 0,886628 - val_loss: 0,266379 - val_accuracy: 0,902622
Epoch: 011/100
0076/0076 [=====] - 5ms/step - loss: 0,265732 - accuracy: 0,886213 - val_loss: 0,265995 - val_accuracy: 0,902622
Epoch: 012/100
0076/0076 [=====] - 6ms/step - loss: 0,265397 - accuracy: 0,883721 - val_loss: 0,265670 - val_accuracy: 0,902622
Epoch: 013/100
0076/0076 [=====] - 5ms/step - loss: 0,265122 - accuracy: 0,885797 - val_loss: 0,265376 - val_accuracy: 0,902622
Epoch: 014/100
0076/0076 [=====] - 5ms/step - loss: 0,264859 - accuracy: 0,885382 - val_loss: 0,265103 - val_accuracy: 0,902622
Epoch: 015/100
0076/0076 [=====] - 7ms/step - loss: 0,264639 - accuracy: 0,885797 - val_loss: 0,264853 - val_accuracy: 0,902622
Epoch: 016/100
0076/0076 [=====] - 6ms/step - loss: 0,264446 - accuracy: 0,886213 - val_loss: 0,264638 - val_accuracy: 0,902622
Epoch: 017/100
0076/0076 [=====] - 5ms/step - loss: 0,264189 - accuracy: 0,885797 - val_loss: 0,264377 - val_accuracy: 0,898876
Epoch: 018/100
0076/0076 [=====] - 6ms/step - loss: 0,263984 - accuracy: 0,884551 - val_loss: 0,264165 - val_accuracy: 0,898876
Epoch: 019/100
0076/0076 [=====] - 4ms/step - loss: 0,263786 - accuracy: 0,884551 - val_loss: 0,263960 - val_accuracy: 0,898876
Epoch: 020/100
0076/0076 [=====] - 4ms/step - loss: 0,263596 - accuracy: 0,885382 - val_loss: 0,263759 - val_accuracy: 0,898876

ect ...

Logs example 2 (convergence doesn't work) :

2024-05-13 16:54:19.347661: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-05-13 16:54:19.354123: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Data Loaded sucessfully ...
Epoch: 001/100
0076/0076 [=====] - 5ms/step - loss: 0,269264 - accuracy: 0,886213 - val_loss: 0,274375 - val_accuracy: 0,880150
Epoch: 002/100
0076/0076 [=====] - 5ms/step - loss: 0,268268 - accuracy: 0,887043 - val_loss: 0,268760 - val_accuracy: 0,880150
Epoch: 003/100
0076/0076 [=====] - 5ms/step - loss: 0,267617 - accuracy: 0,884551 - val_loss: 0,271331 - val_accuracy: 0,880150
Epoch: 004/100
0076/0076 [=====] - 7ms/step - loss: 0,266189 - accuracy: 0,886213 - val_loss: 0,273943 - val_accuracy: 0,880150
Epoch: 005/100
0076/0076 [=====] - 5ms/step - loss: 0,266029 - accuracy: 0,884136 - val_loss: 0,275886 - val_accuracy: 0,880150
Epoch: 006/100
0076/0076 [=====] - 5ms/step - loss: 0,264438 - accuracy: 0,885797 - val_loss: 0,278307 - val_accuracy: 0,876405
Epoch: 007/100
0076/0076 [=====] - 5ms/step - loss: 0,263542 - accuracy: 0,884967 - val_loss: 0,277512 - val_accuracy: 0,880150
Epoch: 008/100
0076/0076 [=====] - 5ms/step - loss: 0,264521 - accuracy: 0,884967 - val_loss: 0,280630 - val_accuracy: 0,876405
Restoring model weights from the end of the best epoch...

Alternatives

No response

@AsakusaRinne
Copy link
Collaborator

Hi, the losses in the two logs seem to have little difference. In the second log, only the last step has an increasing loss. What is the loss like after that?

Are you using the debug mode or release mode? If you are in debug mode, will the result differ under release mode?

@AdrienDeverin
Copy link
Author

AdrienDeverin commented May 14, 2024 via email

@AsakusaRinne
Copy link
Collaborator

I have no idea about it yet. model.compile is just to gather all the information as a preparation for graph building. What if put the following code in the main thread

model = keras.models.load_model(folder);
IOptimizer optimizer = keras.optimizers.Adam(0.01f);
ILossFunc loss = keras.losses.BinaryCrossentropy();

but keep model.compile in the separated thread along with model.fit?

@AdrienDeverin
Copy link
Author

AdrienDeverin commented May 14, 2024 via email

@AdrienDeverin
Copy link
Author

AdrienDeverin commented May 14, 2024

I'll add another strange observation. I tried another example, the one from the TensorFlow.NET GitHub page, and got the opposite result!

Below is the code used:

// Load data
var ((x_train, y_train), (x_test, y_test)) = keras.datasets.cifar10.load_data();
x_train = x_train / 255.0f;

// Construct Model (could be replace by a the load_model() function, it will have the same impact)
var layers = keras.layers;
var inputs = keras.Input(shape: (32, 32, 3), name: "img");
var x = layers.Conv2D(32, 3, activation: "relu").Apply(inputs);
x = layers.Conv2D(64, 3, activation: "relu").Apply(x);
var block_1_output = layers.MaxPooling2D(3).Apply(x);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_1_output);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);
var block_2_output = layers.Add().Apply(new Tensors(x, block_1_output));
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_2_output);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);
var block_3_output = layers.Add().Apply(new Tensors(x, block_2_output));
x = layers.Conv2D(64, 3, activation: "relu").Apply(block_3_output);
x = layers.GlobalAveragePooling2D().Apply(x);
x = layers.Dense(256, activation: "relu").Apply(x);
x = layers.Dropout(0.5f).Apply(x);
var outputs = layers.Dense(10).Apply(x);
var model = keras.Model(inputs, outputs, name: "toy_resnet");

// Compile
model.compile(optimizer: keras.optimizers.RMSprop(1e-3f), loss: keras.losses.SparseCategoricalCrossentropy(from_logits: true), metrics: new[] { "acc" });

Thread t = new Thread(() =>
{
	model.fit(x_train[new Slice(0, 2000)], y_train[new Slice(0, 2000)],
          	batch_size: 64,
          	epochs: 30,
          	validation_split: 0.2f);
});
t.SetApartmentState(ApartmentState.STA);
t.Start();

This example, similar to example 1 in my thread, DOESN'T CONVERGE!

Contrarily, this code, similar to my example 2, converges:

// Load data
var ((x_train, y_train), (x_test, y_test)) = keras.datasets.cifar10.load_data();
x_train = x_train / 255.0f;

Thread t = new Thread(() =>
{
	// Construct Model (could be replace by a the load_model() function, it will have the same impact)
	var layers = keras.layers;
	var inputs = keras.Input(shape: (32, 32, 3), name: "img");
	var x = layers.Conv2D(32, 3, activation: "relu").Apply(inputs);
	x = layers.Conv2D(64, 3, activation: "relu").Apply(x);
	var block_1_output = layers.MaxPooling2D(3).Apply(x);
	x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_1_output);
	x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);
	var block_2_output = layers.Add().Apply(new Tensors(x, block_1_output));
	x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_2_output);
	x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);
	var block_3_output = layers.Add().Apply(new Tensors(x, block_2_output));
	x = layers.Conv2D(64, 3, activation: "relu").Apply(block_3_output);
	x = layers.GlobalAveragePooling2D().Apply(x);
	x = layers.Dense(256, activation: "relu").Apply(x);
	x = layers.Dropout(0.5f).Apply(x);
	var outputs = layers.Dense(10).Apply(x);
	var model = keras.Model(inputs, outputs, name: "toy_resnet");

	// Compile
	model.compile(optimizer: keras.optimizers.RMSprop(1e-3f), loss: keras.losses.SparseCategoricalCrossentropy(from_logits: true), metrics: new[] { "acc" });
	model.fit(x_train[new Slice(0, 2000)], y_train[new Slice(0, 2000)],
          	batch_size: 64,
          	epochs: 30,
          	validation_split: 0.2f);
});
t.SetApartmentState(ApartmentState.STA);
t.Start();

This reveals a convergence issue opposite to the one I initially described in this problem. Furthermore, in this exemple, the more the graph is set in the same thread to the model.fit() code, the more the convergence will occur.

To support this, consider situation 3 (a mix of 1 and 2):

// Load data
var ((x_train, y_train), (x_test, y_test)) = keras.datasets.cifar10.load_data();
x_train = x_train / 255.0f;
// Construct Model (could be replace by a the load_model() function, it will have the same impact)
var layers = keras.layers;
var inputs = keras.Input(shape: (32, 32, 3), name: "img");
var x = layers.Conv2D(32, 3, activation: "relu").Apply(inputs);
x = layers.Conv2D(64, 3, activation: "relu").Apply(x);
var block_1_output = layers.MaxPooling2D(3).Apply(x);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_1_output);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);
var block_2_output = layers.Add().Apply(new Tensors(x, block_1_output));
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_2_output);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);

Thread t = new Thread(() =>
{
	
	var block_3_output = layers.Add().Apply(new Tensors(x, block_2_output));
	x = layers.Conv2D(64, 3, activation: "relu").Apply(block_3_output);
	x = layers.GlobalAveragePooling2D().Apply(x);
	x = layers.Dense(256, activation: "relu").Apply(x);
	x = layers.Dropout(0.5f).Apply(x);
	var outputs = layers.Dense(10).Apply(x);
	var model = keras.Model(inputs, outputs, name: "toy_resnet");

	// Compile
	model.compile(optimizer: keras.optimizers.RMSprop(1e-3f), loss: keras.losses.SparseCategoricalCrossentropy(from_logits: true), metrics: new[] { "acc" });
	model.fit(x_train[new Slice(0, 2000)], y_train[new Slice(0, 2000)],
          	batch_size: 64,
          	epochs: 30,
          	validation_split: 0.2f);
});
t.SetApartmentState(ApartmentState.STA);
t.Start();

This produces a better loss (acc = 0.3) than code 1 (acc = 0.1) but less than code 2 (acc = 0.45). This behavior is totally deterministic (not just one test has been done).

Edit : But all of this have maybe an impact due to the optimizer used. I replace RSMprop by Adam and this exemple converge in case 1,2,3.

I can't explain this behavior, which is totally opposite to my original problem but also indicates an underlying problem with the threads..

@AsakusaRinne
Copy link
Collaborator

That's so weird! What's your code like after t.Start()?

@AdrienDeverin
Copy link
Author

@AsakusaRinne there are litterally nothing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants