Segmentation fault in multithread app (v0.11.2) #380

deadman2000 · 2019-09-09T15:26:56Z

App is crashing during sess.run with message Segmentation fault (core dumped) on docker or Attempted to read or write protected memory. This is often an indication that other memory is corrupt. on Windows.

Stack trace on Windows

   at Tensorflow.c_api.TF_SessionRun(IntPtr session, TF_Buffer* run_options, TF_Output[] inputs, IntPtr[] input_values, Int32 ninputs, TF_Output[] outputs, IntPtr[] output_values, Int32 noutputs, IntPtr[] target_opers, Int32 ntargets, IntPtr run_metadata, IntPtr status)
   at Tensorflow.BaseSession._call_tf_sessionrun(KeyValuePair`2[] feed_dict, TF_Output[] fetch_list, List`1 target_list)
   at Tensorflow.BaseSession._do_run(List`1 target_list, List`1 fetch_list, Dictionary`2 feed_dict)
   at Tensorflow.BaseSession._run(Object fetches, FeedItem[] feed_dict)
   at Tensorflow.BaseSession.run(Tensor fetche, FeedItem[] feed_dict)

I created small example project for tests: https://github.com/deadman2000/TensorFlowNetMultithreading

The text was updated successfully, but these errors were encountered:

Nucs · 2019-09-09T20:52:40Z

Task.Run does not promise a different thread will be used for every created Task, in-fact it is more likely that the tasks are queued behind 4 to 8 Threads (depending on your CPU).
The default mechanism in Tensorflow.NET uses ThreadID.
Replace Task.Run with new Thread(()=> ...).Start(); or create a long-running task using task factory.

Also LoadFromSavedModel() does not set the created Session as default (@Oceania2018 it should, fix it please)
Replace the LoadFromSavedModel to the following:
_session = Session.LoadFromSavedModel(modelLocation).as_default();

Let me know if that made any difference.

Nucs · 2019-09-09T21:25:12Z

I've tested it locally and it works fluently.
Program.cs: https://pastebin.com/upNWzmHk
Predictor.cs: https://pastebin.com/i96ZMyVV

I'll add information about it the wiki.

Nucs · 2019-09-09T21:28:17Z

If you'll need to do multi-threaded unit tests in the future, you are welcome to use MultiThreadedUnitTestExecuter I wrote for the library:
https://github.com/SciSharp/TensorFlow.NET/blob/master/test/TensorFlowNET.UnitTest/Utilities/MultiThreadedUnitTestExecuter.cs

Usage:

MultiThreadedUnitTestExecuter.Run(threadCount: 8, worload: tid => ...);

deadman2000 · 2019-09-10T07:33:49Z

@Nucs as_default not helps. Still Segmentation fault and AccessViolationException with random errors:

2019-09-10 10:20:22.989570: E tensorflow/core/framework/types.cc:102] Unrecognized DataType enum value 105
Expects arg[1] to be float but unknown dtype enum (105)_ref is provided
Tensorflow.TensorflowException: Expects arg[1] to be float but unknown dtype enum (105)_ref is provided
   в Tensorflow.Status.Check(Boolean throwException)
   в Tensorflow.BaseSession._call_tf_sessionrun(KeyValuePair`2[] feed_dict, TF_Output[] fetch_list, List`1 target_list)
   в Tensorflow.BaseSession._do_run(List`1 target_list, List`1 fetch_list, Dictionary`2 feed_dict)
   в Tensorflow.BaseSession._run(Object fetches, FeedItem[] feed_dict)
   в Tensorflow.BaseSession.run(Tensor fetche, FeedItem[] feed_dict)
   в CalcEventsTFS.Predictor.Predict(Single[][] inputs)
   в CalcEventsTFS.Program.<>c.<Main>b__2_0()
Expects arg[1] to be float but bfloat16 is provided
Tensorflow.TensorflowException: Expects arg[1] to be float but bfloat16 is provided
   в Tensorflow.Status.Check(Boolean throwException)
   в Tensorflow.BaseSession._call_tf_sessionrun(KeyValuePair`2[] feed_dict, TF_Output[] fetch_list, List`1 target_list)
   в Tensorflow.BaseSession._do_run(List`1 target_list, List`1 fetch_list, Dictionary`2 feed_dict)
   в Tensorflow.BaseSession._run(Object fetches, FeedItem[] feed_dict)
   в Tensorflow.BaseSession.run(Tensor fetche, FeedItem[] feed_dict)
   в CalcEventsTFS.Predictor.Predict(Single[][] inputs)
   в CalcEventsTFS.Program.<>c.<Main>b__2_0()
Expects arg[1] to be float but resource_ref is provided
Tensorflow.TensorflowException: Expects arg[1] to be float but resource_ref is provided
   в Tensorflow.Status.Check(Boolean throwException)
   в Tensorflow.BaseSession._call_tf_sessionrun(KeyValuePair`2[] feed_dict, TF_Output[] fetch_list, List`1 target_list)
   в Tensorflow.BaseSession._do_run(List`1 target_list, List`1 fetch_list, Dictionary`2 feed_dict)
   в Tensorflow.BaseSession._run(Object fetches, FeedItem[] feed_dict)
   в Tensorflow.BaseSession.run(Tensor fetche, FeedItem[] feed_dict)
   в CalcEventsTFS.Predictor.Predict(Single[][] inputs)
   в CalcEventsTFS.Program.<>c.<Main>b__2_0()

Nucs · 2019-09-10T08:30:53Z

I've fixed it in the commit above, untill it is available via nuget,
at the Predictor constructor, add a lock in the following manner:

lock (Locks.ProcessWide)
    _session = Session.LoadFromSavedModel(modelLocation).as_default();

As mentioned in the wiki, due to lack of documentation from TF's side - we don't know what APIs are not threadsafe and it appears that c_api.TF_LoadSessionFromSavedModel() is not.

deadman2000 · 2019-09-10T12:17:31Z

With lock on .Net Framework 4.6.1 random errors.
On .Net Core 2.2 Segmentation fault and AccessViolationException

I found out that the problem is related to garbage collection.

Try this example:

for (int t = 0; t < THREADS_COUNT; t++)
{
    new Thread(() =>
    {
        Session sess;

        lock (Locks.ProcessWide)
            sess = Session.LoadFromSavedModel(modelLocation).as_default();

        {
            var inputs = new[] { "sp", "fuel" };

            var inp = inputs.Select(name => sess.graph.OperationByName(name).output).ToArray();
            var outp = sess.graph.OperationByName("softmax_tensor").output;

            for (var i = 0; i < 1000; i++)
            {
                {
                    var data = new float[96];
                    FeedItem[] feeds = new FeedItem[2];

                    for (int f = 0; f < 2; f++)
                        feeds[f] = new FeedItem(inp[f], new NDArray(data));

                    try
                    {
                        sess.run(outp, feeds);
                    }
                    catch (Exception ex)
                    {
                        Console.WriteLine(ex);
                    }
                }
                GC.Collect();
            }
        }
    }).Start();
}

or test project https://github.com/deadman2000/TensorFlowNetMultithreading

Nucs · 2019-09-10T13:15:59Z

Please try the following code:
https://pastebin.com/EL9Cv6FQ
It has worked for me for over 10 runs both in Debug and Release.

deadman2000 · 2019-09-10T14:30:04Z

Yes, this sample works fine. But after i replaced cycle to infinity loop, it has crashed after 1 minute in Debug and after 30 seconds in Release

Nucs · 2019-09-10T15:48:09Z

I ran it for over 35 minutes (and still running) without a crash, the while(true) was set on the sess.run scope and not the entire script.
Second run worked for 20 minutes. this time I included the OperationByName calls in the loop and didn't use the VS debugger.

This has lead me to believe it is a game of chance and indeed the program crashed eventually after 10-60 seconds with the following message in Output Window

The program '[22268] dotnet.exe' has exited with code -1073740791 (0xc0000409).

When an error is a simple exit code it indicates that C++ (Tensorflow) has caused the crash.

Code -1073740791 (0xc0000409).
Stack buffer overflow / overrun. Error can indicate a bug in the executed software that causes stack overflow, leading to abnormal termination of the software.

Even though we do not call the same Session from different threads.
So it must be something complicated that is worth Tensorflow's team attention.

I'll try to create a memory dump and see if we can indicate more accurately where the problem lies.

Oceania2018 · 2019-09-10T15:53:49Z

@Nucs Should we raise this issue to tensorflow team?

Nucs · 2019-09-10T16:02:27Z

After I'll get my hands on a dump and research it. I'll let you know.

Nucs · 2019-09-10T16:23:35Z

Managed-Only debug indicates the exception is inside Tensorflow.

Unmanaged debug verifies it

Nucs · 2019-09-10T16:28:00Z

@Oceania2018, you can contact the TF team
or if possible to provide me with a tensorflow.dll and .pdb file compiled with DEBUG

Dump file: https://mega.nz/#!VVF3HKrS!r5sxOBsbccdVn93KpwS7EtxWSvtLlim2GwExDF8L9h4

deadman2000 · 2019-09-26T09:17:52Z

Please try with this code in _call_tf_sessionrun:

            var output_values = fetch_list.Select(x => IntPtr.Zero).ToArray();
            var inputs = feed_dict.Select(f => f.Key).ToArray();
            var input_values = feed_dict.Select(f => (IntPtr)f.Value).ToArray();
            var target_opers = target_list.Select(f => (IntPtr)f).ToArray();

            c_api.TF_SessionRun(_handle,
                run_options: null,
                inputs: inputs,
                input_values: input_values,
                ninputs: feed_dict.Length,
                outputs: fetch_list,
                output_values: output_values,
                noutputs: fetch_list.Length,
                target_opers: target_opers,
                ntargets: target_list.Count,
                run_metadata: IntPtr.Zero,
                status: status);

tompetk · 2020-04-05T23:25:53Z

Any update/thoughts on this one?

I see the latest suggestion by @deadman2000 is already in the master, but I still reproduce this issue.

Weird that adding a lock on sess.run() doesn't help either. I've also tried setting UsePerSessionThreads and IsolateSessionState to true, but no good.

Nucs · 2020-04-06T06:38:32Z

Back then my evaluations were that it is a multithreading problem within Tensorflow so we are kind of helpless here. They still have not responded on the issue. Drop a comment there if you will.

deadman2000 · 2020-04-06T07:01:44Z

I guess the problem isn't with tensorflow. I implemented a multithreaded application on pure C_API, and it works successfully, without crashes.
Perhaps the problem in the early release of the memory by GC.

tompetk · 2020-04-07T00:13:02Z

I've just build TF2.1 debug version locally and was able to reproduce the issue. Some more detailed callstack:

Didn't have much time to investigate yet, but also seems like we pass invalid input tensor values. Probably due to GC. Although couldn't spot where we could dispose it yet. Will investigate how TensorConverter.ToTensor() works tomorrow. Any ideas appreciated :)

tompetk · 2020-04-07T10:31:24Z

So I think issue is caused by the following usage of nd.GetData() in Tensor.Creation.cs. I guess that starts pointing to GC controlled memory with no guarantees it will stay at the same address after GC work.

        private unsafe IntPtr CreateTensorFromNDArray(NDArray nd, TF_DataType? given_dtype)
        {
            if (nd.typecode == NPTypeCode.String)
                throw new NotImplementedException("Support for NDArray of type string not implemented yet");

>>>         var arraySlice = nd.Unsafe.Storage.Shape.IsContiguous ? nd.GetData() : nd.CloneData();

Changing to the following helped (probably with some performance degradation which I didn't notice due to small input dataset):

var arraySlice = nd.CloneData();

After this change I do not reproduce the crash anymore, but I will keep testing this.

Mghobadid · 2020-04-16T15:41:50Z

i have this error too in gpu version in cpu version everything is fine how solve this ?
the error :
Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

tompetk · 2020-04-17T21:43:47Z

@Mghobadid fix by #533 should do the trick. Not the most efficient way, but seems to work at least on CPU (and I reproduced exactly same issue on GPU, so should be same)...

SommerEngineering · 2020-05-31T16:47:33Z

Thanks @tompetk for the fix. I hope the PR gets merged asap and we get a new NuGet version.

Nucs closed this as completed in b251295 Sep 10, 2019

Nucs reopened this Sep 10, 2019

Nucs added a commit that referenced this issue Sep 10, 2019

MultithreadingTests.cs: Added unit-test for case #380

83b9eb8

Oceania2018 assigned Nucs Sep 10, 2019

Oceania2018 added this to Needs triage in TensorFlow.Binding via automation Sep 10, 2019

Nucs added the bug Something isn't working label Sep 10, 2019

Nucs mentioned this issue Sep 11, 2019

C API: Unexpected behaviour when multithreading calls of c_api's TF_SessionRun tensorflow/tensorflow#32449

Closed

Nucs mentioned this issue Dec 15, 2020

Exception in example #21 of TensorFlowNet.Examples #686

Closed

Nucs mentioned this issue Dec 15, 2020

Word2Vec: System.AccessViolationException SciSharp/SciSharp-Stack-Examples#26

Closed

Inzta mentioned this issue Feb 2, 2021

System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt #742

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault in multithread app (v0.11.2) #380

Segmentation fault in multithread app (v0.11.2) #380

deadman2000 commented Sep 9, 2019

Nucs commented Sep 9, 2019 •

edited

Loading

Nucs commented Sep 9, 2019

Nucs commented Sep 9, 2019

deadman2000 commented Sep 10, 2019

Nucs commented Sep 10, 2019 •

edited

Loading

deadman2000 commented Sep 10, 2019

Nucs commented Sep 10, 2019

deadman2000 commented Sep 10, 2019

Nucs commented Sep 10, 2019 •

edited

Loading

Oceania2018 commented Sep 10, 2019

Nucs commented Sep 10, 2019

Nucs commented Sep 10, 2019

Nucs commented Sep 10, 2019 •

edited

Loading

deadman2000 commented Sep 26, 2019

tompetk commented Apr 5, 2020

Nucs commented Apr 6, 2020

deadman2000 commented Apr 6, 2020 •

edited

Loading

tompetk commented Apr 7, 2020 •

edited

Loading

tompetk commented Apr 7, 2020

Mghobadid commented Apr 16, 2020

tompetk commented Apr 17, 2020

SommerEngineering commented May 31, 2020

Segmentation fault in multithread app (v0.11.2) #380

Segmentation fault in multithread app (v0.11.2) #380

Comments

deadman2000 commented Sep 9, 2019

Nucs commented Sep 9, 2019 • edited Loading

Nucs commented Sep 9, 2019

Nucs commented Sep 9, 2019

deadman2000 commented Sep 10, 2019

Nucs commented Sep 10, 2019 • edited Loading

deadman2000 commented Sep 10, 2019

Nucs commented Sep 10, 2019

deadman2000 commented Sep 10, 2019

Nucs commented Sep 10, 2019 • edited Loading

Oceania2018 commented Sep 10, 2019

Nucs commented Sep 10, 2019

Nucs commented Sep 10, 2019

Nucs commented Sep 10, 2019 • edited Loading

deadman2000 commented Sep 26, 2019

tompetk commented Apr 5, 2020

Nucs commented Apr 6, 2020

deadman2000 commented Apr 6, 2020 • edited Loading

tompetk commented Apr 7, 2020 • edited Loading

tompetk commented Apr 7, 2020

Mghobadid commented Apr 16, 2020

tompetk commented Apr 17, 2020

SommerEngineering commented May 31, 2020

Nucs commented Sep 9, 2019 •

edited

Loading

Nucs commented Sep 10, 2019 •

edited

Loading

Nucs commented Sep 10, 2019 •

edited

Loading

Nucs commented Sep 10, 2019 •

edited

Loading

deadman2000 commented Apr 6, 2020 •

edited

Loading

tompetk commented Apr 7, 2020 •

edited

Loading