Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in multithread app (v0.11.2) #380

Open
deadman2000 opened this issue Sep 9, 2019 · 22 comments
Open

Segmentation fault in multithread app (v0.11.2) #380

deadman2000 opened this issue Sep 9, 2019 · 22 comments
Assignees
Labels
bug Something isn't working

Comments

@deadman2000
Copy link

App is crashing during sess.run with message Segmentation fault (core dumped) on docker or Attempted to read or write protected memory. This is often an indication that other memory is corrupt. on Windows.

Stack trace on Windows

   at Tensorflow.c_api.TF_SessionRun(IntPtr session, TF_Buffer* run_options, TF_Output[] inputs, IntPtr[] input_values, Int32 ninputs, TF_Output[] outputs, IntPtr[] output_values, Int32 noutputs, IntPtr[] target_opers, Int32 ntargets, IntPtr run_metadata, IntPtr status)
   at Tensorflow.BaseSession._call_tf_sessionrun(KeyValuePair`2[] feed_dict, TF_Output[] fetch_list, List`1 target_list)
   at Tensorflow.BaseSession._do_run(List`1 target_list, List`1 fetch_list, Dictionary`2 feed_dict)
   at Tensorflow.BaseSession._run(Object fetches, FeedItem[] feed_dict)
   at Tensorflow.BaseSession.run(Tensor fetche, FeedItem[] feed_dict)

I created small example project for tests: https://github.com/deadman2000/TensorFlowNetMultithreading

@Nucs
Copy link
Member

Nucs commented Sep 9, 2019

Task.Run does not promise a different thread will be used for every created Task, in-fact it is more likely that the tasks are queued behind 4 to 8 Threads (depending on your CPU).
The default mechanism in Tensorflow.NET uses ThreadID.
Replace Task.Run with new Thread(()=> ...).Start(); or create a long-running task using task factory.

Also LoadFromSavedModel() does not set the created Session as default (@Oceania2018 it should, fix it please)
Replace the LoadFromSavedModel to the following:
_session = Session.LoadFromSavedModel(modelLocation).as_default();

Let me know if that made any difference.

@Nucs
Copy link
Member

Nucs commented Sep 9, 2019

I've tested it locally and it works fluently.
Program.cs: https://pastebin.com/upNWzmHk
Predictor.cs: https://pastebin.com/i96ZMyVV

I'll add information about it the wiki.

@Nucs
Copy link
Member

Nucs commented Sep 9, 2019

If you'll need to do multi-threaded unit tests in the future, you are welcome to use MultiThreadedUnitTestExecuter I wrote for the library:
https://github.com/SciSharp/TensorFlow.NET/blob/master/test/TensorFlowNET.UnitTest/Utilities/MultiThreadedUnitTestExecuter.cs

Usage:

MultiThreadedUnitTestExecuter.Run(threadCount: 8, worload: tid => ...);

@deadman2000
Copy link
Author

@Nucs as_default not helps. Still Segmentation fault and AccessViolationException with random errors:

2019-09-10 10:20:22.989570: E tensorflow/core/framework/types.cc:102] Unrecognized DataType enum value 105
Expects arg[1] to be float but unknown dtype enum (105)_ref is provided
Tensorflow.TensorflowException: Expects arg[1] to be float but unknown dtype enum (105)_ref is provided
   в Tensorflow.Status.Check(Boolean throwException)
   в Tensorflow.BaseSession._call_tf_sessionrun(KeyValuePair`2[] feed_dict, TF_Output[] fetch_list, List`1 target_list)
   в Tensorflow.BaseSession._do_run(List`1 target_list, List`1 fetch_list, Dictionary`2 feed_dict)
   в Tensorflow.BaseSession._run(Object fetches, FeedItem[] feed_dict)
   в Tensorflow.BaseSession.run(Tensor fetche, FeedItem[] feed_dict)
   в CalcEventsTFS.Predictor.Predict(Single[][] inputs)
   в CalcEventsTFS.Program.<>c.<Main>b__2_0()
Expects arg[1] to be float but bfloat16 is provided
Tensorflow.TensorflowException: Expects arg[1] to be float but bfloat16 is provided
   в Tensorflow.Status.Check(Boolean throwException)
   в Tensorflow.BaseSession._call_tf_sessionrun(KeyValuePair`2[] feed_dict, TF_Output[] fetch_list, List`1 target_list)
   в Tensorflow.BaseSession._do_run(List`1 target_list, List`1 fetch_list, Dictionary`2 feed_dict)
   в Tensorflow.BaseSession._run(Object fetches, FeedItem[] feed_dict)
   в Tensorflow.BaseSession.run(Tensor fetche, FeedItem[] feed_dict)
   в CalcEventsTFS.Predictor.Predict(Single[][] inputs)
   в CalcEventsTFS.Program.<>c.<Main>b__2_0()
Expects arg[1] to be float but resource_ref is provided
Tensorflow.TensorflowException: Expects arg[1] to be float but resource_ref is provided
   в Tensorflow.Status.Check(Boolean throwException)
   в Tensorflow.BaseSession._call_tf_sessionrun(KeyValuePair`2[] feed_dict, TF_Output[] fetch_list, List`1 target_list)
   в Tensorflow.BaseSession._do_run(List`1 target_list, List`1 fetch_list, Dictionary`2 feed_dict)
   в Tensorflow.BaseSession._run(Object fetches, FeedItem[] feed_dict)
   в Tensorflow.BaseSession.run(Tensor fetche, FeedItem[] feed_dict)
   в CalcEventsTFS.Predictor.Predict(Single[][] inputs)
   в CalcEventsTFS.Program.<>c.<Main>b__2_0()

@Nucs Nucs closed this as completed in b251295 Sep 10, 2019
@Nucs
Copy link
Member

Nucs commented Sep 10, 2019

I've fixed it in the commit above, untill it is available via nuget,
at the Predictor constructor, add a lock in the following manner:

lock (Locks.ProcessWide)
    _session = Session.LoadFromSavedModel(modelLocation).as_default();

As mentioned in the wiki, due to lack of documentation from TF's side - we don't know what APIs are not threadsafe and it appears that c_api.TF_LoadSessionFromSavedModel() is not.

@Nucs Nucs reopened this Sep 10, 2019
@deadman2000
Copy link
Author

With lock on .Net Framework 4.6.1 random errors.
On .Net Core 2.2 Segmentation fault and AccessViolationException

I found out that the problem is related to garbage collection.

Try this example:

for (int t = 0; t < THREADS_COUNT; t++)
{
    new Thread(() =>
    {
        Session sess;

        lock (Locks.ProcessWide)
            sess = Session.LoadFromSavedModel(modelLocation).as_default();

        {
            var inputs = new[] { "sp", "fuel" };

            var inp = inputs.Select(name => sess.graph.OperationByName(name).output).ToArray();
            var outp = sess.graph.OperationByName("softmax_tensor").output;

            for (var i = 0; i < 1000; i++)
            {
                {
                    var data = new float[96];
                    FeedItem[] feeds = new FeedItem[2];

                    for (int f = 0; f < 2; f++)
                        feeds[f] = new FeedItem(inp[f], new NDArray(data));

                    try
                    {
                        sess.run(outp, feeds);
                    }
                    catch (Exception ex)
                    {
                        Console.WriteLine(ex);
                    }
                }
                GC.Collect();
            }
        }
    }).Start();
}

or test project https://github.com/deadman2000/TensorFlowNetMultithreading

@Nucs
Copy link
Member

Nucs commented Sep 10, 2019

Please try the following code:
https://pastebin.com/EL9Cv6FQ
It has worked for me for over 10 runs both in Debug and Release.

@deadman2000
Copy link
Author

Yes, this sample works fine. But after i replaced cycle to infinity loop, it has crashed after 1 minute in Debug and after 30 seconds in Release

@Oceania2018 Oceania2018 added this to Needs triage in TensorFlow.Binding via automation Sep 10, 2019
@Nucs
Copy link
Member

Nucs commented Sep 10, 2019

I ran it for over 35 minutes (and still running) without a crash, the while(true) was set on the sess.run scope and not the entire script.
Second run worked for 20 minutes. this time I included the OperationByName calls in the loop and didn't use the VS debugger.

This has lead me to believe it is a game of chance and indeed the program crashed eventually after 10-60 seconds with the following message in Output Window

The program '[22268] dotnet.exe' has exited with code -1073740791 (0xc0000409).

When an error is a simple exit code it indicates that C++ (Tensorflow) has caused the crash.

Code -1073740791 (0xc0000409).
Stack buffer overflow / overrun. Error can indicate a bug in the executed software that causes stack overflow, leading to abnormal termination of the software.

Even though we do not call the same Session from different threads.
So it must be something complicated that is worth Tensorflow's team attention.

I'll try to create a memory dump and see if we can indicate more accurately where the problem lies.

@Oceania2018
Copy link
Member

@Nucs Should we raise this issue to tensorflow team?

@Nucs
Copy link
Member

Nucs commented Sep 10, 2019

After I'll get my hands on a dump and research it. I'll let you know.

@Nucs
Copy link
Member

Nucs commented Sep 10, 2019

Managed-Only debug indicates the exception is inside Tensorflow.
image

Unmanaged debug verifies it
image

@Nucs
Copy link
Member

Nucs commented Sep 10, 2019

@Oceania2018, you can contact the TF team
or if possible to provide me with a tensorflow.dll and .pdb file compiled with DEBUG

Dump file: https://mega.nz/#!VVF3HKrS!r5sxOBsbccdVn93KpwS7EtxWSvtLlim2GwExDF8L9h4

@deadman2000
Copy link
Author

Please try with this code in _call_tf_sessionrun:

            var output_values = fetch_list.Select(x => IntPtr.Zero).ToArray();
            var inputs = feed_dict.Select(f => f.Key).ToArray();
            var input_values = feed_dict.Select(f => (IntPtr)f.Value).ToArray();
            var target_opers = target_list.Select(f => (IntPtr)f).ToArray();

            c_api.TF_SessionRun(_handle,
                run_options: null,
                inputs: inputs,
                input_values: input_values,
                ninputs: feed_dict.Length,
                outputs: fetch_list,
                output_values: output_values,
                noutputs: fetch_list.Length,
                target_opers: target_opers,
                ntargets: target_list.Count,
                run_metadata: IntPtr.Zero,
                status: status);

@tompetk
Copy link

tompetk commented Apr 5, 2020

Any update/thoughts on this one?

I see the latest suggestion by @deadman2000 is already in the master, but I still reproduce this issue.

Weird that adding a lock on sess.run() doesn't help either. I've also tried setting UsePerSessionThreads and IsolateSessionState to true, but no good.

@Nucs
Copy link
Member

Nucs commented Apr 6, 2020

Back then my evaluations were that it is a multithreading problem within Tensorflow so we are kind of helpless here. They still have not responded on the issue. Drop a comment there if you will.

@deadman2000
Copy link
Author

deadman2000 commented Apr 6, 2020

I guess the problem isn't with tensorflow. I implemented a multithreaded application on pure C_API, and it works successfully, without crashes.
Perhaps the problem in the early release of the memory by GC.

@tompetk
Copy link

tompetk commented Apr 7, 2020

I've just build TF2.1 debug version locally and was able to reproduce the issue. Some more detailed callstack:

tf_access_violation_issue
tf_access_violation_issue2

Didn't have much time to investigate yet, but also seems like we pass invalid input tensor values. Probably due to GC. Although couldn't spot where we could dispose it yet. Will investigate how TensorConverter.ToTensor() works tomorrow. Any ideas appreciated :)

@tompetk
Copy link

tompetk commented Apr 7, 2020

So I think issue is caused by the following usage of nd.GetData() in Tensor.Creation.cs. I guess that starts pointing to GC controlled memory with no guarantees it will stay at the same address after GC work.

        private unsafe IntPtr CreateTensorFromNDArray(NDArray nd, TF_DataType? given_dtype)
        {
            if (nd.typecode == NPTypeCode.String)
                throw new NotImplementedException("Support for NDArray of type string not implemented yet");

>>>         var arraySlice = nd.Unsafe.Storage.Shape.IsContiguous ? nd.GetData() : nd.CloneData();

Changing to the following helped (probably with some performance degradation which I didn't notice due to small input dataset):

var arraySlice = nd.CloneData();

After this change I do not reproduce the crash anymore, but I will keep testing this.

@Mghobadid
Copy link

i have this error too in gpu version in cpu version everything is fine how solve this ?
the error :
Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

@tompetk
Copy link

tompetk commented Apr 17, 2020

@Mghobadid fix by #533 should do the trick. Not the most efficient way, but seems to work at least on CPU (and I reproduced exactly same issue on GPU, so should be same)...

@SommerEngineering
Copy link

Thanks @tompetk for the fix. I hope the PR gets merged asap and we get a new NuGet version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
TensorFlow.Binding
  
Needs triage
Development

No branches or pull requests

6 participants