Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow conversion #47

Merged
merged 46 commits into from Jul 27, 2019

Conversation

@SuperShinyEyes
Copy link
Contributor

commented Jul 15, 2019

Add a Tensorflow front-end.

@BachiLi

This comment has been minimized.

Copy link
Owner

commented Jul 15, 2019

Cool!! Let me test it and I'll merge it with the master.

@BachiLi

This comment has been minimized.

Copy link
Owner

commented Jul 15, 2019

Works on my machine. I'll make two changes before merging: 1. I'll try to get rid of the scatter_add custom op if possible. 2. I'll add GPU support.

@SuperShinyEyes

This comment has been minimized.

Copy link
Contributor

commented on pyrednertensorflow/shape.py in 1533a82 Jul 16, 2019

DO NOT use tf.zeros or tf.zeros_like..

Tensorflow has a very strange bug. When I run test_g_buffer.py, tf.zeros does not produce tensors with zeros but some nonzero random numbers. That's why I made

def zeros_like(
tensor: Union[np.ndarray, tf.Tensor, tf.Variable],
is_var=False,
dtype=tf.float32) -> Union[tf.Tensor, tf.Variable]:
zeros = np.zeros_like(tensor)
if is_var:
return tf.Variable(zeros, dtype=dtype)
else:
return tf.constant(zeros, dtype=dtype)

I don't know how to reproduce this bug. This may not happen on your machine. However, this workaround may be an imperfect solution for GPU computing.

@BachiLi

This comment has been minimized.

Copy link
Owner

commented Jul 16, 2019

DO NOT use tf.zeros or tf.zeros_like..

Tensorflow has a very strange bug. When I run test_g_buffer.py, tf.zeros does not produce tensors with zeros but some nonzero random numbers. That's why I made

def zeros_like(
tensor: Union[np.ndarray, tf.Tensor, tf.Variable],
is_var=False,
dtype=tf.float32) -> Union[tf.Tensor, tf.Variable]:
zeros = np.zeros_like(tensor)
if is_var:
return tf.Variable(zeros, dtype=dtype)
else:
return tf.constant(zeros, dtype=dtype)

I don't know how to reproduce this bug. This may not happen on your machine. However, this workaround may be an imperfect solution for GPU computing.

It's very hard to imagine how such a commonly-used function can be buggy. I prefer to keep it as it for now until people complain. I'll make sure the g buffer test works.

@SuperShinyEyes

This comment has been minimized.

Copy link
Contributor Author

commented Jul 16, 2019

It's very hard to imagine how such a commonly-used function can be buggy. I prefer to keep it as it for now until people complain. I'll make sure the g buffer test works.

I agree it's hard to imagine. Neither our lab could believe what we saw.

Tzu-Mao Li and others added 2 commits Jul 16, 2019
@BachiLi

This comment has been minimized.

Copy link
Owner

commented Jul 25, 2019

Almost done with the GPU port. Just need to rewrite all the tests/tutorials/examples.

@BachiLi

This comment has been minimized.

Copy link
Owner

commented Jul 25, 2019

It seems that the eager mode scalar cache in tensorflow (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/framework/constant_op.py#L99) sometimes get corrupted, causing the tf.zeros bug mentioned earilier. Not sure if this is a memory bug in our code or something else. I'm tracking this right now.

@SuperShinyEyes

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2019

I don't think it's a memory issue caused by the custom OPs. There were corrupted zero tensors which were never passed to C++. For example, in shape.py.

@BachiLi

This comment has been minimized.

Copy link
Owner

commented Jul 26, 2019

I probably know the cause. It is caused by the custom operations. Those corrupted zero tensors share the same memory address with some tensors passed into the custom ops (due to the aforementioned scalar cache), which is then modified by the C++ code. I also know how to fix this, but it's 9pm so I'll take a short break ; )

@SuperShinyEyes

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2019

Haha you should call it a day 😄

BTW, how do you verify your hypothesis? Is there some tool to analyze memory structure?

I'm very interested to know how to debug this kind of error.

@BachiLi

This comment has been minimized.

Copy link
Owner

commented Jul 26, 2019

I didn't use any tools here except pdb, gdb and printfs. I got the scalar cache thing by tracing through tensorflow, then I realize the corruption happened after the backward pass. I then tried to comment out part of the C++ code to bisect the part that causes the corruption. I also created new tf.zeros(1) to confirm that the memory addresses are reused.

@SuperShinyEyes

This comment has been minimized.

Copy link
Contributor Author

commented Jul 27, 2019

I see! After spending 1 hour of reading the source code, I finally understand the cause. Thank you! Very good tracing!

@SuperShinyEyes

This comment has been minimized.

Copy link
Contributor Author

commented Jul 27, 2019

It seems that the eager mode scalar cache in tensorflow (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/framework/constant_op.py#L99)

Just out of curiosity, do you know why they call it scalar_cache when it's a tensor not scalar?

@BachiLi BachiLi merged commit 26b3e25 into BachiLi:master Jul 27, 2019
@BachiLi

This comment has been minimized.

Copy link
Owner

commented Jul 27, 2019

It seems that the eager mode scalar cache in tensorflow (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/framework/constant_op.py#L99)

Just out of curiosity, do you know why they call it scalar_cache when it's a tensor not scalar?

It's for tensors whose whole entries containing the same value, and "constant" has been used for something else...

@BachiLi

This comment has been minimized.

Copy link
Owner

commented Jul 27, 2019

Finally merged. I'll move on to the docker thing next week.

@SuperShinyEyes

This comment has been minimized.

Copy link
Contributor Author

commented Jul 27, 2019

This is awesome 👏👏👏 Now enjoy your weekend 🏖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.