Skip to content

Conversation

@ZetaoZhuang
Copy link
Contributor

@ZetaoZhuang ZetaoZhuang commented Sep 20, 2022

Reason for Change:

To make the AI Core concurrency safe, we need to declare the core.lock to be a pointer so in cloned core we can point to the same lock
Issue Fixed:

Requirements:

Notes:

@ZetaoZhuang ZetaoZhuang requested a review from rbtr as a code owner September 20, 2022 19:23
@ZetaoZhuang ZetaoZhuang changed the title declare the core.lock to be a pointer fix: declare the core.lock to be a pointer to make core concurrency safe Sep 20, 2022
@rbtr
Copy link
Collaborator

rbtr commented Sep 22, 2022

@timraymond that this is necessary feels like something is big wrong, wdyt?

@timraymond
Copy link
Member

@rbtr well, the patch is correct as it stands in isolation, because bd236e7 introduced a defect to zapai.(*Core).clone() where the mutex is copied. I guess what I don't understand is why the mutex exists in the first place. It's not clear what it's protecting...

@ZetaoZhuang
Copy link
Contributor Author

ZetaoZhuang commented Sep 22, 2022

@timraymond currently we are only able to have a single encoder shared between a core and its derived. so the mutex here is to protect the traceTelemetry of encoder while there is a process working on resetting/updating it

@ZetaoZhuang
Copy link
Contributor Author

@timraymond currently we are only able to have a single encoder shared between a core and its derived. so the mutex here is to protect the traceTelemetry of encoder while there is a process working on resetting/updating it

@rbtr any thoughts on this?

@rbtr
Copy link
Collaborator

rbtr commented Oct 4, 2022

@ZetaoZhuang I regret the singleton gobber, I don't think this is sustainable. I discussed this with @timraymond who had the idea that maybe we could use protobuf to encode/decode the TraceTelemetry<->[]byte which could get us away from this synchronization mess...wdyt? Could you investigate some alternatives?

@timraymond
Copy link
Member

@rbtr btw, I approved this because the code in its current state is provably broken, and this patch does provide a minimal fix. I agree that we should investigate using protobufs, but I don't think this PR should be a forcing function for that investigation. I'll open an issue with the idea so that it can be handled separately.

@ZetaoZhuang
Copy link
Contributor Author

@ZetaoZhuang I regret the singleton gobber, I don't think this is sustainable. I discussed this with @timraymond who had the idea that maybe we could use protobuf to encode/decode the TraceTelemetry<->[]byte which could get us away from this synchronization mess...wdyt? Could you investigate some alternatives?

+1 on the protobuf. we can do the investigation on protobufs separately.

@rbtr
Copy link
Collaborator

rbtr commented Oct 7, 2022

no objection

@ZetaoZhuang ZetaoZhuang enabled auto-merge (squash) October 10, 2022 20:48
@ZetaoZhuang ZetaoZhuang merged commit e9d143d into master Oct 19, 2022
@ZetaoZhuang ZetaoZhuang deleted the fix_core_lock branch October 19, 2022 01:55
rjdenney pushed a commit to rjdenney/azure-container-networking that referenced this pull request Jan 19, 2023
…afe (Azure#1626)

declare core.lock to be a pointer, so the lock in cloned core can point
to the same pointer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants