-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Algorithm] Parallelize Drain pre-processing and maintain state within single process #10
Comments
@Liangshumin This is a task for the GSOC student to explore in the next week or so. FYI @kezhenxu94 By the end of initial implementation, I'd like to see a comparison between
We currently use a single stream to hold all incoming logs with metadata. Scaling in this manner has an upper bound based on the time taken for drain (Async producer and push back to Redis) or multiprocessing masking, whichever is slower. It's scalable to different machines when using multiple redis stream consumers. But single stream and a single consumer should already cover a wide range of users easily after masking is distributed using multiprocessing. Changing to multi-stream will be easy and can easily process 10+ TB log everyday given more cores. |
@wu-sheng Please help to add @Liushumin and @Fengrui-Liu to the AI team with read access to this repo. I'm not able to @ them here without it. Thank you. |
I have invited them into this repo as |
About |
@wu-sheng Thank you for the suggestion and question, I find this discussion very interesting.
I thought of this before, but I realized it could slow down the 0.1.0 milestone also it may be a sub-optimal idea in terms of system design after all. Let me discuss my take on the point, it may be lengthy and I could be wrong due to lack of full knowledge of LAL, I read its code a while ago. But I have considered a lot making this choice. In the AIOps engine log clustering with machine learning, there are two things that we need to do to return the best results.
So to summarize the above point, LAL is useful to extract log messages from the unstructured logs making the algorithm run faster due to example <timestamp, level, whatever header: msg> -> , but masking is still needed on AI engine side, it's a matter of more or less (hard to measure). But there's a deadly risk and complexity:
For point 2, example is when if user provided regex is problematic, especially user wrongly defines "what part is the actual msg" but doesn't care if it's correct (because they don't use it in further LAL filter), then AI Engine clustering results will be bad since receiving wrong log msg, this coupling onto LAL increases the risk. Finally, the LAL engine persists only the original log after they are processed, not the parsed version based on user regex (it only has a sampling sink, not a transforming sink if I'm right). I know that we could send the parsed version to the AI engine if and only if there's parsed version exists; it will require some code changes. But personally I find maintaining an additional branching logic quite a headache.. |
Thank you! Fengrui-Liu is okay now. But I'm still not able to find Liangshumin in task assign, would you mind double checking for her? |
Sorry, I didn't check my email this afternoon, is it okay now? |
I think I invited a wrong account, please check the notification again. |
I just accepted the invitation |
Parallel masking will be addressed in #14, shared state drain POC works, pending serious implementation for PR. |
First draft Implemented in #23 |
Now we are getting to the serious part, to deploy Drain into real deployment, we will need parallelization of the most expensive
Masking
operation, which essentially is individual Regex substitutions over a large amount of streaming log records taking up to 68% of the total execution time in the profiled results below (more complex regex leads to even higher percentage).Background: Masking example
Dec 10 06:55:46 LabSZ sshd[24200]: reverse mapping checking getaddrinfo for ns.marryaldkfaczcz.com [173.234.31.186] failed - POSSIBLE BREAK-IN ATTEMPT!
to
<Date> <Time> LabSZ sshd[<NUM>] reverse mapping checking getaddrinfo for ns.marryaldkfaczcz.com [<IP>] failed - POSSIBLE BREAK-IN ATTEMPT!
After evaluating the memory usage and running time, Drain is extremely light on memory and costly in CPU processing time, so we could divide and conquer the CPU bounding problem.
The design is - if we have N services, we deploy N (future will be N * LOGGING_LEVEL) drain instances:
We run all the Drain algorithm tree states in the main loop and decouple all the masking to a pool of "Masking" processes.
Each processor will listen to some queue of Redis ingested logs and push back the masked logs into the
ready_to_ingest
state queue. Then drain instance of the corresponding service ingests and trains the trees, push final results to Redisready_to_serve
state.The text was updated successfully, but these errors were encountered: