[Algorithm] Parallelize Drain pre-processing and maintain state within single process #10

Superskyyy · 2022-07-18T16:00:09Z

Now we are getting to the serious part, to deploy Drain into real deployment, we will need parallelization of the most expensive Masking operation, which essentially is individual Regex substitutions over a large amount of streaming log records taking up to 68% of the total execution time in the profiled results below (more complex regex leads to even higher percentage).

Background: Masking example

Dec 10 06:55:46 LabSZ sshd[24200]: reverse mapping checking getaddrinfo for ns.marryaldkfaczcz.com [173.234.31.186] failed - POSSIBLE BREAK-IN ATTEMPT!
to
<Date> <Time> LabSZ sshd[<NUM>] reverse mapping checking getaddrinfo for ns.marryaldkfaczcz.com [<IP>] failed - POSSIBLE BREAK-IN ATTEMPT!
After evaluating the memory usage and running time, Drain is extremely light on memory and costly in CPU processing time, so we could divide and conquer the CPU bounding problem.

The design is - if we have N services, we deploy N (future will be N * LOGGING_LEVEL) drain instances:

We run all the Drain algorithm tree states in the main loop and decouple all the masking to a pool of "Masking" processes.

Each processor will listen to some queue of Redis ingested logs and push back the masked logs into the ready_to_ingest state queue. Then drain instance of the corresponding service ingests and trains the trees, push final results to Redis ready_to_serve state.

total          : took  1806.86 s (100.00%), 16,892,910 samples,  106.96 ms / 1000 samples,        9,349.30 hz

mask           : took  1232.71 s ( 68.22%), 16,892,910 samples,   72.97 ms / 1000 samples,       13,703.89 hz

drain          : took   524.00 s ( 29.00%), 16,892,910 samples,   31.02 ms / 1000 samples,       32,238.18 hz

tree_search    : took   417.02 s ( 23.08%), 16,892,910 samples,   24.69 ms / 1000 samples,       40,508.18 hz

cluster_exist  : took    50.74 s (  2.81%), 16,892,827 samples,    3.00 ms / 1000 samples,      332,961.53 hz

create_cluster : took     0.00 s (  0.00%),         83 samples,    0.00 ms / 1000 samples, N/A hz

print_tree     : took     0.00 s (  0.00%),          1 samples,    0.00 ms / 1000 samples, N/A hz

The text was updated successfully, but these errors were encountered:

Superskyyy · 2022-07-18T16:00:51Z

@Liangshumin This is a task for the GSOC student to explore in the next week or so. FYI @kezhenxu94

By the end of initial implementation, I'd like to see a comparison between

The execution time and memory utilization of 10 services (1 million logs each) in one single process
The execution time and memory utilization of 10 services (1 million logs each) with multiple instances (or multi-tree) drain and multi-processing masking.
~~What about pure multiprocessing? One complete Drain instance in each process. (I don't like this though, it's gonna cause problem to metrics algorithms)~~

We currently use a single stream to hold all incoming logs with metadata. Scaling in this manner has an upper bound based on the time taken for drain (Async producer and push back to Redis) or multiprocessing masking, whichever is slower.

It's scalable to different machines when using multiple redis stream consumers. But single stream and a single consumer should already cover a wide range of users easily after masking is distributed using multiprocessing. Changing to multi-stream will be easy and can easily process 10+ TB log everyday given more cores.

Superskyyy · 2022-07-18T16:28:26Z

@wu-sheng Please help to add @Liushumin and @Fengrui-Liu to the AI team with read access to this repo. I'm not able to @ them here without it. Thank you.

wu-sheng · 2022-07-19T00:24:52Z

I have invited them into this repo as read only collerabators.

wu-sheng · 2022-07-19T00:26:22Z

About Masking operation, you mentioned using regex to make a flat log text as structured. This is the key part of LAL engine doing in the SkyWalking OAP. Do we have to do this again in the AIEngine again?

Superskyyy · 2022-07-19T04:14:27Z

@wu-sheng Thank you for the suggestion and question, I find this discussion very interesting.

About Masking operation, you mentioned using regex to make a flat log text as structured. This is the key part of LAL engine doing in the SkyWalking OAP. Do we have to do this again in the AIEngine again?

I thought of this before, but I realized it could slow down the 0.1.0 milestone also it may be a sub-optimal idea in terms of system design after all.
(I do agree LAL will eliminate some Regex work due to user could provide prior knowledge on log structure, so my design is also not optimal)

Let me discuss my take on the point, it may be lengthy and I could be wrong due to lack of full knowledge of LAL, I read its code a while ago. But I have considered a lot making this choice.

In the AIOps engine log clustering with machine learning, there are two things that we need to do to return the best results.

Get actual log messages from the raw logs - parsing - NOTE: this is where LAL intended to do, but the algorithm enhancement I proposed in the other issue killed this point 1, tradeoff is some more masking computation from 2 because of the additional log structure ingested.
Cover whatever parameter with simplest knowledge possible (e.g. from to - masking), it can be done on either side of the the engine, but it's preferred done at AI engine side, otherwise we need to sync the list of regex all the time, and introduce additional overhead for OAP, I don't want this to happen.

So to summarize the above point, LAL is useful to extract log messages from the unstructured logs making the algorithm run faster due to example <timestamp, level, whatever header: msg> -> , but masking is still needed on AI engine side, it's a matter of more or less (hard to measure).

But there's a deadly risk and complexity:
Considering some cases:

Users may not provide LAL rules. - not a big problem, we fallback to AI engine anyway
LAL rules are not likely comprehensive and 100% accurate. - big problem, AI engine will not get correct data.

For point 2, example is when if user provided regex is problematic, especially user wrongly defines "what part is the actual msg" but doesn't care if it's correct (because they don't use it in further LAL filter), then AI Engine clustering results will be bad since receiving wrong log msg, this coupling onto LAL increases the risk.

Finally, the LAL engine persists only the original log after they are processed, not the parsed version based on user regex (it only has a sampling sink, not a transforming sink if I'm right). I know that we could send the parsed version to the AI engine if and only if there's parsed version exists; it will require some code changes. But personally I find maintaining an additional branching logic quite a headache..

Superskyyy · 2022-07-19T04:47:16Z

I have invited them into this repo as read only collerabators.

Thank you! Fengrui-Liu is okay now.

But I'm still not able to find Liangshumin in task assign, would you mind double checking for her?

wu-sheng · 2022-07-19T07:19:16Z

AFAIK, she doesn't accept the invitation.

I don't have her DM channel, I am afraid.

Liangshumin · 2022-07-19T13:28:50Z

Sorry, I didn't check my email this afternoon, is it okay now?

wu-sheng · 2022-07-19T13:34:52Z

I think I invited a wrong account, please check the notification again.

Liangshumin · 2022-07-19T13:36:32Z

I just accepted the invitation

Superskyyy · 2022-09-09T17:11:35Z

Parallel masking will be addressed in #14, shared state drain POC works, pending serious implementation for PR.

Superskyyy · 2022-09-17T05:33:18Z

First draft Implemented in #23

Superskyyy added type: feature A feature to be implemented Algorithm The work is on the algorithm side analysis: log labels Jul 18, 2022

Superskyyy added this to the 0.1.0 milestone Jul 18, 2022

Superskyyy mentioned this issue Jul 19, 2022

[Algorithm] Drain cache look up optimization, evaluate accuracy impacts #12

Open

Superskyyy assigned Superskyyy and Liangshumin Jul 23, 2022

Superskyyy closed this as completed Sep 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Algorithm] Parallelize Drain pre-processing and maintain state within single process #10

[Algorithm] Parallelize Drain pre-processing and maintain state within single process #10

Superskyyy commented Jul 18, 2022 •

edited

Loading

Superskyyy commented Jul 18, 2022 •

edited

Loading

Superskyyy commented Jul 18, 2022

wu-sheng commented Jul 19, 2022

wu-sheng commented Jul 19, 2022

Superskyyy commented Jul 19, 2022

Superskyyy commented Jul 19, 2022

wu-sheng commented Jul 19, 2022

Liangshumin commented Jul 19, 2022

wu-sheng commented Jul 19, 2022

Liangshumin commented Jul 19, 2022

Superskyyy commented Sep 9, 2022

Superskyyy commented Sep 17, 2022

[Algorithm] Parallelize Drain pre-processing and maintain state within single process #10

[Algorithm] Parallelize Drain pre-processing and maintain state within single process #10

Comments

Superskyyy commented Jul 18, 2022 • edited Loading

Superskyyy commented Jul 18, 2022 • edited Loading

Superskyyy commented Jul 18, 2022

wu-sheng commented Jul 19, 2022

wu-sheng commented Jul 19, 2022

Superskyyy commented Jul 19, 2022

Superskyyy commented Jul 19, 2022

wu-sheng commented Jul 19, 2022

Liangshumin commented Jul 19, 2022

wu-sheng commented Jul 19, 2022

Liangshumin commented Jul 19, 2022

Superskyyy commented Sep 9, 2022

Superskyyy commented Sep 17, 2022

Superskyyy commented Jul 18, 2022 •

edited

Loading

Superskyyy commented Jul 18, 2022 •

edited

Loading