Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Algorithm] Parallelize Drain pre-processing and maintain state within single process #10

Closed
Superskyyy opened this issue Jul 18, 2022 · 12 comments
Assignees
Labels
Algorithm The work is on the algorithm side analysis: log type: feature A feature to be implemented
Milestone

Comments

@Superskyyy
Copy link
Member

Superskyyy commented Jul 18, 2022

Now we are getting to the serious part, to deploy Drain into real deployment, we will need parallelization of the most expensive Masking operation, which essentially is individual Regex substitutions over a large amount of streaming log records taking up to 68% of the total execution time in the profiled results below (more complex regex leads to even higher percentage).

Background: Masking example

Dec 10 06:55:46 LabSZ sshd[24200]: reverse mapping checking getaddrinfo for ns.marryaldkfaczcz.com [173.234.31.186] failed - POSSIBLE BREAK-IN ATTEMPT!
to
<Date> <Time> LabSZ sshd[<NUM>] reverse mapping checking getaddrinfo for ns.marryaldkfaczcz.com [<IP>] failed - POSSIBLE BREAK-IN ATTEMPT!
After evaluating the memory usage and running time, Drain is extremely light on memory and costly in CPU processing time, so we could divide and conquer the CPU bounding problem.
image

The design is - if we have N services, we deploy N (future will be N * LOGGING_LEVEL) drain instances:

We run all the Drain algorithm tree states in the main loop and decouple all the masking to a pool of "Masking" processes.

Each processor will listen to some queue of Redis ingested logs and push back the masked logs into the ready_to_ingest state queue. Then drain instance of the corresponding service ingests and trains the trees, push final results to Redis ready_to_serve state.

total          : took  1806.86 s (100.00%), 16,892,910 samples,  106.96 ms / 1000 samples,        9,349.30 hz

mask           : took  1232.71 s ( 68.22%), 16,892,910 samples,   72.97 ms / 1000 samples,       13,703.89 hz

drain          : took   524.00 s ( 29.00%), 16,892,910 samples,   31.02 ms / 1000 samples,       32,238.18 hz

tree_search    : took   417.02 s ( 23.08%), 16,892,910 samples,   24.69 ms / 1000 samples,       40,508.18 hz

cluster_exist  : took    50.74 s (  2.81%), 16,892,827 samples,    3.00 ms / 1000 samples,      332,961.53 hz

create_cluster : took     0.00 s (  0.00%),         83 samples,    0.00 ms / 1000 samples, N/A hz

print_tree     : took     0.00 s (  0.00%),          1 samples,    0.00 ms / 1000 samples, N/A hz

@Superskyyy Superskyyy added type: feature A feature to be implemented Algorithm The work is on the algorithm side analysis: log labels Jul 18, 2022
@Superskyyy Superskyyy added this to the 0.1.0 milestone Jul 18, 2022
@Superskyyy
Copy link
Member Author

Superskyyy commented Jul 18, 2022

@Liangshumin This is a task for the GSOC student to explore in the next week or so. FYI @kezhenxu94

By the end of initial implementation, I'd like to see a comparison between

  1. The execution time and memory utilization of 10 services (1 million logs each) in one single process
  2. The execution time and memory utilization of 10 services (1 million logs each) with multiple instances (or multi-tree) drain and multi-processing masking.
  3. What about pure multiprocessing? One complete Drain instance in each process. (I don't like this though, it's gonna cause problem to metrics algorithms)

We currently use a single stream to hold all incoming logs with metadata. Scaling in this manner has an upper bound based on the time taken for drain (Async producer and push back to Redis) or multiprocessing masking, whichever is slower.

It's scalable to different machines when using multiple redis stream consumers. But single stream and a single consumer should already cover a wide range of users easily after masking is distributed using multiprocessing. Changing to multi-stream will be easy and can easily process 10+ TB log everyday given more cores.

@Superskyyy
Copy link
Member Author

@wu-sheng Please help to add @Liushumin and @Fengrui-Liu to the AI team with read access to this repo. I'm not able to @ them here without it. Thank you.

@wu-sheng
Copy link
Member

I have invited them into this repo as read only collerabators.

@wu-sheng
Copy link
Member

About Masking operation, you mentioned using regex to make a flat log text as structured. This is the key part of LAL engine doing in the SkyWalking OAP. Do we have to do this again in the AIEngine again?

@Superskyyy
Copy link
Member Author

@wu-sheng Thank you for the suggestion and question, I find this discussion very interesting.

About Masking operation, you mentioned using regex to make a flat log text as structured. This is the key part of LAL engine doing in the SkyWalking OAP. Do we have to do this again in the AIEngine again?

I thought of this before, but I realized it could slow down the 0.1.0 milestone also it may be a sub-optimal idea in terms of system design after all.
(I do agree LAL will eliminate some Regex work due to user could provide prior knowledge on log structure, so my design is also not optimal)

Let me discuss my take on the point, it may be lengthy and I could be wrong due to lack of full knowledge of LAL, I read its code a while ago. But I have considered a lot making this choice.

In the AIOps engine log clustering with machine learning, there are two things that we need to do to return the best results.

  1. Get actual log messages from the raw logs - parsing - NOTE: this is where LAL intended to do, but the algorithm enhancement I proposed in the other issue killed this point 1, tradeoff is some more masking computation from 2 because of the additional log structure ingested.
  2. Cover whatever parameter with simplest knowledge possible (e.g. from to - masking), it can be done on either side of the the engine, but it's preferred done at AI engine side, otherwise we need to sync the list of regex all the time, and introduce additional overhead for OAP, I don't want this to happen.

So to summarize the above point, LAL is useful to extract log messages from the unstructured logs making the algorithm run faster due to example <timestamp, level, whatever header: msg> -> , but masking is still needed on AI engine side, it's a matter of more or less (hard to measure).

But there's a deadly risk and complexity:
Considering some cases:

  1. Users may not provide LAL rules. - not a big problem, we fallback to AI engine anyway
  2. LAL rules are not likely comprehensive and 100% accurate. - big problem, AI engine will not get correct data.

For point 2, example is when if user provided regex is problematic, especially user wrongly defines "what part is the actual msg" but doesn't care if it's correct (because they don't use it in further LAL filter), then AI Engine clustering results will be bad since receiving wrong log msg, this coupling onto LAL increases the risk.

Finally, the LAL engine persists only the original log after they are processed, not the parsed version based on user regex (it only has a sampling sink, not a transforming sink if I'm right). I know that we could send the parsed version to the AI engine if and only if there's parsed version exists; it will require some code changes. But personally I find maintaining an additional branching logic quite a headache..

@Superskyyy
Copy link
Member Author

I have invited them into this repo as read only collerabators.

Thank you! Fengrui-Liu is okay now.

But I'm still not able to find Liangshumin in task assign, would you mind double checking for her?

@wu-sheng
Copy link
Member

AFAIK, she doesn't accept the invitation.

image

I don't have her DM channel, I am afraid.

@Liangshumin
Copy link
Collaborator

Sorry, I didn't check my email this afternoon, is it okay now?

@wu-sheng
Copy link
Member

I think I invited a wrong account, please check the notification again.

@Liangshumin
Copy link
Collaborator

I just accepted the invitation

@Superskyyy
Copy link
Member Author

Parallel masking will be addressed in #14, shared state drain POC works, pending serious implementation for PR.

@Superskyyy
Copy link
Member Author

First draft Implemented in #23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithm The work is on the algorithm side analysis: log type: feature A feature to be implemented
Projects
Status: Done
Development

No branches or pull requests

3 participants