Memory requirements and Server Out of memory exception - help #903

pklanka · 2018-01-09T16:55:35Z

Please make sure that you have checked the boxes:

Review the Quickstart guide
Search for both open and closed issues regarding the problem you are experiencing
For permissions issues (Access Denied and credential related errors), please refer to the requisite docs before submitting an issue:
AWS, GCP, OpenStack, GitHub

Description of issue:

There is a memory leak on monkey process that is leaking until our security monkey server does not respond. Here is what atop looks like -

Server has 2vcpu and 8 GB of RAM. Is that sufficient for security monkey operations? Could someone please guide / help.

pklanka · 2018-01-09T17:00:18Z

Only opened this issue since other one is closed - #826

mikegrima · 2018-01-09T20:31:30Z

Hello @pklanka . This will largely depend on your environment.

Depending on the size of the environment and the technologies you are watching, it is recommended that you spin off separate Security Monkey worker instances that are dedicated to select accounts and technologies.

Do you have an idea on which technologies being scanned are killing SM? Knowing the specific technology in question will also help us to better address the issue.

pklanka · 2018-01-10T16:00:49Z

I increased the VM size to C2xlarge. We have 20 accounts. Difficult to say which is killing SM - none of them return an error (except few access denied - which is expected) when I follow previous mail thread.

Any other way I can debug?

pklanka · 2018-01-10T16:02:05Z

This one is consuming all the memory / cpu -

nginx 9340 10.3 52.7 12015944 8355808 ? Sl Jan09 113:46 /usr/local/src/security_monkey/venv/bin/python /usr/local/src/security_monkey/venv/bin/monkey start_scheduler

mikegrima · 2018-01-10T17:19:10Z

This may be what @zpritcha was discussing about yesterday on Gitter.

We'll need to investigate if the scheduler has memory leak issues.

markofu · 2018-01-11T02:51:06Z

@mikegrima

Depending on the size of the environment and the technologies you are watching, it is recommended that you spin off separate Security Monkey worker instances that are dedicated to select accounts and technologies.

This is the first time I've seen such a recommendation. Can you provide any more specifics? Is there a threshold of number of accounts that you have in mind?

mikegrima · 2018-01-11T04:52:28Z

@markofu It's a loose recommendation that I provide when users are experiencing massive scalability issues.

Generally if the watchers are taking a very long time to describe all the resources, then it makes sense to break it up.

This will be resolved in the future when we work on more event-driven watchers.

mikegrima · 2018-01-12T02:05:55Z

🤞 Fixed in #904 ??

mikegrima · 2018-01-12T21:53:27Z

@pklanka Please fetch the latest develop branch down, re-install dependencies, and test.

The latest version should address the issues with a newer scheduler library.

pklanka · 2018-01-12T21:57:11Z

Absolutely. Will give it a spin and test it over weekend. Many thanks for a quick fix.

rayjanoka · 2018-01-13T21:11:33Z

I installed the latest and it ran the hourly scheduler OK 3 times, but then it just failed with a Throttling error.

ERROR:apscheduler.executors.default:Job "run_change_reporter (trigger: interval[1:00:00], next run at: 2018-01-13 21:33:45 UTC)" raised an exception
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/APScheduler-3.5.0-py2.7.egg/apscheduler/executors/base.py", line 125, in run_job
retval = job.func(*job.args, **job.kwargs)
File "/usr/local/lib/python2.7/dist-packages/security_monkey-0.9.3-py2.7.egg/security_monkey/scheduler.py", line 30, in run_change_reporter
reporter.run(account, interval)
File "/usr/local/lib/python2.7/dist-packages/security_monkey-0.9.3-py2.7.egg/security_monkey/reporter.py", line 57, in run
(items, exception_map) = monitor.watcher.slurp()
File "/usr/local/lib/python2.7/dist-packages/security_monkey-0.9.3-py2.7.egg/security_monkey/watchers/iam/managed_policy.py", line 78, in slurp
attached_roles = [a.arn for a in policy.attached_roles.all()]
File "/usr/local/lib/python2.7/dist-packages/boto3/resources/collection.py", line 83, in __iter__
for page in self.pages():
File "/usr/local/lib/python2.7/dist-packages/boto3/resources/collection.py", line 166, in pages
for page in pages:
File "/usr/local/lib/python2.7/dist-packages/botocore/paginate.py", line 255, in __iter__
response = self._make_request(current_kwargs)
File "/usr/local/lib/python2.7/dist-packages/botocore/paginate.py", line 332, in _make_request
return self._method(**current_kwargs)
File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 317, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 615, in _make_api_call
raise error_class(parsed_response, operation_name)
ClientError: An error occurred (Throttling) when calling the ListEntitiesForPolicy operation (reached max retries: 4): Rate exceeded
2018-01-13 20:35:05,425 DEBUG: Logging exception from scheduler-change-reporter-uncaught with location: None to the database. [in /usr/local/lib/python2.7/dist-packages/security_monkey-0.9.3-py2.7.egg/security_monkey/datastore.py:726]
DEBUG:security_monkey:Logging exception from scheduler-change-reporter-uncaught with location: None to the database.
2018-01-13 20:35:05,430 DEBUG: Completed logging exception to database. [in /usr/local/lib/python2.7/dist-packages/security_monkey-0.9.3-py2.7.egg/security_monkey/datastore.py:757]
DEBUG:security_monkey:Completed logging exception to database.

mikegrima · 2018-01-14T21:34:32Z

@rayj-pgi That appears to be more of a AWS rate-limiting error. We'll need to see why cloudaux isn't handling that properly.

EDIT: I see what's happening. Looks like the managed policy watcher isn't using CloudAux. CloudAux would properly retry with backoff.

rayjanoka · 2018-01-14T21:58:44Z

Thanks @mikegrima! That error seems to have only occurred once this weekend so probably not a big deal. Otherwise with this release I'm seeing the scheduler container plateau at about 3G of memory compared to previously I was seeing about 3.6G. I have 3 medium sized AWS accounts in my SM at the moment.

fstuck37 · 2018-01-17T18:03:51Z

Hi All - running into this issue still even after redeploying with the code changes.
Running on a t2.xlarge with Centos and currently have 19 AWS Accounts configured,

PID VDATA VSTACK VSIZE RSIZE PSIZE VGROW RGROW SWAPSZ MEM CMD 1/19
10870 15.3G 136K 15.6G 13.6G 0K 0K 0K 0K 89% monkey

I'm not a Python expert but willing to do some troubleshooting if you can provide any guidance.
Any suggestions?

Thanks,
Fred

mikegrima · 2018-01-17T18:06:23Z

Jello @fstuck37 ! I'm currently working on #909 to help resolve additional memory issues.

fstuck37 · 2018-01-17T21:40:58Z

Hi @mikegrima - Thanks for the update - I'll keep an eye on #909.
-Fred

mikegrima · 2018-01-18T18:56:25Z

Please see #910

rayjanoka · 2018-01-24T16:40:33Z

This (#911) looks great Mike. I'm seeing that the scheduler is able to utilize multiple cores now and my memory usage, which was normally 3G+, is going up to 1.5G while running and back down to 500M when idle. 🥦 💯

mikegrima · 2018-01-25T01:36:05Z

Fixed in #911

mikegrima added the performance label Jan 9, 2018

mikegrima mentioned this issue Jan 12, 2018

🔺 Upgrading to APScheduler 3.5.0 #904

Merged

mikegrima closed this as completed Jan 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory requirements and Server Out of memory exception - help #903

Memory requirements and Server Out of memory exception - help #903

pklanka commented Jan 9, 2018

pklanka commented Jan 9, 2018 •

edited

mikegrima commented Jan 9, 2018

pklanka commented Jan 10, 2018

pklanka commented Jan 10, 2018

mikegrima commented Jan 10, 2018

markofu commented Jan 11, 2018

mikegrima commented Jan 11, 2018 •

edited

mikegrima commented Jan 12, 2018

mikegrima commented Jan 12, 2018

pklanka commented Jan 12, 2018

rayjanoka commented Jan 13, 2018 •

edited

mikegrima commented Jan 14, 2018 •

edited

rayjanoka commented Jan 14, 2018

fstuck37 commented Jan 17, 2018

mikegrima commented Jan 17, 2018

fstuck37 commented Jan 17, 2018

mikegrima commented Jan 18, 2018

rayjanoka commented Jan 24, 2018 •

edited

mikegrima commented Jan 25, 2018

Memory requirements and Server Out of memory exception - help #903

Memory requirements and Server Out of memory exception - help #903

Comments

pklanka commented Jan 9, 2018

Please make sure that you have checked the boxes:

Description of issue:

pklanka commented Jan 9, 2018 • edited

mikegrima commented Jan 9, 2018

pklanka commented Jan 10, 2018

pklanka commented Jan 10, 2018

mikegrima commented Jan 10, 2018

markofu commented Jan 11, 2018

mikegrima commented Jan 11, 2018 • edited

mikegrima commented Jan 12, 2018

mikegrima commented Jan 12, 2018

pklanka commented Jan 12, 2018

rayjanoka commented Jan 13, 2018 • edited

mikegrima commented Jan 14, 2018 • edited

rayjanoka commented Jan 14, 2018

fstuck37 commented Jan 17, 2018

mikegrima commented Jan 17, 2018

fstuck37 commented Jan 17, 2018

mikegrima commented Jan 18, 2018

rayjanoka commented Jan 24, 2018 • edited

mikegrima commented Jan 25, 2018

pklanka commented Jan 9, 2018 •

edited

mikegrima commented Jan 11, 2018 •

edited

rayjanoka commented Jan 13, 2018 •

edited

mikegrima commented Jan 14, 2018 •

edited

rayjanoka commented Jan 24, 2018 •

edited