Skip to content
This repository has been archived by the owner on Sep 17, 2021. It is now read-only.

Memory requirements and Server Out of memory exception - help #903

Closed
3 tasks done
pklanka opened this issue Jan 9, 2018 · 19 comments
Closed
3 tasks done

Memory requirements and Server Out of memory exception - help #903

pklanka opened this issue Jan 9, 2018 · 19 comments

Comments

@pklanka
Copy link

pklanka commented Jan 9, 2018

Please make sure that you have checked the boxes:

  • Review the Quickstart guide
  • Search for both open and closed issues regarding the problem you are experiencing
  • For permissions issues (Access Denied and credential related errors), please refer to the requisite docs before submitting an issue:
    AWS, GCP, OpenStack, GitHub

Description of issue:

There is a memory leak on monkey process that is leaking until our security monkey server does not respond. Here is what atop looks like -
image

Server has 2vcpu and 8 GB of RAM. Is that sufficient for security monkey operations? Could someone please guide / help.

@pklanka
Copy link
Author

pklanka commented Jan 9, 2018

Only opened this issue since other one is closed - #826

@mikegrima
Copy link
Contributor

Hello @pklanka . This will largely depend on your environment.

Depending on the size of the environment and the technologies you are watching, it is recommended that you spin off separate Security Monkey worker instances that are dedicated to select accounts and technologies.

Do you have an idea on which technologies being scanned are killing SM? Knowing the specific technology in question will also help us to better address the issue.

@pklanka
Copy link
Author

pklanka commented Jan 10, 2018

I increased the VM size to C2xlarge. We have 20 accounts. Difficult to say which is killing SM - none of them return an error (except few access denied - which is expected) when I follow previous mail thread.

Any other way I can debug?

@pklanka
Copy link
Author

pklanka commented Jan 10, 2018

This one is consuming all the memory / cpu -

nginx 9340 10.3 52.7 12015944 8355808 ? Sl Jan09 113:46 /usr/local/src/security_monkey/venv/bin/python /usr/local/src/security_monkey/venv/bin/monkey start_scheduler

@mikegrima
Copy link
Contributor

This may be what @zpritcha was discussing about yesterday on Gitter.

We'll need to investigate if the scheduler has memory leak issues.

@markofu
Copy link
Contributor

markofu commented Jan 11, 2018

@mikegrima

Depending on the size of the environment and the technologies you are watching, it is recommended that you spin off separate Security Monkey worker instances that are dedicated to select accounts and technologies.

This is the first time I've seen such a recommendation. Can you provide any more specifics? Is there a threshold of number of accounts that you have in mind?

@mikegrima
Copy link
Contributor

mikegrima commented Jan 11, 2018

@markofu It's a loose recommendation that I provide when users are experiencing massive scalability issues.

Generally if the watchers are taking a very long time to describe all the resources, then it makes sense to break it up.

This will be resolved in the future when we work on more event-driven watchers.

@mikegrima
Copy link
Contributor

🤞 Fixed in #904 ??

@mikegrima
Copy link
Contributor

@pklanka Please fetch the latest develop branch down, re-install dependencies, and test.

The latest version should address the issues with a newer scheduler library.

@pklanka
Copy link
Author

pklanka commented Jan 12, 2018

Absolutely. Will give it a spin and test it over weekend. Many thanks for a quick fix.

@rayjanoka
Copy link

rayjanoka commented Jan 13, 2018

I installed the latest and it ran the hourly scheduler OK 3 times, but then it just failed with a Throttling error.

ERROR:apscheduler.executors.default:Job "run_change_reporter (trigger: interval[1:00:00], next run at: 2018-01-13 21:33:45 UTC)" raised an exception
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/APScheduler-3.5.0-py2.7.egg/apscheduler/executors/base.py", line 125, in run_job
retval = job.func(*job.args, **job.kwargs)
File "/usr/local/lib/python2.7/dist-packages/security_monkey-0.9.3-py2.7.egg/security_monkey/scheduler.py", line 30, in run_change_reporter
reporter.run(account, interval)
File "/usr/local/lib/python2.7/dist-packages/security_monkey-0.9.3-py2.7.egg/security_monkey/reporter.py", line 57, in run
(items, exception_map) = monitor.watcher.slurp()
File "/usr/local/lib/python2.7/dist-packages/security_monkey-0.9.3-py2.7.egg/security_monkey/watchers/iam/managed_policy.py", line 78, in slurp
attached_roles = [a.arn for a in policy.attached_roles.all()]
File "/usr/local/lib/python2.7/dist-packages/boto3/resources/collection.py", line 83, in __iter__
for page in self.pages():
File "/usr/local/lib/python2.7/dist-packages/boto3/resources/collection.py", line 166, in pages
for page in pages:
File "/usr/local/lib/python2.7/dist-packages/botocore/paginate.py", line 255, in __iter__
response = self._make_request(current_kwargs)
File "/usr/local/lib/python2.7/dist-packages/botocore/paginate.py", line 332, in _make_request
return self._method(**current_kwargs)
File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 317, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 615, in _make_api_call
raise error_class(parsed_response, operation_name)
ClientError: An error occurred (Throttling) when calling the ListEntitiesForPolicy operation (reached max retries: 4): Rate exceeded
2018-01-13 20:35:05,425 DEBUG: Logging exception from scheduler-change-reporter-uncaught with location: None to the database. [in /usr/local/lib/python2.7/dist-packages/security_monkey-0.9.3-py2.7.egg/security_monkey/datastore.py:726]
DEBUG:security_monkey:Logging exception from scheduler-change-reporter-uncaught with location: None to the database.
2018-01-13 20:35:05,430 DEBUG: Completed logging exception to database. [in /usr/local/lib/python2.7/dist-packages/security_monkey-0.9.3-py2.7.egg/security_monkey/datastore.py:757]
DEBUG:security_monkey:Completed logging exception to database.

@mikegrima
Copy link
Contributor

mikegrima commented Jan 14, 2018

@rayj-pgi That appears to be more of a AWS rate-limiting error. We'll need to see why cloudaux isn't handling that properly.

EDIT: I see what's happening. Looks like the managed policy watcher isn't using CloudAux. CloudAux would properly retry with backoff.

@rayjanoka
Copy link

Thanks @mikegrima! That error seems to have only occurred once this weekend so probably not a big deal. Otherwise with this release I'm seeing the scheduler container plateau at about 3G of memory compared to previously I was seeing about 3.6G. I have 3 medium sized AWS accounts in my SM at the moment.

@fstuck37
Copy link

Hi All - running into this issue still even after redeploying with the code changes.
Running on a t2.xlarge with Centos and currently have 19 AWS Accounts configured,

PID VDATA VSTACK VSIZE RSIZE PSIZE VGROW RGROW SWAPSZ MEM CMD 1/19
10870 15.3G 136K 15.6G 13.6G 0K 0K 0K 0K 89% monkey

I'm not a Python expert but willing to do some troubleshooting if you can provide any guidance.
Any suggestions?

Thanks,
Fred

@mikegrima
Copy link
Contributor

Jello @fstuck37 ! I'm currently working on #909 to help resolve additional memory issues.

@fstuck37
Copy link

Hi @mikegrima - Thanks for the update - I'll keep an eye on #909.
-Fred

@mikegrima
Copy link
Contributor

Please see #910

@rayjanoka
Copy link

rayjanoka commented Jan 24, 2018

This (#911) looks great Mike. I'm seeing that the scheduler is able to utilize multiple cores now and my memory usage, which was normally 3G+, is going up to 1.5G while running and back down to 500M when idle. 🥦 💯

@mikegrima
Copy link
Contributor

Fixed in #911

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants