Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updater threads crash when AssumeRole denied by SCP #123

Closed
dfkunstler opened this issue Nov 19, 2020 · 1 comment · Fixed by #124
Closed

Updater threads crash when AssumeRole denied by SCP #123

dfkunstler opened this issue Nov 19, 2020 · 1 comment · Fixed by #124

Comments

@dfkunstler
Copy link
Contributor

dfkunstler commented Nov 19, 2020

Aardvark doesn't handle exceptions in UpdateAccountThread.run() defined in file /aardvark/manage.py. Aardvark also doesn't handle exceptions from CloudAux when it attempts to get IAM client from cloudaux.aws.sts.boto3_cached_conn() in method AccountToUpdate._get_client() defined in file /aardvark/updater/__init__.py. When there is an exception due to permission being denied by SCP, the thread crashes. Each occurrence reduces the effective thread count by 1 until they're all hung, at which point killing the OS process is the only option.

This issue was discovered in a deployment where AWS accounts slated for closure are first moved into an Organizations OU ("KIA") whose SCP denies all access. With ~40 AWS accounts in KIA out of ~850 total, Aardvark would process <100 accounts over several hours before all threads were crashed. The order of the accounts in the swag database appears to determine aardvark's processing order and thus timing of the crashes, so it's possible, for example, that all threads could crash on start if all the KIA accounts were at the top of the list.

The root exception in this scenario is an AccessDenied error from botocore. The source/nature of the AccessDenied exception seems to matter. For example, if AccessDenied is due to the aardvark role missing from the target account, a message is logged and aardvark keeps on trucking. But for some reason when SCP is the culprit, Aardvark chokes. Further investigation could be done to discover the distinction, but for our purposes it's sufficient to say that root cause was an unhandled exception as described above.

Fixing this involved adding exception handling to the two methods listed in the description, and replacing one instance of a direct call from AccountToUpdate._get_arns() to cloudaux.aws.sts.boto3_cached_conn() with a local call to AccountToUpdate._get_client(). Current behavior if the call from UpdateAccountThread.run() to AccountToUpdate.update_account() returns an error (i.e., no exception), is to put account back on the queue for processing again. In the case of an exception, however, it's simpler to assume the cause is not transient and continue, thus avoiding an infinite loop. If the issue was transient then the account will be processed the next time the updater runs, without data loss.

I'd like to contribute the code fixing this issue, but obviously don't have permission to create a branch. Would you prefer that I fork the project and create pull request from that; or submit a diff; or something else? I also added some debug-level logging while troubleshooting, which I'll keep in the pull request because it's valuable for monitoring when processing hundreds of AWS accounts.

thanks!

@patricksanders
Copy link
Collaborator

Thanks for the report, and nice find! Our preference would be to fork the project and open a PR from there.

patricksanders pushed a commit that referenced this issue Dec 16, 2020
* Prevent updater threads from crashing when access is denied by SCP, as described in #123

* Fix flake8 complaints re whitespace & comments.

* Remove errant ] somehow added to vars in f strings in two log messages.

* Updating urllib3, pylint, pyyaml (and astroid due to dependency) to resolve vulnerabilities reported by safety.:
+==============================================================================+
|                                                                              |
|                               /$$$$$$            /$$                         |
|                              /$$__  $$          | $$                         |
|           /$$$$$$$  /$$$$$$ | $$  \__//$$$$$$  /$$$$$$   /$$   /$$           |
|          /$$_____/ |____  $$| $$$$   /$$__  $$|_  $$_/  | $$  | $$           |
|         |  $$$$$$   /$$$$$$$| $$_/  | $$$$$$$$  | $$    | $$  | $$           |
|          \____  $$ /$$__  $$| $$    | $$_____/  | $$ /$$| $$  | $$           |
|          /$$$$$$$/|  $$$$$$$| $$    |  $$$$$$$  |  $$$$/|  $$$$$$$           |
|         |_______/  \_______/|__/     \_______/   \___/   \____  $$           |
|                                                          /$$  | $$           |
|                                                         |  $$$$$$/           |
|  by pyup.io                                              \______/            |
|                                                                              |
+==============================================================================+
| REPORT                                                                       |
| checked 64 packages, using default DB                                        |
+============================+===========+==========================+==========+
| package                    | installed | affected                 | ID       |
+============================+===========+==========================+==========+
| pylint                     | 2.4.2     | <2.5.0                   | 38224    |
+==============================================================================+
| Pylint 2.5.0 no longer allows ``python -m pylint ...`` to import user code.  |
| Previously, it added the current working directory as the first element of   |
| ``sys.path``. This opened up a potential security hole where ``pylint``      |
| would import user level code as long as that code resided in modules having  |
| the same name as stdlib or pylint's own modules.                             |
+==============================================================================+
| pyyaml                     | 5.1.2     | <5.3.1                   | 38100    |
+==============================================================================+
| A vulnerability was discovered in the PyYAML library in versions before      |
| 5.3.1, where it is susceptible to arbitrary code execution when it processes |
| untrusted YAML files through the full_load method or with the FullLoader     |
| loader. Applications that use the library to process untrusted input may be  |
| vulnerable to this flaw. An attacker could use this flaw to execute          |
| arbitrary code on the system by abusing the python/object/new constructor.   |
| See: CVE-2020-1747.                                                          |
+==============================================================================+
| pyyaml                     | 5.1.2     | >=5.1,<=5.1.2            | 38639    |
+==============================================================================+
| CVE-2019-20477: PyYAML 5.1 through 5.1.2 has insufficient restrictions on    |
| the load and load_all functions because of a class deserialization issue,    |
| e.g., Popen is a class in the subprocess module. NOTE: this issue exists     |
| because of an incomplete fix for CVE-2017-18342.                             |
+==============================================================================+
| urllib3                    | 1.25.6    | <1.25.9                  | 38834    |
+==============================================================================+
| urllib3 before 1.25.9 allows CRLF injection if the attacker controls the     |
| HTTP request method, as demonstrated by inserting CR and LF control          |
| characters in the first argument of putrequest(). See: CVE-2020-26137.       |
| (NOTE: this is similar to CVE-2020-26116.)                                   |
+==============================================================================+
| urllib3                    | 1.25.6    | >=1.25.2,<=1.25.7        | 27519    |
+==============================================================================+
| The _encode_invalid_chars function in util/url.py in the urllib3 library     |
| 1.25.2 through 1.25.7 for Python allows a denial of service (CPU             |
| consumption) because of an inefficient algorithm. The percent_encodings      |
| array contains all matches of percent encodings. It is not deduplicated. For |
| a URL of length N, the size of percent_encodings may be up to O(N). The next |
| step (normalize existing percent-encoded bytes) also takes up to O(N) for    |
| each step, so the total time is O(N^2). If percent_encodings were            |
| deduplicated, the time to compute _encode_invalid_chars would be O(kN),      |
| where k is at most 484 ((10+6*2)^2). See: CVE-2020-7212.                     |
+==============================================================================+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants