Skip to content

Conversation

@justinc1
Copy link
Collaborator

@ddemlow had problem replacing certificate, and this happened only it server was access via VPN. Likely VPN resulted in just correct timing to raise unexpected SSL error - in python code we got "EOF occurred in violation of protocol (_ssl.c:997)".

PR catches ScaleComputingError exception, and checks if message had this particular message inside.
I didn't try to refactor the ScaleComputingError to have a dedicated exception - I'm not able to reproduce the problem, so it is hard to test any code modification. For same reason, it is hard to add any integration test for this.

The extra module.warn were added, to help debugging if anything similar happens later. I hope warn message is kind enough ("... ignore and continue") - I do not want to make anyone nervous.

@justinc1 justinc1 requested a review from domendobnikar March 15, 2023 07:12
@justinc1 justinc1 self-assigned this Mar 15, 2023
@justinc1 justinc1 force-pushed the ignore-ssl-error-during-cert-replacement branch from 2c7934d to 2c95f18 Compare March 15, 2023 07:18
)
sleep(2)
continue
except errors.ScaleComputingError as ex:
Copy link
Collaborator

@domendobnikar domendobnikar Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this to client.py in order to avoid string-to-string comparison?
If we catch and raise it in client we can just catch it again inside the module without the string-to-string comparison.
I think the exception is called SSLEOFError.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is called that way. But I have no way to test that code will still work after change. Replacing cert in loop, and in parallel accessing /rest/v1/ping shows ConnectionRefused and ConnectionReset, but I never got SSL error.

And this is problem - we refactor client.py to intercept/reraise one extra exception, and side effect might be this fix will no longer work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

  • crafted SSL server, just to close TCP/SSL connection in middle.
  • or tcpkill
  • or - suggest something pls

Copy link
Collaborator

@domendobnikar domendobnikar Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should just remove the raise Scale error part of the client.
We've had this issue a couple of times now, where we don't exactly know which exception we are trying to handle.
And that part of the code does not seem to add much value ...

For the testing part, would it be too much effort on our part to build and kill an SSL connection?
Maybe we don't even need to setup a server and just close the SSL/VPN connection to the actual cluster as the data is being sent.

Copy link
Collaborator Author

@justinc1 justinc1 Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just remove the raise Scale error

Maybe. We would need to review all other except.

to build and kill an SSL connection

Worth to try. But I expect a blind tcpkill will most likely break opened TCP connection before or after SSL handshake, before or after full record (record as in https://msatechnosoft.in/blog/secure-socket-layer-ssl-work-accomplish-task-ssl-services/). And that SSL will not complain - it already got all bytes of previous record, and is waiting on 1st byte of next record.

Test:

# server
# openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes
openssl s_server -key key.pem -cert cert.pem -accept 44330 -www

# client
while [ 1 ]; do curl -k https://localhost:44330/ 1>out 2>/dev/null; echo RET=$?; done  # silent
while [ 1 ]; do curl -k https://localhost:44330/ 1>out ; echo RET=$?; done  # verbose

sudo tcpkill -i lo port 44330

I get error curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to localhost:44330 all the time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to reproduce that EOF occurred in violation of protocol (_ssl.c:997)>:

---
# ansi.yml 
- name: test SSL EOF
  hosts: all
  connection: local
  gather_facts: false

  tasks:
    - uri:
        url: https://127.0.0.1:44331
        validate_certs: false
# terminal 1
nc -l 127.0.0.1 44331

# terminal 2
ansible-playbook -i localhost, ansi.yml -v

# terminal 1
Ctrl+D or Ctrl+C

# terminal 2
TASK [uri] **********************************************************************************************************
fatal: [localhost]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python3"}, "changed": false, "elapsed": 2, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error EOF occurred in violation of protocol (_ssl.c:997)>", "redirected": false, "status": -1, "url": "https://127.0.0.1:44331"}

No need to interrupt SSL handshake in middle. Just do not send any byte over wire.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! 👍

@justinc1 justinc1 force-pushed the ignore-ssl-error-during-cert-replacement branch from 2c95f18 to ba2b111 Compare March 16, 2023 10:07
Signed-off-by: Justin Cinkelj <justin.cinkelj@xlab.si>
Signed-off-by: Justin Cinkelj <justin.cinkelj@xlab.si>
It has MIT license, so we can include it into our repo (GPL)

Signed-off-by: Justin Cinkelj <justin.cinkelj@xlab.si>
@justinc1 justinc1 force-pushed the ignore-ssl-error-during-cert-replacement branch from ba2b111 to 748195e Compare March 21, 2023 11:37
black and flake8

Signed-off-by: Justin Cinkelj <justin.cinkelj@xlab.si>
Signed-off-by: Justin Cinkelj <justin.cinkelj@xlab.si>
@justinc1 justinc1 force-pushed the ignore-ssl-error-during-cert-replacement branch from 748195e to d932557 Compare March 21, 2023 11:48
Signed-off-by: Justin Cinkelj <justin.cinkelj@xlab.si>
sudo is used when testing locally

Signed-off-by: Justin Cinkelj <justin.cinkelj@xlab.si>
Signed-off-by: Justin Cinkelj <justin.cinkelj@xlab.si>
@justinc1 justinc1 force-pushed the ignore-ssl-error-during-cert-replacement branch from 19d8d03 to 5c27392 Compare March 21, 2023 13:34
Signed-off-by: Justin Cinkelj <justin.cinkelj@xlab.si>
@justinc1
Copy link
Collaborator Author

Refactored exception handling. We expect about 3 different ssl errors.

Integ test passed in https://github.com/ScaleComputing/HyperCoreAnsibleCollection/actions/runs/4479886132/jobs/7874449568

@justinc1 justinc1 requested a review from domendobnikar March 21, 2023 14:13
Copy link
Collaborator

@domendobnikar domendobnikar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you. 👍

@justinc1 justinc1 merged commit 8dee3f7 into main Mar 23, 2023
@justinc1 justinc1 deleted the ignore-ssl-error-during-cert-replacement branch March 23, 2023 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants