Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NIOSSL.NIOSSLError.uncleanShutdown #1015

Closed
finestructure opened this issue Apr 2, 2021 · 15 comments · Fixed by #1041
Closed

Fix NIOSSL.NIOSSLError.uncleanShutdown #1015

finestructure opened this issue Apr 2, 2021 · 15 comments · Fixed by #1041
Assignees

Comments

@finestructure
Copy link
Member

Follow-up from #1014

They fail with:

[ INFO ] Reconciling ... [component: reconcile]
[ ERROR ] Channel error caught. [psql_connection_id: EBB34917-47DC-40F9-8EF4-207ECC928E9B, psql_error: uncleanShutdown]
[ ERROR ] Channel error caught. Closing connection. [psql_connection_id: EBB34917-47DC-40F9-8EF4-207ECC928E9B, psql_error: PSQLError(base: PostgresNIO.PSQLError.Base.connectionError(underlying: NIOSSL.NIOSSLError.uncleanShutdown))]

Interestingly, app_server does seem to be able to connect:

[ INFO ] GET /marinofelipe/swift-package-info [component: server, request-id: 858996AA-35CF-4962-9648-31D1492F13E1]
[ INFO ] POST /api/versions/78A72B11-A475-478B-9189-F3AD5B3411DC/builds [component: server, request-id: B6103D4D-7C59-44F4-BAFD-6F6C7FA543FA]
@finestructure finestructure self-assigned this Apr 2, 2021
@finestructure
Copy link
Member Author

Manually running the analysis job from the app_server container on p3 raises the same error:

root@7b8d9ffa7080:/run# ./Run analyze --id e9475c7f-26ce-403b-b6ee-03cf76b670ae
[ INFO ] Analyzing (id: E9475C7F-26CE-403B-B6EE-03CF76B670AE) ... [component: analyze]
[ INFO ] Checkout directory: /checkouts [component: analyze]
[ INFO ] cloning https://github.com/daveverwer/LeftPad.git to /checkouts/github.com-daveverwer-leftpad [component: analyze]
[ INFO ] listing tags for package https://github.com/daveverwer/LeftPad.git [component: analyze]
[ ERROR ] Channel error caught. [psql_connection_id: 6D6EA036-784B-4C97-8665-515D2CDCD837, psql_error: uncleanShutdown]
[ ERROR ] Channel error caught. Closing connection. [psql_connection_id: 6D6EA036-784B-4C97-8665-515D2CDCD837, psql_error: PSQLError(base: PostgresNIO.PSQLError.Base.connectionError(underlying: NIOSSL.NIOSSLError.uncleanShutdown))]

@finestructure
Copy link
Member Author

The same happens when running from within the app_server container on p1:

root@34553a63d453:/run# ./Run analyze --id e9475c7f-26ce-403b-b6ee-03cf76b670ae
[ INFO ] Analyzing (id: E9475C7F-26CE-403B-B6EE-03CF76B670AE) ... [component: analyze]
[ INFO ] Checkout directory: /checkouts [component: analyze]
[ INFO ] pulling https://github.com/daveverwer/LeftPad.git in /checkouts/github.com-daveverwer-leftpad [component: analyze]
[ INFO ] listing tags for package https://github.com/daveverwer/LeftPad.git [component: analyze]
[ ERROR ] Channel error caught. [psql_connection_id: 89081989-EEDB-4DF8-847D-25F5556FB4D7, psql_error: uncleanShutdown]
[ ERROR ] Channel error caught. Closing connection. [psql_connection_id: 89081989-EEDB-4DF8-847D-25F5556FB4D7, psql_error: PSQLError(base: PostgresNIO.PSQLError.Base.connectionError(underlying: NIOSSL.NIOSSLError.uncleanShutdown))]

@finestructure
Copy link
Member Author

finestructure commented Apr 2, 2021

Seeing the same error now also in the processing containers on p1. Perhaps the reboot only made this work on the first connection and now we're again not processing. So it might not be a p3 issue.

@finestructure
Copy link
Member Author

Also note: the unclean shutdown error might be a red herring and not be indicative of processing failing.

In fact, running the analyze command and checking the timestamp in the db confirms that the record is update from both p1 and p3.

However, I've also explicitly added p3's IP address to the the db's allow list - which might have made this work.

But the error message clearly doesn't have anything to do with the fact that reconciliation didn't work.

@finestructure finestructure changed the title Processing job running on p3 can't connect to db Reconciliation job running on p3 failed to update db Apr 2, 2021
@finestructure
Copy link
Member Author

So to summarise:

  • the error message does not mean the container can't access the db
  • rebooting p3 might have fixed the issue in the first place - it's hard to tell now as p3 seems to be in working order (although that's also hard to tell: running a job manually from app_server should be the same as from a dedicated container but it's not exactly the same)
  • the IP allow-listing has no effect: allowing Azure resources is sufficient to update the db (as it should)

I'm not sure yet how to test p3 and regain confidence for scheduling processing containers on it. What's worse, it might not be a p3 issue at all and affect the other nodes as well. Or it might have been a one-time issue with app_reconcile on p3.

@finestructure
Copy link
Member Author

Almost all jobs are scheduled on p3 again after the latest deployment:

 docker-user@bastion  p3  ~  dps                                                                                                                                              1939ms
CONTAINER ID   IMAGE                                                        STATUS          PORTS     NAMES
56c8d238ff1e   registry.gitlab.com/finestructure/swiftpackageindex:2.23.0   Up 27 minutes             app_server.eb99bzy7x2otxqyolguy324g0.jf6v35h45j97wxnn2hljimqwu
4dcb40afd133   registry.gitlab.com/finestructure/swiftpackageindex:2.23.0   Up 27 minutes             app_ingest.1.8fmezrvg6r0x7jdvwr34h66ad
fb3c198973a8   registry.gitlab.com/finestructure/swiftpackageindex:2.23.0   Up 27 minutes             app_analyze.1.s7iqqacmtm8qvzcp5dk0dvxi5
1aaf0e8b88e8   registry.gitlab.com/finestructure/swiftpackageindex:2.23.0   Up 27 minutes             app_reconcile.1.zk9b6z1mmzc0v2tqz82rkihm3
2b589ee7252f   grafana/promtail:2.0.0                                       Up 5 days                 mon_promtail.eb99bzy7x2otxqyolguy324g0.s9ottyts2fkhnjdhuc8u7yjeu

So far everything looks to be in order but we haven't added any new packages yet. Will need to keep an eye on that.

@finestructure
Copy link
Member Author

Working as expected. Will keep this open for one more deployment to check.

image

@finestructure
Copy link
Member Author

finestructure commented Apr 9, 2021

Last night we saw increased rates of 500s, presumably cause by the app_server container on at least one of the nodes being unable to reach the db. I couldn't inspect the containers at the time but instead restarted the nodes one by one. Only after restarting the last one, p2 (I had started with p3, thinking it might be an issue with p3), did the errors go away.

Might have been coincidental but seems to me that there is an intermittent db connectivity issue.

CleanShot 2021-04-09 at 07 53 07@2x

If this persists, I think we need to consider moving back to a containerised db. This would fix several problems:

  • the SSL errors NIOSSL.NIOSSLError.uncleanShutdown
  • hopefully this intermittent issue
  • being able to go back to Postgres 12, which fixes some warnings we were getting on db imports

Of course, the downside is availability but we've not had any issues with that over months of use.

@finestructure finestructure changed the title Reconciliation job running on p3 failed to update db Fix NIOSSL.NIOSSLError.uncleanShutdown Apr 19, 2021
@finestructure
Copy link
Member Author

I've asked in Vapor's Discord if anyone has ideas how to fix this: https://discord.com/channels/431917998102675485/448584561845338139/833567195849424956

@finestructure
Copy link
Member Author

Azure ticket opened: Case 2104190050000549  Your question was successfully submitted - TrackingID#2104190050000549

@finestructure
Copy link
Member Author

Helpful gist to run a local Postgres with SSL in docker: https://gist.github.com/mrw34/c97bb03ea1054afb551886ffc8b63c3b

Although note that the -v volume mapping didn't work for me. I had to build an image instead:

FROM postgres:11.6-alpine
COPY server.crt /var/lib/postgresql/server.crt
COPY server.key /var/lib/postgresql/server.key
RUN chown postgres:postgres /var/lib/postgresql/server.crt
RUN chown postgres:postgres /var/lib/postgresql/server.key
docker run —name spi_dev -e POSTGRES_DB=spi_dev -e POSTGRES_USER=spi_dev -e POSTGRES_PASSWORD=xxx -p 6432:5432 \
        ssldb \
        -c ssl=on \
        -c ssl_cert_file=/var/lib/postgresql/server.crt \
        -c ssl_key_file=/var/lib/postgresql/server.key

@finestructure
Copy link
Member Author

Good news, looks like there'll be a fix incoming: vapor/postgres-nio#150 (comment)

@finestructure
Copy link
Member Author

Confirmed fixed on staging:

2021-04-22T07:22:04.031012013Z [ INFO ] Analyzing (limit: 25) ... [component: analyze]
2021-04-22T07:22:05.586030459Z [ INFO ] Checkout directory: /checkouts [component: analyze]
2021-04-22T07:22:28.155562570Z [ INFO ] Analyzing (limit: 25) ... [component: analyze]
2021-04-22T07:22:29.301597734Z [ INFO ] Checkout directory: /checkouts [component: analyze]
2021-04-22T07:22:51.742406706Z [ INFO ] Analyzing (limit: 25) ... [component: analyze]
2021-04-22T07:22:52.843857835Z [ INFO ] Checkout directory: /checkouts [component: analyze]

@finestructure
Copy link
Member Author

finestructure commented Apr 22, 2021

And prod

CleanShot 2021-04-22 at 10 59 03@2x

@finestructure
Copy link
Member Author

That spike as the first deployment (2.26.1) went out with the PG NIO fix (around 10:25) is from a bunch of SSL errors that were stacking up as the services restarted. Notice how the follow up deployment 2.26.2 a few minutes later doesn't cause a spike.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant