Fix `NIOSSL.NIOSSLError.uncleanShutdown` #1015

finestructure · 2021-04-02T06:40:41Z

Follow-up from #1014

They fail with:

[ INFO ] Reconciling ... [component: reconcile]
[ ERROR ] Channel error caught. [psql_connection_id: EBB34917-47DC-40F9-8EF4-207ECC928E9B, psql_error: uncleanShutdown]
[ ERROR ] Channel error caught. Closing connection. [psql_connection_id: EBB34917-47DC-40F9-8EF4-207ECC928E9B, psql_error: PSQLError(base: PostgresNIO.PSQLError.Base.connectionError(underlying: NIOSSL.NIOSSLError.uncleanShutdown))]

Interestingly, app_server does seem to be able to connect:

[ INFO ] GET /marinofelipe/swift-package-info [component: server, request-id: 858996AA-35CF-4962-9648-31D1492F13E1]
[ INFO ] POST /api/versions/78A72B11-A475-478B-9189-F3AD5B3411DC/builds [component: server, request-id: B6103D4D-7C59-44F4-BAFD-6F6C7FA543FA]

The text was updated successfully, but these errors were encountered:

finestructure · 2021-04-02T06:52:32Z

Manually running the analysis job from the app_server container on p3 raises the same error:

root@7b8d9ffa7080:/run# ./Run analyze --id e9475c7f-26ce-403b-b6ee-03cf76b670ae
[ INFO ] Analyzing (id: E9475C7F-26CE-403B-B6EE-03CF76B670AE) ... [component: analyze]
[ INFO ] Checkout directory: /checkouts [component: analyze]
[ INFO ] cloning https://github.com/daveverwer/LeftPad.git to /checkouts/github.com-daveverwer-leftpad [component: analyze]
[ INFO ] listing tags for package https://github.com/daveverwer/LeftPad.git [component: analyze]
[ ERROR ] Channel error caught. [psql_connection_id: 6D6EA036-784B-4C97-8665-515D2CDCD837, psql_error: uncleanShutdown]
[ ERROR ] Channel error caught. Closing connection. [psql_connection_id: 6D6EA036-784B-4C97-8665-515D2CDCD837, psql_error: PSQLError(base: PostgresNIO.PSQLError.Base.connectionError(underlying: NIOSSL.NIOSSLError.uncleanShutdown))]

finestructure · 2021-04-02T06:54:20Z

The same happens when running from within the app_server container on p1:

root@34553a63d453:/run# ./Run analyze --id e9475c7f-26ce-403b-b6ee-03cf76b670ae
[ INFO ] Analyzing (id: E9475C7F-26CE-403B-B6EE-03CF76B670AE) ... [component: analyze]
[ INFO ] Checkout directory: /checkouts [component: analyze]
[ INFO ] pulling https://github.com/daveverwer/LeftPad.git in /checkouts/github.com-daveverwer-leftpad [component: analyze]
[ INFO ] listing tags for package https://github.com/daveverwer/LeftPad.git [component: analyze]
[ ERROR ] Channel error caught. [psql_connection_id: 89081989-EEDB-4DF8-847D-25F5556FB4D7, psql_error: uncleanShutdown]
[ ERROR ] Channel error caught. Closing connection. [psql_connection_id: 89081989-EEDB-4DF8-847D-25F5556FB4D7, psql_error: PSQLError(base: PostgresNIO.PSQLError.Base.connectionError(underlying: NIOSSL.NIOSSLError.uncleanShutdown))]

finestructure · 2021-04-02T06:56:53Z

Seeing the same error now also in the processing containers on p1. Perhaps the reboot only made this work on the first connection and now we're again not processing. So it might not be a p3 issue.

finestructure · 2021-04-02T07:09:14Z

Also note: the unclean shutdown error might be a red herring and not be indicative of processing failing.

In fact, running the analyze command and checking the timestamp in the db confirms that the record is update from both p1 and p3.

However, I've also explicitly added p3's IP address to the the db's allow list - which might have made this work.

But the error message clearly doesn't have anything to do with the fact that reconciliation didn't work.

finestructure · 2021-04-02T07:24:33Z

So to summarise:

the error message does not mean the container can't access the db
rebooting p3 might have fixed the issue in the first place - it's hard to tell now as p3 seems to be in working order (although that's also hard to tell: running a job manually from app_server should be the same as from a dedicated container but it's not exactly the same)
the IP allow-listing has no effect: allowing Azure resources is sufficient to update the db (as it should)

I'm not sure yet how to test p3 and regain confidence for scheduling processing containers on it. What's worse, it might not be a p3 issue at all and affect the other nodes as well. Or it might have been a one-time issue with app_reconcile on p3.

finestructure · 2021-04-07T13:59:51Z

Almost all jobs are scheduled on p3 again after the latest deployment:

 docker-user@bastion  p3  ~  dps                                                                                                                                              1939ms
CONTAINER ID   IMAGE                                                        STATUS          PORTS     NAMES
56c8d238ff1e   registry.gitlab.com/finestructure/swiftpackageindex:2.23.0   Up 27 minutes             app_server.eb99bzy7x2otxqyolguy324g0.jf6v35h45j97wxnn2hljimqwu
4dcb40afd133   registry.gitlab.com/finestructure/swiftpackageindex:2.23.0   Up 27 minutes             app_ingest.1.8fmezrvg6r0x7jdvwr34h66ad
fb3c198973a8   registry.gitlab.com/finestructure/swiftpackageindex:2.23.0   Up 27 minutes             app_analyze.1.s7iqqacmtm8qvzcp5dk0dvxi5
1aaf0e8b88e8   registry.gitlab.com/finestructure/swiftpackageindex:2.23.0   Up 27 minutes             app_reconcile.1.zk9b6z1mmzc0v2tqz82rkihm3
2b589ee7252f   grafana/promtail:2.0.0                                       Up 5 days                 mon_promtail.eb99bzy7x2otxqyolguy324g0.s9ottyts2fkhnjdhuc8u7yjeu

So far everything looks to be in order but we haven't added any new packages yet. Will need to keep an eye on that.

finestructure · 2021-04-07T16:27:03Z

Working as expected. Will keep this open for one more deployment to check.

finestructure · 2021-04-09T05:58:27Z

Last night we saw increased rates of 500s, presumably cause by the app_server container on at least one of the nodes being unable to reach the db. I couldn't inspect the containers at the time but instead restarted the nodes one by one. Only after restarting the last one, p2 (I had started with p3, thinking it might be an issue with p3), did the errors go away.

Might have been coincidental but seems to me that there is an intermittent db connectivity issue.

If this persists, I think we need to consider moving back to a containerised db. This would fix several problems:

the SSL errors NIOSSL.NIOSSLError.uncleanShutdown
hopefully this intermittent issue
being able to go back to Postgres 12, which fixes some warnings we were getting on db imports

Of course, the downside is availability but we've not had any issues with that over months of use.

finestructure · 2021-04-19T04:59:14Z

I've asked in Vapor's Discord if anyone has ideas how to fix this: https://discord.com/channels/431917998102675485/448584561845338139/833567195849424956

finestructure · 2021-04-19T08:00:56Z

Azure ticket opened: Case 2104190050000549 Your question was successfully submitted - TrackingID#2104190050000549

finestructure · 2021-04-19T08:10:45Z

Helpful gist to run a local Postgres with SSL in docker: https://gist.github.com/mrw34/c97bb03ea1054afb551886ffc8b63c3b

Although note that the -v volume mapping didn't work for me. I had to build an image instead:

FROM postgres:11.6-alpine
COPY server.crt /var/lib/postgresql/server.crt
COPY server.key /var/lib/postgresql/server.key
RUN chown postgres:postgres /var/lib/postgresql/server.crt
RUN chown postgres:postgres /var/lib/postgresql/server.key

docker run —name spi_dev -e POSTGRES_DB=spi_dev -e POSTGRES_USER=spi_dev -e POSTGRES_PASSWORD=xxx -p 6432:5432 \
        ssldb \
        -c ssl=on \
        -c ssl_cert_file=/var/lib/postgresql/server.crt \
        -c ssl_key_file=/var/lib/postgresql/server.key

finestructure · 2021-04-21T09:26:09Z

Good news, looks like there'll be a fix incoming: vapor/postgres-nio#150 (comment)

finestructure · 2021-04-22T07:23:53Z

Confirmed fixed on staging:

2021-04-22T07:22:04.031012013Z [ INFO ] Analyzing (limit: 25) ... [component: analyze]
2021-04-22T07:22:05.586030459Z [ INFO ] Checkout directory: /checkouts [component: analyze]
2021-04-22T07:22:28.155562570Z [ INFO ] Analyzing (limit: 25) ... [component: analyze]
2021-04-22T07:22:29.301597734Z [ INFO ] Checkout directory: /checkouts [component: analyze]
2021-04-22T07:22:51.742406706Z [ INFO ] Analyzing (limit: 25) ... [component: analyze]
2021-04-22T07:22:52.843857835Z [ INFO ] Checkout directory: /checkouts [component: analyze]

finestructure · 2021-04-22T09:00:02Z

And prod

finestructure · 2021-04-22T09:02:46Z

That spike as the first deployment (2.26.1) went out with the PG NIO fix (around 10:25) is from a bunch of SSL errors that were stacking up as the services restarted. Notice how the follow up deployment 2.26.2 a few minutes later doesn't cause a spike.

finestructure self-assigned this Apr 2, 2021

finestructure changed the title ~~Processing job running on p3 can't connect to db~~ Reconciliation job running on p3 failed to update db Apr 2, 2021

finestructure changed the title ~~Reconciliation job running on p3 failed to update db~~ Fix NIOSSL.NIOSSLError.uncleanShutdown Apr 19, 2021

finestructure mentioned this issue Apr 21, 2021

Fix ssl errors #1041

Merged

daveverwer closed this as completed in #1041 Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `NIOSSL.NIOSSLError.uncleanShutdown` #1015

Fix `NIOSSL.NIOSSLError.uncleanShutdown` #1015

finestructure commented Apr 2, 2021

finestructure commented Apr 2, 2021

finestructure commented Apr 2, 2021

finestructure commented Apr 2, 2021 •

edited

finestructure commented Apr 2, 2021

finestructure commented Apr 2, 2021

finestructure commented Apr 7, 2021

finestructure commented Apr 7, 2021

finestructure commented Apr 9, 2021 •

edited

finestructure commented Apr 19, 2021

finestructure commented Apr 19, 2021

finestructure commented Apr 19, 2021

finestructure commented Apr 21, 2021

finestructure commented Apr 22, 2021

finestructure commented Apr 22, 2021 •

edited

finestructure commented Apr 22, 2021

Fix NIOSSL.NIOSSLError.uncleanShutdown #1015

Fix NIOSSL.NIOSSLError.uncleanShutdown #1015

Comments

finestructure commented Apr 2, 2021

finestructure commented Apr 2, 2021

finestructure commented Apr 2, 2021

finestructure commented Apr 2, 2021 • edited

finestructure commented Apr 2, 2021

finestructure commented Apr 2, 2021

finestructure commented Apr 7, 2021

finestructure commented Apr 7, 2021

finestructure commented Apr 9, 2021 • edited

finestructure commented Apr 19, 2021

finestructure commented Apr 19, 2021

finestructure commented Apr 19, 2021

finestructure commented Apr 21, 2021

finestructure commented Apr 22, 2021

finestructure commented Apr 22, 2021 • edited

finestructure commented Apr 22, 2021

Fix `NIOSSL.NIOSSLError.uncleanShutdown` #1015

Fix `NIOSSL.NIOSSLError.uncleanShutdown` #1015

finestructure commented Apr 2, 2021 •

edited

finestructure commented Apr 9, 2021 •

edited

finestructure commented Apr 22, 2021 •

edited