Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail on multiple replicas for 1.09 #89

Closed
ssdowd opened this issue May 3, 2017 · 5 comments
Closed

Fail on multiple replicas for 1.09 #89

ssdowd opened this issue May 3, 2017 · 5 comments

Comments

@ssdowd
Copy link

ssdowd commented May 3, 2017

Using the example from Kubernetes.md, I've created a deployment with 2 replicas of the cloudsql-proxy. With versions 1.05 and 1.08, it works. With version 1.09 it fails if replicas is greater than 1 with this message

Error from server (BadRequest): container "cloudsqlproxy" in pod "cloudsqlproxy-3735439449-p58dv" is waiting to start: trying and failing to pull image

The kubernetes dashboard shows this error:

Failed to pull image "b.gcr.io/cloudsql-docker/gce-proxy:1.09": failed to register layer: rename /var/lib/docker/image/overlay/layerdb/tmp/layer-289391082 /var/lib/docker/image/overlay/layerdb/sha256/305e4867f6737b619a7ab334876d503b12fa391a3d28478752575b49d6857e69: directory not empty
Error syncing pod, skipping: failed to "StartContainer" for "cloudsqlproxy" with ErrImagePull: "failed to register layer: rename /var/lib/docker/image/overlay/layerdb/tmp/layer-289391082 /var/lib/docker/image/overlay/layerdb/sha256/305e4867f6737b619a7ab334876d503b12fa391a3d28478752575b49d6857e69: directory not empty"

Setting replicas to 1 or using version 1.08 both make it work.

Example config:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: cloudsqlproxy
spec:
  replicas: 2
  template:
    metadata:
      labels:
        app: cloudsqlproxy
    spec:
      containers:
      - image: b.gcr.io/cloudsql-docker/gce-proxy:1.09
        name: cloudsqlproxy
        command:
        - /cloud_sql_proxy
        - -dir=/cloudsql
        - -instances=project:us-central1:db-test=tcp:3306
        - -credential_file=/credentials/credentials.json
        ports:
        - name: sqlpxy-prt-wp
          containerPort: 3306
        volumeMounts:
        - mountPath: /cloudsql
          name: cloudsql
        - mountPath: /credentials
          name: service-account-token
          readOnly: true
        - mountPath: /etc/ssl/certs
          name: ssl-certs
          readOnly: true
      volumes:
      - name: cloudsql
        emptyDir:
      - name: service-account-token
        secret:
          secretName: cloudsql-instance-credentials
      - name: ssl-certs
        hostPath:
          path: /etc/ssl/certs
@Carrotman42
Copy link
Contributor

Carrotman42 commented May 3, 2017 via email

@ssdowd
Copy link
Author

ssdowd commented May 3, 2017

Tried it both ways. Why would it work with replicas: 1 and not with replicas: 2 if the URL was incorrect?

@Carrotman42
Copy link
Contributor

Carrotman42 commented May 5, 2017

I didn't have any good ideas why it isn't working period, so it was mostly a guess: I know that the b.gcr.io-prefixed URIs will be going away at some point so I wanted to make sure it had nothing to do with this problem.

Your efforts to narrow it down to a difference between 1.08 and 1.09 is much appreciated, especially because there is a very small number of commits that it possibly could be.

My guess now is that it's somehow related to setting the base container to Alpine 3.5 (added in aaacabb). @apelisse do you have any thoughts on how this could affect @ssdowd's setup? I don't have a lot of experience with Kubernetes but the pasted config looks sane to me.

Looking at the error message, I'm not sure what docker image the sha "305e4867f6737b619a7ab334876d503b12fa391a3d28478752575b49d6857e69" is associated with, though. I pulled all of the proxy images from 1.07 through 1.09 and both of b.gcr.io and gcr.io, and none of the images on my machine have a layer with that hash (although I could be missing something, someone with more docker/kubernetes experience may want to double check my work).

@apelisse
Copy link
Contributor

apelisse commented May 5, 2017

Seems related to moby/moby#23184.

Here's my guess at what's going on:

  • One of the kubernetes node has an incomplete/corrupted left-over of the 1.09 image
  • Increasing the number of replicas means that the new pod is scheduled on that node

Suggestion: Make sure it's always failing on the same node(s), connect on the failing node(s), manually try pulling the image (make sure it fails), delete the directory that shouldn't be there, and try again.

@ssdowd
Copy link
Author

ssdowd commented May 5, 2017

@apelisse That was it - had to remove 2 directories from /var/lib/docker/image/overlay/layerdb/sha256/ to get a pull to work on 1 node. 1.09 is now working for me. Thanks.

@ssdowd ssdowd closed this as completed May 5, 2017
yosatak pushed a commit to yosatak/cloud-sql-proxy that referenced this issue Feb 26, 2023
Manual auth of cloud_sql_proxy to work around issues getting metadata by IP address
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants