Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dind-port-forward.sh -> invalid resource name ? #149

Open
stock99 opened this issue Oct 31, 2018 · 5 comments
Open

dind-port-forward.sh -> invalid resource name ? #149

stock99 opened this issue Oct 31, 2018 · 5 comments

Comments

@stock99
Copy link

stock99 commented Oct 31, 2018

if i execute the script, I will get error look similar below:
root@ffdl2018:~/FfDL/bin# kubectl port-forward pod/$ui_pod $ui_port:8080
error: invalid resource name "pod/": [may not contain '/']

So I tried to remove the pod/ thinking maybe newer version of kubeadmin-dind look like the pod/ , but i get different error below. Can someone help me with the error message below?

Forwarding from 127.0.0.1:31300 -> 8080
Handling connection for 30029
E1031 14:22:28.129745 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:28 socat[11424] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused
Handling connection for 30029
E1031 14:22:30.160553 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:30 socat[11441] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused
Handling connection for 30029
E1031 14:22:32.191360 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:32 socat[11492] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused
Handling connection for 30029
E1031 14:22:34.225286 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:34 socat[11493] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused
Handling connection for 30029
creating data source...
Handling connection for 30029
set up dashboards
Handling connection for 30029
Finished

@stock99 stock99 changed the title dind-port-forward.sh path name for pods not correct? dind-port-forward.sh -> invalid resource name ? Oct 31, 2018
@Tomcli
Copy link
Contributor

Tomcli commented Oct 31, 2018

Hi @stock99, it looks like the script didn't find the right pod name from your Kubernetes cluster. Can you echo your pod name with the below commands? Thanks.

ui_pod=$(kubectl get pods | grep ffdl-ui | awk '{print $1}')
restapi_pod=$(kubectl get pods | grep ffdl-restapi | awk '{print $1}')
grafana_pod=$(kubectl get pods | grep prometheus | awk '{print $1}')

echo $ui_pod
echo $restapi_pod
echo $grafana_pod

Also, the pod/ format was introduce from kubectl client v1.10.0 and above, so I would recommend to update your kubectl client to a version after v1.10.0.

@stock99
Copy link
Author

stock99 commented Nov 1, 2018

Hi Tomcli,
It looks like the kubectl come with kubeadm-dind installation script isn't the latest one (1.8.x). If i installed the latest version via snap, the installation script there seem to enforce the use of 1.8.15 still. Should I adjust any environment variable?

echo $ui_pod
ffdl-ui-b6cbb98f-c4zpm
echo $restapi_pod
ffdl-restapi-84bcb74478-t8df6
echo $grafana_pod
prometheus-5f85fd7695-gb568

kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.15", GitCommit:"c2bd642c70b3629223ea3b7db566a267a1e2d0df", GitTreeState:"clean", BuildDate:"2018-07-11T17:59:56Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.15", GitCommit:"c2bd642c70b3629223ea3b7db566a267a1e2d0df", GitTreeState:"clean", BuildDate:"2018-07-11T17:52:15Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

snap list
Name Version Rev Tracking Publisher Notes
aws-cli 1.15.71 135 stable aws✓ classic
core 16-2.35.5 5742 stable canonical✓ core
helm 2.11.0 63 stable snapcrafters classic
kubectl 1.12.1 462 stable canonical✓ classic

@Tomcli
Copy link
Contributor

Tomcli commented Nov 1, 2018

Hi @stock99, I updated the script at #150 to make it able to run with K8S 1.8.x. Let me know if you encounter any new issue.

@stock99
Copy link
Author

stock99 commented Nov 6, 2018

seem to be ok now after removing 'pod/' in the script. The connection error in the opening post was because I fat-fingered on one of the export statement in dind installation.

But then I got an error message for the test routine make test-push-data-s3 && make test-job-submit :
Getting all models ...
Handling connection for 32060
ID Name Framework Training status Submitted Completed

0 records found.
Makefile:213: recipe for target 'test-job-submit' failed
make: *** [test-job-submit] Error 1

======
attached is the console log
error_log.txt

@chengboonrong
Copy link

chengboonrong commented Apr 2, 2019

Anyone can help? I got this error messages when running the make test-job-submit

Downloading Docker images and test training data. This may take a while.
Context "dind" modified.
error: there is no need to specify a resource type as a separate argument when passing arguments in resource/name form (e.g. 'kubectl get resource/<resource_name>' instead of 'kubectl get resource resource/<resource_name>'
Submitting example training job (tf-model)
S3 URL: http://:30381 REST URL: http://localhost:31961
Executing in etc/examples/tf-model: DLAAS_URL=http://localhost:31961 DLAAS_USERNAME=test-user DLAAS_PASSWORD=test /home/chris/FfDL/cli/bin/ffdl-linux train manifest.yml .
sed: can't read : No such file or directory
name: tf_convolutional_network_tutorial
description: Convolutional network model using tensorflow
version: "1.0"
gpus: 0
cpus: 0.5
memory: 1Gb
learners: 1

# Object stores that allow the system to retrieve training data.
data_stores:
  - id: sl-internal-os
    type: mount_cos
    training_data:
      container: tf_training_data
    training_results:
      container: tf_trained_model
    connection:
      auth_url: http://10.192.0.3:30417
      user_name: test
      password: test

framework:
  name: tensorflow
  version: "1.5.0-py3"
  command: >
    python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz
      --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz
      --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001
      --trainingIters 2000
  # Change trainingIters to 20000 if you want your model to have over 80% Accuracy rate.

evaluation_metrics:
  type: tensorboard
  in: "$JOB_STATE_DIR/logs/tb"
  # (Eventual) Available event types: 'images', 'distributions', 'histograms', 'images'
  # 'audio', 'scalars', 'tensors', 'graph', 'meta_graph', 'run_metadata'
  #  event_types: [scalars]
/home/chris/FfDL/etc/examples/tf-model
Deploying model with manifest 'manifest_testrun.yml' and model files in '.'...
Handling connection for 31961
Handling connection for 31961
FAILED
Error 200: OK

Test job submitted. Track the status via "DLAAS_URL=http://localhost:31961 DLAAS_USERNAME=test-user DLAAS_PASSWORD=test /home/chris/FfDL/cli/bin/ffdl-linux list".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants