Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Services queue no longer working #1252

Closed
katrinarobinson2000 opened this issue Apr 24, 2024 · 4 comments
Closed

Services queue no longer working #1252

katrinarobinson2000 opened this issue Apr 24, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@katrinarobinson2000
Copy link

Describe the bug

Whenever I try to run a task on services, it fails with the error "/usr/bin/python3.8: No module named virtualenv". I have tried adding different workers to the queue, but I get this error regardless of the worker. And when I try those workers with different queues they work, which indicates that the problem is specific to the Services queue. I have tried with the default docker image and also different docker images that work on different queues.

To reproduce

  1. Add a worker to the Services queue
  2. Run a task on the Services queue

Expected behaviour

The task should have run successfully, like it does with other queues.

Environment

  • Server type: self hosted
  • ClearML SDK Version
  • ClearML Server Version: 1.12.1
  • Python Version: 3.8
  • Linux

Related Discussion

I could not see a similar thread.

@katrinarobinson2000 katrinarobinson2000 added the bug Something isn't working label Apr 24, 2024
@jkhenning
Copy link
Member

Hi @katrinarobinson2000, can you include the full task log? What is the docker image you're trying to run the task with?

@katrinarobinson2000
Copy link
Author

I have tried with multiple docker images and the default nvidia/cuda:11.8.0-base-ubuntu20.04 image. These images work on other queues so I don't think that's the problem. Full task log:

1714082982808 training-02:cpu:8 INFO task ef68556dfb1447b296f8df7162010ae9 pulled from a5f9687681084ae59b27ffd3f4b77d77 by worker training-02:cpu:8

1714082987913 training-02:cpu:8 DEBUG Running task 'ef68556dfb1447b296f8df7162010ae9'

1714082988799 training-02:cpu:8:service:ef68556dfb1447b296f8df7162010ae9 DEBUG Process failed, exit code 1
1714082988840 training-02:cpu:8:service:ef68556dfb1447b296f8df7162010ae9 DEBUG Current configuration (clearml_agent v1.5.2, location: /tmp/.clearml_agent.4wvyqfny.cfg):
----------------------
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.api_server = http://192.168.128.212:8008
api.web_server = http://192.168.128.212:8080
api.files_server = http://192.168.128.212:8081
api.credentials.access_key = R1GO2GQ2R95KLTM5OXH3

agent.worker_id = training-02:cpu:8:service:ef68556dfb1447b296f8df7162010ae9
agent.worker_name = training-02
agent.force_git_ssh_protocol = true
agent.python_binary = 
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = <20.2 ; python_version < '3.10'
agent.package_manager.pip_version.1 = <22.3 ; python_version >\= '3.10'
agent.package_manager.system_site_packages = true
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.conda_channels.3 = nvidia
agent.package_manager.priority_optional_packages.0 = pygobject
agent.package_manager.torch_nightly = false
agent.package_manager.poetry_files_from_repo_working_dir = false
agent.package_manager.force_repo_requirements_txt = true
agent.package_manager.priority_packages.0 = opencv-python-headless
agent.venvs_dir = /opt/clearml/venvs-builds.8.9
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.venvs_cache.path = /opt/clearml/venvs-cache
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /opt/clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /opt/clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /opt/clearml/pip-cache
agent.docker_apt_cache = /opt/clearml/apt-cache.8.9
agent.docker_force_pull = true
agent.default_docker.image = nvidia/cuda:11.8.0-base-ubuntu20.04
agent.enable_task_env = false
agent.hide_docker_command_env_vars.enabled = true
agent.hide_docker_command_env_vars.parse_embedded_urls = true
agent.abort_callback_max_timeout = 1800
agent.docker_internal_mounts.sdk_cache = /opt/clearml/sdk-cache
agent.docker_internal_mounts.apt_cache = /opt/clearml/apt-cache
agent.docker_internal_mounts.ssh_folder = /root/.ssh
agent.docker_internal_mounts.ssh_ro_folder = /root/.ssh
agent.docker_internal_mounts.pip_cache = /opt/clearml/cache/pip-cache
agent.docker_internal_mounts.poetry_cache = /opt/cache/pypoetry
agent.docker_internal_mounts.vcs_cache = /opt/clearml/vcs-cache
agent.docker_internal_mounts.venv_build = /opt/clearml/venvs-builds
agent.docker_internal_mounts.pip_download = /opt/clearml/pip-download-cache
agent.apply_environment = true
agent.apply_files = true
agent.custom_build_script = 
agent.disable_task_docker_override = false
agent.git_user = 
agent.docker_use_activated_venv = true
agent.disable_ssh_mount = false
agent.docker_install_opencv_libs = false
agent.default_python = 3.8
agent.cuda_version = 0
agent.cudnn_version = 0
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = true
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key = 
sdk.aws.s3.region = 
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true

sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false

Executing task id [ef68556dfb1447b296f8df7162010ae9]:
repository = git@gitlab.********
branch = 
version_num = eb2fa20d506be3e70092455705ec0e8e1350e816
tag = 
docker_cmd = 
entry_point = trigger_export_detection.py
working_dir = utils


[package_manager.force_repo_requirements_txt=true] Skipping requirements, using repository "requirements.txt" 

/usr/bin/python3.8: No module named virtualenv

clearml_agent: ERROR: Command '['python3.8', '-m', 'virtualenv', '/opt/clearml/venvs-builds.8.9/3.8', '--system-site-packages']' returned non-zero exit status 1.

@jkhenning
Copy link
Member

The log says the worker running the task is training-02:cpu:8, not the services worker?

@katrinarobinson2000
Copy link
Author

training-02:cpu:8 is a worker I assigned to the services queue. When I start running the task, at the top of the console it says Hostname: training-02:cpu:8:4:service:aea1e678da314bac972abc1f4294de68 but then once the task fails it changes to Hostname: training-02:cpu:8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants