Skip to content

cluster cannot accept SSH connections #393

@YqI777

Description

@YqI777

Hi,when i use sscha to simulation the H3S example, I have a problem.
The sscha.Cluster module was designed for a specific workflow:

Standard workflow: the user runs the main Python script on a local workstation or on the cluster’s login node. The script submits and distributes computational tasks to remote compute nodes via ssh and scp, and manages the files.

My actual situation: I run the main Python script directly on a compute node (after submitting it with sbatch), and, for security reasons, this compute node itself has SSH service disabled.

This creates a fundamental conflict: I am executing a module that requires SSH for its operation inside an environment (my HPC compute node) that cannot accept SSH connections.
what should i do?
the output is

(base) [login1 H3S]$ cat 2out.dat 
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
  Local host:   cpu11
  Local device: mlx5_0
--------------------------------------------------------------------------
ssh: Could not resolve hostname none: Name or service not known
Error with cmd: ssh  None 'echo "/public/home/yan/qe/H3S"'

EXITSTATUS: 255; attempt = 1
THREAD 41879 EXECUTE COMMAND: ssh  None 'echo "/public/home/yan/qe/H3S"'
Traceback (most recent call last):
  File "/public/home/yan/qe/H3S/H3S_relax.py", line 126, in <module>
    my_hpc.setup_workdir()
  File "/public/home/yan/apps/anaconda3/envs/sscha/lib/python3.10/site-packages/sscha/Cluster.py", line 1400, in setup_workdir
    workdir = self.parse_string(self.workdir)
  File "/public/home/yan/apps/anaconda3/envs/sscha/lib/python3.10/site-packages/sscha/Cluster.py", line 1453, in parse_string
    status, output = self.ExecuteCMD(cmd, return_output = True, raise_error= True)
  File "/public/home/yan/apps/anaconda3/envs/sscha/lib/python3.10/site-packages/sscha/Cluster.py", line 402, in ExecuteCMD
    raise IOError("Error while communicating with the cluster. More than %d attempts failed." % (i+1))
OSError: Error while communicating with the cluster. More than 1 attempts failed.
`

my input.py
`#my_hpc = sscha.Cluster.Cluster(mpi_cmd=r"srun -n 40",AlreadyInCluster=True)  
my_hpc = sscha.Cluster.Cluster(mpi_cmd=r"srun -n 40")
#my_hpc.hostname = "login1"
my_hpc.workdir = "/public/home/yan/qe/H3S/run"
my_hpc.binary = "/public/home/apps/qe/qe-7.3.1/bin/pw.x -npool NPOOL -i PREFIX.pwi > PREFIX.pwo"
#Then we need to specify if some modules must be loaded in the submission script
my_hpc.load_modules = """##!/bin/bash
#SBATCH  --job-name=sscha
#SBATCH  --partition=cpu
#SBATCH  --nodes=1
#SBATCH  --ntasks-per-node=40
#SBATCH  --time=14-00:00:00

source /public/env/intel2021
source /public/env/openmpi-4.1.5_icc

"""
my_hpc.n_cpu = 40 # We will use 40 processors
my_hpc.n_nodes = 1 #In 1 node
my_hpc.n_pool = 4 # This is an espresso specific tool, the parallel CPU are divided in 4 pools

#We can also choose in how many batch of jobs we want to submit simultaneously, and how many configurations for each job
my_hpc.batch_size = 4
my_hpc.job_number = 8
#In this way we submit 10 jobs, each one with 10 configurations (overall 100 configuration at time)

my_hpc.set_timeout(300) # We give 30 seconds of timeout
my_hpc.time = "00:20:00" # We can specify the time limit for each job,

my_hpc.setup_workdir()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions