Profiling fix: parsec_init(NULL, NULL)#339
Conversation
|
We put the pid in the name because without the pid, multiple applications writing their profiling in the same directory (which is the most current usage as the output directory must be on a local filesystem) will overwrite each other. |
|
OK. I will update the PR to remove the requirement that the identifier is
the same between the ranks, since I don't believe we want to re-introduce
MPI collective here.
|
|
Ah, no, wait, there is a confusion here:
- the filename does not include the PID, the filename is only based on
the basename parameter, which does not include the PID.
- what includes the PID if NULL is passed to dbp_start() is a string
within the header of the file itself (`head.hr_id` to be specific).
So, in the scenario you refer to (multiple applications running from the
same directory), the profiles would be overwritten anyway.
When we removed the mkstemp from the file creation logic, we decided that
it's better to lose that feature than impose a dependency to MPI (or
another communication library) to the profiling system.
Without communication, I don't know how to provide a file name or hr_id
that matches between rank, and avoid overwriting existing files.
|
|
If So maybe the trick is to use the pid in the filename but remove it from the string stored into the profiling file, so instead of adding it to We can always rely on the fact that if the application was part of a parallel job, there is an RTE that will set some unique identifiers in the environment space. In the case of OMPI, we can use the PMIX_NAMESPACE environment variable to create some form of uniqueness, without relying on any communications. |
|
The beginning of `parsec_profiling_dbp_start` is
```
int parsec_profiling_dbp_start( const char *basefile, const char *hr_info )
{
[...]
rc = asprintf(&bpf_filename, "%s-%d.prof", basefile,
parsec_profiling_process_id);
[...]
file_backend_fd = open(bpf_filename, O_RDWR | O_CREAT | O_TRUNC, 00600);
```
parsec_profiling_process_id is a global integer that was remembered when
calling `parsec_profiling_init(int process_id);`
We don't use `hr_info` in the filename, hence the confusion.
I find the approach of changing the basename harder to the user: it's very
hard to write a script where the filename is generated.
As a user, if I submit multiple runs, it looks easier to me to integrate a
unique identifier in my script when I pass the basename to the MCA
parameter.
But if that's not what we had decided, I will do as decided.
As you imply above, getting the unique ID in a multiple-node run will be
MPI implementation or Communication system - dependent...
I will update the PR to reflect a best effort approach:
- if a single rank, add the PID to the basename
- if more than one rank, check if PMIX_NAMESPACE exists, and add this to
the basename
- if more than one rank and no PMIX_NAMESPACE, use the basename as-is.
Alternatively, we could let the user write something like --mca
profile_filename 'myrun-${PMIX_NAMESPACE}' or --mca profile_filename
'myrun-${$}' or --mca profile_filename 'myrun-$$' and we could expand
${...}/$... into the corresponding ENV variable at runtime, as a shell
would do?
Would that be a useful compromise?
|
|
I did not say to change the profiling name provided by the user. Anyway, the basename you make reference in Sometimes is simpler to just write the code instead of making sure all details pass in a conversation. Here is what I was suggesting. diff --git a/parsec/parsec.c b/parsec/parsec.c
index 56d1b072f..28d222f07 100644
--- a/parsec/parsec.c
+++ b/parsec/parsec.c
@@ -362,7 +362,7 @@ static void parsec_vp_init( parsec_vp_t *vp,
static int check_overlapping_binding(parsec_context_t *context);
-#define DEFAULT_APPNAME "app_name_%d"
+#define DEFAULT_APPNAME "app_name"
#define GET_INT_ARGV(CMD, ARGV, VALUE) \
do { \
@@ -442,10 +442,7 @@ parsec_context_t* parsec_init( int nb_cores, int* pargc, char** pargv[] )
fprintf(stderr, "%s: command line error (%d)\n", (*pargv)[0], ret);
}
} else {
- ret = asprintf( &parsec_app_name, DEFAULT_APPNAME, (int)getpid() );
- if (ret == -1) {
- parsec_app_name = strdup( "app_name" );
- }
+ parsec_app_name = strdup(DEFAULT_APPNAME);
}
ret = parsec_mca_cmd_line_process_args(cmd_line, &ctx_environ, &environ);
@@ -701,11 +698,17 @@ parsec_context_t* parsec_init( int nb_cores, int* pargc, char** pargv[] )
#if defined(PARSEC_PROF_TRACE)
if( (0 != strncasecmp(parsec_enable_profiling, "<none>", 6)) && (0 == parsec_profiling_init( profiling_id )) ) {
int i, l;
- char *cmdline_info = basename(parsec_app_name);
+ char *cmdline_info = NULL;
/* Use either the app name (argv[0]) or the user provided filename */
if( 0 == strncmp(parsec_enable_profiling, "<app>", 5) ) {
+ /* Specialize the profiling filename to avoid collision with other instances */
+ ret = asprintf( &cmdline_info, "%s_%d", basename(parsec_app_name), (int)getpid() );
+ if (ret < 0) {
+ cmdline_info = strdup(DEFAULT_APPNAME);
+ }
ret = parsec_profiling_dbp_start( cmdline_info, parsec_app_name );
+ free(cmdline_info);
} else {
ret = parsec_profiling_dbp_start( parsec_enable_profiling, parsec_app_name );
} |
|
Please update based on the discussion. |
When calling `parsec_init(NULL, NULL)` with profiling enabled and activated,
we would use `app_name_<PID>` as info to create the profile file. However,
when reading a distributed profile, it is assumed that the same info is
passed to all ranks. The manual hinted this through "uniquely" identify
the experiment, but that was not clear enough.
As a result, parallel profile files generated when initializing parsec
with NULL would fail to load.
This patch proposes to pass the same default string (by removing the PID)
to solve this issue.
Alternative approaches could be
- to add a collective operation here to decide on a unique common
name
- to remove the check in dbp_reader.c that this info needs to match
Make the documentation more clear about parsec_profiling_dbp_start
The documentation of this function was stall: it was not updated when
we changed the file naming scheme (removed the mkstemp calll and the
XXXXXX in the filename), and we changed how the rank of the process
is passed to the profiling system. Make it more clear that the string
passed is an identifier, and must match between calling processes.
Implement suggested approach to support the '<app>' profiling filename in single-rank at least
f17512f to
e743292
Compare
|
Implemented suggested changes, and rebased/squashed everything into a single commit |
Bug reported by @yu-pei
When calling
parsec_init(NULL, NULL)with profiling enabled and activated,we use
app_name_<PID>as info to create the profile file. However,when reading a distributed profile, it is assumed that the same info is
passed to all ranks. The manual hinted this through "uniquely" identify
the experiment, but that was not clear enough.
As a result, parallel profile files generated when initializing parsec
with NULL would fail to load.
This patch proposes to pass the same default string (by removing the PID)
to solve this issue.
Alternative approaches could be: