Profiling fix: parsec_init(NULL, NULL) by therault · Pull Request #339 · ICLDisco/parsec

therault · 2022-03-09T18:15:29Z

Bug reported by @yu-pei

When calling parsec_init(NULL, NULL) with profiling enabled and activated,
we use app_name_<PID> as info to create the profile file. However,
when reading a distributed profile, it is assumed that the same info is
passed to all ranks. The manual hinted this through "uniquely" identify
the experiment, but that was not clear enough.

As a result, parallel profile files generated when initializing parsec
with NULL would fail to load.

This patch proposes to pass the same default string (by removing the PID)
to solve this issue.

Alternative approaches could be:

to add a collective operation here to decide on a unique common name
to remove the check in dbp_reader.c that this info needs to match

bosilca · 2022-03-09T20:09:19Z

We put the pid in the name because without the pid, multiple applications writing their profiling in the same directory (which is the most current usage as the output directory must be on a local filesystem) will overwrite each other.

therault · 2022-03-09T20:12:51Z

OK. I will update the PR to remove the requirement that the identifier is the same between the ranks, since I don't believe we want to re-introduce MPI collective here.

therault · 2022-03-09T20:23:14Z

Ah, no, wait, there is a confusion here: - the filename does not include the PID, the filename is only based on the basename parameter, which does not include the PID. - what includes the PID if NULL is passed to dbp_start() is a string within the header of the file itself (`head.hr_id` to be specific). So, in the scenario you refer to (multiple applications running from the same directory), the profiles would be overwritten anyway. When we removed the mkstemp from the file creation logic, we decided that it's better to lose that feature than impose a dependency to MPI (or another communication library) to the profiling system. Without communication, I don't know how to provide a file name or hr_id that matches between rank, and avoid overwriting existing files.

bosilca · 2022-03-09T21:00:16Z

If parsec_init(NULL, NULL), then in parsec.c line 445 we add the process pid to the parsec_app_name. In the profiling code, we use this parsec_app_name via the basename function to set the cmdline_info (which at this point will then contain the pid). This cmdline_info is then used as a base for the name of the backend file in the parsec_profiling_dbp_start call when requested (or by default in fact). I might misread the code, but my understanding is that the pid is part of the profiling filename.

So maybe the trick is to use the pid in the filename but remove it from the string stored into the profiling file, so instead of adding it to parsec_app_name early in the process, we add it just to a temporary variable that we pass to the parsec_profiling_dbp_start call.

We can always rely on the fact that if the application was part of a parallel job, there is an RTE that will set some unique identifiers in the environment space. In the case of OMPI, we can use the PMIX_NAMESPACE environment variable to create some form of uniqueness, without relying on any communications.

therault · 2022-03-09T21:16:02Z

The beginning of `parsec_profiling_dbp_start` is ``` int parsec_profiling_dbp_start( const char *basefile, const char *hr_info ) { [...] rc = asprintf(&bpf_filename, "%s-%d.prof", basefile, parsec_profiling_process_id); [...] file_backend_fd = open(bpf_filename, O_RDWR | O_CREAT | O_TRUNC, 00600); ``` parsec_profiling_process_id is a global integer that was remembered when calling `parsec_profiling_init(int process_id);` We don't use `hr_info` in the filename, hence the confusion. I find the approach of changing the basename harder to the user: it's very hard to write a script where the filename is generated. As a user, if I submit multiple runs, it looks easier to me to integrate a unique identifier in my script when I pass the basename to the MCA parameter. But if that's not what we had decided, I will do as decided. As you imply above, getting the unique ID in a multiple-node run will be MPI implementation or Communication system - dependent... I will update the PR to reflect a best effort approach: - if a single rank, add the PID to the basename - if more than one rank, check if PMIX_NAMESPACE exists, and add this to the basename - if more than one rank and no PMIX_NAMESPACE, use the basename as-is. Alternatively, we could let the user write something like --mca profile_filename 'myrun-${PMIX_NAMESPACE}' or --mca profile_filename 'myrun-${$}' or --mca profile_filename 'myrun-$$' and we could expand ${...}/$... into the corresponding ENV variable at runtime, as a shell would do? Would that be a useful compromise?

bosilca · 2022-03-10T04:22:22Z

I did not say to change the profiling name provided by the user. Anyway, the basename you make reference in parsec_profiling_dbp_start to already contains the pid, in the case where default values are used and the user did not manually specify a filename for the profile. Just run with the default parameters and with parsec_init(NULL, NULL), and look at the generated profile files (they are 'app_name_47657-0.prof', so clearly the pid is in the name).

Sometimes is simpler to just write the code instead of making sure all details pass in a conversation. Here is what I was suggesting.

diff --git a/parsec/parsec.c b/parsec/parsec.c
index 56d1b072f..28d222f07 100644
--- a/parsec/parsec.c
+++ b/parsec/parsec.c
@@ -362,7 +362,7 @@ static void parsec_vp_init( parsec_vp_t *vp,
 
 static int check_overlapping_binding(parsec_context_t *context);
 
-#define DEFAULT_APPNAME "app_name_%d"
+#define DEFAULT_APPNAME "app_name"
 
 #define GET_INT_ARGV(CMD, ARGV, VALUE) \
 do { \
@@ -442,10 +442,7 @@ parsec_context_t* parsec_init( int nb_cores, int* pargc, char** pargv[] )
             fprintf(stderr, "%s: command line error (%d)\n", (*pargv)[0], ret);
         }
     } else {
-        ret = asprintf( &parsec_app_name, DEFAULT_APPNAME, (int)getpid() );
-        if (ret == -1) {
-            parsec_app_name = strdup( "app_name" );
-        }
+        parsec_app_name = strdup(DEFAULT_APPNAME);
     }
 
     ret = parsec_mca_cmd_line_process_args(cmd_line, &ctx_environ, &environ);
@@ -701,11 +698,17 @@ parsec_context_t* parsec_init( int nb_cores, int* pargc, char** pargv[] )
 #if defined(PARSEC_PROF_TRACE)
     if( (0 != strncasecmp(parsec_enable_profiling, "<none>", 6)) && (0 == parsec_profiling_init( profiling_id )) ) {
         int i, l;
-        char *cmdline_info = basename(parsec_app_name);
+        char *cmdline_info = NULL;
 
         /* Use either the app name (argv[0]) or the user provided filename */
         if( 0 == strncmp(parsec_enable_profiling, "<app>", 5) ) {
+            /* Specialize the profiling filename to avoid collision with other instances */
+            ret = asprintf( &cmdline_info, "%s_%d", basename(parsec_app_name), (int)getpid() );
+            if (ret < 0) {
+                cmdline_info = strdup(DEFAULT_APPNAME);
+            }
             ret = parsec_profiling_dbp_start( cmdline_info, parsec_app_name );
+            free(cmdline_info);
         } else {
             ret = parsec_profiling_dbp_start( parsec_enable_profiling, parsec_app_name );
         }

bosilca · 2022-03-22T22:02:17Z

Please update based on the discussion.

When calling `parsec_init(NULL, NULL)` with profiling enabled and activated, we would use `app_name_<PID>` as info to create the profile file. However, when reading a distributed profile, it is assumed that the same info is passed to all ranks. The manual hinted this through "uniquely" identify the experiment, but that was not clear enough. As a result, parallel profile files generated when initializing parsec with NULL would fail to load. This patch proposes to pass the same default string (by removing the PID) to solve this issue. Alternative approaches could be - to add a collective operation here to decide on a unique common name - to remove the check in dbp_reader.c that this info needs to match Make the documentation more clear about parsec_profiling_dbp_start The documentation of this function was stall: it was not updated when we changed the file naming scheme (removed the mkstemp calll and the XXXXXX in the filename), and we changed how the rank of the process is passed to the profiling system. Make it more clear that the string passed is an identifier, and must match between calling processes. Implement suggested approach to support the '<app>' profiling filename in single-rank at least

therault · 2022-03-30T17:54:32Z

Implemented suggested changes, and rebased/squashed everything into a single commit

therault requested a review from bosilca as a code owner March 9, 2022 18:15

therault marked this pull request as draft March 10, 2022 17:34

therault force-pushed the profiling-parsec_init-NULL branch from f17512f to e743292 Compare March 30, 2022 17:53

bosilca approved these changes Mar 30, 2022

View reviewed changes

therault marked this pull request as ready for review March 30, 2022 20:17

bosilca merged commit 13bc4e0 into ICLDisco:master Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling fix: parsec_init(NULL, NULL)#339

Profiling fix: parsec_init(NULL, NULL)#339
bosilca merged 1 commit intoICLDisco:masterfrom
therault:profiling-parsec_init-NULL

therault commented Mar 9, 2022

Uh oh!

bosilca commented Mar 9, 2022

Uh oh!

therault commented Mar 9, 2022 via email •

edited by bosilca

Loading

Uh oh!

therault commented Mar 9, 2022 via email •

edited by bosilca

Loading

Uh oh!

bosilca commented Mar 9, 2022

Uh oh!

therault commented Mar 9, 2022 via email •

edited by bosilca

Loading

Uh oh!

bosilca commented Mar 10, 2022

Uh oh!

bosilca commented Mar 22, 2022

Uh oh!

therault commented Mar 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

therault commented Mar 9, 2022

Uh oh!

bosilca commented Mar 9, 2022

Uh oh!

therault commented Mar 9, 2022 via email • edited by bosilca Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

therault commented Mar 9, 2022 via email • edited by bosilca Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bosilca commented Mar 9, 2022

Uh oh!

therault commented Mar 9, 2022 via email • edited by bosilca Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bosilca commented Mar 10, 2022

Uh oh!

bosilca commented Mar 22, 2022

Uh oh!

therault commented Mar 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

therault commented Mar 9, 2022 via email •

edited by bosilca

Loading

therault commented Mar 9, 2022 via email •

edited by bosilca

Loading

therault commented Mar 9, 2022 via email •

edited by bosilca

Loading