Skip to content

Profiling fix: parsec_init(NULL, NULL)#339

Merged
bosilca merged 1 commit intoICLDisco:masterfrom
therault:profiling-parsec_init-NULL
Mar 30, 2022
Merged

Profiling fix: parsec_init(NULL, NULL)#339
bosilca merged 1 commit intoICLDisco:masterfrom
therault:profiling-parsec_init-NULL

Conversation

@therault
Copy link
Copy Markdown
Contributor

@therault therault commented Mar 9, 2022

Bug reported by @yu-pei

When calling parsec_init(NULL, NULL) with profiling enabled and activated,
we use app_name_<PID> as info to create the profile file. However,
when reading a distributed profile, it is assumed that the same info is
passed to all ranks. The manual hinted this through "uniquely" identify
the experiment, but that was not clear enough.

As a result, parallel profile files generated when initializing parsec
with NULL would fail to load.

This patch proposes to pass the same default string (by removing the PID)
to solve this issue.

Alternative approaches could be:

  • to add a collective operation here to decide on a unique common name
  • to remove the check in dbp_reader.c that this info needs to match

@therault therault requested a review from bosilca as a code owner March 9, 2022 18:15
@bosilca
Copy link
Copy Markdown
Contributor

bosilca commented Mar 9, 2022

We put the pid in the name because without the pid, multiple applications writing their profiling in the same directory (which is the most current usage as the output directory must be on a local filesystem) will overwrite each other.

@therault
Copy link
Copy Markdown
Contributor Author

therault commented Mar 9, 2022 via email

@therault
Copy link
Copy Markdown
Contributor Author

therault commented Mar 9, 2022 via email

@bosilca
Copy link
Copy Markdown
Contributor

bosilca commented Mar 9, 2022

If parsec_init(NULL, NULL), then in parsec.c line 445 we add the process pid to the parsec_app_name. In the profiling code, we use this parsec_app_name via the basename function to set the cmdline_info (which at this point will then contain the pid). This cmdline_info is then used as a base for the name of the backend file in the parsec_profiling_dbp_start call when requested (or by default in fact). I might misread the code, but my understanding is that the pid is part of the profiling filename.

So maybe the trick is to use the pid in the filename but remove it from the string stored into the profiling file, so instead of adding it to parsec_app_name early in the process, we add it just to a temporary variable that we pass to the parsec_profiling_dbp_start call.

We can always rely on the fact that if the application was part of a parallel job, there is an RTE that will set some unique identifiers in the environment space. In the case of OMPI, we can use the PMIX_NAMESPACE environment variable to create some form of uniqueness, without relying on any communications.

@therault
Copy link
Copy Markdown
Contributor Author

therault commented Mar 9, 2022 via email

@bosilca
Copy link
Copy Markdown
Contributor

bosilca commented Mar 10, 2022

I did not say to change the profiling name provided by the user. Anyway, the basename you make reference in parsec_profiling_dbp_start to already contains the pid, in the case where default values are used and the user did not manually specify a filename for the profile. Just run with the default parameters and with parsec_init(NULL, NULL), and look at the generated profile files (they are 'app_name_47657-0.prof', so clearly the pid is in the name).

Sometimes is simpler to just write the code instead of making sure all details pass in a conversation. Here is what I was suggesting.

diff --git a/parsec/parsec.c b/parsec/parsec.c
index 56d1b072f..28d222f07 100644
--- a/parsec/parsec.c
+++ b/parsec/parsec.c
@@ -362,7 +362,7 @@ static void parsec_vp_init( parsec_vp_t *vp,
 
 static int check_overlapping_binding(parsec_context_t *context);
 
-#define DEFAULT_APPNAME "app_name_%d"
+#define DEFAULT_APPNAME "app_name"
 
 #define GET_INT_ARGV(CMD, ARGV, VALUE) \
 do { \
@@ -442,10 +442,7 @@ parsec_context_t* parsec_init( int nb_cores, int* pargc, char** pargv[] )
             fprintf(stderr, "%s: command line error (%d)\n", (*pargv)[0], ret);
         }
     } else {
-        ret = asprintf( &parsec_app_name, DEFAULT_APPNAME, (int)getpid() );
-        if (ret == -1) {
-            parsec_app_name = strdup( "app_name" );
-        }
+        parsec_app_name = strdup(DEFAULT_APPNAME);
     }
 
     ret = parsec_mca_cmd_line_process_args(cmd_line, &ctx_environ, &environ);
@@ -701,11 +698,17 @@ parsec_context_t* parsec_init( int nb_cores, int* pargc, char** pargv[] )
 #if defined(PARSEC_PROF_TRACE)
     if( (0 != strncasecmp(parsec_enable_profiling, "<none>", 6)) && (0 == parsec_profiling_init( profiling_id )) ) {
         int i, l;
-        char *cmdline_info = basename(parsec_app_name);
+        char *cmdline_info = NULL;
 
         /* Use either the app name (argv[0]) or the user provided filename */
         if( 0 == strncmp(parsec_enable_profiling, "<app>", 5) ) {
+            /* Specialize the profiling filename to avoid collision with other instances */
+            ret = asprintf( &cmdline_info, "%s_%d", basename(parsec_app_name), (int)getpid() );
+            if (ret < 0) {
+                cmdline_info = strdup(DEFAULT_APPNAME);
+            }
             ret = parsec_profiling_dbp_start( cmdline_info, parsec_app_name );
+            free(cmdline_info);
         } else {
             ret = parsec_profiling_dbp_start( parsec_enable_profiling, parsec_app_name );
         }

@therault therault marked this pull request as draft March 10, 2022 17:34
@bosilca
Copy link
Copy Markdown
Contributor

bosilca commented Mar 22, 2022

Please update based on the discussion.

When calling `parsec_init(NULL, NULL)` with profiling enabled and activated,
we would use `app_name_<PID>` as info to create the profile file. However,
when reading a distributed profile, it is assumed that the same info is
passed to all ranks. The manual hinted this through "uniquely" identify
the experiment, but that was not clear enough.

As a result, parallel profile files generated when initializing parsec
with NULL would fail to load.

This patch proposes to pass the same default string (by removing the PID)
to solve this issue.

Alternative approaches could be
  - to add a collective operation here to decide on a unique common
    name
  - to remove the check in dbp_reader.c that this info needs to match

Make the documentation more clear about parsec_profiling_dbp_start

The documentation of this function was stall: it was not updated when
we changed the file naming scheme (removed the mkstemp calll and the
XXXXXX in the filename), and we changed how the rank of the process
is passed to the profiling system. Make it more clear that the string
passed is an identifier, and must match between calling processes.

Implement suggested approach to support the '<app>' profiling filename in single-rank at least
@therault therault force-pushed the profiling-parsec_init-NULL branch from f17512f to e743292 Compare March 30, 2022 17:53
@therault
Copy link
Copy Markdown
Contributor Author

Implemented suggested changes, and rebased/squashed everything into a single commit

@therault therault marked this pull request as ready for review March 30, 2022 20:17
@bosilca bosilca merged commit 13bc4e0 into ICLDisco:master Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants