Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDL::Core::dog doesn't scale #421

Open
mohawk2 opened this issue Mar 27, 2023 · 17 comments
Open

PDL::Core::dog doesn't scale #421

mohawk2 opened this issue Mar 27, 2023 · 17 comments

Comments

@mohawk2
Copy link
Member

mohawk2 commented Mar 27, 2023

For this, time perl -MPDL -e 'zeroes(X)->dog' is being discussed. This is prompted by an observation from @djerius.

X=1e4,6 takes no appreciable time. X=6,1e4 takes around 5s. The current implementation of dog does a Perl map slicing over the highest dimension. This reveals two problems:

  • the current machinery of starting up an ndarray operation seems to be very expensive (as noted long ago by @run4flat) and needs profiling to seek improvements
  • it's not trivial to make dog be a PP operation, because we still don't have a notation for a Par to be expressed in a way that would help, nor do we have a way to return many ndarrays

A partial solution to the second point would be to pass in an OtherPar as the input (with no Pars at all), and an OtherPar [o] pdl *retval[] where retval_count gets set automatically as with any OtherPar that's a [], but used for output. The typemap would turn that into an AVREF populated with new ndarray SVs. The input version of that would allow a PP cat operation as well.

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 27, 2023

For comparison, within perldl, with_time { my @x = 1..1e7; $_++ for @x } takes about 750ms, and without the ++ operation, it's about 400ms just for the allocation.

@djerius
Copy link
Contributor

djerius commented Mar 27, 2023

Here's a plot of user time vs N, where the code run is zeroes(6,N)->dog.

This was run on an AMD Epyc 7763 w/ 128 cores & 638Gb, using 100 parallel processes. Noise might be due to clocking of cores at different speeds (3244MHz vs. 2445Mhz)

Code to generate the data:
{ for t in $(seq 1000 1000 1000000); do echo /bin/time --format "'%E %U %S'" perl -MPDL -e "'print $t, q{ };zeroes(6,$t)->dog'" ; done; } | parallel --jobs 100 --joblog joblog |& tee timelog
Plotting code is here: plot-times.txt

times

@djerius
Copy link
Contributor

djerius commented Mar 27, 2023

I've attached a zip file with NYTProf output for N= 1_000, 10_000, and 100_000

profile.zip.txt

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 27, 2023

I've attached a zip file with NYTProf output for N= 1_000, 10_000, and 100_000

My analysis of the 1e5 results: for 100k calls, _slice_int took 126s, and DESTROY took 349s!

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 27, 2023

Here's a plot of user time vs N, where the code run is zeroes(6,N)->dog.

This was run on an AMD Epyc 7763 w/ 128 cores & 638Gb, using 100 parallel processes. Noise might be due to clocking of cores at different speeds (3244MHz vs. 2445Mhz)

This is really cool - how did you make it?

@djerius
Copy link
Contributor

djerius commented Mar 27, 2023

This is really cool - how did you make it?

For future reference, it's been added to the plot comment above.

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 27, 2023

This is really cool - how did you make it?

For future reference, it's been added to the plot comment above.

Thank you. That will be at least twice as valuable if you also include the rest, i.e. specifically how you turned those times into the plot shown?

@djerius
Copy link
Contributor

djerius commented Mar 27, 2023

Thank you. That will be at least twice as valuable if you also include the rest, i.e. specifically how you turned those times into the plot shown?

Check out the plot_times.txt file attached to the comment.

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 27, 2023

Thanks! This is the code, to avoid having to download things:

#! /usr/bin/env perl

use PDL;
use Path::Tiny;
use PDL::Graphics::PGPLOT;

my ( @n, @e, @u, @s );
for ( path('timelog')->lines ) {

   chomp;
   my ( $n, $e, $u, $s ) = split /\h+/;

   for my $time ( $e, $u, $s ) {
       my @c = reverse split /:/, $time;
       $time = (pdl(@c) * 60**sequence(0+@c))->dsumover / 3600;
   }

   push @n, $n;
   push @e, $e;
   push @u, $u;
   push @s, $s;
}

my $n = PDL->new( @n );
my $e = PDL->new( @e );
my $u = PDL->new( @u );
my $s = PDL->new( @s );

# points( $n, $e, { symbol => 2, symbolsize=>3});
# label_axes( 'N', 'Elapsed Time [hours]' );

points( $n, $u, { symbol => 2, symbolsize=>3});
label_axes( 'N', 'User Time [hours]' );

# points( $n, $s, { symbol => 2, symbolsize=>3});
# label_axes( 'N', 'System Time [hours]' );

@djerius
Copy link
Contributor

djerius commented Mar 28, 2023

Updated the plot in the above comment, just for kicks.

@djerius
Copy link
Contributor

djerius commented Mar 29, 2023

Updated the plot in the above comment, just for kicks.

Final update of the plot. Now the relationship looks really odd at higher N.

Maybe there was process contention on the machine after all? I'll run it again with fewer processes and see if it changes.

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 29, 2023

Multiple clear-ish lines is very confusing!

@djerius
Copy link
Contributor

djerius commented Mar 29, 2023

Rerunning as:

seq 1000 1000 1000000                                                   \
| parallel --shuf -t --jobs 25 --joblog joblog                          \
  taskset --cpu-list '{%}'                                              \
  /bin/time --format "'%E %U %S'"                                       \
  perl -MPDL -e "'print STDERR {}, q{ }; zeroes(6,{})->dog'" '2>&1'     \
| tee timelog         

With results (as plotted via code in above comment)

plot-times-taskset

@djerius
Copy link
Contributor

djerius commented Mar 31, 2023

@mohawk2 new results finished; see plot in comment w/ revised code. Much cleaner.

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 31, 2023

Thank you. It does look like it might be O(n^2). Feel like making a log graph to see? :-)

@mohawk2
Copy link
Member Author

mohawk2 commented Apr 1, 2023

To be comparable, with the current code:

pdl> with_time { @x = map sequence(10), 1..1e5 }
1957.02 ms
pdl> with_time { pdl(\@x) }
283.285 ms
pdl> with_time { cat(@x) }
1073.88 ms

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 31, 2024

Using cachegrind as noted in #451, 62% of the time for zeroes(6,1e4)->dog was taken in pdl__removetrans_children (called in DESTROY). Simply adding a count of how many children each ndarray so the linked-list walking can terminate early caused the typical time for that to go from about 800ms, to about 450ms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants