Add support for parallel execution (autopep8's `--jobs` opt) #107

giampaolo · 2022-04-10T11:48:45Z

Hello. Today I discovered this project existed, and I immediately integrated it in my own projects. Differently from autopep8 tool, I noticed it does not support the --jobs CLI option, so I decided to submit this PR. I ran this patched version of autoflake against psutil code base, and it resulted in more than a 2x speedup (my laptop has 8 logical cores).

Standard:

~/svn/autoflake {parallel-exec}$ time python3 autoflake.py --expand-star-imports --remove-all-unused-imports --remove-duplicate-keys --remove-unused-variables -r /home/giampaolo/svn/psutil

real	0m2,446s
user	0m2,394s
sys	0m0,052s

Using --jobs opt:

~/svn/autoflake {parallel-exec}$ time python3 autoflake.py --jobs=0 --expand-star-imports --remove-all-unused-imports --remove-duplicate-keys --remove-unused-variables  -r /home/giampaolo/svn/psutil

real	0m0,972s
user	0m4,130s
sys	0m0,213s

About the patch: I had to turn argparse.Namespace into a dict because multiprocessing module is not able to serialize it.

EDIT: fixed it ~~Unfortunately ŧest_autoflake.py reports 4 failures that I'm not sure how to fix. Hope this helps and thanks for this great tool.~~

Signed-off-by: Giampaolo Rodola <g.rodola@gmail.com>

giampaolo · 2022-06-24T12:36:14Z

Update: I set --jobs=0 as the default, meaning that autoflake will automatically spawn os.cpu_count() workers even if --jobs opt is not specified. This is the same default used by black CLI tool.

giampaolo · 2022-06-24T12:54:37Z

Since I'm not sure whether / when this PR will be merged, to whoever wants to used this feature, this is how you can install this very PR by using pip:

python3 -m pip install --force-reinstall git+https://github.com/PyCQA/autoflake.git@refs/pull/107/head

...by pip-installing a PR I provided for autoflake8 packages which adds --jobs option to the tool, see PyCQA/autoflake#107 Signed-off-by: Giampaolo Rodola <g.rodola@gmail.com>

FlorentJeannot · 2022-07-22T06:47:42Z

Hello @giampaolo

Thanks for this PR, this will be very useful for big projects!

Using macOS 12.5 (21G72) and Python 3.9.12, when I add unused imports in several files of my folder and then do the following:

autoflake -j 6 -cr --remove-all-unused-imports --exclude {folder_path} folder_to_check

I can see that it detects the unused imports, but then it hangs forever. I need to force quit it and then it prints the following:

...
No issues detected!
No issues detected!
No issues detected!
^CProcess SpawnPoolWorker-8:
Process SpawnPoolWorker-5:
(and 4 others)

When there are no issues in the folder, autoflake takes ~8s instead of ~36s, so that's very promising! 😄

fsouza

I think this is a good idea, but I have one question about turning the args into a dict. Also, can you update your branch with the most recent changes in the master branch?

Thanks for contributing!

autoflake.py

giampaolo · 2022-08-24T20:12:43Z

Hello!

can you update your branch with the most recent changes in the master branch?

I'm trying but 2645f85 messed things up for this PR. The problem with 2645f85 is that _fix_file() function now accepts a sys.stdout object, which is not serializable by multiprocessing. When parallelizing tasks via multiprocessing, code must be organized in a way that functions are only passed standard types (int, str, dict, list, etc.).

Any reason for using a dict instead of the regular namespace? Is it for serialization?

Correct.

fsouza · 2022-08-24T20:35:05Z

I'm trying but 2645f85 messed things up for this PR. The problem with 2645f85 is that _fix_file() function now accepts a sys.stdout object, which is not serializable by multiprocessing. When parallelizing tasks via multiprocessing, code must be organized in a way that functions are only passed standard types (int, str, dict, list, etc.).

Gotcha, we can reorganize the code. It doesn't make sense to support stdout/stdin with parallel execution anyways, so when the user passes stdin as an input we could force serial execution? And then restructure the code to ensure that we can support both paths? Like, how does autopep8 handle stdin/stdout?

giampaolo · 2022-08-24T21:13:45Z

when the user passes stdin as an input we could force serial execution

Done and pushed. Please note that tests are green:

$ python3 test_autoflake.py 
...........................usage: autoflake [-h] [-c] [-r] [-j n] [--exclude globs] [--imports IMPORTS] [--expand-star-imports]
                 [--remove-all-unused-imports] [--ignore-init-module-imports] [--remove-duplicate-keys]
                 [--remove-unused-variables] [--version] [--quiet] [-v] [--stdin-display-name STDIN_DISPLAY_NAME] [-i | -s]
                 files [files ...]
autoflake: error: argument -s/--stdout: not allowed with argument -i/--in-place
................................................................................................
----------------------------------------------------------------------
Ran 123 tests in 3.789s

OK

...but I did not add any test for the new code path (I only tested this manually).

use os.cpu_count() instead of multiprocessing.cpu_count(): the latter may raise `NotImplementedError`

fsouza · 2022-08-24T21:40:07Z

...but I did not add any test for the new code path (I only tested this manually).

The end-to-end tests should exercise it since the default behavior is to use all available CPUs.

Can you fix the pre-commit violation? Running pre-commit run -a locally should be enough.

giampaolo · 2022-08-24T21:48:47Z

Can you fix the pre-commit violation? Running pre-commit run -a locally should be enough.

done

The end-to-end tests should exercise it since the default behavior is to use all available CPUs.

I put a pdb in autoflake.py and the multiprocessing part is not exercised. Note that the logic is:

    if args["jobs"] == 1 or len(files) == 1 or args["jobs"] == 1 or '-' in files or standard_out is not None:
         # serial code
    else:
         # parallel code

fsouza · 2022-08-24T21:53:56Z

I put a pdb in autoflake.py and the multiprocessing part is not exercised. Note that the logic is:

Oh interesting. Does it not get exercised with test_fuzz.py either?

python test_fuzz.py *.py

giampaolo · 2022-08-24T22:01:20Z

Mmm no. If I read test_fuzz.py code right it passes one file at a time. Instead autoflake should be invoked as:

python3 autoflake.py file1.py file2.py

That sort of invocation will trigger parallel execution.

fsouza · 2022-08-24T23:15:09Z

That's fair enough. We can add something later that will do it.

giampaolo added 2 commits April 10, 2022 13:25

use multiprocessing to run tasks in parallel

8eeaf87

Signed-off-by: Giampaolo Rodola <g.rodola@gmail.com>

fix tests

7016d9e

giampaolo added 3 commits June 24, 2022 14:38

assume parallel run by default

273e893

remove redundant code

ef3bb1c

fix parameter

05fbfb1

fsouza reviewed Aug 24, 2022

View reviewed changes

autoflake.py Outdated Show resolved Hide resolved

Merge branch 'master' into parallel-exec

2128423

giampaolo added 3 commits August 24, 2022 23:24

Update autoflake.py

1410267

use os.cpu_count() instead of multiprocessing.cpu_count(): the latter may raise `NotImplementedError`

rename args -> kargs to differentiate namespace from dict

f9b1951

undo last commit

6c92976

fix flake8

03c18b6

fsouza merged commit 0344bb1 into PyCQA:master Aug 24, 2022

giampaolo deleted the parallel-exec branch August 25, 2022 06:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for parallel execution (autopep8's `--jobs` opt) #107

Add support for parallel execution (autopep8's `--jobs` opt) #107

giampaolo commented Apr 10, 2022 •

edited

giampaolo commented Jun 24, 2022 •

edited

giampaolo commented Jun 24, 2022 •

edited

FlorentJeannot commented Jul 22, 2022 •

edited

fsouza left a comment

giampaolo commented Aug 24, 2022 •

edited

fsouza commented Aug 24, 2022

giampaolo commented Aug 24, 2022

fsouza commented Aug 24, 2022

giampaolo commented Aug 24, 2022 •

edited

fsouza commented Aug 24, 2022

giampaolo commented Aug 24, 2022 •

edited

fsouza commented Aug 24, 2022

Add support for parallel execution (autopep8's --jobs opt) #107

Add support for parallel execution (autopep8's --jobs opt) #107

Conversation

giampaolo commented Apr 10, 2022 • edited

giampaolo commented Jun 24, 2022 • edited

giampaolo commented Jun 24, 2022 • edited

FlorentJeannot commented Jul 22, 2022 • edited

fsouza left a comment

Choose a reason for hiding this comment

giampaolo commented Aug 24, 2022 • edited

fsouza commented Aug 24, 2022

giampaolo commented Aug 24, 2022

fsouza commented Aug 24, 2022

giampaolo commented Aug 24, 2022 • edited

fsouza commented Aug 24, 2022

giampaolo commented Aug 24, 2022 • edited

fsouza commented Aug 24, 2022

Add support for parallel execution (autopep8's `--jobs` opt) #107

Add support for parallel execution (autopep8's `--jobs` opt) #107

giampaolo commented Apr 10, 2022 •

edited

giampaolo commented Jun 24, 2022 •

edited

giampaolo commented Jun 24, 2022 •

edited

FlorentJeannot commented Jul 22, 2022 •

edited

giampaolo commented Aug 24, 2022 •

edited

giampaolo commented Aug 24, 2022 •

edited

giampaolo commented Aug 24, 2022 •

edited