Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parallel execution (autopep8's --jobs opt) #107

Merged
merged 10 commits into from Aug 24, 2022

Conversation

giampaolo
Copy link
Contributor

@giampaolo giampaolo commented Apr 10, 2022

Hello. Today I discovered this project existed, and I immediately integrated it in my own projects. Differently from autopep8 tool, I noticed it does not support the --jobs CLI option, so I decided to submit this PR. I ran this patched version of autoflake against psutil code base, and it resulted in more than a 2x speedup (my laptop has 8 logical cores).

Standard:

~/svn/autoflake {parallel-exec}$ time python3 autoflake.py --expand-star-imports --remove-all-unused-imports --remove-duplicate-keys --remove-unused-variables -r /home/giampaolo/svn/psutil

real	0m2,446s
user	0m2,394s
sys	0m0,052s

Using --jobs opt:

~/svn/autoflake {parallel-exec}$ time python3 autoflake.py --jobs=0 --expand-star-imports --remove-all-unused-imports --remove-duplicate-keys --remove-unused-variables  -r /home/giampaolo/svn/psutil

real	0m0,972s
user	0m4,130s
sys	0m0,213s

About the patch: I had to turn argparse.Namespace into a dict because multiprocessing module is not able to serialize it.

EDIT: fixed it Unfortunately ŧest_autoflake.py reports 4 failures that I'm not sure how to fix. Hope this helps and thanks for this great tool.

Signed-off-by: Giampaolo Rodola <g.rodola@gmail.com>
@giampaolo
Copy link
Contributor Author

giampaolo commented Jun 24, 2022

Update: I set --jobs=0 as the default, meaning that autoflake will automatically spawn os.cpu_count() workers even if --jobs opt is not specified. This is the same default used by black CLI tool.

@giampaolo
Copy link
Contributor Author

giampaolo commented Jun 24, 2022

Since I'm not sure whether / when this PR will be merged, to whoever wants to used this feature, this is how you can install this very PR by using pip:

python3 -m pip install --force-reinstall git+https://github.com/PyCQA/autoflake.git@refs/pull/107/head

giampaolo added a commit to giampaolo/psutil that referenced this pull request Jun 24, 2022
...by pip-installing a PR I provided for autoflake8 packages which adds
--jobs option to the tool, see PyCQA/autoflake#107

Signed-off-by: Giampaolo Rodola <g.rodola@gmail.com>
@FlorentJeannot
Copy link

FlorentJeannot commented Jul 22, 2022

Hello @giampaolo

Thanks for this PR, this will be very useful for big projects!

Using macOS 12.5 (21G72) and Python 3.9.12, when I add unused imports in several files of my folder and then do the following:

autoflake -j 6 -cr --remove-all-unused-imports --exclude {folder_path} folder_to_check

I can see that it detects the unused imports, but then it hangs forever. I need to force quit it and then it prints the following:

...
No issues detected!
No issues detected!
No issues detected!
^CProcess SpawnPoolWorker-8:
Process SpawnPoolWorker-5:
(and 4 others)

When there are no issues in the folder, autoflake takes ~8s instead of ~36s, so that's very promising! 😄

Copy link
Collaborator

@fsouza fsouza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea, but I have one question about turning the args into a dict. Also, can you update your branch with the most recent changes in the master branch?

Thanks for contributing!

autoflake.py Outdated Show resolved Hide resolved
@giampaolo
Copy link
Contributor Author

giampaolo commented Aug 24, 2022

Hello!

can you update your branch with the most recent changes in the master branch?

I'm trying but 2645f85 messed things up for this PR. The problem with 2645f85 is that _fix_file() function now accepts a sys.stdout object, which is not serializable by multiprocessing. When parallelizing tasks via multiprocessing, code must be organized in a way that functions are only passed standard types (int, str, dict, list, etc.).

Any reason for using a dict instead of the regular namespace? Is it for serialization?

Correct.

@fsouza
Copy link
Collaborator

fsouza commented Aug 24, 2022

I'm trying but 2645f85 messed things up for this PR. The problem with 2645f85 is that _fix_file() function now accepts a sys.stdout object, which is not serializable by multiprocessing. When parallelizing tasks via multiprocessing, code must be organized in a way that functions are only passed standard types (int, str, dict, list, etc.).

Gotcha, we can reorganize the code. It doesn't make sense to support stdout/stdin with parallel execution anyways, so when the user passes stdin as an input we could force serial execution? And then restructure the code to ensure that we can support both paths? Like, how does autopep8 handle stdin/stdout?

@giampaolo
Copy link
Contributor Author

when the user passes stdin as an input we could force serial execution

Done and pushed. Please note that tests are green:

$ python3 test_autoflake.py 
...........................usage: autoflake [-h] [-c] [-r] [-j n] [--exclude globs] [--imports IMPORTS] [--expand-star-imports]
                 [--remove-all-unused-imports] [--ignore-init-module-imports] [--remove-duplicate-keys]
                 [--remove-unused-variables] [--version] [--quiet] [-v] [--stdin-display-name STDIN_DISPLAY_NAME] [-i | -s]
                 files [files ...]
autoflake: error: argument -s/--stdout: not allowed with argument -i/--in-place
................................................................................................
----------------------------------------------------------------------
Ran 123 tests in 3.789s

OK

...but I did not add any test for the new code path (I only tested this manually).

use os.cpu_count() instead of multiprocessing.cpu_count(): the latter may raise `NotImplementedError`
@fsouza
Copy link
Collaborator

fsouza commented Aug 24, 2022

...but I did not add any test for the new code path (I only tested this manually).

The end-to-end tests should exercise it since the default behavior is to use all available CPUs.

Can you fix the pre-commit violation? Running pre-commit run -a locally should be enough.

@giampaolo
Copy link
Contributor Author

giampaolo commented Aug 24, 2022

Can you fix the pre-commit violation? Running pre-commit run -a locally should be enough.

done

The end-to-end tests should exercise it since the default behavior is to use all available CPUs.

I put a pdb in autoflake.py and the multiprocessing part is not exercised. Note that the logic is:

    if args["jobs"] == 1 or len(files) == 1 or args["jobs"] == 1 or '-' in files or standard_out is not None:
         # serial code
    else:
         # parallel code

@fsouza
Copy link
Collaborator

fsouza commented Aug 24, 2022

I put a pdb in autoflake.py and the multiprocessing part is not exercised. Note that the logic is:

Oh interesting. Does it not get exercised with test_fuzz.py either?

python test_fuzz.py *.py

@giampaolo
Copy link
Contributor Author

giampaolo commented Aug 24, 2022

Mmm no. If I read test_fuzz.py code right it passes one file at a time. Instead autoflake should be invoked as:

python3 autoflake.py file1.py file2.py

That sort of invocation will trigger parallel execution.

@fsouza
Copy link
Collaborator

fsouza commented Aug 24, 2022

That's fair enough. We can add something later that will do it.

@fsouza fsouza merged commit 0344bb1 into PyCQA:master Aug 24, 2022
@giampaolo giampaolo deleted the parallel-exec branch August 25, 2022 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants