-
Notifications
You must be signed in to change notification settings - Fork 27
[Feature #600]: Use multiprocessing to speed up the parsing #601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature #600]: Use multiprocessing to speed up the parsing #601
Conversation
If it's not done inside of the "if __name__ == "__main__"", it will be recalled inside every new process on Mac/Windows
Since the processing is now async, this print might confuse the users
|
Thank you @AlexandraImbrisca for the implementation and sending the detailed report which reads coherently! What I stumbled across so far:
The speed is stalling somewhere from 5 cores onward. I can imagine this drop in the speed increase is caused by a) the writing concurrency, b) other running processes on my laptop, c) number of parallel processes decrease once most of the tasks are done?
(The column
|
|
Thanks a lot for the detailed review and suggestions @nesnoj!
|
|
Hey @AlexandraImbrisca !
Sounds good to me.
An alternative way could be to create separate SQLite DBs and finally merge them. Dunno if this is a viable option..
It terminates :( |
Sounds good to me as well! |
Instead of "timeout", we can use "connect_timeout" which works for both SQLite and PostgreSQL
|
Awesome, thanks a lot both! A few updates from my side:
About merging the DBs: that might work, but it might get quite messy with many processes (i.e., we could end up with 10+ temporary DBs) and we have to make sure that we clean everything up eventually 🤔 Using temporary tables performed better than I expected (source) |
|
Thx for the quick update!
I'll get back to this later
The column issue seems to be solved but now I keep getting an error in PostgreSQL with the privileges, see below for full log. The user has all privileges for the DB (superuser) and the tables are created but no data is written. I think it is not related to the actual privileges but the implementation but I wasn't able to track it further down right now.
Great that you already did some testing in the past! The write-temp-and-merge strategy was just a quick thought, it probably comes with other consequences I cannot estimate and also requires more testing. I'm also fine with the current implementation but open for discussion ;). Click here for full postgres traceback |
|
Thanks a bunch for finding this bug! I was using an unauthenticated database and I didn't realise that this could be an issue. The connection_url obfuscates the password so I updated the code to properly set the password. Could you please try again and let me know if you see the same issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two small things needed a fix, I patched..
Now it works fine with psql, thank you!
|
Thank you for spotting the issues and fixing them! If you any other suggestions, please let me know |
|
Is this the version now that should be merged to develop and released afterwards? If yes, I would start with the comparison of the two databases:
|
|
Yes, I think this is the final version (unless we find any other bugs/suggestions). If you can help testing, that would be great! I will also test a bit more |
|
Did you test on windows? Without setting os.environ, my program immediatly crashes: concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. |
|
Could you please try again and let me know if any other error is being printed? I unfortunately don't have my own Windows system and I only tested the previous version before adding the os.environ variables. I'll try accessing Windows today and test the code again |
|
I just tested on Windows 11 and I had no issues. I tried with WSL 2.0 and similarly, the program is running correctly. I tested without setting os.environ as well as with setting each of the fields. |
|
Just saw it now, I'll work on this hopefully within this or next week. |
|
@AlexandraImbrisca from open_mastr import Mastr
import os
os.environ["NUMBER_OF_PROCESSES"] = "1"
db = Mastr()
db.download(date="existing", data="solar")throws this error:It refers to:\xml_download\utils_write_to_database.py", line 67, in write_mastr_xml_to_database future.result() I'm using python 3.11.8 in a conda env on Windows. |
|
Oh, could you please try again with the following snippet? This condition is necessary to ensure that the program doesn't attempt to recreate the pool for every new process. It's already part of main.py. Without this condition, multiprocessing will always break on Windows/MacOS AFAIU |
|
This solved the issue. However I'm sure that many users don't have a ``if name == "main":` check in their code. Is there a way to only use multiprocessing if os.environ["NUMBER_OF_PROCESSES"] = "SomeNumber" is given? Because otherwise our version update would break their code. |
|
In my tests above it always worked without checking the top-level code environment by |
|
Yes, I guess this is a problem appearing only on Windows. |
|
Great idea! I updated the code to not use multiprocessing unless one of the 2 options is set (USE_RECOMMENDED_NUMBER_OF_PROCESSES / NUMBER_OF_PROCESSES). I also added a note to the documentation about if __name__ == "__main__" |
|
@FlorianK13 @nesnoj small reminder if you can review these changes again! 🙏🏻 |
|
A simple db = Mastr()
db.download(date="existing", data="wind")on windows now works again 👍 |
FlorianK13
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Sounds great @FlorianK13, thank you! I fixed the ruff linter warnings |
Follow up to the previous PR: