Skip to content

Conversation

@AlexandraImbrisca
Copy link
Contributor

Follow up to the previous PR:

  • Using a ProcessorPoolExecutor with 3 processors speeds up the execution time significantly
  • Depending on the operating system and technical specifications, we obtain a time decrease between 49.68% and 70.03% relatively to the previously optimized algorithm. In combination with the other improvements, this adds up to a 76,56% decrease from the initial, non-optimized implementation
  • Leveraging the standard multiprocessing functionality and carefully ordering the files leads to a safe optimisation across all tested environments

If it's not done inside of the "if __name__ == "__main__"", it will be recalled inside every new process on Mac/Windows
Since the processing is now async, this print might confuse the users
@nesnoj
Copy link
Collaborator

nesnoj commented Jan 27, 2025

Thank you @AlexandraImbrisca for the implementation and sending the detailed report which reads coherently!
Is this PR ready for review?

What I stumbled across so far:

  • The CPU count is hard coded but should be configurable. Ideally via CLI but we do not have one, so maybe an environment variable could do the job. And I'm not sure whether >1 is an appropriate default, mp could also be promoted optionally. To be kept in mind: the base process uses 100 MB and each process about 1 GB. A standard office PC is equipped with ~8 GB so 3 processes might be ok. Alternatively, we could set a default of 1 and add a message like "Your system supports multiple CPU cores, you can increase the processing speed by setting env var ..."
    What do you think @AlexandraImbrisca @FlorianK13 ?
  • I tested with different CPU counts (Ryzen 7, Linux, SQlite DB, only "solar"). Concerning the processing speed increase my results are somewhat in line to yours:
Cores Time in s (SQLite)
3 394.8
4 316.6
5 292.4
6 280.8
8 279.5
10 280.6
12 276.4

The speed is stalling somewhere from 5 cores onward. I can imagine this drop in the speed increase is caused by a) the writing concurrency, b) other running processes on my laptop, c) number of parallel processes decrease once most of the tasks are done?

  • Could you please explain why you chose parameters in create_efficient_engine() like this? And are they optimal for any number of CPUs?
  • With 12 cores I occasionally(!) get the following error message:
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) duplicate column name: InAnspruchGenommeneAckerflaeche
[SQL: ALTER TABLE solar_extended ADD "InAnspruchGenommeneAckerflaeche" VARCHAR NULL;]

(The column InAnspruchGenommeneAckerflaeche does not exist in our data model which isn't a problem - it is automatically added but there seems to be an issue with that in the mp)

  • PostgreSQL is crashing here: sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) invalid dsn: invalid connection option "timeout". The arg timeout in connect_args seems not to be understood by Postgres.
  • The docs need to be updated
  • Changelog entry is missing

@AlexandraImbrisca
Copy link
Contributor Author

AlexandraImbrisca commented Jan 27, 2025

Thanks a lot for the detailed review and suggestions @nesnoj!

  • Default number of processes: I like the suggestion of keeping only 1 and adding a note! Since we are just introducing this feature, it might be helpful to make people aware of it and ask them to report any potential issues. How about we add that explanatory message and a link to the issues page to report any possible bugs / negative experiences?
  • Thanks for testing! From what I have read the general suggestion is to use CPU_count - 1 so I totally agree that increasing the number of cores relatively to the system makes sense. I think the performance stalls because of sqlite (sqlite is not designed for write concurrency) :(
  • The choice of parameters: sure, I'll leave some comments!
  • Error when using 12 cores: oh, interestingly! OOC, is the exception caught or does the program terminate?
  • Postgresql: I unfortunately tested mostly on sqlite:( I'll find a solution for this bug and test a bit more on postgresql
  • Docs & changelog updates: sure thing! I'll create another commit for these updates

@nesnoj
Copy link
Collaborator

nesnoj commented Jan 28, 2025

Hey @AlexandraImbrisca !

  • Default number of processes: I like the suggestion of keeping only 1 and adding a note! Since we are just introducing this feature, it might be helpful to make people aware of it and ask them to report any potential issues. How about we add that explanatory message and a link to the issues page to report any possible bugs / negative experiences?

Sounds good to me.
What do you think @FlorianK13 ?

I think the performance stalls because of sqlite (sqlite is not designed for write concurrency) :(

An alternative way could be to create separate SQLite DBs and finally merge them. Dunno if this is a viable option..

  • Error when using 12 cores: oh, interestingly! OOC, is the exception caught or does the program terminate?

It terminates :(

@FlorianK13
Copy link
Member

  • Default number of processes: I like the suggestion of keeping only 1 and adding a note! Since we are just introducing this feature, it might be helpful to make people aware of it and ask them to report any potential issues. How about we add that explanatory message and a link to the issues page to report any possible bugs / negative experiences?

Sounds good to me as well!

@AlexandraImbrisca AlexandraImbrisca changed the title Use multiprocessing to speed up the parsing [Feature #600]: Use multiprocessing to speed up the parsing Jan 28, 2025
@AlexandraImbrisca
Copy link
Contributor Author

Awesome, thanks a lot both! A few updates from my side:

  • I introduced 2 new environment variables: one for using the recommended number of processes, one to set up a custom number of processes. I think that 2 variables are necessary since people might not be aware of what number of processes would perform the best, but it would be nice to allow them to customize it
  • @nesnoj I think the "duplicate column name" exception occurs because of a race condition (i.e., 2 processes trying to add the same column at the same time). Please correct me if I'm wrong, but I think we can safely ignore this error since once we have introduced the missing columns, we reached our purpose 🤔 I added some more error handling. Could you please let me know if you are still able to reproduce this issue?
  • I fixed the PostgreSQL issue and generally tested more for PostgreSQL
  • I updated the documentation and added a message to promote this feature

About merging the DBs: that might work, but it might get quite messy with many processes (i.e., we could end up with 10+ temporary DBs) and we have to make sure that we clean everything up eventually 🤔 Using temporary tables performed better than I expected (source)

@nesnoj
Copy link
Collaborator

nesnoj commented Jan 29, 2025

Thx for the quick update!

  • I introduced 2 new environment variables: one for using the recommended number of processes, one to set up a custom number of processes. I think that 2 variables are necessary since people might not be aware of what number of processes would perform the best, but it would be nice to allow them to customize it

I'll get back to this later

  • @nesnoj I think the "duplicate column name" exception occurs because of a race condition (i.e., 2 processes trying to add the same column at the same time). Please correct me if I'm wrong, but I think we can safely ignore this error since once we have introduced the missing columns, we reached our purpose 🤔 I added some more error handling. Could you please let me know if you are still able to reproduce this issue?
  • I fixed the PostgreSQL issue and generally tested more for PostgreSQL

The column issue seems to be solved but now I keep getting an error in PostgreSQL with the privileges, see below for full log. The user has all privileges for the DB (superuser) and the tables are created but no data is written. I think it is not related to the actual privileges but the implementation but I wasn't able to track it further down right now.
Does it work properly at your end?

  • I updated the documentation and added a message to promote this feature

About merging the DBs: that might work, but it might get quite messy with many processes (i.e., we could end up with 10+ temporary DBs) and we have to make sure that we clean everything up eventually 🤔 Using temporary tables performed better than I expected (source)

Great that you already did some testing in the past! The write-temp-and-merge strategy was just a quick thought, it probably comes with other consequences I cannot estimate and also requires more testing. I'm also fine with the current implementation but open for discussion ;).

Click here for full postgres traceback
Processing file 'AnlagenEegSolar_48.xml'...
Processing file 'EinheitenSolar_48.xml'...

concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 146, in __init__
    self._dbapi_connection = engine.raw_connection()
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3298, in raw_connection
    return self.pool.connect()
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 449, in connect
    return _ConnectionFairy._checkout(self)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1263, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 712, in checkout
    rec = pool._do_get()
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 179, in _do_get
    with util.safe_reraise():
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 177, in _do_get
    return self._create_connection()
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 390, in _create_connection
    return _ConnectionRecord(self)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 674, in __init__
    self.__connect()
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 900, in __connect
    with util.safe_reraise():
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 896, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 646, in connect
    return dialect.connect(*cargs, **cparams)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 622, in connect
    return self.loaded_dbapi.connect(*cargs, **cparams)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL:  password authentication failed for user "mastr"
connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL:  password authentication failed for user "mastr"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/nesnoj/git-repos/OpenEnergyPlatform/open-MaStR/open-MaStR_546_parsing_speed/open_mastr/xml_download/utils_write_to_database.py", line 103, in process_xml_file
    create_database_table(engine, xml_table_name)
  File "/home/nesnoj/git-repos/OpenEnergyPlatform/open-MaStR/open-MaStR_546_parsing_speed/open_mastr/xml_download/utils_write_to_database.py", line 215, in create_database_table
    orm_class.__table__.drop(engine, checkfirst=True)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/sql/schema.py", line 1299, in drop
    bind._run_ddl_visitor(ddl.SchemaDropper, self, checkfirst=checkfirst)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3248, in _run_ddl_visitor
    with self.begin() as conn:
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3238, in begin
    with self.connect() as conn:
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3274, in connect
    return self._connection_cls(self)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 148, in __init__
    Connection._handle_dbapi_exception_noconnection(
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2439, in _handle_dbapi_exception_noconnection
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 146, in __init__
    self._dbapi_connection = engine.raw_connection()
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3298, in raw_connection
    return self.pool.connect()
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 449, in connect
    return _ConnectionFairy._checkout(self)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1263, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 712, in checkout
    rec = pool._do_get()
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 179, in _do_get
    with util.safe_reraise():
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 177, in _do_get
    return self._create_connection()
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 390, in _create_connection
    return _ConnectionRecord(self)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 674, in __init__
    self.__connect()
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 900, in __connect
    with util.safe_reraise():
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 896, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 646, in connect
    return dialect.connect(*cargs, **cparams)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 622, in connect
    return self.loaded_dbapi.connect(*cargs, **cparams)
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL:  password authentication failed for user "mastr"
connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL:  password authentication failed for user "mastr"

(Background on this error at: https://sqlalche.me/e/20/e3q8)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/nesnoj/git-repos/OpenEnergyPlatform/open-MaStR/open-MaStR_546_parsing_speed/testing.py", line 17, in <module>
    db.download(data="solar")# solar
  File "/home/nesnoj/git-repos/OpenEnergyPlatform/open-MaStR/open-MaStR_546_parsing_speed/open_mastr/mastr.py", line 244, in download
    write_mastr_xml_to_database(
  File "/home/nesnoj/git-repos/OpenEnergyPlatform/open-MaStR/open-MaStR_546_parsing_speed/open_mastr/xml_download/utils_write_to_database.py", line 65, in write_mastr_xml_to_database
    future.result()
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/nesnoj/miniconda3/envs/py310_open_mastr_546/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL:  password authentication failed for user "mastr"
connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL:  password authentication failed for user "mastr"

(Background on this error at: https://sqlalche.me/e/20/e3q8)

Process finished with exit code 1

@AlexandraImbrisca
Copy link
Contributor Author

Thanks a bunch for finding this bug! I was using an unauthenticated database and I didn't realise that this could be an issue. The connection_url obfuscates the password so I updated the code to properly set the password. Could you please try again and let me know if you see the same issue?

Copy link
Collaborator

@nesnoj nesnoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two small things needed a fix, I patched..
Now it works fine with psql, thank you!

@AlexandraImbrisca
Copy link
Contributor Author

Thank you for spotting the issues and fixing them! If you any other suggestions, please let me know

@nesnoj nesnoj requested a review from FlorianK13 January 31, 2025 21:35
@FlorianK13
Copy link
Member

Is this the version now that should be merged to develop and released afterwards? If yes, I would start with the comparison of the two databases:

  • downloaded with this branch
  • downloaded with open-mastr from pypi

@AlexandraImbrisca
Copy link
Contributor Author

Yes, I think this is the final version (unless we find any other bugs/suggestions). If you can help testing, that would be great! I will also test a bit more

@FlorianK13
Copy link
Member

Did you test on windows? Without setting os.environ, my program immediatly crashes:

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

@AlexandraImbrisca
Copy link
Contributor Author

Could you please try again and let me know if any other error is being printed? I unfortunately don't have my own Windows system and I only tested the previous version before adding the os.environ variables. I'll try accessing Windows today and test the code again

@AlexandraImbrisca
Copy link
Contributor Author

I just tested on Windows 11 and I had no issues. I tried with WSL 2.0 and similarly, the program is running correctly. I tested without setting os.environ as well as with setting each of the fields.

@FlorianK13
Copy link
Member

Just saw it now, I'll work on this hopefully within this or next week.

@FlorianK13
Copy link
Member

FlorianK13 commented Feb 24, 2025

@AlexandraImbrisca
Running this script:

from open_mastr import Mastr
import os

os.environ["NUMBER_OF_PROCESSES"] = "1"
db = Mastr()
db.download(date="existing", data="solar")

throws this error:

    concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

    RuntimeError:
    An attempt has been made to start a new process before the
    current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

    To fix this issue, refer to the "Safe importing of main module"
    section in https://docs.python.org/3/library/multiprocessing.html

It refers to:

\xml_download\utils_write_to_database.py", line 67, in write_mastr_xml_to_database future.result()
\utils_write_to_database.py", line 63, in write_mastr_xml_to_database
futures = [
\utils_write_to_database.py", line 64, in
executor.submit(process_xml_file, *item) for item in interleaved_files

I'm using python 3.11.8 in a conda env on Windows.

@AlexandraImbrisca
Copy link
Contributor Author

Oh, could you please try again with the following snippet?

from open_mastr import Mastr
import os

os.environ["NUMBER_OF_PROCESSES"] = "1"
db = Mastr()

if __name__ == "__main__":
      db.download(date="existing", data="solar")

This condition is necessary to ensure that the program doesn't attempt to recreate the pool for every new process. It's already part of main.py. Without this condition, multiprocessing will always break on Windows/MacOS AFAIU

@FlorianK13
Copy link
Member

FlorianK13 commented Feb 24, 2025

This solved the issue. However I'm sure that many users don't have a ``if name == "main":` check in their code. Is there a way to only use multiprocessing if os.environ["NUMBER_OF_PROCESSES"] = "SomeNumber" is given?

Because otherwise our version update would break their code.

@nesnoj
Copy link
Collaborator

nesnoj commented Feb 24, 2025

In my tests above it always worked without checking the top-level code environment by __name__ == '__main__' (linux) 🤔

@FlorianK13
Copy link
Member

Yes, I guess this is a problem appearing only on Windows.

@AlexandraImbrisca
Copy link
Contributor Author

Great idea! I updated the code to not use multiprocessing unless one of the 2 options is set (USE_RECOMMENDED_NUMBER_OF_PROCESSES / NUMBER_OF_PROCESSES). I also added a note to the documentation about if __name__ == "__main__"

@AlexandraImbrisca
Copy link
Contributor Author

@FlorianK13 @nesnoj small reminder if you can review these changes again! 🙏🏻

@nesnoj
Copy link
Collaborator

nesnoj commented Mar 6, 2025

@FlorianK13

@FlorianK13
Copy link
Member

A simple

db = Mastr()
db.download(date="existing", data="wind")

on windows now works again 👍

Copy link
Member

@FlorianK13 FlorianK13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two small changes. After that I suggest we can merge this branch and start with #602 on the development branch. Do you agree @nesnoj ?

@nesnoj
Copy link
Collaborator

nesnoj commented Mar 10, 2025

Only two small changes. After that I suggest we can merge this branch and start with #602 on the development branch. Do you agree @nesnoj ?

yes, go for it!

@AlexandraImbrisca
Copy link
Contributor Author

Sounds great @FlorianK13, thank you! I fixed the ruff linter warnings

@FlorianK13 FlorianK13 merged commit e277d4f into OpenEnergyPlatform:develop Mar 11, 2025
0 of 9 checks passed
@nesnoj nesnoj mentioned this pull request Apr 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants