Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring parsers #22

Merged
merged 126 commits into from Mar 31, 2022
Merged

Refactoring parsers #22

merged 126 commits into from Mar 31, 2022

Conversation

simonbowly
Copy link
Member

@maliheha and I are working on this on my fork. These updates aims to separate the parsers for different sections of the log into submodules so things are easier to unit-test and modify as we go.

@mattmilten no need to review for now, we'll let you know when it's in a more complete state.

@ronaldvdv we noticed you're also adding some tests and example data - can we fold this into the refactored code? Just want to make sure we aren't duplicating work and writing tests that may eventually clash with one another. Happy to discuss and coordinate together.

maliheha and others added 30 commits December 27, 2021 15:58
Added temporary tests asserting against current parser outputs
from log files in the data directory
Separate parsing components for norel and nodelog sections.
Replace main code sections with the new parsers.
Use common interface and helper functions
maliheha and others added 12 commits February 15, 2022 22:57
Update tox config and package data specs
Move loaders for parameter defaults and descriptions to
dedicated grblogtools.parameters package
merged_logs argument is not needed, we always collect multiple
logs from a file and use a LogNumber column to distinguish them
Avoid trying to create Log/ModelFile/Model columns when
ModelFilePath is not available.
Co-authored-by: Maliheh Aramon <maliheha@users.noreply.github.com>
@simonbowly
Copy link
Member Author

@mattmilten we are pretty close to done with refactoring in this PR :) There are some very minor behaviour changes which we've documented in the changelog. To try to avoid regressions, the previous code is still in the repo, and tests/test_regression.py tests the refactored code against the v1.3.2 code (we should delete both of these before finalising).

When you have a chance to review could you let us know your thoughts? Probably a few things could still be tidied up but we're quite happy with the structure.

Thanks @maliheha for your great work on this so far, and for putting the below summary of the new structure:

  • In the new design, parsers are the most inner layer, each responsible for parsing a specific section in the log file.
  • The parsers can be found in src/grblogtools/parsers/. Each parser has a main method parse() that takes a single line as an input and returns True/False in case the line is matched by a pattern associated with the parser.
  • Each parser has two other methods named get_summary() and get_progress() (if applicable) to return the parsed summary or the detailed progress.
  • The next layer is the SingleLogParser which is responsible for parsing one single log run.
  • The SingleLogParser(see src/grblogtools/parsers/single_log.py) has the main method parse() which takes in a line as an input and returns True/False if the line is matched by some patterns. The SingleLogParser has two internal variables: current_parser and future_parsers which keeps them updated as it sees more lines. It initializes by setting the current_parser to the header_parser and the future_parsers to the list of remaining parsers. If a line is not matched by the current_parser, the SingleLogParser checks whether it should pass the parser to any of the remaining parsers or not. As soon as any of the parsers in the list of future_parsers returns True, the current_parser and the future_parsers are updated.
  • The final layer (API layer) (see src/grblogtools/api.py) has also a method named parse() where it takes in strings of log file patterns as individual arguments and returns a ParseResult object. The methods summary() and progress() of the ParseResult class returns the summary dict and search progress information as before.
  • The main method of the ParseResult class is parse() that takes in the path to a single log file (the log file can include multiple runs) and uses SingleLogParser objects to parse each log run.
  • The API also includes a legacy function named get_dataframe(logfiles, timelines=False, prettyparams=False). i.e. the API is unchanged other than dropping merged_logs (these are handled automagically)
  • The usage example of the API can be found in the header of the api.py module.
  • Unit tests can be found in tests/ where separate tests are written for individual parsers and api.
  • There is a file named test_refactor_regression.py including regression tests to ensure the equivalent outputs from the current grblogtools api and the newly designed api.
  • The tests run against current test data in data folder and newly added test data in tests/assets.
  • The plotting API is left untouched and there is no plan to re-factor it under this project.
  • The extension of the current API to tuner, multiple objective optimization, concurrent and distributed optimization are the next steps, which can be directly incorporated into the main grblogtools git repo after the current base code is reviewed and hopefully merged.

@mattmilten mattmilten self-requested a review March 10, 2022 14:55
Copy link
Member

@mattmilten mattmilten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! One question: would it make sense to define a base class for the different parsers? They all implement the same function anyway but the classes do not seem to be connected by a common base class.

src/grblogtools/__init__.py Outdated Show resolved Hide resolved
src/grblogtools/cli.py Outdated Show resolved Hide resolved
src/grblogtools/parsers/nodelog.py Outdated Show resolved Hide resolved
CONTRIBUTING.md Show resolved Hide resolved
CITATION.cff Show resolved Hide resolved
README.md Show resolved Hide resolved
@simonbowly
Copy link
Member Author

Excellent work! One question: would it make sense to define a base class for the different parsers? They all implement the same function anyway but the classes do not seem to be connected by a common base class.

Thanks! I don't think abstract base classes really add much here. There's a common API between classes but the abstract class doesn't add any functionality, so I'm inclined to just leave it as is.

Click 8.1.0 broke something in the black formatter.
Temporarily fix to <= 8.0.4
Update author list and bump version
Remove -m (merged logs) option from cli.
Update cli to use new API.
Run api tests against top level import.
Termination regexes could return None values, which then
failed in type conversion.
has seen a proper log start line."""

parser = HeaderParser()
parse_lines(parser, ["Presolved: 390 rows, 316 columns, 1803 nonzeros"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen this to happen in the dev log when we have node presolve output too. If I am not overlooking, in a typical log, the header parser should not interrupt because we should have been in the presolve parser when reaching this line. That being said, thank you for the change. It is definitely safer to guard against some patterns.

@simonbowly simonbowly marked this pull request as ready for review March 31, 2022 00:54
@simonbowly simonbowly changed the title WIP refactoring parsers Refactoring parsers Mar 31, 2022
@simonbowly
Copy link
Member Author

@mattmilten this is ready to go!

@mattmilten mattmilten merged commit 6a7783e into Gurobi:master Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants