Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fortran scanner misses comments in some cases #4454

Open
mwichmann opened this issue Dec 29, 2023 · 1 comment
Open

Fortran scanner misses comments in some cases #4454

mwichmann opened this issue Dec 29, 2023 · 1 comment
Labels

Comments

@mwichmann
Copy link
Collaborator

mwichmann commented Dec 29, 2023

This is to document, in an issue, a limitation of the Fortran Scanner. The scanner itself does describe the limitation in comments, but not why it exists.

In searching for USE and INCLUDE statements when scanning Fortran source, regexes are used, which cannot deal with the combination of multiple semicolon-separated statements on one line which includes comment marks. These lines from the unit test (SCons/Scanner/FortranTests.py) all give the wrong result:

!     USE modia ; use modib  # expect nothing, get modib
      USE mod14 !; USE modi  # expect mod14, get both
      USE mod15!;USE modi  # expect mod15, get both
      USE mod16  !  ;  USE  modi  # expect mod16, get both
; USE modi  # expect nothing (??), get modi

The regex considers the semicolon as the start of a new bit of text to scan, so in each case, what comes before has no effect. The regexes are applied in multiline mode, FWIW. It doesn't matter where the comment mark is or how much whitespace there is, the comment appears to "end" at the semicolon. The fifth example should (apparently) be considered a syntax error (and thus ignored?), but as scanning starts on the blank after the semicolon, there is no complaint.

Python's re module allows only fixed look-behind patterns, so there is no legal way to express "semicolon if not preceded by ! and possibly some other stuff", doing so will produce an error from the re module. There also doesn't seem to be a simple way to express "if the line begins with a comment, don't do anything more with it". It's not hard to write a bit of regex that says ignore from a character until the line ending, but interleaving that with the already fairly complex regex in use is something else.

From some digging around the internet, it seems that it's possible to encode this without a look-behind, at the cost of creating a considerably more complex regex pattern - that might be something to explore; we seem to be lacking that level of expertise. The non-stdlib regex module can reportedly do look-behind with a variable-length pattern; a simple attempt to code it for there gave no error, but didn't help.

Probably we should find a place to document this and suggest that the easiest workaround, if it causes problems (it's not clear there's a real-world problem here, just a broken test when a change was made in the scanner module), is to "not do that" - so instead of:

      USE mod14 !; USE modi

do:

      USE mod14
!     USE modi

One suggestion from Discord was to pre-process the file to scan to remove comments, not sure how easy this is. Can comment marks appear in a line inside some other construct such that they are not considered comment indicators?

@mwichmann
Copy link
Collaborator Author

To follow up on the final paragraph of the issue, in order to strip comments, we'd need to understand the syntax involved. The complication, as always, is context: if it's in a string (though Fortran does not use that terminology) it should not be considered a comment indicator.

From the standard:

6.3.2.3 Free form commentary
The character “!” initiates a comment except where it appears within a character context. The comment extends to the end of the line. If the first nonblank character on a line is an “!”, the line is a comment line. Lines containing only blanks or containing no characters are also comment lines. Comments may appear anywhere in a program unit and may precede the first statement of a program unit or follow the last statement of a program unit. Comments have no effect on the interpretation of the program unit.

6.3.3.2 Fixed form commentary
The character “!” initiates a comment except where it appears within a character context or in character position 6. The comment extends to the end of the line. If the first nonblank character on a line is an “!” in any character position other than character position 6, the line is a comment line. Lines beginning with a “C” or “*” in character position 1 and lines containing only blanks are also comment lines. Comments may appear anywhere in a program unit and may precede the first statement of the program unit or follow the last statement of a program unit.

Note that the existing regex does not attempt to recognize the fixed form "C or * in column 1".

and

3.21 character context
within a character literal constant (7.4.4) or within a character string edit descriptor (13.3.2)

And for the scanner issue, which incolves inline statement termination, here's the wording:

6.3.2.5 Free form statement termination
If a statement is not continued, a comment or the end of the line terminates the statement.
A statement may alternatively be terminated by a “;” character that appears other than in a character context or in a comment. The “;” is not part of the statement. After a “;” terminator, another statement may appear on the same line, or begin on that line and be continued. A sequence consisting only of zero or more blanks and one or more “;” terminators, in any order, is equivalent to a single “;” terminator.

6.3.3.4 Fixed form statement termination
If a statement is not continued, a comment or the end of the line terminates the statement.
A statement may alternatively be terminated by a “;” character that appears other than in a character context, in a comment, or in character position 6. The “;” is not part of the statement. After a “;” terminator, another statement may begin on the same line, or begin on that line and be continued. A “;” shall not appear as the first nonblank character on an initial line. A sequence consisting only of zero or more blanks and one or more “;” terminators, in any order, is equivalent to a single “;” terminator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant