Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some utils into LexAccessor and StyleContext #54

Closed
zufuliu opened this issue Jan 14, 2022 · 11 comments
Closed

Add some utils into LexAccessor and StyleContext #54

zufuliu opened this issue Jan 14, 2022 · 11 comments
Labels
lexlib The utility library used by lexers

Comments

@zufuliu
Copy link
Contributor

zufuliu commented Jan 14, 2022

lexlib-0114.zip

code extract from https://github.com/zufuliu/notepad2/blob/main/scintilla/lexlib/LexAccessor.h and https://github.com/zufuliu/notepad2/blob/main/scintilla/lexlib/StyleContext.h
LexAccessor added GetCharacterAndWidth() and StyleAtEx() (to get cache style without call Flush()).
StyleContext added SeekTo(), Rewind(), BackTo() and Advance(). GetDocNextChar() and GetLineNextChar() (currently only skip ASCII whitespaces) may also useful for writing lexers.

@zufuliu
Copy link
Contributor Author

zufuliu commented Feb 8, 2022

BackTo() need updated to following to avoid unneeded flush.

void BackTo(Sci_PositionU startPos) {
	assert(startPos <= styler.GetStartSegment());
	if (startPos < styler.GetStartSegment()) {
		styler.Flush();
		styler.StartAt(startPos);
		styler.StartSegment(startPos);
	}
	SeekTo(startPos);
}

These function are useful to write advanced lexers with try parse rewind on failure approach, e.g. my enhanced lexer for GitHub Flavored Markdown, GitLab Flavored Markdown and Pandoc’s Markdown (though it still needs further work to handle indented code block).
https://github.com/zufuliu/notepad2/blob/main/scintilla/lexers/LexMarkdown.cxx

@nyamatongwe
Copy link
Member

StyleAtEx is worthwhile but should have a name that explains the behavioural difference. Uncertain about name - maybe StyleAtReadingBuffer or BufferedStyleAt.

GetCharacterAndWidth could be useful in lexers. It might be nicer to return a struct (like Document::ExtractCharacter) or tuple instead of using an optional pointer argument. IDocument doesn't do this due to potential for ABI issues between separately built executables.

The implementation of SeekTo (and thus the other methods that call it) doesn't appear to set all the fields of StyleContext - atLineStart and chPrev is zeroed instead of setting to the previous character. Possibly others. There would be additional issues if seeking between lines - maybe this should be detected (throw?) or documented as a limitation.

IDocument::LineEnd returns the position before the line end characters so MatchLineEnd in the branch isn't behaving the same as in the main Lexilla repository. The point of MatchLineEnd is to switch styles for both single and multi-byte line ends and to eliminate situations where '\r' and '\n' are in different styles.

@zufuliu
Copy link
Contributor Author

zufuliu commented Feb 12, 2022

for StyleAtEx , what about CachedStyleAt?

It should documented SeekTo, BackTo and Rewind is intended for seeking on same line, multiline seeking is complex (line state changes).

chPrev is not needed in most time:

  1. forward seeking is intended to skip some characters, which can limit the length to ensure chPrev is set after next Forward() call;
  2. rewind means something failed, the lexer should call Forward() after call Rewind() or BackTo() to avoid infinite loop, which will set chPrev. if a lexer need re-parse text with another syntax which needs chPrev, following changes will help.

chPrev can only be set after change Document::GetCharacterAndWidth() to something like following to get character before position (the change it self might be useful):

if (pWidth && *pWidth < 0) {
	position = MovePositionOutsideChar(position, -1, false);
}

nyamatongwe pushed a commit that referenced this issue Feb 17, 2022
@nyamatongwe
Copy link
Member

Pondered the name a bit and went for BufferStyleAt as the buffer is being written into only from the lexer. CachedStyleAt would be more appropriate if a block of styles was loaded from the document.

@nyamatongwe
Copy link
Member

The proposed LexAccessor::GetCharacterAndWidth encourages mutable state. Returning a struct enables more use of const.

const FullCharacter ch = styler.CharacterAt(nextPos);
const auto [ch, width] = styler.CharacterAt(nextPos);
--- a/lexlib/LexAccessor.h
+++ b/lexlib/LexAccessor.h
@@ -12,6 +12,11 @@ namespace Lexilla {
 
 enum class EncodingType { eightBit, unicode, dbcs };
 
+struct FullCharacter {
+	int character = 0;
+	Sci_Position width = 1;
+};
+
 class LexAccessor {
 private:
 	Scintilla::IDocument *pAccess;
@@ -80,6 +85,11 @@ public:
 	Scintilla::IDocument *MultiByteAccess() const noexcept {
 		return pAccess;
 	}
+	FullCharacter CharacterAt(Sci_Position position) const {
+		FullCharacter fullCharacter;
+		fullCharacter.character = pAccess->GetCharacterAndWidth(position, &fullCharacter.width);
+		return fullCharacter;
+	}
 	/** Safe version of operator[], returning a defined value for invalid position. */
 	char SafeGetCharAt(Sci_Position position, char chDefault=' ') {
 		if (position < startPos || position >= endPos) {

@zufuliu
Copy link
Contributor Author

zufuliu commented Feb 19, 2022

OK then. Maybe CharacterAfter(Sci_Position position) and CharacterBefore(Sci_Position position) can also be added (with wraps GetRelativePosition and CharacterAt to avoid my proposed changes to Document::GetCharacterAndWidth()).

@zufuliu
Copy link
Contributor Author

zufuliu commented Feb 25, 2022

StyleContext-0225.zip

Updated SeekTo() to set atLineStart and chPrev; Updated BackTo() to support multiline rewind (which is need in some cases like implement Markdown multiline link text). Added comment for these methods.

@zufuliu
Copy link
Contributor Author

zufuliu commented Feb 25, 2022

There also needs better names for these method, may be rename SeekTo to MoveTo and rename BackTo to SeekTo is better, as the later can be used for multiline forward seek.
StyleContext-0225-rename.zip

@nyamatongwe nyamatongwe added the lexlib The utility library used by lexers label Feb 27, 2022
@nyamatongwe
Copy link
Member

SIZE_MAX requires cstdint or stdint.h. It often works because other headers include cstdint but it may break unusual systems.

The StyleContext constructor isn't simple, isn't performance-sensitive and would be better out-of-line in the implementation file, hiding the implementation.

Linked is a patch that does that. The patch does some of the same things as the above patches. It also makes fields const where possible, makes lineDocEnd private, initializes with field assignments where trivial, initializes using the initializer list when dependent on data.
54StyleContext.patch

nyamatongwe added a commit that referenced this issue Mar 10, 2022
… for fields

that do not change, initialize in field declaration or initializer list instead of constructor.
@zufuliu zufuliu closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2022
@nyamatongwe
Copy link
Member

The seeking methods are the main outstanding feature of this proposal. I think something similar could be useful but its difficult to understand when and how to use them (particularly the single line / multiple line split) so its less likely lexer developers will incorporate these techniques. Perhaps there is a subset that is strongly motivated and could be presented as a simple technique to use. An approach could be to implement the feature inside a lexer (possibly with a StyleContext subclass) then refine it until it appears finished then move it up into StyleContext.

@zufuliu
Copy link
Contributor Author

zufuliu commented Sep 21, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lexlib The utility library used by lexers
Projects
None yet
Development

No branches or pull requests

2 participants