Skip to content

Commit

Permalink
Support for UTF8 characters in input string
Browse files Browse the repository at this point in the history
  • Loading branch information
myell0w committed May 2, 2013
1 parent de7f594 commit 4e649d4
Show file tree
Hide file tree
Showing 3 changed files with 961 additions and 770 deletions.
14 changes: 14 additions & 0 deletions grammar/markdown.grammar
Original file line number Diff line number Diff line change
@@ -1,6 +1,19 @@
%option case-insensitive
%option reentrant

/*
match unicode characters as one rule instead of each byte.
http://stackoverflow.com/questions/10252777/making-lex-to-read-utf-8-doesnt-work?lq=1
*/
u2a [\xC2-\xDF][\x80-\xBF]
u2b \xE0[\xA0-\xBF][\x80-\xBF]
u3a [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
u3b \xED[\x80-\x9F][\x80-\xBF]
u4a \xF0[\x90-\xBF][\x80-\xBF]{2}
u4b [\xF1-\xF3][\x80-\xBF]{3}
u4c \xF4[\x80-\x8F][\x80-\xBF]{2}
utf_8 {u2a}|{u2b}|{u3a}|{u3b}|{u4a}|{u4b}|{u4c}

h [0-9a-f]
nonascii [\200-\377]
unicode \\{h}{1,6}[ \t\r\n\f]?
Expand Down Expand Up @@ -51,6 +64,7 @@ nl \n|\r\n|\r|\f
[ ]{2,}$ {markdownConsume(yytext, MARKDOWNNEWLINE, yyscanner);}
[\n]{2,} {markdownConsume(yytext, MARKDOWNPARAGRAPH, yyscanner);}
[ \n\t\f] {markdownConsume(yytext, MARKDOWNUNKNOWN, yyscanner);}
{utf_8} {markdownConsume(yytext, MARKDOWNUNKNOWN, yyscanner);}
. {markdownConsume(yytext, MARKDOWNUNKNOWN, yyscanner);}

%%
Expand Down
Loading

0 comments on commit 4e649d4

Please sign in to comment.