Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strings generated by default should have more usual characters #99

Closed
vi opened this issue Sep 28, 2015 · 20 comments
Closed

Strings generated by default should have more usual characters #99

vi opened this issue Sep 28, 2015 · 20 comments

Comments

@vi
Copy link
Contributor

vi commented Sep 28, 2015

Strings typically generted look like this:

򒽹򪬛򘌟󥫹񕘼򟶹񍀝󿡥񰫩􂒡𻘒򅇪򍸈򁑙򟪆󴥏􅲴󩇇Ք𡫗󼺆򺩪𾺗𵅟𻎚񦎅񳜿󔈐ᘤ񜾊񺥿򍔛󆬭𙚹񃃣񥊆򇆗򂤇􀑬󨍏󉨫򺌑񽙕񷗮󻐉𿞊񹩵򖬊񣁇󘅛󖛉𼄠󄚷򽕎

񡰎𣒌󹣵

􅘤󀞶񼮳𮕇򀁕񗎰򀇦򹖻񚂫󚾪򃢛ൾ𬫤񘖤􇾷򇍤񮅈弅򥅽񨭴󶈡񫯭𧚼䝠Ᏼ􌷞󻃃񛑓󵎍𯭜􄼧򛷺񚂛񊪧󯱩񩴝𛭑𨀒񡥍󛢘񺐢𵑌眻󖙞񅉊򛻨񆷮󸧖󵔸񎽔򡚥񂼮򆼤񄓭񷬗􂭡򼠀񥉑󓊢򡽍􀟞񮆀񿫇񿌝򒋏񜎀񢝒򋯼񹌈􃔪򈟙󟬠󀧸󀉖𠥤񌡭𧆼񦛼񶰆񂠑𻗛򙄒

򁂺󶼀𰷹𾏉򮘭򈝥󓞤򫚁󙖘󆖙𰻟񇫙𴋅𠈼󹌀凾𖙟𠓕𾭭𗢐񥂡񇫘򫯝󨃖僧󰲺򘳑𛉏񡴶􏹊򔌇ᘼ􎜅󇆯󈤟󷃚􎞎񦖄봄򬩘񎩭򓲜󍥊𰀈􍆔񥝖򅈳𿴛􌆗񍏰񛯚􏡩򵆠񝭆􍘣򅕜򞙄󬍦񜟩󐮐򊽛󘉏񔀊񎇝󞏏񥍇󕎎𑝩񡨭򋡠򪙃󆂵𫟯񩽗𕡫򞠇𤷸򙺛򄎩򍮂򒹣䴒𻉜󸅩􎬙𷆊񲎀񪨌򌀀􄻖񇋒󹜅

񒢩򊟗𥁿򌅞񬕧𞎂񮕄򞿮ҙ󐢉󠢜񭈑󕓴󅸛򕎅𳞇񪉩񞢜򭧘󛦦󎪦󱏚홀񍋧篬󦲫򈧹񍔿񚣪򛍾𡚙󫬴򽉫񈝗񔁯󹕩񦟠󼝹𦼄󹆲񜋑񠇒񇽖􋤋񤔇󐂚񼛍􄼀󮀯𔯀󦃂񵒵򄯎󩎅򓊕򯠎񌬆롇񘺊񵺾𕧨񚾓򍚬󷈮󠢧򦂼𹥞󮾛򼢁􀏜򸬟󮃣򪾂򬁷񯕞񋓝󞭤󰥔񐮃󲏫󻞻򻬒􄑎򉗭򄫛󨶇󔽟򹑹򼹨򨵖񰊅񻣊

񕂍􌥩񵉗񶗖𬉞魊񞦹󙤵𦪀𒮡󄮽𕗠񥽕􂥦􈲹󙁡򞞌𵄐𳎩뚲󄬥򶰂񷷢򝬉񞝟򃹪腎󑍭򒋀򔲙󉺿񲥓𝹯󩬑

񾧍󝫮򎿛窯񕓨􈲶񽶛򩪣򁹡􂏔򛘎򘇀𠷺񎂦񭭸񽎊𐊊𨺿򆊺𗞩􌽂񁸶輬񄃸񅙜񐢳񔧏񈙄𝐳𪉛

𥹠𼞳󾢙𔅿􊭷񾣐򎕊񭳖򕸌󎀛򶃣𜿉􀢛󶄿𲊊񪟛񈘉񲻦񭩂󳄒𾺲𯆊񏸆񌼠񭳒񷞈򁓎􈋖𚜸󿒲򁔀𰤫򫥑򞼲𥐸񥾥󓮿򳖱𗋕󳉋􉠨𖍳󷕝򝘊򏩰󁺱񍰉崳󀁈𒯻򿮥𯙇񥻂󓿭򫙎񄦽񠷎𞸁񵳳󿧻􂯓󹗄򇪾󽘝󉵷򗚫񸐄󘁳񏶌񥵾󧒟򮫰𖪄񞼗􏾢𶕘򊮴𗰼򄌋񺿭򣘸

򵑆򢀍򯪂񏁾𧔥󼖈򬿚󰖓􁯌򁆙񎠤񬴞𝒲󹩝򾜣񰢝󭙯񠵛񹘦󒳼񡫉󺲢񘾥񿔷󾣎埙󊽑򼪐򊰭񠮺禹󾣣𼵝򒹐񓪊󫨬򫄒󦉂񇫄󬆿򻃃񯟶󜚓𷗚𲦂󻔴񮏂򶰎񥊢񰵪񢔀򁻢喒󢷅󟽄񚐚񓼈򭶸􅇆󭏂𜯃𨖉󑶧򪌿𨌽򄔖𱢼򩱼򫮴

𻥻򖿄󎰁𔱷𑒂𼟊􇼅𲂄􈔜󉄘􋫢𖔄񋺆𬎽򛖢򠬔􃸫􀵆𮘜󛥺􂶐򵄤𺐅񀕛񟥠􏽑󹫠򡧄󒸡򰎜򊎰񥑌񇇣𩔒򰅈򵄯񓣏󓢍򑛛򉨸𾓊𷅣𵼤𦌘򓀞𽣟򞴍

񒚞񙶴󫃔񀹮񣍯𵘣󾈣򐜴󄌿𦶞󯛫󄊁񳍮񚶨󡷣𸫂𞳊񏜪󜮶𺾆㮄񆡐󓶭﯈󊎊󿧺􏋔󤍄󁾒򣎦񛦋򯠖􍓙񜠔򇺱𓆦񟣤󝢧򱁕񗽳󶱮󶊼󿓊򾚓򵵆򼫏󄈬򜒕󱦡񘗏򮑏󔼋򂄸󺺘𨛽󚲁񶗜󖿭󣠯򆱩񰊫憎񳥒񼷵𠗆񍻠󽖾𿂆󑓓񎚯򦉠𘊲

𳸈򋸴󂩗񽰋񄨪񏉙򀤵󕳷𚱲򵚑󓶵񀐡󈁡򮖵񜬃񫡵񠬳񬶊𡱭񅼺򝙾񞬁ꈺ𹖉𵌛ロ񒁭􉫒󋟑񔊢򽲓񿨈񑜰񇿴򢣖􍟣򰮍󿘱󄡣󹉚𬥿񋘙𔭦򡎑

󰺿󶄔𬆷󱂌󉬞󾪢񖖖񮊕񝧃򾽼񨤳𸾂𣑌󸕖󹴌𬜄𑼦􌷱򩔤𖌸񥺼񗎱񗱐󲣑𴒨󴹒𹒪󋊌󨴖򙀁򁿵񫘕󐇯򂯲񵕌񁢜󹤨򙢄񨓩򄠑񻨸𻍦󦩒񬪸󻴴񜀛򝇩񼗕򣚉󵏛󣹥򉭞򦚎񀺲񆛜󮟏񟒫𲹇򋼡򷧘񿩬󎭝򪆀𛑛񻉞󝫨𡞱㆒

򼋡󯻭􁤛񎔝󸴭󜈰񱚱픦󎍆𢐻򷟝񞮵蘪󖣢𫭪𲊺󿜚񈅗򿧙󿴞񫌱衯򉕭򊞴қ󴙠𱜯񙐰򍊌񮨬󪾪򮝂𾨆񱘾񡠇팢򗡗򽑶򴸟𙅡򎋎𐠮󶖉򄡏󟛿􏠰򟿢񚨬憲򭫉􉌞񘮝󗕖񡏤𮺀񫐀𔚈񷋭𙪢򇬣򛡡񗚡򓛏򘌶𦷴񶵍򮵔숥𲟪

񘿘𛀦򵊻󠑜󼾐𙈨󹲃𸯕𜆦淅󍵮𰂘񛦑󛫌󝌲󔕇󹙮񧃷񧈵񠶄򙹫󓰆򒈐􃎢邮򣻖񩹖𓋱𥥫񩘆

It fails (or takes long) to find simple case like a when something starts with a space, special character or like this.

I think normal characters should be preferred to deeply Unicode things.

I think by default there should be tiers of characters classes:

  1. Bug-prone characters: whitespace, punctuation, control characters, maybe up to 5 selected Unicode queeries like zero-width space;
  2. [a-zA-Z0-9_];
  3. Unicode characters in the first plane;
  4. Everything else.

And character generator may aim for, for example, 10% of buggy chars, 40% of [a-zA-Z0-9_], 40% of basic Unicode characters, 10% of everything else.

@BurntSushi
Copy link
Owner

This isn't a bad idea. It used to be ASCII only. But clearly, that is also an extreme. I often find myself defining new types that restrict the set of possible values to make debugging easier.

See also: #77

@shepmaster
Copy link
Contributor

Copying my idea from #77:

I could conceive of a string with only ASCII as being "smaller" than one with ASCII + ASCII punctuation. Then it could "grow" to include more common Unicode, then "grow" towards uncommon.

@vi
Copy link
Contributor Author

vi commented Dec 8, 2015

@shepmaster, Maybe first uncommon/tricky (zero-width things, BOM, left-to-right), then numerous common?

Each Unicode codepoint has equal weight (time required to test with it), but unequal usefullness (probability that this codepoint catches some bug).

@shepmaster
Copy link
Contributor

It's an interesting thing - what is the most useful order to iterate though test cases? I'd think most people using quickcheck would want it to find things that they haven't thought of (at least it's true for me!). However, once something is found, we want it to reduce it to something that we can wrap our brains around.

I think that "simple ASCII" will often be the easy-to-understand group of characters. The problem is going to be that different usages of Strings will have different "tricky" bits. Perhaps your area of code is more likely to have issues with BOMs, but mine with control characters. I'd doubt there's One True Order.

@vi
Copy link
Contributor Author

vi commented Dec 9, 2015

But I expect that a problem will rarely come up with, for example, character U+12345 CUNEIFORM SIGN URU TIMES KI (𒍅) exactly (and not with other high-plane characters). Yet including all high-plane characters significantly increase the testing space and outnumbers more useful characters. So for "lesser" strings you can leave just one high-plane character.

Imagine the table:

Option Weight Usefulness notes
Don't include high-plane characters in "easy set" smallest Won't catch respective problems
Include just one high-plane character in "easy set" small Likely to catch the problem with such characters
Include all high-plane characters in "easy set" big Only slightly more chance to catch such problems compared to the previous row

Small character classes (control characters, whitespace) should be included entirely.

@BurntSushi
Copy link
Owner

I somewhat feel like the obvious behavior for String is the current behavior: any Unicode codepoint is fair game.

With that said, there's no reason why quickcheck couldn't define a few other newtypes around String that correspond to useful subsets of Unicode. (If we go that route, I would prefer to the keep the number of such types in quickcheck proper very small.)

@vi
Copy link
Contributor Author

vi commented Dec 9, 2015

What does mean "a fair game"? In my idea any codepoint can appear, but probability should be drastically different.

Useful subset may fail to find a problem even if running long enough.

Option Speed Immediate results Long-term results
All codepoints regularly distributed (current) slow few all
"Useful subset" fast moderate moderate
All codepoints, but not regularly distributed (proposed) medium moderate all

For example, if a function breaks just when being fed a string with three spaces in a row, I expect it to find it fast. If a function only breaks when being fed with tree 𒍅s in a row, that is expected to be found out slower (because of space is must more popular for bugs that some arbitrary character).

@BurntSushi
Copy link
Owner

@vi Ah, I see, I misunderstood. I think I'm fine with a smarter impl of String.

@vi
Copy link
Contributor Author

vi commented Dec 9, 2015

@BurntSushi, Maybe smarter impl of char? Do you feel OK if arbitrary char would not be regularly distributed and would prefer some characters?

@BurntSushi
Copy link
Owner

I think that might be OK.

@vi
Copy link
Contributor Author

vi commented Dec 9, 2015

Can such logic be also applied to u32 and friends (making things like 0,1,2,-1,0x80000000 more popular) ?

Probably [0,0,0,0] can trigger more bugs than [1582149423,1582149423,1582149423,1582149423].

@BurntSushi
Copy link
Owner

Sounds like a good idea to me!

@BurntSushi
Copy link
Owner

I wonder if it'd be worth looking at what other ports of quickcheck do. Does the Haskell quickcheck do anything fancy like this? If not, did they consider it?

@vi
Copy link
Contributor Author

vi commented Dec 9, 2015

Asked on IRC.

http://haddock.stackage.org/lts-3.17/QuickCheck-2.8.1/src/Test-QuickCheck-Arbitrary.html#line-471

kadoban> _Vi: So it basically only ever picks characters between 0 and 255, and it's biased towards 0 to 128

@FranklinChen
Copy link

This is a known crappy Haskell QuickCheck default that has bitten many people. The standard workaround is http://hackage.haskell.org/package/quickcheck-unicode

@vi
Copy link
Contributor Author

vi commented Dec 10, 2015

Shall I try submitting a pull request about this, making it generate chars a bit like aforementioned quickcheck-unicode (but with some emphasis on whitespace, special characters and specific tricky Unicode characters).?

@BurntSushi
Copy link
Owner

@vi That would be lovely!

vi added a commit to vi/quickcheck that referenced this issue Dec 13, 2015
That's how generated strings typically looks now:

O���[.?

}'셥-�91(]ª!ñ�·��	#* "9ô�£´�:؀{乸0%㯓9똁⁔Rz릉¤tó£±�? (]>�
                                                <܏nf)*ᖯ'��ñ��°6>¦ó¤�¡匈�#$'`맽ô���c￸HX)�[r莅3*A ð¹�§7]

	G_媣<ꉟต8~^i7䱄釱fh)+��{G�0�

ﵽ❔K/5‴9[꤅X1J[M&4[؜¥"

⇉Ɩ©�42폨ĒUñ�¸�5.`'O§)⁣�-���*ñ·�¼‌r ؅
 '@/@�骲6!ñ�§��,&E؀ 
e?!�܏fó � ó¶±¬V�_ (]>el󯣿o+狪*="⁅
     ￸ñ���肖<{ó¿¿½\+巤

{T��*ô�¿½⁆?ó¿¿½ ꡯ칵쫨C}1<ʼn��*���..#ñ��º& J:,j=؂‹3“褙`}j¬ñ���+‌‐󾬲¦bO©￰S�ñ¡��~~�.�ª
=㍃�&f�E&Q@ð¾�±R�笹
⁁�D

6')�m9m�)�sqT�3H㹵0￸35蹈\>^鯅ñ��»ó��­�ó¨£�؅�‰쩻8 ⁋0�N\WGô�¡�¥�®��5UWñª���1钟[!�X��+<󿬹難"4​�ó�®³ᔵ"ó¬�®!G

揟’O�1'ñ�¿��+髾@$Zvó�¹�䵃�;ð»�¸�h뢚ស᜼9Yó¿¿¾_L蛇�AjpⰚ�㤩

©揪)ò�®�-d�A){¥攝剟>~ó���؃="

«ó¿¿½1賬‟z⁉�VOô�¿¾�2I!mô�´¿N4;,ñ�¾»i>-\B��)裉᷈�f륯  +ाX~9[u 樴m‿ñ°��!=�=�C[	ط57_£=‴�`⁧5�_�4}⁃‥�у灼	¥1:�>ð�©»

<$)>@, -"♄f<��ð¶¾�
BurntSushi added a commit that referenced this issue Jan 27, 2016
@BurntSushi
Copy link
Owner

Done with PR #116 in commit faed60d. Thanks @vi!

@vi
Copy link
Contributor Author

vi commented Jan 28, 2016

QuickCheck's string generator's motto should be "I love characters you hate".

@BurntSushi
Copy link
Owner

@vi Haha, I like it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants