Strings generated by default should have more usual characters #99

vi · 2015-09-28T02:10:17Z

Strings typically generted look like this:

򒽹򪬛򘌟󥫹񕘼򟶹񍀝󿡥񰫩􂒡𻘒򅇪򍸈򁑙򟪆󴥏􅲴󩇇Ք𡫗󼺆򺩪𾺗𵅟𻎚񦎅񳜿󔈐ᘤ񜾊񺥿򍔛󆬭𙚹񃃣񥊆򇆗򂤇􀑬󨍏󉨫򺌑񽙕񷗮󻐉𿞊񹩵򖬊񣁇󘅛󖛉𼄠󄚷򽕎

񡰎𣒌󹣵

􅘤󀞶񼮳𮕇򀁕񗎰򀇦򹖻񚂫󚾪򃢛ൾ𬫤񘖤􇾷򇍤񮅈弅򥅽񨭴󶈡񫯭𧚼䝠Ᏼ􌷞󻃃񛑓󵎍𯭜􄼧򛷺񚂛񊪧󯱩񩴝𛭑𨀒񡥍󛢘񺐢𵑌眻󖙞񅉊򛻨񆷮󸧖󵔸񎽔򡚥񂼮򆼤񄓭񷬗􂭡򼠀񥉑󓊢򡽍􀟞񮆀񿫇񿌝򒋏񜎀񢝒򋯼񹌈􃔪򈟙󟬠󀧸󀉖𠥤񌡭𧆼񦛼񶰆񂠑𻗛򙄒

򁂺󶼀𰷹𾏉򮘭򈝥󓞤򫚁󙖘󆖙𰻟񇫙𴋅𠈼󹌀凾𖙟𠓕𾭭𗢐񥂡񇫘򫯝󨃖僧󰲺򘳑𛉏񡴶􏹊򔌇ᘼ􎜅󇆯󈤟󷃚􎞎񦖄봄򬩘񎩭򓲜󍥊𰀈􍆔񥝖򅈳𿴛􌆗񍏰񛯚􏡩򵆠񝭆􍘣򅕜򞙄󬍦񜟩󐮐򊽛󘉏񔀊񎇝󞏏񥍇󕎎𑝩񡨭򋡠򪙃󆂵𫟯񩽗𕡫򞠇𤷸򙺛򄎩򍮂򒹣䴒𻉜󸅩􎬙𷆊񲎀񪨌򌀀􄻖񇋒󹜅

񒢩򊟗𥁿򌅞񬕧𞎂񮕄򞿮ҙ󐢉󠢜񭈑󕓴󅸛򕎅𳞇񪉩񞢜򭧘󛦦󎪦󱏚홀񍋧篬󦲫򈧹񍔿񚣪򛍾𡚙󫬴򽉫񈝗񔁯󹕩񦟠󼝹𦼄󹆲񜋑񠇒񇽖􋤋񤔇󐂚񼛍􄼀󮀯𔯀󦃂񵒵򄯎󩎅򓊕򯠎񌬆롇񘺊񵺾𕧨񚾓򍚬󷈮󠢧򦂼𹥞󮾛򼢁􀏜򸬟󮃣򪾂򬁷񯕞񋓝󞭤󰥔񐮃󲏫󻞻򻬒􄑎򉗭򄫛󨶇󔽟򹑹򼹨򨵖񰊅񻣊

񕂍􌥩񵉗񶗖𬉞魊񞦹󙤵𦪀𒮡󄮽𕗠񥽕􂥦􈲹󙁡򞞌𵄐𳎩뚲󄬥򶰂񷷢򝬉񞝟򃹪腎󑍭򒋀򔲙󉺿񲥓𝹯󩬑

񾧍󝫮򎿛窯񕓨􈲶񽶛򩪣򁹡􂏔򛘎򘇀𠷺񎂦񭭸񽎊𐊊𨺿򆊺𗞩􌽂񁸶輬񄃸񅙜񐢳񔧏񈙄𝐳𪉛

𥹠𼞳󾢙𔅿􊭷񾣐򎕊񭳖򕸌󎀛򶃣𜿉􀢛󶄿𲊊񪟛񈘉񲻦񭩂󳄒𾺲𯆊񏸆񌼠񭳒񷞈򁓎􈋖𚜸󿒲򁔀𰤫򫥑򞼲𥐸񥾥󓮿򳖱𗋕󳉋􉠨𖍳󷕝򝘊򏩰󁺱񍰉崳󀁈𒯻򿮥𯙇񥻂󓿭򫙎񄦽񠷎𞸁񵳳󿧻􂯓󹗄򇪾󽘝󉵷򗚫񸐄󘁳񏶌񥵾󧒟򮫰𖪄񞼗􏾢𶕘򊮴𗰼򄌋񺿭򣘸

򵑆򢀍򯪂񏁾𧔥󼖈򬿚󰖓􁯌򁆙񎠤񬴞𝒲󹩝򾜣񰢝󭙯񠵛񹘦󒳼񡫉󺲢񘾥񿔷󾣎埙󊽑򼪐򊰭񠮺禹󾣣𼵝򒹐񓪊󫨬򫄒󦉂񇫄󬆿򻃃񯟶󜚓𷗚𲦂󻔴񮏂򶰎񥊢񰵪񢔀򁻢喒󢷅󟽄񚐚񓼈򭶸􅇆󭏂𜯃𨖉󑶧򪌿𨌽򄔖𱢼򩱼򫮴

𻥻򖿄󎰁𔱷𑒂𼟊􇼅𲂄􈔜󉄘􋫢𖔄񋺆𬎽򛖢򠬔􃸫􀵆𮘜󛥺􂶐򵄤𺐅񀕛񟥠􏽑󹫠򡧄󒸡򰎜򊎰񥑌񇇣𩔒򰅈򵄯񓣏󓢍򑛛򉨸𾓊𷅣𵼤𦌘򓀞𽣟򞴍

񒚞񙶴󫃔񀹮񣍯𵘣󾈣򐜴󄌿𦶞󯛫󄊁񳍮񚶨󡷣𸫂𞳊񏜪󜮶𺾆㮄񆡐󓶭﯈󊎊󿧺􏋔󤍄󁾒򣎦񛦋򯠖􍓙񜠔򇺱𓆦񟣤󝢧򱁕񗽳󶱮󶊼󿓊򾚓򵵆򼫏󄈬򜒕󱦡񘗏򮑏󔼋򂄸󺺘𨛽󚲁񶗜󖿭󣠯򆱩񰊫憎񳥒񼷵𠗆񍻠󽖾𿂆󑓓񎚯򦉠𘊲

𳸈򋸴󂩗񽰋񄨪񏉙򀤵󕳷𚱲򵚑󓶵񀐡󈁡򮖵񜬃񫡵񠬳񬶊𡱭񅼺򝙾񞬁ꈺ𹖉𵌛ロ񒁭􉫒󋟑񔊢򽲓񿨈񑜰񇿴򢣖􍟣򰮍󿘱󄡣󹉚𬥿񋘙𔭦򡎑

󰺿󶄔𬆷󱂌󉬞󾪢񖖖񮊕񝧃򾽼񨤳𸾂𣑌󸕖󹴌𬜄𑼦􌷱򩔤𖌸񥺼񗎱񗱐󲣑𴒨󴹒𹒪󋊌󨴖򙀁򁿵񫘕󐇯򂯲񵕌񁢜󹤨򙢄񨓩򄠑񻨸𻍦󦩒񬪸󻴴񜀛򝇩񼗕򣚉󵏛󣹥򉭞򦚎񀺲񆛜󮟏񟒫𲹇򋼡򷧘񿩬󎭝򪆀𛑛񻉞󝫨𡞱㆒

򼋡󯻭􁤛񎔝󸴭󜈰񱚱픦󎍆𢐻򷟝񞮵蘪󖣢𫭪𲊺󿜚񈅗򿧙󿴞񫌱衯򉕭򊞴қ󴙠𱜯񙐰򍊌񮨬󪾪򮝂𾨆񱘾񡠇팢򗡗򽑶򴸟𙅡򎋎𐠮󶖉򄡏󟛿􏠰򟿢񚨬憲򭫉􉌞񘮝󗕖񡏤𮺀񫐀𔚈񷋭𙪢򇬣򛡡񗚡򓛏򘌶𦷴񶵍򮵔숥𲟪

񘿘𛀦򵊻󠑜󼾐𙈨󹲃𸯕𜆦淅󍵮𰂘񛦑󛫌󝌲󔕇󹙮񧃷񧈵񠶄򙹫󓰆򒈐􃎢邮򣻖񩹖𓋱𥥫񩘆

It fails (or takes long) to find simple case like a when something starts with a space, special character or like this.

I think normal characters should be preferred to deeply Unicode things.

I think by default there should be tiers of characters classes:

Bug-prone characters: whitespace, punctuation, control characters, maybe up to 5 selected Unicode queeries like zero-width space;
[a-zA-Z0-9_];
Unicode characters in the first plane;
Everything else.

And character generator may aim for, for example, 10% of buggy chars, 40% of [a-zA-Z0-9_], 40% of basic Unicode characters, 10% of everything else.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2015-09-28T02:25:09Z

This isn't a bad idea. It used to be ASCII only. But clearly, that is also an extreme. I often find myself defining new types that restrict the set of possible values to make debugging easier.

See also: #77

shepmaster · 2015-12-08T17:58:25Z

Copying my idea from #77:

I could conceive of a string with only ASCII as being "smaller" than one with ASCII + ASCII punctuation. Then it could "grow" to include more common Unicode, then "grow" towards uncommon.

vi · 2015-12-08T18:40:17Z

@shepmaster, Maybe first uncommon/tricky (zero-width things, BOM, left-to-right), then numerous common?

Each Unicode codepoint has equal weight (time required to test with it), but unequal usefullness (probability that this codepoint catches some bug).

shepmaster · 2015-12-08T22:32:19Z

It's an interesting thing - what is the most useful order to iterate though test cases? I'd think most people using quickcheck would want it to find things that they haven't thought of (at least it's true for me!). However, once something is found, we want it to reduce it to something that we can wrap our brains around.

I think that "simple ASCII" will often be the easy-to-understand group of characters. The problem is going to be that different usages of Strings will have different "tricky" bits. Perhaps your area of code is more likely to have issues with BOMs, but mine with control characters. I'd doubt there's One True Order.

vi · 2015-12-09T08:57:51Z

But I expect that a problem will rarely come up with, for example, character U+12345 CUNEIFORM SIGN URU TIMES KI (𒍅) exactly (and not with other high-plane characters). Yet including all high-plane characters significantly increase the testing space and outnumbers more useful characters. So for "lesser" strings you can leave just one high-plane character.

Imagine the table:

Option	Weight	Usefulness notes
Don't include high-plane characters in "easy set"	smallest	Won't catch respective problems
Include just one high-plane character in "easy set"	small	Likely to catch the problem with such characters
Include all high-plane characters in "easy set"	big	Only slightly more chance to catch such problems compared to the previous row

Small character classes (control characters, whitespace) should be included entirely.

BurntSushi · 2015-12-09T11:44:54Z

I somewhat feel like the obvious behavior for String is the current behavior: any Unicode codepoint is fair game.

With that said, there's no reason why quickcheck couldn't define a few other newtypes around String that correspond to useful subsets of Unicode. (If we go that route, I would prefer to the keep the number of such types in quickcheck proper very small.)

vi · 2015-12-09T13:08:42Z

What does mean "a fair game"? In my idea any codepoint can appear, but probability should be drastically different.

Useful subset may fail to find a problem even if running long enough.

Option	Speed	Immediate results	Long-term results
All codepoints regularly distributed (current)	slow	few	all
"Useful subset"	fast	moderate	moderate
All codepoints, but not regularly distributed (proposed)	medium	moderate	all

For example, if a function breaks just when being fed a string with three spaces in a row, I expect it to find it fast. If a function only breaks when being fed with tree 𒍅s in a row, that is expected to be found out slower (because of space is must more popular for bugs that some arbitrary character).

BurntSushi · 2015-12-09T13:28:32Z

@vi Ah, I see, I misunderstood. I think I'm fine with a smarter impl of String.

vi · 2015-12-09T14:29:56Z

@BurntSushi, Maybe smarter impl of char? Do you feel OK if arbitrary char would not be regularly distributed and would prefer some characters?

BurntSushi · 2015-12-09T14:32:14Z

I think that might be OK.

vi · 2015-12-09T16:29:18Z

Can such logic be also applied to u32 and friends (making things like 0,1,2,-1,0x80000000 more popular) ?

Probably [0,0,0,0] can trigger more bugs than [1582149423,1582149423,1582149423,1582149423].

BurntSushi · 2015-12-09T17:03:20Z

Sounds like a good idea to me!

BurntSushi · 2015-12-09T17:03:57Z

I wonder if it'd be worth looking at what other ports of quickcheck do. Does the Haskell quickcheck do anything fancy like this? If not, did they consider it?

vi · 2015-12-09T17:29:56Z

Asked on IRC.

http://haddock.stackage.org/lts-3.17/QuickCheck-2.8.1/src/Test-QuickCheck-Arbitrary.html#line-471

kadoban> _Vi: So it basically only ever picks characters between 0 and 255, and it's biased towards 0 to 128

FranklinChen · 2015-12-10T00:24:15Z

This is a known crappy Haskell QuickCheck default that has bitten many people. The standard workaround is http://hackage.haskell.org/package/quickcheck-unicode

vi · 2015-12-10T00:40:52Z

Shall I try submitting a pull request about this, making it generate chars a bit like aforementioned quickcheck-unicode (but with some emphasis on whitespace, special characters and specific tricky Unicode characters).?

BurntSushi · 2015-12-10T12:01:04Z

@vi That would be lovely!

That's how generated strings typically looks now: O�ò»¹��[.? }'셥-�91(]ª!ñ�·�� #* "9ô�£´�:؀{乸0%㯓9똁⁔Rz릉¤tó£±�? (]>� <܏nf)*ᖯ'��ñ��°6>¦ó¤�¡匈�#$'`맽ô��c￸HX)�[r莅3*A ð¹�§7] G_媣<ꉟต8~^i7䱄釱fh)+��{G�0� ﵽ❔K/5‴9[꤅X1J[⁯M&4[؜¥￻" ⇉Ɩ©�42폨ĒUñ�¸�5.`'O§)⁣�-��*ñ·�¼‌r ؅ '@/@�骲6!ñ�§��,&E؀ e?!�܏fó � ó¶±¬V�_ (]>eló¯£¿o+狪*="⁅ ￸ñ��肖<{ó¿¿½\+巤 {T��*ô�¿½⁆?ó¿¿½ ꡯ칵쫨C}1<ŉ��*��..#ñ��º& J:,j=؂‹3“褙`}j¬ñ��+‌‐ó¾¬²¦bO©￰S�ñ¡��~~�.�ª =㍃�&f�E&Q@ð¾�±R�笹 ⁁�D 6')�m9m⁪�)�sqT�3H㹵0￸35蹈\>^鯅ñ��»ó��ó¨£�؅�‰쩻8 ⁋0�N⁯\WGô�¡�¥�®��5UWñª��1钟[!�X��+<ó¿¬¹難"4�ó�®³ᔵ"ó¬�®!G 揟’O�1'ñ�¿��+髾@$Zvó�¹�䵃�;ð»�¸�h뢚ស᜼9Yó¿¿¾_L蛇�AjpⰚ�㤩 ©揪)ò�®�-d�A){¥攝剟>~ó��؃=" «ó¿¿½1賬‟z⁉�VOô�¿¾�2I!mô�´¿N4;,ñ�¾»i>-\B��)裉᷈�f륯 +ाX~9[u 樴m‿ñ°��!=�=�C[ ط57_£=‴�`⁧5�_�4}⁃‥�у灼 ¥1:�>ð�©» <$)>@, -"♄f<��ð¶¾�

Bias generated `char`s (#99)

BurntSushi · 2016-01-27T21:48:53Z

Done with PR #116 in commit faed60d. Thanks @vi!

vi · 2016-01-28T02:37:21Z

QuickCheck's string generator's motto should be "I love characters you hate".

BurntSushi · 2016-01-28T11:39:02Z

@vi Haha, I like it!

BurntSushi added the enhancement label Sep 28, 2015

BurntSushi added a commit that referenced this issue Jan 27, 2016

Merge pull request #116 from vi/bias_arbitrary_char

faed60d

Bias generated `char`s (#99)

BurntSushi closed this as completed Jan 27, 2016

vi mentioned this issue Jan 31, 2016

The random selection doesn't appear to be very random #119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strings generated by default should have more usual characters #99

Strings generated by default should have more usual characters #99

vi commented Sep 28, 2015

BurntSushi commented Sep 28, 2015

shepmaster commented Dec 8, 2015

vi commented Dec 8, 2015

shepmaster commented Dec 8, 2015

vi commented Dec 9, 2015

BurntSushi commented Dec 9, 2015

vi commented Dec 9, 2015

BurntSushi commented Dec 9, 2015

vi commented Dec 9, 2015

BurntSushi commented Dec 9, 2015

vi commented Dec 9, 2015

BurntSushi commented Dec 9, 2015

BurntSushi commented Dec 9, 2015

vi commented Dec 9, 2015

FranklinChen commented Dec 10, 2015

vi commented Dec 10, 2015

BurntSushi commented Dec 10, 2015

BurntSushi commented Jan 27, 2016

vi commented Jan 28, 2016

BurntSushi commented Jan 28, 2016

Strings generated by default should have more usual characters #99

Strings generated by default should have more usual characters #99

Comments

vi commented Sep 28, 2015

BurntSushi commented Sep 28, 2015

shepmaster commented Dec 8, 2015

vi commented Dec 8, 2015

shepmaster commented Dec 8, 2015

vi commented Dec 9, 2015

BurntSushi commented Dec 9, 2015

vi commented Dec 9, 2015

BurntSushi commented Dec 9, 2015

vi commented Dec 9, 2015

BurntSushi commented Dec 9, 2015

vi commented Dec 9, 2015

BurntSushi commented Dec 9, 2015

BurntSushi commented Dec 9, 2015

vi commented Dec 9, 2015

FranklinChen commented Dec 10, 2015

vi commented Dec 10, 2015

BurntSushi commented Dec 10, 2015

BurntSushi commented Jan 27, 2016

vi commented Jan 28, 2016

BurntSushi commented Jan 28, 2016