Permalink
Switch branches/tags
Unicode-11.0 add-more-expr-ops adjust_nursery_stringops alias_analysis annotate_refs_reprapi asyncsocket_listen_sockname backtrace_uninline better_takedispatcher_opt callsite_flags_sso circular_outer_fixups collation-arrays compunit_string_mem_share concat_moar_binary configure-telemeh-probe-rdtscp coroutine crashy_getxfrom_templates ctx-lazy ctypes3 debug-dynvar-bug debugspam_inlining declarative_op_checks decont_assign_nativerefs decrease_spesh_log_memory_growth deserialization_debugspam deterministic-ucd2c dont_gc_in_spesh du-chains-and-opts-WIP du-chains-and-opts dynamic_gen2_tuning early_death_percentage eliminate_redundant_guards esc execname expr-jit-invoke ext-stage extra-usage-chains-fixes finite_callgraph_depth fix-illumos-build fix-null-concat fix_for_expmod fork-safety frame-gc-opts fsa_cleanup_stats fsa_tune_page_sizes gc_worklist_add_vector gen2-frames getenvhash_constant_fold gh-pages heapsnapshot_onlymajor_filter helgrind_support hllbool improve_boxing_and_not in-situ-strings informative_deopt_profile inline_ignore_instrumentation_bytesize inline_in_place inlining-exception-fix issue165 jit-comment-on-spesh-log jit-expr-optimizer jit-moar-ops jit-perf-map jit-sp_speshresolve jit-stack-walker jit_and_opt_setcodeobj jit_devirtualize_reprops_3 jit_getcodeobj jit_indexicim_ops jit_stuff_in_speshlog jitcode-refcount lazier_inline_fixups lazy_static_lex_vivify leave libuv-1.6.1-update line_based_coverage_5 make_builds_reproducible_again make_unbox_removal_available many_null_checks master maybe_fix_big_endian_oldmoar maybe_fix_big_endian moar-gdb-prettyprinter moritz/debian multi_cache_no_segfault_on_null multicachefind multidimarray_view mvmarray_in_situ_storage mvmhash_use_fsa named_to_positional nativecall_script nativeref_decont_split nfa_to_statelist nine-try-this-fix no_atomic_if_single_threaded no_fuse_bb_after_guard nqp-mbc null-normalization optimize_callsite_memory optimize_can_op overflow_exception_mvmarray p6opaque_packed p6opaque_use_fsa pahole pahole2017 pea pointers postrelease-opts prevent_double_unlock_multi_cache_add profile_dump_less_stack_usage profiler-extra-type-info profiler_new_spesh_semantics profiling-additions refuse_dangerous_inlines restricted return_from_inline_without_log_exit sepsh set-removal sha1bin short_string_cache slower/elim-take-dispatcher smoke-me/spaceybuild speculative-calls spesh-array-access spesh-leaks spesh-value-prop spesh_comments spesh_constant_folding spesh_faster_shutdown spesh_hll_and_boot_types spesh_lex_vivify_checks spesh_remove_set_op spesh_tune_alloc speshplugin_guardstaticcode sync-without-uv telemeh_try telemeh_windows_port template-compiler-refactor udp_receive_hostname_port update_libatomic_ops uthash_padding valgrind_support vectorization vmhealth wip-mvmarray-refactor wip-tile-no-template
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
116 lines (90 sloc) 4.55 KB

Strings in MoarVM

Abstract

MoarVM implements strings using NFG (Normalization Form Grapheme).

This is an extension on Unicode NFC (Normalization Form Canonical).

Strings have either a 32-bit signed or 8-bit signed fixed width representation, with negative numbers used to represent graphemes which contain more than one codepoint per grapheme (Synthetic graphemes)

💡
Remember, all input text is normalized by default.

MVMString

Strings are represented by the MVMString struct (source). A string’s length is stored as a 32-bit unsigned integer, so the maximum number of graphemes allowed in a string is 32³² - 1 (4,294,967,295).

For a given string MVMString *string, string→body.storage_type can be one of the following types:

MVMString types:
  1. MVM_STRING_GRAPHEME_32

  2. MVM_STRING_GRAPHEME_ASCII

  3. MVM_STRING_GRAPHEME_8

  4. MVM_STRING_STRAND

Type Storage Notes

MVM_STRING_GRAPHEME_32

32-bit signed

Can contain any Synthetic

MVM_STRING_GRAPHEME_ASCII

8-bit signed

Can contain the CRLF Synthetic

MVM_STRING_GRAPHEME_8

8-bit signed

MVM_STRING_STRAND

References other strings

Created by: concatenation, substring ops & string repeat op

Strands

Strands are a type of MVMString which instead of being a flat string with contiguous data, actually contains references to other strings. Strands are created during concatenation or substring operations. When two flat strings are concatenated together, a Strand with references to both string a and string b is created. If string a and string b were strands themselves, the references of string a and references of string b are copied one after another into the Strand.

Grapheme Segmentation

Graphemes are segmented (which codepoints are apart of which graphemes) follow Unicode’s Text Segmentation rules for Grapheme Clusters Technical Report 29 [TR29].

Synthetic’s

Synthetics are graphemes which contain multiple codepoints. In MoarVM these are stored and accessed using a trie, while the actual data itself stores the base character seprately and then the combiners are stored in an array. We also store whether or not it is a UTF8-C8 synthetic. The struct’s source is in src/strings/nfg.h.

ℹ️
Currently the maximum number of combiners in a synthetic is 1024. MoarVM will throw an exception if you attempt to create a grapheme with more than 1024 codepoints in it. (source)

Synthetic’s codepoints are stored in a single array, with the base character pointed to by storing the location of its index in the array. The reason for this is for compatibility with Prepend characters.

Prepend

Before Unicode 9.0, base characters were always the first codepoint in the grapheme. The Prepend property was added in Unicode 9.0, which does the opposite of the Extend property. Codepoints with the Prepend property combine with the codepoint which comes immediately afterward. MoarVM supports both segmentation, as well as getting the base codepoint out of a synthetic that starts with one or more Prepend codepoint(s).

Normalization

MoarVM normalizes into NFG form all input text. This can cause the data to change as normalization takes place. Developers and users may be used to systems which treat strings as "bags of bytes" and do not ensure they are valid Unicode (or any other encoding for that matter). MoarVM goes beyond ensuring correct Unicode and also ensures correct normalization in NFC form.

Glossary

MVMString

The C type used to represent strings

NFG

Normalization Form Grapheme. Similar to NFC except graphemes which contain multiple codepoints are stored in Synthetic graphemes.

NFC

Normalization Form Canonical

Grapheme

Short for Grapheme Cluster. See [TR29] for more information.

Synthetic

In MoarVM, a special representative to store a grapheme containing more than one codepoint using the same space as a standard codepoint. Internally stored using negative numbers in the C string data array.

References