Skip to content

Commit 77c83fc

Browse files
committed
Make this just load all the different pieces of Str & Chr support
1 parent cd6963d commit 77c83fc

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+204
-10848
lines changed

.travis.yml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,11 @@ git:
2727

2828
## uncomment the following lines to override the default test script
2929
script:
30-
- julia -e 'Pkg.clone(pwd()); Pkg.test("Strs"; coverage=true)'
30+
- if [[ -a .git/shallow ]]; then git fetch --unshallow; fi
31+
- julia -e 'VERSION < v"0.7.0-DEV" || (using Pkg); p="https://github.com/JuliaString"; s=".jl.git"; l="_Entities"; for n in ("StrTables", "LightXML", "JSON", "Format", "PCRE2", "InternedStrings") ; Pkg.add(n) ; end ; t=("LaTeX","Emoji","HTML","Unicode"); for n in t; Pkg.add("$n$l"); end; for n in ("APITools", "StrAPI", "CharSetEncodings", "Chars", "StrBase", "StrRegex", "StrLiterals", "StrFormat", "StrEntities") ; Pkg.clone("$p/$n$s"); end ; Pkg.clone(pwd()); for n in t; Pkg.build("$n$l"); end; Pkg.test("Strs"; coverage=true)'
32+
#- julia -e 'Pkg.clone(pwd()); Pkg.test("Strs"; coverage=true)'
3133
after_success:
3234
# push coverage results to Coveralls
33-
- julia -e 'cd(Pkg.dir("Strs")); Pkg.add("Coverage"); using Coverage; Coveralls.submit(Coveralls.process_folder())'
35+
- julia -e 'VERSION < v"0.7.0-DEV" || (using Pkg); cd(Pkg.dir("Strs")); Pkg.add("Coverage"); using Coverage; Coveralls.submit(Coveralls.process_folder())'
3436
# push coverage results to Codecov
35-
- julia -e 'cd(Pkg.dir("Strs")); Pkg.add("Coverage"); using Coverage; Codecov.submit(Codecov.process_folder())'
37+
- julia -e 'VERSION < v"0.7.0-DEV" || (using Pkg); cd(Pkg.dir("Strs")); Pkg.add("Coverage"); using Coverage; Codecov.submit(Codecov.process_folder())'

IDEAS.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
Ideas:
2+
Have substring support built in
3+
4+
Have types using both a UniStr type along with UTF-8 and/or UTF-16 versions, and take advantage of
5+
knowing the real length of the string from the UniStr version, and always reading the string,
6+
except for times when a UTF-8 (or UTF-16) version is needed, using the O(1) version.
7+
8+
Have extra parameter(s?), for types to hold things like: UnitRange{UInt16,UInt32,UInt64},
9+
for substrings, UInt64 for caching hash value, Str{T}, where T could be Raw*, UTF8, or UTF16,
10+
to cache either unchanged input data and/or validated UTF8/UTF16, and possibly even one of those with a substring type!
11+
12+
Add types _UCS2 and _UTF32, for the internal versions of those encodings, which are guaranteed to
13+
have at least one character of that type, which allows for O(1) == comparisons between strings of
14+
different types. [done!]
15+
16+
Having a mixed string type, for large strings where you may have occasional BMP or non-BMP characters, but the major portion is ASCII/Latin1.
17+
-> Keep a table of offsets and types, 2 bits per 64 character chunk for encoded type of chunk,
18+
offset table, 16-bits, 32-bits, or 64-bits?
19+
16-bits could handle 16+5-2 = 2^19 characters, max cost 16K for 512K character string,
20+
32-bits could handle up to 32GB character strings
21+
22+
Have a function that given a Vector{UInt8}, BinaryStr, UTF8Str, String, etc.
23+
returns a vector of substring'ed UniStr, which only has to make copies if some of the lines
24+
are _Latin, _UCS2, _UTF32.
25+
(This could be done as well for input of Vector{UInt16}, RawWordStr)
26+
27+
To save space, and for better performance, lines with different types are pooled together,
28+
and lines with ASCII can point to the original, except if the input is Vector{UInt8},
29+
in which case they are also added to the pooled buffer for the lines.
30+
In addition, depending on the percentage of characters still pointing into the original string
31+
(again, only if it's not a Vector, which must be copied), it may decide to go ahead and copy
32+
all of the ASCII substrings into the pool.
33+
Note: for good substring performance, some of the operations that are optimized to work 8 bytes
34+
(or more) at a time, will need to deal with masking the initial chunk, not just the final chunk.
35+
36+
37+
New ideas:
38+
Have a single concrete "UniStr" type, which uses bits in the trailing "nul" byte of the String
39+
representation, to store the following information:
40+
41+
NotValidated, Invalid, NoASCII, Latin, BMP, UTF32, Hash present, Short
42+
43+
2 bits: 00 -> Valid, 01 -> Invalid, 10 -> NotValidated, 11 means?
44+
1 bit: 0 -> Some ASCII, 1 -> no ASCII (bit flipped from others so that 0 -> ASCIIStr)
45+
1 bit: 0 -> No Latin1, 1 -> some Latin1
46+
1 bit: 0 -> ByteWise, 1 -> WordWise
47+
1 bit: 0/1 Hash present
48+
1 bit: 0/1 Short
49+
1 bit: ?
50+
51+
Extra byte for wordwise:
52+
1 bit: 0 -> None > 0x7ff, 1 -> some > 0x7ff (for UTF8 conversions?)
53+
1 bit: 0 -> No BMP, 1 -> some BMP (0x800-0xd7ff,0xe000-0xffff)
54+
1 bit: 0 -> Only BMP, 1 -> some non-BMP (0x10000-0x10ffff)
55+
Have at least 5 bits for other information.
56+
57+
So: ASCIIStr would be: Valid, All ASCII, ... i.e. 0 + short/hash bits
58+
_LatinStr would be: Valid, maybe no ascii, Latin1, no bmp, no non-bmp
59+
_UCS2Str would be: Valid, maybe no ascii, maybe Latin1, some bmp, no non-bmp
60+
_UTF32Str would be: Valid, maybe no ascii, maybe Latin1, maybe BMP, some non-bmp

README.md

Lines changed: 52 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,35 @@
66

77
[![codecov.io](http://codecov.io/github/JuliaString/Strs.jl/coverage.svg?branch=master)](http://codecov.io/github/JuliaString/Strs.jl?branch=master)
88

9+
[![](https://twitter.com/twitter/statuses/995385258332901377)]
10+
911
The `Strs` package is now working on both the release version (v0.6.2) and the latest master (v0.7.0-DEV).
12+
1013
It represents an attempt to give Julia better string handling than possible with Base `String` and `Char`.
1114

1215
I am now trying to make sure that all of the functionality in String and Char is implemented for
1316
Str and Chr, and to start optimizing the functions (although they are already substantially faster)
1417

15-
I also am working on implementing full Regex support (although some changes might be needed in Base to make it work with the `r"..."` regex string macro and `Regex` type, because there are some fields missing that would be needed to handle arbitrary abstract string types).
18+
Strs.jl is now a container for a number of different packages from [JuliaString.org](https://juliastring.org)
19+
20+
* [StrAPI](https://github.com/JuliaString/StrAPI.jl): Common API for string/character functionality
21+
* [CharSetEncodings](https://github.com/JuliaString/CharSetEncodings.jl): Basic types/support for Character Sets, Encodings, and Character Set Encodings
22+
* [Chars](https://github.com/JuliaString/Chars.jl): `Chr{CharSet,CodeUnitType}` type and support
23+
* [StrBase](https://github.com/JuliaString/StrBase.jl): `Str{CSE, Hash, SubSet, Cache}` type
24+
* [PCRE2](https://github.com/JuliaString/PCRE2.jl): `PCRE2` library support
25+
* [StrRegex](https://github.com/JuliaString/StrRegex.jl): `Regex` support for all string types
26+
* [StrLiterals](https://github.com/JuliaString/StrLiterals.jl): Extensible string literal support
27+
* [Format](https://github.com/JuliaString/Format.jl): Python/C style formatting (based on [Formatting](https://github.com/JuliaIO/Formatting.jl)
28+
* [StrFormat](https://github.com/JuliaString/StrFormat.jl): formatting extensions for literals
29+
* [StrTables](https://github.com/JuliaString/StrTables.jl): low-level support for entity tables
30+
* [HTML_Entities](https://github.com/JuliaString/HTML_Entities.jl)
31+
* [LaTeX_Entities](https://github.com/JuliaString/LaTeX_Entities.jl)
32+
* [Emoji_Entities](https://github.com/JuliaString/Emoji_Entities.jl)
33+
* [Unicode_Entities](https://github.com/JuliaString/Unicode_Entities.jl)
34+
* [StrEntities](https://github.com/JuliaString/StrEntities.jl): Entity extensions for literals
35+
* [InternedStrings](https://github.com/JuliaString/InternedStrings.jl): save space by interning strings (by @oxinabox!)
36+
37+
The new package [APITools](https://github.com/JuliaString/APITools.jl) is used to set up a consistent and easy to use API for most of the cooperating packages, without having to worry too much about imports, exports, using, and what functions are part of a public API, and which ones are part of the internal development API for other packages to extend.
1638

1739
I would very much appreciate any constructive criticism, help implementing some of the ideas, ideas on how to make it perform better, bikeshedding on names and API, etc.
1840
Also, I'd love contributions of benchmark code and/or samples for different use cases of strings,
@@ -85,12 +107,13 @@ There is a new API that I am working on for indexing and searching, (however the
85107

86108
Also there are more readable function names that always separate words with `_`, and avoid hard to understand abbreviations:
87109

88-
* `is*` -> `is_*` (for `ascii`, `digit`, `space`, `alpha`, `numeric`, `valid`,
110+
* `is*` -> `is_*` (for `ascii`, `digit`, `space`, `numeric`, `valid`,
89111
`defined`, `empty`, `assigned`)
90112
* `iscntrl` -> `is_control`
91113
* `isgraph` -> `is_graphic`
92114
* `isprint` -> `is_printable`
93115
* `ispunct` -> `is_punctuation`
116+
* `isalpha` -> `is_letter`
94117
* `isalnum` -> `is_alphanumeric`
95118
* `isgraphemebreak` -> `is_grapheme_break`
96119
* `isgraphemebreak!` -> `is_grapheme_break!`
@@ -100,3 +123,30 @@ Also there are more readable function names that always separate words with `_`,
100123
* `uppercasefirst` -> `uppercase_first`
101124
* `startswith` -> `starts_with`
102125
* `endswith` -> `ends_with`
126+
127+
* In addition, I've added `is_alphabetic`
128+
129+
## Kudos
130+
131+
Nobody is an island, and to achieve great things, one must stand on the shoulders of giants.
132+
133+
I would like to thank some of those giants in particular:
134+
135+
* The four co-creators of Julia: [Jeff Bezanson](https://github.com/JeffBezanson),[Viral B. Shah](https://github.com/ViralBShah), [Alan Edelman](https://github.com/alanedelman), and [Stefan Karpinski](https://github.com/StefanKarpinski), without their uncompromising greediness, none of this would be possible.
136+
137+
* [Tom Breloff](https://github.com/tbreloff), for showing how an ecosystem could be created in Julia, i.e. "Build it, and they will come", for providing some nice code in this [PR](https://github.com/JuliaIO/Formatting.jl/pull/10) (which I shamelessly pirated in order to create [Format](https://github.com/JuliaString/Format.jl), and for good advice at JuliaCon.
138+
* [Ismael Venegas Castelló](https://twitter.com/SalchiPapa1337) for encouraging me to [tweet](https://twitter.com/GandalfSoftware) about Julia starting at the 2015 JuliaCon, for good advice, and being a great guy in general.
139+
* [Chris Rackaukas](https://github.com/ChrisRackauckas) simply a star in Julia now, great guy, great advice, and great blogs about stuff that's usually way over my head. Julia is incredibly lucky to have him.
140+
* [Jacob Quinn](https://github.com/quinnj), for collaborating & discussions early on in [Strings](https://github.com/quinnj/Strings.jl) on ideas for better string support in Julia, as well as a lot of hard work on things dear to me, such as databases and importing/exporting data [SQLite](https://github.com/JuliaDatabases/SQLite.jl), [ODBC](https://github.com/JuliaDatabases/ODBC.jl), [CSV](https://github.com/JuliaData/CSV.jl), [WeakRefStrings](https://github.com/JuliaData/WeakRefStrings.jl), [DataStreams](https://github.com/JuliaData/DataStreams.jl), [Feather](https://github.com/JuliaData/Feather.jl), [JSON2](https://github.com/quinnj/JSON2.jl)
141+
* [Milan Bouchet-Valat](https://github.com/nalimilan), for discussions on string handling and encoding in [StringEncodings](https://github.com/nalimilan/StringEncodings.jl)
142+
* [Tim Holy](https://github.com/timholy) for the famous "Holy" Trait Trick, which I use extensively in the Str* packages, for the work along with [Matt Bauman](https://github.com/mbauman) on making Julia arrays general, extensible while still performing well, and hence very useful in my work.
143+
* [Steven G. Johnson]() for illuminating me on how one could create a whole package in very few lines of code when I first started learning Julia, see [DecFP](https://github.com/stevengj/DecFP.jl)
144+
* [Tony Kelman](https://github.com/tkelman), for very thorough reviews of my PRs, I learned a great deal from his (and other Julians') comments), including that I didn't have to code in C anymore to get the performance I desired.
145+
146+
* [Lyndon White](https://github.com/oxinabox), I've already "appropriated" :grinning: his very nice [InternedStrings](https://github.com/JuliaString/InternedStrings.jl) into this package, I'm really lucky to have gotten him to join the organization!
147+
* [Bogumił Kamiński](https://github.com/bkamins) who has been doing a great job testing and reviewing `Strs` (as well as doing the same for the string/character support in Julia Base), as well as input into the design. (Also very glad to have co-opted him to become a member of the org)
148+
* Last but not least, Julia mathematical artist (and blogger!) extraordinaire, [Cormullion](https://github.com/cormullion), creator of our wonderful logo!
149+
150+
Also thanks to anybody who's answered my (sometimes stupid :grinning:) questions on [Gitter](https://gitter.im/JuliaLang/julia) and [Discourse](https://discourse.julialang.org/)
151+
152+
Kudos to all of the other contributors to Julia and the ever expanding Julia ecosystem!

REQUIRE

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,5 @@
11
julia 0.6
2-
CharSetEncodings
2+
StrRegex
3+
StrFormat
4+
StrEntities
5+
InternedStrings

0 commit comments

Comments
 (0)