Skip to content
This repository
Newer
Older
100644 172 lines (136 sloc) 3.967 kb
fccc6851 »
2011-06-21 Initial open-source release
1 (*
2 Copyright © 2011 MLstate
3
4 This file is part of OPA.
5
6 OPA is free software: you can redistribute it and/or modify it under the
7 terms of the GNU Affero General Public License, version 3, as published by
8 the Free Software Foundation.
9
10 OPA is distributed in the hope that it will be useful, but WITHOUT ANY
11 WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
12 FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for
13 more details.
14
15 You should have received a copy of the GNU Affero General Public License
16 along with OPA. If not, see <http://www.gnu.org/licenses/>.
17 *)
18 (**
19 Server side implementation of Cactutf.
20 Encoding of unicode characters
21 @author Corentin Gallet
22 @author Rudy Sicard
23 @author Mathieu Barbin (documentation)
24 *)
25
26 (**
27 CactUTF is a `light version' of Camomile, the popular Ocaml library often
28 used for its full support of Unicode. Sadly, there are two problems with
29 it :
30 - Camomile is big. Really big.
31 - Camomile has to be installed.
32
33 So, here comes Cactutf, with only the functions needed.
34 It's a plain translation of EucalyptUTF into Ocaml.
35
36 You may note that this implementation only used up to 4 bytes for one
37 Unicode character, where Camomile goes as far as 6. That's because
38 they go further than the RFC 3629.
39 *)
40
41 (**
42 {6 Types alias}
43 *)
44
45 (**
46 There is at least 3 different int manipulated, so these types
47 alias try to reduce confusion
48 *)
49
50 (**
51 The representation of unicode char
52 *)
53 type unicode = int
54
55 (**
56 The indexation in unicode char, independant from the implementation
57 *)
58 type unicode_index = int
59
60 (**
61 The indexation in bytes
62 *)
63 type bytes_index = int
64
65 (**
66 {6 Indexation and length}
67 *)
68
69 (**
70 For one Unicode code, return the number of bytes needed for
71 a representation in UTF-8.
72 @raises Lenbytes if the code is invalid
73 *)
74 val lenbytes : unicode -> bytes_index
75
76 (**
77 [Cactutf.length_until string pos]
78 Returns the number of unicode characters encoded in the string
79 until the position [pos] given in [bytes]
80 *)
81 val length_until : string -> bytes_index -> unicode_index
82
83 (**
84 Returns the number of unicode characters encoded in the string.
85 This returns the same result as [Cactutf.length_until s (String.length s)]
86 *)
87 val length : string -> unicode_index
88
89 (**
90 Return the index in bytes of the n-th Unicode character.
91 *)
92 val nth : string -> unicode_index -> bytes_index
93
94 (**
95 [Cactutf.next string pos]
96 Return the index of the next Unicode character.
97 <!> Silently returns [pos+1] in case of error.
98 *)
99 val next : string -> bytes_index -> bytes_index
100
101 (**
102 {6 Unicode}
103 *)
104
105 (**
106 Unicode char can be encoded with 1 up to 4 bytes.
107 Theses function encode unicode from there bytes.
108 *)
109
110 val one_byte : int -> unicode
111 val two_bytes : int -> int -> unicode
112 val three_bytes : int -> int -> int -> unicode
113 val four_bytes : int -> int -> int -> int -> unicode
114
115 (**
116 {6 Access}
117 *)
118
119 (**
120 Return the Unicode code of the nth Unicode character.
121 *)
122 val get : string -> unicode_index -> unicode
123
124 (**
125 Return the Unicode code using the index (and not the nth).
126 A lot faster, but only when using index instead of position.
127 *)
128 val look : string -> bytes_index -> unicode
129
130 (**
131 {6 Allocation}
132 *)
133
134 (**
135 Build a new string from a character.
136 *)
137 val cons : unicode -> string
138
139 (**
140 {6 Extraction, Transformation}
141 *)
142
143 (**
144 <!> This is weird, the length given for [sub] is the length in bytes,
145 not in number of unicode characters.
146 *)
147
148 val sub : string -> bytes_index -> int -> string
149
150 val sub_opt : string -> bytes_index -> int -> string option
151
152 (**
153 uppercase the string
154 *)
155 val uppercase : string -> string
156
157 (**
158 lowercase the string
159 *)
160 val lowercase : string -> string
161
162
163 (**
164 {6 Deprecated}
165 *)
166
167 (**
168 FIXME: undocumented, incorrect, dirty, not following guidelines.
169 This exception should not be exported, and goes not out this module.
170 *)
171 exception Lenbytes of int
Something went wrong with that request. Please try again.