Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 704 lines (549 sloc) 21.078 kB
511dc44 initial import
Laurent Sansonetti authored
1 # scanf for Ruby
2 #
3 # $Release Version: 1.1.2 $
8f21162 @richkilmer bring lib up to r22701 (ruby 1.9.1_0 tag). there are build issues us…
richkilmer authored
4 # $Revision: 19094 $
5 # $Id: scanf.rb 19094 2008-09-03 12:54:13Z dblack $
6 # $Author: dblack $
511dc44 initial import
Laurent Sansonetti authored
7 #
8 # A product of the Austin Ruby Codefest (Austin, Texas, August 2002)
9
10 =begin
11
12 =scanf for Ruby
13
14 ==Description
15
16 scanf for Ruby is an implementation of the C function scanf(3),
17 modified as necessary for Ruby compatibility.
18
19 The methods provided are String#scanf, IO#scanf, and
20 Kernel#scanf. Kernel#scanf is a wrapper around STDIN.scanf. IO#scanf
21 can be used on any IO stream, including file handles and sockets.
22 scanf can be called either with or without a block.
23
24 scanf for Ruby scans an input string or stream according to a
25 <b>format</b>, as described below ("Conversions"), and returns an
26 array of matches between the format and the input. The format is
27 defined in a string, and is similar (though not identical) to the
28 formats used in Kernel#printf and Kernel#sprintf.
29
30 The format may contain <b>conversion specifiers</b>, which tell scanf
31 what form (type) each particular matched substring should be converted
32 to (e.g., decimal integer, floating point number, literal string,
33 etc.) The matches and conversions take place from left to right, and
34 the conversions themselves are returned as an array.
35
36 The format string may also contain characters other than those in the
37 conversion specifiers. White space (blanks, tabs, or newlines) in the
38 format string matches any amount of white space, including none, in
39 the input. Everything else matches only itself.
40
41 Scanning stops, and scanf returns, when any input character fails to
42 match the specifications in the format string, or when input is
43 exhausted, or when everything in the format string has been
44 matched. All matches found up to the stopping point are returned in
45 the return array (or yielded to the block, if a block was given).
46
47
48 ==Basic usage
49
50 require 'scanf.rb'
51
52 # String#scanf and IO#scanf take a single argument (a format string)
53 array = aString.scanf("%d%s")
54 array = anIO.scanf("%d%s")
55
56 # Kernel#scanf reads from STDIN
57 array = scanf("%d%s")
58
59 ==Block usage
60
61 When called with a block, scanf keeps scanning the input, cycling back
62 to the beginning of the format string, and yields a new array of
63 conversions to the block every time the format string is matched
64 (including partial matches, but not including complete failures). The
65 actual return value of scanf when called with a block is an array
4b368bd @Watson1978 stdlib: Deleted the whitespece of end of line, and fixed the petty typo.
Watson1978 authored
66 containing the results of all the executions of the block.
511dc44 initial import
Laurent Sansonetti authored
67
68 str = "123 abc 456 def 789 ghi"
69 str.scanf("%d%s") { |num,str| [ num * 2, str.upcase ] }
70 # => [[246, "ABC"], [912, "DEF"], [1578, "GHI"]]
71
72 ==Conversions
73
74 The single argument to scanf is a format string, which generally
75 includes one or more conversion specifiers. Conversion specifiers
76 begin with the percent character ('%') and include information about
77 what scanf should next scan for (string, decimal number, single
78 character, etc.).
79
80 There may be an optional maximum field width, expressed as a decimal
81 integer, between the % and the conversion. If no width is given, a
82 default of `infinity' is used (with the exception of the %c specifier;
83 see below). Otherwise, given a field width of <em>n</em> for a given
84 conversion, at most <em>n</em> characters are scanned in processing
85 that conversion. Before conversion begins, most conversions skip
86 white space in the input string; this white space is not counted
87 against the field width.
88
89 The following conversions are available. (See the files EXAMPLES
90 and <tt>tests/scanftests.rb</tt> for examples.)
91
92 [%]
93 Matches a literal `%'. That is, `%%' in the format string matches a
94 single input `%' character. No conversion is done, and the resulting
95 '%' is not included in the return array.
96
97 [d]
98 Matches an optionally signed decimal integer.
99
100 [u]
101 Same as d.
102
4b368bd @Watson1978 stdlib: Deleted the whitespece of end of line, and fixed the petty typo.
Watson1978 authored
103 [i]
511dc44 initial import
Laurent Sansonetti authored
104 Matches an optionally signed integer. The integer is read in base
105 16 if it begins with `0x' or `0X', in base 8 if it begins with `0',
106 and in base 10 other- wise. Only characters that correspond to the
107 base are recognized.
108
109 [o]
110 Matches an optionally signed octal integer.
111
112 [x,X]
113 Matches an optionally signed hexadecimal integer,
114
115 [f,g,e,E]
116 Matches an optionally signed floating-point number.
117
118 [s]
119 Matches a sequence of non-white-space character. The input string stops at
120 white space or at the maximum field width, whichever occurs first.
121
122 [c]
123 Matches a single character, or a sequence of <em>n</em> characters if a
124 field width of <em>n</em> is specified. The usual skip of leading white
125 space is suppressed. To skip white space first, use an explicit space in
126 the format.
127
128 [<tt>[</tt>]
129 Matches a nonempty sequence of characters from the specified set
130 of accepted characters. The usual skip of leading white space is
131 suppressed. This bracketed sub-expression is interpreted exactly like a
132 character class in a Ruby regular expression. (In fact, it is placed as-is
133 in a regular expression.) The matching against the input string ends with
134 the appearance of a character not in (or, with a circumflex, in) the set,
135 or when the field width runs out, whichever comes first.
136
137 ===Assignment suppression
138
139 To require that a particular match occur, but without including the result
140 in the return array, place the <b>assignment suppression flag</b>, which is
141 the star character ('*'), immediately after the leading '%' of a format
142 specifier (just before the field width, if any).
143
144 ==Examples
145
146 See the files <tt>EXAMPLES</tt> and <tt>tests/scanftests.rb</tt>.
147
148 ==scanf for Ruby compared with scanf in C
149
150 scanf for Ruby is based on the C function scanf(3), but with modifications,
151 dictated mainly by the underlying differences between the languages.
152
153 ===Unimplemented flags and specifiers
154
155 * The only flag implemented in scanf for Ruby is '<tt>*</tt>' (ignore
156 upcoming conversion). Many of the flags available in C versions of scanf(4)
157 have to do with the type of upcoming pointer arguments, and are literally
158 meaningless in Ruby.
159
160 * The <tt>n</tt> specifier (store number of characters consumed so far in
161 next pointer) is not implemented.
162
163 * The <tt>p</tt> specifier (match a pointer value) is not implemented.
164
165 ===Altered specifiers
166
167 [o,u,x,X]
168 In scanf for Ruby, all of these specifiers scan for an optionally signed
169 integer, rather than for an unsigned integer like their C counterparts.
170
171 ===Return values
172
173 scanf for Ruby returns an array of successful conversions, whereas
174 scanf(3) returns the number of conversions successfully
175 completed. (See below for more details on scanf for Ruby's return
176 values.)
177
178 ==Return values
179
180 Without a block, scanf returns an array containing all the conversions
181 it has found. If none are found, scanf will return an empty array. An
182 unsuccesful match is never ignored, but rather always signals the end
183 of the scanning operation. If the first unsuccessful match takes place
184 after one or more successful matches have already taken place, the
185 returned array will contain the results of those successful matches.
186
187 With a block scanf returns a 'map'-like array of transformations from
188 the block -- that is, an array reflecting what the block did with each
189 yielded result from the iterative scanf operation. (See "Block
190 usage", above.)
191
192 ==Test suite
193
194 scanf for Ruby includes a suite of unit tests (requiring the
195 <tt>TestUnit</tt> package), which can be run with the command <tt>ruby
196 tests/scanftests.rb</tt> or the command <tt>make test</tt>.
197
198 ==Current limitations and bugs
199
200 When using IO#scanf under Windows, make sure you open your files in
201 binary mode:
202
203 File.open("filename", "rb")
204
205 so that scanf can keep track of characters correctly.
206
207 Support for character classes is reasonably complete (since it
208 essentially piggy-backs on Ruby's regular expression handling of
209 character classes), but users are advised that character class testing
210 has not been exhaustive, and that they should exercise some caution
211 in using any of the more complex and/or arcane character class
212 idioms.
213
214
215 ==Technical notes
216
217 ===Rationale behind scanf for Ruby
218
219 The impetus for a scanf implementation in Ruby comes chiefly from the fact
220 that existing pattern matching operations, such as Regexp#match and
221 String#scan, return all results as strings, which have to be converted to
222 integers or floats explicitly in cases where what's ultimately wanted are
223 integer or float values.
224
225 ===Design of scanf for Ruby
226
227 scanf for Ruby is essentially a <format string>-to-<regular
228 expression> converter.
229
230 When scanf is called, a FormatString object is generated from the
231 format string ("%d%s...") argument. The FormatString object breaks the
232 format string down into atoms ("%d", "%5f", "blah", etc.), and from
233 each atom it creates a FormatSpecifier object, which it
234 saves.
235
236 Each FormatSpecifier has a regular expression fragment and a "handler"
237 associated with it. For example, the regular expression fragment
238 associated with the format "%d" is "([-+]?\d+)", and the handler
239 associated with it is a wrapper around String#to_i. scanf itself calls
240 FormatString#match, passing in the input string. FormatString#match
241 iterates through its FormatSpecifiers; for each one, it matches the
242 corresponding regular expression fragment against the string. If
243 there's a match, it sends the matched string to the handler associated
244 with the FormatSpecifier.
245
246 Thus, to follow up the "%d" example: if "123" occurs in the input
247 string when a FormatSpecifier consisting of "%d" is reached, the "123"
248 will be matched against "([-+]?\d+)", and the matched string will be
249 rendered into an integer by a call to to_i.
250
251 The rendered match is then saved to an accumulator array, and the
252 input string is reduced to the post-match substring. Thus the string
253 is "eaten" from the left as the FormatSpecifiers are applied in
254 sequence. (This is done to a duplicate string; the original string is
255 not altered.)
256
257 As soon as a regular expression fragment fails to match the string, or
258 when the FormatString object runs out of FormatSpecifiers, scanning
259 stops and results accumulated so far are returned in an array.
260
261 ==License and copyright
262
263 Copyright:: (c) 2002-2003 David Alan Black
264 License:: Distributed on the same licensing terms as Ruby itself
265
266 ==Warranty disclaimer
267
268 This software is provided "as is" and without any express or implied
269 warranties, including, without limitation, the implied warranties of
270 merchantibility and fitness for a particular purpose.
271
272 ==Credits and acknowledgements
273
274 scanf for Ruby was developed as the major activity of the Austin
275 Ruby Codefest (Austin, Texas, August 2002).
276
277 Principal author:: David Alan Black (mailto:dblack@superlink.net)
278 Co-author:: Hal Fulton (mailto:hal9000@hypermetrics.com)
279 Project contributors:: Nolan Darilek, Jason Johnston
280
281 Thanks to Hal Fulton for hosting the Codefest.
282
4b368bd @Watson1978 stdlib: Deleted the whitespece of end of line, and fixed the petty typo.
Watson1978 authored
283 Thanks to Matz for suggestions about the class design.
511dc44 initial import
Laurent Sansonetti authored
284
285 Thanks to Gavin Sinclair for some feedback on the documentation.
286
287 The text for parts of this document, especially the Description and
288 Conversions sections, above, were adapted from the Linux Programmer's
289 Manual manpage for scanf(3), dated 1995-11-01.
290
291 ==Bugs and bug reports
292
293 scanf for Ruby is based on something of an amalgam of C scanf
294 implementations and documentation, rather than on a single canonical
295 description. Suggestions for features and behaviors which appear in
296 other scanfs, and would be meaningful in Ruby, are welcome, as are
297 reports of suspicious behaviors and/or bugs. (Please see "Credits and
298 acknowledgements", above, for email addresses.)
299
300 =end
301
302 module Scanf
303
304 class FormatSpecifier
305
306 attr_reader :re_string, :matched_string, :conversion, :matched
307
308 private
309
310 def skip; /^\s*%\*/.match(@spec_string); end
311
312 def extract_float(s); s.to_f if s &&! skip; end
313 def extract_decimal(s); s.to_i if s &&! skip; end
314 def extract_hex(s); s.hex if s &&! skip; end
315 def extract_octal(s); s.oct if s &&! skip; end
316 def extract_integer(s); Integer(s) if s &&! skip; end
317 def extract_plain(s); s unless skip; end
318
319 def nil_proc(s); nil; end
320
321 public
322
323 def to_s
324 @spec_string
325 end
326
327 def count_space?
8f21162 @richkilmer bring lib up to r22701 (ruby 1.9.1_0 tag). there are build issues us…
richkilmer authored
328 /(?:\A|\S)%\*?\d*c|%\d*\[/.match(@spec_string)
511dc44 initial import
Laurent Sansonetti authored
329 end
330
331 def initialize(str)
332 @spec_string = str
333 h = '[A-Fa-f0-9]'
334
4b368bd @Watson1978 stdlib: Deleted the whitespece of end of line, and fixed the petty typo.
Watson1978 authored
335 @re_string, @handler =
511dc44 initial import
Laurent Sansonetti authored
336 case @spec_string
337
338 # %[[:...:]]
339 when /%\*?(\[\[:[a-z]+:\]\])/
340 [ "(#{$1}+)", :extract_plain ]
341
342 # %5[[:...:]]
343 when /%\*?(\d+)(\[\[:[a-z]+:\]\])/
344 [ "(#{$2}{1,#{$1}})", :extract_plain ]
345
346 # %[...]
347 when /%\*?\[([^\]]*)\]/
348 yes = $1
349 if /^\^/.match(yes) then no = yes[1..-1] else no = '^' + yes end
350 [ "([#{yes}]+)(?=[#{no}]|\\z)", :extract_plain ]
351
352 # %5[...]
353 when /%\*?(\d+)\[([^\]]*)\]/
354 yes = $2
355 w = $1
356 [ "([#{yes}]{1,#{w}})", :extract_plain ]
357
358 # %i
359 when /%\*?i/
8f21162 @richkilmer bring lib up to r22701 (ruby 1.9.1_0 tag). there are build issues us…
richkilmer authored
360 [ "([-+]?(?:(?:0[0-7]+)|(?:0[Xx]#{h}+)|(?:[1-9]\\d*)))", :extract_integer ]
511dc44 initial import
Laurent Sansonetti authored
361
362 # %5i
363 when /%\*?(\d+)i/
364 n = $1.to_i
365 s = "("
366 if n > 1 then s += "[1-9]\\d{1,#{n-1}}|" end
367 if n > 1 then s += "0[0-7]{1,#{n-1}}|" end
368 if n > 2 then s += "[-+]0[0-7]{1,#{n-2}}|" end
369 if n > 2 then s += "[-+][1-9]\\d{1,#{n-2}}|" end
370 if n > 2 then s += "0[Xx]#{h}{1,#{n-2}}|" end
371 if n > 3 then s += "[-+]0[Xx]#{h}{1,#{n-3}}|" end
372 s += "\\d"
373 s += ")"
374 [ s, :extract_integer ]
375
376 # %d, %u
377 when /%\*?[du]/
378 [ '([-+]?\d+)', :extract_decimal ]
379
380 # %5d, %5u
381 when /%\*?(\d+)[du]/
382 n = $1.to_i
383 s = "("
384 if n > 1 then s += "[-+]\\d{1,#{n-1}}|" end
385 s += "\\d{1,#{$1}})"
386 [ s, :extract_decimal ]
387
388 # %x
389 when /%\*?[Xx]/
390 [ "([-+]?(?:0[Xx])?#{h}+)", :extract_hex ]
391
392 # %5x
393 when /%\*?(\d+)[Xx]/
394 n = $1.to_i
395 s = "("
396 if n > 3 then s += "[-+]0[Xx]#{h}{1,#{n-3}}|" end
397 if n > 2 then s += "0[Xx]#{h}{1,#{n-2}}|" end
398 if n > 1 then s += "[-+]#{h}{1,#{n-1}}|" end
399 s += "#{h}{1,#{n}}"
400 s += ")"
401 [ s, :extract_hex ]
402
403 # %o
404 when /%\*?o/
405 [ '([-+]?[0-7]+)', :extract_octal ]
406
407 # %5o
408 when /%\*?(\d+)o/
409 [ "([-+][0-7]{1,#{$1.to_i-1}}|[0-7]{1,#{$1}})", :extract_octal ]
410
411 # %f
412 when /%\*?f/
413 [ '([-+]?((\d+(?>(?=[^\d.]|$)))|(\d*(\.(\d*([eE][-+]?\d+)?)))))', :extract_float ]
414
415 # %5f
416 when /%\*?(\d+)f/
417 [ "(\\S{1,#{$1}})", :extract_float ]
418
419 # %5s
420 when /%\*?(\d+)s/
421 [ "(\\S{1,#{$1}})", :extract_plain ]
422
423 # %s
424 when /%\*?s/
425 [ '(\S+)', :extract_plain ]
426
427 # %c
428 when /\s%\*?c/
429 [ "\\s*(.)", :extract_plain ]
430
431 # %c
432 when /%\*?c/
433 [ "(.)", :extract_plain ]
434
435 # %5c (whitespace issues are handled by the count_*_space? methods)
436 when /%\*?(\d+)c/
437 [ "(.{1,#{$1}})", :extract_plain ]
438
439 # %%
440 when /%%/
441 [ '(\s*%)', :nil_proc ]
442
443 # literal characters
444 else
445 [ "(#{Regexp.escape(@spec_string)})", :nil_proc ]
446 end
447
448 @re_string = '\A' + @re_string
449 end
450
451 def to_re
452 Regexp.new(@re_string,Regexp::MULTILINE)
453 end
454
455 def match(str)
456 @matched = false
457 s = str.dup
458 s.sub!(/\A\s+/,'') unless count_space?
459 res = to_re.match(s)
460 if res
461 @conversion = send(@handler, res[1])
462 @matched_string = @conversion.to_s
463 @matched = true
464 end
465 res
466 end
467
468 def letter
469 @spec_string[/%\*?\d*([a-z\[])/, 1]
470 end
471
472 def width
473 w = @spec_string[/%\*?(\d+)/, 1]
474 w && w.to_i
475 end
476
477 def mid_match?
478 return false unless @matched
479 cc_no_width = letter == '[' &&! width
480 c_or_cc_width = (letter == 'c' || letter == '[') && width
481 width_left = c_or_cc_width && (matched_string.size < width)
482
483 return width_left || cc_no_width
484 end
4b368bd @Watson1978 stdlib: Deleted the whitespece of end of line, and fixed the petty typo.
Watson1978 authored
485
511dc44 initial import
Laurent Sansonetti authored
486 end
487
488 class FormatString
489
490 attr_reader :string_left, :last_spec_tried,
491 :last_match_tried, :matched_count, :space
492
493 SPECIFIERS = 'diuXxofeEgsc'
494 REGEX = /
495 # possible space, followed by...
496 (?:\s*
497 # percent sign, followed by...
498 %
499 # another percent sign, or...
500 (?:%|
501 # optional assignment suppression flag
502 \*?
503 # optional maximum field width
504 \d*
505 # named character class, ...
506 (?:\[\[:\w+:\]\]|
507 # traditional character class, or...
508 \[[^\]]*\]|
509 # specifier letter.
510 [#{SPECIFIERS}])))|
511 # or miscellaneous characters
512 [^%\s]+/ix
513
514 def initialize(str)
515 @specs = []
516 @i = 1
517 s = str.to_s
518 return unless /\S/.match(s)
519 @space = true if /\s\z/.match(s)
520 @specs.replace s.scan(REGEX).map {|spec| FormatSpecifier.new(spec) }
521 end
522
523 def to_s
524 @specs.join('')
525 end
526
527 def prune(n=matched_count)
528 n.times { @specs.shift }
529 end
530
531 def spec_count
532 @specs.size
533 end
534
535 def last_spec
536 @i == spec_count - 1
537 end
538
539 def match(str)
540 accum = []
541 @string_left = str
542 @matched_count = 0
543
544 @specs.each_with_index do |spec,i|
545 @i=i
546 @last_spec_tried = spec
547 @last_match_tried = spec.match(@string_left)
548 break unless @last_match_tried
549 @matched_count += 1
550
551 accum << spec.conversion
552
553 @string_left = @last_match_tried.post_match
554 break if @string_left.empty?
555 end
556 return accum.compact
557 end
558 end
559 end
560
561 class IO
562
563 # The trick here is doing a match where you grab one *line*
564 # of input at a time. The linebreak may or may not occur
565 # at the boundary where the string matches a format specifier.
566 # And if it does, some rule about whitespace may or may not
567 # be in effect...
568 #
569 # That's why this is much more elaborate than the string
570 # version.
571 #
572 # For each line:
573 # Match succeeds (non-emptily)
574 # and the last attempted spec/string sub-match succeeded:
575 #
576 # could the last spec keep matching?
577 # yes: save interim results and continue (next line)
578 #
579 # The last attempted spec/string did not match:
580 #
581 # are we on the next-to-last spec in the string?
582 # yes:
583 # is fmt_string.string_left all spaces?
584 # yes: does current spec care about input space?
585 # yes: fatal failure
586 # no: save interim results and continue
587 # no: continue [this state could be analyzed further]
588 #
589 #
590
591 def scanf(str,&b)
592 return block_scanf(str,&b) if b
593 return [] unless str.size > 0
594
595 start_position = pos rescue 0
596 matched_so_far = 0
597 source_buffer = ""
598 result_buffer = []
599 final_result = []
600
601 fstr = Scanf::FormatString.new(str)
602
603 loop do
604 if eof || (tty? &&! fstr.match(source_buffer))
605 final_result.concat(result_buffer)
606 break
607 end
608
609 source_buffer << gets
610
611 current_match = fstr.match(source_buffer)
612
613 spec = fstr.last_spec_tried
614
615 if spec.matched
616 if spec.mid_match?
617 result_buffer.replace(current_match)
618 next
619 end
620
621 elsif (fstr.matched_count == fstr.spec_count - 1)
622 if /\A\s*\z/.match(fstr.string_left)
623 break if spec.count_space?
624 result_buffer.replace(current_match)
625 next
626 end
627 end
628
629 final_result.concat(current_match)
630
631 matched_so_far += source_buffer.size
632 source_buffer.replace(fstr.string_left)
633 matched_so_far -= source_buffer.size
634 break if fstr.last_spec
635 fstr.prune
636 end
637 seek(start_position + matched_so_far, IO::SEEK_SET) rescue Errno::ESPIPE
638 soak_up_spaces if fstr.last_spec && fstr.space
639
640 return final_result
641 end
642
643 private
644
645 def soak_up_spaces
646 c = getc
647 ungetc(c) if c
648 until eof ||! c || /\S/.match(c.chr)
649 c = getc
650 end
651 ungetc(c) if (c && /\S/.match(c.chr))
652 end
653
654 def block_scanf(str)
655 final = []
656 # Sub-ideal, since another FS gets created in scanf.
657 # But used here to determine the number of specifiers.
658 fstr = Scanf::FormatString.new(str)
659 last_spec = fstr.last_spec
660 begin
661 current = scanf(str)
662 break if current.empty?
663 final.push(yield(current))
664 end until eof || fstr.last_spec_tried == last_spec
665 return final
666 end
667 end
668
669 class String
670
671 def scanf(fstr,&b)
672 if b
673 block_scanf(fstr,&b)
674 else
4b368bd @Watson1978 stdlib: Deleted the whitespece of end of line, and fixed the petty typo.
Watson1978 authored
675 fs =
511dc44 initial import
Laurent Sansonetti authored
676 if fstr.is_a? Scanf::FormatString
4b368bd @Watson1978 stdlib: Deleted the whitespece of end of line, and fixed the petty typo.
Watson1978 authored
677 fstr
678 else
511dc44 initial import
Laurent Sansonetti authored
679 Scanf::FormatString.new(fstr)
680 end
681 fs.match(self)
682 end
683 end
684
685 def block_scanf(fstr,&b)
686 fs = Scanf::FormatString.new(fstr)
687 str = self.dup
688 final = []
689 begin
690 current = str.scanf(fs)
691 final.push(yield(current)) unless current.empty?
692 str = fs.string_left
693 end until current.empty? || str.empty?
694 return final
695 end
696 end
697
698 module Kernel
699 private
700 def scanf(fs,&b)
701 STDIN.scanf(fs,&b)
702 end
703 end
Something went wrong with that request. Please try again.