/
io.txt
1458 lines (1052 loc) · 55.4 KB
/
io.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
== Text Processing and File Management ==
=== A Job Scripting Languages Are Built For ===
Ruby fills a lot of the same roles that languages such as Perl and Python
do. Because of this, you can expect to find first rate support for text
processing and file management. Whether it's parsing a text file with
some regular expressions or building some *nix style filter applications,
Ruby can help make life easier.
However, much of Ruby's I/O facilities are tersely documented at best. It is
also relatively hard to find good resources which show you general strategies
for attacking common text processing tasks. This chapter aims to expose
you to some good tricks that you can use to simplify your text processing
needs, as well as sharpen your skills when it comes to interacting with
and managing files on your system.
As in other chapters, we'll start off by looking at some real open source
code, this time, a simple parser for an Adobe Font Metrics file. This example
will expose you to text processing in its setting. We'll then follow up
with a number of detailed sections which look at different practices that
will help you master basic I/O skill. Armed with these techniques, you'll
be able to take on all sorts of text processing and file management tasks with
ease.
=== Line Based File Processing with State Tracking ===
Processing a text document line by line does not mean that we're limited
to extracting content in a uniform way, treating each line identically.
Some files have more structure than that, but can still benefit from
being processed linearly. We're now going to look over a small parser that
illustrates this general idea by selecting different ways to extract our
data based on what section of a file we are in.
The code in this section was written by James Edward Gray II as part of Prawn's
support for Adobe Font Metrics. Though the example itself is domain specific,
we won't hung up in the particular details of this parser. Instead, we'll
be taking a look at the general approach for to build a state aware parser
that operates on an efficient line by line basis. Along the way, you'll pick
up some basic I/O tips and tricks as well as see the importance regular
expressions often play in this sort of task.
Before we take a look at the actual parser, we can take a glance at the sort
of data we're dealing with. Adobe Font Metrics files are essentially font
glyph measurements and specifications, so they tend to look a bit like a
configuration file of sorts. Some of these things are simply straight key
value pairs, such as:
...............................................................................
CapHeight 718
XHeight 523
Ascender 718
Descender -207
...............................................................................
Others are organized sets of values within a section, as in the following
example:
---------------------------------------------------------------------------------
StartCharMetrics 315
C 32 ; WX 278 ; N space ; B 0 0 0 0 ;
C 33 ; WX 278 ; N exclam ; B 90 0 187 718 ;
C 34 ; WX 355 ; N quotedbl ; B 70 463 285 718 ;
C 35 ; WX 556 ; N numbersign ; B 28 0 529 688 ;
C 36 ; WX 556 ; N dollar ; B 32 -115 520 775 ;
....
EndCharMetrics
---------------------------------------------------------------------------------
Sections can be nested within each other, making things more interesting.
The data across the file does not fit a uniform format, as each section
represents a different sort of thing. However, we can come up with patterns
to parse data in each section we're interested in, because they are consistent
within their sections. We also are only interested in a subset of the
sections, so we can safely ignore some of them. This is the essence of the
task we needed to accomplish, but if you notice, it's a fairly abstract
pattern that we can reuse. Many documents with a simple section-based
structure can be worked with using the approach we show here.
The code that follows is essentially a simple finite state machine that keeps
track of what section the current line appears in. It attempts to parse
the opening or closing of a section first, and then uses this information
to determine a parsing strategy for the current line. The sections that
we're not interested in parsing, we simply skip.
We end up with a very straightforward solution. The whole parser is
reduced to a simple iteration over each line of the file which
manages a stack of nested sections, while determining if and how to
parse the current line.
We'll look at the parts in more details in just a moment, but here is the
whole AFM parser that extracts all the information we need to properly render
Adobe fonts in Prawn:
...............................................................................
def parse_afm(file_name)
section = []
File.foreach(file_name) do |line|
case line
when /^Start(\w+)/
section.push $1
next
when /^End(\w+)/
section.pop
next
end
case section
when ["FontMetrics", "CharMetrics"]
next unless line =~ /^CH?\s/
name = line[/\bN\s+(\.?\w+)\s*;/, 1]
@glyph_widths[name] = line[/\bWX\s+(\d+)\s*;/, 1].to_i
@bounding_boxes[name] = line[/\bB\s+([^;]+);/, 1].to_s.rstrip
when ["FontMetrics", "KernData", "KernPairs"]
next unless line =~ /^KPX\s+(\.?\w+)\s+(\.?\w+)\s+(-?\d+)/
@kern_pairs[[$1, $2]] = $3.to_i
when ["FontMetrics", "KernData", "TrackKern"], ["FontMetrics", "Composites"]
next
else
parse_generic_afm_attribute(line)
end
end
end
...............................................................................
You could try to understand the particular details if you'd like, but it's
also fine to black-box the expressions used here so that you can get
a sense of the overall structure of the parser. Here's what the code
looks like if we do that for all but the patterns which determine the
section nesting:
...............................................................................
def parse_afm(file_name)
section = []
File.foreach(file_name) do |line|
case line
when /^Start(\w+)/
section.push $1
next
when /^End(\w+)/
section.pop
next
end
case section
when ["FontMetrics", "CharMetrics"]
parse_char_metrics(line)
when ["FontMetrics", "KernData", "KernPairs"]
parse_kern_pairs(line)
when ["FontMetrics", "KernData", "TrackKern"], ["FontMetrics", "Composites"]
next
else
parse_generic_afm_attribute(line)
end
end
end
...............................................................................
With these simplifications, it's very clear that we're looking at an ordinary
finite state machine which is acting upon the lines of the file. It
also makes it easier to notice what's actually going on.
The first case statement is just a simple way to check for which section
we're currently looking at, updating the stack as necessary as we move
in and out of sections:
...............................................................................
case line
when /^Start(\w+)/
section.push $1
next
when /^End(\w+)/
section.pop
next
end
...............................................................................
If we find a section beginning or end, we skip to the next line as we know
there is nothing else to parse. Otherwise, we know that we have to do some
real work, which is done in the second case statement:
...............................................................................
case section
when ["FontMetrics", "CharMetrics"]
next unless line =~ /^CH?\s/
name = line[/\bN\s+(\.?\w+)\s*;/, 1]
@glyph_widths[name] = line[/\bWX\s+(\d+)\s*;/, 1].to_i
@bounding_boxes[name] = line[/\bB\s+([^;]+);/, 1].to_s.rstrip
when ["FontMetrics", "KernData", "KernPairs"]
next unless line =~ /^KPX\s+(\.?\w+)\s+(\.?\w+)\s+(-?\d+)/
@kern_pairs[[$1, $2]] = $3.to_i
when ["FontMetrics", "KernData", "TrackKern"], ["FontMetrics", "Composites"]
next
else
parse_generic_afm_attribute(line)
end
...............................................................................
Here, we've got four different ways to handle our line of text. In the first
two cases, we process the lines we need to as we walk through the section,
extracting the bits of information we need and ignoring the extraneous
information we're not interested in.
In the third case, we identify certain sections to skip and simply resume
processing the next line if we are currently within that section.
Finally, if the other cases fail to match, our last case scenario is to
assume we're dealing with a simple key value pair, which is handled by a
private helper method in Prawn. Since it does not provide anything different
to look at than the first two sections of this case statement, we can
safely ignore how it works without missing anything important.
However, the interesting thing that you might have noticed is that the first
case and the second case use two different ways of extracting values. The code
which processes +CharMetrics+ is using +String#[]+, wheras the code handling
KernPairs is using Perl-style global match variables. The reason for this is
largely convenience. The following two lines of code are equivalent:
...............................................................................
name = line[/\bN\s+(\.?\w+)\s*;/, 1]
name = line =~ /\bN\s+(\.?\w+)\s*;/ && $1
...............................................................................
There are still other ways to handle your captured matches
(Such as +MatchData+ via +String#match+), but we'll get into those later. For
now, it's simply worth knowing that when you're trying to extract a single
matched capture, +String#[]+ does the job well, but if you need to deal with
more than one, you need to use another approach. We see this clearly in
the second case:
...............................................................................
next unless line =~ /^KPX\s+(\.?\w+)\s+(\.?\w+)\s+(-?\d+)/
@kern_pairs[[$1, $2]] = $3.to_i
...............................................................................
This code is a bit clever, as the line that assigns the values to +@kern_pairs+
only gets executed when there is a successful match. When the match fails,
it will return +nil+, causing the parser to skip to the next line for
processing.
We could continue studying this example, but we'd then be delving into the
specifics and those details aren't important for remembering this simple
general pattern.
When dealing with a structured document that can be processed by discrete
rules for each section, the general approach is simple and does not typically
require pulling the entire document into memory or doing multiple passes
through the data.
Instead, you can do the following:
* Identity the beginning and end markers of sections with a pattern.
* If sections are nested, maintain a stack which you update before further
processing of each line.
* Break up your extraction code into different cases and select the right
one based on the current section you are in.
* When a line cannot be processed, skip to the next one as soon as possible,
using the +next+ keyword.
* Maintain state as you normally would, processing whatever data you need.
By following these basic guidelines, you can avoid over thinking your problem,
while still saving clock cycles and keeping your memory footprint low.
Although the code here solves a particular problem, it can easily be adapted
to fit a wide range of basic document processing needs.
This introduction has hopefully provided a taste of what text processing in
Ruby is all about. The rest of the chapter will provide many more tips and
tricks, with a greater focus on the particular topics. Feel free to jump
around to the things that interest you most, but I'm hoping all of the
sections have something interesting to offer to even seasoned Rubyists.
=== Regular Expressions ===
At the time of writing this chapter, I was spending some time watching the
Dow Jones Industrial Average, as the world was in the middle of a major
financial meltdown. If you're wondering what this has to do with Ruby
or Regular Expressions, take a quick look at the following code:
...............................................................................
require "open-uri"
loop do
puts( open("http://finance.google.com/finance?cid=983582").read[
/<span class="\w+" id="ref_983582_c">([+-]?\d+\.\d+)/m, 1] )
sleep(30)
end
...............................................................................
In just a couple of lines, I was able to throw together a script that would
poll Google Finance and pull down the current average price of the Dow. This
sort of "find a needle in the haystack" extraction is what regular expressions
are all about.
Of course, the art of constructing regular expressions is often veiled in
mystery. Even simple patterns such as this one might make some folks feel a bit uneasy:
...............................................................................
/<span class="\w+" id="ref_983582_c">([+-]?\d+\.\d+)/m
...............................................................................
This expression is simple by comparison to some other examples we can show,
but it still makes use of a number of regular expression concepts. All in
one line, we can see the use of character classes (both general and special),
escapes, quantifiers, groups, and a switch that enables multi-line matching.
Patterns are dense because they are written in a special syntax which acts
as a sort of domain language for matching and extracting text. The reason
why it may be considered daunting is that this language is made up of so
few special characters:
...............................................................................
\ [ ] . ^ $ ? * + { } | ( )
...............................................................................
At its heart, regular expressions are nothing more than a facility to do
find and replace operations. This concept is so familiar that anyone who
has used a word processor has a strong grasp on it. Using a regex, you
can easily replace all instances of the word "Mitten" with "Kitten", just
like your favorite text editor or word processor can:
...............................................................................
some_string.gsub(/\bMitten\b/,"Kitten")
...............................................................................
Many programmers get this far and stop. They learn to use regex as if it
were a necessary evil rather than an essential techique. We can do
better than that. In this section, we'll look at a few guidelines for
how to write effective patterns that do what they're supposed to without
getting too convoluted. I'm assuming you've done your homework and are
at least familiar with Regex basics as well as Ruby's pattern syntax. If
that's not the case, pick up your favorite language reference and take a few
minutes to review the fundamentals.
So long as you can comfortably read the first example in this section, you're
ready to move on. If you can convince yourself that writing regular
expressions is actually much easier than people tend to think it is, the tips
and tricks to follow shouldn't cause you to break a sweat.
==== Don't Work Too Hard ====
Despite being such a compact format, it's relatively easy to write bloated
patterns if you don't consciously remember to keep things clean and tight.
We'll now take a look at a couple sources of extra fat and how to trim them
down.
Alternation is a very powerful regex tool. It allows you to match one of
a series of potential sequences. For example, if you want to match the name
"James Gray" but also match "James gray", "james Gray", and "james gray",
the following code will do the trick:
...............................................................................
>> ["James Gray", "James gray", "james gray", "james Gray"].all? { |e|
?> e.match(/James|james Gray|gray/) }
=> true
...............................................................................
However, you don't need to work so hard. You're really talking about
possible alternations of simply two characters, not two full words. You could
write this far more efficiently using a character class:
...............................................................................
>> ["James Gray", "James gray", "james gray", "james Gray"].all? { |e|
?> e.match(/[Jj]ames [Gg]ray/) }
=> true
...............................................................................
This makes your pattern clearer and also will result in a much better
optimization in Ruby's regex engine. So in addition to looking better,
this code is actually faster.
In a similar vein, it is unnecessary to use explicit character classes
when a shortcut will do. To match a four digit number, we could write:
...............................................................................
/[0-9][0-9][0-9][0-9]/
...............................................................................
Which can of course be cleaned up a bit using repetitions:
...............................................................................
/[0-9]{4}/
...............................................................................
However, we can do even better by using the special class built in for this:
...............................................................................
/\d{4}/
...............................................................................
It pays to learn what shortcuts are available to you. Here's a quick list
for further study, if you're not already familiar with them:
...............................................................................
. \s \S \w \W \d \D
...............................................................................
Each one of the above corresponds to a literal character class that is more
verbose when written out. Using shortcuts increases clarity and decreases
the chance of bugs creeping in by ill defined patterns. Though it may seem
a bit terse at first, you'll be able to sight read them at ease over time.
==== Anchors are your friends ====
One way to match my name in a string is to write the following simple pattern:
...............................................................................
string =~ /Gregory Brown/
...............................................................................
However, consider the following:
...............................................................................
>> "matched" if "Mr. Gregory Browne".match(/Gregory Brown/)
=> "matched"
...............................................................................
Often times, we mean "match this phrase", but we write "match this sequence
of characters". The solution is to make use of anchors to clarify what we
mean.
Sometimes we want to match only if a string starts with a phrase:
...............................................................................
>> phrases = ["Mr. Gregory Browne", "Mr. Gregory Brown is cool",
"Gregory Brown is cool", "Gregory Brown"]
>> phrases.grep /\AGregory Brown\b/
=> ["Gregory Brown is cool", "Gregory Brown"]
...............................................................................
Other times we want to ensure that the string contains the phrase:
...............................................................................
>> phrases.grep /\bGregory Brown\b/
=> ["Mr. Gregory Brown is cool", "Gregory Brown is cool", "Gregory Brown"]
...............................................................................
And finally, sometimes we want to ensure the string contains an exact phrase:
...............................................................................
>> phrases.grep /\AGregory Brown\z/
=> ["Gregory Brown"]
...............................................................................
Although I am using English names and phrases here for simplicity, this can
of course be generalized to encompass any sort of matching pattern. You could
be verifying that a sequence of numbers fit a certain form, or something
equally abstract. The key thing to take away from this is that when you use
anchors, you're being much more explicit about how you expect your pattern to
match, which in most cases means that you'll have a better chance of catching
problems faster, and an easier time remembering what your pattern was supposed
to do.
An interesting thing to note about anchors is that they don't actually match
characters. Instead, they match between characters to allow you to assert
certain expectations about your strings. So when you use something like +\b+,
you are actually matching between one of +\w\W+ , +\W\w+ , +\A+ , +\z+. In English,
that means that you're transitioning from a non-word character to a word
character, or a non-word character to a word character, or you're matching the
beginning or end of the string. If you review the use of +\b+ in the examples above,
it should now be very clear how anchors work.
The full list of available anchors in Ruby are +\A+, +\Z+, +\z+, +^+, +$+, and
+\b+. Each have their merits, so be sure to read up on them.
==== Use caution when working with quantifiers ====
One of the most common anti-patterns I picked up when first learning regular
expressions was to make use of +.*+ everywhere. Though this may seem innocent,
This is similar to my bad habit of using +rm -Rf+ on the command line all the
time instead of just +rm+. Both can result in catastrophe when used
incorrectly.
But maybe you're not as crazy as I am. Instead, maybe you've been writing
innocent things like +/(\d*)Foo/+ to match any number of digits prepended to
the word Foo:
For some cases, this works great:
...............................................................................
>> "1234Foo"[/(\d*)Foo/,1]
=> "1234"
...............................................................................
But does this surprise you?
...............................................................................
>> "xFoo"[/(\d*)Foo/,1]
=> ""
...............................................................................
It may not, but then again it may. It's relatively common to forget that +*+
always matches. At a first glance, the following code seems fine:
...............................................................................
if num = string[/(\d*)Foo/,1]
Integer(num)
end
...............................................................................
However, since the match will capture an empty string in its failure case,
this code will break. The solution is simple. If you really mean "at least
one", use + instead.
...............................................................................
if num = string[/(\d+)Foo/,1]
Integer(num)
end
...............................................................................
Though more experienced folks might not easily be trapped by something so
simple, there are more subtle variants. For example, if we intend to match
only "Greg" or "Gregory", the following code doesn't quite work:
...............................................................................
>> "Gregory"[/Greg(ory)?/]
=> "Gregory"
>> "Greg"[/Greg(ory)?/]
=> "Greg"
>> "Gregor"[/Greg(ory)?/]
=> "Greg"
...............................................................................
Even if the pattern looks close to what we want, we can see the results
don't fit. The following modifications remedy the issue:
...............................................................................
>> "Gregory"[/\bGreg(ory)?\b/]
=> "Gregory"
>> "Greg"[/\bGreg(ory)?\b/]
=> "Greg"
>> "Gregor"[/\bGreg(ory)?\b/]
=> nil
...............................................................................
Notice that the pattern now properly matches Greg or Gregory, but no other
words. The key thing to take away here is that unbounded zero-matching
quantifiers are tautologies. They can never fail to match, so you need
to be sure to account for that.
A final gotcha about quantifiers is that they are greedy by default.
This means they'll try to consume as much of the string as possible before
matching. The following is an example of a greedy match:
...............................................................................
>> "# x # y # z #"[/#(.*)#/,1]
=> " x # y # z "
...............................................................................
As you can see, this matches everything between the first and last +#+ character.
But sometimes, we want processing to happen from the left and end as soon
as we have a match. To do this, we append a +?+ to our repetition:
...............................................................................
>> "# x # y # z #"[/#(.*?)#/,1]
=> " x "
...............................................................................
All quantifiers can be made non-greedy this way. Remembering this will save a lot of
headaches in the long run.
Though our treatment of regular expressions has been by no means
comprehensive, these few basic tips will really carry you a long way.
The key things to remember are:
* Regular Expressions are nothing more than a special language
for find and replace operations, built upon simple logical constructs.
* There are lots of shortcuts built in for common regular expression
operations, so be sure to make use of special character classes and
other simplifications when you can.
* Anchors provide a way to set up some expectation about where in a string
you want to look for a match. These help with both optimization and
pattern correctness.
* Quantifiers such as +*+ and +?+ will always match, so they should not be
used without sufficient boundaries.
* Quantifiers are greedy by default, and can be made non-greedy via +?+.
By following these guidelines, you'll write clearer, more accurate, and
faster regular expressions. As a result, it'll be a whole lot easier to
revisit them when you run into them in your own old code a few months down
the line.
A final note on regular expressions is that sometimes we are seduced by their
power and overlook other solutions that may be more robust for certain needs.
In both the stock ticker and AFM parsing examples, we were working within
the realm where regular expressions are a quick, easy, and fine way to go.
However, as documents take on more complex structures, and your needs move
from extracting some values to attempting to fully parse a document, you
will probably need to look to other techniques that involve full blown
parsers such as Treetop, Ghostwheel, or Racc. These libraries can solve
problems regular expressions can't solve, and if you find yourself with
data that's hard to map a regex to, it's worth looking at these alternative
solutions.
Of course, your mileage will vary based on the problem at hand, so don't be
afraid of trying a regex based solution first before pulling out the big guns.
=== Working With Files ===
There are a whole slew of options for doing various file management tasks in
Ruby. Because of this, it can be difficult to decide what the best approach
for a given task might be. In this section, we'll cover two key task while
looking at three of Ruby's standard libraries.
We'll start by showing how to use the +Pathname+ and +FileUtils+ libraries to
traverse your file system using a clean cross-platform approach that rivals
the power of popular *nix shells without sacrificing compatibility. We'll
then move on to show how to use +Tempfile+ to automate handling of temporary
file resources within your scripts. These practical tips will help you
write platform-agnostic Ruby code that'll work out of the box on more
systems, while still managing to make your job easier.
==== Using Pathname and FileUtils ====
If you are using Ruby to write administrative scripts, it's nearly inevitable
that you've needed to do some file management along the way. It may be quite
tempting to drop down the the shell to do things like move and rename
directories, search for files in a complex directory structure, and do other
common tasks that involve ferrying files around from one place to the other.
However, Ruby provides some great tools to avoid this sort of thing.
The +Pathname+ and +FileUtils+ standard libraries provide virtually everything
you need for file management. The best way to demonstrate their capabilities
is by example, so we'll now take a look at some code and then break it down
piece by piece.
To illustrate +Pathname+, we can take a look at a small tool I've built for
doing local installations of libraries found on Github. This script,
called 'mooch', essentially looks up and clones a git repository, puts it
in a convenient place within your project (a 'vendor/' directory), and
optionally sets up a stub file that will include your vendored packages
into the loadpath upon requiring it. Sample usage looks something like
this:
...............................................................................
$ mooch init lib/my_project
$ mooch sandal/prawn 0.2.3
$ mooch ruport/ruport 1.6.1
...............................................................................
Then, we can see the following will work without loading rubygems:
...............................................................................
>> require "lib/my_project/dependencies"
=> true
>> require "prawn"
=> true
>> require "ruport"
=> true
>> Prawn::VERSION
=> "0.2.3"
>> Ruport::VERSION
=> "1.6.1"
...............................................................................
Although this script is pretty useful, that's not what we're here to talk
about though. Instead, let's focus on how this sort of thing is built,
since it shows a practical example of using +Pathname+ to manipulate files and
folders. I'll start by showing you the whole script, and then we'll walk
through it part by part:
...............................................................................
#!/usr/bin/env ruby
require "pathname"
WORKING_DIR = Pathname.getwd
LOADER = %Q{
require "pathname"
Pathname.glob("#{WORKING_DIR}/vendor/*/*/") do |dir|
lib = dir + "lib"
$LOAD_PATH.push(lib.directory? ? lib : dir)
end
}
if ARGV[0] == "init"
lib = Pathname.new(ARGV[1])
lib.mkpath
(lib + 'dependencies.rb').open("w") do |file|
file.write LOADER
end
else
vendor = Pathname.new("vendor")
vendor.mkpath
Dir.chdir(vendor.realpath)
system("git clone git://github.com/#{ARGV[0]}.git #{ARGV[0]}")
if ARGV[1]
Dir.chdir(ARGV[0])
system("git checkout #{ARGV[1]}")
end
end
...............................................................................
As you can see, it's not a ton of code, even though it does a lot. Let's
shine the spotlight on the interesting `Pathname` bits.
...............................................................................
WORKING_DIR = Pathname.getwd
...............................................................................
Here we are simply assigning the initial working directory to a constant. We
use this to build up the code for the 'dependencies.rb' stub script that can
be generated via +mooch init+. Here we're just doing quick and dirty code
generation, and you can see the full stub as stored in +LOADER+:
...............................................................................
LOADER = %Q{
require "pathname"
Pathname.glob("#{WORKING_DIR}/vendor/*/*/") do |dir|
lib = dir + "lib"
$LOAD_PATH.push(lib.directory? ? lib : dir)
end
}
...............................................................................
This script does something fun. It looks in the working directory that
mooch init was run in for a folder called vendor, and then looks for
folders two levels deep fitting the Github convention of username/project. We
then use a glob to traverse the directory structure, in search of folders
to add to the loadpath. The code will check to see if each project has a
'lib' folder within it (as is the common Ruby convention), but will add the
project folder itself to the loadpath if it is not present.
Here we notice a few of `Pathname`'s niceties. You can see we can construct
new paths by just adding new strings to them, as shown here:
...............................................................................
lib = dir + "lib"
...............................................................................
In addition to this, we can check to see if the path we've created actually
points to a directory on the filesystem, via a simple +Pathname#directory?+
call. This makes traversal downright easy, as you can see in the preceding
code.
This simple stub may be a bit dense, but once you get the hang of +Pathname+,
you can see that it's quite powerful. Let's look at a couple more tricks,
focusing this time on the code that actually writes this snippet to file:
...............................................................................
lib = Pathname.new(ARGV[1])
lib.mkpath
(lib + 'dependencies.rb').open("w") do |file|
file.write LOADER
end
...............................................................................
Before, we showed an invocation that looked like this:
...............................................................................
$ mooch init lib/my_project
...............................................................................
Here, +ARGV[1]+ is 'lib/my_project'. So, in the preceding code, you can see
we're building up a relative path to our current working directory and
then creating a folder structure. A very cool thing about Pathname is that
it works similar to +mkdir -p+ on *nix, so +Pathname#mkpath+ will actually create
any necessary nesting directories as needed, and won't complain if the
structure already exist, which are both what we want here.
Once we build up the directories, we need to create our 'dependencies.rb' file
and populate it with the string in +LOADER+. We can see here that Pathname
provides shortcuts that work in a similar fashion to +File.open()+.
In the code that actually downloads and vendors libraries from GitHub,
we see the same techniques in use yet again, this time mixed in with some
shell commands and +Dir.chdir+. Since this doesn't introduce anything new,
we can skip overthe details.
Before we move on to discussing temporary files, we'll take a quick look
at +FileUtils+. The purpose of this module is to provide a UNIX-like interface
to file manipulation tasks, and a quick look at its method list will show
that it does a good job of this:
...............................................................................
cd(dir, options)
cd(dir, options) {|dir| .... }
pwd()
mkdir(dir, options)
mkdir(list, options)
mkdir_p(dir, options)
mkdir_p(list, options)
rmdir(dir, options)
rmdir(list, options)
ln(old, new, options)
ln(list, destdir, options)
ln_s(old, new, options)
ln_s(list, destdir, options)
ln_sf(src, dest, options)
cp(src, dest, options)
cp(list, dir, options)
cp_r(src, dest, options)
cp_r(list, dir, options)
mv(src, dest, options)
mv(list, dir, options)
rm(list, options)
rm_r(list, options)
rm_rf(list, options)
install(src, dest, mode = <src's>, options)
chmod(mode, list, options)
chmod_R(mode, list, options)
chown(user, group, list, options)
chown_R(user, group, list, options)
touch(list, options)
...............................................................................
You'll see a bit more of +FileUtils+ later on in the chapter when we talk about
atomic saves. But before we jump into advanced file management techniques,
let's take a quick look at another important foundational tool, the tempfile
standard library.
=== The tempfile Standard Library ===
Producing temporary files is a common need in many applications. Whether
you need to store some things on disk to keep it out of memory until it
is needed again, or you want to serve up a file but don't need to keep it
lurking around after your process has terminated, odds are you'll run into
this problem sooner or later.
It's quite tempting to roll our own +Tempfile+ support, which might look
something like the following code:
...............................................................................
File.open("/tmp/foo.txt","w") do |file|
file << some_data
end
# Then in some later code
File.foreach("/tmp/foo.txt") do |line|
# do something with data
end
# Then finally
require "fileutils"
FileUtils.rm("/tmp/foo.txt")
...............................................................................
This code works, but it has some drawbacks. The first is that it assumes
that you're on a *nix system with a '/tmp' directory. Secondly, we
don't do anything to avoid file collisions, so if another application is
using '/tmp/foo.txt', this will overwrite it. Finally, we need to explicitly
remove the file, or risk leaving a bunch of trash around.
Luckily, Ruby has a standard library that helps us get around these issues.
Using it, our example then looks like this:
...............................................................................
require "tempfile"
temp = Tempfile.new("foo.txt")
temp << some_data
# then in some later code
temp.rewind
temp.each do |line|
# do something with data
end
# Then finally
temp.close
...............................................................................
Let's take a look at what's going on in a little more detail, to really get
a sense of what the +tempfile+ library is doing for us.
==== Automatic Temporary Directory Handling ====
The code looks somewhat similar to our original example, as we're still
essentially working with an IO object. However, the approach is different.
+Tempfile+ opens up a file handle for us to a file that is stored in whatever
your system's tempdir is. We can inspect this value, and even change it if we
need to. Here's what it looks like on two of my systems:
...............................................................................
>> Dir.tmpdir
=> "/var/folders/yH/yHvUeP-oFYamIyTmRPPoKE+++TI/-Tmp-"
>> Dir.tmpdir
=> "/tmp"
...............................................................................
Usually, it's best to go with whatever this value is because it is where Ruby
thinks your temp files should go. However, in the cases where we want to
control this ourselves, it is simple to do so, as shown in the following:
...............................................................................
temp = Tempfile.new("foo.txt", "path/to/my/tmpdir")
...............................................................................
==== Collision Avoidance ====
When you create a temporary file with Tempfile.new, you aren't actually
specifying an exact filename. Instead, the filename you specify is used
as a base name and then gets a unique identifier appended to it. This
prevents one temp file from accidentally overwriting another. Here's a
trivial example that shows what's going on under the hood:
...............................................................................