/
persistence.tex
2574 lines (2380 loc) · 132 KB
/
persistence.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%% This file is a portion of the source for Revised Edition 1.1 of
%% Operating Systems and Middleware: Supporting Controlled
%% Interaction, Copyright 2011 by Max Hailperin. This work is
%% licensed under the Creative Commons Attribution-ShareAlike 3.0
%% Unported License. To view a copy of this license, visit
%% http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to
%% Creative Commons, 171 Second Street, Suite 300, San Francisco,
%% California, 94105, USA.
\newcommand{\persistenceChapterTitle}{Files and Other Persistent Storage}
\chapter{\persistenceChapterTitle}
\label{persistence-chapter}
\section{Introduction}
In this chapter, you will study two different kinds of service, each of
which can be provided by either an operating system or middleware, and
each of which can take several different forms. \vocabindex{Persistence}{persistence} services
provide for the retention of data for periods of time that are long
enough to include system crashes, power failures, and similar
disruptions. \vocabindex{Access}{access} services provide application
programs the means to operate on objects that
are identified by name or by other attributes, such as a portion of the
contents. In principle, these two kinds of service are independent of
one another: persistent objects can be identified by numeric
address rather than by name, and naming can be applied to
non-persistent objects. However, persistence and access services are
often provided in concert, as with named files stored on disk.
Therefore, I am addressing both in a single chapter.
Any kind of object that stores data can be persistent,
whether the object is as simple as a
sequence of bytes or as complex as an application-specific object,
such as the representation of a retirement portfolio in a benefits
management system. In contemporary mainstream systems, the three most
common forms of persistent storage are as follows:
\begin{itemize}
\item
A \vocab{file}, which is an array of bytes that can be modified in length, as
well as read and written at any numerically specified position.
(Historically, the word has had other meanings, but this definition
has become dominant.) File storage is normally provided by operating
systems and will serve as my primary example in this chapter.
\item
A \vocab{table}, which in a relational database system is a multiset of
\vocabs{row} (also known as \vocabs{tuple} or \vocabs{record}). Each row provides an
appropriately typed value for each of the table's \vocabs{column}.
For example, a table of chapters might have a title column, which holds
character strings, and a number column, which holds integers. Then an
individual row within that table might contain the title {\tt
"\persistenceChapterTitle"} and the number {\tt \ref{persistence-chapter}}.
Database storage is normally provided by middleware, rather than by an
operating system.
\item
A \foldvocab{persistent}{object}, which is
an application-specific object of the sort associated with
object-oriented programming languages. For example, Java objects
can be made persistent. Persistent objects
are normally supported by middleware using one of the previous two
types of persistent storage. Unfortunately, there are many
competing approaches to supporting persistent objects; even the Java
API does not yet have a single standardized approach. Therefore, I will not
discuss persistent objects any further.
\end{itemize}
Access services can also take a variety of forms, from a single
directory of unique names to the sort of sophisticated full-text
search familiar from the web. I will concentrate on two access options
that are popular in operating systems and middleware:
\begin{itemize}
\item
Hierarchical \vocabyies{director} map names into objects, each of
which can be a subdirectory, thereby forming a tree of directories (or
nested file folders). In some variants, objects can have be accessible
through multiple names, either directly (multiple names refer to one
object) or indirectly (one name refers to another name, which refers
to an object). Operating systems generally use hierarchical
directories to provide access to files.
\item
\vocabindex{Indexes}{index} provide access to those objects that contain specified data.
For example, an index on a table of orders could be used to find those rows
that describe orders placed by a particular customer. Relational database
middleware commonly uses indexes to provide access to rows. Files
can also be indexed for fast searching.
\end{itemize}
\section{Disk Storage Technology}\label{disk-storage-technology-section}
A disk drive stores fixed-sized blocks of data known as
\vocabs{sector}; a typical sector size is 512 bytes. The interface
between a contemporary disk drive and a computer is conceptually quite
simple, essentially just a large array of sectors. Just like in any
array, the sectors are consecutively numbered, from 0 up to a maximum
that depends on the capacity of the drive. The processor can ask the
disk controller to perform two basic operations:
\begin{itemize}
\item
The processor can request that the controller \vocab{write} data from a specified address in the
main memory (RAM) to a specified sector number on the
disk. For reasons you will see in the remainder of this section, the write request can also
specify a number of consecutive sectors to be transferred.
\item
The processor can request that the controller \vocab{read} data from a specified sector number to a specified
address in the main memory. Again, the
read request can specify that multiple consecutive sectors be
transferred.
\end{itemize}
This view of the disk drive as one large array of sectors suffices for
writing correct software, but not for writing software that performs
well. Because some disk accesses involve far more mechanical movement
than others, the access time can vary substantially. In particular,
contemporary disk drives can sustain data transfer rates measured in
tens of megabytes per second if accessed optimally, but only tens of
kilobytes per second if accessed randomly. To understand what the
software needs to do to avoid this performance penalty of three orders
of magnitude, it helps to look inside the black box at the internal
structure of a disk drive, as in Figure~\ref{photo-8-1}.
\begin{figure}
\centerline{\includegraphics{hail_f0801}}
%\centerline{\epsfbox{photo-8-1.eps}}
\caption{In this photo of an opened disk drive, a stack of four
platters is visible at the top, with a head arm
extending into the platter
area. The part you can see is the topmost layer of the head arm,
holding the head for the top surface of the top platter. Similar
layers are stacked below it; for example, the next layer down has
heads for the bottom of the top platter and the top of the second
platter.
%% The following attribution line must be kept. Note that Seagate
%% explicitly allows inclusion of the photo in this Creative Commons
%% licensed book, provided that the attribution is kept, but is not
%% granting permission for the photo to be distributed on its own,
%% separate fom the book.
Photo copyright by and reprinted by courtesy of Seagate Technology LLC.}
\label{photo-8-1}
\end{figure}
A disk drive contains a stack of platters mounted on a common spindle
and spinning at a fixed rotational speed, such as 10,000 revolutions
per minute. Data is recorded onto the surface of the platters and
read back off using heads, one recording and playback head per
surface. The heads are supported by an arm that can pivot so as to
position the heads nearer or further from the central spindle;
Figure~\ref{photo-8-1} shows the relationship between the platters,
the spindle, and the head arm.
If the
arm is left in a fixed position, the rotation of the disks causes a
circular region of each disk surface to pass under the corresponding
head. This circular region of a single disk surface is known as a
\vocab{track}; each track is divided into hundreds of sectors. The
collection of tracks, one per disk surface, accessible at a particular
position of the head arm is called a \vocab{cylinder}.
Only one head can be active at any time. Depending on which head is active
and the position of the head arm, a single track's worth of sectors can
be read or written.
In order to access more data than can fit in a single track, there are
two options. A \vocab{head switch} changes the active head and
thereby provides access to another track in
the same cylinder. A \vocab{seek} moves the arm to a position closer or further
from the spindle in order to provide access to another cylinder. As it happens, the head switch time on a modern
drive is quite similar to the time needed to seek to an adjacent
cylinder---a fraction of a millisecond---so the distinction is not important; henceforth, I'll talk
only about seeking.
The seek time is larger for tracks that are further apart, but not
proportionately so, for some of the same reasons as the duration of an
automobile trip is not proportional to its distance. Just as an
automobile needs to accelerate at the beginning of a trip, decelerate
at the end of the trip, and then painstakingly pull into a parking
spot, so too a disk arm needs to accelerate, decelerate, and home in
on the exact position of the destination track. The net result of
this is that seeking to a track tens of thousands of cylinders away
may take 5 milliseconds, only ten times as long as seeking to an adjoining track.
Seeking to an adjoining track already takes long enough that tens of
kilobytes of data could have been transferred were the drive not busy
seeking.
Even the ten-fold speed ratio between short and long seeks is
misleading, however, because accessing a sector involves more than
just seeking to that sector's track. Once the appropriate track is
spinning under the active head, the disk controller needs to wait for
the appropriate sector to come around to the head's position, a delay
known as \foldvocab{rotational}{latency}. Because the time a disk
takes to complete one revolution is comparable to the time taken to
seek across tens of thousands of cylinders, the rotational latency can
bring the total access time for a random sector on an adjoining track
to within a small factor of the access time for a sector on a distant
track.
Once an access is underway, additional sectors can be read or written
at high speed as they pass under they head assembly. Even a request
for multiple sectors that happens to cross a track boundary will pay
only the penalty of seek time and not the larger penalty of rotational latency,
because the first sector of the track is positioned so that it passes under the head just
after the seek completes.
If the software accesses a large number of consecutive
sectors, there are two advantages for doing so in a single
large request rather than several smaller requests. One advantage is reduced overhead in the interface
between computer and disk. The other difference, which is more significant, is
that issuing separate requests may cost additional disk revolutions,
particularly for writes; missing the timing for a sector means waiting
a full revolution for it to come around again. (Reads are less of a
problem because disk drives contain on-board RAM, known as \vocab{cache buffers}, to hold data that has
passed under the active head, so that it can be accessed without
waiting for it to come around again. Disks can use the cache buffers for writes
as well, but only if the software is designed to tolerate some loss of
recently written data upon system crash.)
Thus, the secrets to attaining a disk drive's full potential are
locality, locality, and locality:
\begin{itemize}
\item
Accessing a sector with a similar identifying number to the most recently accessed one
will generally be faster than accessing a sector with a greatly
different number.
\item
Accessing consecutive sectors will generally be faster than accessing
sectors that have even small differences in sector number.
\item
Accessing consecutive sectors in one large request will be faster
than accessing them in several smaller requests.
\end{itemize}
You should keep these performance issues related to locality in mind when
considering topics such as how disk space is allocated.
There is one other performance issue, less directly related to
locality, which I will only briefly mention here. (Seeing how it
influences software design would be interesting, but beyond the level
of this book.) The software should not wait for the disk drive to
complete each request before issuing the next request, which may be
from a different thread. Disk drives are
capable of queuing up multiple requests and then handling them in whichever
order best utilizes the mechanical components. For example, if
several accesses to the same track are queued, the disk drive can
perform them in the order the sectors happen to pass under the head.
Throughout this chapter, I will focus on systems
that employ a single disk drive, for the sake of simplicity. Using
multiple drives to divide or replicate data raises interesting
trade-offs of reliability and performance; the notes section at the
end of the chapter suggests some readings if you want to explore this
area.
\section{POSIX File API}\label{posix-file-api-section}
All UNIX-like systems (including Linux and Mac OS~X) support a rather
complicated set of procedures for operating on \index{file I/O}files, which has
evolved over the decades, eventually becoming part of the POSIX
standard. For most everyday purposes, programmers can and should
ignore this API, instead using one of the cleaner, higher-level APIs
built on top of it, such as those included in the Java and C$++$
standards. Nonetheless, I will introduce the POSIX API here,
because in many important systems, it forms the interface between the
operating system kernel and software running in user-level application
processes, even if the latter is encapsulated in libraries.
\subsection{File Descriptors}
Files are referred to in two different ways: by character-string
\vocabs{pathname}
(such as {\tt microshell.c} or {\tt /etc/passwd}) and by integer
\foldvocabs{file}{descriptor} (such as 0, 1, or 17).
A pathname is a name of a file, optionally including a sequence of
directories used to reach it. A file descriptor, on the other hand,
provides no information about the file's name or location; it is just
a featureless integer.
Many operations
require file descriptors; in particular, to read data from a file or
write data into a file requires a file descriptor. If a process
happens to have inherited a file descriptor when it was forked from
its parent (or happens to have received the file descriptor in a
message from another process), then it can read or write the file
without ever knowing a name for it. Otherwise, the process can use
the \index{open@\verb"|open"|}\verb|open| procedure to obtain a file
descriptor for a named file. When the process is done
with the file descriptor, it can
\index{close@\verb"|close"|}\verb|close| it. (When a process
terminates, the operating system automatically closes any remaining
open file descriptors.)
File descriptors can refer not only to open files, but also to other
sources and destinations for input and output, such as the keyboard
and display screen. Some procedures will work only for regular files,
whereas others work equally well for hardware devices, network
communication ports, and so forth. I will flag some places these distinctions
matter; however, my primary focus will be on regular files in persistent storage.
By convention, all processes inherit at least three file descriptors
from their parent.
These file descriptors, known as the \foldvocab{standard}{input},
\foldvocab{standard}{output}, and \foldvocab{standard}{error output},
are numbered 0, 1, and 2, respectively. Rather than
remembering the numbers, you should use the symbolic names defined in
\verb|unistd.h|, namely, \index{STDIN_FILENO@\verb"|STDIN_FILENO"|}\verb|STDIN_FILENO|, \index{STDOUT_FILENO@\verb"|STDOUT_FILENO"|}\verb|STDOUT_FILENO|, and
\index{STDERR_FILENO@\verb"|STDERR_FILENO"|}\verb|STDERR_FILENO|.
When you run a program from a shell and don't make special
arrangements, standard input generally is your keyboard, while the
standard output and error output are both directed to the shell's
window on your display screen. You can \foldindex{standard I/O}{redirection}redirect the standard input or
output to a file by using the shell's \verb|<| and \verb|>|
notations. For example, the shell command
\begin{verbatim}
ps l >my-processes
\end{verbatim}
runs the \verb|ps| program with the \verb|l| option to generate a
list of processes, as you saw in Chapter~\ref{processes-chapter}. However, rather
than displaying the list on your screen, this command puts the list
into a file called \verb|my-processes|. The \verb|ps| program doesn't
need to know anything about this change; it writes its output
to the standard output in either case. Only the shell needs to do
something different, namely, closing the preexisting standard output
and opening the file in its place before executing the
\verb|ps| program.
If the \verb|ps| program has any error messages to
report, it outputs them to the standard error output, which remains
connected to your display screen. That way, the error messages aren't
hidden in the \verb|my-processes| file.
Figure~\ref{file-processes-source} contains a program illustrating how
the shell would operate in the preceding example, with a child process
closing its inherited standard output and then opening \verb|my-processes| before executing \verb|ps|.
\begin{figure}
\begin{verbatim}
#include <unistd.h>
#include <stdio.h>
#include <iostream>
#include <fcntl.h>
#include <sys/wait.h>
#include <sys/stat.h>
using namespace std;
int main(){
pid_t returnedValue = fork();
if(returnedValue < 0){
perror("error forking");
return -1;
} else if (returnedValue == 0){
if(close(STDOUT_FILENO) < 0){
perror("error closing standard output");
return -1;
}
// When there is no error, open returns the smallest file
// descriptor not already in use by this process, so having
// closed STDOUT_FILENO, the open should reuse that number.
if(open("my-processes", O_WRONLY | O_CREAT | O_TRUNC,
S_IRUSR | S_IWUSR) < 0){
perror("error opening my-processes");
return -1;
}
execlp("ps", "ps", "l", NULL); // ps with option letter l
perror("error executing ps");
return -1;
} else {
if(waitpid(returnedValue, 0, 0) < 0){
perror("error waiting for child");
return -1;
}
cout << "Note the parent still has the old standard output."
<< endl;
}
}
\end{verbatim}
\caption{This C$++$ program, {\tt file-processes.cpp}, illustrates how
the shell runs the command {\tt ps l >my-processes}. After
forking, the child process closes the inherited standard output and
in its place opens {\tt my-processes} before executing {\tt ps}.}
\label{file-processes-source}
\end{figure}
The most complicated procedure call is the one to \verb|open|. The
first argument is the name of the file to open. Because this
character string does not contain any slash characters ({\tt /}), the
file is found in the process's current directory. (Every process has a
current \foldvocab{working}{directory}, which can be changed using the \index{chdir@\verb"|chdir"|}\verb|chdir|
procedure.) If the name contained one or more slashes, such as {\tt
alpha/beta/gamma} or {\tt /etc/passwd}, then the operating system would
traverse one or more directories to find the file to open. In particular,
{\tt alpha/beta/gamma} would start with the current directory, look for
subdirectory {\tt alpha}, look in {\tt alpha} for {\tt beta},
and finally look in {\tt beta} for the file {\tt gamma}. Because {\tt
/etc/passwd} starts with a slash, the search for this file would begin
by looking in the root directory for {\tt etc} and then in that directory
for {\tt passwd}. In
Section~\ref{directory-indexing-section}, I will discuss file naming
further, including related aspects of the POSIX API, such as how a
file can be given an additional name or have a name removed.
The second argument to \index{open@\verb"|open"|}\verb|open| specifies the particular way in
which the file should be opened. Here, the \verb|O_WRONLY| indicates
the file should be opened for writing only (as opposed to
\verb|O_RDONLY| or \verb|O_RDWR|), the \verb|O_CREAT| indicates that
the file should be created if it doesn't already exist (rather than
signaling an error), and the \verb|O_TRUNC| indicates that the file
should be truncated to zero length before writing; that is, all the old
data (if any) should be thrown out. Because the \verb|O_CREAT| option
is specified, the third argument to \verb|open| is needed; it
specifies the access permissions that should be given to the file, if
it is created. In this case, the access permissions are read and
write for the owning user only, that is, \verb|rw-------|.
Even setting aside \verb|open| and \verb|close|, not all operations
on files involve reading or writing the contents of the file. Some
operate on the \foldvocabs{metadata}{attribute}---attributes
describing a file---such as the access
permissions, time of last modification, or owner. A variety of
procedures, such as \index{chmod@\verb"|chmod"|}\verb|chmod|, \index{utime@\verb"|utime"|}\verb|utime|, and \index{chown@\verb"|chown"|}\verb|chown|,
allow these attributes to be set; I won't detail them. I will,
however, illustrate one procedure that allows the attributes of a file
to be retrieved. The C$++$ program in Figure~\ref{fstater-source} uses
the \index{fstat@\verb"|fstat"|}\verb|fstat| procedure to retrieve information about its standard
input. It then reports just a few of the attributes from the larger
package of information.
\begin{figure}
\begin{verbatim}
#include <unistd.h>
#include <time.h>
#include <sys/stat.h>
#include <stdio.h>
#include <iostream>
using namespace std;
int main(){
struct stat info;
if(fstat(STDIN_FILENO, &info) < 0){
perror("Error getting info about standard input");
return -1;
}
cout << "Standard input is owned by user number "
<< info.st_uid << endl;
cout << "and was last modified " << ctime(&info.st_mtime);
if(S_ISREG(info.st_mode)){
cout << "It is a " << info.st_size << "-byte file." << endl;
} else {
cout << "It is not a regular file." << endl;
}
return 0;
}
\end{verbatim}
\caption{This C$++$ program, {\tt fstater.cpp}, describes its standard
input, using information retrieved using {\tt fstat}. That
information includes the owner, last modification time, and whether
the standard input is from a regular file. In the latter case, the
size of the file is also available.}
\label{fstater-source}
\end{figure}
After printing the owner and modification time stamp, the program
checks whether the standard input is from a regular file, as it would
be if the shell was told to redirect standard input, using \verb|<|.
Only in this case does the program print out the file's size, because
the concept of size doesn't make any sense for the stream of input
coming from the keyboard, for example. If this program is compiled in
a file called \verb|fstater|, then the shell command
\begin{verbatim}
./fstater </etc/passwd
\end{verbatim}
would give you information about the \verb|/etc/passwd| file, which
you could verify using the command \verb|ls -ln /etc/passwd|.
Moving on to actually reading or
writing the contents of a file, the low-level POSIX API provides three
different choices, outlined here:
\[\label{file-access-styles-table}\begin{tabular}{l||l|l}
&\bf Explicit Positions&\bf Sequential\\\hline\hline
\bf Memory Mapped&\texttt{mmap}&---\\\hline
\bf External&\texttt{pread}/\texttt{pwrite}&\texttt{read}/\texttt{write}
\end{tabular}\]
A file (or a portion thereof) can be mapped into
the process's address space using the \index{mmap@\verb"|mmap"|}\verb|mmap| procedure, allowing
normal memory loads and stores to do the reading and writing. Alternatively, the
file can be left outside the address space, and individual portions
explicitly read or written using procedures that copy from the file
into memory or from memory into the file. One version of these
procedures (\index{pread@\verb"|pread"|}\verb|pread| and \index{pwrite@\verb"|pwrite"|}\verb|pwrite|) needs to be told what
position within the file to read or write, whereas the other version
(\index{read@\verb"|read"|}\verb|read| and \index{write@\verb"|write"|}\verb|write|) operates sequentially, with each
operation implicitly using the portion of the file immediately after
the preceding operation. I'll discuss all three possibilities at
least briefly, because each has its virtues. Because \verb|mmap| is the
simplest procedure, I will start with it.
\subsection{Mapping Files Into Virtual Memory}
The use of \index{mmap@\verb"|mmap"|}\verb|mmap| is illustrated by the C$++$ program in
Figures \ref{cpmm-source1} and \ref{cpmm-source2}, which copies the contents of one file to
another. The program expects to be given the names of the input and
output files as \verb|argv[1]| and \verb|argv[2]|, respectively. It
uses the \verb|open| procedure to translate these into integer file
descriptors, \verb|fd_in| and \verb|fd_out|. By using \verb|fstat|
(as in Figure~\ref{fstater-source}), it finds the size of the input
file. This size (\verb|info.st_size|) plays three roles. One is that
the program makes the output file the same size, using
\index{ftruncate@\verb"|ftruncate"|}\verb|ftruncate|. (Despite its name,
\verb|ftruncate| does not necessarily make a file shorter; it sets the
file's size, whether by truncating it or by padding it out with extra
bytes that all have the value zero.) Another use of the input file's size is for the two calls to
\verb|mmap|, which map the input and output files into virtual memory,
with read-only and write-only protections, respectively. The returned
values, \verb|addr_in| and \verb|addr_out|, are the virtual addresses
at which the two files start in the process's address space. The
third use of the input file size is to tell the library procedure
\index{memcpy@\verb"|memcpy"|}\verb|memcpy| how many bytes to copy from \verb|addr_in| to
\verb|addr_out|. The \verb|memcpy| procedure is a loop
that executes load and store instructions to copy from one place in virtual memory to
another. (This loop could be written explicitly in C$++$, but would be less
clear and likely less efficient as well, because the library routine is
very carefully tuned for speed.)
\begin{figure}
\begin{verbatim}
#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
using namespace std;
int main(int argc, char *argv[]){
if(argc != 3){
cerr << "Usage: " << argv[0] << " infile outfile" << endl;
return -1;
}
int fd_in = open(argv[1], O_RDONLY);
if(fd_in < 0){
perror(argv[1]);
return -1;
}
struct stat info;
if(fstat(fd_in, &info) < 0){
perror("Error stating input file");
return -1;
}
void *addr_in =
mmap(0, info.st_size, PROT_READ, MAP_SHARED, fd_in, 0);
if(addr_in == MAP_FAILED){
perror("Error mapping input file");
return -1;
}
\end{verbatim}
\caption{This is the first portion of {\tt cpmm.cpp}, a C$++$ program using virtual
memory mapping to copy a file. The program is continued in the next figure.}
\label{cpmm-source1}
\end{figure}
\begin{figure}
\begin{verbatim}
int fd_out =
open(argv[2], O_RDWR | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
if(fd_out < 0){
perror(argv[2]);
return -1;
}
if(ftruncate(fd_out, info.st_size) < 0){
perror("Error setting output file size");
return -1;
}
void *addr_out =
mmap(0, info.st_size, PROT_WRITE, MAP_SHARED, fd_out, 0);
if(addr_out == MAP_FAILED){
perror("Error mapping output file");
return -1;
}
memcpy(addr_out, addr_in, info.st_size);
return 0;
}
\end{verbatim}
\caption{This is the second portion of {\tt cpmm.cpp}, a C$++$ program using virtual
memory mapping to copy a file. The program is continued from the previous figure.}
\label{cpmm-source2}
\end{figure}
Of course, I haven't explained all the arguments to \verb|mmap|, or
many other details. My intent here is not to provide comprehensive
documentation for these API procedures, nor to provide a complete
tutorial. Instead, the example should suffice to give you some feel
for file I/O using \verb|mmap|; files are opened, then mapped into the virtual
address space, and then accessed as any other memory would be, for example,
using \verb|memcpy|.
The underlying idea behind virtual memory-based file access (using
\verb|mmap|) is that files are arrays of bytes, just like
regions of virtual address space; thus, file access can be treated as
virtual memory access. The next style of file I/O to consider
accepts half of this argument (that files are arrays of bytes) but
rejects the other half (that they should therefore be treated the same
as memory). In Section~\ref{sequential-io-section}, you will see a third style of I/O, which largely
rejects even the first premise.
\subsection{Reading and Writing Files at Specified Positions}
Although convenient, accessing files as virtual memory is not without
disadvantages. In particular, writing files using \index{mmap@\verb"|mmap"|}\verb|mmap| raises
three problems:
\begin{itemize}
\item
The process has no easy way to control the time at which its updates
are made persistent. Specifically, there is no simple way for the
process to ensure that a data structure is written to persistent storage only after
it is in a consistent state, rather than in the middle of a series of
related updates.
\item
A process can write a file only if it has read permission as well as
write permission, because all page faults implicitly read from the
file, even if the page faults occur in the course of writing data into
the file's portion of virtual memory.
\item
Mapping a file into a range of addresses presumes you know how big
the file is. That isn't well suited to situations in which you don't
know in advance how much data will be written.
\end{itemize}
For these and other reasons, some programmers prefer to leave files
separate from the virtual memory address space and use procedures in
the POSIX API that explicitly copy data from a file into memory or
from memory into a file. The \index{pread@\verb"|pread"|}\verb|pread| and \index{pwrite@\verb"|pwrite"|}\verb|pwrite|
procedures take as arguments a file descriptor, a virtual address in
memory, a number of bytes to copy, and a position within the file.
Each procedure copies bytes starting from the specified position in the file and
the specified address in memory---\verb|pread| from the file to the
memory and \verb|pwrite| from the memory to the file. These
procedures are somewhat tricky to use correctly, because they may copy
fewer bytes than requested, and because they may signal error
conditions that go away upon retrying the operation.
Therefore, they always need to be put in
carefully designed loops. For this reason, I will not devote space
to an example here.
\subsection{Sequential Reading and Writing}\label{sequential-io-section}
Both \index{mmap@\verb"|mmap"|}\verb|mmap| and the \index{pread@\verb"|pread"|}\verb|pread|/\index{pwrite@\verb"|pwrite"|}\verb|pwrite| pair rely on the
ability to access arbitrary positions within a file; that is, they
treat the file as an array of bytes. As such, neither interface will
work for other sources of input and destinations for output, such as
keyboards and network connections. Instead, one needs to use a
sequential style of I/O, where each read or write operation takes
place not at a specified position, but wherever the last one
left off.
Sequential I/O is also quite convenient for many purposes, even when
used with files. For example, suppose you give the following command
in a shell:
\begin{verbatim}
(ls; ps) > information
\end{verbatim}
This opens the file named \verb|information| for writing as the
standard output and then runs two programs in succession: \verb|ls|
to list the files in the current directory and \verb|ps| to list
processes. The net result is that \verb|information| contains both
listings, one after the other. The \verb|ps| command does not need to
take any special steps to direct its output to the position in the
file immediately after where \verb|ls| stopped. Instead, by using the
sequential I/O features of the POSIX API, each of the two processes
naturally winds up writing each byte of output to the position after
the previously written byte, whether that previous byte was written by
the same process or not.
A process can perform sequential I/O using the \index{read@\verb"|read"|}\verb|read| and
\index{write@\verb"|write"|}\verb|write| procedures, which are identical to \verb|pread| and
\verb|pwrite|, except that they do not take an argument specifying the
position within the file. Instead, each implicitly is directed to
read or write at the current \foldvocab{file}{offset} and to
update that file offset.
The file offset is a position for reading and writing that is maintained by the
operating system.
For special files such as keyboard input, sequential input is
intrinsic, without needing an explicit file offset. For regular
files in persistent storage, however, the file offset is a numeric
position within the file (of the same kind \verb|pread| and
\verb|pwrite| take as arguments) that the operating
system keeps track of behind the scenes. Whenever a file is opened, the operating
system creates an
\foldvocab{open}{file description}, a capability-like structure that includes the file
offset, normally initialized to 0. Any file descriptors descended
from that same call to \verb|open| share the same open file
description. For example, in the previous example of
\verb|ls| and \verb|ps| writing to the \verb|information| file, each
of the two processes has its own file descriptor, but they are
referring to the same open file description, and hence share the same
file offset. If a process independently calls \verb|open| on the same
file, however, it will get a separate file offset.
A process implicitly increases the file offset whenever it does a
\verb|read| or \verb|write| of length more than zero. It can also
explicitly change the file offset using the
\index{lseek@\verb"|lseek"|}\verb|lseek| procedure. The \verb|lseek|
procedure can set the file offset anywhere within the file (for a
regular file). As such, a process can use the combination of
\verb|lseek| and \verb|read| or
\verb|write| to simulate \verb|pread| or \verb|pwrite|. However,
this simulation is prone to races if multiple threads or processes
share the same open file description, unless they use some synchronization
mechanism, such as a mutex.
Normally \verb|lseek| is used only infrequently, with sequential
access predominating. For example, a process may read a whole file
sequentially, using \verb|read|, and then use \verb|lseek| to set it
back to the beginning to read a second time. The conceptual model is
based on a tape drive, where ordinary reads and writes progress
sequentially through the tape, but rewinding or skipping forward are
also possible.
The \verb|read| and \verb|write| procedures share the same difficulty
as \verb|pread| and \verb|pwrite|:
the necessity of looping until all bytes have been
transferred. It is much easier to use the I/O facilities
defined in the standard libraries for higher level programming
languages, such as Java or C$++$. Behind the scenes, these
libraries are using \verb|read| and \verb|write| and doing the
looping (and other details) for you.
\section{Disk Space Allocation}\label{disk-allocation-section}
A file system is analogous to a virtual
memory system, in that each uses a level of indirection to map objects
into storage locations. In virtual memory, the mapping is from
virtual addresses within address spaces to physical addresses within
memory. In a file system, the mapping is from positions within files
to locations in persistent storage. For efficiency, the mapping is done at a coarse
granularity, several kilobytes at a time. In virtual memory, each
page is mapped into a page frame; in a file system, each block of a
file is mapped into a storage block. (You will see that blocks are
typically several kilobytes in size, spanning multiple sectors.)
When discussing virtual memory, I remarked that the operating system
was free to assign any unused page frame of physical memory to hold
each page of virtual memory. However, although any allocation policy
would be correct, some might cause cache memory to perform better.
Persistent storage faces a similar allocation problem, but the
performance issues are considerably more pronounced if the persistent
storage hardware is a disk drive, as I will assume in this section. A file system
has the freedom to store data in any
otherwise unused disk block. The choices it makes determine how
accesses to files translate into accesses to disk. You have already seen
that the pattern of disk access can make a huge performance difference
(three orders of magnitude). Thus, I will examine allocation
policies here more closely than I examined placement policies in
Chapter~\ref{vm-chapter}.
Before I get into allocation policies themselves and their
embodiment in allocation mechanisms, I will
look at the key objectives for allocation: minimizing wasted space
and time. As you will see in Sections \ref{fragmentation-section} and
\ref{locality-section}, these goals can be
expressed as minimizing fragmentation and maximizing locality.
\subsection{Fragmentation}\label{fragmentation-section}
The word \vocab{fragmentation} is used in two different senses.
First, consider the definition I will \emph{not} be using. For
some authors, fragmentation refers to the degree to which a file is
stored in multiple noncontiguous regions of the disk. A file that is
stored in a single contiguous sequence of disk blocks (called an
\vocab{extent}) is not fragmented at all, by this definition.
A file stored in two separate extents would be
slightly fragmented. If the file's blocks are individual scattered
across the disk, then the file is maximally fragmented, by this
definition. A
\vocab{defragmentation} program moves files' blocks around on disk so
as to leave each file in a single extent. To allow future allocations
to be non-fragmented, the defragmentation program also arranges the
files so that the free space on the disk is clustered together.
The contiguity and sequentiality issues mentioned in the preceding
paragraph are important for speed of access; I will discuss them in
Section~\ref{locality-section} under the broader heading of locality. However,
I will not refer to them as fragmentation, because I will use
another definition that is well established in the
operating systems field. By this alternative definition,
fragmentation concerns space efficiency. A highly fragmented disk is
one in which a large proportion of the storage capacity is unavailable
for allocation to files. I will explain in the remainder of this
subsection the phenomena that cause space to be unusable.
One source of waste is that space is allocated only in integer
multiples of some file system block size. For example, a file system
might allocate space only in units of 4~KB. A file that is too big to
fit in a single 4-KB unit will be allocated 8~KB of space---even if it
is only a single byte larger than 4~KB. The unused space in the last
file block is called \foldvocab{internal}{fragmentation}. The
amount of internal fragmentation depends not only on the desired file sizes,
but also on the file system block size. As an analogy, consider
parallel parking in an area where individual parking spaces are marked
with painted lines, and where drivers actually respect those lines.
The amount of wasted space depends on the cars being parked, but it
also depends on how far apart the lines are painted. Larger parking
spaces will generally result in more wasted space.
The file system block size is always some multiple of the underlying
disk drive's sector size; no file system ever subdivides the space
within a single disk sector. Generally the file system blocks span
several consecutive disk sectors; for example, eight disk sectors of 512
bytes each might be grouped into each 4-KB file system block. Larger
file system blocks cause more internal fragmentation, but are
advantageous from other perspectives. In particular, you will see that
a larger block size tends to reduce
external fragmentation.
Additionally, a larger block size implies that there are fewer blocks
to keep track of, which reduces bookkeeping overhead.
Once a space allocation request has been rounded up to the next
multiple of the block size, the operating system must locate the
appropriate number of unused blocks. In order to read or write the
file as quickly as possible, the blocks should be in a single
consecutive extent. For the moment, I will consider this to be an
absolute requirement. Later, I will consider relaxing it.
Continuing with my earlier example, suppose you need space for a file
that is just one byte larger than 4~KB and hence has been rounded up
to two 4-KB blocks. The new requirement of contiguity means that you
are looking for somewhere on the disk where two consecutive 4-KB blocks
are free. Perhaps you are out of luck. Maybe the disk is only half
full, but the half that is full consists of every even-numbered file
system block with all the odd-numbered ones available for use. This
situation, where there is lots of space available but not enough
grouped together in any one place, is \foldvocab{external}{fragmentation}. So
long as you insist on contiguous allocation, external fragmentation is
another cause of wasted space: blocks that are free for use, but are
too scattered to be usable.
On the surface, it appears that external fragmentation would result
only from very strange circumstances. My example, in which every
second file system block is occupied, would certainly fit that
description. To start with, it implies that you allocated lots of
small files and now suddenly want to allocate a larger file. Second,
it implies that you either were really dumb in choosing where those
small files went (skipping every other block), or had
phenomenally bad luck in the user's choice of which files to delete.
However, external fragmentation can occur from much more
plausible circumstances. In particular, you can wind up with only
small gaps of space available even if all the allocations have been
for much larger amounts of space and even if the previous allocations
were done without leaving silly gaps for no reason.
For a small scenario that illustrates the phenomenon, consider a disk
that has room for only 14 file system blocks. Suppose you start by
allocating three four-block files. At this point, the space
allocation might look as follows:
\[\begin{tikzpicture}[scale=.035]
\draw (0,0) rectangle (40,20);
\draw (40,0) rectangle (80,20);
\draw (80,0) rectangle (120,20);
\draw (120,0) rectangle (140,20);
\draw (20,10) node{file1};
\draw (60,10) node{file2};
\draw (100,10) node{file3};
\draw (0,-7) node{0};
\draw (40,-7) node{4};
\draw (80,-7) node{8};
\draw (120,-7) node{12};
\draw (140,-7) node{14};
\end{tikzpicture}\]
\iffalse
\[\begin{graph}(150,32)(-3,-12)
\graphlinecolour{0}
\fillednodesfalse
\rectnode{a}[40,20](20,10)
\rectnode{b}[40,20](60,10)
\rectnode{c}[40,20](100,10)
\rectnode{d}[20,20](130,10)
\autonodetext{a}{file1}
\autonodetext{b}{file2}
\autonodetext{c}{file3}
\freetext(0,-7){0}
\freetext(40,-7){4}
\freetext(80,-7){8}
\freetext(120,-7){12}
\freetext(140,-7){14}
\end{graph}\]
\fi
Suppose file2 is now deleted, resulting in a four-block gap, with
another two blocks free at the end of the disk:
\[\begin{tikzpicture}[scale=.035]
\draw (0,0) rectangle (40,20);
\draw (40,0) rectangle (80,20);
\draw (80,0) rectangle (120,20);
\draw (120,0) rectangle (140,20);
\draw (20,10) node{file1};
\draw (100,10) node{file3};
\draw (0,-7) node{0};
\draw (40,-7) node{4};
\draw (80,-7) node{8};
\draw (120,-7) node{12};
\draw (140,-7) node{14};
\end{tikzpicture}\]
\iffalse
\[\begin{graph}(150,32)(-3,-12)
\graphlinecolour{0}
\fillednodesfalse
\rectnode{a}[40,20](20,10)
\rectnode{b}[40,20](60,10)
\rectnode{c}[40,20](100,10)
\rectnode{d}[20,20](130,10)
\autonodetext{a}{file1}
\autonodetext{c}{file3}
\freetext(0,-7){0}
\freetext(40,-7){4}
\freetext(80,-7){8}
\freetext(120,-7){12}
\freetext(140,-7){14}
\end{graph}\]
\fi
If, at this point, a three-block file (file4) is created, it can go
into the four-block gap, leaving one block unused:
\[\begin{tikzpicture}[scale=.035]
\draw (0,0) rectangle (40,20);
\draw (40,0) rectangle (70,20);
\draw (70,0) rectangle (80,20);
\draw (80,0) rectangle (120,20);
\draw (120,0) rectangle (140,20);
\draw (20,10) node{file1};
\draw (100,10) node{file3};
\draw (55,10) node{file4};
\draw (0,-7) node{0};
\draw (40,-7) node{4};
\draw (70,-7) node{7};
\draw (80,-7) node{8};
\draw (120,-7) node{12};
\draw (140,-7) node{14};
\end{tikzpicture}\]
\iffalse
\[\begin{graph}(150,32)(-3,-12)
\graphlinecolour{0}
\fillednodesfalse
\rectnode{a}[40,20](20,10)
\rectnode{b}[30,20](55,10)
\rectnode{e}[10,20](75,10)
\rectnode{c}[40,20](100,10)
\rectnode{d}[20,20](130,10)
\autonodetext{a}{file1}
\autonodetext{c}{file3}
\autonodetext{b}{file4}
\freetext(0,-7){0}
\freetext(40,-7){4}
\freetext(70,-7){7}
\freetext(80,-7){8}
\freetext(120,-7){12}
\freetext(140,-7){14}
\end{graph}\]
\fi
Now there are three unused blocks, but there is no way to
satisfy another three-block allocation request, because the three
unused blocks are broken up, with one block between files 4 and 3, and
two more blocks at the end of the disk.
Notice that you wound up with a one-block gap not because a one-block
file was created and later deleted (or because of stupid allocation),
but because a four-block file was replaced by a three-block file. The
resulting gap is the difference in the file sizes. This means that
even if a disk is used exclusively for storing large files, it may
still wind up with small gaps, which cannot hold any large files.
This is the fundamental problem of external fragmentation.
Returning to the parallel parking analogy, consider an area where no
parking spaces are marked on the pavement, leaving drivers to allocate
their own spaces. Even if they are courteous enough not to leave any
pointless gaps, small gaps will arise as cars of varying sizes come
and go. A large car may vacate a space, which is then taken by a
smaller car. The result is a gap equal to the difference in car
sizes, too small for even the smallest cars to use. If this situation
happens repeatedly at different spots along a block, there may be enough
total wasted space to accommodate a car, but not all in one place.
Earlier, I mentioned that increasing the file system block size,
which increases internal fragmentation, decreases external
fragmentation. The reason for this is that with a larger block size,
there is less variability in the amount of space being allocated.
Files that might have different sizes when rounded up to the next
kilobyte (say, 14~KB and 15~KB) may have the same size when rounded to
the next multiple of 4~KB (in this case, 16~KB and 16~KB). Reduced
variability reduces external fragmentation; in the extreme case, no
external fragmentation at all occurs if the files are all allocated
the same amount of space.
Suppose you relax the requirement that a file be allocated a single
extent of the disk. Using file metadata, it is possible to
store different blocks of the file in different locations, much as a
virtual memory address space can be scattered throughout physical
memory. Does this mean that external fragmentation is a nonissue?
No, because for performance reasons, you will still want to allocate
the file contiguously as much as possible. Therefore, external
fragmentation will simply change from being a space-efficiency issue
(free space that cannot be used) to a time-efficiency issue (free
space that cannot be used without file access becoming slower). This
gets us into the next topic, locality.
\subsection{Locality}\label{locality-section}
Recall that disks provide their fastest performance when asked to
access a large number of consecutive sectors in a single request at a
location nearby to the previous access request. Most file system
designers have interpreted these conditions for fast access as implying the following
locality guidelines for space allocation:
\begin{enumerate}
\item The space allocated for each file should be broken into as few