Skip to content

Latest commit

 

History

History
160 lines (127 loc) · 11.5 KB

File metadata and controls

160 lines (127 loc) · 11.5 KB

Chapter 14: The Block I/O Layer

Block devices are hardware devices distinguished by the random access of fixed-size chunks of data. The fixed-size chunks of data are called blocks.

The other basic type of device is a character device. Character devices, or char devices, are accessed as a stream of sequential data, one byte after another.

Anatomy of a Block Device

The smallest addressable unit on a block device is a sector. Sectors come in various powers of two, but 512 bytes is the most common size. The sector size is a physical property of the device, and the sector is the fundamental unit of all block devices.

block is the smallest logically addressable unit. The block is an abstraction of the filesystem, filesystems can be accessed only in multiples of a block.

Buffers and Buffer Heads

When a block is stored in memory, it is stored in a buffer. Each buffer is associated with exactly one block. The buffer serves as the object that represents a disk block in memory. Because the kernel requires some associated control information to accompany the data, each buffer is associated with a descriptor. This descriptor is called a buffer head and is of type struct buffer_head. The buffer_head structure holds all the information that the kernel needs to manipulate buffers and is defined in <linux/buffer_head.h>.

struct buffer_head {
    unsigned long b_state;             /* buffer state flags */
    struct buffer_head *b_this_page;   /* list of page’s buffers */
    struct page *b_page;               /* associated page */
    sector_t b_blocknr;                /* starting block number */
    size_t b_size;                     /* size of mapping */
    char *b_data;                      /* pointer to data within the page */
    struct block_device *b_bdev;       /* associated block device */
    bh_end_io_t *b_end_io;             /* I/O completion */
    void *b_private;                   /* reserved for b_end_io */
    struct list_head b_assoc_buffers;  /* associated mappings */
    struct address_space *b_assoc_map; /* associated address space */
    atomic_t b_count;                  /* use count */
};

The b_state field specifies the state of this particular buffer. It can be one or more of the flags in the following table. The legal flags are stored in the bh_state_bits enumeration, which is defined in <linux/buffer_head.h>.

Status Flag Meaning
BH_Uptodate Buffer contains valid data.
BH_Dirty Buffer is dirty. (The contents of the buffer are newer than the contents of the block on disk and therefore the buffer must eventually be written back to disk.)
BH_Lock Buffer is undergoing disk I/O and is locked to prevent concurrent access.
BH_Req Buffer is involved in an I/O request.
BH_Mapped Buffer is a valid buffer mapped to an on-disk block.
BH_New Buffer is newly mapped via get_block() and not yet accessed.
BH_Async_Read Buffer is undergoing asynchronous read I/O via end_buffer_async_read().
BH_Async_Write Buffer is undergoing asynchronous write I/O via end_buffer_async_write().
BH_Delay Buffer does not yet have an associated on-disk block (delayed allocation).
BH_Boundary Buffer forms the boundary of contiguous blocks; the next block is discontinuous.
BH_Write_EIO Buffer incurred an I/O error on write.
BH_Ordered Ordered write.
BH_Eopnotsupp Buffer incurred a "not supported" error.
BH_Unwritten Space for the buffer has been allocated on disk but the actual data has not yet been written out.
BH_Quiet Suppress errors for this buffer.

The bh_state_bits enumeration also contains a BH_PrivateStart flag. This is not a valid state flag but instead corresponds to the first usable bit of which other code can make use. All bit values equal to and greater than BH_PrivateStart are not used by the block I/O layer proper, so these bits are safe to use by individual drivers who want to store information in the b_state field.

The b_count field is the buffer’s usage count. The value is incremented and decremented by two inline functions defined in <linux/buffer_head.h>:

static inline void get_bh(struct buffer_head *bh)
{
        atomic_inc(&bh->b_count);
}

static inline void put_bh(struct buffer_head *bh)
{
        smp_mb__before_atomic_dec();
        atomic_dec(&bh->b_count);
}

Before manipulating a buffer head, you must increment its reference count via get_bh() to ensure that the buffer head is not deallocated out from under you. When finished with the buffer head, decrement the reference count via put_bh().

The physical block on disk to which a given buffer corresponds is the b_blocknr-th logical block on the block device described by b_bdev. The physical page in memory to which a given buffer corresponds is the page pointed to by b_page. More specifically, b_data is a pointer directly to the block (that exists somewhere in b_page), which is b_size bytes in length. Therefore, the block is located in memory starting at address b_data and ending at address (b_data + b_size).

The purpose of a buffer head is to describe this mapping between the on-disk block and the physical in-memory buffer. Acting as a descriptor of this buffer-to-block mapping is the data structure’s only role in the kernel.

The bio Structure

The bio structure is the basic container for block I/O within the kernel is the bio structure. Defined in <linux/bio.h>, this structure represents block I/O operations that are in flight as a list of segments. A segment is a chunk of a buffer that is contiguous in memory. Thus, individual buffers need not be contiguous in memory. Vector I/O such as this is called scatter-gather I/O.

struct bio {
    sector_t bi_sector;               /* associated sector on disk */
    struct bio *bi_next;              /* list of requests */
    struct block_device *bi_bdev;     /* associated block device */
    unsigned long bi_flags;           /* status and command flags */
    unsigned long bi_rw;              /* read or write? */
    unsigned short bi_vcnt;           /* number of bio_vecs off */
    unsigned short bi_idx;            /* current index in bi_io_vec */
    unsigned short bi_phys_segments;  /* number of segments */
    unsigned int bi_size;             /* I/O count */
    unsigned int bi_seg_front_size;   /* size of first segment */
    unsigned int bi_seg_back_size;    /* size of last segment */
    unsigned int bi_max_vecs;         /* maximum bio_vecs possible */
    unsigned int bi_comp_cpu;         /* completion CPU */
    atomic_t bi_cnt;                  /* usage counter */
    struct bio_vec *bi_io_vec;        /* bio_vec list */
    bio_end_io_t *bi_end_io;          /* I/O completion method */
    void *bi_private;                 /* owner-private method */
    bio_destructor_t *bi_destructor;  /* destructor method */
    struct bio_vec bi_inline_vecs[0]; /* inline bio vectors */
};

The primary purpose of a bio structure is to represent an in-flight block I/O operation. To this end, the majority of the fields in the structure are housekeeping related. The most important fields are bi_io_vec, bi_vcnt, and bi_idx.

I/O vectors

The bi_io_vec field points to an array of bio_vec structures, each of which is used as a list of individual segments in this specific block I/O operation. The entire array of these vectors describes the entire buffer.

Each bio_vec is treated as a vector of the form <page, offset, len>, which describes a specific segment: the physical page on which it lies, the location of the block as an offset into the page, the length of the block starting from the given offset.

The bio_vec structure is defined in <linux/bio.h>:

struct bio_vec {
    /* pointer to the physical page on which this buffer resides */
    struct page *bv_page;

    /* the length in bytes of this buffer */
    unsigned int bv_len;

    /* the byte offset within the page where the buffer resides */
    unsigned int bv_offset;
};

In each given block I/O operation, there are bi_vcnt vectors in the bio_vec array starting with bi_io_vec. As the block I/O operation is carried out, the bi_idx field is used to point to the current index into the array.

Each block I/O request is represented by a bio structure. Each request is composed of one or more blocks, which are stored in an array of bio_vec structures. These structures act as vectors and describe each segment’s location in a physical page in memory. The first segment in the I/O operation is pointed to by b_io_vec. Each additional segment follows after the first, for a total of bi_vcnt segments in the list. As the block I/O layer submits segments in the request, the bi_idx field is updated to point to the current segment.

The bi_idx field is used to point to the current bio_vec in the list, which helps the block I/O layer keep track of partially completed block I/O operations. More importantly, it allows the splitting of bio structures.

The bio structure maintains a usage count in the bi_cnt field. When this field reaches zero, the structure is destroyed and the backing memory is freed.

void bio_get(struct bio *bio)
void bio_put(struct bio *bio)

The bi_private field is a private field for the owner of the structure. As a rule, you can read or write this field only if you allocated the bio structure.

Request Queues

Block devices maintain request queues to store their pending block I/O requests. The request queue is represented by the request_queue structure and is defined in <linux/blkdev.h>. The request queue contains a doubly linked list of requests and associated control information. Requests are added to the queue by higher-level code in the kernel, such as filesystems. As long as the request queue is nonempty, the block device driver associated with the queue grabs the request from the head of the queue and submits it to its associated block device. Each item in the queue’s request list is a single request, of type struct request (also defined in <linux/blkdev.h>).

Each request can be composed of more than one bio structure because individual requests can operate on multiple consecutive disk blocks. Note that although the blocks on the disk must be adjacent, the blocks in memory need not be; each bio structure can describe multiple segments and the request can be composed of multiple bio structures.

I/O Schedulers

The subsystem of the kernel that performs these operations is called the I/O scheduler. The I/O scheduler divides the resource of disk I/O among the pending block I/O requests in the system.

The Job of an I/O Scheduler

An I/O scheduler works by managing a block device's request queue. It decides the order of requests in the queue and at what time each request is dispatched to the block device. It manages the request queue with the goal of reducing seeks, which results in greater global throughput.

I/O scheduler perform two primary actions to minimize seeks:

  • merging: the coalescing of two or more requests into one.
  • sorting: the entire request queue is kept sorted, sectorwise.

I/O Scheduler Selection

There are four different I/O schedulers in the 2.6 kernel. Each of these I/O schedulers can be enabled and built into the kernel. By default, block devices use the Complete Fair Queuing I/O scheduler. This can be overridden via the boot-time option elevator=foo on the kernel command line, where foo is a valid and enabled I/O Scheduler.

Parameter I/O Scheduler
as Anticipatory
cfq Complete Fair Queuing
deadline Deadline
noop Noop