Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approach for gluster brick creation from disks. #49

Closed
nnDarshan opened this issue Oct 25, 2016 · 8 comments
Closed

Approach for gluster brick creation from disks. #49

nnDarshan opened this issue Oct 25, 2016 · 8 comments

Comments

@nnDarshan
Copy link

nnDarshan commented Oct 25, 2016

This issue is to arrive at a suitable approach to be taken for creating gluster bricks form the disks. To achieve that few of the available tools like pyudev, blivet, pyudevDAG, storaged and udisks will be investigated and most suitable tool will be picked. Also the current approach used by Ovirt project to create bricks will be investigated and the positive points will be considered.

@mbukatov
Copy link

Do we plan to evaluate storaged or udisks for this as well?

@nnDarshan
Copy link
Author

The steps and the best practices to be considered while creating a gluster brick, so as to get an optimal performance from the bricks are as follows. These are obtained from Gluster admin guide. The tool that we use and the approach that we take must follow these gludelines.

Summary of steps involved in creating a gluster brick from a disk

LVM layer:

  1. Physical Volume creation:

    $ pvcreate --dataalignment <alignment_value> <disk>

    where alignment_value :
    - For JBODS: 256k
    - For H/W RAID: RAID stripe unit size * Nos of data disks (nos of data disks depends upon the raid type)

  2. Volume Group creation:

    • For RAID disks:

      $ vgcreate --physicalextentsize <extent_size> VOLGROUP <physical_volume>

      where extent_size = RAID stripe unit size * Nos of data disks (nos of data disks depends upon the raid type)

    • For JBODS:

      $ vgcreate VOLGROUP <physical_volume>

  3. Thin Pool creation:

    $ lvcreate --thinpool VOLGROUP/thin_pool --size <pool_size> --chunksize <chunk_size> --poolmetadatasize <meta_size> --zero n

    Where:
    - meta_size: 16 GiB recomended, if its a concern atleast 0.5% of pool_size
    - chunk_size:
    i. For JBOD: use a thin pool chunk size of 256 KiB.
    ii. For RAID 6: stripe_unit size * number of data disks must be B/w 1Mib and 2Mib (preferably close to 1)
    iii. For RAID 10: thin pool chunk size of 256 KiB

    NOTE: if we need multiple bricks on a single H/w device then create multiple Thin pools from a single VG.

  4. Thin LV creation:

    $ vcreate --thin --name LV_name --virtualsize LV_size VOLGROUP/thin_pool

XFS Layer:

  1. Formatting filesystem on the disk:
    • XFS Inode Size: 512 bytes
    • XFS RAID Alignment:
      • For RAID 6: SU= SW=number of data disks.

        Example :
        $ mkfs.xfs other_options -d su=128k,sw=10 device_name

      • For RAID 10 and JBODS: this can be omitted default is fine

    • Logical Block Size for the Directory:
      • For all types:
        default is 4k for better performance have a greater value like 8192 use "-n sixe=" for setting this.

        Example :
        $ mkfs.xfs -f -i size=512 -n size=8192 -d su=128k,sw=10 logical volume meta-data=/dev/mapper/gluster-brick1 isize=512 agcount=32, agsize=37748736 blks

  2. Mounting the filesystem
    • Allocation Strategy: default is inode32 but inode64 is recommended
      set it by using "-o inode64" during mount
    • Access Time:
      If the application does not require to update the access time on files, than file system must always be mounted with noatime mount option
      Example :
      $ mount -t xfs -o inode64,noatime <logical volume> <mount point>
    • Allocation groups: Default is fine
    • Percentage of space allocation to inodes:
      If the workload is very small files (average file size is less than 10 KB ), then it is recommended to set maxpct value to 10, while formatting the file system

@nnDarshan
Copy link
Author

Current approach taken by Ovirt:

  • Uses blivet for creating bricks from disks. A bit older version is used as gluster has to be supported on older versions of operating system.
  • It can read the list of disks availble. But raid specific details are not available in the version of blivet used by ovirt.
  • But some of the best practices are based on the raid type and raid attributes (as mentioned in the above comment). So if user wants to create a brick out of raid volume. He has to manually provide the raid specific details.
  • After obtaining the raid specific details from user, it goes ahead with the brick creation without any further user intervention. All the steps mentioned above are automated using blivet.
  • Once the bricks are created. They can be used to create volume in the volume create flow. User will be shown the list of all bricks available in the cluster, he can choose the bricks and create a gluster volume out of it.
  • Ovirt has support for distribute, replicate and distributed-replicate type of volume. In case of replicated
    and distributed replicated volume, ovirt makes sure that replica pairs are located in different machines, if
    sufficient bricks are not available in different machines, it warns user and user can override the warning.

@nnDarshan
Copy link
Author

Summary of analysis of few tools:

  • pyudev and pyudevDAG : These provide details of all the disks available, and some of the software raid(md raid)specific details but not everything needed. We cannot perform operations like pv creation, lv creation, vg creation etc. using this tool.
  • udisks: This does not have support for LVM related operations.
  • storaged and udisks2: These provides details of various storage disks available, and some of the software raid specific details. This tool can be used to create pv, create vg etc. But you cant provide some additional options while creating them(refered doc). For gluster brick creation, you need to create pv, vg, thinpool etc. with some additional options for better performance.
  • blivet: This provides device details of available disks on the node. software raid(linux md raid) details are also available. Also this can be used to actually provision the brick by creating pv, lv, thinpool and thin lv with options considering the best practices.

NOTE: None of these have capability to detect hardware raid volume related details. I guess good number of customers might be using this. Raid specific details like RAID leval, stripe count, disk count are requirement to provision the bricks considering best practices wrt performance.

@nnDarshan
Copy link
Author

Conclusion:

Considering the analysis mentioned above Blivet seems to be more appropriate for our use case. Blivet can be used to get the list of all the disks available on the node and provision the bricks. But when user is using raid, blivet will not be able to get the raid specific details . User will have to provide the raid specific details like:

  • raid leval
  • raid stripe size
  • number of data disks

With these details provided by user, we can use blivet and provision the bricks as per the best practices.

Also blivet is being used by ovirt for the same purpose and there is no major concern about using blivet for brick provisioning. We can reuse some of the work from there, which in turn will speed up out development.

@nnDarshan
Copy link
Author

nnDarshan commented Oct 28, 2016

Requirements of a tool that can be used to provision gluster bricks:

  • Tool must be able to provide the list of disk devices available on the machine with all
    the basic information like size, type, hierarchy etc.

  • It must provide API for following LVM related operations:

    - Create LVM physical volume with data_alignment option.
    
    - Create LVM volume group with extent_size option.
    
    - Create LVM thin pool with chunk_size and metadata_size options.
    
    - Create LVM thin LV.
    
  • It must Provide API to format Filesystem on a device with following options:

    - Inode Size
    
    - RAID Alignment(if underlying device is a raid device)
    
    - Logical Block Size for the Directory
    
  • It must Provide API to mount filesystem with options like:

    - Allocation strategy
    
    - Access time
    
    - Allocation groups
    
    - inode space percentage
    
  • Its good if the tool can recognize the RAID devices(both Hardware and software) available in the
    machine and give information about these devices, like:

    - Raid leval
    
    - Raid stripe count
    
    - Number of data disks etc.
    

@nnDarshan
Copy link
Author

UPDATE: we had a discussion with blivet team and the summary of the discussion is as follows:
Their suggestion was to plug libstoragemgmt with blivet, so that blivet will be able to discover h/w raid details and provision the disks to be used as gluster bricks. Libstoragemanager has the capability to get hardware raid specific details.
They said, they would be providing some help for blivet-libstoragemgr integration.

@nnDarshan
Copy link
Author

We are exploring a possibility of getting the feature of brick provisioning into gdeploy and consuming that feature from tendrl.
Here is an issue raised against gdeploy to track this: gluster/gdeploy#257

@r0h4n r0h4n closed this as completed Jan 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants