## How to view the contents of hdf5 files

First, import the relevant modules and load data.

In [2]:
import h5py
import allel

In [3]:
callset = h5py.File('../data/phase1/hdf5/ag1000g.phase1.ar3.pass.h5', mode='r')


Now, choose a chromosome arm to work with. To see all available arms, use the .keys() function to look at the keys. From what I understand, these represent all the subdivisions of a hdf5 file, sort of like a hierarchial directory structure. In this case, the highest level is the chromosome to be investigated. Here I'll use 3L.

In [4]:
sorted(callset.keys())

['2L', '2R', '3L', '3R', 'X']

Note: the same output can be generated without the .keys() function. I'm not sure what this really adds, but Alastair used it originally when talking about VCFs.

In [5]:
chrom = '3L'

Now that we've chosen a chromosome, lets see what the next level of subdivisions of data are that lie under this. To do this, we'll look at subsets of our data using square brackets [].

In [6]:
sorted(callset[chrom])

['calldata', 'samples', 'variants']

This reveals that there are three categories of data in these files: 'variants', 'calldata', and 'samples'. To further explore the contents of these, we need to then use another pair of square brackets after the first, and this continues for each level of data you want to access. This is different to VCF files, which use '/' between each level, all contained in one set of square brackets. For example, variant data is as follows:

In [7]:
sorted(callset[chrom]['variants'])

['ABHet',
 'ABHom',
 'AC',
 'AF',
 'ALT',
 'AN',
 'ANN',
 'Accessible',
 'BaseCounts',
 'BaseQRankSum',
 'CHROM',
 'Coverage',
 'CoverageMQ0',
 'DP',
 'DS',
 'Dels',
 'FILTER_FS',
 'FILTER_HRun',
 'FILTER_HighCoverage',
 'FILTER_HighMQ0',
 'FILTER_LowCoverage',
 'FILTER_LowMQ',
 'FILTER_LowQual',
 'FILTER_NoCoverage',
 'FILTER_PASS',
 'FILTER_QD',
 'FILTER_ReadPosRankSum',
 'FILTER_RefN',
 'FILTER_RepeatDUST',
 'FS',
 'HRun',
 'HW',
 'HaplotypeScore',
 'HighCoverage',
 'HighMQ0',
 'InbreedingCoeff',
 'LOF',
 'LowCoverage',
 'LowMQ',
 'LowPairing',
 'MLEAC',
 'MLEAF',
 'MQ',
 'MQ0',
 'MQRankSum',
 'NDA',
 'NMD',
 'NoCoverage',
 'OND',
 'POS',
 'QD',
 'QUAL',
 'REF',
 'RPA',
 'RU',
 'ReadPosRankSum',
 'RefMasked',
 'RefN',
 'RepeatDUST',
 'RepeatMasker',
 'RepeatTRF',
 'STR',
 'VariantType',
 'is_snp',
 'num_alleles',
 'svlen']

As you can see, this is where the 'Accessible' data is hidden away! To confirm that we can access it, use the following:

In [8]:
callset[chrom]['variants']['Accessible']

<HDF5 dataset "Accessible": shape (9643193,), type "|b1">

From here, we should be able to sort and remove all those that are not accessible - I'm not sure how yet! Note that if you're doing this directly from a chromosome file e.g. 3L, not the main .h5, you still need to specify the chromosome arm first, as that is still the first level of the data structure.