Skip to content
Python-based Illumina methylation array preprocessing software.
Python Makefile
Branch: master
Clone or download
marcmaxson v1.2.0 feature/geometa tested and ready to release (#38)
* added batch_size parameter to run_pipeline

* added CLI functionality

* batch_size python/CLI and tests

* removed test; changed default behavior: won't raise error if file-to-be-downloaded already exists

* Update setup.py

* Update test_batch_size.py

* Rename test_batch_size.py to test_pipeline_batch_size.py

* dropped redundant tests and sped up one

* Feature/public data (#21)

* download command, as well as some batch_size adjustments

* fixed string issue

* renaming update and removed redundant tests

* bs4 required for Array ingester

* tests

* workaround to return objects with batch size changes

* workaround to return objects with batch size changes

* bug

* tests pass for batch_size

* version 1.1 (#22)

* download command, as well as some batch_size adjustments

* fixed string issue

* renaming update and removed redundant tests

* bs4 required for Array ingester

* tests

* workaround to return objects with batch size changes

* workaround to return objects with batch size changes

* bug

* tests pass for batch_size

* progress bars

* documenting `download`

* Update cli.py

* restore sample_name filter

* added rawMetaDataset class and moved get_sample_sheet_s3 to more logical place here (#24)

* updated docs for 1.1.1

* Update README.md

* Update setup.py

* exposed create_sample_sheet and download no_clean options

* manifest file download in lambda

* manifest file download in lambda

* manifest file download in lambda

* v1.1.3 bump bug fix

* handles blank sample_name and ensures names are unique.

* Update setup.py

* geo downloader tweaks, fixed docs

* minor tweaks to sample_sheet parser

* v1.1.8: CLI retain --uncorrected mean prob values; sample_sheet sample_type sample_sub_type; sample_sheet accepts alt sentrix column headers

* v1.1.8: CLI retain --uncorrected mean prob values; sample_sheet sample_type sample_sub_type; sample_sheet accepts alt sentrix column headers

* v1.1.8

* v1.1.9 minor bug fix to alt filename

* bug fix: sample QC control status

* v1.1.11 generates meta_data pickle

* coveralls
* bug fix
* v1.1.14 smarter meta_data cli option
* Create faq.md
* reworking download and meta_data to be more robust
* downloader warns if idats aren't there; smarter meta_data
* minor
* unit tests for download, meta_data, ae_download
* v1.2.0 meta_data parser, better GEO/AE downloader
Latest commit fe16f3e Nov 14, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci Minor bug fixes, and coveralls stopped working (#35) Nov 5, 2019
_sphinx_static version 1.0 release Jun 25, 2019
docs
methylprep v1.2.0 feature/geometa tested and ready to release (#38) Nov 14, 2019
tests v1.2.0 feature/geometa tested and ready to release (#38) Nov 14, 2019
.editorconfig
.gitignore
LICENSE version 1.0 release Jun 25, 2019
Makefile
Pipfile v1.1 release (#23) Sep 5, 2019
Pipfile.lock version 1.0 release Jun 25, 2019
README.md
appveyor.yml Renamed repo and all internal references (#14) Aug 22, 2019
conda-env.yml Renamed repo and all internal references (#14) Aug 22, 2019
conf.py Renamed repo and all internal references (#14) Aug 22, 2019
coverage.html.js Renamed repo and all internal references (#14) Aug 22, 2019
index.rst Minor bug fixes, and coveralls stopped working (#35) Nov 5, 2019
junit.xml Renamed repo and all internal references (#14) Aug 22, 2019
requirements.txt Minor bug fixes, and coveralls stopped working (#35) Nov 5, 2019
setup.cfg version bump for release 1.0.5 (#11) Aug 6, 2019
setup.py v1.2.0 feature/geometa tested and ready to release (#38) Nov 14, 2019

README.md

methylprep is a python package for processing Illumina methylation array data. View on ReadTheDocs.

Readthedocs License: MIT CircleCI Build status Codacy Badge Coverage Status

methylprep Package

The methylprep package contains both high-level APIs for processing data from local files and low-level functionality allowing you to customize the flow of data and how it is processed.

Installation

methylprep maintains configuration files for your Python package manager of choice: conda, pipenv, and pip.

pip install methylprep

High-Level Processing

The primary methylprep API provides methods for the most common data processing and file retrieval functionality.

run_pipeline

Run the complete methylation processing pipeline for the given project directory, optionally exporting the results to file.

Returns: A collection of DataContainer objects for each processed sample

from methylprep import run_pipeline

data_containers = run_pipeline(data_dir, array_type=None, export=False, manifest_filepath=None, sample_sheet_filepath=None, sample_names=None)
Argument Type Default Description
data_dir str, Path REQUIRED Base directory of the sample sheet and associated IDAT files
array_type str None Code of the array type being processed. Possible values are custom, 27k, 450k, epic, and epic+. If not provided, the pacakage will attempt to determine the array type based on the number of probes in the raw data. If the batch contains samples from different array types, this may not work. Our data download function attempts to split different arrays into separate batches for processing to accommodate this.
manifest_filepath str, Path None File path for the array's manifest file. If not provided, this file will be downloaded from a Life Epigenetics archive.
no_sample_sheet bool None pass in "--no_sample_sheet" from command line to trigger sample sheet auto-generation. Sample names will be based on idat filenames. Useful for public GEO data sets that lack sample sheets.
sample_sheet_filepath str, Path None File path of the project's sample sheet. If not provided, the package will try to find one based on the supplied data directory path.
sample_name str to list None List of sample names to process, in the CLI format of -n sample1 sample2 sample3 etc. If provided, only those samples specified will be processed. Otherwise all samples found in the sample sheet will be processed.
export bool False Add flag to export the processed data to CSV.
betas bool False Add flag to output a pickled dataframe of beta values of sample probe values.
m_value bool False Add flag to output a pickled dataframe of m_values of samples probe values.
batch_size int None Optional: splits the batch into smaller sized sets for processing. Useful when processing hundreds of samples that can't fit into memory. Produces multiple output files. This is also used by the package to process batches that come from different array types.

Note: By default, if run_pipeline is called as a function in a script, a list of SampleDataContainer objects is returned.

methylprep Command Line Interface (CLI)

methylprep provides a command line interface (CLI) so the package can be used directly in bash/batchfile scripts as part of building your custom processing pipeline.

All invocations of the methylprep CLI will provide contextual help, supplying the possible arguments and/or options available based on the invoked command. If you specify verbose logging the package will emit log output of DEBUG levels and above.

>>> python -m methylprep

usage: methylprep [-h] [-v] {process,sample_sheet} ...

Utility to process methylation data from Illumina IDAT files

positional arguments:
  {process,sample_sheet}
    process             process help
    sample_sheet        sample sheet help

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Enable verbose logging

Commands

The methylprep cli provides two top-level commands:

  • process to process methylation data
  • download script to download and process public data sets in NIH GEO or ArrayExpress collections. Provide the public Accession ID and it will handle the rest.
  • sample_sheet to find/read/validate a sample sheet and output its contents

process

Process the methylation data for a group of samples listed in a single sample sheet.

If you do not provide the file path for the project's sample_sheet the module will try to find one based on the supplied data directory path. You must supply either the name of the array being processed or the file path for the array's manifest file. If you only specify the array type, the array's manifest file will be downloaded from a Life Epigenetics archive.

>>> python -m methylprep process

usage: methylprep idat [-h] -d DATA_DIR [-a {custom,27k,450k,epic,epic+}]
                       [-m MANIFEST] [-s SAMPLE_SHEET] [--no_sample_sheet]
                       [-n [SAMPLE_NAME [SAMPLE_NAME ...]]] [-e] [-b]
                       [--m_value] [--batch_size BATCH_SIZE]

Process Illumina IDAT files

optional arguments:
  -h, --help            show this help message and exit
  -d DATA_DIR, --data_dir DATA_DIR
                        Base directory of the sample sheet and associated IDAT
                        files. If IDAT files are in nested directories, this
                        will discover them.
  -a {custom,27k,450k,epic,epic+}, --array_type {custom,27k,450k,epic,epic+}
                        Type of array being processed. If omitted, this will
                        autodetect it.
  -m MANIFEST, --manifest MANIFEST
                        File path of the array manifest file. If omitted, this
                        will download the appropriate file from `s3`.
  -s SAMPLE_SHEET, --sample_sheet SAMPLE_SHEET
                        File path of the sample sheet. If omitted, this will
                        discover it. There must be only one CSV file in the
                        data_dir for discovery to work.
  --no_sample_sheet     If your dataset lacks a sample sheet csv file, specify
                        --no_sample_sheet to have it create one on the fly.
                        This will read .idat file names and ensure processing
                        works. If there is a matrix file, it will add in
                        sample names too.
  -n [SAMPLE_NAME [SAMPLE_NAME ...]], --sample_name [SAMPLE_NAME [SAMPLE_NAME ...]]
                        Sample(s) to process. You can pass multiple sample
                        names with multiple -n params.
  -e, --no_export       Default is to export data to csv in same folder where
                        IDAT file resides. Pass in --no_export to suppress
                        this.
  -b, --betas           If passed, output returns a dataframe of beta values
                        for samples x probes. Local file beta_values.npy is
                        also created.
  --m_value             If passed, output returns a dataframe of M-values for
                        samples x probes. Local file m_values.npy is also
                        created.
  --batch_size BATCH_SIZE
                        If specified, samples will be processed and saved in
                        batches no greater than the specified batch size

download

There are thousands of publically accessible DNA methylation data sets available via the GEO (US NCBI NIH) https://www.ncbi.nlm.nih.gov/geo/ and ArrayExpress (UK) https://www.ebi.ac.uk/arrayexpress/ websites. This function makes it easy to import them and build a reference library of methylation data.

Argument Type Default Description
-d , --data_dir str [required path] path to where the data series will be saved. Folder must exist already.
-i ID, --id ID str [required ID] The dataset's reference ID (Starts with GSM for GEO or E-MTAB- for ArrayExpress)
-l LIST, --list LIST multiple strings optional List of series IDs (can be either GEO or ArrayExpress), for partial downloading
-o, --dict_only True pass flag only If passed, will only create dictionaries and not process any samples
-b BATCH_SIZE, --batch_size BATCH_SIZE int optional Number of samples to process at a time, 100 by default. Set to 0 for processing everything as one batch. Regardless of this number, the resulting file structure will be the same. But most machines cannot process more than 200 samples in memory at once, so this helps the user set the memory limits for their machine.

sample_sheet

Find and parse the sample sheet in a given directory and emit the details of each sample. This is not required for actually processing data.

>>> python -m methylprep sample_sheet

usage: methylprep sample_sheet [-h] -d DATA_DIR

Process Illumina sample sheet file

optional arguments:
  -h, --help            show this help message and exit
  -d, --data_dir        Base directory of the sample sheet and associated IDAT
                        files
  -c, --create          If specified, this creates a sample sheet from idats
                        instead of parsing an existing sample sheet. The
                        output file will be called "samplesheet.csv".
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        If creating a sample sheet, you can provide an
                        optional output filename (CSV).                        

example of creating a sample sheet

~/methylprep$ python -m methylprep -v sample_sheet -d ~/GSE133062/GSE133062 --create
INFO:methylprep.files.sample_sheets:[!] Created sample sheet: ~/GSE133062/GSE133062/samplesheet.csv with 70 GSM_IDs
INFO:methylprep.files.sample_sheets:Searching for sample_sheet in ~/GSE133062/GSE133062
INFO:methylprep.files.sample_sheets:Found sample sheet file: ~/GSE133062/GSE133062/samplesheet.csv
INFO:methylprep.files.sample_sheets:Parsing sample_sheet
200861170112_R01C01
200882160083_R03C01
200861170067_R02C01
200498360027_R04C01
200498360027_R08C01
200861170067_R01C01
200861170072_R05C01
200498360027_R06C01
200861170072_R01C01
200861170067_R03C01
200882160070_R02C01
...

download

The CLI now includes a download option. Supply the GEO ID or ArrayExpress ID and it will locate the files, download the idats, process them, and build a dataframe of the associated meta data. This dataframe format should be compatible with methylcheck and methylize.

optional arguments:
Argument Type Description
-h, --help show this help message and exit
-d DATA_DIR, --data_dir DATA_DIR path (required) Directory to download series to
-i ID, --id ID string Unique ID of the series (either GEO or ArrayExpress ID)
-l LIST, --list LIST multiple string arguments List of series IDs (can be either GEO or ArrayExpress)
-o, --dict_only no args If passed, will only create dictionaries and not process any samples
-b BATCH_SIZE, --batch_size BATCH_SIZE number Number of samples to process at a time, 100 by default
  • When processing large batches of raw .idat files, specify --batch_size to break the processing up into smaller batches so the computer's memory won't overload. This is off by default when using process but is ON when using download and set to batch_size of 100.

Low-Level Processing

These are some functions that you can use within methylprep. run_pipeline calls them for you as needed.

get_sample_sheet

Find and parse the sample sheet for the provided project directory path.

Returns: A SampleSheet object containing the parsed sample information from the project's sample sheet file

from methylprep import get_sample_sheet

sample_sheet = get_sample_sheet(dir_path, filepath=None)
Argument Type Default Description
data_dir str, Path - Base directory of the sample sheet and associated IDAT files
sample_sheet_filepath str, Path None File path of the project's sample sheet. If not provided, the package will try to find one based on the supplied data directory path.

get_manifest

Find and parse the manifest file for the processed array type.

Returns: A Manifest object containing the parsed probe information for the processed array type

from methylprep import get_manifest

manifest = get_manifest(raw_datasets, array_type=None, manifest_filepath=None)
Argument Type Default Description
raw_datasets RawDataset collection - Collection of RawDataset objects containing probe information from the raw IDAT files.
array_type str None Code of the array type being processed. Possible values are custom, 450k, epic, and epic+. If not provided, the pacakage will attempt to determine the array type based on the provided RawDataset objects.
manifest_filepath str, Path None File path for the array's manifest file. If not provided, this file will be downloaded from a Life Epigenetics archive.

get_raw_datasets

Find and parse the IDAT files for samples within a project's sample sheet.

Returns: A collection of RawDataset objects for each sample's IDAT file pair.

from methylprep import get_raw_datasets

raw_datasets = get_raw_datasets(sample_sheet, sample_names=None)
Argument Type Default Description
sample_sheet SampleSheet - A SampleSheet instance from a valid project sample sheet file.
sample_names str collection None List of sample names to process. If provided, only those samples specified will be processed. Otherwise all samples found in the sample sheet will be processed.
You can’t perform that action at this time.