Skip to content

Commit

Permalink
read_geo improved (#33)
Browse files Browse the repository at this point in the history
* v0.5.9 qc_plot bug fixes -99
* github org name change
* read_geo v3
* v0.6 read_geo improved
  • Loading branch information
Marc Maxmeister committed Jun 11, 2020
1 parent ef4122d commit d0e4216
Show file tree
Hide file tree
Showing 15 changed files with 1,846 additions and 27 deletions.
41 changes: 29 additions & 12 deletions README.md
@@ -1,17 +1,17 @@
methylcheck is a Python-based package for filtering and visualizing Illumina methylation array data. The focus is on quality control.

[![Readthedocs](https://readthedocs.com/projects/life-epigenetics-methylcheck/badge/?version=latest)](https://life-epigenetics-methylcheck.readthedocs-hosted.com/en/latest/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![CircleCI](https://circleci.com/gh/LifeEGX/methylcheck.svg?style=shield&circle-token=58a514d3924fcfe0287c109d2323b7f697956ec9)](https://circleci.com/gh/LifeEGX/methylcheck) [![Build status](https://ci.appveyor.com/api/projects/status/j15lpvjg1q9u2y17?svg=true)](https://ci.appveyor.com/project/life_epigenetics/methQC) [![Codacy Badge](https://api.codacy.com/project/badge/Grade/aedf5c223e39415180ff35153b2bad89)](https://www.codacy.com?utm_source=github.com&utm_medium=referral&utm_content=LifeEGX/methylcheck&utm_campaign=Badge_Grade)
[![Coverage Status](https://coveralls.io/repos/github/LifeEGX/methylcheck/badge.svg?t=OVL45Q)](https://coveralls.io/github/LifeEGX/methylcheck) ![PyPI-Downloads](https://img.shields.io/pypi/dm/methylcheck.svg?label=pypi%20downloads&logo=PyPI&logoColor=white)
[![Readthedocs](https://readthedocs.com/projects/life-epigenetics-methylcheck/badge/?version=latest)](https://life-epigenetics-methylcheck.readthedocs-hosted.com/en/latest/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![CircleCI](https://circleci.com/gh/FOXOBioScience/methylcheck.svg?style=shield&circle-token=58a514d3924fcfe0287c109d2323b7f697956ec9)](https://circleci.com/gh/FOXOBioScience/methylcheck) [![Build status](https://ci.appveyor.com/api/projects/status/j15lpvjg1q9u2y17?svg=true)](https://ci.appveyor.com/project/life_epigenetics/methQC) [![Codacy Badge](https://api.codacy.com/project/badge/Grade/aedf5c223e39415180ff35153b2bad89)](https://www.codacy.com?utm_source=github.com&utm_medium=referral&utm_content=FOXOBioScience/methylcheck&utm_campaign=Badge_Grade)
[![Coverage Status](https://coveralls.io/repos/github/FOXOBioScience/methylcheck/badge.svg?t=OVL45Q)](https://coveralls.io/github/FOXOBioScience/methylcheck) ![PyPI-Downloads](https://img.shields.io/pypi/dm/methylcheck.svg?label=pypi%20downloads&logo=PyPI&logoColor=white)

![methylprep snapshots](https://raw.githubusercontent.com/LifeEGX/methylcheck/master/docs/methylcheck_overview.png "methylcheck snapshots")
![methylprep snapshots](https://raw.githubusercontent.com/FOXOBioScience/methylcheck/master/docs/methylcheck_overview.png "methylcheck snapshots")

![methylprep snapshots](https://raw.githubusercontent.com/LifeEGX/methylcheck/master/docs/methylcheck_overview.png "methylcheck snapshots")
![methylprep snapshots](https://raw.githubusercontent.com/FOXOBioScience/methylcheck/master/docs/methylcheck_overview.png "methylcheck snapshots")

## methylcheck Package

This package contains high-level APIs for filtering processed data from local files. 'High-level' means that the details are abstracted away, and functions are designed to work with a minimum of knowledge and specification required. But you can always override the "smart" defaults with custom settings if things don't work. Before starting you must first download processed data from the NIH GEO database or process a set of `idat` files with `methylprep`. Refer to [methylprep](https://life-epigenetics-methylprep.readthedocs-hosted.com/en/latest/index.html) for instructions on this step.

![methylprep functions](https://raw.githubusercontent.com/LifeEGX/methylcheck/dev/docs/methylcheck_functions.png)
![methylprep functions](https://raw.githubusercontent.com/FOXOBioScience/methylcheck/master/docs/methylcheck_functions.png)

## Installation

Expand All @@ -31,7 +31,7 @@ mydata = pandas.read_pickle('beta_values.pkl')
If you processed a large batch of samples using the `batch_size` option in `methylprep process`, there's a convenience function in `methylcheck` (methylcheck.load) that will load and combine a bunch of output files in the same folder:

```python
import methylize
import methycheck
df = methylcheck.load('<path to folder with methylprep output>')
# or
df,meta = methylcheck.load_both('<path to folder with methylprep output>')
Expand All @@ -41,24 +41,39 @@ This conveniently loads a dataframe of all meta data associated with the samples

For more, check out our [examples of loading data into `methylcheck`](https://life-epigenetics-methylcheck.readthedocs-hosted.com/en/latest/docs/demo_qc_functions.html)

### GEO
### GEO (`idat`)

Alternatively, you can import public GEO datasets directly, if they are processed data containing either probe `beta` values for samples or methylated/unmethylated signal intensities. If you have `idat` files, process them first with `methylprep`, or use the `methylprep download -i <GEO_ID>` option to download and process public data.
Alternatively, you can download and process public GEO datasets. If you have a gzip of `idat` files, process them first with `methylprep`, or use the `methylprep download -i <GEO_ID>` option to download and process public data.

In general, the best way to import data is to use `methylprep` and run
```python
run_pipeline(data_folder, betas=True)
import methylprep
methylprep.run_pipeline(data_folder, betas=True)

# or from the command line:
python -m methylprep process -d <filepath to idats> --all
```

collect the `beta_values.pkl` file it returns/saves to disk, and load that in a Jupyter notebook. From there, each data transformation is a single line of code using Pandas DataFrames. `methylcheck` will keep track of the data format/structures for you, and you can visualize the effect of each filter as you go. You can also export images of your charts for publication.
collect the `beta_values.pkl` file it returns/saves to disk, and load that in a Jupyter notebook.

### GEO (processed data in `csv, txt, xlsx` formats)
If idats are not available on GEO, you can imported the processed tabular data using `methylcheck.read_geo`. This will convert methylation/unmethylated signal intensities to beta values by default, returning a Pandas dataframe with samples in columns and probes in rows. As of version 0.6, it recognizes six different ways authors have organized their data, but does not handle these cases yet:

- Combining two files containing methylated and unmethylated values for a set of samples
- Reading GSM12345-tbl-1.txt type files (found in GSExxxx_family.tar.gz packages)

To obtain probe `beta` values for samples or methylated/unmethylated signal intensities, download the file locally and run:
```python
import methylcheck
df = methylcheck.read_geo(filepath)
```
If you include `verbose=True` it will explain what it is doing. You can also use `test_only=True` to ensure the parser will work without loading the whole file (which can be several gigabytes in size). Test mode will only return the first 200 probes, but should parse the header and detect the file structure if it can.

From there, each data transformation is a single line of code using Pandas DataFrames. `methylcheck` will keep track of the data format/structures for you, and you can visualize the effect of each filter as you go. You can also export images of your charts for publication.

Refer to the Jupyter notebooks on readthedocs for examples of filtering probes from a batch of samples, removing outlier samples, and generating plots of data.

## Quality Control (QC)
## Quality Control (QC) reports

The simplest way to generate a battery of plots about your data is to run this function in a Jupyter notebook:

Expand All @@ -67,6 +82,8 @@ import methylcheck
methylcheck.run_qc('<path to your methylprep processed files>')
```

There is also a `methylcheck.qc_report.ReportPDF` class that allows you to build your own QC report and save it to PDF.

## Other functions

`methylcheck` provides functions to
Expand All @@ -80,4 +97,4 @@ methylcheck.run_qc('<path to your methylprep processed files>')

## Authors

Parts of this package were ported from `minfi`, an `R` package, and extended/developed by the team at Foxo Bioscience, who maintains it. You can write to `info@LifeEgx.com` to give feedback, ask for help, or suggest improvements. For bugs, report issues on our [github repo](https://github.com/lifeEGX/methylcheck) page.
Parts of this package were ported from `minfi`, an `R` package, and extended/developed by the team at Foxo Bioscience, who maintains it. You can write to `info@FOXOBioScience.com` to give feedback, ask for help, or suggest improvements. For bugs, report issues on our [github repo](https://github.com/FOXOBioScience/methylcheck) page.
3 changes: 2 additions & 1 deletion methylcheck/__init__.py
Expand Up @@ -28,7 +28,7 @@

from .qc_report import run_pipeline
from .load_processed import load, load_both, container_to_pkl
from .read_geo_processed import read_geo
from .read_geo_processed import read_geo, detect_header_pattern
from .version import __version__

getLogger(__name__).addHandler(NullHandler())
Expand All @@ -42,6 +42,7 @@
'cumulative_sum_beta_distribution',
'container_to_pkl',
'detect_array',
'detect_header_pattern',
'drop_nan_probes',
'exclude_probes',
'exclude_sex_control_probes',
Expand Down
10 changes: 10 additions & 0 deletions methylcheck/load_processed.py
Expand Up @@ -84,6 +84,16 @@ def load(filepath='.', format='beta_value', file_stem='', verbose=False, silent=
for reading csvs processed using R's sesame package. It has a different format (Probe_ID, ind_beta, ind_negs, ind_poob) per sample.
Only those probes that pass the p-value cutoff will be included.
Science on p-value cutoff:
This function defaults to a p-value cutoff of 0.05, which is typical for scientific tests.
There is currently no consensus on what percent of a sample's probes can fail. For example,
if a sample has 860,000 probes and 5% of them fail, should you reject the whole sample from the batch?
For large batch industrial scale testing, the authors assign some limit, like 5%, 10%, 20%, 30%, etc as a cutoff. And methylcheck's run_qc() function defaults to 10 percent.
But the academics we spoke to don't automatically throw out any samples. Because it depends.
Cancer samples have lots of anueploidy (an abnormal number of chromosomes in a haploid set) and lost chromosomes, so one would expect no signal for these CpG sites.
So those researchers wouldn't throw out samples unless most of the sample fails.
People are working on deriving a calibration curve from public GEO data as a guide, and give a frame of reference, but none exist yet. And public data rarely includes failed samples.
TODO:
- BUG: custom fields cannot auto detect the -pval- column and this isn't supplied in kwargs
- DONE: meth_df deal with batches of files
Expand Down

0 comments on commit d0e4216

Please sign in to comment.