Skip to content

Commit

Permalink
v0.5 (report pdf) (#28)
Browse files Browse the repository at this point in the history
* p-value probe detection work in progress

* hdbscan clustering functions

* more QC methods testing

* v0.4.0 more tests, smart about df orientation

* removing unused function

* bug

* added read_geo() for processed datafiles, and unit tests for it. Works with txt,csv,xlsx,pkl files

* read_geo() docs

* unit test read_geo

* circleci broke with coverall

* bug fix unit read_geo

* debugging unit test for read_geo

* circleci read_geo unit tests

* circleci read_geo unit tests

* v.0.4.0 re-organized files
debugged filters.list_problem_probes:
updated the docs to have correct spelling for refs/reasons. these were wrong from the original code I inherited.
it now produces an error if you specify criteria that aren’t recognized
also added a function that lets you see more detail on the probes and reasons/pubs criteria
added unit tests for this.

* genome studio QC functions,
improved .load function (but not consolidated through methyl-suite yet)
function .assign() for manually categorizing samples
unit testing on the predict.sex function
get_sex() prediction

* improved QC control probe plots; fixed bug in detecting duplicates and run_pipeline

* debugging circleci

* debugging circleci

* debugging circleci

* debugging circleci

* debugging circleci

* debugging circleci

* debugging circleci

* debugging circleci

* debugging circleci

* debugging circleci

* v0.4.0 consolidated data loading for functions and uses fastest option

* support for and unit tests on .load from batched samples

* improved docs homepage

* readthedocs improved autodocs

* Add files via upload

* Update README.md

* Add files via upload

* better docs and smarter load_both()

* added sesame and poobah option to .load

* updated beta_mds_plot() to handle NaN probes and impute from adj probes

* working on a QC report, and add return_fig option to most functions

* updated and improved ReportPDF with tables and appendix formatting

* v0.5 adds kwargs to functions for silent processing returning figure objects, and a report_pdf class that can run QC and generate a PDF report. __version__ too

* debugging pipenv; breaking circleci

* removed local virtual env and updated pipfile lock

* fixed setup bug
  • Loading branch information
Marc Maxmeister committed Apr 23, 2020
1 parent 71b4515 commit 819b00f
Show file tree
Hide file tree
Showing 20 changed files with 2,637 additions and 334 deletions.
7 changes: 5 additions & 2 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,15 @@ nbsphinx = "*"
xlrd = ">=1.0.0"

[packages]
tqdm = "*"
numpy = "==1.16.3"
pandas = "==0.24.2"
statsmodels = "==0.9.0"
scipy = "==1.2.1"
methylcheck = {editable = true,extras = ["socks"], path = "."}
seaborn = "*"
matplotlib = "*"
tqdm = "*"
scikit-learn = "*"
# methylcheck = {editable = true,extras = ["dev"],path = "."}

[requires]
python_version = "3.7"
1,559 changes: 1,559 additions & 0 deletions Pipfile.lock

Large diffs are not rendered by default.

21 changes: 14 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
methylcheck is a Python-based package for filtering and visualizing Illumina methylation array data. The focus is on quality control.

[![Readthedocs](https://readthedocs.com/projects/life-epigenetics-methylcheck/badge/?version=latest)](https://life-epigenetics-methylcheck.readthedocs-hosted.com/en/latest/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![CircleCI](https://circleci.com/gh/LifeEGX/methylcheck.svg?style=shield&circle-token=58a514d3924fcfe0287c109d2323b7f697956ec9)](https://circleci.com/gh/LifeEGX/methylcheck) [![Build status](https://ci.appveyor.com/api/projects/status/j15lpvjg1q9u2y17?svg=true)](https://ci.appveyor.com/project/life_epigenetics/methQC) [![Codacy Badge](https://api.codacy.com/project/badge/Grade/aedf5c223e39415180ff35153b2bad89)](https://www.codacy.com?utm_source=github.com&utm_medium=referral&utm_content=LifeEGX/methylcheck&utm_campaign=Badge_Grade)
[![Coverage Status](https://coveralls.io/repos/github/LifeEGX/methylcheck/badge.svg?t=OVL45Q)](https://coveralls.io/github/LifeEGX/methylcheck)
[![Coverage Status](https://coveralls.io/repos/github/LifeEGX/methylcheck/badge.svg?t=OVL45Q)](https://coveralls.io/github/LifeEGX/methylcheck) ![PyPI-Downloads](https://img.shields.io/pypi/dm/methylcheck.svg?label=pypi%20downloads&logo=PyPI&logoColor=white)

![methylprep snapshots](https://raw.githubusercontent.com/LifeEGX/methylcheck/master/docs/methylcheck_overview.png "methylcheck snapshots")

![methylprep snapshots](https://raw.githubusercontent.com/LifeEGX/methylcheck/master/docs/methylcheck_overview.png "methylcheck snapshots")

## methylcheck Package

This package contains high-level APIs for filtering processed data from local files. 'High-level' means that the details are abstracted away, and functions are designed to work with a minimum of knowledge and specification required. But you can always override the "smart" defaults with custom settings if things don't work. Before starting you must first download processed data from the NIH GEO database or process a set of `idat` files with `methylprep`. Refer to methylprep for instructions on this step.
This package contains high-level APIs for filtering processed data from local files. 'High-level' means that the details are abstracted away, and functions are designed to work with a minimum of knowledge and specification required. But you can always override the "smart" defaults with custom settings if things don't work. Before starting you must first download processed data from the NIH GEO database or process a set of `idat` files with `methylprep`. Refer to [methylprep](https://life-epigenetics-methylprep.readthedocs-hosted.com/en/latest/index.html) for instructions on this step.

![methylprep functions](https://raw.githubusercontent.com/LifeEGX/methylcheck/dev/docs/methylcheck_functions.png)

## Installation

Expand All @@ -24,17 +28,19 @@ Load your data in a Jupyter Notebook like this:
mydata = pandas.read_pickle('beta_values.pkl')
```

If you processed a large batch of samples using the `batch_size` option in `methylprep process`, there's a convenience function in `methylize` (methylize.load) that will load and combine a bunch of output files in the same folder:
If you processed a large batch of samples using the `batch_size` option in `methylprep process`, there's a convenience function in `methylcheck` (methylcheck.load) that will load and combine a bunch of output files in the same folder:

```python
import methylize
df = methylize.load('<path to folder with methylprep output>')
df = methylcheck.load('<path to folder with methylprep output>')
# or
df,meta = methylize.load_both('<path to folder with methylprep output>')
df,meta = methylcheck.load_both('<path to folder with methylprep output>')
```

This conveniently loads a dataframe of all meta data associated with the samples, if you are using public GEO data. Some analysis functions require specifying which samples are part of a treatment group (vs control) and the `meta` dataframe object can be used for this.

For more, check out our [examples of loading data into `methylcheck`](https://life-epigenetics-methylcheck.readthedocs-hosted.com/en/latest/docs/demo_qc_functions.html)

### GEO

Alternatively, you can import public GEO datasets directly, if they are processed data containing either probe `beta` values for samples or methylated/unmethylated signal intensities. If you have `idat` files, process them first with `methylprep`, or use the `methylprep download -i <GEO_ID>` option to download and process public data.
Expand All @@ -44,10 +50,11 @@ In general, the best way to import data is to use `methylprep` and run
run_pipeline(data_folder, betas=True)

# or from the command line:
`python -m methylprep process -d <filepath to idats> --all`
python -m methylprep process -d <filepath to idats> --all
```

collect the beta_values.pkl file it returns/saves to disk, and load that in a Jupyter notebook. From there, each data transformation is a single line of code using Pandas DataFrames. `methylcheck` will keep track of the data format/structures for you, and you can visualize the effect of each filter as you go. You can also export images of your charts for publication.
collect the `beta_values.pkl` file it returns/saves to disk, and load that in a Jupyter notebook. From there, each data transformation is a single line of code using Pandas DataFrames. `methylcheck` will keep track of the data format/structures for you, and you can visualize the effect of each filter as you go. You can also export images of your charts for publication.


Refer to the Jupyter notebooks on readthedocs for examples of filtering probes from a batch of samples, removing outlier samples, and generating plots of data.

Expand Down
3 changes: 2 additions & 1 deletion conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,8 @@
'sphinx.ext.autodoc',
'sphinxcontrib.apidoc',
'm2r',
'nbsphinx'
'nbsphinx',
'sphinx.ext.autosummary'
]

# instead of CLI "sphinx-autodoc . _build/html" you write this
Expand Down
Binary file added docs/methylcheck_functions.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
72 changes: 58 additions & 14 deletions docs/source/methylcheck.rst
Original file line number Diff line number Diff line change
@@ -1,26 +1,70 @@
methylcheck package
===================

methylcheck.cli module
----------------------
.. autosummary::
methylcheck.cli
methylcheck.run_pipeline
methylcheck.run_qc

.. automodule:: methylcheck.cli
methylcheck.read_geo
methylcheck.load
methylcheck.load_both

methylcheck.qc_signal_intensity
methylcheck.plot_M_vs_U
methylcheck.plot_controls
methylcheck.plot_beta_by_type

methylcheck.probes
methylcheck.list_problem_probes
methylcheck.exclude_probes
methylcheck.exclude_sex_control_probes
methylcheck.drop_nan_probes

methylcheck.samples
methylcheck.sample_plot
methylcheck.beta_density_plot
methylcheck.mean_beta_plot
methylcheck.mean_beta_compare
methylcheck.beta_mds_plot
methylcheck.combine_mds
methylcheck.cumulative_sum_beta_distribution

methylcheck.predict
methylcheck.get_sex
methylcheck.assign

methylcheck pipeline functions
-----------------------------------------------

.. automodule:: methylcheck.qc_report
:members:
:undoc-members:
:show-inheritance:


methylcheck.probes module
-------------------------

.. automodule:: methylcheck.probes
:members:
:undoc-members:
:show-inheritance:

methylcheck.filters module

methylcheck.samples module
--------------------------

.. automodule:: methylcheck.filters
:members:
:undoc-members:
:show-inheritance:
.. automodule:: methylcheck.samples
:members:
:undoc-members:
:show-inheritance:

methylcheck.postprocessQC module
--------------------------------

.. automodule:: methylcheck.postprocessQC
:members:
:undoc-members:
:show-inheritance:
methylcheck.predict module
--------------------------

.. automodule:: methylcheck.predict
:members:
:undoc-members:
:show-inheritance:
2 changes: 1 addition & 1 deletion index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@ methylcheck documentation
:name: Help Files

docs/demo-methylprep-to-methylcheck-example.ipynb
functions <docs/source/methylcheck>
docs/demo_read_geo_processed.ipynb
docs/filtering_probes.ipynb
docs/methylprep_methylcheck_example.ipynb
docs/demo_qc_functions.ipynb
docs/another-methylcheck-qc-example.ipynb
docs/demo_using_matched_meta_data.ipynb
methylprep package <https://life-epigenetics-methylprep.readthedocs-hosted.com/en/latest/>
docs/source/methylcheck
methylize (analysis) package <https://life-epigenetics-methylize.readthedocs-hosted.com/en/latest/>


Expand Down
11 changes: 1 addition & 10 deletions methylcheck/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,18 +27,9 @@
)

from .qc_report import run_pipeline

'''
try:
import methylprep
load = methylprep.load
load_both = methylprep.load_both
del methylprep
except ImportError as error:
pass # these functions are not available otherwise.
'''
from .load_processed import load, load_both, container_to_pkl
from .read_geo_processed import read_geo
from .version import __version__

getLogger(__name__).addHandler(NullHandler())

Expand Down
48 changes: 39 additions & 9 deletions methylcheck/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,22 +17,52 @@ def error(self, message):
self.exit(status=2)


def detect_array(df):
"""Determines array type using number of probes columns in df. Returns array string."""
def detect_array(df, returns='name'):
"""Determines array type using number of probes columns in df. Returns array string.
Note: this is different from methylprep.models.arrays.ArrayType.from_probe_count, which looks at idat files.
returns (name | filepath)
default is 'name' -- returns a string
if 'filepath', this also returns the filepath to the array, using ArrayType and
methylprep.files.manifests ARRAY_TYPE_MANIFEST_FILENAMES."""

if returns == 'filepath':
# get manifest data from .methylprep_manifest_files
try:
from methylprep.files.manifests import MANIFEST_DIR_PATH, ARRAY_TYPE_MANIFEST_FILENAMES, Manifest
from methylprep.models.arrays import ArrayType
except ImportError:
raise ImportError("this function requires `methylprep` be installed (to read manifest array files).")

def get_filename(array_name):
ARRAY_FILENAME = {
'27k': 'hm27.hg19.manifest.csv.gz',
'450k': 'HumanMethylation450_15017482_v1-2.CoreColumns.csv.gz',
'epic': 'MethylationEPIC_v-1-0_B4.CoreColumns.csv.gz',
'epic+': 'CombinedManifestEPIC.manifest.CoreColumns.csv.gz',
'mouse': 'LEGX_B1_manifest_mouse_v1_min.csv.gz',
}
man_path = Path(MANIFEST_DIR_PATH).expanduser()
man_filename = ARRAY_FILENAME[array_name]
man_filepath = Path(man_path, man_filename)
return man_filepath

# shape: should be wide, with more columns than rows. The larger dimension is the probe count.
if df.shape[0] > df.shape[1]:
# WARNING: this will need to be transposed later.
col_count = (df.shape[0]) #does the index count in shape? assuming it doesn't.
else:
col_count = (df.shape[1])
if 26000 <= col_count <= 28000:
return '27k'
if 440000 <= col_count <= 490000: #485512
return '450k'
if 869001 <= col_count <= 869335: # 52650 <= col_count <= 53000:
return 'EPIC+'
if 860000 <= col_count <= 869000: #1050000 <= col_count <= 1053000: actual: 865860
return 'EPIC'
return '27k' if returns == 'name' else (ArrayType('27k'), get_filename('27k'))
elif 440000 <= col_count <= 490000: #485512
return '450k' if returns == 'name' else (ArrayType('450k'), get_filename('450k'))
elif 869001 <= col_count <= 869335: # 52650 <= col_count <= 53000:
return 'EPIC+' if returns == 'name' else (ArrayType('epic+'), get_filename('epic+'))
elif 860000 <= col_count <= 869000: #1050000 <= col_count <= 1053000: actual: 865860
return 'EPIC' if returns == 'name' else (ArrayType('epic'), get_filename('epic'))
elif 250000 <= col_count <= 270000: #actual count: 262812
return 'mouse' if returns == 'name' else (ArrayType('mouse'), get_filename('mouse'))
else:
raise ValueError('Unsupported Illumina array type. Your data file contains {0} rows for probes.'.format(col_count))

Expand Down

0 comments on commit 819b00f

Please sign in to comment.