v0.5 (report pdf) (#28)

* p-value probe detection work in progress * hdbscan clustering functions * more QC methods testing * v0.4.0 more tests, smart about df orientation * removing unused function * bug * added read_geo() for processed datafiles, and unit tests for it. Works with txt,csv,xlsx,pkl files * read_geo() docs * unit test read_geo * circleci broke with coverall * bug fix unit read_geo * debugging unit test for read_geo * circleci read_geo unit tests * circleci read_geo unit tests * v.0.4.0 re-organized files debugged filters.list_problem_probes: updated the docs to have correct spelling for refs/reasons. these were wrong from the original code I inherited. it now produces an error if you specify criteria that aren’t recognized also added a function that lets you see more detail on the probes and reasons/pubs criteria added unit tests for this. * genome studio QC functions, improved .load function (but not consolidated through methyl-suite yet) function .assign() for manually categorizing samples unit testing on the predict.sex function get_sex() prediction * improved QC control probe plots; fixed bug in detecting duplicates and run_pipeline * debugging circleci * debugging circleci * debugging circleci * debugging circleci * debugging circleci * debugging circleci * debugging circleci * debugging circleci * debugging circleci * debugging circleci * v0.4.0 consolidated data loading for functions and uses fastest option * support for and unit tests on .load from batched samples * improved docs homepage * readthedocs improved autodocs * Add files via upload * Update README.md * Add files via upload * better docs and smarter load_both() * added sesame and poobah option to .load * updated beta_mds_plot() to handle NaN probes and impute from adj probes * working on a QC report, and add return_fig option to most functions * updated and improved ReportPDF with tables and appendix formatting * v0.5 adds kwargs to functions for silent processing returning figure objects, and a report_pdf class that can run QC and generate a PDF report. __version__ too * debugging pipenv; breaking circleci * removed local virtual env and updated pipfile lock * fixed setup bug
FoxoTech · Apr 23, 2020 · 819b00f · 819b00f
1 parent 71b4515
commit 819b00f
Show file tree

Hide file tree

Showing 20 changed files with 2,637 additions and 334 deletions.
diff --git a/Pipfile b/Pipfile
@@ -21,12 +21,15 @@ nbsphinx = "*"
 xlrd = ">=1.0.0"
 
 [packages]
-tqdm = "*"
 numpy = "==1.16.3"
 pandas = "==0.24.2"
 statsmodels = "==0.9.0"
 scipy = "==1.2.1"
-methylcheck = {editable = true,extras = ["socks"], path = "."}
+seaborn = "*"
+matplotlib = "*"
+tqdm = "*"
+scikit-learn = "*"
+# methylcheck = {editable = true,extras = ["dev"],path = "."}
 
 [requires]
 python_version = "3.7"
diff --git a/Pipfile.lock b/Pipfile.lock
diff --git a/README.md b/README.md
@@ -1,13 +1,17 @@
 methylcheck is a Python-based package for filtering and visualizing Illumina methylation array data. The focus is on quality control.
 
 [![Readthedocs](https://readthedocs.com/projects/life-epigenetics-methylcheck/badge/?version=latest)](https://life-epigenetics-methylcheck.readthedocs-hosted.com/en/latest/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![CircleCI](https://circleci.com/gh/LifeEGX/methylcheck.svg?style=shield&circle-token=58a514d3924fcfe0287c109d2323b7f697956ec9)](https://circleci.com/gh/LifeEGX/methylcheck) [![Build status](https://ci.appveyor.com/api/projects/status/j15lpvjg1q9u2y17?svg=true)](https://ci.appveyor.com/project/life_epigenetics/methQC) [![Codacy Badge](https://api.codacy.com/project/badge/Grade/aedf5c223e39415180ff35153b2bad89)](https://www.codacy.com?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=LifeEGX/methylcheck&amp;utm_campaign=Badge_Grade)
-[![Coverage Status](https://coveralls.io/repos/github/LifeEGX/methylcheck/badge.svg?t=OVL45Q)](https://coveralls.io/github/LifeEGX/methylcheck)
+[![Coverage Status](https://coveralls.io/repos/github/LifeEGX/methylcheck/badge.svg?t=OVL45Q)](https://coveralls.io/github/LifeEGX/methylcheck) ![PyPI-Downloads](https://img.shields.io/pypi/dm/methylcheck.svg?label=pypi%20downloads&logo=PyPI&logoColor=white)
+
+![methylprep snapshots](https://raw.githubusercontent.com/LifeEGX/methylcheck/master/docs/methylcheck_overview.png "methylcheck snapshots")
 
 ![methylprep snapshots](https://raw.githubusercontent.com/LifeEGX/methylcheck/master/docs/methylcheck_overview.png "methylcheck snapshots")
 
 ## methylcheck Package
 
-This package contains high-level APIs for filtering processed data from local files. 'High-level' means that the details are abstracted away, and functions are designed to work with a minimum of knowledge and specification required. But you can always override the "smart" defaults with custom settings if things don't work. Before starting you must first download processed data from the NIH GEO database or process a set of `idat` files with `methylprep`. Refer to methylprep for instructions on this step.
+This package contains high-level APIs for filtering processed data from local files. 'High-level' means that the details are abstracted away, and functions are designed to work with a minimum of knowledge and specification required. But you can always override the "smart" defaults with custom settings if things don't work. Before starting you must first download processed data from the NIH GEO database or process a set of `idat` files with `methylprep`. Refer to [methylprep](https://life-epigenetics-methylprep.readthedocs-hosted.com/en/latest/index.html) for instructions on this step.
+
+![methylprep functions](https://raw.githubusercontent.com/LifeEGX/methylcheck/dev/docs/methylcheck_functions.png)
 
 ## Installation
 
@@ -24,17 +28,19 @@ Load your data in a Jupyter Notebook like this:
 mydata = pandas.read_pickle('beta_values.pkl')
 ```
 
-If you processed a large batch of samples using the `batch_size` option in `methylprep process`, there's a convenience function in `methylize` (methylize.load) that will load and combine a bunch of output files in the same folder:
+If you processed a large batch of samples using the `batch_size` option in `methylprep process`, there's a convenience function in `methylcheck` (methylcheck.load) that will load and combine a bunch of output files in the same folder:
 
 ```python
 import methylize
-df = methylize.load('<path to folder with methylprep output>')
+df = methylcheck.load('<path to folder with methylprep output>')
 # or
-df,meta = methylize.load_both('<path to folder with methylprep output>')
+df,meta = methylcheck.load_both('<path to folder with methylprep output>')
 ```
 
 This conveniently loads a dataframe of all meta data associated with the samples, if you are using public GEO data. Some analysis functions require specifying which samples are part of a treatment group (vs control) and the `meta` dataframe object can be used for this.
 
+For more, check out our [examples of loading data into `methylcheck`](https://life-epigenetics-methylcheck.readthedocs-hosted.com/en/latest/docs/demo_qc_functions.html)
+
 ### GEO
 
 Alternatively, you can import public GEO datasets directly, if they are processed data containing either probe `beta` values for samples or methylated/unmethylated signal intensities. If you have `idat` files, process them first with `methylprep`, or use the `methylprep download -i <GEO_ID>` option to download and process public data.
@@ -44,10 +50,11 @@ In general, the best way to import data is to use `methylprep` and run
 run_pipeline(data_folder, betas=True)
 
 # or from the command line:
-`python -m methylprep process -d <filepath to idats> --all`
+python -m methylprep process -d <filepath to idats> --all
 ```
 
-collect the beta_values.pkl file it returns/saves to disk, and load that in a Jupyter notebook. From there, each data transformation is a single line of code using Pandas DataFrames. `methylcheck` will keep track of the data format/structures for you, and you can visualize the effect of each filter as you go. You can also export images of your charts for publication.
+collect the `beta_values.pkl` file it returns/saves to disk, and load that in a Jupyter notebook. From there, each data transformation is a single line of code using Pandas DataFrames. `methylcheck` will keep track of the data format/structures for you, and you can visualize the effect of each filter as you go. You can also export images of your charts for publication.
+
 
 Refer to the Jupyter notebooks on readthedocs for examples of filtering probes from a batch of samples, removing outlier samples, and generating plots of data.
 

diff --git a/conf.py b/conf.py
@@ -42,7 +42,8 @@
     'sphinx.ext.autodoc',
     'sphinxcontrib.apidoc',
     'm2r',
-    'nbsphinx'
+    'nbsphinx',
+    'sphinx.ext.autosummary'
 ]
 
 # instead of CLI "sphinx-autodoc . _build/html" you write this

diff --git a/docs/methylcheck_functions.png b/docs/methylcheck_functions.png
diff --git a/docs/source/methylcheck.rst b/docs/source/methylcheck.rst
@@ -1,26 +1,70 @@
 methylcheck package
 ===================
 
-methylcheck.cli module
-----------------------
+.. autosummary::
+   methylcheck.cli
+   methylcheck.run_pipeline
+   methylcheck.run_qc
 
-.. automodule:: methylcheck.cli
+   methylcheck.read_geo
+   methylcheck.load
+   methylcheck.load_both
+
+   methylcheck.qc_signal_intensity
+   methylcheck.plot_M_vs_U
+   methylcheck.plot_controls
+   methylcheck.plot_beta_by_type
+
+   methylcheck.probes
+   methylcheck.list_problem_probes
+   methylcheck.exclude_probes
+   methylcheck.exclude_sex_control_probes
+   methylcheck.drop_nan_probes
+
+   methylcheck.samples
+   methylcheck.sample_plot
+   methylcheck.beta_density_plot
+   methylcheck.mean_beta_plot
+   methylcheck.mean_beta_compare
+   methylcheck.beta_mds_plot
+   methylcheck.combine_mds
+   methylcheck.cumulative_sum_beta_distribution
+
+   methylcheck.predict
+   methylcheck.get_sex
+   methylcheck.assign
+
+methylcheck pipeline functions
+-----------------------------------------------
+
+.. automodule:: methylcheck.qc_report
+  :members:
+  :undoc-members:
+  :show-inheritance:
+
+
+methylcheck.probes module
+-------------------------
+
+.. automodule:: methylcheck.probes
    :members:
    :undoc-members:
    :show-inheritance:
 
-methylcheck.filters module
+
+methylcheck.samples module
 --------------------------
 
-.. automodule:: methylcheck.filters
-   :members:
-   :undoc-members:
-   :show-inheritance:
+.. automodule:: methylcheck.samples
+  :members:
+  :undoc-members:
+  :show-inheritance:
 
-methylcheck.postprocessQC module
---------------------------------
 
-.. automodule:: methylcheck.postprocessQC
-   :members:
-   :undoc-members:
-   :show-inheritance:
+methylcheck.predict module
+--------------------------
+
+.. automodule:: methylcheck.predict
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/index.rst b/index.rst
@@ -10,14 +10,14 @@ methylcheck documentation
    :name: Help Files
 
    docs/demo-methylprep-to-methylcheck-example.ipynb
+   functions <docs/source/methylcheck>
    docs/demo_read_geo_processed.ipynb
    docs/filtering_probes.ipynb
    docs/methylprep_methylcheck_example.ipynb
    docs/demo_qc_functions.ipynb
    docs/another-methylcheck-qc-example.ipynb
    docs/demo_using_matched_meta_data.ipynb
    methylprep package <https://life-epigenetics-methylprep.readthedocs-hosted.com/en/latest/>
-   docs/source/methylcheck
    methylize (analysis) package <https://life-epigenetics-methylize.readthedocs-hosted.com/en/latest/>
 
 

diff --git a/methylcheck/__init__.py b/methylcheck/__init__.py
@@ -27,18 +27,9 @@
     )
 
 from .qc_report import run_pipeline
-
-'''
-try:
-    import methylprep
-    load = methylprep.load
-    load_both = methylprep.load_both
-    del methylprep
-except ImportError as error:
-    pass # these functions are not available otherwise.
-'''
 from .load_processed import load, load_both, container_to_pkl
 from .read_geo_processed import read_geo
+from .version import __version__
 
 getLogger(__name__).addHandler(NullHandler())
 

diff --git a/methylcheck/cli.py b/methylcheck/cli.py
@@ -17,22 +17,52 @@ def error(self, message):
         self.exit(status=2)
 
 
-def detect_array(df):
-    """Determines array type using number of probes columns in df. Returns array string."""
+def detect_array(df, returns='name'):
+    """Determines array type using number of probes columns in df. Returns array string.
+    Note: this is different from methylprep.models.arrays.ArrayType.from_probe_count, which looks at idat files.
+
+    returns (name | filepath)
+        default is 'name' -- returns a string
+        if 'filepath', this also returns the filepath to the array, using ArrayType and
+        methylprep.files.manifests ARRAY_TYPE_MANIFEST_FILENAMES."""
+
+    if returns == 'filepath':
+        # get manifest data from .methylprep_manifest_files
+        try:
+            from methylprep.files.manifests import MANIFEST_DIR_PATH, ARRAY_TYPE_MANIFEST_FILENAMES, Manifest
+            from methylprep.models.arrays import ArrayType
+        except ImportError:
+            raise ImportError("this function requires `methylprep` be installed (to read manifest array files).")
+
+        def get_filename(array_name):
+            ARRAY_FILENAME = {
+                '27k': 'hm27.hg19.manifest.csv.gz',
+                '450k': 'HumanMethylation450_15017482_v1-2.CoreColumns.csv.gz',
+                'epic': 'MethylationEPIC_v-1-0_B4.CoreColumns.csv.gz',
+                'epic+': 'CombinedManifestEPIC.manifest.CoreColumns.csv.gz',
+                'mouse': 'LEGX_B1_manifest_mouse_v1_min.csv.gz',
+            }
+            man_path = Path(MANIFEST_DIR_PATH).expanduser()
+            man_filename = ARRAY_FILENAME[array_name]
+            man_filepath = Path(man_path, man_filename)
+            return man_filepath
+
     # shape: should be wide, with more columns than rows. The larger dimension is the probe count.
     if df.shape[0] > df.shape[1]:
         # WARNING: this will need to be transposed later.
         col_count = (df.shape[0]) #does the index count in shape? assuming it doesn't.
     else:
         col_count = (df.shape[1])
     if 26000 <= col_count <= 28000:
-        return '27k'
-    if 440000 <= col_count <= 490000: #485512
-        return '450k'
-    if 869001 <= col_count <= 869335: # 52650 <= col_count <= 53000:
-        return 'EPIC+'
-    if 860000 <= col_count <= 869000: #1050000 <= col_count <= 1053000: actual: 865860
-        return 'EPIC'
+        return '27k' if returns == 'name' else (ArrayType('27k'), get_filename('27k'))
+    elif 440000 <= col_count <= 490000: #485512
+        return '450k' if returns == 'name' else (ArrayType('450k'), get_filename('450k'))
+    elif 869001 <= col_count <= 869335: # 52650 <= col_count <= 53000:
+        return 'EPIC+' if returns == 'name' else (ArrayType('epic+'), get_filename('epic+'))
+    elif 860000 <= col_count <= 869000: #1050000 <= col_count <= 1053000: actual: 865860
+        return 'EPIC' if returns == 'name' else (ArrayType('epic'), get_filename('epic'))
+    elif 250000 <= col_count <= 270000: #actual count: 262812
+        return 'mouse' if returns == 'name' else (ArrayType('mouse'), get_filename('mouse'))
     else:
         raise ValueError('Unsupported Illumina array type. Your data file contains {0} rows for probes.'.format(col_count))