ncbi_counts

Download the NCBI-generated RNA-seq count data by specifying the Series accession number(s), and the regular expression of the Sample attributes.

If you just need a count matrix for all samples (GSM) in a series (GSE), this library is not needed. However, if you need a count matrix for each GSE, specifying only the control group samples and treatment group samples, this library may be useful.

Installation

From PyPI:

pip install ncbi-counts

Usage

python -m ncbi_counts [-h] [-n NORM] [-a ANNOT_VER] [-k [KEEP_ANNOT ...]] [-s SRC_DIR] [-o OUTPUT] [-q] [-S SEP] [-y GSM_YAML] [-c] FILE

Options

positional arguments:
  FILE                  Path to input file (.yaml, .yml) which represents each GSE accession number(s) which contains a sequence of maps with two keys: 'control' and 'treatment'. Each of these maps further contains key(s) (e.g., 'title', 'characteristics_ch1').

options:
  -h, --help            show this help message and exit
  -n NORM, --norm-type NORM
                        Normalization type of counts (choices: None, fpkm, tpm, default: None)
  -a ANNOT_VER, --annot-ver ANNOT_VER
                        Annotation version of counts (default: GRCh38.p13)
  -k [KEEP_ANNOT ...], --keep-annot [KEEP_ANNOT ...]
                        Annotation column(s) to keep (choices: Symbol, Description, Synonyms, GeneType, EnsemblGeneID, Status, ChrAcc, ChrStart, ChrStop, Orientation, Length, GOFunctionID, GOProcessID, GOComponentID, GOFunction, GOProcess, GOComponent, default: None)
  -s SRC_DIR, --src-dir SRC_DIR
                        A directory to save the source obtained from NCBI (default: ./)
  -o OUTPUT, --output OUTPUT
                        A directory to save the count matrix (or matrices) (default: ./)
  -q, --silent          If True, suppress warnings (default: False)
  -S SEP, --sep SEP     Separator between group and GSM in column (default: -)
  -y GSM_YAML, --yaml GSM_YAML
                        Path to save YAML file which contains GSMs (default: None)
  -c, --cleanup         If True, remove source files (default: False)

Command-line Example

To create a mock vs. CoV2 comparison pair for each tissues from GSE164073, please prepare the following yaml file (but do not need words beginning with "!!" as they are type hints):

Note

The acceptable options for Sample attributes (such as 'title' and 'characteristics_ch1') can be found on the Sample Attributes table or SOFT download section in SOFT submission instructions page. You can use the values in the 'Label' column of the table as a key in the YAML file. Also, please exclude the string '!Sample_'.

If you want a comprehensive list of attributes for all samples in a series, GEOparse library is useful.

 import GEOparse
 GEOparse.get_GEO("GSExxxxx").phenotype_data

GSE164073: !!seq
- control: !!map
    title: !!str Cornea
    characteristics_ch1: !!str mock
  treatment: !!map
    title: !!str Cornea
    characteristics_ch1: !!str SARS-CoV-2
- control: !!map
    title: !!str Limbus
    characteristics_ch1: !!str mock
  treatment: !!map
    title: !!str Limbus
    characteristics_ch1: !!str SARS-CoV-2
- control: !!map
    title: !!str Sclera
    characteristics_ch1: !!str mock
  treatment: !!map
    title: !!str Sclera
    characteristics_ch1: !!str SARS-CoV-2

or if you would like to specify the GSM directly, please prepare the following yaml file:

GSE164073: !!seq
- control: !!map
    geo_accession: !!str ^GSM4996084$|^GSM4996085$|^GSM4996086$
  treatment: !!map
    geo_accession: !!str ^GSM4996087$|^GSM4996088$|^GSM4996089$
- control: !!map
    geo_accession: !!str ^GSM4996090$|^GSM4996091$|^GSM4996092$
  treatment: !!map
    geo_accession: !!str ^GSM4996093$|^GSM4996094$|^GSM4996095$
- control: !!map
    geo_accession: !!str ^GSM4996096$|^GSM4996097$|^GSM4996098$
  treatment: !!map
    geo_accession: !!str ^GSM4996099$|^GSM4996100$|^GSM4996101$

and run the following command ("Symbol" column is kept in this expample):

python -m ncbi_counts sample_regex.yaml -k Symbol -c

then you will get the following files:

GSE164073-1.tsv

GeneID	Symbol	control-GSM4996084	control-GSM4996085	control-GSM4996086	treatment-GSM4996088	treatment-GSM4996087	treatment-GSM4996089
1	A1BG	144	197	157	156	133	122
2	A2M	254	276	262	178	153	178
3	A2MP1	1	0	2	0	0	0
9	NAT1	97	133	103	83	93	88
...	...	...	...	...	...	...	...

GSE164073-2.tsv

GeneID	Symbol	control-GSM4996092	control-GSM4996091	control-GSM4996090	treatment-GSM4996095	treatment-GSM4996094	treatment-GSM4996093
1	A1BG	175	167	203	143	145	145
2	A2M	261	158	427	215	145	169
3	A2MP1	0	0	0	0	0	2
9	NAT1	122	100	133	90	78	80
...	...	...	...	...	...	...	...

GSE164073-3.tsv

GeneID	Symbol	control-GSM4996098	control-GSM4996097	control-GSM4996096	treatment-GSM4996099	treatment-GSM4996100	treatment-GSM4996101
1	A1BG	158	115	140	136	124	145
2	A2M	3337	2261	2536	1524	1288	1807
3	A2MP1	0	0	0	0	0	0
9	NAT1	83	64	68	65	52	79
...	...	...	...	...	...	...	...

If you don't need source files from NCBI, please delete the following files:

Example in Python

To get the output as a pandas DataFrame, please refer to the following code:

from ncbi_counts import Series

series = Series(
    "GSE164073",
    [
        {
            "control": {"title": "Cornea", "characteristics_ch1": "mock"},
            "treatment": {"title": "Cornea", "characteristics_ch1": "SARS-CoV-2"},
        },
        {
            "control": {"title": "Limbus", "characteristics_ch1": "mock"},
            "treatment": {"title": "Limbus", "characteristics_ch1": "SARS-CoV-2"},
        },
        {
            "control": {"geo_accession": "^GSM499609[6-8]$"},
            "treatment": {"geo_accession": "^GSM4996099$|^GSM4996100$|^GSM4996101$"},
        },
    ],
    keep_annot=["Symbol"],
    save_to=None,
)
series.generate_pair_matrix()
# series.cleanup()  # remove source files
series.pair_count_list[0]  # Corresponds to GSE164073-1.tsv
series.pair_count_list[1]  # Corresponds to GSE164073-2.tsv
series.pair_count_list[2]  # Corresponds to GSE164073-3.tsv

License

ncbi_counts is released under an MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.vscode		.vscode
ncbi_counts		ncbi_counts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ncbi_counts

Installation

Usage

Options

Command-line Example

Example in Python

License

About

Releases 2

Packages

Languages

License

136s/ncbi_counts

Folders and files

Latest commit

History

Repository files navigation

ncbi_counts

Installation

Usage

Options

Command-line Example

Example in Python

License

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages