# Download Data

Download data for analysis notebooks

## Download Cool-Seq-Tool Data

This notebook downloads data for Cool-Seq-Tool used throughout the analysis.

In [1]:
import requests
from pathlib import Path

In [2]:
MANUSCRIPT_S3_URL = "https://nch-igm-wagner-lab-public.s3.us-east-2.amazonaws.com/variation-normalizer-manuscript"

In [3]:
def download_s3(url: str, outfile_path: Path) -> None:
    """Download objects from public s3 bucket

    :param url: URL for file in s3 bucket
    :param outfile_path: Path where file should be saved
    """
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(outfile_path, "wb") as h:
            for chunk in r.iter_content(chunk_size=8192):
                if chunk:
                    h.write(chunk)

In [4]:
path = "data"
Path(path).mkdir(exist_ok=True)

for fn in [
    "LRG_RefSeqGene_20231114",
    "MANE.GRCh38.v1.3.summary.txt",
    "transcript_mapping.tsv",
]:
    url = f"{MANUSCRIPT_S3_URL}/cool-seq-tool/{fn}"
    outfile_path = Path(f"{path}/{fn}")
    download_s3(url, outfile_path)

## Download data needed for CNV analyses

This notebook downloads all of the underlying data for analyses of CNVs in ClinVar and how a real-world data set of CNVs from microarrays matches to these variants. It also downloads the intermediate output files generated in the course of running these CNV analyses so that the user may avoid re-running long computations in the matching analysis.

In [5]:
path = "cnvs/cnv_data"
Path(path).mkdir(exist_ok=True)

for fn in [
    "ClinVar-CNVs-normalized.csv",
    "MANE.GRCh38.v1.1.ensembl_genomic.gff.gz",
    "match-scoring-results.csv.gz",
    "NCH-microarray-CNVs-cleaned.csv",
    "NCH-microarray-CNVs.csv",
    "NCH-normalizer-results.json",
]:
    url = f"{MANUSCRIPT_S3_URL}/cnv_data/{fn}"
    outfile_path = Path(f"{path}/{fn}")
    download_s3(url, outfile_path)

## ClinVar normalized variant data

If you have not already run the notebooks in the ClinVar analysis directory, you will need to pull down the normalized ClinVar variants. This can be done by running ```clinvar_variation_analysis.ipynb``` in the ```clinvar``` directory, or by running the following cell:

In [None]:
path = "clinvar"
Path(path).mkdir(exist_ok=True)

url = f"{MANUSCRIPT_S3_URL}/output-variation_identity-vrs-1.3.ndjson.gz"
outfile_path = Path(f"{path}/output-variation_identity-vrs-1.3.ndjson.gz")
download_s3(url, outfile_path)