# Retrieving HG38 epigenomic files
The following notebook shows how the epigenomic files metadata are retrieved.

In [1]:
from glob import glob
import pandas as pd
import compress_json
from encodeproject import biosamples, accessions, biosample, download_urls

We specify that we are only interested in the [GRCh38](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/) assembly, are currently in status [released](https://www.encodeproject.org/help/getting-started/status-terms/#FileStatuses), have replication type [isogenic](https://www.encodeproject.org/data-standards/terms/) (there is a biological replication) and the file format is [bigWig](https://genome.ucsc.edu/goldenPath/help/bigWig.html#:~:text=The%20bigWig%20format%20is%20useful,in%20an%20indexed%20binary%20format.&text=Wiggle%20data%20must%20be%20continuous%20and%20consist%20of%20equally%20sized%20elements.).

In [2]:
parameters = dict(
    assembly="GRCh38",
    replication_type="isogenic",
    file_format="bigWig",
    status="released",
    use_multiprocessing=False
)

We will append all the dataset while we obtain them to the following list.

In [3]:
all_datasets = []

### Retrieving CHIP-seq

In [None]:
samples = biosamples(
    accessions=accessions(compress_json.load("hg38_encode_queries/chipseq.json")),
    min_biological_replicates=2,
    output_type="fold change over control",
    **parameters
)
all_datasets.append(samples)
samples

Retrieving biosamples:   0%|          | 0/1836 [00:00<?, ?it/s]

### Retrieving DNASE-seq

In [None]:
samples = biosamples(
    accessions=accessions(compress_json.load("hg38_encode_queries/dnaseseq.json")),
    organism=None,
    **parameters
)
samples["organism"] = "human"

all_datasets.append(samples)
samples

### Retrieving WGBS

In [None]:
samples = biosamples(
    accessions=accessions(compress_json.load("hg38_encode_queries/wgbs.json")),
    organism=None,
    **parameters,
)
# I have manually checked that the version of the files is 3, but it is not available in the metadata.
samples["encode_version"] = 3
samples["organism"] = "human"
all_datasets.append(samples)
samples

### Retrieving ATAC

In [None]:
samples = biosamples(
    accessions=accessions(compress_json.load("hg38_encode_queries/atacseq.json")),
    organism=None,
    min_biological_replicates=2,
    output_type="fold change over control",
    **parameters
)
samples["organism"] = "human"

all_datasets.append(samples)
samples

## Combining all datasets

In [None]:
combined = pd.concat(all_datasets)
combined

In [None]:
combined

### Keeping only latest encode version of each file

In [None]:
combined["string_biological_replicates"] = combined["biological_replicates"].astype(str)
filtered_combined = combined.sort_values("encode_version").groupby([
    "target",
    "cell_line",
    "assay_title",
    "institute_name",
    "string_biological_replicates"
]).last().reset_index()

filtered_combined.to_csv("epigenomic_dataset/epigenomes_metadata/hg38.csv", index=False)

In [None]:
filtered_combined