# Standard Operating Procedure (SOP) for Updating ENCODE Data Tracks in FILER
## 1.Introduction
* ``ENCODE (Encyclopedia of DNA Elements)`` data tracks in FILER are periodically updated. This SOP outlines the step-by-step process for systematically collecting, processing, and integrating newly released ENCODE data tracks into FILER.
* For consistency, only bigBed files are downloaded from ENCODE and converted to BED format using <u>bigBedToBed.</u>
### 1.1 Purpose
The purpose of this SOP is to ensure consistency, accuracy, and efficiency in updating ENCODE data tracks within FILER.
### 1.2 Revision History

- **Version 1.0:** Initial draft version. 2024 Update.

## 2. Data Collection
### 2.1. bigBed Tracks from ENCODE
To begin the update process, select all the bigBed tracks available in the ENCODE experiment matrix, using the below selection criteria:

- Biosample: Organism --> Homo sapiens
- Quality: Status --> released, archived, revoked
- Analysis: Available file types --> All bigBed files
- Analysis: Genome Assembly --> hg19, GRCh38

The following URL is pre-selected to meet the above criteria on the ENCODE data portal:

[ENCODE BigBed Files](https://www.encodeproject.org/search/?type=Experiment&control_type!=*&status=released&perturbed=false&replicates.library.biosample.donor.organism.scientific_name=Homo+sapiens&status=archived&status=revoked&files.file_type=bigBed+narrowPeak&files.file_type=bigBed+broadPeak&files.file_type=bigBed+bed3%2B&files.file_type=bigBed+bedRnaElements&files.file_type=bigBed+tss_peak&files.file_type=bigBed+bedMethyl&files.file_type=bigBed+bed9%2B&files.file_type=bigBed+idr_peak&files.file_type=bigBed+bed9&files.file_type=bigBed+bedLogR&files.file_type=bigBed+bed12&files.file_type=bigBed+bed6%2B&files.file_type=bigBed+bedExonScore&files.file_type=bigBed+peptideMapping&files.file_type=bigBed+bed3&files.file_type=bigBed+modPepMap&files.file_type=bigBed+pepMap)

(Verify the selection: Ensure that all bigBed files are selected under `'Analysis - Available file types'`)

#### 2.1.1 ENCODE File status

On the ENCODE portal, statuses are used to indicate different processing phase.

[ENCODE status terminology](https://www.encodeproject.org/help/getting-started/status-terms/#FileStatuses)

- Released: These files have been been released to public.
- Archived: These files are considered superfluous or outdated. There is nothing wrong with these files , but it is considered an extra to the experiment.
- Revoked: These files were deemed erroneous or significantly below standards after it had been released. 

In FILER, we adhere to the same rules as ENCODE. We only include ENCODE files with 'Released' status.

The status information for 'archived' and 'revoked' files are collected for downstream use, specifically for archiving these files from FILER.

### 2.2. Download metadata
Using the **ENCODE batch Download** option, get the `files.txt` file, which is a list of URLs to a file containing all the experimental metadata and links to download the file. 
The first line of the file has the URL or command line to download the metadata file.

The following command will download the `metadata.tsv` file, which contains metadata describing the assay and the files for all the bigBed tracks.

head -n 1 files.txt | xargs -n 1 curl -O -J -L

[ENCODE Metadata schema](https://www.encodeproject.org/help/batch-download/) 

![Alt text](/mnt/ebs/jackal/FILER2/FILER2-production/ENCODE/ENCODE_batch_download.png)


### 2.3. Metadata filtering

To identify newly released ENCODE tracks that are not yet added to FILER, compare the list of bigBed tracks using the `File accession` column from `metadata.tsv` with the existing ENCODE tracks in FILER.

This generates a list of all bigBed files released since the last FILER update.

#### Filtering Criteria

Apply the following filtering criteria to refine the list of tracks to be updated:

- **File Status:** Exclude tracks with `Archived` and `Revoked` file status.
- **Output Type:** Exclude tracks in data output types `FDR cut rate` and `peaks and background as input for IDR`.
- **File Format:** Exclude tracks in `bedMethyl` format.
- **Assay:** Check for new assay types.

These filtering criteria ensure that only relevant tracks are selected for further processing and integration into FILER. 

**Discussion with FILER Team**

Discuss and finalize the filtering criteria during FILER team meeting.

### 2024 update:

The following assays are excluded after discussing their use cases in FILER:

- BruChase-seq
- Bru-seq
- BruUV-seq
- polyA minus RNA-seq
- polyA plus RNA-seq
- small RNA-seq
- total RNA-seq

### 2.4. File download

With the final list of bigBed files are ready, proceed to download using the `files.txt` obtained in section 2.2. Filter `files.txt` to extract only the URLs corresponding to the bigBed files using **file accession**

To download the bigBed files, use the following command:

xargs -L 1 curl -O -J -L < files.txt

This command will download files listed in `filtered_bigBed_files.txt`

## 3. Metadata Collection

Collect the following metadata from `metadata.tsv` for each track. This information will be used to construct FILER metadata:

- File format
- Output type
- File assembly
- Experiment accession
- Assay
- Biosample term id
- Biosample term name
- Biosample type
- Experiment target
- Biological replicate(s)
- Technical replicate(s)
- File download URL
- Experiment date released
- Project
- File analysis title

This metadata collection process ensures that all necessary information is gathered for each track, facilitating the construction of accurate and comprehensive metadata entries in FILER.


### 3.1 ENCODE metadata schema (2024)

## 4. JSON File Download

These JSON files contain additional metadata that are not provided in `metadata.tsv`. They will be used to enrich the metadata collection for a comprehensive representation of each ENCODE track.

### 4.1. Downloading JSON Files for BigBed Tracks

For each bigBed track, download the corresponding JSON file by providing the `File accession` list as input to the following command:

cat bigBed.track.list | parallel --jobs 12 'curl -L -H "Accept: application/json" https://www.encodeproject.org/files/{}/ > track/{}.json'

This command uses parallel processing to efficiently download JSON files for multiple tracks simultaneously. Each JSON file contains metadata and additional information about the corresponding bigBed track.

### 4.2. Downloading JSON Files for Experiments

Additionally, download JSON files for experiments by providing the `Experiment accession` (from metadata.tsv) list as input to the following command:

cat experiment.list | parallel --jobs 12 'curl -L -H "Accept: application/json" https://www.encodeproject.org/files/{}/ > experiment/{}.json'

These files have experiment level metadata 

## 5 Additional Metadata Collection

### 5.1. Metadata Collection from Individual Tracks JSON

Extract the following metadata records from bigBed track JSON files:

- .date_created: `File release date.`
- .submitted_file_name: `Original file name.`

Use the script below to collect the metadata:

In [None]:
#!/usr/bin/env python3

"""
Script to extract ENCODE bigBed track metadata from JSON files.

Date_added in this refers to Individual track (bigBed files) release date; Release date column in FILER metadata.

Json files are pre-downloaded.
"""

import os
import json
import sys

def extract_info(file):
    with open(file, 'r') as f:
        data = json.load(f)
        submitted_file_name = data.get("submitted_file_name", "")
        name = os.path.basename(submitted_file_name)
        date = data.get("date_created", "").split("T")[0]
        return os.path.splitext(os.path.basename(file))[0], name, date

def main(directory):
    output_file = "bigBed_track_metadata.txt"
    with open(output_file, 'w') as out_f:
        for file in os.listdir(directory):
            if file.endswith('.json'):
                file_path = os.path.join(directory, file)
                file_base, name, date = extract_info(file_path)
                out_f.write(f"{file_base}\tSubmitted_track_name={name}\tRelease_date={date}\n")

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python script.py <directory>")
        sys.exit(1)

    directory = sys.argv[1]
    main(directory)

### 5.2. Metadata Collection from Experiment JSON

Extract the following metadata records from experiment JSON files:

- .biosample_summary: `Biosample summary.`
- .biosample_ontology: `System category.`
- .lab: `Lab.`
- .life_stage_age: `Life stage.`

Use the script below to collect the metadata:

In [None]:
#!/usr/bin/env python3

"""
Script to extract ENCODE experiments metadata from JSON files.

Json files are pre-downloaded.

Use the 'Experiment accession' in the metadata to map Experiments to individual tracks
"""

import os
import json
import sys

def extract_info(file):
    with open(file, 'r') as f:
        data = json.load(f)
        lab = data.get("lab", {}).get("title", "")
        bio = data.get("biosample_summary", "")
        system = ",".join(data.get("biosample_ontology", {}).get("system_slims", []))
        life_stage_ages = ",".join([control.get("life_stage_age", "") for control in data.get("possible_controls", [])])
        return lab, bio, system, life_stage_ages

def main(directory):
    output_file = "experiment_metadata.txt"
    with open(output_file, 'w') as out_f:
        for file in os.listdir(directory):
            if file.endswith('.json'):
                file_path = os.path.join(directory, file)
                lab, bio, system, life_stage_ages = extract_info(file_path)

                # Process file name to remove extension
                file_base = os.path.splitext(file)[0]

                # Constructing key-value pairs for the second column
                key_value_pairs = ""
                if bio:
                    key_value_pairs += f"Biosample_summary={bio};"
                if lab:
                    key_value_pairs += f"Lab={lab};"
                if system:
                    key_value_pairs += f"System={system};"

                # Output key-value pairs if there's any non-empty value
                if key_value_pairs or life_stage_ages:
                    out_f.write(f"{file_base}\t{key_value_pairs}\t{life_stage_ages}\n")

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python script.py <directory>")
        sys.exit(1)

    directory = sys.argv[1]
    main(directory)

### 5.3. Organizing Collected Metadata for FILER

Collected metadata are organized and stored in the following columns in FILER metadata:

- **Biosample summary**: Extracted from experiment JSON files.
- **Lab**: Extracted from experiment JSON files.
- **System category**: Extracted from experiment JSON files.
- **Original file name**: Extracted from individual track JSON files.
- **Experiment date released**: Retrieved from ENCODE metadata.
- **File analysis title**: Retrieved from ENCODE metadata.

These metadata categories are stored as key-value pairs under the `Track Description` column in FILER metadata.

Additionally, the following metadata categories are stored in specific columns:

- **Life stage**: Stored under the `Life stage` column in FILER metadata.
- **File release date**: Stored under the `Release date` column in FILER metadata.

## 6. Removing Deprecated Files from FILER

To ensure that FILER only contains the active files, the `file status` of each existing ENCODE tracks needs to be verified.

### 6.1. Metadata Filtering

Utilizing the `metadata.tsv` file downloaded in section 2.2, filter out new files to retain metadata only for existing FILER files. Check the `File status` column, filtering for archived and revoked tracks.

The resulting final list of files needs to be removed from FILER.

## 7. Data Processing

The bigBed files downloaded in section 2.4 need to be processed into BED like (BED3+) format. 


### 7.1. Processing steps:

1. Convert bigBed to BED format using `bigBedToBed` tool. Discard files with errors.
2. Add headers to each track based on their file format. Some ENCODE data formats and their headers are as follows:

    - **narrowPeak**: chrom;chromStart;chromEnd;name;score;strand;signalValue;pValue;qValue;peak
    - **broadPeak**: chrom;chromStart;chromEnd;name;score;strand;signalValue;pValue;qValue
    - **bed5**: chrom;chromStart;chromEnd;name;score
    - **bedLogR**: chrom;chromStart;chromEnd;name;score;strand;thickStart;thickEnd;reserved;logR
    - **bed6+2 miRNA**: chrom;chromStart;chromEnd;name;score;strand;miRNAname;expressionValue
    
    ### qValue/pValue based filtering - Future processing step!

    If encountering a new data format, find the corresponding column names from ENCODE.

3. Sort and compress the BED files.

These processing steps ensure that the bigBed files are converted to the appropriate BED format and compressed for efficient storage and analysis.

## 8. Quality Check (QC)

### 8.1 Data quality

Ensure data quality with the following steps:

1. Check for empty files.
2. Verify header and contents for each file format.

These QC steps ensure the integrity and reliability of the processed data.

### 8.2 Metadata QC

The QC checks are to verify the accuracy and integrity of the metadata, confirming that it meets the required standards.

Run the metadata QC pipeline after completing the FILER metadata generation.

Below is the QC script which performs various quality control (QC) checks on the metadata. These checks include:

- Verifying the correct end-of-line character expected on UNIX
- Checking whether the header matches a predefined header
- Evaluating Giggle indexes
- Consistency of Giggle index and annotation files
- Updating any life stages that need correction
- Inspecting the metadata for any blank fields
- Checking the URL download prefix matches the URL link
- Verifying that fields have consistent capitalization patterns
- Detecting and resolving any double slashes in fields
- Ensuring the uniqueness of the Identifier field
- Verifying the uniqueness of the Processed file md5 field
- Checking that all rows have the correct number of fields
- Verifying the accessibility and validity of provided Processed File URLs
- Ensuring the correctness and matching of file sizes and md5sums.

In [None]:
#!/bin/bash

######################################################################
# FILER Metadata Quality Check (QC) Pipeline
######################################################################

input_md=$1
project_name=${2:-"undefined"}
break_on_error=${3:-"true"}

if [ $break_on_error == "true" ]; then
    set -e
fi

if [ $project_name == "undefined" ]; then
    project_name=$(basename $(dirname $input_md))
fi

md_name="/mnt/data/jeffrey/FILER/scripts/metadata/checking/${project_name}_master_project_metadata.tsv"

## Do a temporary version
echo -e "----------\nSaving working version to ./checking:"
cp $input_md $md_name

## Check for correct end-of-line character
echo -e "\n----------\nChecking whether provided metadata has the correct end-of-line character expected on UNIX:"
bash /mnt/data/jeffrey/FILER/scripts/metadata/check_unix_eol.sh $md_name

## Add a header
echo -e "----------\nChecking whether header matches ./metadata_header.txt exactly:"
bash /mnt/data/jeffrey/FILER/scripts/metadata/check_header.sh $md_name

## Use check files
echo -e "\n----------\nRun check_files to evaluate Giggle indexes:"
bash /mnt/data/jeffrey/FILER/scripts/metadata/check_files_alt.sh $md_name

## Check file ages
echo -e "\n----------\nWhecking that giggle index is older than all contained files:"
bash /mnt/data/jeffrey/FILER/scripts/metadata/check_file_ages.sh $md_name

## Check the life stages
echo -e "\n----------\nChecking whether any life stages need to be fixed:"
bash /mnt/data/jeffrey/FILER/scripts/metadata/update_life_stages.sh $md_name

## Check for empty elements
echo -e "\n----------\nChecking to see if any fields are blank:"
bash /mnt/data/jeffrey/FILER/scripts/metadata/inspect_metadata_for_empty.sh $md_name

if [ 0 = 1 ]; then

## Check for correct number of columns
echo -e "\n----------\nChecking to see if column number compatible with declared format:"
bash /mnt/data/jeffrey/FILER/scripts/metadata/check_columns_alt.sh $md_name

## Check output dirs
echo -e "\n----------\nChecking that each output dir only contains one output type:"
bash /mnt/data/jeffrey/FILER/scripts/metadata/check_dirs_and_outputtypes_alt.sh $md_name

fi

## Check download prefixes
echo -e "\n----------\nChecking that URL download prefix matches URL link (will print if mismatch):"
bash /mnt/data/jeffrey/FILER/scripts/metadata/check_url_and_target_dir_match_alt.sh $md_name

## Check for capitalization issues
echo -e "\n----------\nChecking whether any fields have varying capitalization patterns:"
bash /mnt/data/jeffrey/FILER/scripts/metadata/check_capitalization_alt.sh $md_name

## Check for double-slashes
echo -e "\n----------\nChecking whether any field has a double slash in it, ignoring https:// and http:// :"
bash /mnt/data/jeffrey/FILER/scripts/metadata/detect_double_slashes.sh $md_name

## Check that Identifier is always unique
echo -e "\n----------\nChecking whether Identifier field is always unique:"
bash /mnt/data/jeffrey/FILER/scripts/metadata/check_unique_identifiers.sh $md_name

## Check that md5sum is always unique
echo -e "\n----------\nChecking whether Processed file md5 field is always unique:"
bash /mnt/data/jeffrey/FILER/scripts/metadata/check_unique_md5.sh $md_name

## Check that column number is always correct
echo -e "\n----------\nChecking whether all rows have the right number of fields:"
bash /mnt/data/jeffrey/FILER/scripts/metadata/check_cols_and_empty_rows.sh $md_name

## Check download availability
echo -e "\n----------\nChecking whether provided Processed File URLS are actually accessible and valid"
python /mnt/data/jeffrey/FILER/scripts/metadata/check_download_links.py $md_name

## Check sizes etc.
echo -e "\n----------\nChecking whether the file sizes and md5sums are correct/match expected."
bash /mnt/data/jeffrey/FILER/scripts/metadata/check_md5_and_filesize.sh $md_name

## 9. Data Organization and Indexing

Organize the processed data following FILER's data structure guidelines. Ensure that the processed BED files are structured appropriately according to the specified directory hierarchy and naming conventions.

After organization, index the processed BED files using [tabix](https://www.htslib.org/doc/tabix.html) and [giggle](https://github.com/ryanlayer/giggle). Indexing facilitates efficient data retrieval and querying, enhancing the accessibility and usability of the dataset.


## 10. Processed BED files

Processed ENCODE data in FILER: 

**ENCFF569TJO.bed.gz**: TF-ChIP-seq, narrowpeak, hg38

**ENCFF002TMO.bed.gz**: DNase-seq, bed5, hg38

## 11. FILER metadata

## 12. Validate FILER metadata

Verify if the ENCODE metadata is updated for existing FILER ENCODE files. ENCODE may update and/or add more metadata for their files. This involves checking the metadata for files that have already been processed and incorporated into FILER.

Use the downloaded JSON files to compare and ensure the metadata aligns with the current ENCODE standards. This validation ensures that all previously processed ENCODE files maintain consistency with the latest metadata updates from ENCODE.

## Add a step for cell type mapping, Refer cell-type tisse dictionary and the mapping script