# FACTORBOOK data processing SOP

## 1. Introduction

Factorbook is a transcription factor (TF)-centric repository of all ENCODE ChIP-seq datasets on TF-binding regions, as well as the rich analysis results of these data.

#### 1.1 Factorbook motifs:
Motifs were identified by applying MEME to the top 500 IDR thresholded ChIP-seq peaks from more than 3000 ENCODE ChIP-seq experiments. Five motifs were identified per experiment. These motifs were then filtered for quality using peak centrality and enrichment metrics. In total, there are 6,069 motifs available.

[Factorbook motif catalog](https://downloads.wenglab.org/factorbook-download/complete-factorbook-catalog.meme.gz)

[Associated metadata](https://storage.googleapis.com/gcp.wenglab.org/factorbook_chipseq_meme_motifs.tsv)

#### 1.2. Factorbook motif annotation for FILER

Factorbook motifs are downloaded and processed to include overlaps with known motifs from the HOCOMOCO and JASPAR databases.

TOMTOM was used to compare Factorbook MEME motifs against the HOCOMOCO and JASPAR catalogs. Tomtom results were filtered for matches where the query is a part of the target or vice versa. From these filtered results, motifs with the lowest q-value were selected as the most confident overlaps between Factorbook motifs and the known motifs in HOCOMOCO and JASPAR.

The purpose of this SOP is to document the steps and processes involved in processing and annotating Facorbook motifs.

#### 1.3 Data processing workflow

![title](/mnt/ebs/jackal/FILER2/FILER2-production/FACTORBOOK/FB.png)



## 2.Requirements

### 2.1 ENCODE metadata:

ENCODE metadata is necessary to construct the download URL for each of the factorbook motifs. For example, factorbook motif ID (e.g., ENCFF674XTY_CMTBCTGGGARTTGTAG) is generated by combining the ENCODE BED file ID with the consensus sequence. 

#### 2.1.1 metadata download:
Select all the bed and bigBed tracks available in the ENCODE experiment matrix. Used the below selection criteria:

- Biosample: Organism --> Homo sapiens
- Quality: Status --> released, archived, revoked
- Analysis: Available file types --> All bed and bigBed file types
- Analysis: Genome Assembly --> hg19, GRCh38

The following URL is pre-selected to meet the above criteria on the ENCODE data portal:

[ENCODE metadata](https://www.encodeproject.org/search/?type=Experiment&status=released&perturbed=false&replicates.library.biosample.donor.organism.scientific_name=Homo+sapiens&perturbed=true&status=archived&status=revoked&files.file_type=bed+narrowPeak&files.file_type=bigBed+narrowPeak&files.file_type=bed+idr_ranked_peak&files.file_type=bed+broadPeak&files.file_type=bed+bed3%2B&files.file_type=bigBed+broadPeak&files.file_type=bigBed+bed3%2B&files.file_type=bed+bedRnaElements&files.file_type=bigBed+bedRnaElements&files.file_type=bed+tss_peak&files.file_type=bigBed+tss_peak&files.file_type=bed+bedMethyl&files.file_type=bigBed+bedMethyl&files.file_type=bed+bed9%2B&files.file_type=bed+bedGraph&files.file_type=bigInteract&files.file_type=bed+idr_peak&files.file_type=bigBed+idr_peak&files.file_type=bigBed+bed9%2B&files.file_type=bed+bed9&files.file_type=bigBed+bed9&files.file_type=bed+bedLogR&files.file_type=bigBed+bedLogR&files.file_type=bed+bed12&files.file_type=bigBed+bed12&files.file_type=bigBed+bed6%2B&files.file_type=bed+bed6%2B&files.file_type=bed+bedExonScore&files.file_type=bigBed+bedExonScore&files.file_type=bed+peptideMapping&files.file_type=bigBed+peptideMapping&files.file_type=bed+bed3&files.file_type=bigBed+bed3&files.file_type=bed+modPepMap&files.file_type=bed+pepMap&files.file_type=bigBed+modPepMap&files.file_type=bigBed+pepMap)

### 2.2 Tomtom: Motif comparison tool

Tomtom compares one or more motifs against a database of known motifs (e.g., JASPAR). 

[MEME suite V5.5.5](https://meme-suite.org/meme/doc/download.html)

### 2.3 Samtools

[Samtools V1.17](https://www.htslib.org/)

## 3. FILER data processing pipeline

![title](/mnt/ebs/jackal/FILER2/FILER2-production/FACTORBOOK/fb_pipeline.png)

### 3.1 Collect ENCODE experiment IDs from Factorbook motif catalog metadata

Script: step1-get_ENCODE_experiments.sh

Input: factorbook motif metadata - factorbook_chipseq_meme_motifs.tsv

| Animal       | Biosample | Target | Identifier     | Consensus                 | Sample Type | Information                                                                                                                                                  |
|--------------|-----------|--------|----------------|---------------------------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Homo sapiens | GM12878   | NFATC3 | ENCSR437GBJ     | AAAGAGGAASTGAAA            | cell line   | "inconsistent platforms, mild to moderate bottlenecking, moderate library complexity, mild to moderate bottlenecking, mild to moderate bottlenecking"          |
| Homo sapiens | GM12878   | NFATC3 | ENCSR437GBJ     | CCTCCTAGCCCTRACACACAGCTGGG | cell line   | "inconsistent platforms, mild to moderate bottlenecking, moderate library complexity, mild to moderate bottlenecking, mild to moderate bottlenecking"          |
| Homo sapiens | HepG2     | TIGD3  | ENCSR725QZQ     | AWANAAWHAMHTTWMWGACHAGAAAKRTTR  | cell line   | "mild to moderate bottlenecking, mild to moderate bottlenecking, mild to moderate bottlenecking, mild to moderate bottlenecking, borderline replicate concordance, missing genetic modification reagents" |
| Homo sapiens | HepG2     | TIGD3  | ENCSR725QZQ     | WGGTGCTGAAA                | cell line   | "mild to moderate bottlenecking, mild to moderate bottlenecking, mild to moderate bottlenecking, mild to moderate bottlenecking, borderline replicate concordance, missing genetic modification reagents" |
| Homo sapiens | HepG2     | MED13  | ENCSR269DQN     | RAAGATCAAAG                | cell line   | missing genetic modification reagents                                                                                                                         |
| Homo sapiens | HepG2     | MED13  | ENCSR269DQN     | CAGCACCRYGGACAGCGCC        | cell line   | missing genetic modification reagents                                                                                                                         |
| Homo sapiens | HepG2     | MED13  | ENCSR269DQN     | CCCCBCCCBCNCCYBBSCCCC      | cell line   | missing genetic modification reagents                                                                                                                         |


Output: ENCODE_experimentIDs.txt

- ENCSR000AHD
- ENCSR000AHF
- ENCSR000AIB
- ENCSR000AIC
- ENCSR000AID
- ENCSR000AIE
- ENCSR000AIG
- ENCSR000AIH

### 3.2 Process ENCODE metadata to map IDR thresholded peak BED file IDs for all ChIP-seq experiments

Script: step2-bedtobigBed_mapping.py

Input: 
1. ENCODE metadata from section 2.1
2. ENCODE experiment IDs from section 3.1
3. factorbook motifs metadata section 1.1

Output: bedtobigBed_mapping.txt

| Animal       | Biosample | Target | Experiment_ID | Experiment_ID | bigBed_Accession | BED_Accession | Consensus                        | Sample_Type | BED_Consensus                        |
|--------------|-----------|--------|---------------|---------------|------------------|---------------|-----------------------------------|-------------|--------------------------------------|
| Homo sapiens | GM12878   | NFATC3 | ENCSR437GBJ   | ENCSR437GBJ   | ENCFF340KVJ      | ENCFF002XEC   | AAAGAGGAASTGAAA                  | cell line   | ENCFF002XEC_AAAGAGGAASTGAAA         |
| Homo sapiens | GM12878   | NFATC3 | ENCSR437GBJ   | ENCSR437GBJ   | ENCFF340KVJ      | ENCFF002XEC   | CCTCCTAGCCCTRACACACAGCTGGG        | cell line   | ENCFF002XEC_CCTCCTAGCCCTRACACACAGCTGGG |
| Homo sapiens | GM12878   | NFATC3 | ENCSR437GBJ   | ENCSR437GBJ   | ENCFF340KVJ      | ENCFF002XEC   | SCBSBCMSBBCCRVCCHSCSVKCCYKSVCC    | cell line   | ENCFF002XEC_SCBSBCMSBBCCRVCCHSCSVKCCYKSVCC |
| Homo sapiens | GM12878   | NFATC3 | ENCSR437GBJ   | ENCSR437GBJ   | ENCFF340KVJ      | ENCFF002XEC   | GCSYGGGGCAAGTGACCGTGCGTGTAAAGG    | cell line   | ENCFF002XEC_GCSYGGGGCAAGTGACCGTGCGTGTAAAGG |
| Homo sapiens | HepG2     | TIGD3  | ENCSR725QZQ   | ENCSR725QZQ   | ENCFF491KVL      | ENCFF003AMQ   | WGGTGCTGAAA                      | cell line   | ENCFF003AMQ_WGGTGCTGAAA             |
| Homo sapiens | HepG2     | TIGD3  | ENCSR725QZQ   | ENCSR725QZQ   | ENCFF491KVL      | ENCFF003AMQ   | WTCAGCACCANGGACAGCDCC            | cell line   | ENCFF003AMQ_WTCAGCACCANGGACAGCDCC   |
| Homo sapiens | HepG2     | TIGD3  | ENCSR725QZQ   | ENCSR725QZQ   | ENCFF491KVL      | ENCFF003AMQ   | CCTYYYYCYBCCTBCTCTYYC            | cell line   | ENCFF003AMQ_CCTYYYYCYBCCTBCTCTYYC   |


### 3.3 Download Factorbook motifs

Script: step3-download_facrotbook_motifs.py

Input: bedtobigBed_mapping.txt from section 3.2

Output: binding sites for the motif ENCFF675PHY_STGATGCAABC in BED format

| Chromosome | Start      | End        | Strand | p-value  |
|------------|------------|------------|--------|--------|
| chr11      | 3057416    | 3057427    | +      | 0.0529 |
| chr6       | 6655191    | 6655202    | +      | 0.0529 |
| chr2       | 218745386  | 218745397  | +      | 0.0529 |
| chr19      | 33374069   | 33374080   | +      | 0.0529 |
| chr1       | 220046302  | 220046313  | +      | 0.0529 |
| chr10      | 72272976   | 72272987   | +      | 0.0529 |
| chr17      | 41689493   | 41689504   | -      | 0.0529 |


### 3.4 Run Tomtom to compare factorbook motifs to HOCOMOCO and JASPAR known motifs and process the results

Script: step4-run_tomtom.py

Input: 
1. complete-factorbook-catalog.meme from section 1.1
2. HOCOMOCO v12 Core collection - H12CORE_meme_format.meme 
3. HOCOMOCO v12 core metadata - H12CORE_motifs.tsv
4. JASPAR 2022 Core collection - JASPAR2022_CORE_non-redundant_v2.meme 

Output: tomtom_mapping.txt

| query_ID                     | HOCOMOCO_ID           | HOCOMOCO_q-value | HOCOMOCO_query_consensus | HOCOMOCO_Target_consensus | HOCOMOCO_TF_name | HOCOMOCO_species | JASPAR_ID   | JASPAR_q-value | JASPAR_query_consensus | JASPAR_Target_consensus | JASPAR_TF_name | JASPAR_species        |
|------------------------------|-----------------------|------------------|--------------------------|---------------------------|------------------|------------------|-------------|----------------|------------------------|-------------------------|----------------|-----------------------|
| ENCSR000FAL_GCYGCYGCYGCCKCC   | ZN519.H12CORE.0.P.C   | 2.35385e-07      | GCCGCCGCCGCCGCC           | GCCGCCGCCGCCGCCGCCCC       | ZN519            | ZN519_HUMAN      | MA2022.1    | 1.06006e-07    | GCCGCCGCCGCCGCC        | CCGCCGCCGCCGCCGCC        | LOB            | Arabidopsis thaliana   |
| ENCSR000BUU_TTCAGCACCAYGGACA  | REST.H12CORE.0.P.B    | 4.04185e-11      | TTCAGCACCATGGACA          | TTCAGCACCATGGACAGCGCCC     | REST             | REST_HUMAN       | MA0138.2    | 2.80469e-11    | TTCAGCACCATGGACA        | TTCAGCACCATGGACAGCGCC    | REST           | Homo sapiens          |
| ENCSR796VBR_TGAGTCAY          | FOSB.H12CORE.0.P.B    | 7.94831e-05      | TGAGTCAT                  | ATGAGTCAT                  | FOSB             | FOSB_HUMAN       | MA1988.1    | 0.000972578    | TGAGTCAT                | TATGAGTCATC             | Atf3           | Mus musculus          |
| ENCSR000BKE_GTCACGTG          | TFE3.H12CORE.0.PSM.A  | 5.0608e-05       | GTCACGTG                  | GGTCACGTGAT                | TFE3             | TFE3_HUMAN       | MA0871.2    | 0.00134658     | GTCACGTG                | GGTCACGTGGG             | TFEC           | Homo sapiens          |
| ENCSR000BTE_RTGACTCA          | ATF3.H12CORE.0.P.B    | 3.51441e-05      | ATGACTCA                  | GGATGACTCA                 | ATF3             | ATF3_HUMAN       | MA0099.3    | 1.65038e-05    | ATGACTCA                | GATGACTCAT             | FOS::JUN       | Homo sapiens          |
| ENCSR705HGT_YKCYSATTGGCYR     | no_match              | NA               | NA                        | NA                         | NA               | NA               | MA1644.1    | 6.0952e-06     | CTCTGATTGGCTG           | TCTGATTGGCT             | NFYC           | Homo sapiens          |


### 3.5 Process factorbook motifs - Annotation and filtering intervals by p-value.

script: step5-motif_processing.py

Input:
1. Factorbook motifs from section 3.3
2. bedtobigBed_mapping.txt from section 3.2
3. tomtom_mapping.txt from section 3.4

Output:
1. fb_motifs_filtered - annotated intervals filtered for p-value < 1e-4.
2. fb_motifs_processed - annotated intervals

| #chrom | chromStart | chromEnd | hocomoco_jaspar_TF_name | FB_score | strand | binding_sequence | FB_pwm_ID                | FB_consensus_sequence | FB_ChIP-Seq_target | HOCOMOCO_pwm_ID          | HOCOMOCO_Target_consensus | HOCOMOCO_q-value | JASPAR_pwm_ID | JASPAR_Target_consensus | JASPAR_q-value | ENCODE_exp_bigBed_bed_IDs                      |
|--------|------------|----------|-------------------------|----------|--------|------------------|--------------------------|-----------------------|--------------------|--------------------------|--------------------------|-----------------|---------------|------------------------|----------------|------------------------------------------------|
| chr1   | 115809     | 115817   | GATA1,GATA1::TAL1        | 0.178    | -      | CTTATCTG          | ENCFF321NXM_CTTATCWS      | CTTATCWS              | NCOR1              | GATA1.H12CORE.1.PSM.A     | CCTTATCTGC               | 4.6956e-05      | MA0140.2      | CTTATCTGTGAGGAGCAG      | 0.0155726      | ENCSR798ILC-ENCFF866HRM-ENCFF321NXM            |
| chr1   | 1109904    | 1109912  | GATA1,GATA1::TAL1        | 0.185    | +      | CTTATCAG          | ENCFF321NXM_CTTATCWS      | CTTATCWS              | NCOR1              | GATA1.H12CORE.1.PSM.A     | CCTTATCTGC               | 4.6956e-05      | MA0140.2      | CTTATCTGTGAGGAGCAG      | 0.0155726      | ENCSR798ILC-ENCFF866HRM-ENCFF321NXM            |
| chr1   | 1883636    | 1883644  | GATA1,GATA1::TAL1        | 0.178    | -      | CTTATCTG          | ENCFF321NXM_CTTATCWS      | CTTATCWS              | NCOR1              | GATA1.H12CORE.1.PSM.A     | CCTTATCTGC               | 4.6956e-05      | MA0140.2      | CTTATCTGTGAGGAGCAG      | 0.0155726      | ENCSR798ILC-ENCFF866HRM-ENCFF321NXM            |
| chr1   | 2585948    | 2585956  | GATA1,GATA1::TAL1        | 0.185    | +      | CTTATCTC          | ENCFF321NXM_CTTATCWS      | CTTATCWS              | NCOR1              | GATA1.H12CORE.1.PSM.A     | CCTTATCTGC               | 4.6956e-05      | MA0140.2      | CTTATCTGTGAGGAGCAG      | 0.0155726      | ENCSR798ILC-ENCFF866HRM-ENCFF321NXM            |


| COLUMN                    | DESCRIPTION                                                                                                                                             |
|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| chrom                     | Chromosome in chrXXX notation (e.g., chr1, chr2)                                                                                                        |
| chromStart                | Start position of the binding site. 0-based coordinate system                                                                                            |
| chromEnd                  | End position of the binding site.                                                                                                                        |
| hocomoco_jaspar_TF_name    | Transcription factor names from HOCOMOCO and JASPAR databases. If both are available, they are comma-separated                                           |
| FB_score                  | FIMO p-value                                                                                                                                            |
| strand                    | Strand (+ for forward, - for reverse)                                                                                                                    |
| binding_sequence          | Strand specific nucleotide sequence that binds to the transcription factor, extracted from the reference genome using the binding site coordinates (reverse complemented for '-' strand) |
| FB_pwm_ID                 | Factorbook motif Position Weight Matrix (PWM) ID                                                                                                         |
| FB_consensus_sequence      | Consensus sequence representing the most common nucleotides at each position within the Factorbook motif                                                 |
| FB_ChIP-Seq_target         | Target of the ChIP-Seq experiment                                                                                                                        |
| HOCOMOCO_pwm_ID            | HOCOMOCO motif Position Weight Matrix (PWM) ID                                                                                                           |
| HOCOMOCO_Target_consensus  | Consensus sequence from the HOCOMOCO database representing the target site of the transcription factor                                                    |
| HOCOMOCO_q-value           | Q-value indicating the significance of the match between Factorbook motif and known motifs in the HOCOMOCO database.                                      |
| JASPAR_pwm_ID              | JASPAR motif Position Weight Matrix (PWM) ID                                                                                                             |
| JASPAR_Target_consensus    | Consensus sequence from the JASPAR database representing the target site of the transcription factor                                                      |
| JASPAR_q-value             | Q-value indicating the significance of the match between Factorbook motif and known motifs in the JASPAR database.                                        |
| ENCODE_exp_bigBed_bed_IDs  | ENCODE experiment ID, bigBed Accession, and BED Accession (hyphen-separated)                                                                             |


### 3.6 Factorbook motif probability matrix

Script: step6-pwm-proecssing.py

Input: complete-factorbook-catalog.meme.gz from section 1.1

Output: ENCSR444LIN_CTTTGAAR.meme