# FACTORBOOK data processing SOP

## 1. Introduction

Factorbook is a transcription factor (TF)-centric repository of all ENCODE ChIP-seq datasets on TF-binding regions, as well as the rich analysis results of these data.

#### 1.1 Factorbook motifs:
Motifs were identified by applying MEME to the top 500 IDR thresholded ChIP-seq peaks from more than 3000 ENCODE ChIP-seq experiments. Five motifs were identified per experiment. These motifs were then filtered for quality using peak centrality and enrichment metrics. In total, there are 6,069 motifs available.

[Factorbook motif catalog](https://downloads.wenglab.org/factorbook-download/complete-factorbook-catalog.meme.gz)

[Associated metadata](https://storage.googleapis.com/gcp.wenglab.org/factorbook_chipseq_meme_motifs.tsv)

#### 1.2. Motif annotation:

Factorbook motifs are downloaded and processed to include overlaps with known motifs from the HOCOMOCO and JASPAR databases.

TOMTOM was used to compare Factorbook MEME motifs against the HOCOMOCO and JASPAR catalogs. Tomtom results were filtered for matches where the query is a part of the target or vice versa. From these filtered results, motifs with the lowest q-value were selected as the most confident overlaps between Factorbook motifs and the known motifs in HOCOMOCO and JASPAR.

The purpose of this SOP is to document the stpes and processes involved in processing and annotation Facotbook motifs.



## 2.Requirements

### 2.1 ENCODE metadata:

ENCODE metadata is necessary to construct the download URL for each of the factorbook motifs. For example, factorbook motif ID (e.g., ENCFF674XTY_CMTBCTGGGARTTGTAG) is generated by combining the ENCODE BED file ID with the consensus sequence. 

#### 2.1.1 metadata download:
Select all the bed and bigBed tracks available in the ENCODE experiment matrix, using the below selection criteria:

- Biosample: Organism --> Homo sapiens
- Quality: Status --> released, archived, revoked
- Analysis: Available file types --> All bed anf bigBed file types
- Analysis: Genome Assembly --> hg19, GRCh38

The following URL is pre-selected to meet the above criteria on the ENCODE data portal:

[ENCODE metadata](https://www.encodeproject.org/search/?type=Experiment&status=released&perturbed=false&replicates.library.biosample.donor.organism.scientific_name=Homo+sapiens&perturbed=true&status=archived&status=revoked&files.file_type=bed+narrowPeak&files.file_type=bigBed+narrowPeak&files.file_type=bed+idr_ranked_peak&files.file_type=bed+broadPeak&files.file_type=bed+bed3%2B&files.file_type=bigBed+broadPeak&files.file_type=bigBed+bed3%2B&files.file_type=bed+bedRnaElements&files.file_type=bigBed+bedRnaElements&files.file_type=bed+tss_peak&files.file_type=bigBed+tss_peak&files.file_type=bed+bedMethyl&files.file_type=bigBed+bedMethyl&files.file_type=bed+bed9%2B&files.file_type=bed+bedGraph&files.file_type=bigInteract&files.file_type=bed+idr_peak&files.file_type=bigBed+idr_peak&files.file_type=bigBed+bed9%2B&files.file_type=bed+bed9&files.file_type=bigBed+bed9&files.file_type=bed+bedLogR&files.file_type=bigBed+bedLogR&files.file_type=bed+bed12&files.file_type=bigBed+bed12&files.file_type=bigBed+bed6%2B&files.file_type=bed+bed6%2B&files.file_type=bed+bedExonScore&files.file_type=bigBed+bedExonScore&files.file_type=bed+peptideMapping&files.file_type=bigBed+peptideMapping&files.file_type=bed+bed3&files.file_type=bigBed+bed3&files.file_type=bed+modPepMap&files.file_type=bed+pepMap&files.file_type=bigBed+modPepMap&files.file_type=bigBed+pepMap)

### 2.2 Tomtom: motif comparison tool

Tomtom compares one or more motifs against a database of known motifs (e.g., JASPAR). 

[MEME suite V5.5.5](https://meme-suite.org/meme/doc/download.html)

### 2.3 Samtools

[Samtools V1.17](https://www.htslib.org/)

## 3. FILER data processing pipeline

### 3.1 Collect ENCODE experiment IDs from Factorbook motif catalog metadata

Script: step1-get_ENCODE_experiments.sh

Input: factorbook motif metadata - factorbook_chipseq_meme_motifs.tsv

Output: ENCODE_experimentIDs.txt

### 3.2 process ENCODE metadata to map IDR thresholded peak BED file IDs for all ChIP-seq experiments

Script: step2-bedtobigBed_mapping.py

Input: 
1. ENCODE metadata
2. ENCODE experiment IDs from step 3.1
3. factorbook motifs metadata

Output: bedtobigBed_mapping.txt