# HiCPro Pipeline Setup  
notebook by Frank Grenn

HiCPro by Nicolas Servant  
    [HiCPro paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0831-x)  
    [HiCPro github](https://github.com/nservant/HiC-Pro)

In [None]:
#these paths and files need to already exist
HICPRO_PATH="/path/to/hicpro/2.11.1" #hicpro installation
WRKDIR="/path/to/all/output/folders" #folder that will contain a folder for each sample's HiC-Pro output
ANNOTATION_PATH="{}/annotation".format(WRKDIR) #path to folder that will contain files created in step 1
CONFIG_FILE="{}/config.txt".format(ANNOTATION_PATH)
CONFIG_FILE2="{}/config2.txt".format(ANNOTATION_PATH)
REF_GENOME_FASTA="path/to/a/Homo_sapiens_assembly38.fasta" #pulled down from gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta

#other variables
SAMPLE="35233" #identifier for the sample. A folder with this name will be made in WRKDIR
RESTRICTION_ENZYME="dpnii" #dpnii=dpni=mboi=sau3ai
REF_GENOME="hg38"

#these folders will be created by code in this notebook
SAMPLE_PATH="{}/{}".format(WRKDIR,SAMPLE)
SPLIT_DATA_PATH="{}/splits".format(SAMPLE_PATH)
SCRIPT_PATH="{}/scripts".format(SAMPLE_PATH)
LOG_PATH="{}/logs".format(SAMPLE_PATH)

#### after everything is run directories should look something like this:  
starting with `WRKDIR`
```
|-HiCPro
|   |-annotation
|   |   |-config.txt
|   |   |-config2.txt
|   |-SAMPLE1
|   |   |-splits
|   |   |-scripts
|   |   |-logs
|   |   |-bowtie_results
|   |   |   |-bwt2
|   |   |   |-bwt2_global
|   |   |   |-bwt2_local
|   |   |-hic_results
|   |   |   |-data
|   |   |   |-matrix
|   |   |   |-pic
|   |   |   |-stats
|   |   |-rawdata
|   |-SAMPLE2
|   |   |-splits
|   |   |-scripts
|   |   |-logs
|   |   |-bowtie_results
|   |   |   |-bwt2
|   |   |   |-bwt2_global
|   |   |   |-bwt2_local
|   |   |-hic_results
|   |   |   |-data
|   |   |   |-matrix
|   |   |   |-pic
|   |   |   |-stats
|   |   |-rawdata
        
```

# (1) Setup HiCPro Files

## (a) generate BED file of restriction fragments after digestion

placed in the annotation folder

In [None]:
print("{}/bin/utils/digest_genome.py -r {} -o {}/{}_{}.bed {}".format(HICPRO_PATH, RESTRICTION_ENZYME,ANNOTATION_PATH, REF_GENOME, RESTRICTION_ENZYME, REF_GENOME_FASTA))

optional removal of anything but chr1-22, X and Y

In [None]:
print("mv {}/{}_{}.bed {}/{}_{}_full.bed".format(ANNOTATION_PATH, REF_GENOME, RESTRICTION_ENZYME,ANNOTATION_PATH, REF_GENOME, RESTRICTION_ENZYME))
print("grep \
-e chr1$'\\t' -e chr2$'\\t' -e chr3$'\\t' -e chr4$'\\t' -e chr5$'\\t' \
-e chr6$'\\t' -e chr7$'\\t' -e chr8$'\\t' -e chr9$'\\t' -e chr10$'\\t' \
-e chr11$'\\t' -e chr12$'\\t' -e chr13$'\\t' -e chr14$'\\t' -e chr15$'\\t' \
-e chr16$'\\t' -e chr17$'\\t' -e chr18$'\\t' -e chr19$'\\t' -e chr20$'\\t' \
-e chr21$'\\t' -e chr22$'\\t' -e chrX$'\\t' -e chrY$'\\t' \
{}/{}_{}_full.bed > {}/{}_{}.bed ".format(ANNOTATION_PATH, REF_GENOME, RESTRICTION_ENZYME,ANNOTATION_PATH, REF_GENOME, RESTRICTION_ENZYME))




compare line count

In [None]:
print("wc -l {}/{}_{}.bed".format(ANNOTATION_PATH, REF_GENOME, RESTRICTION_ENZYME))
print("grep -v \
-e chr1$'\\t' -e chr2$'\\t' -e chr3$'\\t' -e chr4$'\\t' -e chr5$'\\t' \
-e chr6$'\\t' -e chr7$'\\t' -e chr8$'\\t' -e chr9$'\\t' -e chr10$'\\t' \
-e chr11$'\\t' -e chr12$'\\t' -e chr13$'\\t' -e chr14$'\\t' -e chr15$'\\t' \
-e chr16$'\\t' -e chr17$'\\t' -e chr18$'\\t' -e chr19$'\\t' -e chr20$'\\t' \
-e chr21$'\\t' -e chr22$'\\t' -e chrX$'\\t' -e chrY$'\\t' \
{}/{}_{}_full.bed | wc -l".format(ANNOTATION_PATH, REF_GENOME, RESTRICTION_ENZYME))

## (b) generate chromosome size file

use python pyfaidx library

In [None]:
print("module load python/3.7")
print("cd {}".format(ANNOTATION_PATH))
print("faidx {} -i chromsizes -o sizes.genome".format(REF_GENOME_FASTA))

optional removal of anything but chr1-22, X and Y

In [None]:
print("mv {}/sizes.genome {}/sizes.genome.full".format(ANNOTATION_PATH, ANNOTATION_PATH))
print("grep \
-e chr1$'\\t' -e chr2$'\\t' -e chr3$'\\t' -e chr4$'\\t' -e chr5$'\\t' \
-e chr6$'\\t' -e chr7$'\\t' -e chr8$'\\t' -e chr9$'\\t' -e chr10$'\\t' \
-e chr11$'\\t' -e chr12$'\\t' -e chr13$'\\t' -e chr14$'\\t' -e chr15$'\\t' \
-e chr16$'\\t' -e chr17$'\\t' -e chr18$'\\t' -e chr19$'\\t' -e chr20$'\\t' \
-e chr21$'\\t' -e chr22$'\\t' -e chrX$'\\t' -e chrY$'\\t' \
{}/sizes.genome.full > {}/sizes.genome ".format(ANNOTATION_PATH, ANNOTATION_PATH))




compare line count

In [None]:
print("wc -l {}/sizes.genome".format(ANNOTATION_PATH))
print("grep -v \
-e chr1$'\\t' -e chr2$'\\t' -e chr3$'\\t' -e chr4$'\\t' -e chr5$'\\t' \
-e chr6$'\\t' -e chr7$'\\t' -e chr8$'\\t' -e chr9$'\\t' -e chr10$'\\t' \
-e chr11$'\\t' -e chr12$'\\t' -e chr13$'\\t' -e chr14$'\\t' -e chr15$'\\t' \
-e chr16$'\\t' -e chr17$'\\t' -e chr18$'\\t' -e chr19$'\\t' -e chr20$'\\t' \
-e chr21$'\\t' -e chr22$'\\t' -e chrX$'\\t' -e chrY$'\\t' \
{}/sizes.genome.full | wc -l".format(ANNOTATION_PATH))

## (c) index reference genome

create script to index reference genome files using bowtie2

In [None]:
with open ("{}/bowtie2_index_genome.sh".format(SCRIPT_PATH), "w") as text_file:
    print("#!/bin/bash \n\
module load bowtie \n\
cd {} \n\
bowtie2-build {} genome \n\
echo 'done'".format(ANNOTATION_PATH, REF_GENOME_FASTA), file = text_file)
    text_file.close()

In [None]:
print("cd {}".format(SCRIPT_PATH))
print("sbatch --mem=100g --cpus-per-task=10 --mail-type=ALL --time=24:00:00 bowtie2_index_genome.sh")

## (d) Setup config file  
save it in the path stored in the `CONFIG_FILE` (and `CONFIG_FILE2`) variables
`CONFIG_FILE` will be used for jobs utilizing the split fastqs
`CONFIG_FILE2` will be used for jobs utilizing the results from merging the mapped split fastqs
only differences between these two is the `JOB_MEM` field  
(ex: set to `5G` in `CONFIG_FILE` and `100G` in `CONFIG_FILE2`).
 - `CONFIG_FILE` job memory ends up being `JOB_MEM` * `N_CPU`
 - `CONFIG_FILE2` job memory end up being `JOB_MEM`, so it need to be higher than memory in `CONFIG_FILE`

https://nservant.github.io/HiC-Pro/MANUAL.html#setting-the-configuration-file

example config file here: /usr/local/apps/hicpro/config_example.txt

---
# (2) Per Sample Run

## (a) Make sample directories  
```
|-WRKDIR  
|   |-annotation  
|   |-SAMPLE  
|   |   |-scripts  
|   |   |-splits  
|   |   |-logs  
```

In [None]:
%%bash -s "$SAMPLE_PATH" "$SPLIT_DATA_PATH" "$SCRIPT_PATH"  "$LOG_PATH"
SAMPLE_PATH=${1}
SPLIT_DATA_PATH=${2}
SCRIPT_PATH=${3}
LOG_PATH=${4}

mkdir ${SAMPLE_PATH}
mkdir ${SPLIT_DATA_PATH}
mkdir ${SCRIPT_PATH}
mkdir ${LOG_PATH}

##### path to R1 and R2 fastq file for sample  
these are the files that will be split into smaller fastqs in the next step


In [None]:
R1_PATH = "/path/to/sample/R1.fastq.gz".format(SAMPLE)
R2_PATH = "/path/to/sample/R2.fastq.gz".format(SAMPLE)


## (b) Split fastq read files  
this splits the R1 and R2 fastq files in to many smaller files.  
this allows the mapping to be run on each smaller file in parallel, speeding up the mapping process

In [None]:

with open ("{}/split_{}_R1.sh".format(SCRIPT_PATH,SAMPLE), "w") as text_file:
    print("#!/bin/bash \n\
module load hicpro \n\
cd {} \n\
mkdir {}/{} \n\
{}/bin/utils/split_reads.py -r {}/{} {} \n\
echo 'done'".format(LOG_PATH,SPLIT_DATA_PATH,SAMPLE, HICPRO_PATH, SPLIT_DATA_PATH,SAMPLE, R1_PATH), file = text_file)
    text_file.close()
    
with open ("{}/split_{}_R2.sh".format(SCRIPT_PATH,SAMPLE), "w") as text_file:
    print("#!/bin/bash \n\
module load hicpro \n\
cd {} \n\
mkdir {}/{} \n\
{}/bin/utils/split_reads.py -r {}/{} {} \n\
echo 'done'".format(LOG_PATH,SPLIT_DATA_PATH,SAMPLE, HICPRO_PATH, SPLIT_DATA_PATH,SAMPLE, R2_PATH), file = text_file)
    text_file.close()
    
print("sbatch --mem=100g --cpus-per-task=10 --mail-type=ALL --time=10:00:00 --error={}/split_{}_R1.err --output={}/split_{}_R1.out {}/split_{}_R1.sh".format(LOG_PATH, SAMPLE, LOG_PATH, SAMPLE, SCRIPT_PATH,SAMPLE))
print("sbatch --mem=100g --cpus-per-task=10 --mail-type=ALL --time=10:00:00 --error={}/split_{}_R2.err --output={}/split_{}_R2.out {}/split_{}_R2.sh".format(LOG_PATH, SAMPLE, LOG_PATH, SAMPLE, SCRIPT_PATH,SAMPLE))

## (c) Run Hic-Pro in Stepwise Mode

### (i) Stepwise Mapping  
this will create a `HiCPro_step1_hicpro.sh` to run

In [None]:
%%bash -s "$LOG_PATH" "$SPLIT_DATA_PATH" "$SAMPLE_PATH" "$CONFIG_FILE"
LOG_PATH=${1}
SPLIT_DATA_PATH=${2}
SAMPLE_PATH=${3}
CONFIG_FILE=${4}
module load hicpro
cd ${LOG_PATH}
HiC-Pro -i ${SPLIT_DATA_PATH} -o ${SAMPLE_PATH} -c ${CONFIG_FILE} -p -s mapping

In [None]:
print("cd {}".format(SAMPLE_PATH))
print("sbatch {}/HiCPro_step1_hicpro.sh".format(SAMPLE_PATH))

### (ii) Stepwise Hi-C filtering (proc_hic)  
takes in the files created in `bowtie_results/bwt2` from the previous mapping step  
this will overwrite the previous `HiCPro_step1_hicpro.sh` with new code to run

In [None]:
%%bash -s "$LOG_PATH" "$SAMPLE_PATH" "$CONFIG_FILE"
LOG_PATH=${1}
SAMPLE_PATH=${2}
CONFIG_FILE=${3}
module load hicpro
cd ${LOG_PATH}
HiC-Pro -i ${SAMPLE_PATH}/bowtie_results/bwt2 -o ${SAMPLE_PATH} -c ${CONFIG_FILE} -p -s proc_hic

In [None]:
print("cd {}".format(SAMPLE_PATH))
print("sbatch {}/HiCPro_step1_hicpro.sh".format(SAMPLE_PATH))

### (iii) Quality Control 
takes in the files created in `hic_results/data` from the previous filtering step  
this will create a `HiCPro_step2_hicpro.sh` with new code to run  
(may fail on the plot creation step)

In [None]:
%%bash -s "$LOG_PATH" "$SAMPLE_PATH" "$CONFIG_FILE2"
LOG_PATH=${1}
SAMPLE_PATH=${2}
CONFIG_FILE2=${3}
module load hicpro
cd ${LOG_PATH}
HiC-Pro -i ${SAMPLE_PATH}/hic_results/data -o ${SAMPLE_PATH} -c ${CONFIG_FILE2} -p -s quality_checks

In [None]:
print("cd {}".format(SAMPLE_PATH))
print("sbatch {}/HiCPro_step2_hicpro.sh".format(SAMPLE_PATH))

### (iv) Merge Per Sample  
takes in `.validPairs` files in `hic_results/data`  
do not run this step with the parallel `-p` option  
this will create a `.allValidPairs` file in `/hic_results/data`


In [None]:
with open ("{}/hicpro_merge_persample.sh".format(SCRIPT_PATH), "w") as text_file:
    print("#!/bin/bash \n\
module load hicpro \n\
cd {}/hic_results/data \n\
HiC-Pro -i {}/hic_results/data -o {} -c {} -s merge_persample \n\
echo 'done'".format(SAMPLE_PATH,SAMPLE_PATH,SAMPLE_PATH,CONFIG_FILE2), file = text_file)
    text_file.close()

In [None]:
print("cd {}".format(SAMPLE_PATH))
print("sbatch --cpus-per-task=10 --mem=100g --mail-type=ALL --error={}/hicpro_merge_persample.err --output={}/hicpro_merge_persample.out {}/hicpro_merge_persample.sh".format(LOG_PATH,LOG_PATH, SCRIPT_PATH))

### (v) Build Contact Maps
takes in `.validPairs` files in `hic_results/data`  
this will create a `matrix` folder in the `hic_results` folder with the contact matrices at different resolutions

In [None]:
with open ("{}/hicpro_build_contact_maps.sh".format(SCRIPT_PATH), "w") as text_file:
    print("#!/bin/bash \n\
module load hicpro \n\
cd {}/hic_results/data \n\
HiC-Pro -i {}/hic_results/data -o {} -c {} -s build_contact_maps \n\
echo 'done'".format(SAMPLE_PATH,SAMPLE_PATH,SAMPLE_PATH,CONFIG_FILE2), file = text_file)
    text_file.close()

In [None]:
print("cd {}".format(SAMPLE_PATH))
print("sbatch --cpus-per-task=2 --mem=20g --mail-type=ALL --error={}/hicpro_build_contact_maps.err --output={}/hicpro_build_contact_maps.out {}/hicpro_build_contact_maps.sh".format(LOG_PATH,LOG_PATH, SCRIPT_PATH))

### (vi) ICE Normalization  
takes in matrix files in `hic_results/matrix`  
this will create an `iced` folder in the `hic_results/matrix` folder with normalized contact maps

In [None]:
with open ("{}/hicpro_ice_norm.sh".format(SCRIPT_PATH), "w") as text_file:
    print("#!/bin/bash \n\
module load hicpro \n\
cd {}/hic_results/matrix \n\
HiC-Pro -i {}/hic_results/matrix -o {} -c {} -s ice_norm \n\
echo 'done'".format(SAMPLE_PATH,SAMPLE_PATH,SAMPLE_PATH,CONFIG_FILE2), file = text_file)
    text_file.close()

In [None]:
print("cd {}".format(SAMPLE_PATH))
print("sbatch --cpus-per-task=2 --mem=20g --mail-type=ALL --error={}/hicpro_ice_norm.err --output={}/hicpro_ice_norm.out {}/hicpro_ice_norm.sh".format(LOG_PATH,LOG_PATH, SCRIPT_PATH))