### Defining orthologous boundaries between _D.melanogaster_ and _D. triauraria_  

__Versions__  
HiC explorer: 2.2     
gcc: 5.4  
halLiftover: 2.1  
bedtools: 2.29.0  
GNU Awk: 4.0.2    
join (GNU coreutils): 8.22  
sort (GNU coreutils): 8.22

Repeatmasker 4.0.9   
Perl: 5.26.2  
Perl-text-soundex: 3.05   
RM_Blast: 2.9.0  
TRF: 4.0.9  
RepeatDatabase: RepBase

Cactus: (no number, original?)  
Python: 2.7  
Singularity: 2.7.12  

#### Cactus:
Align genomes. Produce .hal output file

Input file:

In [None]:
Dtriauroria /3_muller_o.fasta.masked
Dmelanogaster /dmel-all-chromosome-r6.22_extract_ABCDEF.fasta.masked

In [None]:
#!/bin/bash
#SBATCH -p ellison_1
#SBATCH -c 4
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem=200G
#SBATCH -M amarel
#SBATCH -t 32:00:00

USER=

srun hostname

srun date
ml
module load singularity python/2.7.12
ml
rm -rf /scratch/$USER/singularity_temp/*


cd /scratch/nt365/oarc/cactus


source cactus_oarc/bin/activate
which cactus

toil clean /scratch/$USER/singularity_temp/jobStore

export SINGULARITY_CACHEDIR=/scratch/$USER/singularity_temp
export SINGULARITY_TMPDIR=/scratch/$USER/singularity_temp
export TOIL_WORKDIR=/scratch/$USER/singularity_temp

env | grep singularity_temp
pip list  #print the current python packages

cactus /scratch/$USER/singularity_temp/jobStore dmel_dtri/dmel_dtri.txt dmel_dtri/dmel_dtri.hal --binariesMode singularity


#### Repeatmasker:
Softmask repetitive regions

In [None]:
#!/bin/bash

#SBATCH --partition=main             # Partition (job queue)
#SBATCH --requeue                    # Return job to the queue if preempted
#SBATCH --job-name=repmask          # Assign an short name to your job
#SBATCH --nodes=1                    # Number of nodes you require
#SBATCH --ntasks=1                  # Total # of tasks across all nodes
#SBATCH --cpus-per-task=28            # Cores per task (>1 if multithread tasks)
#SBATCH --mem=150G                 # Real memory (RAM) required (MB)
#SBATCH --time=72:00:00              # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm.%N.%j.out     # STDOUT output fil
#SBATCH --error=slurm.%N.%j.err      # STDERR output file (optional)
#SBATCH --export=ALL                 # Export you current env to the job env
#SBATCH --array=1-3

perl /RepeatMasker/RepeatMasker $SLURM_ARRAY_TASK_ID\_muller_o.fasta -species drosophila -xsmall

Pipeline to identify orthologous boundaries

Use bedtools software "merge" utility to filter the TAD boundaries identified by HiC_Explorer in both replicates. 
These are **high confidence** boundaries. 
Use awk utility to find the mid 5000 bp of each boundary range. 

In [None]:
cat dup_*/dtri_boundaries.bed | sort -k1,1 -k2,2n | awk '$3-$2<=10000' | bedtools merge -i - -c 1 -o count | awk '$4==2' | awk '{print $1"\t"$2"\t"$3"\tDTRI_BOUNDARY_"NR}' > ./dmel_dtri_merge/dtri_boundaries_merge
cat dup_*/dmel_boundaries.bed | sort -k1,1 -k2,2n | awk '$3-$2<=10000' | bedtools merge -i - -c 1 -o count | awk '$4==2' | awk '{print $1"\t"$2"\t"$3"\tDMEL_BOUNDARY_"NR}' > ./dmel_dtri_merge/dmel_boundaries_merge
awk -F "\t" '{a=$3+$2;b=a/2;c=b+2500;d=b-2500;print $1"\t" d"\t" c"\t" $4}' dtri_boundaries_merge > dtri_merge_mid5000
awk -F "\t" '{a=$3+$2;b=a/2;c=b+2500;d=b-2500;print $1"\t" d"\t" c"\t" $4}' dmel_boundaries_merge > dmel_merge_mid5000

Use bedtools software "merge" utility to filter the TAD boundaries identified by HiC_Explorer in only one replicate. 
These are **low confidence** boundaries. 

In [None]:
cat dup_*/dmel_boundaries.bed | sort -k1,1 -k2,2n | awk '$3-$2<=10000' | bedtools merge -i - -c 1 -o count | awk '$4==1' | awk '{print $1"\t"$2"\t"$3"\tDTRI_BOUNDARY_"NR}' > ./dmel_dtri_merge/dmel_boundaries_lc
cat dup_*/dtri_boundaries.bed | sort -k1,1 -k2,2n | awk '$3-$2<=10000' | bedtools merge -i - -c 1 -o count | awk '$4==1' | awk '{print $1"\t"$2"\t"$3"\tDTRI_BOUNDARY_"NR}' > ./dmel_dtri_merge/dtri_boundaries_lc

Use halLiftover to find the corresponding coordinates of the HiC_explorer identified high confidence TAD boundaries in dmel (original species), in dtri (target species).

Use bedtools merge to collapse lifted over boundary coordinates within 5000 bp.
Use awk to remove lifted over features less than 500bp. 500 bp is 1/10 of an average TAD boundary size therefore we believe features less than 500bp are likely artifacts. 

In [None]:
#!/bin/bash

#SBATCH --partition=main    # Partition (job queue)
#SBATCH --job-name=halsnps         # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1                # Number of compute nodes
#SBATCH --ntasks=1               # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28       # Threads per process (or per core)
#SBATCH --export=ALL             # Export you current environment settings to the job environment
#SBATCH --time=12:00:00
#SBATCH --mem=100G
#SBATCH --output=/slurm-%A_%a.out

OF=dmel_to_dtri_boundaries/dmel_to_dtri_fdr_ft_then_merge
IF=dmel_dtri_merge/dmel_merge_mid5000


module load gcc

halLiftover dmel_dtri.hal Dmelanogaster $IF  Dtriauroria $OF
sort -k1,1 -k2,2n $OF | bedtools merge -i - -d 5000 -c 4 -o distinct > $OF\_merge.bed 
awk -F "\t" '{a=$3-$2;print $1"\t" $2"\t" $3"\t" $4"\t"a}' $OF\_merge.bed | awk '$5>=500'| awk -F "\t" '{print $1"\t" $2"\t" $3"\t" $4}'| sed -r 's/,/\t/' > $OF\_merge_500bp.bed

Run halLiftover to find the dmel (target) locations of the high confidence Dtri (original) TAD boundary coordinates.

In [None]:
#!/bin/bash

#SBATCH --partition=main    # Partition (job queue)
#SBATCH --job-name=halsnps         # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1                # Number of compute nodes
#SBATCH --ntasks=1               # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28       # Threads per process (or per core)
#SBATCH --export=ALL             # Export you current environment settings to the job environment
#SBATCH --time=12:00:00
#SBATCH --mem=100G
#SBATCH --output=slurm-%A_%a.out

OF=/dtri_to_dmel_boundaries/dtri_to_dmel_fdr_ft_then_merge
IF=/dmel_dtri_merge/dtri_merge_mid5000


module load gcc

halLiftover /dmel_dtri.hal Dtriauroria $IF  Dmelanogaster $OF 
sort -k1,1 -k2,2n $OF | bedtools merge -i - -d 5000 -c 4 -o distinct > $OF\_merge.bed 
awk -F "\t" '{a=$3-$2;print $1"\t" $2"\t" $3"\t" $4"\t"a}' $OF\_merge.bed | awk '$5>=500'| awk -F "\t" '{print $1"\t" $2"\t" $3"\t" $4}'| sed -r 's/,/\t/' > $OF\_merge_500bp.bed

The following commands are run on the lifted over output of _D.tri_ to _Dmel_ and _D.mel_ to _D.tri_ to find:
1. print the merged liftover output to a format for input to bedtools.
2. Find how many unique boundary locations in the original species have corresponding coordinates in the target
3. To find out how many lifted over boundary locations in the target species overlap with the high confidence boundaries in the original species. 
4. To quantify the number of lifted over boundary locations overlap with the high confidence boundaries. 
5. Find out how many lifted over boundary locations in the target species overlap with the low confidence boundaries in the original species. 
6. Quantify the number of lifted over boundary locations overlap with the low confidence boundaries. 

Sum of number of high confidence and low confidence overlaps = orthologous boundaries. 

In [None]:
1. awk -F'\t' '!$5{ $5="NA" }1 {print $1"\t" $2"\t" $3"\t" $4"\t"$5}' dmel_to_dtri_fdr_ft_then_merge_merge_500bp.bed > dmel_to_dtri_fdr_ft_then_merge_merge_500bp_f.bed
2. cut -f 4,5 dmel_to_dtri_fdr_ft_then_merge_merge_500bp_f.bed | tr "\t" "\n" | tr "," "\n" | sort | uniq | grep -c DMEL
3. bedtools closest -d -a dmel_to_dtri_fdr_ft_then_merge_merge_500bp_f.bed -b /projects/genetics/ellison_lab/nicole/fdr_boundary_domain_files/dmel_dtri_merge/dtri_merge_mid5000 | awk '$10<=5000' > conserved_dmel_to_dtri_boundaries 
4. cut -f 4,5 conserved_dmel_to_dtri_boundaries  | tr "\t" "\n" | tr "," "\n" | sort | uniq | grep -c DMEL
5. bedtools closest -d -a dmel_to_dtri_fdr_ft_then_merge_merge_500bp_f.bed -b /projects/genetics/ellison_lab/nicole/fdr_boundary_domain_files/dmel_dtri_merge/dtri_boundaries_lc | awk '$10<=5000' > nonconserved_dmel_to_dtri_boundaries 
6. cut -f 4,5 nonconserved_dmel_to_dtri_boundaries | tr "\t" "\n" | tr "," "\n" | sort | uniq | grep -c DMEL

In [None]:
1. awk -F'\t' '!$5{ $5="NA" }1 {print $1"\t" $2"\t" $3"\t" $4"\t"$5}' dtri_to_dmel_fdr_ft_then_merge_merge_500bp.bed > dtri_to_dmel_fdr_ft_then_merge_merge_500bp_f.bed
2. cut -f 4,5 dtri_to_dmel_fdr_ft_then_merge_merge_500bp_f.bed | tr "\t" "\n" | tr "," "\n" | sort | uniq | grep -c DTRI
3. bedtools closest -d -a dtri_to_dmel_fdr_ft_then_merge_merge_500bp_f.bed -b /projects/genetics/ellison_lab/nicole/fdr_boundary_domain_files/dmel_dtri_merge/dmel_merge_mid5000 | awk '$10<=5000' > conserved_dtri_to_dmel_boundaries
4. cut -f 4,5 conserved_dtri_to_dmel_boundaries | tr "\t" "\n" | tr "," "\n" | sort | uniq | grep -c DTRI
5. bedtools closest -d -a dtri_to_dmel_fdr_ft_then_merge_merge_500bp_f.bed -b /projects/genetics/ellison_lab/nicole/fdr_boundary_domain_files/dmel_dtri_merge/dmel_boundaries_lc | awk '$10<=5000' > nonconserved_dtri_to_dmel_boundaries
6. cut -f 4,5 nonconserved_dtri_to_dmel_boundaries| tr "\t" "\n" | tr "," "\n" | sort | uniq | grep -c DTRI