### Defining orthologous domains between _D.melanogaster_ and _D. triauraria_  

__Versions__  
HiC explorer: 2.2     
gcc: 5.4  
halLiftover: 2.1  
bedtools: 2.29.0  
GNU Awk: 4.0.2    
join (GNU coreutils): 8.22  
sort (GNU coreutils): 8.22

First, use bedtools intersect to find the TAD domains identified in both replicates for Dmel and Dtri. Domain start and stop coordinates should be within 5000kb. Find the average of the starts and stops for intersecting domains. These are **high confidence** domains. Reassign domain numbers to these high confidence domains.   
Sort the list of high confidence domains and report their sizes. Count the number of lines in the merged file for the number of high confidence domains.   
Merge and sort all of the domains in both replicates. 

In [None]:
bedtools intersect -wa -wb -a ./dup_1/dtri_domains.bed -b ./dup_2/dtri_domains.bed | awk '(($12-$3>=0 && $12-$3<=5000) || ($3-$12>=0 && $3-$12<=5000)) && (($11-$2>=0 && $11-$2<=5000) || ($2-$11>=0 && $2-$11<=5000))' | awk '{print  $1"\t"($2+$11)/2"\t"($3+$12)/2"\t""DTRI_DOMAINS_"NR}' >  ./dmel_dtri_merge/dtri_domains_int
sort -k1,1 -k2,2n ./dmel_dtri_merge/dtri_domains_int > ./dmel_dtri_merge/dtri_domains_int_sorted
wc -l ./dmel_dtri_merge/dmel_domains_int
cat ./dmel_dtri_merge/dtri_domains_int | awk '{print $4"\t"$3-$2}' | sort -k1b,1 > ./dmel_dtri_merge/dtri_domains_int_sizes
cat ./dup_*/dtri_domains.bed |sort -k1,1 -k2,2n > ./dmel_dtri_merge/dtri_domains_cat_sorted

bedtools intersect -wa -wb -a ./dup_1/dmel_domains.bed -b ./dup_2/dmel_domains.bed | awk '(($12-$3>=0 && $12-$3<=5000) || ($3-$12>=0 && $3-$12<=5000)) && (($11-$2>=0 && $11-$2<=5000) || ($2-$11>=0 && $2-$11<=5000))' | awk '{print  $1"\t"($2+$11)/2"\t"($3+$12)/2"\t""DMEL_DOMAINS_"NR}' >  ./dmel_dtri_merge/dmel_domains_int
sort -k1,1 -k2,2n ./dmel_dtri_merge/dmel_domains_int > ./dmel_dtri_merge/dmel_domains_int_sorted
cat ./dmel_dtri_merge/dmel_domains_int | awk '{print $4"\t"$3-$2}' | sort -k1b,1 > ./dmel_dtri_merge/dmel_domains_int_sizes
cat ./dup_*/dmel_domains.bed |sort -k1,1 -k2,2n > ./dmel_dtri_merge/dmel_domains_cat_sorted


Liftover each domain individually. Move the file _d*_domains_int_ into a separate folder and create files each containing one line --> one domain.    
Remove original domains file _dmel_domains_int_ and _dtri_domains_int_ from the folder with the individual domain files. 

In [None]:
mkdir dmel_to_dtri_domains
cp ./dmel_dtri_merge/dmel_domains_int ./dmel_to_dtri_domains/
split -l 1 --numeric-suffixes dmel_domains_int q
rename q0 "" *
rename q "" *
rm dmel_domains_int

mkdir dtri_to_dmel_domains
cp ./dmel_dtri_merge/dtri_domains_int ./dtri_to_dmel_domains/
split -l 1 --numeric-suffixes dtri_domains_int q
rename q0 "" *
rename q "" *
rm dtri_domains_int

Liftover steps:
1. Perform halLiftover for each individual domain. 
2. Merge the lifted over domains that are within 20kb. 
3. Remove lifted over features that are less than 5000 bp since this is 1/10th of the expected size of a domain. 
4. Used bedtools groupby utility to calculate the size of the lifted over domain. 

In [None]:
#!/bin/bash

#SBATCH --partition=main    # Partition (job queue)
#SBATCH --job-name=halsnps         # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1                # Number of compute nodes
#SBATCH --ntasks=1               # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28       # Threads per process (or per core)
#SBATCH --export=ALL             # Export you current environment settings to the job environment
#SBATCH --time=12:00:00
#SBATCH --mem=100G
#SBATCH --output=slurm-%A_%a.out



module load gcc
for file in /dmel_to_dtri_domains/* 
do 
	halLiftover dmel_dtri.hal Dmelanogaster $file Dtriauroria $file\.bed
done
for file in /dmel_to_dtri_domains/*.bed
do 
	sort -k1,1 -k2,2n $file | bedtools merge -i - -d 20000 -c 4 -o distinct > $file\_merge 
done

for file in /dmel_to_dtri_domains/*.bed_merge 
do 
	awk '$3-$2 >= 5000' $file > $file\_5000
done

#sometimes this needs to be run on the command line
for file in /dmel_to_dtri_domains/*.bed_merge_5000
do  
	awk '{print $0"\t"$3-$2}' $file | bedtools groupby -g 4 -c 5 -o sum  > $file\_group 
done



In [None]:
#!/bin/bash

#SBATCH --partition=main    # Partition (job queue)
#SBATCH --job-name=halsnps         # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1                # Number of compute nodes
#SBATCH --ntasks=1               # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28       # Threads per process (or per core)
#SBATCH --export=ALL             # Export you current environment settings to the job environment
#SBATCH --time=12:00:00
#SBATCH --mem=100G
#SBATCH --output=slurm-%A_%a.out


module load gcc
for file in /dtri_to_dmel_domains/* 
do 
	halLiftover dmel_dtri.hal Dtriauroria $file Dmelanogaster $file\.bed
done
for file in /dtri_to_dmel_domains/*.bed
do 
	sort -k1,1 -k2,2n $file | bedtools merge -i - -d 20000 -c 4 -o distinct > $file\_merge 
done

for file in /dtri_to_dmel_domains/*.bed_merge 
do 
	awk '$3-$2 >= 5000' $file > $file\_5000
done

for file in /dtri_to_dmel_domains/*.bed_merge_5000
do  
	awk '{print $0"\t"$3-$2}' $file | bedtools groupby -g 4 -c 5 -o sum  > $file\_group 
done



Run the following in each domain folder. (*Example shown for D. melanogaster to D. triauroria liftover.*)
1. combine all the lifted over domains into one file. Count the number of unique domains lifted over. 
2. combine all the bedtools groupby output for each lifted over domain into one file
3. combine liftover target genomic coordinates with original domain coordinates.  
4. Add the sizes of the original and lifted over domains to the file step 1.
5. Add the sizes of the original and lifted over domains to the file step 2.
6. filter out lifted over domains that are less than half of the original domain size or more than 50% larger than the original domain. These are **truncated** and **expanded domains**.
7. Print the truncated and expanded domains to a file and calculate the number of unique lifted over domains that are truncated or expanded. 
8. Find the number of unique, **contiguous** lifted over domains from the original in the target. 
9. join the list of contiguous domains with their coordinate information. 
10. sort and filter lists of contiguous domains for input to bedtools

In [None]:
1. cat *.bed_merge_5000 | sort -k4b,4 > cat.bed_merge_5000
awk '{print $4}' cat.bed_merge_5000 | sort | uniq -c | wc -l
2. cat *.bed_merge_5000_group | sort -k1b,1 > cat.bed_merge_5000_group
3. join -1 4 -2 4 cat.bed_merge_5000 /dmel_dtri_merge/dmel_domains_int_sorted > dtri_lo_dmel_coords
4. join -1 1 -2 1 /dmel_dtri_merge/dtri_domains_int_sizes dtri_lo_dmel_coords > dtri_lo_dmel_coords_size1
5. join -1 1 dmel_domains_int_sizes cat.bed_merge_5000_group > bed_merge_group_5000_sizes
join -1 1 -2 1 bed_merge_group_5000_sizes dtri_lo_dmel_coords_size1 | awk '{print $1"\t"$2"\t"$3"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10}'> dtri_lo_dmel_coords_size2
6. awk '$3>=($2-(0.5 * $2)) && $3<=($2+(0.5 * $2))' dtri_lo_dmel_coords_size2 > dtri_lo_dmel_coords_te_filt
7. awk '$3<=($2-(0.5 * $2)) || $3>=($2+(0.5 * $2))' dtri_lo_dmel_coords_size2 > dtri_lo_dmel_tande
wc -l dtri_lo_dmel_tande
8. awk '{print $1}' dmel_lo_dtri_coords_te_filt | sort | uniq -c > dmel_lo_dtri_coords_te_filt_counts
cat dmel_lo_dtri_coords_te_filt_counts | grep " 1 " | sort -k2b,2 | awk '{print $2}' > dmel_lo_dtri_continuous
9. join -1 1 -2 1 dmel_lo_dtri_continuous dmel_lo_dtri_coords_te_filt > dmel_lo_dtri_coords_continuous
10. sort -k6,6 -k7,7 dmel_lo_dtri_coords_continuous | awk '{print $5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$1}' > dmel_lo_dtri_coords_continuous_sorted

Run the following bedtools intersect commands for both directions of original and target species to 
1. Obtain the number of lifted over contiguous boundaries that are identified by HiC Explorer in at least one replicate of the original species.  These are **orthologous domains**.
2. Obtain the number of lifted over contiguous boundaries that are identified by HiC Explorer in at both replicates of the original species. 
3. obtain the number of lifted over contiguous boundaries that are identified by HiC Explorer in at least one replicate of the original species and the TADs in both species have boundaries within 5kb.
4. To count the number of orthologous domains with the same boundaries.

In [None]:
#Dmel to Dtri
1. bedtools intersect -wa -wb -f 0.9 -r -a /dmel_to_dtri_domains/dtri_lo_dmel_coords_continuous_sorted -b /dtri_domains_cat_sorted | cut -f 1-3 | sort | uniq | wc -l
2. bedtools intersect -wa -wb -f 0.9 -r -a /dmel_to_dtri_domains/dtri_lo_dmel_coords_continuous_sorted -b /dmel_dtri_merge/dtri_domains_int_sorted | cut -f 1-3 | sort | uniq | wc -l
3. bedtools intersect -wa -wb -f 0.9 -r -a /dmel_to_dtri_domains/dtri_lo_dmel_coords_continuous_sorted -b /dmel_dtri_merge/dtri_domains_cat_sorted | awk '(($10-$3>=0 && $10-$3<=5000) || ($3-$10>=0 && $3-$10<=5000)) && (($9-$2>=0 && $9-$2<=5000) || ($2-$9>=0 && $2-$9<=5000))' > /dmel_to_dtri_domains/dtri_lo_dmel_bothreps_restends
4. | cut -f 1-3 | sort | uniq | wc -l
#Dtri to Dmel
1. bedtools intersect -wa -wb -f 0.9 -r -a /dtri_to_dmel_domains/dmel_lo_dtri_coords_continuous_sorted -b /dmel_dtri_merge/dmel_domains_cat_sorted | cut -f 1-3 | sort | uniq | wc -l
2. bedtools intersect -wa -wb -f 0.9 -r -a /dtri_to_dmel_domains/dmel_lo_dtri_coords_continuous_sorted -b /dmel_dtri_merge/dmel_domains_int_sorted | cut -f 1-3 | sort | uniq | wc -l
3. bedtools intersect -wa -wb -f 0.9 -r -a /dtri_to_dmel_domains/dmel_lo_dtri_coords_continuous_sorted -b /dmel_dtri_merge/dmel_domains_cat_sorted | awk '(($10-$3>=0 && $10-$3<=5000) || ($3-$10>=0 && $3-$10<=5000)) && (($9-$2>=0 && $9-$2<=5000) || ($2-$9>=0 && $2-$9<=5000))' > /dtri_to_dmel_domains/dmel_lo_dtri_bothreps_restends
awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}' /dtri_to_dmel_domains/dmel_lo_dtri_bothreps_restends | sort | uniq > /dtri_to_dmel_domains/dmel_lo_dtri_ortholgous_final

Run the following to identify split domains. Example shown for dtri to dmel analysis

In [None]:
cat dmel_lo_dtri_coords_continuous dmel_lo_dtri_tande | sort -k1b,1 > dmel_lo_dtri_cont_plus_tande
join -1 1 -2 1 dmel_lo_dtri_coords dmel_lo_dtri_cont_plus_tande -v 1 | awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7}' > split

bedtools groupby -i split -g 5,6,7  -c 2,3,4 -o collapse > split_groupby
sed 's/,/\t/g' split_groupby > split_groupby_tab

Run the following to identify domains split by lineage-specific boundaries

In [None]:
#Dmel to dtri
cat dtri_to_dmel_boundaries/dtri_to_dmel_dmel_non_boundary_coord_with_bound_id dmel_to_dtri_boundaries/dmel_to_dtri_dmel_nonconserved_coords_with_bound_id | sort -k1,1b -k2,2n > all_linspec_bounds_dmelc
#split by lineage-specific boundary
bedtools intersect -u -a dmel_to_dtri_domains/dtri_lo_dmel_nonorth_continuous_reord -b all_linspec_bounds_dmelc | wc -l
#add the remainder that are not split by a lineage-specific boundary to the truncated/expanded category
bedtools intersect -v -a dmel_to_dtri_domains/dtri_lo_dmel_nonorth_continuous_reord -b all_linspec_bounds_dmelc | wc -l
    
#Dtri to dmel
cat dtri_to_dmel_boundaries/dtri_to_dmel_dtri_nonconserved_coords_with_bound_id dmel_to_dtri_boundaries/dmel_to_dtri_dtri_non_boundary_coord_with_bound_id > all_linspec_bounds_dtric
#split by lineage-specific boundary
bedtools intersect -u -a dtri_to_dmel_domains/dtri_to_dmel_nonorth_dtric -b all_linspec_bounds_dtric | wc -l
#add the remainder that are not split by a lineage-specific boundary to the truncated/expanded category
bedtools intersect -v -a /dtri_to_dmel_domains/dtri_to_dmel_nonorth_dtric -b all_linspec_bounds_dtric | wc -l

Run the following to identify the remainder of the truncated/expanded domains caused by insertions and deletions. For final truncated/expanded count - add the count from below to the count from **step 7** above. (Step shown for _D. mel_ to _D. tri_ analysis) 

In [None]:
bedtools intersect -v -wa -wb -a dmel_to_dtri_domains/dtri_lo_dmel_nonorth_continuous_reord -b all_linspec_bounds_dmelc > dmel_to_dtri_nonorth_nonls
wc -l dmel_to_dtri_nonorth_nonls