### Chromatin State
###### Pipeline to compare chromatin state between orthologous and non orthologous genes in two species. 

_Versions_  
Bedtools: 2.29.0  
R: 3.6.1   
GNU Awk 4.0.2  
GNU Coreutils 8.22   
Python: 3.7.3  

In [None]:
#Count how many total base pairs/genes from orthologous genes are within one of five chromatin states
bedtools intersect -wo -a dtri_lo_dmel_0.9r_final.GENES.bed -b Kc_chromatin_states_r6_muller.bed | sort -k8,8 | bedtools groupby -g 8 -c 9 -o sum | cut -f 2 | ./sum.pl
#Find the number and percent of orthologous genes in each chromatin state. 
bedtools intersect -wo -a dtri_lo_dmel_0.9r_final.GENES.bed -b Kc_chromatin_states_r6_muller.bed | sort -k8,8 | bedtools groupby -g 8 -c 9 -o sum | awk '{print $1"\t"$2"\t"$2/13180916}'
#Count how many total base pairs/genes from orthologous genes are within one of five chromatin states
bedtools intersect -wo -a dtri_lo_dmel_NOT_0.9r_final.GENES.bed -b Kc_chromatin_states_r6_muller.bed | sort -k8,8 | bedtools groupby -g 8 -c 9 -o sum | cut -f 2 | ./sum.pl
#Find the number and percent of orthologous genes in each chromatin state. 
bedtools intersect -wo -a dtri_lo_dmel_NOT_0.9r_final.GENES.bed -b Kc_chromatin_states_r6_muller.bed | sort -k8,8 | bedtools groupby -g 8 -c 9 -o sum | awk '{print $1"\t"$2"\t"$2/45589307}' 

Fisher's exact tests for state differences between orthologous and non-orthologous domains.

In [None]:
import scipy.stats as stats
#BLACK
oddsratio, pvalue = stats.fisher_exact([[719,1653-719],[2156,7263-2156]],alternative='greater')

#BLUE
oddsratio, pvalue = stats.fisher_exact([[247,1653-247],[861,7263-861]],alternative='greater')

#GREEN
oddsratio, pvalue = stats.fisher_exact([[20,1653-20],[357,7263-357]],alternative='less')

#RED
oddsratio, pvalue = stats.fisher_exact([[116,1653-116],[526,7263-526]],alternative='less')

#YELLOW
oddsratio, pvalue = stats.fisher_exact([[551,1653-551],[3373,7263-3373]],alternative='less')

Chromatin state fixed vs polymorphic breakpoint analysis

In [None]:
git clone https://github.com/mahulchak/dspr-asm.git

gzcat dspr-asm/variants-raw/*.gz | grep INV | awk '$9>=10000' | cut -f 1-3 | tr "." "\t" | cut -f 2-4 | awk '{print $1"\t"$2"\t"$2+1"\t"$3-$2"\n"$1"\t"$3"\t"$3+1"\t"$3-$2}' | sort -k1,1 -k2,2n > t

cat t | sort -k1,1 -k2,2n | bedtools merge -i - -d 10 | bedtools intersect -wo -a ~/t -b Kc_chromatin_states_r6.bed | cut -f 1,2,3,8 | uniq | cut -f 4 | sort | uniq -c | awk '{print $2"\t"$1"\t"$1/292}'

bedtools intersect -wa -wb -a dmel_breakpoints_no_boundaries -b Kc_chromatin_states_r6.bed | cut -f 7 | sort | uniq -c | awk '{print $2"\t"$1"\t"$1/843}' | sort -k1b,1

Fishers exact test to for polymorphic vs fixed breakpoint analysis

In [None]:
#BLACK
oddsratio, pvalue = stats.fisher_exact([[129,292-129],[157,843-157]],alternative='greater')
#oddsratio = 3.458012582548552	pvalue = 5.1175117049543467e-17

#BLUE
oddsratio, pvalue = stats.fisher_exact([[47,292-47],[67,843-67]],alternative='greater')
#oddsratio = 2.221870240633567	pvalue = 9.714349618584786e-05

#GREEN
oddsratio, pvalue = stats.fisher_exact([[4,292-4],[23,843-23]],alternative='less')
#oddsratio = 0.49516908212560384	pvalue = 0.13580764850077057

#RED
oddsratio, pvalue = stats.fisher_exact([[37,292-37],[62,843-62]],alternative='greater')
#oddsratio = 1.8277672359266288	pvalue = 0.004981927432859326

#YELLOW
oddsratio, pvalue = stats.fisher_exact([[75,292-75],[534,843-534]],alternative='less')
#oddsratio = 0.1999948221405271	pvalue = 2.3515576001110368e-29

Orthologous and nonorthologous DE genes per chromatin state

First create file: dmel_breakpoints_no_boundaries

In [None]:
bedtools intersect -v -a dmel_breakpoints -b dmel_merge_mid5000 dmel_lc > dmel_breakpoints_no_boundaries

Then run the script below: color_DE_exclude_boundaries.sh

In [None]:
flank=10000
pval=0.05

for color in YELLOW RED BLACK BLUE GREEN
do
# Get all genes of a color then split into those near breakpoints versus the rest
grep $color genes.color | cut -f 1 > temp
grep -f temp dmel-all-r6.21.genes.MullerIDs.bed | sort -k1,1 -k2,2n | bedtools closest -d -a - -b dmel_breakpoints_no_boundaries | awk -v f=$flank '$10<=f' | cut -f 4 | sort | uniq > temp2
cat temp temp2 | sort | uniq -u > tempu

# Get the number of breakpoint genes that are DE
bde=`grep -f temp2 ../expression/deseq_output_all.NEW.csv | tr "," "\t" | grep -v NA | awk -v p=$pval '$7<=p' | wc -l`
btot=`grep -f temp2 ../expression/deseq_output_all.NEW.csv | tr "," "\t" | grep -v NA | wc -l`
echo $bde $btot $color | awk '{print "Breakpoint genes: "$1"/"$2": "$1/$2"\t"$3}'

# Get the number of other genes that are DE
ode=`grep -f tempu ../expression/deseq_output_all.NEW.csv | tr "," "\t" | grep -v NA | awk -v p=$pval '$7<=p' | wc -l`
otot=`grep -f tempu ../expression/deseq_output_all.NEW.csv | tr "," "\t" | grep -v NA | wc -l`
echo $ode $otot $color | awk '{print "Other genes: "$1"/"$2": "$1/$2"\t"$3}'
done

Fishers Exact Test for DE genes in orthologous vs nonorthologous TADs, separated by chromatin state

In [None]:
#BLACK
oddsratio, pvalue = stats.fisher_exact([[20,99-20],[77,497-77]],alternative='greater')
#oddsratio = 1.380897583429229	pvalue = 0.1561514805893438

#BLUE
oddsratio, pvalue = stats.fisher_exact([[17,99-17],[48,497-48]],alternative='greater')
#oddsratio = 1.9392784552845528	pvalue = 0.02616904723438471

#GREEN
oddsratio, pvalue = stats.fisher_exact([[1,99-1],[11,497-11]],alternative='less')
#oddsratio = 0.45083487940630795	pvalue = 0.3808229702397943

#RED
oddsratio, pvalue = stats.fisher_exact([[10,99-10],[31,497-31]],alternative='greater')
#oddsratio = 1.689017760057992	pvalue = 0.12315396696148169

#YELLOW
oddsratio, pvalue = stats.fisher_exact([[51,99-51],[330,497-330]],alternative='less')
#oddsratio = 0.537689393939394	pvalue = 0.0038106562955318468