Chrom-Lasso is a tool to analyze Hi-C data for identifying cis chromatin interactions (interactions occur within chromosome).
The following folders are code, input files, test data, and tutorial for Chrom-Lasso.
- Code: This folder contains codes for Chrom-Lasso, the usage of them is shown in the "Tutorial" folder.
- Prepare_Input_File: This folder contains input files needed by Chrom-Lasso to do analysis, mainly are domain files and cutting site files for Mouse and Human, The domain files are constant for specific species, but the genomic coordinate in the file can be changed with genome version by tool "LiftOver". The cutting site files can be produced by tool "OligoMatch" with reference genome and cutting sequence for the restriction endonuclease.
- Test_Data:
This folder contains "sortChr" simple example files for Mouse and Human for testing Chrom-Lasso, the "sortChr" file are preprocessed file from raw "fastq" sequencing files,
the preprocessing of "fastq" files can be done by
Juicer
(https://github.com/aidenlab/juicer). TheJuicer
generates "merged_nodups" file, and you can use the Shell scripts in tutorial that sorts the file by chromosome order to generate "sortChr" file for further analysis. - Tutorial: This folder contains the analysis pipeline for Mouse and Human.
The compile of Chrom-Lasso Cpp code needs: gcc (>=4.8) and boost_1_51. And it also needs R (>=3.0) to run polynomial regression and lasso regression. Users can use "makefile" in Code folder for compiling Cpp code.
Chrom-Lasso needs 3 input files prepared according to the Hi-C experimental design.
The cutting site file contains the cutting sites of restriction enzyme used in Hi-C experiments. Each line stands for a cutting site locus on the genome.
This bedfile can be generated by an open source tool OligoMatch
.
The domain file contains the genomic regions identified as TADs. Each line stands for a TAD identified.
The raw domain files are downloaded from: (Dixon, J., Selvaraj, S., Yue, F. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions.
Nature 485, 376–380 (2012)). The raw version for Mouse genome is mm9 and for Human genome is hg18. Users can transform the genome version via tool "LiftOver".
Users should first generate merged_nodups.txt file by Juicer
(https://github.com/aidenlab/juicer) from raw Hi-C sequencing fastq files.
Here, we use Mouse_merged_nodups.txt file as example to generate sortChr file
awk '{if($1==0){strand1=1;}else if($1==16) {strand1=0;} if($5==0){strand2=1;}else if($5==16) {strand2=0;} print $2"\t"$3"\t"strand1"\t"$6"\t"$7"\t"strand2"\t0";}' Mouse_merged_nodups.txt > Mouse.formatted
for chr in {1..19} X;do awk '{if($1=="'$chr'") print $0;}' Mouse.formatted;done > Mouse.sortChr
The sortChr file contains the paired end sequencing information of Hi-C data.
/Code/2_Arrange_Domain/HiC_mixturePLD_singleThread -g mm10 -w 51 500 10000 26 1 -d /Prepare_Input_File/Domain_File/total.domain.mm10 -c
/Prepare_Input_File/Cutting_Site_File/MboI.mm10.bed Mouse.sortChr > "Mouse.sortChr_summary"
-g: genome version (mm9, mm10, hg18, hg19)
-w: read size, distance to cutting site threshold, model bin size, number of nodes, cores per node
-d: path to domain file
-c: path to cutting site file
For each chromosome, this step generates 3 files, "chr_csInterChromTotalMap", "chr_domainCSinterFreq", "chr_domainSitesMap".
chr_csInterChromTotalMap: this file contains inter-chromosomal hybrid fragments for evaluating bias.
chr_domainCSinterFreq: this file contains cutting site contact frequency for each domain.
chr_domainSitesMap: this file contains cutting site loci on this chromosome.
for chr in chr{1..19} chrX;do awk 'function abs(x){return ((x < 0.0) ? -x : x)}BEGIN{i=0;}{if($2!=0){map[i]=$1;++i;}}END{for(j=0;j<i;++j){for(k=j+1;k<i;++k){dis=abs(map[j]-map[k]);if(dis<100000){dist[int(dis/100)]++;}else{break;}}}for(i=0;i<1000;++i) print i"\t"dist[i];}' "$chr"_csInterChromTotalMap;done | awk '{map[$1]+=$2;}END{for(i=0;i<1000;++i) print i"\t"map[i];}' > csDistanceDistr
cat Mouse.sortChr | awk 'function abs(x){return ((x < 0.0) ? -x : x)}{dis=abs($2-$5);if($1==$4 && dis<1000000){map[int(dis/100)]++;}}END{for(i=0;i<10000;++i) print i"\t"map[i];}' > empericalDist.bin100
cat Mouse.sortChr | awk 'function abs(x){return ((x < 0.0) ? -x : x)}BEGIN{nonloop=0;}{if($3!=$6 && (($2<$5 && $3==1) || ($2>$5 && $3==0))){nonloop++;}else{dis=abs($2-$5);if($1==$4 && dis<1000000){map[int(dis/100)]++;}}}END{for(i=0;i<10000;++i) {print i"\t"map[i];} print i"\t"nonloop}' > bin100.withNonloop
Rscript /Code/3_Model_Distribution/empericalDist.r
This step generates the "PolyCoef" file used for estimating data background.
"PolyCoef" file contains parameters used in the 7th-degree polynomial regression.
for chr in {1..19} X;
do
mkdir chr$chr
cd chr$chr
/Code/4_Find_IntraDomain_Interaction/findIntraDomainInteraction /Output_path/chr$chr"_csInterChromTotalMap" /Output_path/chr$chr"_domainSitesMap" /Output_path/chr$chr"_domainCSinterFreq" chr$chr"_out" /Output_path/PolyCoef \-13.95 1.854 > chr$chr"_out_summary"
cd ../
done;
Here, -13.95
and 1.854
were defult parameters used to estimate background model. Users can also get these 2 parameters following the instruction below:
/Code/4_Find_IntraDomain_Interaction/calEffBetas/calEffBetas chr1_csInterChromTotalMap chr1_domainSitesMap chr1_domainCSinterFreq chr1_out PolyCoef > chr1_out_summary
The above step generating 3 files: chr1_out_freq
, chr1_out_effvec
, and chr1_out_offset
.
Then run following R script
:
freq<-scan("chr1_out_freq")
offVec<-scan("chr1_out_offset")
effVec<-scan("chr1_out_effvec")
glmobj<-glm(freq~effVec,family="poisson",offset=offVec)
print(glmobj)
The 2 parameters will be estimated based on data from selected chromosome, users can also select another chromosome to do this.
For each chromosome, this step generates an independent folder containing "regionData" and "distMatrix" files for each domain on this chromosome.
"regionData" and "distMatrix" files containing frequency and genomic distance information for interactions that need lasso regression to select.
for chr in {1..19} X;
do
cd chr$chr
domainNum=`tail -n1 chr"$chr"_out_debug | awk '{print $2}'`
for (( c=0; c<=$domainNum; c++ ));
do
Rscript /Code/5_Lasso_Determine_Center/nnlasso.HiC.r regionData_$c distMatrix_$c > nnlassoOut_$c
done;
cd ../
done;
In each chromosome folder, this step generates "nnlassoOut" file for each domain, which contains the coefficients resulting from lasso regression.
for chr in {1..19} X;
do
cd chr$chr
/Code/6_Test_Distribution/outputTestPosPvalue /Output_path/chr$chr"_csInterChromTotalMap" /Output_path/chr$chr"_domainSitesMap" /Output_path/chr$chr"_domainCSinterFreq" chr$chr"_posP" /Output_path/Mouse.polyCoef \-13.95 1.854
cd ../
done;
In each chromosome folder, this step generates "oneCol" and "testPosP" files for each domain, which contain the information for independent interacting center.
for chr in {1..19} X
do
cd chr$chr
domainNum=`tail -n1 chr"$chr"_out_debug | awk '{print $2}'`
for (( c=0; c<=$domainNum; c++ ))
do
testNum1=`cat nnlassoOut_"$c" | tail -n2 | head -1 | awk '{print $2}'`;testNum2=`cat regionData_"$c" | tail -n2 | head -1 | awk '{print $2}'`;testNum3=`wc -l regionData_"$c" | awk '{print $1;}'`
if [ "$testNum1" = "$testNum2" ] || [ "$testNum3" == 0 ]
then
echo $chr.$c.match
else
echo $chr.$c.notMatch
fi
done
cd ../
done > domainNnlassoResults
for chr in {1..19} X
do
echo $chr
cd chr$chr
domainNum=`tail -n1 chr"$chr"_out_debug | awk '{print $2}'`
for (( c=0; c<=$domainNum; c++ ))
do
testNum1=`cat nnlassoOut_"$c" | tail -n2 | head -1 | awk '{print $2}'`;testNum2=`cat testPosP_"$c" | awk '{if(NF==7) {m=$3;}}END{print m;}'`;testNum3=`wc -l regionData_"$c" | awk '{print $1;}'`
if [ "$testNum1" = "$testNum2" ] || [ "$testNum3" == 0 ]
then
echo $chr.$c.match
else
echo $chr.$c.notMatch
fi
done
cd ../
done > domainTestPos
for chr in {1..19} X
do
cd chr$chr
domainNum=`tail -n1 chr"$chr"_out_debug | awk '{print $2}'`
for (( c=0; c<=$domainNum; c++ ))
do
cat testPosP_"$c" | awk '{if(NF==7) {printf("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%.10f\t%.10f\n",$1,$2,$3,$4,$5,$6,$7,$6,$7);}else{print $0;}}' > temp1
cat nnlassoOut_"$c" | awk '{if($1!~/^\[/) {printf("%.10f",$1);for(i=2;i<=NF;i++){printf("\t%.10f",$i);}printf("\n");}else{print $0;}}' > temp2
awk 'FNR==NR{if(NF==9){i=$2;j=1;beta[i]=$6;pval[i]=$7}else{if(NF==3){map[i,j]=$0;}j++;} next;}{if($1~/^\[/){i=$2;}else{if(i in pval){for(j=1;j<=NF;j++) {if(((i,j) in map) && $j!="Error!"){print map[i,j]"\t"beta[i]"\t"pval[i]"\t"$j;}}}}}' temp1 temp2 | sort -k1,1n -k2,2n -u | gzip > sigTest_"$c".all.gz
done
cd ../
done
for chr in {1..19} X
do
echo $chr
cd chr$chr
domainNum=`tail -n1 chr"$chr"_out_debug | awk '{print $2}'`
for (( c=0; c<=$domainNum; c++ ))
do
testNum1=`cat nnlassoOut_"$c" | tail -n2 | head -1 | awk '{print $2+1}'`;testNum2=`zcat sigTest_"$c".all.gz | wc -l | awk '{print $1;}'`;testNum3=`wc -l regionData_"$c" | awk '{print $1;}'`
if [ "$testNum1" = "$testNum2" ] || [ "$testNum3" == 0 ]
then
echo $chr.$c.match
else
echo $chr.$c.notMatch
fi
done
cd ../
done > domainSigTest.all
for chr in {1..19} X;do zcat chr$chr/sigTest_*.all.gz | awk '{print "'$chr'\t"$0;}';cat chr$chr/oneCol_* | awk '{print "'$chr'\t"$1"\t"$2"\t0\t"$3"\t"$4"\t"$3;}'; done | sort -k1,1 -k2,2n -k3,3n > all_interactions
This step provides users with a summary of detected interactions.
Each line stands for a detected interaction.
Column 1: chromosome
Colume 2: interacting end1
Column 3: interacting end2
Column 4: users can ignore this column.
Column 5: beta coefficients for testing distribution, which can be used to infer the relative proportion of cells with this interaction. The bigger this value, the larger the proportion.
Column 6: p value
Column 7: beta coefficients for lasso regression. if column 5 is not equal to column 7, this means that this interaction comes from lasso regression. Users can grap interactions from lasso regression seperately and select them based on column 7.
for chr in {1..19} X;
do
cd chr$chr
/Code/7_Present_Significance/randomSamplingForFDR /Output_path/chr$chr"_csInterChromTotalMap" /Output_path/chr$chr"_domainSitesMap" /Output_path/chr$chr"_domainCSinterFreq" chr$chr /Output_path/Mouse.polyCoef \-13.95 1.854
cd ../
done;
cat chr*/randomSamples* > randomSamples.combined
Rscript /Code/7_Present_Significance/fdrFromRandomSamples.r
This step generates "randomSamples.combined.fdr" and "randomSamples.combined.posFdr" files to estimate the FDR level for interactions.
"randomSamples.combined.fdr" contains FDR level for all randomly pickig loci pairs.
"randomSamples.combined.posFdr" contains FDR level for randomly picking loci pairs with beta coefficients above 0.
Each line stands for a randomly picking loci pair.
Column 5: p value multiple correction based on FDR
Colume 6: p value multiple correction based on BY
Get significance level, for FDR<0.05:
awk '{if($5<0.05 && $5>0.0499) print $0;}' randomSamples.combined.fdr | sort -k5,5n | tail -n1 | awk '{print $4}' > FDR_0.05
if FDR_0.05=0.00000587 Select interactions with FDR<0.05
awk '{if($6<0.00000587) print $0;}' all_interactions > interactions_fdr0.05
- For cutting site input file, please make sure this file is derived from the
same genome version
with the one you use for mapping. - When using Shell scripts in the tutorial for sorting chromosome and do
for
circulation, please make sure thetotal number of chromosomes
is changed befor running the code. - Chrom-Lasso focuses on detecting long-range chromatin interactions with the distance between interacting loci over 20,000bp, but this parameter can be changed in the
Cpp code
to satisfy the needs of study. This parameter can be changed in/Code/4_Find_IntraDomain_Interaction/findIntraDomainInteraction.cpp
,/Code/6_Test_Distribution/outputTestPosPvalue.cpp
, and/Code/7_Present_Significance/randomSamplingForFDR.cpp
.
- When testing for the reads distribution surrounding potential interacting loci pair, Chrom-Lasso defines the testing range by parameter "NEIGHBDIS" in
cpp code
, when this parameter is 5, it defines a 11 cutting site window centered by the potential loci, and you can change this parameter in Cpp code to satisfy the needs of study. This parameter can be changed in/Code/4_Find_IntraDomain_Interaction/findIntraDomainInteraction.cpp
,/Code/6_Test_Distribution/outputTestPosPvalue.cpp
, and/Code/7_Present_Significance/randomSamplingForFDR.cpp
.
- When testing for the reads distribution surrounding potential interacting loci pair, Chrom-Lasso defines the reads frequency for testing by parameter "TESTFREQTHRES" in
cpp code
, and you can change this parameter in Cpp code to satisfy the needs of study. This parameter can be changed in/Code/4_Find_IntraDomain_Interaction/findIntraDomainInteraction.cpp
, and/Code/6_Test_Distribution/outputTestPosPvalue.cpp
.
- If users want to add some new genome version for analysis,
please edit in/Code/2_Arrange_Domain/HiC_mixturePLD_main.cpp
. Users should add anelse if
to tell the length of chromosomes for this genome version as below and re-compile.