-
Notifications
You must be signed in to change notification settings - Fork 2
TomConlin/Jaspar_FA
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
from Fanconi gene sets spreadsheet https://docs.google.com/spreadsheets/d/1yX-5sfrC3vrahf4_k7-5rl4Oqzm853ollIMmUo1PTc0 gene set 5 with GH issue NCATS-Tangerine/ncats-ingest#21 FANCA, FANCB, FANCC, FANCE, FANCF, FANCG, FANCL, FANCM, FANCD2, FANCI, UBE2T FANCD1 (BRCA2), FANCJ, FANCN, FANCO, FANCP, FANCQ, FANCR, FANCS, FANCV, FANCU FAAP100, FAAP24, FAAP20, FAAP16 (MHF1), FAAP10 (MHF2) ---------------------------------------------- Core "complex" FANCA, FANCB, FANCC, FANCE, FANCF, FANCG, FANCL, FANCM FANCD2, FANCI, UBE2T Jaspar provides bed files for transcription factors which locates them on hg19 http://jaspar.genereg.net/ http://jaspar.genereg.net/html/DOWNLOAD/ curl http://jaspar.genereg.net/html/DOWNLOAD/README > README.download_jaspar http://jaspar.genereg.net/html/DOWNLOAD/JASPAR_CORE/ we would be interested in the vertebrate set wget http://jaspar.genereg.net/html/DOWNLOAD/bed_files/MA*.bed no joy it is not liking wget, resetting by peer. curl gets through with the names here. curl -s http://jaspar.genereg.net/html/DOWNLOAD/bed_files/ | sed 's|.*<a href="MA.*bed">\(MA[0-9]*\.[0-9]\.bed\)</a>.*|\1|g' | grep "^MA" > jaspar_tx_bed.list wc -l < jaspar_tx_bed.list 141 for ma in $(< jaspar_tx_bed.list); do \ curl -s http://jaspar.genereg.net/html/DOWNLOAD/bed_files/${ma} > ${ma}; done head MA0007.2.bed chr8 10345716 10345731 hg19_chr8:10345717-10345731(+) . + chr10 122413006 122413021 hg19_chr10:122413007-122413021(-) . - chr5 136960820 136960835 hg19_chr5:136960821-136960835(-) . - chr20 44144378 44144393 hg19_chr20:44144379-44144393(+) . + chr2 216953753 216953768 hg19_chr2:216953754-216953768(+) . + chr13 83635183 83635198 hg19_chr13:83635184-83635198(-) . - chr2 19570730 19570745 hg19_chr2:19570731-19570745(+) . + chr15 85874605 85874620 hg19_chr15:85874606-85874620(-) . - chr12 62553253 62553268 hg19_chr12:62553254-62553268(+) . + chr3 185376610 185376625 hg19_chr3:185376611-185376625(+) . + chromosome/location and direction/sense distance upstream transcription factor distance upstream transcription factor binding sites looks like 1.1M instances of 141 motifs total ---------------------------------------------------------- where: chr. bp & strand are these FA-5 genes locations on hg_19? get all genes from ucsd table browser? genome.ucsc.edu/cgi-bin/hgTables?hgsid=586694461_3slK1ieAYc4t0GvHwBU8LvW8sCYj&clade=mammal&org=Human&db=hg19&hgta_group=genes&hgta_track=geneid&hgta_table=0&hgta_regionType=genome&hgta_outputType=bed&hgta_outFileName=hg19_geneid.bed.gz looks like they offer some kind of fasta file for 1k,2k & 5k upstream regions 8M,16M & 40M respectively http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/upstream1000.fa.gz the deflines give refseq, chromosome region & strand zcat upstream1000.fa.gz | grep "^>" | head >NM_001308203_up_1000_chr1_66998252_f chr1:66998252-66999251 >NM_032291_up_1000_chr1_66998639_f chr1:66998639-66999638 >NM_001145277_up_1000_chr1_16766167_f chr1:16766167-16767166 >NM_001145278_up_1000_chr1_16766167_f chr1:16766167-16767166 >NM_018090_up_1000_chr1_16766167_f chr1:16766167-16767166 but strand may not matter in this case. i.e transcription factor may need not be on the same strand as the gene it is regulating. (update: decided it may matter wrt the sense of the gene and ordering of motifs) a bed file would look like chrx start_bp stop_bp refseq . +/- zcat upstream1000.fa.gz | grep "^>" |awk 'BEGIN{s["f"]="+";s["r"]="-"}{split(substr($1,2),b,"_");split($2,a,":");split(a[2],c,"-");\ print a[1] "\t" c[1] "\t" c[2] "\t" b[1] "_" b[2] "\t.\t" s[b[7]]}'> upstream1000.bed these transcription factors would have an IRI like: http://fantom.gsc.riken.jp/5/sstar/JASPAR_motif:MA0007.1 I expect the names in a bed file do not need to be unique so I will put the motif file name back in the bed file some lines are incomplete awk 'NF==6' upstream1000.bed > upstream1000_clean.bed awk -F'\t' 'NF==6{$4=substr(FILENAME,1,8);print}' MA[0-9]*.bed |sort -k1,1 -k2,2n > MA_all.bed note there are chr I and chr II about 2,045 or the upstream1000.bed are not parsing not all files have 6 columns chr1_gl000191_random or chr4_ctg9_hap1 retry awk 'BEGIN{OFS="\t"}/^chr[0-9XY]*/{$4=substr(FILENAME,1,8);$5=".";if($6=="")$6=".";print}' MA[0-9]*.bed |sort -k1,1 -k2,2n > MA_all.bed bedtools intersect -wb -a upstream1000_clean.bed -b MA_all.bed > MA_up1k.bed wc -l < MA_up1k.bed 103,832 that will do. -------------------------------------------------------------------------------- wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/upstream2000.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/upstream5000.fa.gz zcat data/upstream1000.fa.gz | grep "^>" > data/upstream2000.defline zcat data/upstream2000.fa.gz | grep "^>" > data/upstream2000.defline zcat data/upstream5000.fa.gz | grep "^>" > data/upstream5000.defline awk 'BEGIN{s["f"]="+";s["r"]="-";s[""]="."}{split(substr($1,2),b,"_");split($2,a,":");split(a[2],c,"-");\ print a[1] "\t" c[1] "\t" c[2] "\t" b[1] "_" b[2] "\t.\t" s[b[7]]}' data/upstream1000.defline | | awk '/^chr[0-9XY]*\t/&&NF==6' > data/upstream1000.bed awk 'BEGIN{s["f"]="+";s["r"]="-";s[""]="."}{split(substr($1,2),b,"_");split($2,a,":");split(a[2],c,"-");\ print a[1] "\t" c[1] "\t" c[2] "\t" b[1] "_" b[2] "\t.\t" s[b[7]]}' data/upstream2000.defline | | awk '/^chr[0-9XY]*\t/&&NF==6' > data/upstream2000.bed awk 'BEGIN{s["f"]="+";s["r"]="-";s[""]="."}{split(substr($1,2),b,"_");split($2,a,":");split(a[2],c,"-");\ print a[1] "\t" c[1] "\t" c[2] "\t" b[1] "_" b[2] "\t.\t" s[b[7]]}' dataupstream5000.defline | | awk '/^chr[0-9XY]*\t/&&NF==6' > data/upstream5000.bed ###################################################################### ###################################################################### ###################################################################### Update: leaving the rest for reference but forking from this point to 'README.FA_genes_take2' https://github.com/TomConlin/Jaspar_FA/blob/master/README.FA_genes_take2 which gets to a better place quicker (hindsight being 20/20 and all) ###################################################################### ###################################################################### ###################################################################### bedtools intersect -wb -a data/upstream1000.bed -b MA_all.bed > MA_up1k.bed bedtools intersect -wb -a data/upstream2000.bed -b MA_all.bed > MA_up2k.bed bedtools intersect -wb -a data/upstream5000.bed -b MA_all.bed > MA_up5k.bed # some counts # transcription_factor v.s gene pairs wc -l < MA_up1k.bed 103832 wc -l < MA_up2k.bed 133534 wc -l < MA_up5k.bed 206316 # unique genes with any transcription factors n-K bp upstream cut -f4 MA_up1k.bed | sort -u | wc -l 31253 cut -f4 MA_up2k.bed | sort -u | wc -l 34937 cut -f4 MA_up5k.bed | sort -u | wc -l 40072 how to represent ... 1k part_of 2k part_of 5k there are only 141 tx factors, so there is going to have to be duplication gene1 <up1k> <tx_1> . gene1 <up1k> <tx_2> . gene1 <up5k> <tx_1> . To get at how many tx_n within 1,2,5 kbp of gene_123? uniq -c refseq_motif_1k.tab | awk '{print $2 "\t" $3 "\t" $1}' > refseq_motif_1k_count.tab uniq -c refseq_motif_2k.tab | awk '{print $2 "\t" $3 "\t" $1}' > refseq_motif_2k_count.tab uniq -c refseq_motif_5k.tab | awk '{print $2 "\t" $3 "\t" $1}' > refseq_motif_5k_count.tab what genes have the same set of tx factors as this gene? SO has Thing + sequence_variant+ structural_variant+ feature_variant+ intergenic_variant+ upstream_gene_variant- 2KB_upstream_variant- 5KB_upstream_variant -------------------------------------------------------------------------------- # filter down to nested sets of mocut -f4 MA_up1k.bed | sort -u | wc -ltif in # 0-1k, 1-2k, 2-5k buckets upstream of gene cut -f4,10 MA_up1k.bed | sort -k1b,1 -k2b,2 > refseq_motif_1k.tab cut -f4,10 MA_up2k.bed | sort -k1b,1 -k2b,2 > refseq_motif_2k.tab cut -f4,10 MA_up5k.bed | sort -k1b,1 -k2b,2 > refseq_motif_5k.tab # 0-1k bucket as is. # 1-2k bucket is the diff of second and first comm -13 refseq_motif_1k.tab refseq_motif_2k.tab > refseq_motif_1-2k.tab # 2-5k bucket is the third minus second comm -13 refseq_motif_2k.tab refseq_motif_5k.tab > refseq_motif_1-2-5k.tab wc -l refseq_motif_1k.tab refseq_motif_1-2k.tab refseq_motif_2-5k.tab 103832 refseq_motif_1k.tab 29702 refseq_motif_1-2k.tab 72782 refseq_motif_2-5k.tab 206316 total note: the 5k file also showed 206,316 mappings (check) update: None of the ontology I find is supporting nested intervals so will just use each of the three collections as is. cut -f2 refseq_motif_5k.tab | sort | uniq -c| sort -nr | wc -l We capture 131 distinct motif, the most common being cut -f2 refseq_motif_5k.tab | sort | uniq -c| sort -nr 11184 MA0162.2 10580 MA0079.3 10467 MA0466.1 8163 MA0154.2 7106 MA0506.1 6361 MA0524.1 5649 MA0473.1 5504 MA0502.1 5229 MA0496.1 5067 MA0495.1 5007 MA0060.2 curl -s http://jaspar.genereg.data/pfm_vertebrates.txtnet/html/DOWNLOAD/JASPAR_CORE/pfm/nonredundant/pfm_vertebrates.txt\ > data/pfm_vertebrates.txt grep "^>" data/pfm_vertebrates.txt | tr -d '>' > pfm_vertebrates.defline wc -l < pfm_vertebrates.defline 519 why so many and only 131 get through? cut -f2 refseq_motif_5k.tab | sort -u > motif_hg19.txt join motif_hg19.txt pfm_vertebrates.defline | wc -l 74 these are tagged with gene symbols, at least one and sometimes several (not all with human nomenclature) sample: MA0514.1 Sox3 MA0515.1 Sox6 MA0516.1 SP2 MA0517.1 STAT1::STAT2 MA0518.1 Stat4 MA0519.1 Stat5a::Stat5b MA0520.1 Stat6 MA0521.1 Tcf12 MA0523.1 TCF7L2 MA0526.1 USF2 MA0527.1 ZBTB33 MA0528.1 ZNF263 Not noticing any FA genes in this set w/annotation Still do not know what it means MA0514.1 attached here to Sox3, is also associated with 120 other genes in hg19 which is all fine and good but attaching Sox3 to all of them seems pointless. -------------------------------------------------------------------------- try another tact curl -s http://jaspar.genereg.net/html/DOWNLOAD/TFFM/TFFM_table.csv > data/TFFM_table.csv headers: TFFM_ID,PFM_ID,TF_name,ChIP-seq_filename,log_p_1st_order,log_p_detailed TFFM ID - the unique ID of the TFFM including version number PFM ID - the JASPAR matrix ID of the profile used initially to train the TFFM TF Name - name of the transcription factor ChIP-seq file - name of the ChIP-seq peak data file used to train the TFFM 1st-order log(p) - log(p-value) of the 1st-order TFFM model Detailed log(p) - log(p-value) of the detailed TFFM model wc -l jspr_tfname.tab 131 jspr_tfname.tab well at least it has the same number of rows as there are motifs used in hg19. join motif_hg19.txt jspr_tfname.tab | wc -l 67 hmmm not helping for example MA0058.3 has the name MAX (human symbol) not found in our subset. indicating 5kbp may not be far enough or it is not upstream or MAX is not annotated on hg19 or ... at this point we have a bunch (131) motifs about half are given a "name of the transcription factor" which is a gene symbol which is sometimes from human nomenclature. And a list of motifs with names including some with human symbols which do not occur in the 5KB region upstream of a gene on hg19. (there are some stray contigs from that assembly which are not included here since the chromosome names are un matched) The modeling for transcription factor binding sites in SO does not include any concepts of frequency within nested upstream buckets. made a dot straw man model jaspar_target_model.gv digraph{ rankdir=LR "BNODE:gene_motif_bucket" -> "gene_motif_bucket" [label="rdfs:label"] "BNODE:gene_motif_bucket" -> "SO:TF_binding_site" [label="rdf:type"] //"BNODE:gene_motif_bucket" -> "SO:five_prime_flanking_region" [label="rdf:type"] "BNODE:gene_motif_bucket" -> "5 (how many of these motifs)" [label="rdf:value"] "BNODE:gene_motif_bucket" -> "http://...JASPAR:motif" [label="OIO:hasdbxref"] "BNODE:gene_motif_bucket" -> "1000 (bucket size in base pair)" [label="GENO:has_extent"] "NCBIGene_123" -> "BNODE:gene_motif_bucket" [label="SO:five_prime_flanking_region"] } I will not be surprised if I can't use "SO:five_prime_flanking_region" as a predicate (don't know its type) and instead need to type the bnode with it but them I am at loss (or tired of looking) for how to attach to the gene. It might also be sweet to replace the bnode with an IRI to a genome browser centered on the bucket with the motifs lit up although I think it would have to be an internal viewer -------------------------------------------------------------------------------- what are the (ncbi)gene IDs for this gene set cut -f1 refseq_motif_5k_count.tab| sort -u > refseq.list $ wc -l < refseq.list 40,072 get a mapping file of all genes from http://www.genenames.org/cgi-bin/download (check external entrezid & refseq) head data/entrez_refseq_from_genenames.txt Entrez Gene ID(supplied by NCBI) RefSeq(supplied by NCBI) 26985 NM_001320263 ... awk '$1 && $2{print $2 "\t" $1}' data/entrez_refseq_from_genenames.txt | wc -l 38,467 not a covering set. try their (HGNC) curated set awk '$1 && $2{print $2 "\t" $1}' data/entrez_refseq_entrezex_refseqex_from_genenames.txt | wc -l 39,864 closer. try combining both sets awk -F"\t" '$1 && $2{print $2 "\t" $1}\ $3 && $4{print $4 "\t" $3}' data/entrez_refseq_entrezex_refseqex_from_genenames.txt\ | sort -u | sort -k1b,1 > refseq_entrez.tab wc -l < refseq_entrez.tab 46,906 plausible. # (the extra sort on the second file is to get this join to work) join refseq.list refseq_entrez.tab > hg19_refseq_entrez_mapping.txt wc -l < hg19_refseq_entrez_mapping.txt 21,869 only about half the refseq are finding NCBIGene IDs which is a good start but I would rather get them all in one shot. What kinds of RefSeqs are included? cut -f1 -d'_' refseq_entrez.tab | sort | uniq -c 1 NC 12948 NG 26186 NM 6522 NR 70 NT 562 XM 602 XR 13 YP all kinds. not finding any species/assembly specific listing. time for overkill. ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz (500M) zcat data/gene2refseq.gz | head -1| tr '\t' '\n' | grep -n . 1:#tax_id 2:GeneID 3:status 4:RNA_nucleotide_accession.version 5:RNA_nucleotide_gi 6:protein_accession.version 7:protein_gi 8:genomic_nucleotide_accession.version 9:genomic_nucleotide_gi 10:start_position_on_the_genomic_accession 11:end_position_on_the_genomic_accession 12:orientation 13:assembly 14:mature_peptide_accession.version 15:mature_peptide_gi 16:Symbol zcat data/gene2refseq.gz | grep "^9606" | cut -f2,4,6,8 | awk -F'\t' '{for(i=2;i<5;i++){if($i!="-"){print substr($i,1,index($i,".")-1) "\t" $1}}}' | sort -u | sort -k1b,1 > m wc -l < refseq_entrez.tab 728,070 join refseq.list refseq_entrez.tab > hg19_refseq_entrez_mapping.txt wc -l < hg19_refseq_entrez_mapping.txt 40,072 perfect. awk 'NR==FNR{a[$1]=$2}NR!=FNR{print "NCBIGene:" a[$1]"\tJASPAR:"$2"\t"substr(FILENAME,14,2)"\t"$3}' hg19_refseq_entrez_mapping.txt refseq_motif_?k_count.tab | sort -u > GENE_JASPAR_bucket_count.tab wc -l < GENE_JASPAR_bucket_count.tab 387,518 wc -l < GENE_JASPAR_bucket_count.tab 179,789 that seems a bit high wc -l refseq_motif_?k_count.tab 92295 refseq_motif_1k_count.tab 117954 refseq_motif_2k_count.tab 177269 refseq_motif_5k_count.tab 387518 total okay, I guess it looks on track. it is genes (40k) * avg_#_of_motifs_in_upstream_bucket which looks to be under five, cut -f 1-3 GENE_JASPAR_bucket_count.tab |sort -u | wc -l 177,981 why the difference of couple K? shouldn't the counts all collapse per gene-motif-bucket. cut -f1,2 refseq_motif_1k_count.tab | sort -u | wc -l 92295 cut -f1,2 refseq_motif_2k_count.tab | sort -u | wc -l 117954 cut -f1,2 refseq_motif_5k_count.tab | sort -u | wc -l 177269 within each bucket, adding the count does not change gene+motif uniqueness 177,269 / 40,072 = 4.423762228 motifs per gene in the 5K bucket 131 choose 5 is about 40 billion 131 choose 4.4 about 2.3 billion 131 choose 4 is about 300 million 131 choose 3 is about 2.2 million 131 choose 2 is 17,161 So a couple of matching motifs may not be something to write home about but more than that gets interesting quickly. ------------------------------------------------------------ Converting the tab file into RDF will look like: for line in GENE_JASPAR_bucket_count.tab; (ncbigene, motif, bucket, count) = line bnode = '_:' + sha1sum(ncbigene + motif + bucket)[1,21] spo(ncbigene, 'SO:five_prime_flanking_region', bnode) spo(bnode, 'rdf:type', 'SO:TF_binding_site') spo(bnode, 'rdfs:label', ncbigene + ' ' + motif + ' ' + 'bucket') spo(bnode, 'rdf:value', count) spo(bnode, 'GENO:has_extent', bucket) spo(bnode, 'OIO:has_xbxref', motif) ------------------------------------------------------- Still need to get specific for the human FA set zcat data/gene2refseq.gz | grep "^9606" |cut -f2,16| sort -u > ncbigene_symbol.tab wc -l < ncbigene_symbol.tab 55,506 grep -f FA_symbol.list ncbigene_symbol.tab | sort -k2 675 BRCA2 80233 FAAP100 199990 FAAP20 91442 FAAP24 2175 FANCA 2187 FANCB 2176 FANCC 2177 FANCD2 115795 FANCD2OS 100421239 FANCD2P1 101929530 FANCD2P2 2178 FANCE 2188 FANCF 2189 FANCG 55215 FANCI 55120 FANCL 57697 FANCM 29089 UBE2T the three in the middle have suffixes appended [OS P1 P2] respectively and should likely be ignored in favor of "NCBIGene:2177 FANCD2" leaves 14 symbols not found. MHF1 MHF2 FAAP10 FAAP16 FANCD1 FANCJ FANCN FANCO FANCP FANCQ FANCR FANCS FANCU FANCV Maybe a job for myGene.info foo=FAAP16 curl -s "http://mygene.info/v3/query?species=human&q=alias:${foo}" | jq . { "max_score": 9.465674, "took": 2, "total": 1, "hits": [ { "_id": "10459", "_score": 9.465674, "entrezgene": 10459, "name": "mitotic arrest deficient 2 like 2", "symbol": "MAD2L2", "taxid": 9606 } ] } good enough. for foo in $(cat FA_alias); do echo ${foo}; curl -s "http://mygene.info/v3/query?species=human&fields=symbol,entrezgene&q=alias:${foo}" |\ jq '[.hits[].entrezgene, .hits[].symbol]'; done | tr -d '\n,"[' | tr ']' '\n' MHF1 378708 CENPS MHF2 201254 CENPX FAAP10 201254 CENPX FAAP16 378708 CENPS FANCD1 675 BRCA2 FANCJ 83990 BRIP1 FANCN 79728 PALB2 FANCO 5889 RAD51C FANCP 84464 SLX4 FANCQ 2072 ERCC4 FANCR 5888 RAD51 FANCS 672 BRCA1 FANCU 7516 XRCC2 FANCV 10459 MAD2L2 ... | awk '{print $2,$3}' | sort -u -k2 ------------------------------------------ which gives us an actionable set of 26 NCBIGene to query with. 672 BRCA1 675 BRCA2 2072 ERCC4 2175 FANCA 2176 FANCC 2177 FANCD2 2178 FANCE 2187 FANCB 2188 FANCF 2189 FANCG 5888 RAD51 5889 RAD51C 7516 XRCC2 10459 MAD2L2 29089 UBE2T 55120 FANCL 55215 FANCI 57697 FANCM 79728 PALB2 80233 FAAP100 83990 BRIP1 84464 SLX4 91442 FAAP24 199990 FAAP20 201254 CENPX 378708 CENPS ----------------------------------------------------------- time to build a data store to query these against. Best may be a dipper ingest (started) but that is wasteful if it wont be added to monarch or worse cannot usefully answer the question at hand The specific question asked would be: Which other genes have the most similar set(s) of upstream motifs as this set? How do we characterize this sets motif repertoire? join FA_NCBIGene_symbol.txt GENE_JASPAR_bucket_count.tab ... irritating sort issues sort -k1b,1 FA_NCBIGene_symbol.txt > foo sort -k1b,1 GENE_JASPAR_bucket_count.tab > bar join foo bar > FA_NCBIGene_sym_JAS_bkt_count.txt wc -l < FA_NCBIGene_sym_JAS_bkt_count.txt 370 26 * 4.2 = 114.4 * three buckets ~> 343.2 nothing disturbing there. grep 1k FA_NCBIGene_sym_JAS_bkt_count.txt | wc -l 95 grep 2k FA_NCBIGene_sym_JAS_bkt_count.txt | wc -l 119 grep 5k FA_NCBIGene_sym_JAS_bkt_count.txt | wc -l 156 grep 1k FA_NCBIGene_sym_JAS_bkt_count.txt | cut -f1 -d ' '| sort -u | wc -l 24 two genes in our FA-set do not have any motifs reported in the first 1k and one (NCBIGene:2072, ERCC4 alias FANCQ) has none within 5K with an uniform distribution the remaining genes would be averaging close to four different motifs. how many different motifs in each bucket for this FA-gene set? grep 1k FA_NCBIGene_sym_JAS_bkt_count.txt | cut -f3 -d ' '| sort -u | wc -l 35 grep 2k FA_NCBIGene_sym_JAS_bkt_count.txt | cut -f3 -d ' '| sort -u | wc -l 41 grep 5k FA_NCBIGene_sym_JAS_bkt_count.txt | cut -f3 -d ' '| sort -u | wc -l 51 more than doubling the size of the bucket less than doubles the number of motif flavors... suggests some focus What are the most popular favors of motifs within this FA-gene set? grep 1k FA_NCBIGene_sym_JAS_bkt_count.txt | cut -f3 -d ' '| sort | uniq -c | sort -nr | head -20 7 JASPAR:MA0502.1 7 JASPAR:MA0076.2 6 JASPAR:MA0079.3 5 JASPAR:MA0471.1 5 JASPAR:MA0470.1 5 JASPAR:MA0162.2 4 JASPAR:MA0524.1 4 JASPAR:MA0506.1 4 JASPAR:MA0475.1 4 JASPAR:MA0144.2 3 JASPAR:MA0154.2 3 JASPAR:MA0093.2 3 JASPAR:MA0080.3 3 JASPAR:MA0060.2 grep 1k FA_NCBIGene_sym_JAS_bkt_count.txt | cut -f3 -d ' '| sort | uniq -c | sort -nr | head -20 9 JASPAR:MA0502.1 7 JASPAR:MA0079.3 7 JASPAR:MA0076.2 6 JASPAR:MA0524.1 6 JASPAR:MA0471.1 6 JASPAR:MA0470.1 5 JASPAR:MA0162.2 4 JASPAR:MA0506.1 4 JASPAR:MA0475.1 4 JASPAR:MA0154.2 4 JASPAR:MA0144.2 3 JASPAR:MA0526.1 3 JASPAR:MA0516.1 3 JASPAR:MA0510.1 3 JASPAR:MA0473.1 3 JASPAR:MA0095.2 3 JASPAR:MA0093.2 3 JASPAR:MA0080.3 grep 5k FA_NCBIGene_sym_JAS_bkt_count.txt | cut -f3 -d ' '| sort | uniq -c | sort -nr | head -30 9 JASPAR:MA0502.1 9 JASPAR:MA0466.1 7 JASPAR:MA0079.3 7 JASPAR:MA0076.2 6 JASPAR:MA0471.1 6 JASPAR:MA0470.1 6 JASPAR:MA0060.2 5 JASPAR:MA0526.1 5 JASPAR:MA0524.1 5 JASPAR:MA0162.2 5 JASPAR:MA0144.2 4 JASPAR:MA0506.1 4 JASPAR:MA0496.1 4 JASPAR:MA0495.1 4 JASPAR:MA0475.1 4 JASPAR:MA0154.2 4 JASPAR:MA0095.2 4 JASPAR:MA0093.2 4 JASPAR:MA0080.3 3 JASPAR:MA0516.1 3 JASPAR:MA0510.1 3 JASPAR:MA0491.1 3 JASPAR:MA0490.1 3 JASPAR:MA0473.1 3 JASPAR:MA0102.3 Motifs shared by more than two genes in the FA-gene set none cover more than a third. want to see these. array[gene,motif]=count ./make_fagene_motif_table.awk FA_NCBIGene_symbol.txt GENE_JASPAR_bucket_count.tab NCBIGene:10459 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 NCBIGene:199990 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 1 1 0 0 NCBIGene:201254 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 NCBIGene:2175 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 NCBIGene:2176 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 NCBIGene:2177 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NCBIGene:2178 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 1 0 1 2 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 NCBIGene:2187 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NCBIGene:2188 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 NCBIGene:2189 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NCBIGene:29089 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 NCBIGene:55120 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NCBIGene:55215 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NCBIGene:57697 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 NCBIGene:5888 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 NCBIGene:5889 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 NCBIGene:672 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 NCBIGene:675 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 NCBIGene:7516 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 NCBIGene:79728 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 NCBIGene:80233 0 0 1 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NCBIGene:83990 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 NCBIGene:84464 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 4 0 0 0 0 0 0 0 NCBIGene:91442 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 oy, maybe not. consider another hundred zeros on the end & its a very sparse matrix hmmm maybe a ML classification approach. 140 columns 40,000 rows, an average of 4 non-zero per row training classification on just 25 rows strikes me as inadequate and I am not aware of a means to increase the training set before hand since that is the goal of the exercise. An approach to consider is; cluster all, then pool the clusters containing members of our FA-gene set. Hopefully choosing a viable cluster strategy won't take forever. Maybe start with something brain dead e.g. have at least three motifs in common before turning to black boxes. Will take the order of genes in 'ncbigene_symbol.tab' as row labels make an ordered sequence for column labels cut -f2 refseq_motif_5k_count.tab | sort -u > motif.list I wonder how many genes will have identical motif signatures. I wonder if what I will need here is some concept of nested sets. I can get the identicals using digests but that wont help nesting. bitvectors would do it if we replace the count with existence (too bad there are more than 128 flavors) ./make_gene_motif_table.awk GENE_JASPAR_bucket_count.tab -rw-r--r-- 1 tomc tomc 14M Apr 12 15:26 upstream_bucket_1k.matrix -rw-r--r-- 1 tomc tomc 14M Apr 12 15:26 upstream_bucket_2k.matrix -rw-r--r-- 1 tomc tomc 14M Apr 12 15:26 upstream_bucket_5k.matrix 14M. hmm maybe a sparser format is called for but with these we can do a sideways sums (Hamming weight) and maybe an (big) integer representation ./bits_of_ham.awk upstream_bucket_1k.matrix > upstream_bucket_1k_ham_bit.tab head upstream_bucket_1k_ham_bit.tab 0 0 0 0 0 0 0 0 0 0 0 0 2 18014398509482016 4 604462909807340357156864 0 0 3 17592186569728 grep "-" upstream_bucket_1k.matrix no negative number reported which would indicate over/under flow of bit vector there is a 24 digit number there and a 31 digit ... I think awk is handling arbitrary precision. (maybe uses bc lib under the hood) 2^131 2722258935367507707706996859454145691648 only 41 decimal digits max ./bits_of_ham.awk upstream_bucket_2k.matrix > upstream_bucket_2k_ham_bit.tab ./bits_of_ham.awk upstream_bucket_5k.matrix > upstream_bucket_5k_ham_bit.tab So the ones we might be able to say anything significant about are likely to have more two motifs, how many identical collections of more than two motifs are there? awk '$1 > 2{a[$2]++}END{print length(a) ": gene [sets] with more than 2 motifs"}' upstream_bucket_1k_ham_bit.tab 3351: gene [sets] with more than 2 motifs awk '$1 > 2{a[$2]++}END{print length(a) ": gene [sets] with more than 2 motifs"}' upstream_bucket_2k_ham_bit.tab 4715: gene [sets] with more than 2 motifs awk '$1 > 2{a[$2]++}END{print length(a) ": gene [sets] with more than 2 motifs"}' upstream_bucket_5k_ham_bit.tab 7672: gene [sets] with more than 2 motifs of course some could be singleton sets, but roughly 10% of the genes can hope to play. awk '$1 > 2{print $2}' upstream_bucket_1k_ham_bit.tab | wc -l 6885 awk '$1 > 2{print $2}' upstream_bucket_2k_ham_bit.tab | wc -l 8551 awk '$1 > 2{print $2}' upstream_bucket_5k_ham_bit.tab | wc -l 11753 yes, nearly half are singletons but they may still be or contain subsets. awk '$1 > 2{print $2}' upstream_bucket_1k_ham_bit.tab | sort | uniq -c |sort -nr | head 176 10141204801825835211973625643008 135 40564819207303340847894502572032 132 2535301200456458802993406410752 104 9904124777192849513780346880 79 10141214473232392129007023292416 69 2535310871863015720026804060160 64 9903520314283042199192993792 60 604462909807314587353088 56 9671406556917033397649408 51 9671406556917050577518592 awk '$1 > 2{print $2}' upstream_bucket_2k_ham_bit.tab | sort | uniq -c |sort -nr | head 171 40564819207303340847894502572032 171 10141204801825835211973625643008 140 2535301200456458802993406410752 83 9904124777192849513780346880 64 2535310871863015720026804060160 62 10141214473232392129007023292416 56 604462909807314587353088 55 40564828878709897764927900221440 52 9903520314283042199192993792 48 9671406556917033397649408 awk '$1 > 2{print $2}' upstream_bucket_5k_ham_bit.tab | sort | uniq -c |sort -nr | head 177 40564819207303340847894502572032 171 10141204801825835211973625643008 157 2535301200456458802993406410752 59 43100120407759799650887908982784 54 2535310871863015720026804060160 53 10141214473232392129007023292416 51 9904124777192849513780346880 49 5070602400912917605986812821504 43 10141205406288745019288212996096 42 14167099448608935641088 We do get many genes with identical motif sets of more than two! For sub setting this scheme does have the nice property that the numeric signature of the superset has to be strictly larger than the numeric signature of the subset, and that (superset_sig AND subset_sig) == subset_sig --------------------------------------------------------- We have: <gene> <upstream> <motif> (and attributes) we want: <gene> <upstream> <motif_set> and <motif_set_1> <subset_of> <motif_set_2> * where the sets have more than two elements will also need <motif> <element_of> <motif_set> awk '$1 > 2{ms[$2]=ms[$2] "\t" NR}END{for(s in ms){split(substr(ms[s],2),a,"\t");for(g in a){print a[g] "\t" s}}}' upstream_bucket_1k_ham_bit.tab > gene_mset_1k.tab head gene_mset_1k.tab 5855 17179936768 13975 17179936768 3646 4611686018460958720 3647 4611686018460958720 5614 4611686018460958720 8829 3096224743817224 15579 43110034203943549417435086979072 6787 154742504911235518675550208 8313 139586437152 12344 11408855402496804485637867569152 wc -l < gene_mset_1k.tab 6885 awk '$1 > 2{ms[$2]=ms[$2] "\t" NR}END{for(s in ms){split(ms[s],a,"\t");if(length(a)>1){for(g in a){print g "\t" s}}}}' upstream_bucket_2k_ham_bit.tab > gene_mset_2k.tab awk '$1 > 2{ms[$2]=ms[$2] "\t" NR}END{for(s in ms){split(ms[s],a,"\t");if(length(a)>1){for(g in a){print g "\t" s}}}}' upstream_bucket_5k_ham_bit.tab > gene_mset_5k.tab wc -l gene_mset_?k.tab 6885 gene_mset_1k.tab 8551 gene_mset_2k.tab 11753 gene_mset_5k.tab that is the <gene> <upstream> <motif_set>. (* for motif sets of more than 2 flavors) ------------------------------------------------------ find subsets awk '$1 > 2{print $2}' upstream_bucket_1k_ham_bit.tab | sort -nru |\ awk '{a[NR]=$1}END{for(i=1;i<NR;i++){for(j=i+1;j<=NR;j++){\ if(a[j] == and(a[i],a[j])){print a[i] "\t" a[j]}}}}' > mset_smset_1k.tab head mset_smset_1k.tab 4613937835420977152 2251816993556480 4613937818287210496 2251799826268160 4613937818287210496 46137344 4612530462718432256 844442110005248 4612530462718432256 844442110002176 4612530462718432256 844442110001152 4612530462718432256 844427077615616 4612530462718432256 844424930136064 4612530462718432256 562967133294592 4612530462718432256 281492156583936 wc -l < mset_smset_1k.tab 948 plausible. awk '$1 > 2{print $2}' upstream_bucket_2k_ham_bit.tab | sort -nru |\ awk '{a[NR]=$1}END{for(i=1;i<NR;i++){for(j=i+1;j<=NR;j++){\ if(a[j] == and(a[i],a[j])){print a[i] "\t" a[j]}}}}' > mset_smset_2k.tab awk '$1 > 2{print $2}' upstream_bucket_5k_ham_bit.tab | sort -nru |\ awk '{a[NR]=$1}END{for(i=1;i<NR;i++){for(j=i+1;j<=NR;j++){\ if(a[j] == and(a[i],a[j])){print a[i] "\t" a[j]}}}}' > mset_smset_5k.tab wc -l < mset_smset_2k.tab 1247 wc -l < mset_smset_5k.tab 1531 which covers explicit subset relations ------------------------------------------------------ Subsets seem useful here because a collection of TFs working on one gene has potential to also work on another. But this tells me nothing about what they do not have in common. We have sets.. could do pairwise Jaccard across all genes, |intersection(A,B)| / |union(A,B)| but that will have a lot of useless information (40,000^2-40,000)/2 = 799,980,00 is about 800 million and there are three region sizes [1k,2k,5k] so in the order of 2 billion pairwise metrics. oy could generate then filter but would rather not. motif membership in a set (maybe irrelevant...) should finish this set. -- will think about it first. sets are already connected to genes and genes are connected to motifs and in general, which motifs are in a set is not what anyone has asked about ... especiallly unrelated to a gene. Will model for a while ------------------------------------------------------------------------------- so here is the plan superset | gene->up_region->motif_set-->motif | subset the answers will mainly happen between gene and motif_set what motif_set does my_gene have? what other genes have my_gene's motif_set? what other genes include my_gene's motif_set? what other genes have some of my_gene's motif_set? I can not avoid the motif_sets of < 3 because it is inconsistent, although I will not include them in super/sub sets. if there are less than three flavors of motifs make a tabbed list of genes with the set of flavors then turn around and make a gene to flavor pair from the list awk '$1 < 3 {ms[$2]=ms[$2] "\t" NR}END{for(s in ms){split(substr(ms[s],2),a,"\t");for(g in a){print a[g] "\t" s}}}' upstream_bucket_1k_ham_bit.tab > gene_msetlet_1k.tab cut -f2 gene_mset_1k.tab | sort -u | wc -l 3351 cut -f2 gene_mset_2k.tab | sort -u | wc -l 4715 cut -f2 gene_mset_5k.tab | sort -u | wc -l 7672 ------------------------------- cut -f2 gene_msetlet_1k.tab | sort -u | wc -l 673 cut -f2 gene_msetlet_2k.tab | sort -u | wc -l 748 cut -f2 gene_msetlet_5k.tab | sort -u | wc -l 749 -------------------------------------- what are the basic counts? there are ~131 motifs there are 7672 + 749 == 8,421 motif sets (in the 5k regions) there are 1,531 motif subsets (in the 5k regions) there are (131^2)-131 == 17,030 di-motifs possible (partial orders) there are ~ 50k upstream regions ################################################################################ ################################################################################ ################################################################################ I have abstracted away (at least) several important pieces of information * how many of which flavor motif, copy number * and order of flavors (recover partial order with di-motifs?) * distance between motifs (too close or too far may matter) Dan is arguing to consider order even if it is expensive. A very short paper suggesting order _does_ matter. http://klab.tch.harvard.edu/publications/PDFs/gk2023.pdf see if we can come up with an efficient partial order signature cut -f 4,6,8-10 MA_up1k.bed |awk '{print $1,int(($3+$4)/($2 2)),$5}'|sort -k1,1 -k2n,2 | head NM_000016 76189990 MA0471.1 NM_000017 121163496 MA0506.1 NM_000018 7122523 MA0473.1 NM_000021 73603082 MA0162.2 NM_000021 73603107 MA0162.2 NM_000021 73603142 MA0475.1 NM_000022 -43280924 MA0154.2 NM_000022 -43280470 MA0144.2 NM_000022 -43280469 MA0473.1 NM_000024 148205952 MA0154.2 note: we only care about the order within a region UPDATE NOTE: with respect to the sense of the gene. hence the negative locations which will sort as intended cut -f 4,6,8-10 MA_up1k.bed |awk '{print $1,int(($3+$4)/($2 2)),$5}'| sort -k1,1 -k2n,2 |\ awk '$1!=r{r=$1;m=$3} ($1==r)&&(m!=$3){print r"\t" m "\t" $3; m=$3}' | head NM_000021 MA0162.2 MA0475.1 NM_000022 MA0473.1 MA0144.2 NM_000022 MA0144.2 MA0154.2 NM_000024 MA0154.2 MA0502.1 NM_000026 MA0079.3 MA0516.1 NM_000030 MA0466.1 MA0524.1 NM_000031 MA0162.2 MA0079.3 NM_000031 MA0079.3 MA0162.2 NM_000031 MA0162.2 MA0506.1 NM_000033 MA0473.1 MA0527.1 dimotifs (64,377 for 22,262 genes 1k region) half the genes have no dimotifs (there are other motif datasets out there with more, over 600) cut -f 4,6,8-10 MA_up1k.bed |awk '{print $1,int(($3+$4)/($2 2)),$5}' | sort -k1,1 -k2n,2 |awk '$1!=r{r=$1;m=$3} ($1==r)&&(m!=$3){print r"\t" m "\t" $3; m=$3}' | sort -u > refseq_dimotif_1k.tab cut -f 4,6,8-10 MA_up2k.bed |awk '{print $1,int(($3+$4)/($2 2)),$5}' | sort -k1,1 -k2n,2 |awk '$1!=r{r=$1;m=$3} ($1==r)&&(m!=$3){print r"\t" m "\t" $3; m=$3}' | sort -u > refseq_dimotif_2k.tab cut -f 4,6,8-10 MA_up5k.bed |awk '{print $1,int(($3+$4)/($2 2)),$5}' | sort -k1,1 -k2n,2 |awk '$1!=r{r=$1;m=$3} ($1==r)&&(m!=$3){print r"\t" m "\t" $3; m=$3}' | sort -u > refseq_dimotif_5k.tab # the unique is because the same dimotif can show up multiple times in a region wc -l refseq_dimotif_?k.tab 64,377 refseq_dimotif_1k.tab 88,059 refseq_dimotif_2k.tab 148,651 refseq_dimotif_5k.tab 301087 total cut -f1 refseq_dimotif_5k.tab | uniq -c | sort -nr | head 28 NM_025195 27 NM_001198806 26 NM_130898 26 NM_003364 26 NM_001287430 26 NM_001287429 26 NM_001255980 26 NM_001255978 26 NM_001039848 25 NM_015125 wow. was not expecting that some genes have pretty long sequences of distinct motif pairs have to think about how to represent so many an a reasonable way. maybe this will be best expressed as an off line precalculation what are the most popular dimotifs? cut -f2,3 refseq_dimotif_5k.tab |sort | uniq -c | sort -nr | head 2322 MA0495.1 MA0496.1 2112 MA0105.2 MA0105.3 1951 MA0488.1 MA0492.1 1774 MA0093.2 MA0526.1 1447 MA0060.2 MA0502.1 1402 MA0502.1 MA0060.2 1173 MA0476.1 MA0491.1 1042 MA0076.2 MA0475.1 956 MA0470.1 MA0471.1 902 MA0496.1 MA0495.1 egrep -c "MA0496.1.MA0495.1$" < refseq_dimotif_5k.tab 902 v.s. 2322 egrep -c "MA0105.3.MA0105.2$" < refseq_dimotif_5k.tab 171 v.s. 2112 egrep -c "MA0492.1.MA0488.1$" < refseq_dimotif_5k.tab 387 v.s. 1951 top three dimotifs are significantly more common than the reverse order. that is a result. ---------------------------------- SO. Would like the Jaccard but way too many, and most would be useless remember that was 2 billion pairwise gene calculations involving an average of 9 motifs each. not good. And there were at most only 8,421 distinct motif sets ( 8,421^2- 8,421)/2 == 35,452,410 35.5 Million still too many but perhaps post process filtering will help. here only looking at motif sets with more than two flavors cut -f2 gene_mset_1k.tab| sort -u | wc -l 3351 cut -f2 gene_mset_2k.tab| sort -u | wc -l 4715 cut -f2 gene_mset_5k.tab| sort -u | wc -l 7672 time \ cut -f2 gene_mset_1k.tab| sort -nru |grep -v "^$" | ./motif_set_jaccard.awk | sort -k1nr,1 -k2nr,2 -k3nr,3 > motif_set_jaccard_1k.tab real 0m5.536s user 0m5.540s sys 0m0.032s head motif_set_jaccard_1k.tab 9223372062625103872 4899916411792523264 0.25 9223372062625103872 4622945034709237760 0.2 9223372062625103872 4620693243485487104 0.5 9223372062625103872 4613937835454496768 0.2 9223372062625103872 4613937835420977152 0.166667 9223372062625103872 4612530462718432256 0.111111 9223372062625103872 4612530460570948608 0.125 9223372062625103872 4612266560567476224 0.142857 9223372062625103872 4612248985594232832 0.2 9223372062625103872 4611967510583967744 0.25 wc -l < motif_set_jaccard_1k.tab 99,345 (non-zero out of 5,612,925 possible 1.77% ) cut -f2 gene_mset_2k.tab| sort -nru |grep -v "^$" | ./motif_set_jaccard.awk | sort -k1nr,1 -k2nr,2 -k3nr,3 > motif_set_jaccard_2k.tab time cut -f2 gene_mset_5k.tab| sort -nru |grep -v "^$" | ./motif_set_jaccard.awk | sort -k1nr,1 -k2nr,2 -k3nr,3 > motif_set_jaccard_5k.tab 25 seconds wc -l motif_set_jaccard_?k.tab 99345 motif_set_jaccard_1k.tab 155315 motif_set_jaccard_2k.tab 254021 motif_set_jaccard_5k.tab 508681 total half a million hints, but they get multiplied as they are applied 2X as they apply to each motif_set in the pair times the number of genes which have the motif_set this amounts to jaccard between the set motifs per gene choosing a cutoff cut -f3 motif_set_jaccard_k.tab | sort -nr | uniq -c |otsu.awk 0.833333 all buckets (1k,2k & 5k) return 0.8333 awk '$3 >= 0.833333 {count++}END{print count}' motif_set_jaccard_?k.tab a few precious hundred with that threshold. How about, better than 50/50 awk '$3 >= 0.5 {count++}END{print count}' motif_set_jaccard_?k.tab 23153 more like under ten thousand (per bucket) so on average maybe one in four genes have a (weak) recommendation dropping down to a third awk '$3 >= 0.33333 {count++}END{print count}' motif_set_jaccard_?k.tab 82,529 I think looking at what quality of associations are being generated will have to inform where to set the cutoff (if any). ---------------------------------------------------------------- I really like the idea of the dimotifs. maybe what we want is a jaccard on the sets of edges (dimotifs) for each pairwise motif set. The jaccard on edges is fundamentally different than the jaccard over nodes because it goes back to depending explicitly on each region and and cannot rely on the motif sets because order matters. Could include a distance between adjacent motifs, which would introduce multi edges (one per weight) Could also include self edges which convey repeats cut -f 4,8-10 MA_up1k.bed|awk '{print $1,int(($2+$3)/2),$4}'|sort -k1,1 -k2n,2|\ awk '($1==r){print r"\t" m "\t" $3 "\t" $2-l;l=$2;m=$3}$1!=r{r=$1;l=$2;m=$3}' | head NM_000021 MA0162.2 MA0162.2 25 NM_000021 MA0162.2 MA0475.1 35 NM_000022 MA0473.1 MA0144.2 1 NM_000022 MA0144.2 MA0154.2 454 NM_000024 MA0154.2 MA0154.2 43 NM_000024 MA0154.2 MA0502.1 110 NM_000026 MA0079.3 MA0516.1 2 NM_000030 MA0466.1 MA0524.1 199 NM_000030 MA0524.1 MA0524.1 57 NM_000031 MA0162.2 MA0079.3 49 overlapping motifs / nearly overlapping. I had not been expecting that. overlapping most likely means only one of them can be in play at the same time short distances must be ORs and larger distances are/could be ANDs so the OR should not be its own edge but split into probabilistic edges with adjacent upstream/downstream neighbors (in the future) NM_000024 MA0154.2 MA0154.2 43 and NM_000030 MA0524.1 MA0524.1 57 are (tandem) repeat which should reinforce the signal NM_000022 MA0473.1 MA0144.2 1 and NM_000026 MA0079.3 MA0516.1 2 must be overlaps which means maybe we can derive some measure of closeness between motif flavors (hopefully someone somewhere has already done this) cut -f 4,8-10 MA_up1k.bed|awk '{print $1,int(($2+$3)/2),$4}'|sort -k1,1 -k2n,2|\ awk '($1==r){print r"\t" m "\t" $3 "\t" $2-l;l=$2;m=$3}$1!=r{r=$1;l=$2;m=$3}'\ > refseq_m1_m2_dist_1k.tab cut -f 4,8-10 MA_up2k.bed|awk '{print $1,int(($2+$3)/2),$4}'|sort -k1,1 -k2n,2|\ awk '($1==r){print r"\t" m "\t" $3 "\t" $2-l;l=$2;m=$3}$1!=r{r=$1;l=$2;m=$3}' \ > refseq_m1_m2_dist_2k.tab cut -f 4,8-10 MA_up5k.bed|awk '{print $1,int(($2+$3)/2),$4}'|sort -k1,1 -k2n,2|\ awk '($1==r){print r"\t" m "\t" $3 "\t" $2-l;l=$2;m=$3}$1!=r{r=$1;l=$2;m=$3}' \ > refseq_m1_m2_dist_5k.tab wc -l refseq_m1_m2_dist_?k.tab 72579 refseq_m1_m2_dist_1k.tab 98597 refseq_m1_m2_dist_2k.tab 166244 refseq_m1_m2_dist_5k.tab 337420 total dimotifs cut -f1 refseq_m1_m2_dist_1k.tab | sort -u | wc -l 23,263 72,579 / 23,263 ~ 3.12 dimotif / gene (with a dimotif which is just over half) cut -f1 refseq_m1_m2_dist_2k.tab | sort -u | wc -l 27,227 98,597 / 27,227 ~ 3.621295038 dimotif / gene cut -f1 refseq_m1_m2_dist_5k.tab | sort -u | wc -l 34,466 166244/ 34,466 ~4.823420182 dimotif / gene Again that is a a lot with (N^2-N)/2, too many?* 1k) 270,571,953 * 3.621295038 ~ 979,820,870 2k) 370,641,151 * 3.621295038 ~ 1,342,200,961 5k) 593,935,345 * 4.823420182 ~ 2,864,799,730 ~ 5,186,821,559 comparisons but the intersection of these edge sets may winnow down to zero even more quickly than the node sets did dropping distance and parallel edges begin with, keeping self edges I do not have a nice numeric representation of the dimotifs. important to have when considering in the order of 5 billion comparisons ... There are always only two, which means we can represent each pair with a couple of bytes dimotif = (index_of(m1)<<8) & index_of(m2) results need to be a metric between gene regions since there are so many potential pairs I think it is safe to start with a cutoff where we need to have at least half the dimotifs in common to be related time \ cut -f1-3 refseq_m1_m2_dist_1k.tab| sort -u | \ ./dimotif_jaccard.awk motif.list - > refseq_pair_dimotif_jaccard_1k.tab real 42m32.958s user 42m32.844s sys 0m0.116s wc -l < refseq_pair_dimotif_jaccard_1k.tab 169,668 averages to each gene gets four recommendations not that it is how it will work out cut -f3 refseq_pair_dimotif_jaccard_1k.tab | sort -n |uniq -c | sort -nr 151349 1 13599 0.666667 2430 0.6 1121 0.75 279 0.571429 247 0.8 195 0.833333 79 0.555556 76 0.714286 62 0.625 49 0.857143 37 0.777778 36 0.7 27 0.888889 18 0.545455 15 0.909091 14 0.875 8 0.583333 8 0.538462 6 0.642857 4 0.9 3 0.692308 2 0.636364 2 0.615385 1 0.928571 head of the dimotif jaccard score distribution the genes are not ordered pairwise in any particular way but lets see who is in the "in" crowd cut -f1 refseq_pair_dimotif_jaccard_1k.tab | sort |uniq -c | sort -nr | head 209 NM_000142 208 NM_000218 207 NM_000874 206 NM_000922 205 NM_000941 204 NM_001001669 203 NM_001001872 202 NM_001001928 201 NM_001002020 200 NM_001003938 cut -f2 refseq_pair_dimotif_jaccard_1k.tab | sort |uniq -c | sort -nr | head 209 NM_207585 208 NM_207584 207 NM_207578 206 NM_203314 205 NM_198686 204 NM_198428 203 NM_198088 202 NM_198087 201 NM_178844 200 NM_178568 time \ cut -f1-3 refseq_m1_m2_dist_2k.tab| sort -u | \ ./dimotif_jaccard.awk motif.list - > refseq_pair_dimotif_jaccard_2k.tab real 77m14.509s user 77m14.372s sys 0m0.116s time \ cut -f1-3 refseq_m1_m2_dist_5k.tab| sort -u | \ ./dimotif_jaccard.awk motif.list - > refseq_pair_dimotif_jaccard_5k.tab real 181m43.807s user 181m42.324s sys 0m0.228s wc -l refseq_pair_dimotif_jaccard_1k.tab 169,668 refseq_pair_dimotif_jaccard_1k.tab wc -l refseq_pair_dimotif_jaccard_2k.tab 136,123 refseq_pair_dimotif_jaccard_2k.tab wc -l refseq_pair_dimotif_jaccard_5k.tab 108,856 refseq_pair_dimotif_jaccard_5k.tab there is an interesting outcome including more dimotifs means fewer make the grade ############################################################ [ belay this next part.... went with jaccard over sets of graph parts nodes|edge without ever assembling a graph ] My current thinking is surprise surprise these are graphs (per gene) and by choosing a similarity metric (and a threshold) we can construct a graph of related genes based on how related their graphs are look at the big one grep "^NM_025195" refseq_dimotif_5k.tab | \ awk -F'\t' 'BEGIN{print "digraph{rankdir=LR"}{print "\"" $2 "\" -> \"" $3 "\""}END{print "}"}' > NM_025195_forest.gv xdot NM_025195_forest.gv clearly not a DAG. (so there can't be trees, so I should not call it a forest) ####################################################### Turn the data into RDF. Use the graphviz model as a template sed 's|//.*||g' jaspar_target_model.gv| \ awk -F'"' '/.* -> .*/ {print "# <" $2 "><" $6 "><" $4 ">"}' jaspar_target_model.gv # <NCBIGene:123><SO:adjacent_to><BNODE:gene1_upstream_region> # <BNODE:gene1_upstream_region><rdfs:label><gene1_upstream_region> # <BNODE:gene1_upstream_region><GENO:has_extent><1000 (region extent bp)> # <BNODE:gene1_upstream_region><rdf:type><SO:five_prime_flanking_region> # <BNODE:gene1_upstream_region><rdf:comment><Candidate SO:TF_binding_sites> # <BNODE:motif_set><rdfs:label><motif_set_sig> # <BNODE:motif_set><rdf:type><SIO:collection> # <BNODE:motif_set><rdf:value><3 (flavors)> # <BNODE:motif_set><OIO:hasdbxref><http:JASPAR:motif> # <BNODE:motif_set><OIO:subset><BNODE:motif_set> # <BNODE:motif_set><RO:has member><JASPAR:motif> # <JASPAR:motif><rdf:type><SO:nucleotide_motif> # <JASPAR:motif><OIO:hasdbxref><http:JASPAR:motif> # <JASPAR:motif><rdfs:label><motif> # <BNODE:gene1_upstream_region><RO:member of><BNODE:pairwise similarity> # <BNODE:gene2_upstream_region><RO:member of><BNODE:pairwise similarity> # <BNODE:pairwise similarity><rdfs:label><gene1_region gene2_region> # <BNODE:pairwise similarity><SWO:Similarity score><0.73> # <BNODE:pairwise similarity><rdf:type><SWO:Jaccard’s index> # from the bottom - motifs - motifset - motif * motifset - sub_motifset * motifset - motifset - jaccard * gene(region) - motifset * gene(region) - gene(region) - jaccard notes: two arguments: one for less one for more answer is - It depends. - We can answer the related gene questions without ever going near the motifs, which is an argument not to include then. we could reduce motifs to a xref from a motifset or skip if no one cares. - We need to store a linerizied representation of the dimotif poset* for each gene to know what evidance supports a gene-gene association and to compare across species. (* not strictly posets because can't guarentee that a->b & b->a don't occur within the same set, do seen evidance of strong preference for just one) A good reason to avoid preserving/pubilishing details at this time is: My compact set representations are adhoc. The index into a lexicaly sorted list is a fragile construct that will break on: - adding a motif - retracting a motif - renaming a motif Can also argue that "motifset - motifset - jaccard" is both superceded by "gene(region) - gene(region) - jaccard" and effectivly replicated with the explicit subset/superset. which if so leaves # <NCBIGene_123><SO:adjacent_to><BNODE:gene_upstream_region> # <BNODE:gene_upstream_region><rdfs:label><gene_upstream_region> # <BNODE:gene_upstream_region><GENO:has_extent><1000 (region extent bp)> # <BNODE:gene_upstream_region><rdf:type><SO:five_prime_flanking_region> # <BNODE:gene_upstream_region><rdf:comment><Candidate SO:TF_binding_sites> # <BNODE:motif_set><rdfs:label><motif_set_sig> # <BNODE:motif_set><rdf:type><SIO:collection> # <BNODE:motif_set><rdf:value><3 (flavors)> # <BNODE:motif_set><OIO:hasdbxref><http:JASPAR:motif> # <BNODE:motif_set><OIO:subset><BNODE:motif_set> # <BNODE:motif_set><RO:has member><JASPAR:motif> # <JASPAR:motif><rdf:type><SO:nucleotide_motif> # <JASPAR:motif><OIO:hasdbxref><http:JASPAR:motif> # <JASPAR:motif><rdfs:label><motif> # <BNODE:gene_upstream_region><RO:member of><BNODE:gene_jaccard_value> # <BNODE:gene2_upstream_region><RO:member of><BNODE:gene_jaccard_value> # <BNODE:gene_jaccard_value><rdfs:label><gene1_region gene2_region> # <BNODE:gene_jaccard_value><SWO:Similarity score><0.73> # <BNODE:gene_jaccard_value><rdf:type><SWO:Jaccard’s index> * motifset - sub_motifset (inc. xref to motif?) * gene(region) - motifset * gene(region) - gene(region) - jaccard If I keep going I will talk myself out of subsets as well ... # starting with the motif sets cut -f2 gene_mset_?k.tab | sort -u | wc -l 10,321 motif sets of more than two motifs cut -f2 gene_mset*_?k.tab | sort -u | wc -l 11,018 motifsets (including sets with fewer than three) (adds less than 700 of the 8,646 possible sets < 3) cut -f1 gene_mset*_?k.tab | sort -u | wc -l 55,507 gene regions ######################################################################## I do not seem to have ever generated an explicit mapping of which motif names are members of which motif-sets. The motif-sets are themselves an opaque record but humes will need something readable. (expecting around 50k rows) note leaving motif-subsets independent of the three [1k,2k,5k] regions here. cut -f2 gene_mset*_?k.tab | sort -u > motifset_all.list awk 'FNR==NR{m[i]=$NR}\ FNR!=NR{for(j=1;j<i;j++)\ if(and($1,(2^j)))\ print $1 "\t" m[j]}'\ motif.list motifset_all.list > motifset_motif.tab ## this is failing miserably, only returning 7,030 mappings awk 'BEGIN{i=1}FNR==NR{m[i]=$1;i++}FNR!=NR{for(j=1;j<i;j++)if(and(rshift($1,j),1)){print $1 "\t" m[j]}}' motif.list motifset_over.list > motifset_motif.tab wc -l < motifset_motif.tab 13,556 better, but wtf? back to the source sort -u upstream_bucket_?k.matrix | grep "1" > motifset_all.matrix wc -l < motifset_all.matrix 18970 awk 'NF!=131{print NF}' motifset_all.matrix all ok awk 'BEGIN{i=1}FNR==NR{m[i]=$1;i++}FNR!=NR{delete(a);b=c=0;for(i=1;i<132;i++){if($i){b=or(b,(2^i));a[c++]=m[i]}}for(x in a) print b "\t" a[x]}' motif.list motifset_all.matrix | sort -u > motifset_motif.tab wc -l < motifset_motif.tab 66752 ok that is closer to what I was expecting. but hints b = or(b,2^i) ends up dfferent than b+=2^i which are equilivent in the theoretical world which means I should redo or at least check all the subsets to be sure cut -f1 motifset_motif.tab | sort -u | wc -l 7611 I am just not understanding ... should step away from the keyboard. (but won't) there are 40k genes with some motif in an upstream region as returned by the intersection of bed files. I have three interval sizes so the maximum number is distinct motif sets is 120k but not only are the the three intervals overlapping and prone it have sets common set we saw that more than doubling the extent less than doubled the number of new motifs. So 40k < 67k < 120k is a very belivable number of distinct motif sets. the matrix files all are guarenteed to have 131 columns and every row is guarenteed to have at least one non-zero and every row pattern is guarenteed to be uniqie therefore every motif set is represented with at least one motif and all 67k should occur in the output, not just 7.6k the number of rows emmited should be 67k * the avg number if bits set/row . stopping and doing a take two.
About
exploring JASPAR dataset and model ingest for Translator FA questions
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published