# Searching for motifs

## Using MEME to find motifs

[MEME](https://meme-suite.org/meme/) must be installed to use this functionality. `find_motifs` searches for motifs upstream of all genes in an iModulon. The `gene_table` must contain the columns `accession` and `operon` for this function to work (see `notebooks/gene_annotation.ipynb`).

`find_motifs` supports many of the command-line options for MEME:

* `outdir`: Directory for output files
* `palindrome`: If True, limit search to palindromic motifs (default: False)
* `nmotifs`: Number of motifs to search for (default: 5)
* `upstream`: Number of basepairs upstream from first gene in operon to include in motif search (default: 500)
* `downstream`: Number of basepairs upstream from first gene in operon to include in motif search (default: 100)
* `verbose`: Show steps in verbose output (default: True)
* `force`: Force execution of MEME even if output already exists (default: False)
* `evt`: E-value threshold (default: 0.001)
* `cores` Number of cores to use (default: 8)
* `minw`: Minimum motif width in basepairs (default: 6)
* `maxw`: Maximum motif width in basepairs (default: 40)
* `minsites`: Minimum number of sites required for a motif. Default is the number of operons divided by 3.

In [1]:
from pymodulon.motif import *
from pymodulon.data.example_data import load_ecoli_data, ecoli_fasta
ica_data = load_ecoli_data()

In [2]:
motifs = find_motifs(ica_data, 'ArgR', ecoli_fasta)

Finding motifs for 7 sequences
Found 1 motif across 7 sites


`find_motifs` creates a `MotifInfo` object

In [3]:
motifs

<MotifInfo with 1 motif across 7 sites>

The `MotifInfo` object contains the following properties:

* `motifs`: Information about the motifs (e.g. E-value, width, consensus sequence)

In [7]:
motifs.motifs

Unnamed: 0_level_0,e_value,sites,width,consensus,motif_frac
motif,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
MEME-1,1.2e-16,7,37,TGAATWWAWATKCAMTWWWTATGMATAAWWATTCANT,1.0


* `sites`: Information about predicted binding sites (e.g. location, p-value)

In [9]:
motifs.sites

Unnamed: 0_level_0,Unnamed: 1_level_0,rel_position,pvalue,site_seq,genes,locus_tags,start_pos,strand
motif,operon,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
MEME-1,argA,453,2.36e-13,AGAATAAAAATACACTAATTTCGAATAATCATGCAAA,argA,b2818,2948741,+
MEME-1,argCBH,375,7.76e-14,TCAATATTCATGCAGTATTTATGAATAAAAATACACT,"argC,argB,argH","b3958,b3959,b3960",4154500,+
MEME-1,argD,132,4.31e-14,TGAAATTATAACCACAAAATATGCATAAAAAATCACT,argD,b3359,3490080,-
MEME-1,argF,129,2.74e-17,TGAATTAAAATTCACTTTATATGTGTAATTATTCATT,argF,b0273,290205,-
MEME-1,argG,412,2.48e-14,TAAATGAAAACTCATTTATTTTGCATAAAAATTCAGT,argG,b3172,3318136,+
MEME-1,argI,127,8.81e-17,TGAATTAAAATTCAATTTATATGGATGATTATTCATT,argI,b4254,4478211,-
MEME-1,artJ,171,2.66e-13,TGAATTTATATGCAATAAACATGATTAAATAATTTAA,artJ,b0860,900475,-


* `cmd`: The original command used to run MEME

In [11]:
motifs.cmd

'meme motifs/ArgR.fasta -oc motifs/ArgR -dna -mod zoops -p 8 -nmotifs 5 -evt 0.001 -minw 6 -maxw 40 -allw -minsites 2'

* `file`: Path to MEME output. This is the relative path of the file from the notebook that ran `find_motifs`

In [12]:
motifs.file

'motifs/ArgR/meme.txt'

This `MotifInfo` object is automatically stored as a dictionary in the IcaData object. It will persist after saving and re-loading the IcaData object.

In [14]:
ica_data.motif_info

{'ArgR': <MotifInfo with 1 motif across 7 sites>}

## Using TOMTOM to compare motifs against external databases

Once you have a motif from MEME, you can use [TOMTOM](https://meme-suite.org/meme/tools/tomtom) to compare your motif against external databases. The `compare_motifs` function makes this process simple.

The `MotifInfo` object generated in the `find_motifs` function contains the MEME file location, which is the primary input for `compare_motifs`.

In [15]:
compare_motifs(ica_data.motif_info['ArgR'])

Unnamed: 0,motif,target,target_id,pvalue,Evalue,qvalue,overlap,optimal_offset,orientation,database
0,MEME-1,argR2,argR2,9.891939999999999e-38,5.84613e-35,1.14464e-34,37,2,-,dpinteract
1,MEME-1,ArgR_26-5,ArgR_26-5,7.28565e-14,4.30582e-11,4.21527e-11,26,0,-,SwissRegulon_e_coli
2,MEME-1,argR,argR,1.24533e-10,7.35988e-08,2.88204e-08,18,-19,-,dpinteract
3,MEME-1,ArgR,MX000116,3.70711e-07,0.00021909,7.14942e-05,14,0,-,prodoric
4,MEME-1,AhrC,MX000043,5.80323e-07,0.000342971,8.39396e-05,16,-20,-,prodoric


This information is saved in the `matches` attribute of the ArgR `MotifInfo`. Again, this information will be maintained if you save the file to json.

In [16]:
ica_data.motif_info['ArgR'].matches

Unnamed: 0,motif,target,target_id,pvalue,Evalue,qvalue,overlap,optimal_offset,orientation,database
0,MEME-1,argR2,argR2,9.891939999999999e-38,5.84613e-35,1.14464e-34,37,2,-,dpinteract
1,MEME-1,ArgR_26-5,ArgR_26-5,7.28565e-14,4.30582e-11,4.21527e-11,26,0,-,SwissRegulon_e_coli
2,MEME-1,argR,argR,1.24533e-10,7.35988e-08,2.88204e-08,18,-19,-,dpinteract
3,MEME-1,ArgR,MX000116,3.70711e-07,0.00021909,7.14942e-05,14,0,-,prodoric
4,MEME-1,AhrC,MX000043,5.80323e-07,0.000342971,8.39396e-05,16,-20,-,prodoric


By default, `compare_motifs` compares your motifs to 5 prokaryotic databases. You can also use the path to a local motif file or limit the search to a single database, namely:

* `'prodoric'`
* `'collectf'`
* `'dpinteract'`
* `'regtransbase'`
* `'SwissRegulon_e_coli'`

To re-run `compare_motifs` with different arguments, set `force=True`

In [17]:
compare_motifs(ica_data.motif_info['ArgR'], motif_db = 'prodoric', force=True)

Unnamed: 0,motif,target,target_id,pvalue,Evalue,qvalue,overlap,optimal_offset,orientation,database
0,MEME-1,ArgR,MX000116,4.20503e-07,8.5e-05,7.6e-05,14,0,-,prodoric
1,MEME-1,AhrC,MX000043,5.76078e-07,0.000116,7.6e-05,16,-20,-,prodoric
