# Demo: Snekmer Motif

<b>Motif</b> is a pipeline for identification of functionally and structurally relevant amino acid motifs using feature selection.

Motif uses the output of Model and a set of user-supplied protein families to train support vector machines for in- or out-of-family classification and find which kmers are the most informative.

In this notebook, we will demonstrate how to use Snekmer Motif to find the kmers most indicative of membership in 3 small families.



## Getting started with Snekmer Motif

### Setup

First, install Snekmer using the instructions in the [user installation guide](https://github.com/PNNL-CompBio/Snekmer/).

Before running Snekmer, verify that files have been placed in an **_input_** directory placed at the same level as the **_config.yaml_** file. The assumed file directory structure is illustrated below.

    ├── input
    │   ├── A.fasta
    │   ├── B.fasta
    │   ├── C.fasta
    │   ├── D.fasta
    │   ├── etc.
    │   ├── background
    │   │   ├── E.fasta
    │   │   ├── F.fasta
    │   │   ├── G.fasta
    │   │   ├── H.fasta
    │   │   └── etc.
    └── config.yaml
    

(Note: Snekmer automatically creates the **_output_** directory when creating output files, so there is no need to create this folder in advance. Additionally, inclusion of background sequences is optional, but is illustrated above for interested users.)

To ensure that Snekmer is available in the Jupyter notebook, either activate the `snekmer` environment before opening this notebook, or use a utility such as the [IPython Kernel](https://ipython.readthedocs.io/en/stable/install/kernel_install.html) to
  access the environment as a kernel. 

### Notes on Using Snekmer

Snekmer assumes that the user will primarily process input files using the command line. For more detailed instructions, refer to the [README](https://github.com/PNNL-CompBio/Snekmer).

The basic process for running Snekmer Motif is as follows:

1. Verify that your file directory structure is correct and that the top-level directory contains a **_config.yaml_** file.
   - A **_config.yaml_** template has been included in the Snekmer codebase at **_resources/config.yaml_**.
2. Modify the **_config.yaml_** with the desired parameters.
3. Use the command line to navigate to the directory containing both the **_config.yaml_** file and **_input_** directory.
4. If you are using background families you are not interested in identifying functionally relevant kmers from, run 'snekmer model', then move the FASTA files containing those families to the 'background' directory. You may skip this step if you are performing feature selection on all input families.
5. Run 'snekmer motif'.

## Running Snekmer Motif

First, install Snekmer using the instructions in the [user installation guide](https://snekmer.readthedocs.io/en/latest/getting_started/install.html).

To ensure that the tutorial runs correctly, activate the conda environment containing your Snekmer installation and run the notebook from the environment.

If you haven't yet run the [Snekmer tutorial](https://snekmer.readthedocs.io/en/latest/tutorial/index.html), you'll need to do so now. This runs Motif (and Model) on the demo example files and produces all output files. The tutorial uses the included default configuration parameters to guide the analysis, but the user can modify these parameters if a different configuration set is desired. The tutorial command line instructions are copied below:

```bash
conda activate snekmer
cd resources/tutorial/motif_tutorial_files
./run_motif_demo.sh
```

Finally, we will initialize some parameters and parse filenames for this demo notebook.

In [1]:
# imports
import glob
import os
import yaml
from itertools import product
import pandas as pd

In [2]:
# load config file
with open(os.path.join("..", "..", "resources", "motif_config.yaml"), "r") as configfile:
    config = yaml.safe_load(configfile)

print(config)

{'k': 3, 'alphabet': 1, 'input_file_exts': ['fasta', 'fna', 'faa', 'fa'], 'input_file_regex': '.*', 'nested_output': False, 'score': {'scaler': True, 'scaler_kwargs': {'n': 0.25}, 'labels': 'None', 'lname': 'None'}, 'cluster': {'method': 'agglomerative-jaccard', 'params': {'n_clusters': 'None', 'linkage': 'average', 'distance_threshold': 0.92, 'compute_full_tree': True}, 'cluster_plots': False, 'min_rep': None, 'max_rep': None, 'save_matrix': True, 'dist_thresh': 100}, 'model': {'cv': 5, 'random_state': 'None'}, 'model_dir': 'output/model/', 'basis_dir': 'output/kmerize/', 'score_dir': 'output/score/', 'motif': {'n': 2000}}


In [3]:
filenames = sorted(
    [
        fa.rstrip(".gz")
        for fa, ext in product(
            glob.glob(os.path.join("motif_tutorial_files", "input", "*")),
            config["input_file_exts"],
        )
        if fa.rstrip(".gz").endswith(f".{ext}")
    ]
)

families = sorted([os.path.splitext(os.path.basename(f))[0] for f in filenames])

print(families)

['TIGR03149', 'nxrA']


## Snekmer Motif output

Snekmer Motif ranks kmers by p-value and support vector machine (SVM) weight after recursive feature elimination. The p-value is calculated as the number of rescoring iterations in which a kmer's SVM weight exceeded its weight on the correctly labeled input data, divided by the total number of rescoring iterations. Output files containing this data are stored in the **motif**/**p_values** directory. The other subdirectories within the **motif** directory are used for intermediate files that may be used to resume long workflows if execution is interrupted for any reason.

The primary output file has the file name {input file name}.csv and has 5 columns:

* **kmer**: The recoded amino acid sequence evaluated as a feature.
* **real score**: The normalized weight learned for the kmer by an SVM trained for one-vs-all classification of the input family against a background of all other input families and any provided background sequences, after performing recursive feature elimination.
* **false positive**: The number of rescoring iterations where an SVM learned a higher weight for a given kmer than that learned on the input data.
* **n**: The number of rescoring iterations performed. This should be the same for each kmer and match the value of **n** contained in config.yaml.
* **p**: The p-value calculated for each kmer.

The Snekmer tutorial generates 3 output files, one for each family. We'll load and parse one of these:

In [4]:
# read motif results
results1 = pd.read_csv(os.path.join("motif_tutorial_files", "output", "motif", "p_values", "nxrA.csv"))
results2 = pd.read_csv(os.path.join("motif_tutorial_files", "output", "motif", "p_values", "TIGR03149.csv"))
results1[:15], results2[:15]

(   kmer  real score  false positives     n       p
 0   FDN    1.000000                0  2000  0.0000
 1   FDP    1.000000                0  2000  0.0000
 2   PNP    1.000000                0  2000  0.0000
 3   PDF    1.000000                0  2000  0.0000
 4   FFA    1.000000                0  2000  0.0000
 5   FND    1.000000                0  2000  0.0000
 6   FDA    1.000000                0  2000  0.0000
 7   FFF    1.000000                0  2000  0.0000
 8   PKF    1.000000                0  2000  0.0000
 9   KPP    1.000000                0  2000  0.0000
 10  FAF    1.000000                0  2000  0.0000
 11  PKC    0.639511                3  2000  0.0015
 12  KCC    0.639511                3  2000  0.0015
 13  DDF    0.580741                4  2000  0.0020
 14  CKA    0.360489               23  2000  0.0115,
    kmer  real score  false positives     n       p
 0   NPF    0.941680               19  2000  0.0095
 1   AFD    0.709216               28  2000  0.0140
 2   NFF   

As seen above, kmers are ordered first by p-value and then by their score on real data. Kmers that contribute to function will be significant, while those that aren't involved in function, folding, or regulation will typically be insignificant. Because the families we're using for the tutorial already have PROSITE patterns associated with them, we can check whether the top kmers match our expectations. We see that of the top 15 kmers identified for nxrA, 11 match PS00551, 4 match PS00490, and 13 match PS00932, while for TIGR03149 12 match PS00198.