# iModulonDB Tutorial

This notebook will help you add information and generate 

Follow the advice [here](https://github.com/SBRG/modulytics/wiki/Adding-a-Private-Project) to get started, and learn to use pymodulon with the initial Tutorial if needed.

Please contact krychel@eng.ucsd.edu with any questions!

In [None]:
from pymodulon.core import IcaData
from pymodulon.io import *
from pymodulon.imodulondb import *
import pandas as pd

## Step 1: Load Your Data

More details on this can be found in the main tutorial.

All that is absolutely required is the following:

- M matrix (part of initial ICA output)
- A matrix (part of initial ICA output)
- X matrix (log_tpm.csv) -- initial data, preferably not normalized to the base condition.
- Sample_table containing:
    - Either a column labeled 'sample' or an index (leftmost column) that serves the purpose of sample labels. These should be unique to each sample (no duplicates in the entire table), and it must match the columns of A and X.
    - A column labeled 'project' -- this is the largest grouping of samples; they may be downloaded from the same original paper or contain a set of samples/experiments with a common theme. The contents of this column will appear on the X axis when zoomed all the way out in the activity bar graphs on iModulonDB.
    - A column labeled 'condition' -- this groups samples such that all biological replicates have the same condition name. Samples in the same project with matching conditions will be grouped into a single bar in the activity bar graphs. If you'd like to re-use condition names across different projects (for example, giving each project a 'control' condition), that is okay.
    
Additional information will only improve your version of the site!

In [None]:
ica_data = load_json_model('../example_data/example.json')

## Step 2: Fill out some metadata

When you load an ica_data model, it fills out default metadata for you. Therefore you can technically skip this step.

### 2.a: Splash Table (Dictionary)

This table provides the basic elements shown on [the splash page](http://imodulondb.org) and the names of the directories in which all data will be stored.

First, look at the default one:

In [None]:
ica_data.splash_table

The first three entries are shown on the splash page directly. The last two are important folder names. Typical organism folder names are in the form 'g_species'. Choose a name for your dataset that will be unique; it is likely the only one for your organism, so you could choose 'modulome' if that is the project you are working on.

Let's update this table with your information.

In [None]:
# FILL IN THESE DETAILS

# <i> tags italicize the species name
# So far on the site, large titles are all simply species names.
ica_data.splash_table['large_title'] = '<i>M. favoritespeciesus</i>'

# subtitle provides additional information about the dataset's origins
ica_data.splash_table['subtitle'] = 'Modulome Dataset'
ica_data.splash_table['author'] = 'Rychel, et al.'

# these will correspond to actual folders, which will be generated for you.
ica_data.splash_table['organism_folder'] = 'm_favoritespeciesus'
ica_data.splash_table['dataset_folder'] = 'modulome'

# make sure the output is desired
ica_data.splash_table

### 2.b: Dataset Table (Dictionary)

This table provides the basic elements shown in the [dataset page](https://imodulondb.org/dataset.html?organism=e_coli&dataset=precise1). The numerical elements are computed for you in the default.

If desired, you can add your own entries to this with whatever names you'd like. They can even include links if you obey the [rules for making HTML links](https://www.w3schools.com/html/html_links.asp). For example, the published datasets include a 'Publication' entry.

In [None]:
ica_data.dataset_table

In [None]:
# FILL IN THESE DETAILS, TOO

# This title can combine some information from the splash table
ica_data.dataset_table['Title'] = 'M. favoritespeciesus Modulome'
ica_data.dataset_table['Organism'] = '<i>M. favoritespeciesus</i>'
ica_data.dataset_table['Strain'] = 'IMDB1000'

# additional rows if desired
ica_data.dataset_table['Useful Additional Link'] = '<a href="google.com">Google</a>'

# check output
ica_data.dataset_table

## Step 3: Add Links (Optional)

If your organism has a database like EcoCyc, SubtiWiki, or AureoWiki, you may be interested in adding links to it.

iModulonDB has two kinds of links:

- gene_links: 
    - Associate gene **loci** from the **index of ica_data.M** to websites
    - On the [gene pages](https://imodulondb.org/gene.html?organism=e_coli&dataset=precise1&gene_id=b0002), these links have the name of the database that you are linking to. The name of the database needs to be saved as a string in ica_data.link_database.
    - For programming reasons, the default value of gene_links is all genes as keys, each pointing to the value np.nan.
    
- tf_links:
    - Associate **TF names** from **ica_data.TRN.regulator and ica_data.imodulon_table.regulator** to websites
    - On the [iModulon pages](https://imodulondb.org/iModulon.html?organism=e_coli&dataset=precise1&k=43), the name of the regulator becomes a link. You never have to state the name of this database, and it doesn't have to be the same one used for gene_links (i.e. ica_data.link_database does NOT have to be the one you use).
    - For programming reasons, the default value is an empty dictionary.
    
Note that gene_links uses gene loci since not every gene has a readable name, but tf_links uses readable names since not all regulators have associated gene loci. This could create some confusion between the two types of links; for example, the gene b2741 in E. coli encodes the TF rpoS:

- ica_data.gene_links\['b2741'\] = 'https://ecocyc.org/gene?orgid=ECOLI&id=EG10510 '
- ica_data.tf_links\['rpoS'\] = 'http://regulondb.ccg.unam.mx/sigmulon?term=ECK125110294&organism=ECK12&format=jsp&type=sigmulon '

In this example, ica_data.link_database = 'EcoCyc' since that is where the gene links point to.

You are free to skip this step, or you can use any of the following strategies to generate your links:

#### Strategy 1: Look at the URLs in your database.

Search for genes in your database. You might notice that it always uses the same URL, but enters your search term in part of that URL. You could then take the pieces of that URL and use those to generate all needed URLs. Below is an example from psuedomonas aeruginosa:

In [None]:
prefix = 'https://www.pseudomonas.com/primarySequenceFeature/list?strain_ids=107&term=Pseudomonas+aeruginosa+PAO1+%28Reference%29&c1=name&v1='
suffix = '&e1=1&assembly=complete'

temp_gene_links = dict()
for gene in ica_data.M.index:
    temp_gene_links[gene] = prefix + gene + suffix
    
temp_tf_links = dict()
for gene in ica_data.trn.regulator.unique():
    temp_tf_links[gene] = prefix + gene + suffix
    
# Then, test a few of the links and uncomment these lines

#ica_data.gene_links = temp_gene_links
#ica_data.tf_links = temp_tf_links

You may need to add a suffix in addition to a prefix. You may also have to deal with databases that use different loci abbreviations than the ones you have - this can be remedied by finding a mapping file between the two naming systems.

#### Strategy 2: Make a file

If you are more proficient in excel than python, have found a gene <-> link mapping file elsewhere, or already used strategy 1 to generate a file, you can make sure it has only two columns (gene and link) and no headers, then simply read it in as shown below.

This may be especially useful if you have no database, but want to keep a file of gene_links or tf_links with specific papers instead of db entries.

In [None]:
# ica_data.gene_links = 'gene_links.csv'

#### Strategy 3: Individual modifications

You can also just tell the model the links one at a time. This can be useful if you generate all links with another strategy, but want to overide it in a few cases.

In [None]:
# ica_data.gene_links['b2741'] = 'https://ecocyc.org/gene?orgid=ECOLI&id=EG10510'

#### If using gene links, don't forget to update ica_data.link_database

If you don't, the default value is 'External Database'. If you have no links, nothing will appear so this is irrelevant.

In [None]:
# ica_data.link_database = 'EcoCyc'

ica_data.link_database

## Step 4: Check for compatibility

Though the code cannot check for all edge cases and concerns, you can use the function 'imodulondb_compatibility()' to see what information you are missing, mostly through a check of the columns of your DataFrames. 

#### 4.a: Modify column names

First, let's see what we are missing:

In [None]:
imodulondb_compatibility(ica_data)

Now, let's see if we actually DO have the information that the function believes is missing -- it is likely that some of the names of the columns are simply not what iModulonDB is expecting. If so, we simply rename it to what is expected.

If you actually DO NOT have some of the information, that is okay -- after this step, we'll fill in the blanks with default values as needed.

In [None]:
ica_data.imodulon_table

In [None]:
# We do have regulators. Rename that column to TFs
# We also have n_genes. Rename that column to n_genes
rename_dict = {'regulator': 'TF', # this is more important than the 'Regulator' column
               'regulon_size': 'n_genes'} 
ica_data.imodulon_table = ica_data.imodulon_table.rename(rename_dict, axis = 1)

# We also have names. They were in the index, but the code wants them in the main dataframe as well.
# If you would rather not add this column, the code will do so automatically.
ica_data.imodulon_table['name'] = ica_data.imodulon_table.index

# We don't have a separate Regulator column. Don't worry about that!

In [None]:
ica_data.gene_table

In [None]:
# we have the "missing" cog column:
rename_dict = {'cog': 'COG'}
ica_data.gene_table = ica_data.gene_table.rename(rename_dict, axis = 1)

# we don't have the gene_product column. This is a nice-to-have, but you can ignore it.

In [None]:
ica_data.sample_table

In [None]:
# Nothing to change for the sample table -- the 0th column is the correct one to use.
# However, you can optionally suppress a warning about this from later on if you name the index:
ica_data.sample_table.index.name = 'sample'

# Also, there is an error in the existing 'Biological Replicates' column.
# This caused a "cannot set a row with mismatched columns" error in export (step 5).
# The easiest way to fix it is to have the code re-compute this column in step 4b.
# to make that happen, delete the existing one now.
ica_data.sample_table = ica_data.sample_table.drop('Biological Replicates', axis = 1)

#### 4.b: Fill in the unknowns (Optional)

This step is already integrated in step 5 -- feel free to skip it. However, it might be a useful demonstration.

Even if you only have the minimum (M, A, X, and sample_table) data, calling the compatibility function with inplace=True will fill in enough blanks to generate the dataset. In this example, we need the code to fix the things that we ignored during step 4a.

In general, you may want to deepcopy the data before doing this. While the default columns shouldn't be an issue, the function requires that all iModulon names be numbers, which might not be desired for you. Under the hood in step 5, we automatically deep copy your data before changing anything. If you want access to the default values and columns, however, you may be interested in doing this without the deepcopy.

In [None]:
from copy import deepcopy

new_data = deepcopy(ica_data)

imodulondb_compatibility(new_data, inplace = True)

Let's call it again just to see what the output is now that the new elements are written:

In [None]:
imodulondb_compatibility(new_data)

A very minimal output is good!

## 5: Export

You are now ready to export! This process may take some time (up to around 1 hr for PRECISE2, but significantly shorter depending on the number of genes your organism has).

The export function needs a path to export to. This should be path to where you cloned the iModulonDB repository.

The function will first deepcopy your files and then call the compatibility function from above with write = True. Then it will generate all files that don't require iteration, then all iModulon files, then all gene files.

As it goes, it will print anything you may want to be aware of. This is where issues with the TRN may become apparent.

In [None]:
imodulondb_export(ica_data, '../../iModulonDB/Personal')

## 6: Update TRN, other gene info if desired

In the output above, you will notice snippets such as this one:

> TF has no associated expression profile: cecR

>If cecR is not a gene, this behavior is expected.

>If it is a gene, use consistent naming between the TRN and gene_table.

If desired, you could look up each of these to find alternative names that may be used, then rename either the TRN (all instances of the name in the regulator column) or the gene_table (the name column in the row with the appropriate locus) so that they all match. This will not need to be done for every case; for example, 'ile-tRNA' is a leucine-bound tRNA molecule, which does not have an associated gene.

After doing this, you would need to rerun the export function (step 5).

In [None]:
# according to EcoCyc, the b-number for cecR is b0796 and its alternative name is 'ybiH'

ica_data.gene_table.loc['b0796']

In [None]:
# To remedy this matchup, I will replace 'ybiH' with 'cecR' in the gene_table.

ica_data.gene_table.loc['b0796', 'gene_name'] = 'cecR'

# repeat this process for all unmatched genes.

# re-running step 5 would then correct this error.

## 7: Paste HTML

This step only needs to occur once! Time to add the link to the splash page.

1. Navigate to your iModulonDB/organisms/M_favoritespeciesus/modulome folder.
2. Open the file 'html_for_splash.html' in a text editor
3. Copy the entire contents to your clipboard
4. Navigate back to the main repository folder
5. Open 'index.html' in a text editor
6. Scroll down to the comment near line 250 about "NEW DATASETS"
7. Paste in the appropriate location. You may want to adjust the indentations.
8. Save the file
9. Enjoy your new website!

Traveling to localhost:8000 in your browser should show you the site with a shiny new button for your dataset. Click it to see your new iModulons.

## Moving Forward

Any time you make a change that you want reflected on your site, simply repeat step 5! If changes don't immediately appear on the browser, try a force refresh (Ctrl+Shift+R in Google Chrome).

## Don't forget to save your file so that step 5 is all that is required in the future!

In [None]:
# save_to_json(ica_data, '../example_data/example.json')