# iModulonDB Tutorial

This notebook will help you add information and generate 

Follow the advice [here](https://github.com/SBRG/modulytics/wiki/Adding-a-Private-Project) to get started, and learn to use pymodulon with the initial Tutorial if needed.

Please contact krychel@eng.ucsd.edu with any questions!

In [1]:
from pymodulon.core import IcaData
from pymodulon.io import *
from pymodulon.imodulondb import *
import pandas as pd

## Step 1: Load Your Data

More details on this can be found in the main tutorial.

All that is absolutely required is the following:

- M matrix (part of initial ICA output)
- A matrix (part of initial ICA output)
- X matrix (log_tpm.csv) -- initial data, preferably not normalized to the base condition.
- Sample_table containing:
    - Either a column labeled 'sample' or an index (leftmost column) that serves the purpose of sample labels. These should be unique to each sample (no duplicates in the entire table), and it must match the columns of A and X.
    - A column labeled 'project' -- this is the largest grouping of samples; they may be downloaded from the same original paper or contain a set of samples/experiments with a common theme. The contents of this column will appear on the X axis when zoomed all the way out in the activity bar graphs on iModulonDB.
    - A column labeled 'condition' -- this groups samples such that all biological replicates have the same condition name. Samples in the same project with matching conditions will be grouped into a single bar in the activity bar graphs. If you'd like to re-use condition names across different projects (for example, giving each project a 'control' condition), that is okay.
    
Additional information will only improve your version of the site!

In [3]:
ica_data = load_json_model('../example_data/example.json')

## Step 2: Fill out some metadata

When you load an ica_data model, it fills out default metadata for you. Therefore you can technically skip this step.

### 2.a: Splash Table (Dictionary)

This table provides the basic elements shown on [the splash page](http://imodulondb.org) and the names of the directories in which all data will be stored.

First, look at the default one:

In [5]:
ica_data.splash_table

{'large_title': 'New Dataset',
 'subtitle': 'Unpublished study',
 'author': 'Pymodulon User',
 'organism_folder': 'new_org',
 'dataset_folder': 'new_dataset'}

The first three entries are shown on the splash page directly. The last two are important folder names. Typical organism folder names are in the form 'g_species'. Choose a name for your dataset that will be unique; it is likely the only one for your organism, so you could choose 'modulome' if that is the project you are working on.

Let's update this table with your information.

In [8]:
# FILL IN THESE DETAILS

# <i> tags italicize the species name
# So far on the site, large titles are all simply species names.
ica_data.splash_table['large_title'] = '<i>M. favoritespeciesus</i>'

# subtitle provides additional information about the dataset's origins
ica_data.splash_table['subtitle'] = 'Modulome Dataset'
ica_data.splash_table['author'] = 'Rychel, et al.'

# these will correspond to actual folders, which will be generated for you.
ica_data.splash_table['organism_folder'] = 'm_favoritespeciesus'
ica_data.splash_table['dataset_folder'] = 'modulome'

# make sure the output is desired
ica_data.splash_table

{'large_title': 'M. favoritespeciesus',
 'subtitle': 'Modulome Dataset',
 'author': 'Rychel, et al.',
 'organism_folder': 'm_favoritespeciesus',
 'dataset_folder': 'modulome'}

### 2.b: Dataset Table (Dictionary)

This table provides the basic elements shown in the [dataset page](https://imodulondb.org/dataset.html?organism=e_coli&dataset=precise1). The numerical elements are computed for you in the default.

If desired, you can add your own entries to this with whatever names you'd like. They can even include links if you obey the [rules for making HTML links](https://www.w3schools.com/html/html_links.asp). For example, the published datasets include a 'Publication' entry.

In [9]:
ica_data.dataset_table

Title                             New Dataset
Organism                         New Organism
Strain                         Unknown Strain
Number of Samples                         278
Number of Unique Conditions               163
Number of Genes                          3923
Number of iModulons                        92
dtype: object

In [11]:
# FILL IN THESE DETAILS, TOO

# This title can combine some information from the splash table
ica_data.dataset_table['Title'] = 'M. favoritespeciesus Modulome'
ica_data.dataset_table['Organism'] = '<i>M. favoritespeciesus</i>'
ica_data.dataset_table['Strain'] = 'IMDB1000'

# additional rows if desired
ica_data.dataset_table['Useful Additional Link'] = '<a href="google.com">Google</a>'

# check output
ica_data.dataset_table

Title                            M. favoritespeciesus Modulome
Organism                           <i>M. favoritespeciesus</i>
Strain                                                IMDB1000
Number of Samples                                          278
Number of Unique Conditions                                163
Number of Genes                                           3923
Number of iModulons                                         92
Useful Additional Link         <a href="google.com">Google</a>
dtype: object

## Step 3: Add Links (Optional)

If your organism has a database like EcoCyc, SubtiWiki, or AureoWiki, you may be interested in adding links to it.

iModulonDB has two kinds of links:

- gene_links: 
    - Associate gene **loci** from the **index of ica_data.M** to websites
    - On the [gene pages](https://imodulondb.org/gene.html?organism=e_coli&dataset=precise1&gene_id=b0002), these links have the name of the database that you are linking to. The name of the database needs to be saved as a string in ica_data.link_database.
    - For programming reasons, the default value of gene_links is all genes as keys, each pointing to the value np.nan.
    
- tf_links:
    - Associate **TF names** from **ica_data.TRN.regulator and ica_data.imodulon_table.regulator** to websites
    - On the [iModulon pages](https://imodulondb.org/iModulon.html?organism=e_coli&dataset=precise1&k=43), the name of the regulator becomes a link. You never have to state the name of this database, and it doesn't have to be the same one used for gene_links (i.e. ica_data.link_database does NOT have to be the one you use).
    - For programming reasons, the default value is an empty dictionary.
    
Note that gene_links uses gene loci since not every gene has a readable name, but tf_links uses readable names since not all regulators have associated gene loci. This could create some confusion between the two types of links; for example, the gene b2741 in E. coli encodes the TF rpoS:

- ica_data.gene_links\['b2741'\] = 'https://ecocyc.org/gene?orgid=ECOLI&id=EG10510 '
- ica_data.tf_links\['rpoS'\] = 'http://regulondb.ccg.unam.mx/sigmulon?term=ECK125110294&organism=ECK12&format=jsp&type=sigmulon '

In this example, ica_data.link_database = 'EcoCyc' since that is where the gene links point to.

You are free to skip this step, or you can use any of the following strategies to generate your links:

#### Strategy 1: Look at the URLs in your database.

Search for your gene in your database. You might notice that it always uses the same URL, but enters your search term in part of that URL. You could then take the pieces of that URL and use those to generate all needed URLs:

In [15]:
prefix = 'http://subtiwiki.uni-goettingen.de/v3/gene/search/exact/'

temp_gene_links = dict()
for gene in ica_data.M.index:
    temp_gene_links[gene] = prefix + gene
    
# Then, test a few of the links and uncomment this line
# ica_data.gene_links = temp_gene_links

You may need to add a suffix. You may also have to deal with databases that use different loci abbreviations than the ones you have - this can be remedied by finding a mapping file between the two naming systems.

#### Strategy 2: Make a file

If you are more proficient in excel than python, have found a gene <-> link mapping file elsewhere, or already used strategy 1 to generate a file, you can make sure it has only two columns (gene and link) and no headers, then simply read it in as shown below.

This may be especially useful if you have no database, but want to keep a file of gene_links or tf_links with specific papers instead of db entries.

In [16]:
# ica_data.gene_links = 'gene_links.csv'

#### Strategy 3: Individual modifications

You can also just tell the model the links one at a time. This can be useful if you generate all links with another strategy, but want to overide it in a few cases.

In [18]:
# ica_data.gene_links['b2741'] = 'https://ecocyc.org/gene?orgid=ECOLI&id=EG10510'

#### If using gene links, don't forget to update ica_data.link_database

If you don't, the default value is 'External Database'. If you have no links, nothing will appear so this is irrelevant.

In [19]:
# ica_data.link_database = 'EcoCyc'

ica_data.link_database

'External Database'

## Step 4: Check for compatibility