# Using the Data Modules

The data modules provide a way to access the underlying data and transform it to facilitate analysis. This includes:

- Data retrieval using SQL
- Creation of mutation matrices
- Creation of vectors of dependent variables

In [2]:
%matplotlib inline

import init
import microbepy.common
from microbepy.common import constants as cn
from microbepy.common import util
from microbepy.common import isolate
from microbepy.statistics.mutation_differential import MutationDifferential
from microbepy.common.range_constraint import RangeConstraint
from microbepy.common.study_context import nextStudyContext
from microbepy.plot.util_plot import PlotParms
from microbepy.correlation.mutation_collection import MutationCollection
from microbepy.plot.mutation_plot import MutationIsolatePlot, MutationLinePlot

import copy
import numpy as np
import pandas as pd

## Data Model

Microbepy assumes that data are organized in terms of:

- Isolate: This is the microbial community (a generalization of the usual definition of isolate)
- Mutation: Changes to the genome
- Culture: Phenotype information obtained from a culture

An isolate is described in terms of the following:

- Evolutionary line (often, just line)
- Transfer time (time at which the isolate was obtained)
- End point dilution (abbreviated EPD) or None if not applicable
- Clone (an index of same genome organisms on a plate) or None if not applicable
- Species or None if multiple species
- Experiment (just CI in these data)

For example, the isolate HA3.152.10.01.D.CI has evoluationary line HA3, transfer 152, EPD 10, clone 1, and species DVH.

A mutation is specified by its an affected gene (if applicable), position in the genome, nucleotides in the reference genome, and the changed nucleotides. For example, DVU2451.2555217.CA.C is a mutation in the DVU gene DVU2451 at position 2555217 that changes the nucelotides CA to C.

The culture is a string that uniquely identifies each single or paired incubation of microbes.

These data are combined into a single table called ``genotype_phenotype``. The keys are the isolate (key_isolate), mutation (key_mutation), and culture (key_culture). The table ``genotype`` only contains information related to isolates and mutation. Details of the columns in these tables can be found in microbepy.common.constants.py.

The database file is specified in the ``.microbepy`` directory (in the user's home directory) in the file ``config.py``.

## SQL Access to Data

The function ``util.readSQL`` queries the data repository and returns a dataframe with the columns specified in the query.

In [4]:
sql_cmd = "select key_isolate, key_mutation, key_culture from genotype_phenotype where transfer = 152"
df = util.readSQL(sql_cmd)
df.head()

Unnamed: 0,key_isolate,key_mutation,key_culture
0,HR2.152.01.*.*.*,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C127
1,HR2.152.01.*.*.*,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C128
2,HR2.152.01.*.*.*,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C129
3,HR2.152.01.*.*.*,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C130
4,HR2.152.01.*.*.*,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C278


Note that many of the results are for end point dilutions since the clone and species are "\*" (None). To obtain true isolates, the query can be modified.

In [5]:
sql_cmd = """
select key_isolate, key_mutation, key_culture from genotype_phenotype
     where transfer = 152 and species in ('D', 'M')
"""
df = util.readSQL(sql_cmd)
df.head()

Unnamed: 0,key_isolate,key_mutation,key_culture
0,HR2.152.05.01.D.CI,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C647
1,HR2.152.05.01.D.CI,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C648
2,HR2.152.05.01.D.CI,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C649
3,HR2.152.05.01.D.CI,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C168
4,HR2.152.05.01.D.CI,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C169


To obtain phenotype information, we had columns for ``rate`` and ``yield``.

In [7]:
sql_cmd = """
select key_isolate, key_mutation, key_culture, rate, yield from genotype_phenotype
     where transfer = 152 and species in ('D', 'M')
"""
df = util.readSQL(sql_cmd)
df.head()

Unnamed: 0,key_isolate,key_mutation,key_culture,rate,yield
0,HR2.152.05.01.D.CI,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C647,0.025899,0.426375
1,HR2.152.05.01.D.CI,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C648,0.028403,0.438955
2,HR2.152.05.01.D.CI,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C649,0.026304,0.441246
3,HR2.152.05.01.D.CI,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C168,0.030093,0.51952
4,HR2.152.05.01.D.CI,DVU0001.412.CCCCCCTCGCAGCCCCC.CCCCCC,C169,0.029688,0.525941


## Mutation Matrix