# Polycube robustness

In [1]:
import robustness
import altair as alt
import pandas as pd
import utils

Let's start by calculating $N_g$, the set of genotype neighbours one point mutation away from a given genotype $g$. Depending on the allowed amount of colors,input size and dimentionality, the corresponding neigbourhood will be of different sizes, as we shall see later. 

As an initial example, this is the number of 2D mutations for the genotype of a 2x2 square, if we allow only one color and one cube type:

In [2]:
', '.join(robustness.enumerateMutations('040087000000', maxColor=1, maxCubes=1, dim=2))

'840087000000, 000087000000, 048487000000, 040487000000, 040003000000, 040007000000, 040087840000, 040087040000, 040087008400, 040087000400, 040087000084, 040087000004'

## Genotype robustness

We define the genotype robustness $\rho_g$ of a genotype $g$ as:
$$ \rho_g = \frac{1}{\left | N_g \right |}\sum_{n \in N_g} \begin{cases}
 1 & \text{ if } p(n)=p(g) \\ 
 0 & \text{ otherwise } 
\end{cases} $$
where $N_g$ is the set of 1-mutant neigbours of $g$ and $p(g)$ is the phenotype assembled from genotype $g$.

Let's calculate the robustness for the same genotype as before. This time in the default 3D with a maximum of 3 colours and 2 cube types

In [3]:
robustness.calcGenotypeRobustness('040087000000', 3, 2)

0.7592592592592593

 If we allow a larger phenotype space the robustness will be higher, since there is more room for neutral mutations:

In [4]:
genotypeRobustnessData = []
for maxCol in range(1,10):
    for maxCubes in range(1,10):
        genotypeRobustnessData.append({
            'maxCol': maxCol,
            'maxCubes': maxCubes,
            'robustness': robustness.calcGenotypeRobustness('040087000000', maxCol, maxCubes)
        })
alt.Chart(pd.DataFrame(data=genotypeRobustnessData)).mark_rect().encode(
    alt.X('maxCol:O', title="Allowed colors"),
    alt.Y('maxCubes:O', title="Allowed cube types"),
    alt.Color('robustness', scale=alt.Scale(domain=(0,1))),
)

## Phenotype robustness

With the genotype robustness defined, we can then define the phenotype robustness $\rho_p$ as the average genotype robustness for all genotypes enconding for the given phenotype:
$$ \rho_p = \frac{1}{\left | P \right |} \sum_{g \in P}\rho_g $$
where $P$ is the set of all genotypes with phenotype $p$.

Plot the distribution of the genotype robustness values for a few key phenotypes. How large is the variance? Is the distribution gaussian or is it skewed any way? (If the variance is low, no need to sample as many genotypes to get a good estimation of the mean)

In [2]:
phenos = utils.loadPhenos('../cpp/out_new/small3d/phenos')

In [5]:
phenos[1]

{'count': 61,
 'freq': 6.1e-07,
 'compl': 4,
 'rule': '88128c00930500000e00000000000c000000858e00000008000000000000',
 'genotypes': ['db20746b8b414e9c0387fe670365b028bbe8c63915472ee207c35483aff2',
  '0528e5826c359578245ed0e31fa368af857b5e4426261490ccde9dbd4a4f',
  '60962fb6837a08108ba4ee0e5967fd1df01d9017f45b30698e2dd283377d',
  'e3b86c621f146e8a5287014a4310d21a2f728b9d933c5b9931d2c6468b88',
  '17be90c9e65afd2cac04ed56211f5cce2b4f708bebaba4deb9a37066c9ea',
  '5798d87305751b23cbe5486a8a2fdb0093b23adb30619a00c0e8d010358f',
  '6cc222964d4bba34f7b5c02a85fb38afeb62891dbd1a5371ed5a90a1a91a',
  '5d57f6ae59d647dddbcd0936227fa83e362c364a0728193ee4c218271735',
  'ec251e3c2dccd580b20eb70c12288c89baebba81b533c29ee8cdc66f45bf',
  'f41e4eed42fb9412f02d0cfeca4008e141b3039ae9cf78b2d60f001a5572',
  'd332ae799d4e222c0194da9542fae1041a0dc7a5088c8ba9062f047e2611',
  '833b30d448e86e7220b75b75a1804e9f62ac1f51b360c3e7357cffec064f',
  '8ff2657def051d55e859c6da3957564c0cc092e819b34232194b12148409',
  '415b7cb7

In [7]:
# Generating Data
data = []
for phenoName, phenoIdx in [('a',1), ('b',131), ('c',151)]:
    p = phenos[phenoIdx]
    for genotype in p['genotypes']:
        data.append({'r': robustness.calcGenotypeRobustness(genotype), 'pheno': phenoName})
source = pd.DataFrame(data=data)

In [11]:
base = alt.Chart(source)

base.mark_bar(opacity=0.5).encode(
    x=alt.X('r:Q', bin=True, axis=None),
    y=alt.Y('count()', stack=None),
    color="pheno:O"
) + base.mark_rule(color='red').encode(
    x='mean(r):Q',
    size=alt.value(5),
    color="pheno:O"
)

In [15]:
alt.Chart(source).transform_joinaggregate(
    total='count(*)'
).transform_calculate(
    pct='1 / datum.total'
).mark_bar(opacity=0.5).encode(
    alt.X('r:Q', bin=True),
    alt.Y('sum(pct):Q', axis=alt.Axis(format='%')),
    color="pheno:O"
)

In [99]:
for pheno in ['a', 'b', 'c']:
    df = source.loc[source['pheno'] == pheno]
    print("{}: mean: {:.2}, variance: {:.2}".format(pheno, df.mean()[0], df.var()[0]))

a: mean: 0.57, variance: 0.00074
b: mean: 0.58, variance: 0.00083
c: mean: 0.57, variance: 0.00095


For a large dataset, this will take a while; so it's better to calculate it separately by running `python robustness.py` and saving the pickled data:
In this case, we load a random sample of 100 phenotypes:

In [5]:
#phenotypeRobustnessData = robustness.calcPhenotypeRobustness(path='../cpp/out/3d', sampleSize=100)
phenotypeRobustnessData = pickle.load(open('../cpp/out/3d/robustness.p', "rb"))

In [6]:
alt.Chart(phenotypeRobustnessData).transform_calculate(
        url='https://akodiat.github.io/polycubes?hexRule=' + alt.datum.rule
    ).mark_circle(size=60).encode(
        alt.X('frequency', scale=alt.Scale(type='log'), title="Frequency"),
        alt.Y('robustness', title="Phenotype Robustness"),
        href='url:N',
        tooltip=['rule']
    ).interactive()