## Comments on Filtering 

Consider a reference sequence, length $L$, and a pool of other sequences to compare with. In a genome, there are $N \approx 4\times 10^6$ other sequences to compare with the reference. 
\begin{align*}
    L \text{ choose } n &\text{ number of combinations}\\
    (1/4)^{L - n}(3/4)^n &\text{ probabilty a combination is observed}
\end{align*}
Hence we define the following $E(n, L)$, which is the number of sequences expected to have $n$ differences, given a query length $L$. 
$$
E(n, L) = N \bigg(\frac{(L \text{ choose }n)3^n}{4^L} \bigg)
$$
We want to filter so that $E(n, L)$ is less than a threshold, $\tau = 0.1$. 

In [16]:

from entropies import produce_seqs, plot_entropy

GENE = 'fdoH'

seqs = produce_seqs(GENE)

plot_entropy(seqs, GENE)



KeyboardInterrupt: 

In [22]:

import importlib
import entropies
importlib.reload(entropies)
from entropies import produce_seqs, frequency_table

table = frequency_table(seqs)

print(table)

[[49.03393545 49.03393545 49.03393545 49.03393545 49.03393545 49.03393545
  49.03393545 49.03393545 49.03393545 49.03393545 49.03393545 49.03393545
  49.03393545 49.03393545 49.03393545 49.03393545 49.03393545 49.03393545
  49.03393545 49.03393545 49.03393545 49.03393545 49.03393545 49.03393545
  49.03393545 49.03393545 49.03393545 49.03393545 49.03393545 49.03393545
  49.03393545 49.03393545 49.03393545 49.03393545 49.03393545 49.03393545
  49.03393545 49.03393545 49.03393545 49.03393545 49.03393545 49.03393545
  49.03393545 49.03393545 49.03393545 49.03393545 49.03393545 49.03393545
  49.03393545 49.03393545 49.03393545 49.03393545 49.03393545 49.03393545
  49.03393545 49.03393545 49.03393545 49.03393545 49.03393545 49.03393545
  49.03393545 49.03393545 49.03393545 49.03393545 49.03393545 49.03393545
  49.03393545 49.03393545 49.03393545 49.03393545 49.03393545 49.03393545
  49.03393545 49.03393545 49.03393545 49.03393545 49.03393545 49.03393545
  49.03393545 49.03393545 49.03393545 