# Setup

In [1]:
import os
import re

import numpy as np
from scipy.stats import entropy 
from Bio import SeqIO

DIR = r'c://downloads'

# Q1

## A

At position 1, the nucleotides are A, A, T, A, meaning we have A with probability 0.75 and T with probability 0.25, so the entropy is $H_1 = 0.75 \cdot \log_2(\frac{1}{0.75}) + 0.25 \cdot \log_2(\frac{1}{0.25}) = 0.82$.

Likewise, at the other positions:
* $H_2 = 4 \cdot 0.25 \cdot log_2(\frac{1}{0.25}) = log_2(4) = 2$
* $H_3 = 2 \cdot 0.25 \cdot log_2(\frac{1}{0.25}) + 0.5 \cdot log_2(\frac{1}{0.5}) = 0.5 \cdot 2 + 0.5 \cdot 1 = 1.5$
* $H_4 = 2 \cdot 0.5 \cdot log_2(\frac{1}{0.5}) = log_2(2) = 1$
* $H_5 = H_3 = 1.5$
* $H_6 = 1 \cdot log_2(\frac{1}{1}) = 0$

## B

According to the above calculations, position 2 has the highest entropy (2) and position 6 has the lowest (0). That makes intuitive case: position 2 indeed exhibits the most uncertainty (each of the 4 nucleotides can occur with equal frequency), and position 6 exhibits the lowest uncertainty (T occurs 100% of the times).

## C

i) Positions -9, -8, 1, 2 and 4 appear to be least important (as they are the shortest in the displayed logo, meaning they have minimal information / maximal entropy).

ii) Anything other than C would be somewhat surprising, with A and G completely unexpected. 

# Q2

## A

The genome of theK-12 strain of E. Coli is found at: https://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3.

In [2]:
genome_record, = SeqIO.parse(os.path.join(DIR, 'sequence.fasta'), 'fasta')
genome_seq = genome_record.seq
print('The genome is %d nt long.' % len(genome_seq))

The genome is 4641652 nt long.


## B

According to [UniProt](https://www.uniprot.org/uniprot/P0A7C2), the LexA repressor "represses a number of genes involved in the response to DNA damage (SOS response), including recA and lexA. Binds to the 16 bp palindromic sequence 5'-CTGTATATATATACAG-3'. In the presence of single-stranded DNA, RecA interacts with LexA causing an autocatalytic cleavage which disrupts the DNA-binding part of LexA, leading to derepression of the SOS regulon and eventually DNA repair. Implicated in hydroxy radical-mediated cell death induced by hydroxyurea treatment (PubMed:20005847).The SOS response controls an apoptotic-like death (ALD) induced (in the absence of the mazE-mazF toxin-antitoxin module) in response to DNA damaging agents that is mediated by RecA and LexA (PubMed:22412352)."

## C

In [3]:
LEXA_PATTERN = re.compile(r'..CTG[TG][AT].[AG]...[AT]..CAG..')

laxA_pattern_matches = []

for strand in ['+', '-']:

    if strand == '+':
        strand_seq = genome_seq
    else:
        strand_seq = genome_seq.reverse_complement()

    for match in LEXA_PATTERN.finditer(str(strand_seq)):
        laxA_pattern_matches.append((strand, match.start(), match.end(), match.group()))
        
print('Found %d matches to the regex pattern of lexA repressor:' % len(laxA_pattern_matches))

for strand, start, end, seq in laxA_pattern_matches:
    print('%d-%d (%s): %s' % (start, end, strand, seq))

Found 281 matches to the regex pattern of lexA repressor:
24770-24790 (+): GCCTGTTTGGCCTGGCAGAC
50434-50454 (+): TACTGTTTATCTTCCCAGCG
57820-57840 (+): TGCTGTATGTCATTGCAGAA
58611-58631 (+): TTCTGGTCAGATAAACAGAC
110445-110465 (+): GGCTGGATAAAGAACCAGAA
133583-133603 (+): TGCTGGAAGCCGATGCAGAT
159485-159505 (+): TACTGTTGGTTCAATCAGAT
174854-174874 (+): CGCTGTAAAGATTTTCAGAC
217280-217300 (+): CGCTGTACAGTTTCTCAGCA
246516-246536 (+): TGCTGTATATTTATTCAGCT
250863-250883 (+): CACTGTATACTTTACCAGTG
258824-258844 (+): GTCTGTTCGATAAAGCAGGC
306490-306510 (+): AACTGGTGAATTTTCCAGCC
307127-307147 (+): GCCTGGATAGAGAGTCAGAC
352886-352906 (+): GGCTGGAGAAACAGCCAGAG
370655-370675 (+): GGCTGGATAAGGTGCCAGTT
482257-482277 (+): CGCTGGTCAACATATCAGGG
503429-503449 (+): GACTGTAAAACGATGCAGCC
602292-602312 (+): TGCTGGTGGGAATGGCAGAG
607639-607659 (+): CACTGTATAAATAAACAGCT
632910-632930 (+): TTCTGGTTGCGGAGCCAGCA
657093-657113 (+): GCCTGGATGCGCTTTCAGTT
657238-657258 (+): TACTGGTAACCGACACAGCA
713849-713869 (+): TTCTGGAAGTG

The constraints of the written regex are:
* In 6 positions, requiring an exact nucleotide match, resulting a match probability of 1/4 in each of these positions.
* In 4 positions, requiring one of two possible nucleotides, resulting a match probability of 2/4 in each of these positions.

Overall, a random sequence of the appropriate length would match the regex with probability of $(\frac{1}{4})^6 \cdot (\frac{1}{2})^4 = \frac{1}{2^{16}} = \frac{1}{65,536}$.

The analyzed bacterial genome is of size 4,641,652 nt, meaning 9,283,304 nt on the two strands together. Each of these nucleotides is a potential start site for a sequence that could match to the lexA repressor regex pattern. Therefore, if the genomic sequence were completely random, we would expect $\frac{9,283,304}{65,536} \approx 142$ matches.

The number of matches actually observed (281) is about twice the number we would expect at random.

## D

In [4]:
NT_OPTIONS = 'ACGT'

nt_to_index = {nt: i for i, nt in enumerate(NT_OPTIONS)}
motif_counts = np.zeros((4, 20))

for _, _, _, seq in laxA_pattern_matches:
    for i, nt in enumerate(seq):
        motif_counts[nt_to_index[nt], i] += 1
        
motif_freqs = motif_counts / len(laxA_pattern_matches)
motif_percents = 100 * motif_freqs

print('According to the %d extracted sequences, the frequency of the 4 nucleotides (%s) at each position are:' % \
        (len(laxA_pattern_matches), ', '.join(NT_OPTIONS)))
print(np.array_repr(motif_percents.astype(int), max_line_width = 150))

According to the 281 extracted sequences, the frequency of the 4 nucleotides (A, C, G, T) at each position are:
array([[ 22,  25,   0,   0,   0,   0,  48,  19,  51,  24,  23,  22,  53,  23,  17,   0, 100,   0,  28,  27],
       [ 22,  12, 100,   0,   0,   0,   0,  20,   0,  24,  28,  29,   0,  19,  34, 100,   0,   0,  24,  27],
       [ 19,  35,   0,   0, 100,  63,   0,  20,  48,  23,  19,  26,   0,  30,  29,   0,   0, 100,  26,  18],
       [ 35,  25,   0, 100,   0,  36,  51,  39,   0,  27,  28,  20,  46,  25,  19,   0,   0,   0,  20,  26]])


## E

In [5]:
entropy_per_position = entropy(motif_freqs)
print('Entropy per position:')
print(np.array_repr(entropy_per_position, precision = 1, max_line_width = 150))

Entropy per position:
array([1.4, 1.3, 0. , 0. , 0. , 0.7, 0.7, 1.3, 0.7, 1.4, 1.4, 1.4, 0.7, 1.4, 1.3, 0. , 0. , 0. , 1.4, 1.4])


Unsurprisingly, the entropy is low at positions constrained by the regex. At unconstrained positions, the entropy is in the range 1.3-1.4 bits. At positions constrained to two possible nucleotides, the entropy is half that number (~0.7 bits), and at positions contrained to one specific nucleotide the entropy is of course zero.

## F

On the one hand, we did find twice as many sequences as expected at random, but this could also be explained by other reasons (e.g. we know that the background frequencies of nucleotides in the genome is dependent on GC-content, and is never completely random, regardless of the lexA repressor motif). On the other hand, the recovered frequencies at each of the 20 positions doesn't look very similar to the logo we started with at positions not constrained by the regex itself, and the entropy looks pretty much what we would expect from random sequences constrained by the provided regex pattern. Overall, I tend to believe that most of the recovered sequences are probably random.

To settle the issue, we could simply take the sequences from which the lexA motif logo was created in the first place and check to what extent they match the 281 sequences we found.

# Q3

A) The definition of p-value is the probability of observing data at least as extreme as the one observed, given that the null hypothesis is true. In this case, the null hypothesis is that the coin isn't biased, so a p-value of 0.1 means that the 20 coins flips are somewhat unlikely if the coin weren't biased (i.e. for a fair coin, only 1 in 10 sets of experiments would appear at least as unbalanced as the observed coin flips). A p-value of 0.1 is generally considered above the significance cutoff, so normally we wouldn't reject the null hypothesis in that case.

B) A one-tailed statistical test is used when we seek to reject the null hypothesis in a specific direction (e.g. we want to test whether a coin tends towoards more heads than tails, but not the other way around). A two-tailed test is used when we don't know, apriori, in which direction we might reject the null hypothesis, and we want to test it in both directions. Two-tailed tests are the default, and one should use a one-tailed statistical test only if they have a good reason to expect or be interested in a deviation from the null hypothesis only in one direction. In the case of coin flips, a coin might be biased in both directions (i.e. biased towards either heads or tails), so a two-tailed statistical test should be used.

C) No. Nothing is certain in the realm of statistics. While a p-value of 0.05 is a good evidence against the null hypothesis (i.e. a good evidence for a biased coin), it could also occur due to chance alone (in 1 out of each 20 sets of experiments, on expectation). In the case of coins, we might not have strong feelings about whether they should be biased or not, but sometimes we might have good reasons to think that the null hypothesis should be true, regardless of our own experiment (for example, if this is a widely expected theory). In such cases, it might be appropriate not to reject the null hypothesis even for very low p-values (e.g. if we get a p-value of 0.0001, we might still think it's more probably that we have witnessed an unlikely event of 1 to 10,000 rather than that the theory we tested was false).

D) Yes. Significance threshold are, in the end, arbitrary. The lower the p-value is, it means the less likely the data is given the null hypothesis, and so the "reductio ad absurdum" argument becomes stronger, driving us more strongly to reject the null hypothesis (it can also be shown mathematically that the more unlikely the data given the null hypothesis, the lower the probability of the null hypothesis). In the case of coin flipping, we get that the lower the p-value, the lower the probability is that the coin is unbiased, meaning more evidence for a biased coin.

E) It is virtually impossible to obtain such a low p-value due to chance alone. Assuming that the flipping protocol is reliable and that the statistical test used to derive the p-value is appropriate (and doesn't make any unjustified assumptions) - then we can, with **very** high certainty, determine that the coin is indeed biased.

F) No! It is a common mistake to confuse significance (which is very strong in this case) with effect size. A very significant result does not imply a strong effect. Importantly, the more samples we have, the more significant the p-value will be for a given effect size. With a sufficiently high number of samples, even a very weak effect will produce very significant p-value. In our case, we know that the new ECBI protocols entail a very high number of coin flips, meaning very large sample sizes.

G) A high p-value doesn't prove that there is no effect, even though it is evidence (but potentially a weak one). In general, there's no way to prove that a specific model is true. Scientific theories are increasingly established the more they resist attempts to refute them, but they are never proven beyond all doubt. It is always possible that a coin is in fact biased, even if we see no evidence for that (it could be, for example, that the bias is so weak that we will need a much larger sample size to detect that, no matter how big our sample already is). However, if we flip a coin for many times and still get no significant difference between the numbers of heads and tails, it does comprise evidence in favor of the coin being fair. At the very least, we could probably rule out a strong effect size (which, for a reasonable statistical test, would give us a significant result with very high probability). "absence of evidence" is in fact "evidence of absence", so a high p-value should stregthen our belief, even if only by a tiny bit, that the coin is indeed fair [*].

[*] More generally, if observing something strengthens our belief in some theory, then failing to observe it must, by the laws or probability, weaken our belief in that theory (at least by some amount, small as it may be). In the case of p-values, if we accept a low p-value as evidence against the null hypothesis (and we should), then it means we must also accept high p-values as evidence in favor of the null hypothesis.