# <span style="color:red">Before you turn this in, make sure everything runs as expected.</span>

1. **RESTART THE KERNEL** – in the menubar, select Kernel$\rightarrow$Restart
2. **RUN ALL CELLS** – in the menubar, select Cell$\rightarrow$Run All
3. **VALIDATE THE NOTEBOOK** – in the menubar, click the Validate button

## <span style="color:blue">How to Answer Questions</span>

### <span style="color:blue">Python code answers</span>

Enter your answer any place that says
```python
# Enter your code here
```
<span style="color:red">**AND delete the text.**</span>
```python
raise NotImplementedError # No Answer - remove if you provide an answer
```

### <span style="color:blue">Written answers</span>

Enter your answer any place that says
```
YOUR ANSWER HERE.
```

In [73]:
ANUID = "U7522927"

# Part 1 -- PSSM algorithms

The notebook is worth 8%.

There are extension questions in this notebook worth 1%. (See the [README-FIRST](README-FIRST.ipynb) notebook to understand how these work.)

**You are expected to use numpy. You are not allowed to use anything else other than Pythons builtins.**

## Background

We obtain position weight matrices (henceforth PWM's) from JASAPR. These matrices are oriented as rows corresponding to nucleotides, columns correspond to positions aloing the sequence, cells are the counts. For instance, the python index `[1][3]` of the following equals `10` which corresponds to the counts for `C` at position 3.

```
>MA0108.2	TBP
A  [ 61  16 352   3 354 268 360 222 155  56  83  82  82  68  77 ]
C  [145  46   0  10   0   0   3   2  44 135 147 127 118 107 101 ]
G  [152  18   2   2   5   0  20  44 157 150 128 128 128 139 140 ]
T  [ 31 309  35 374  30 121   6 121  33  48  31  52  61  75  71 ]
```

**NOTE:** I expect your solutions to use the same orientation and ordering.

## Q1 -- convert weight matrix to a position specific probability matrix

Complete the `pwm_to_ppm()` function below.

- The `pwm` argument is a numpy array OR a list of lists
- The `pseudocount` argument is a float
- It returns a numpy array with the same shape as the input array

The function should
- convert a PWM into a position specific probability matrix (PPM)
- support a list, or numpy array as input
- corrects for zeros in the PWM via the optional argument `pseudocount`

In [3]:
import numpy

In [4]:
def pwm_to_ppm(pwm, pseudocount=0.0):
    if type(pwm)==list:
        pwm=numpy.array(pwm,dtype=float)
    else:
        pwm=pwm.astype(float)
    
    # Pseudo-counts -- handling missing data
    result = pwm + pseudocount
    
    for j in result.T:
        sum_frequency=sum(j)
        for i in range(4):
            j[i]=j[i]/sum_frequency
            
    return result

In [5]:
# This part worth 0.1

"""Q1 correct function name"""
from gutils import check

check.allowed_modules(allowed=["numpy"])
check.expected_variables_exist(["pwm_to_ppm"], locals(), callables=["pwm_to_ppm"])

In [6]:
# This part worth 0.1

# A test!
"""Q1 pwm_to_ppm does not fail when given either list or numpy array"""

'Q1 pwm_to_ppm does not fail when given either list or numpy array'

In [7]:
# This part worth 0.1

# A test!
"""Q1 pwm_to_ppm takes pseudo-count int or float"""

'Q1 pwm_to_ppm takes pseudo-count int or float'

In [8]:
# This part worth 0.1

# A test!
"""Q1 pwm_to_ppm returns numpy array"""

'Q1 pwm_to_ppm returns numpy array'

In [9]:
# This part worth 0.3

# A test!
"""Q1 pwm_to_ppm columns sum to 1"""

'Q1 pwm_to_ppm columns sum to 1'

In [10]:
# This part worth 0.3

# A test!
"""Q1 pwm_to_ppm produces correct values"""

'Q1 pwm_to_ppm produces correct values'

## Q2 -- PSSM

Complete the `ppm_to_pssm()` function below.

- it takes a single argument, a PPM as a numpy array
- it returns a numpy array
- it uses log2
- it uses a background distribution of equally frequent bases

In [11]:
def ppm_to_pssm(ppm):
    pssm = numpy.zeros((ppm.shape[0], ppm.shape[1]))
    for i in range(ppm.shape[0]):
        for j in range(ppm.shape[1]):
            pssm[i, j] = (numpy.log2(ppm[i,j]) - numpy.log2(0.25))

    return pssm.transpose()


In [12]:
# This part worth 0.1

"""Q2 correct function name"""
check.allowed_modules(allowed=["numpy"])
check.expected_variables_exist(["ppm_to_pssm"], locals(), callables=["ppm_to_pssm"])

In [13]:
# This part worth 0.1

# A test!
"""Q2 ppm_to_pssm does not fail"""

'Q2 ppm_to_pssm does not fail'

In [14]:
# This part worth 0.1

# A test!
"""Q2 ppm_to_pssm produces correct data type"""

'Q2 ppm_to_pssm produces correct data type'

In [15]:
# This part worth 0.4

# A test!
"""Q2 ppm_to_pssm produces correct values"""

'Q2 ppm_to_pssm produces correct values'

## Q3 - convert a sequence to indices suitable for scoring

In a real analysis, our data are strings, but our PSSMs are arrays. For computational efficiency, we need to convert our sequences into a data structure that simplifies applying a PSSM.

You will complete the `seq_to_indices()` which will convert a sequence into a list (or array) of integers

The function:

- takes a single argument, a DNA sequence as a string
- assumes the base order is alphabetical
- returns the indices as a list

If given the sequence `"ACAAGT"` it should return the list `[0, 1, 0, 0, 2, 3]` (or the array equivalent).

In [16]:
def seq_to_indices(seq: str) -> list:
    base = ['A', 'C', 'G', 'T']
    L = []
    for s in seq:
        L.append(base.index(s))
    
    return L


In [17]:
# This part worth 0.1

"""Q3 correct function name and is callable"""
check.allowed_modules(allowed=["numpy"])
check.expected_variables_exist(
    ["seq_to_indices"], locals(), callables=["seq_to_indices"]
)

In [18]:
# This part worth 0.8

# A test!
"""Q3 seq_to_indices works"""

'Q3 seq_to_indices works'

## Q4 - score a sequence

- write a function that takes a PSSM (as a numpy array), AND a sequence converted into indices and returns a score for every possible position

In [19]:
def score_indexed_seq(pssm, indexed_seq):
    score = []
    
    for i in range(len(indexed_seq)):
        if pssm.shape[0] > i:
            score.append(pssm[i, indexed_seq[i]])
        else:
            score.append([])
    
    return score

In [20]:
# This part worth 0.1

"""Q4 correct function name and is callable"""
check.allowed_modules(allowed=["numpy"])
check.expected_variables_exist(
    ["score_indexed_seq"], locals(), callables=["score_indexed_seq"]
)

In [21]:
# This part worth 2.4

# A test!
"""Q4 correct function name and returns correct value"""

'Q4 correct function name and returns correct value'

## Q5 - calculate a score

---------------

**If you don't think you can write the Python code** to do this question, you can do the computation by hand and assign the values you got to the indicated variable names.

**If you do think you can write the Python code** then do that!

---------------

Consider the following made-up PWM:

```
A  [ 61  16 ]
C  [145  46 ]
G  [152  18 ]
T  [ 31 309 ]
```

Construct the pssm as a numpy array to at least 2 decimal places (only if doing by hand) and assign it to the variable `my_pssm` in the cell below.

In [22]:
import numpy
pwm = [
    [ 61,  16 ],
    [145,  46 ],
    [152,  18 ],
    [ 31, 309 ]
    ]
output = pwm_to_ppm(pwm)
my_pssm = ppm_to_pssm(output)


In [23]:
# This part worth 0.3

# A test!
"""Q5 my_pssm rounded to 2 places is right"""


'Q5 my_pssm rounded to 2 places is right'

Apply that PSSM to the following two sequences, calculating the score for each possible position. Assign the scores for:

- `seq1` to `seq1_scores`
- `seq2` to `seq2_scores`

In [24]:
seq1 = "AGG"
seq2 = "TAA"

In [25]:
seq1_scores = []
seq2_scores = []
seq1_scores = score_indexed_seq(my_pssm, seq_to_indices(seq1))
seq2_scores = score_indexed_seq(my_pssm, seq_to_indices(seq2))

In [26]:
# This part worth 0.5

# A test!
"""Q5 my_pssm scores approx correct"""


'Q5 my_pssm scores approx correct'

## Q6 - find matching positions from a collection of sequences

- complete `findall_matching_positions()`

It takes:
- a 2D numpy array of *log-odds* scores
- a float value of a cutoff

It returns a numpy array whose values are the number of sequences at a position that had a score >= a cutoff

In [27]:
def findall_matching_positions(scores, cutoff):
    new_scores = numpy.zeros((scores.shape[0], scores.shape[1]))
       
    for i in range(scores.shape[0]):
        for j in range(scores.shape[1]):
            if scores[i,j] >= cutoff:
                new_scores[i,j] = scores[i,j]
            else:
                new_scores[i,j] = 0
    return numpy.array(new_scores)

In [28]:
# This part worth 0.1

"""Q6 func present"""
check.allowed_modules(allowed=["numpy"])
check.expected_variables_exist(
    ["findall_matching_positions"], locals(), callables=["findall_matching_positions"]
)

In [29]:
# This part worth 2

# A test!
"""Q6 gets seqs right"""

'Q6 gets seqs right'

## Q6 -- extension question

*Worth 1%*

- modify your functions for sequence indexing and scoring so they handle the case where a sequence has non-canonical characters, e.g. "GCNTTATA"
- your scoring function should return a score for that sequence that is the sum across all positions except those with these invalid characters
- provide one or two small sequences that can be scored using your PSSM as a demonstration

**ANSWER CODE MUST FUNCTION CORRECTLY** -- no partial marks.

In [33]:
def seq_to_indices_upgrade(seq):
    base = ['A', 'C', 'G', 'T']
    L = []
    for s in seq:
        if base.count(s) != 0:
            L.append(base.index(s))
        else:
            if s not in base:
                base.append(s)
                L.append(base.index(s))

    return L

def score_indexed_seq_upgrade(pssm, indexed_seq):
    score = []

    for i in range(len(indexed_seq)):
        if pssm.shape[0] > i and pssm.shape[1] > indexed_seq[i]:
            score.append(pssm[i, indexed_seq[i]])
        else:
            score.append([])

    return score

#test data
seq = "GCNT"
seq = "NCNT"

newseq_to_indices= seq_to_indices_upgrade(seq)

pssm_test = numpy.array([
[-0.66851029, 0.57385055, 0.64164064, -1.63374487],
[-2.56663068, -1.07186599, -2.40157143, 1.6627708 ],
[ 1.99445472, -7.6110248, -7.6110248, -7.6110248 ],
[-4.80366988, -3.21870737, -5.2890967, 1.93779711]])



final_output= score_indexed_seq_upgrade(pssm_test, seq_to_indices_upgrade(seq))


In [34]:
"""Q6 - bonus question"""

'Q6 - bonus question'