# Practice 5: Motif Discovery with PFM and PSSM

In this exercise, you will learn how to work with sequence motifs using Anderson promoter sequences. The tasks are designed to walk you through the construction of a Position Frequency Matrix (PFM), a Position Probability Matrix (PPM), and finally a Position-Specific Scoring Matrix (PSSM).

You will then scan a real bacterial genome for high-scoring matches.

---

## 🧠 Biological Background

Transcription factors often bind to short, conserved motifs in the DNA. One way to represent these motifs is with **position-specific matrices** derived from multiple aligned sequences.

- **PFM (Position Frequency Matrix):** counts of A/C/G/T at each position.
- **PPM (Position Probability Matrix):** normalized frequencies (sums to 1 per column).
- **PSSM (Position-Specific Scoring Matrix):** log-odds score of each base relative to background.

---

We will use a set of Anderson promoter sequences to build these matrices.


# PSSM Matrices

In [1]:
!pip install biotite

from Bio.Seq import Seq
import numpy as np
import matplotlib.pyplot as plt
import biotite.sequence as seq
import biotite.sequence.align as align
import biotite.sequence.graphics as graphics

# The list of Anderson promoters
seqs = [Seq("ttgacagctagctcagtcctaggtataatgctagc"),
        Seq("ttgacagctagctcagtcctaggtataatgctagc"),
        Seq("tttacagctagctcagtcctaggtattatgctagc"),
        Seq("ttgacagctagctcagtcctaggtactgtgctagc"),
        Seq("ctgatagctagctcagtcctagggattatgctagc"),
        Seq("ttgacagctagctcagtcctaggtattgtgctagc"),
        Seq("tttacggctagctcagtcctaggtactatgctagc"),
        Seq("tttacggctagctcagtcctaggtatagtgctagc"),
        Seq("tttacggctagctcagccctaggtattatgctagc"),
        Seq("ctgacagctagctcagtcctaggtataatgctagc"),
        Seq("tttacagctagctcagtcctagggactgtgctagc"),
        Seq("tttacggctagctcagtcctaggtacaatgctagc"),
        Seq("ttgacggctagctcagtcctaggtatagtgctagc"),
        Seq("ctgatagctagctcagtcctagggattatgctagc"),
        Seq("ctgatggctagctcagtcctagggattatgctagc"),
        Seq("tttatggctagctcagtcctaggtacaatgctagc"),
        Seq("tttatagctagctcagcccttggtacaatgctagc"),
        Seq("ttgacagctagctcagtcctagggactatgctagc"),
        Seq("ttgacagctagctcagtcctagggattgtgctagc"),
        Seq("ttgacggctagctcagtcctaggtattgtgctagc")]


Collecting biotite
  Downloading biotite-1.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting biotraj<2.0,>=1.0 (from biotite)
  Downloading biotraj-1.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB)
Collecting msgpack>=0.5.6 (from biotite)
  Downloading msgpack-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting scipy>=1.13 (from biotraj<2.0,>=1.0->biotite)
  Downloading scipy-1.15.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Downloading biotite-1.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (56.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.1/56.1 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading biotraj-1.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [3

## Task 1: Build a Position Frequency Matrix and PPM

Using the provided Anderson promoter sequences:

1. Build a **Position Frequency Matrix (PFM)**.
   - Rows: A, C, G, T.
   - Columns: each position in the motif (should all be the same length).

2. Convert the PFM into a **Position Probability Matrix (PPM)** by normalizing each column.

📝 **Hints:**
- You can assume all sequences are the same length.
- Use `numpy` for matrix manipulation.

Example PFM:

|   | 0 | 1 | 2 |
|---|---|---|---|
| A | 2 | 1 | 7 |
| C | 3 | 6 | 0 |
| G | 3 | 0 | 2 |
| T | 2 | 3 | 1 |


# Your code goes here
#
#

🖼️ **(Optional Bonus)**: Plot the PPM as a sequence logo!

You can use:
```python
graphics.plot_sequence_logo(ppm, ax=plt.gca())
```
You'll need to convert your matrix into a `biotite.sequence.align.SubstitutionMatrix` object or manually scale it.


## Task 2

You have a Lactobacillus genome in "practice5_files" folder. Let's try to find the 10 top-scoring according to your PSM... 

# Your code goes here
#
#

## Task 3

Since your previous step did not give any meaningful results, let's try and add pseudocounts to the model (just add 0.01 probability to every cell of PPM)

# Your code goes here
#
#

## Task 4

Ok, now we have some top-scoring positions, however, numbers are not impressive of meaningful.

To get meaningful results let's switch to log odds Position Weight Matrix (PWM)
![correct](practice5_files/PPM.png)

b_k - frequency of nucleotide k in background model (nucleotide frequency based)

# Your code goes here
#
#

## 🧪 Bonus Task: Extend the Motif Scanner

Can you improve your scanner to:
- Report overlapping hits?
- Score reverse complements?
This simulates a more realistic binding site detection tool!
