# INFO-F208: project 3, PSSM profiles
The goal of this project is to make a position-specific scoring matrix (*PSSM*) from aligned sequences of amino acids of the WW domain. The sequences that will be used come from human proteins only.

A PSSM is a compact representation of multiple aligned sequences, and is commonly used to represent domains of sequences, such as SH-3, PDZ, or WW in our case. PSSMs can be used to align an unknown sequence to a whole domain, with only one pass of the alignment algorithm, in order to check whether the sequence belongs to the domain.

## Data
As explained in the statement, I got sequences of the WW domain from the SMART database in a `.fasta` file, and I aligned them with two online aligner. I chose MUSCLE and CLUSTAL Omega. The aligned domains has been saved into `msaresults-MUSCLE.fasta` and `msaresults-CLUSTAL.fasta`.

In order to calculate the PSSM, one need $p(a)$, i.e. the probability of the amino acid $a$ in the random pattern. This information has been gathered from the Swiss-prot database statistics page, and is encoded in the global dictionnary `p` in following the Python code.

## Choices in the implementation
A few decisions has to be taken in this project regarding the details of the implementation.
### Pseudo-counts
The goal of pseudo-counts is to avoid errors when using logarithms in the computing of the PSSM: in the case where the frequency ($f_{u,a}$) is equal to zero in the formula
$$m_{u,a} = \log\frac{f_{u,a}}{p(a)}$$
we have an error ($\log(0)\notin \mathbb{R}$).  
So the solution is to use pseudo-counts, and change the formula to
$$m_{u,a} = \log\frac{\alpha f_{u,a} + \beta p(a)}{(\alpha + \beta)p(a)}$$
So that with positive $\alpha$ and $\beta$, the fraction is never null.  
I chose $\alpha$ as the number of sequences without gap at the position $u$. Consequently, $\alpha$ is a function of $u$, thus we need to calculate a value of $\alpha$ for each column.  
I also chose $\beta = \sqrt{N_{seq}}$.
### The logarithm base
The base of the logarithm is not explicitly given in the slides, so I assumed a base 2. Nonetheless, this can easily be changed by setting the global variable `logBase` in the beginning of the Python code.

### Output of the code
The code below outputs the PSSM profile of the MUSCLE alignment, but one can get the CLUSTAL PSSM as well by commenting and uncommenting the appropriate lines in the `main` fonction.  
The output is wide, so it is not very convenient in jupyter. The solution is to copy-paste the output to a text editor, so that the PSSM can be easily inspected.

## Code

In [2]:
from math import log
from copy import deepcopy

gap = "-"
aminoAcids = "ARNDCEQGHILKMFPSTWYV"

# The following probabilities has been gathered from the Swiss-prot database
# statistics page
p={ "E" : 0.0674, "H" : 0.0227, "S" : 0.066,  "K" : 0.0582, "Y" : 0.0292,
    "T" : 0.0535, "W" : 0.0109, "L" : 0.0965, "R" : 0.0553, "M" : 0.0241,
    "Q" : 0.0393, "A" : 0.0826, "V" : 0.0687, "F" : 0.0386, "C" : 0.0137,
    "N" : 0.0406, "P" : 0.0472, "I" : 0.0593, "D" : 0.0546, "G" : 0.0708}

logBase = 2

class Pssm:
    def __init__(self, filename):
        self.load(filename)
    
    def load(self, filename):
        sequences = []
        
        with open(filename) as file:
            for line in file:
                line = line.strip()
                if not line.startswith(">"):
                    sequences.append(line.upper())
        
        self.numberColumns = len(sequences[0])
        matrix = [{aa : 0 for aa in aminoAcids} for _ in range(self.numberColumns)]
        beta = len(sequences) ** 0.5
        self.matrix = deepcopy(matrix)
        
        for sequence in sequences:
            for i, aa in enumerate(sequence):
                if aa in aminoAcids:
                    matrix[i][aa] += 1
                    
        
        for i, column in enumerate(matrix):
            alpha = sum(column.values())
            for aa, count in column.items():
                frequency = count / len(sequences)
                probability = (alpha * frequency + beta * p[aa]) / (alpha + beta)
                self.matrix[i][aa] = log(probability / p[aa], logBase)
    
    def __getitem__(self, column, aa):
        return self.matrix[column][aa]
    
    def __str__(self):
        res = "  | "
        cellWidth = 5
        for i in range(self.numberColumns):
            res += str(i).rjust(cellWidth) + " | "
        res += "\n  " + "+-------" * self.numberColumns + "\n"
        for aa in aminoAcids:
            res += aa + " | "
            for column in self.matrix:
                res += str(column[aa])[:cellWidth].rjust(cellWidth) + " | "
            res += "\n"
        res += "\n"
        return res

def main():
    muscle = "msaresults-MUSCLE.fasta"
    clustal = "msaresults-CLUSTAL.fasta"
    pssm = Pssm(muscle)
    #pssm = Pssm(clustal)
    print(pssm)
    
if __name__ == "__main__":
    main()


  |     0 |     1 |     2 |     3 |     4 |     5 |     6 |     7 |     8 |     9 |    10 |    11 |    12 |    13 |    14 |    15 |    16 |    17 |    18 |    19 |    20 |    21 |    22 |    23 |    24 |    25 |    26 |    27 |    28 |    29 |    30 |    31 |    32 |    33 |    34 |    35 |    36 |    37 |    38 |    39 |    40 |    41 |    42 |    43 |    44 |    45 |    46 |    47 |    48 |    49 |    50 |    51 |    52 |    53 |    54 |    55 |    56 |    57 |    58 |    59 | 
  +-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------
A | -1.56 | -1.38 | -2.33 | -0.0

## Results analysis
The constructed PSSMs can be found in `MUSCLE-PSSM.txt` and `CLUSTAL-PSSM.txt`, they aren't shown here because of their excessive width. We will compare the PSSM files and the weblogo constructed from the alignment fasta files.

Of course these comparisons are not formal and do not prove that this project works for all domain alignments. The aim is just to convince that it works for this case, and it is sufficient, because this project only targets the WW domain.

### MUSCLE Weblogo
![MUSCLE-logo](MUSCLE-logo.png)
By searching for high values on the weblogo and in `MUSCLE-PSSM.txt`, we can see that they all match. For exemple:
* A big W on the weblogo at index 5, and a value of 6.407 for W in the PSSM at this index
* D slightly bigger than T at index 10, and 3.013 for D and 2.333 for T in the PSSM  at this index
* A big P on the weblogo at index 57, and a value of 4.290 for P in the PSSM at this index
### CLUSTAL Weblogo
![CLUSTAL-logo](CLUSTAL-logo.png)
By searching for high values on the weblogo and in `CLUSTAL-PSSM.txt`, we can see that they all match. For exemple:
* A big W on the weblogo at index 5, and a value of 6.407 for W in the PSSM at this index
* A big G on the weblogo at index 14 and a value of 3.455 for G in the PSSM at this index
* A big W on the weblogo at index 53 and a value of 6.127 for W in the PSSM at this index
### Preserved positions in the Weblogos
There are a few amino acids that are very well conserved in all sequences of the domain:
* Amino acid P at index 2
* Amino acid W at index 5
* Amino acid Y at index 17/18
* Amino acid Y at index 18/19
* Amino acid H at index 21/47
* Amino acid W at index 53/54
* Amino acid P at index 56/57

Note that the most prominent amino acids can be found in both alignment, but their exact indices slightly differ. This is due to different gaps in the alignments. The most noticeable difference is for an amino acid H that is before a big gap in the CLUSTAL alignment, and right after the gap in the MUSCLE alignment.
### HMM logo
Here is the HMM logo for the WW family:  

![HMM-logo](HMM-logo.png)

We can see that the preserved positions are the same as in the weblogos, although scaled differently. Gaps are shown in the HMM logo with vertical red lines, and by the last two lines of numbers below the image. The gap information in the HMM logo roughly matches with the observed gaps in the Weblogos.