In [26]:
from aminoacid import AminoAcid
from sequence import Sequence, loadFasta
from align import Align, Aligned
from math import sqrt, log

# Project 3 - PSSM Profiles and alignment

## Domain

A protein **domain** is part of a protein that has its own structure and function, independent of the rest of the protein. Despite their use being similar across the proteins they can be found in, the same domains do not necessarily have the same amino acid sequence. However, they do have enough common amino acids to be identified and located by the means of *sequence alignment*. Our goal in this project is to locate a given domain within a given sequence, should it exist one or more times within that sequence. One such domain is the WW domain, known to be used by multiple species and sometimes included multiple times in a single protein. In this project we will use sequences from the WW domain that belong to human proteins to test our implementations : these have been saved to the file `to-be-aligned.fasta`. These domain sequences and many others can be found on the [SMART](http://smart.embl.de) database.

## MSA

*Multiple Sequence Alignment* allows us to align multiple sequences together, and therefore observe how well some amino acids are preserved at some positions. This is of great interest in the study of domains, since it reveals which positions are of greater importance when trying to align new sequences to known domains. There are of course other applications, as well as numerous methods that achieve MSA, however we won't go into them since we won't be implementing any of these in this project. Two online tools were used to align all the sequences from `to-be-aligned.fasta` together : [MUSCLE](http://www.ebi.ac.uk/Tools/msa/muscle/) and [CLUSTAL Omega](http://www.ebi.ac.uk/Tools/msa/clustalo/). The resulting sequences can be respectively found in files `msaresults-MUSCLE.fasta` and `msaresults-CLUSTALO.fasta`.

## PSSM

A *profile* is a way to represent multiple aligned sequences and their similarities or patterns. A *Position Specific Scoring Matrix* is a kind of profile that uses a matrix, where each **column** contains information about the same columns of the aforementioned sequences, and each **row** matches one amino acid. In this manner, each **cell** points to a specific column of all observed sequences, as well as one amino acid : the value it contains is the frequency of this amino acid within these columns. PSSMs can be used to align sequences with, the result indicating whether that sequence contains a subsequence that's similar to all of the sequences represented by the PSSM.

### Implementation

The eponym class `PSSM` contains the frequency matrix as well as metadata that will allow us to use it efficiently as a scoring system when aligning sequences with it (more on that later). Here is the class, as well as a dictionary of absolute amino acid frequencies within the UniProt database (also used to provide scores) :

In [27]:
# AA frequencies for complete UniProt database
# from http://web.expasy.org/docs/relnotes/relstat.html, "AMINO ACID COMPOSITION"
uniprob = {
	AminoAcid("Ala") : .0826,
	AminoAcid("Gln") : .0393,
	AminoAcid("Leu") : .0965,
	AminoAcid("Ser") : .0660,
	AminoAcid("Arg") : .0553,
	AminoAcid("Glu") : .0674,
	AminoAcid("Lys") : .0582,
	AminoAcid("Thr") : .0535,
	AminoAcid("Asn") : .0406,
	AminoAcid("Gly") : .0708,
	AminoAcid("Met") : .0241,
	AminoAcid("Trp") : .0109,
	AminoAcid("Asp") : .0546,
	AminoAcid("His") : .0227,
	AminoAcid("Phe") : .0386,
	AminoAcid("Tyr") : .0292,
	AminoAcid("Cys") : .0137,
	AminoAcid("Ile") : .0593,
	AminoAcid("Pro") : .0472,
	AminoAcid("Val") : .0687,
	
}


class PSSM:
	"""
	Position Specific Score Matrix.
	Creates a profile for a series of aligned sequences, and gives a score to each AA subsitution in a given column.
	"""
	def __init__(self, description=""):
		self.description=description
		self.seqCount = 0 #total number of sequences
		self.size = None #all sequences have the same size
		self.aaDistribution = None #amino acid distribution
		self.aaCount = None
		self.gapPenalties = None
		
	
	def add(self, sequence):
		#check sequence size
		if self.size is None:
			self.size = len(sequence)
			self.aaDistribution = [{} for i in range(self.size)]
			self.aaCount = [0 for i in range(self.size)]
			self.gapPenalties = [0 for i in range(self.size + 1)]
		
		assert(len(sequence) == self.size)
			
		#update amino acid count for each column
		for index in range(self.size):
			if not sequence[index].isGap():
				self.aaCount[index] += 1
				try:
					self.aaDistribution[index][sequence[index]] += 1
				except:
					self.aaDistribution[index][sequence[index]] = 1
		
		#increase sequence count
		self.seqCount += 1
		
	def getDescription(self):
		return self.description
		
	def getScore(self, aminoAcid, columnIndex):
		#pseudocounts
		alpha = self.aaCount[columnIndex] - 1
		beta = sqrt(self.seqCount)
		alphaplusbeta = alpha + beta

		#random probability of amino acid
		try:
			p_aa = uniprob[aminoAcid]
		except:
			p_aa = 0.001
		
		#evolutionary probability of amino acid
		try:
			f_aa = self.aaDistribution[columnIndex][aminoAcid] / self.seqCount
		except:
			f_aa = 0
			
		q_aa = (alpha * f_aa + beta * p_aa) / alphaplusbeta
		
		return log(q_aa / p_aa)
	
	
	def getGapPenalty(self, columnIndex):
		return self.gapPenalties[columnIndex]
	
	
	def setGapPenalty(self, penalty, columnIndex=None):
		if columnIndex is None:
			for i in range(self.size):
				self.gapPenalties[i] = penalty
		else:
			self.gapPenalties[columnIndex] = penalty
	
	def __len__(self):
		return self.size
	
	def __repr__(self):
		for i in range(self.size):
			for key, score in self.aaDistribution[i].items():
				print(key, ": ", score, "(", self.getScore(key, i), ")", sep="",  end=", ")
			print()

### Scoring

Suppose we want to align a sequence to a PSSM: it will be the PSSM that determines the scores for the alignment -based on the frequencies gathered from previously aligned sequences-. Sure, PSSM has "Scoring Matrix" in it, but we haven't talked about scores yet (that is, unless you've read the code above). That's when you should think : "oh, but I know what a scoring matrix is, it gives out scores for each amino acid pair, that's what this is". Well, when we were aligning two sequences together, that was the case since we made the assumption that the positions of amino acids didn't matter. However the goal now is to align one sequence with a load of other aligned sequences -represented by a PSSM- that often don't have the same amino acid at the same position. The frequency and range of amino acids can vary for each column of the PSSM, therefore the column value is required (instead of the "other" amino acid) to provide a score.

So how is that score calculated you ask ? Well, exactly the same way we did for the Blosum matrices: with the **log-odds ratio** $\frac{q_{a,b}}{p_a}$, where $q_{a,c}$ is the evolutionary probability of amino acid $a$ being located at column $c$, and $p_a$ is the random probability of amino acid $a$ (which doesn't depend on the column). What differs from the previous project is the way these terms are calculated:
* The **random probability** is based on the frequencies of amino acids in the whole UniProt database, which can be found [here](http://web.expasy.org/docs/relnotes/relstat.html).
* The **evolutionary probability**

Protein [D6C652](http://www.uniprot.org/uniprot/D6C652) (Transcriptional coactivator YAP1-A)
![title](D6C652-Domains.PNG)

Protein [P46935](http://www.uniprot.org/uniprot/P46935) (E3 ubiquitin-protein ligase NEDD4)
![title](P46935-Domains.PNG)

In [28]:
for aligned in (r"msaresults-MUSCLE.fasta", r"msaresults-OMEGA.fasta"):
	print("\n >>> Creating PSSM from file {} ...".format(aligned), end="")
	pssm = PSSM("WW domain")
	for seq in loadFasta(aligned):
		pssm.add(seq)
	pssm.setGapPenalty(4)
	print(" done\n\n")
	
	
	al = Align(pssm)
	for toalign in loadFasta(r"test.fasta"):
		for aligned in al.multiAlign(toalign):
			print(aligned)


 >>> Creating PSSM from file msaresults-MUSCLE.fasta ... done


---------- Multi-Seq. Alignment ----------
Size       : 59
Type       : local
Score      : 25.92
Gaps       : 28

PSSM : WW domain
Aligned seq. : sp|D6C652|YAP1A_XENLA Transcriptional coactivator YAP1-A OS=Xenopus laevis GN=yap1-a PE=1 SV=1
	28 Gaps, 31 AAs (positions 142 to 173)

142
-LPPGWEMAKT-PS-GQR-YFLN------------------------HIDQTTTWQDPR


---------- Multi-Seq. Alignment ----------
Size       : 60
Type       : local-suboptimal(1)
Score      : 23.73
Gaps       : 28

PSSM : WW domain
Aligned seq. : sp|D6C652|YAP1A_XENLA Transcriptional coactivator YAP1-A OS=Xenopus laevis GN=yap1-a PE=1 SV=1
	28 Gaps, 32 AAs (positions 200 to 232)

200
-LPDGWEQALTPEGEA---YFIN------------------------HKNKSTSWLDPRL


---------- Multi-Seq. Alignment ----------
Size       : 5
Type       : local-suboptimal(2)
Score      : 11.06
Gaps       : 1

PSSM : WW domain
Aligned seq. : sp|D6C652|YAP1A_XENLA Transcriptional coactivator YAP1-A OS=Xenopu