In [8]:
%matplotlib inline

import matplotlib
import numpy
import matplotlib.pyplot as pyplot
from math import log

from aminoacid import AminoAcid

# Project 4 - Secondary structure prediction

Proteins have four levels of structure:
* The **primary** structure is the sequence of amino acids that makes up the protein.
* The **secondary** structure refers to particular shapes that sub-sequences of the protein tend to form, due to *hydrogen bonds*. The most common among these are *alpha helices* and *beta sheets*.
* The **tertiary** structure is how the whole protein is "folded" (i.e. its 3D structure). The folding is due to hydrophobic interactions, and stops when the shape is stabilized by other interactions.
* The **quartenary** structure is particular to *multimers*, proteins that are made up of multiple subunits.  It describes how these subunits are arranged together.

The goal of this project is to predict the secondary structure of a protein, based on its primary structure. This is useful in the context of multiple sequence alignment, since proteins that exerce the same function are likely to have similar secondary structure as well as related primary structures. Fortunately for us, secondary structures can be observed experimentally via multiple techniques, granting us the possibility to train and verify our prediction system with real-world data. Furthermore, the folding of proteins (into their secondary and tertiary stable structures) is highly deterministic, which means it can be predicted based on the primary structure alone.

## DSSP definition

DSSP stands for *Define Secondary Structure of Proteins* and is a standard for how the atomic 3D arrangement of a protein is translated into secondary structures. DSSP admits eight types of secondary structures and assigns one to each amino acid from a protein by examining their spacial coordinates. We won't be implementing DSSP, however we will need to parse `.dssp` files in order to extract secondary structure information to train and verify our prediction system. Here is a class that parses such files:

In [9]:
class DSSP:
	def __init__(self, filePath):
		
		#Interesting columns : (start index, end index)
		self.columns = ("RESIDUE", "AA", "STRUCTURE")
		self.residues = []
		
		#Metadata
		self.identifier = ""
		self.protein = ""
		self.organism = ""
		
		#Parsing
		with open(filePath, 'r') as dsspFile:
			columnIndex = {col : (0, 0) for col in self.columns}
			lineIsData = False
			
			for line in dsspFile.readlines():
				
				if lineIsData:
					data = []
					for column in self.columns:
						start, end = columnIndex[column]
						data.append(line[start:end])
					
					self.residues.append(data)
					
				else:
					if line.strip()[0] == "#":
						lineIsData = True
						for column in self.columns:
							startIndex = line.find(column)
							endIndex = startIndex + len(column)
							endIndex = endIndex + (len(line[endIndex:]) - len(line[endIndex:].lstrip())) - 1
							columnIndex[column] = (startIndex, endIndex)
					elif line.startswith("HEADER"):
						self.identifier = line.split()[-2]
						
					elif line.startswith("COMPND"):
						self.protein = line.split(":")[1].split(";")[0].strip()
						self.protein = " ".join(self.protein.split())
						
					elif line.startswith("SOURCE"):
						self.organism = line.split(":")[1].split(";")[0].strip()
						self.organism = " ".join(self.organism.split())
						
	
	def __repr__(self):
		res = []
		for values in self.residues:
			res.append(str(values))
		return "\n".join(res)
		
	
	def getSequenceStructure(self, chain):
		structs = {"H":"H","G":"H","I":"H","E":"E","B":"E","T":"T","C":"C","S":"C"," ":"C"}
		sequence = []
		structure = []
		
		for residue in self.residues:
			if residue[0][-1] == chain:
				sequence.append(residue[1][0])
				structure.append(structs[residue[2][0]])
		
		return "".join(sequence), "".join(structure)

Note that we won't be using all eight structures, but rather regroup them into four classes:
* **Helix** (H) regroups 3,4 and 5-turn helixes
* **Sheet** (E) regroups parallel/antiparallel $\beta$-sheets and isolated $\beta$-bridges
* **Turn** (T) is the hydrogen bonded turn
* **Coil** (C) regroups coils (no structure) and bends


## GOR prediction

GOR stands for *Garnier-Osguthorpe-Robson* and is a secondary structure prediction method based on information theory. It has had several releases, each increasing the prediction accuracy, but we will only focus on the GOR III version here. This version uses two kinds of information to issue a prediction, all based on known protein-structure pairs parsed from a *training dataset*. In the following formulas, $R_j$ is the residue (amino acid) at index $j$ whose structure is being predicted, $S_j$ is one of the structures, $n-S_j$ represents all of the structures except for $S_j$, $f_{c_1,...c_k}$ is the frequency with which all conditions $c_1$ through $c_k$ are met within the training dataset and $I(\Delta S, ...) = I(S, ...) - I(n-S, ...)$ is the information difference between the predictions concerning $S$ and $n-S$.
* **Individual** information concerns only the amino acid at position $j$: $$I(\Delta S_j, R_j) = \log{\left( \frac{f_{S_j,R_j}}{f_{n-S_j,R_j}} \right)} + \log{\left( \frac{f_{n-S_j}}{f_{S_j}} \right)}$$
* **Directional** information was introduced in version 2 and concerns the amino acids surrounding position $j$, from $j-n$ to $j+n$: $$I(\Delta S_j, R_{j+m}) = \log{\left( \frac{f_{S_j,R_{j+m}}}{f_{n-S_j,R_{j+m}}} \right)} + \log{\left( \frac{f_{n-S_j}}{f_{S_j}} \right)}$$
* **Pair-wise** information has replaced directional information since version 3 and concerns the pairs $(R_j, R_{j+m}) \forall m \in [-n, -1] \cup [1, n]$: $$I(\Delta S_j, R_{j+m} | R_j) = \log{\left( \frac{f_{S_j,R_{j+m},R_j}}{f_{n-S_j,R_{j+m},R_j}} \right)} + \log{\left( \frac{f_{n-S},R_j}{f_{S},R_j} \right)}$$

Overall, the formula applied for the GOR III prediction is: $$I(\Delta S_j, R_{j-n} ... R_{j+n}) = I(\Delta S_j, R_j) + \sum_{m=-n, m \neq 0}^{m=n}{I(\Delta S_j, R_{j+m} | R_j)}$$

Here is an implementation of the algorithm, that we can train with new sequences then use to predict the structure of other sequences:

In [10]:
class GOR3:
	"""
	Implements the GOR III secondary structure prediction algorithm.
	Objects of this class must be trained with known examples of sequences and their structure
	before being able to predict the structures of new sequences.
	"""

	def __init__(self):
		self.structures = "HETC"
		self.neighbourOffset = 8
		
		self.trainings = 0 #Number of trainings (one per AA)
		self.strucCount = {s:0 for s in self.structures}
		self.pairCount = {(s, a):0 for s in self.structures for a in AminoAcid.getAllNames()}
		self.tripletCount = {}
		for s in self.structures:
			for a in AminoAcid.getAllNames():
				for na in AminoAcid.getAllNames(): #Neighbour AA
					self.tripletCount[(s, a, na)] = 0
		
		
	def train(self, sequence, structure):
		"""
		Trains the system with a known example of a sequence and its structure.
		"""
		self.trainings += len(sequence)
		
		for index in range(len(sequence)):
			curAminoacid = sequence[index]
			curStructure = structure[index]
			self.strucCount[curStructure] += 1
			self.pairCount[(curStructure, curAminoacid)] += 1
			
			for neiAminoacid in self.neighbourValues(sequence, index):
				self.tripletCount[(curStructure, curAminoacid, neiAminoacid)] += 1
	
	
	def predict(self, sequence):
		"""
		Returns the predicted structure of 'sequence', based on received training.
		"""
		structure = [] #Result: predicted structure
		
		#Predict structures for each aminoacid in sequence
		for index in range(len(sequence)):
			curAminoacid = sequence[index]
			
			#First possible structure
			predStructure = self.structures[0]
			predScore = self.__getScore(sequence, index, predStructure)
			
			#Other structures
			for curStructure in self.structures[1:]:
				curScore = self.__getScore(sequence, index, curStructure)
				
				#Remember structure that gives best score
				if curScore > predScore:
					predStructure = curStructure
					predScore = curScore
					
			structure.append(predStructure)
		return "".join(structure)
		
	
	def __getScore(self, sequence, index, struct):
		"""
		Returns I(deltaS, R) as defined by the GOR III algorithm.
		"""
		aminoacid = sequence[index]
		scoreTerms = []
		
		score = self.pairCount[(struct, aminoacid)]
		score /= sum(self.pairCount[(otherStruct, aminoacid)] for otherStruct in self.getStructures(struct))
		scoreTerms.append(score)
		
		score = (self.trainings - self.strucCount[struct]) / self.strucCount[struct]
		scoreTerms.append(score)
		
		for neiAminoacid in self.neighbourValues(sequence, index):
			score = self.tripletCount[(struct, aminoacid, neiAminoacid)]
			score /= sum(self.tripletCount[(otherStruct, aminoacid, neiAminoacid)] for otherStruct in self.getStructures(struct))
			scoreTerms.append(score)
			
			score = sum(self.pairCount[(otherStruct, aminoacid)] for otherStruct in self.getStructures(struct))
			score /= self.pairCount[(struct, aminoacid)]
			scoreTerms.append(score)
			
		return sum(log(s) for s in scoreTerms)
			
			
		
	def neighbourOffsets(self):
		for offset in range(-self.neighbourOffset, self.neighbourOffset+1):
			if offset!=0:
				yield offset
	
	
	def neighbourValues(self, sequence, index):
		for offset in self.neighbourOffsets():
			neiIndex = index + offset
			if neiIndex >= 0 and neiIndex < len(sequence):
				yield sequence[neiIndex]
	
	
	def getStructures(self, exclude=None):
		for s in self.structures:
			if s != exclude:
				yield s

## Results

### Parsing DSSP files

In [11]:
with open(r"dataset/CATH_info.txt") as infoFile:
	with open(r"dataset/CATH_info-PARSED.txt", 'w') as outFile:
		for line in infoFile.readlines():
			d = DSSP(r"dataset/dssp/" + line[0:4] + ".dssp")
			description = "> " + d.identifier + "|" + d.protein + "|" + d.organism
			seq, struct = d.getSequenceStructure(line[4])
			
			outFile.writelines(l + "\n" for l in [description,seq,struct])
		
with open(r"dataset/CATH_info_test.txt") as infoFile:
	with open(r"dataset/CATH_info_test-PARSED.txt", 'w') as outFile:
		for line in infoFile.readlines():
			d = DSSP(r"dataset/dssp_test/" + line[0:4] + ".dssp")
			description = "> " + d.identifier + "|" + d.protein + "|" + d.organism
			seq, struct = d.getSequenceStructure(line[4])
			
			outFile.writelines(l + "\n" for l in [description,seq,struct])

### Training GOR III

In [12]:
gor3Pred = GOR3()

with open(r"dataset/CATH_info-PARSED.txt") as inFile:
	index = 0
	sequence = ""
	for line in inFile.readlines():
		line = line.strip().upper()
		if not (line=="" or line[0]==">"):
			#Line is a sequence
			if index % 2 == 0:
				sequence = line
			#Line is a structure
			else:
				gor3Pred.train(sequence, line)
			index += 1

### Predicting structures

In [13]:
with open(r"dataset/CATH_info_test-PARSED.txt") as inFile:
	index = 0
	sequence = ""
	for line in inFile.readlines():
		line = line.strip().upper()
		if not (line=="" or line[0]==">"):
			#Line is a sequence
			if index % 2 == 0:
				sequence = line
			#Line is a structure
			else:
				structure = line
				prediction = gor3Pred.predict(sequence)
				inter = [":" if s1==s2 else " " for s1,s2 in zip(structure, prediction)]
				
				print("-------- PREDICTION --------")
				chunk = 80
				for start in range(0, len(structure), chunk):
					stop = start+chunk+1 if start+chunk+1<=len(structure) else len(structure)
					print(structure[start:stop])
					print("".join(inter[start:stop]))
					print(prediction[start:stop])
					print()
				print("Accuracy:", inter.count(":")/len(inter))
				print()
				
			index += 1

-------- PREDICTION --------
CCCTTTCCTTCCCCHHHHHHHHHHHHHHCTTTEEEEEEEECTTCCEEEEEEECCCCCCCCEEEEEECCCTTCHHHHHHHHH
    :            :::::::::::         : :      ::::::        :::: :   :     ::: ::
ETTETEETEEEEETETTHHHHHHHHHHHHHHHHHHHHETEEEHTETEEEEEETETTTTETEEEETETEHTETEEEHHHTHH

HHHHHHHHHTTTCHHHHHHHHHCEEEEECCCCHHHHHHHHHTCTTCCCCCCCCCCCCCCCCCHHHCCCCCTTCCCCECCTT
:  :      ::       : :  :::   :            :                           :       : 
HEEHTEETTETTTTTECEEHTHHTEEETEECTTTTEEECTTHTTHEETEHTEEETTEEETETETTTETHTETTETETTTTE

TCTTECCCCTTCCHHHHHHHHHHHHHCCEEEEEEEEECCCEEEECCCCCCCCCTTHHHHHHHHHHHHHHHHHHHCCCCEEE
  : :     :                 : :::       ::::         ::: :: ::::::::::::      :: 
ETTEEETHEETEEEEETTEETEETTETTETEEEHHHHHTTEEEETETEEEETETTHEHHTHHHHHHHHHHHHEETETEEET

EEHHHHCCCCCCCHHHHHHHTTCCEEEEEEECCCCCCHHHCCHHHHHHHHHHHHHHHHHHHHHHHHC
 :         :         :  ::::: :  :        :::::  ::::::  :         
TEEEEEEEEEECTEETEEETETETEEEEETETTCTTHTEEETHHHHHCCHHHHHHTEHEEETEEETT

Accuracy: 0.3094462540716