## A feed-forward Neural Network for secondary structure prediction
This notebook uses a Neural Network based on code from 
[Stevens and Boucher (2014, Python Programming for Biology CUP)](https://www.amazon.co.uk/Python-Programming-Biology-Bioinformatics-Beyond/dp/0521720095). The aim is to predict the secondary structure of a protein from its sequence. 
A Predictive network is trained using data on known secondary structure of k-mers of 5 amino-acids taken from a set of PDB structures. Three secondary structure states are defined: H, C, and S. H and S are helix and strand respectively while C is for coil which is a range of structures not with regular H-bonding pattern. In practice, secondary structure prediction has many uses, for instance in helping in the identification of functional domains ([Drozdetskiy et al., 2015](https://doi.org/10.1093/nar/gkv332)) and can be easily acheived using the JPRED4 server http://www.compbio.dundee.ac.uk/jpred4

Secondary structure is much more complicated than indicated by the simple classification of this data - full details are available from the analysis of the H-bonding arrangements. The program DSSP (https://swift.cmbi.umcn.nl/gv/dssp/DSSP_3.html) is a well-tested approach to this problem. This produces a description of the secondary structure in a known protein structure. For historical reasons DSSP uses E for Strands. A related DSSR program gives RNA secondary structure.

It is useful to be able to predict the secondary structure of a protein for which there is only sequence available. One approach would be to align it with homologous sequences where the structure is known. The approach here is to use the sequence in the neighbourhood of a residue as a basis for a neural network prediction. 

The network here is a simple three layer feed-forward one. The number of nodes in the hidden layer can be defined by the programmer. But the number of input and output nodes is defined by the sizes of the input and output data vectors.


In [1]:
##Run this cell to import numpy 
from numpy import tanh, ones, append, array, zeros, random, sum

The neural network function takes input data for the first layer of Network nodes, applies the the first weighted connections to pass the signal to the hidden layer of nodes, then applies the second weights to produce output. 

The output may not be optimized as the function also operates on the weighting during training. However after training the function gives predictions so takes its name from that. 

The weightsIn values define the strength of connection between the input nodes and the hidden nodes. Similarly weightsOut define the strengths of connection between the hidden and the output nodes. 

The wieghts are given as matrices with the rows indexing the nodes in a layer and the columns indexing the nodes in the other layer. 

The signalIn vector is the input features and an extra value of 1.0. This additional value is called the bias node which is used to tune the baseline response of the network. The baseline is the level without a meaningful signal. 

Setting the bias node value happens during training to adapt to the values in the input data. This means the input data don't need to pre-prepared with a mean of zero.

In [2]:
## Run this cell to define the function
def neuralNetPredict(inputVec, weightsIn, weightsOut):
    """ uses the current weights in a neural network
    to make a prediction from an input vector
    all input and output are numpy data structures""" 
    signalIn = append(inputVec, 1.0) # input layer

    prod = signalIn * weightsIn.T
    sums = sum(prod, axis=1)
    signalHid = tanh(sums)    # hidden    layer

    prod = signalHid * weightsOut.T
    sums = sum(prod, axis=1)
    signalOut = tanh(sums)    # output    layer

    return signalIn, signalHid, signalOut
 

The .T function gives the transpose of a matrix - that is the matrix with the columns turned into rows and the rows turned into columns. 

This is used so that the input signal gets applied to all the hidden nodes.

The network applies the hyperbolic tangent function (tanh) to get the signal output from all the nodes in layer. Hyperbolic tan is a sigmoidal function that varies from -1 to 1, so it is much better than ordinary tan that runs off to infinity. 

<img src="https://mathworld.wolfram.com/images/interactive/TanhReal.gif" width=300></img> 

The tanh function defines the output of that node given an input or set of inputs. As a nod to the output of neurones, which depends on an activation level across their cell membrane, the output function is called the *Activation* function.

In operation only the signalOut from the output layer is of interest. But during training the response signals from the other layers are also needed to adjust the weighting scheme.

### Training 

The weighting scheme (and gain) will be optimized by using a training dataset. 

The training data will be an input feature vector and a known output vector. The order of the data will be randomly shuffled to avoid bias. The number of hidden nodes needs to be specified and the number of optimization cycles. 

After each cycle the 'error' between the output signal of the network and the known training set output is used to adjust the network weights. The difference is combined with the *gradient* in the signal values - calculated from the tanh activation function (conveniently the gradient of tanh(sig) is 1-sig<sup>2</sup> or 1 - {sig x sig}). 

Early in training large difference can make the network go haywire so the speed of weight changing is damped down by a 'rate' and 'momentum' multipliers (usually the default values of 0.5 and 0.2 are good enough). 

More damping would mean that many more cycles would be needed for the weights to converge. 

The training will work back from the value of the error to adjust the weighting scheme of the network. This is called *back propagation*. 

The use of the gradient is crucial as it means initial adjustments will be large but then finer adjustments will be made as the optimum is approached.

In [3]:
## Run this cell to define the function
def neuralNetTrain(trainData, numHid, steps=100, rate=0.5, momentum=0.2):
    """ uses training data to set the weights in a simple
    neural network, number of hidden nodes is specified"""
    numInp = len(trainData[0][0])
    numOut = len(trainData[0][1])
    numInp += 1
    minError = None

    sigInp = ones(numInp)
    sigHid = ones(numHid)
    sigOut = ones(numOut)

    wInp = random.random((numInp, numHid))-0.5
    wOut = random.random((numHid, numOut))-0.5
    bestWeightMatrices = (wInp, wOut)

    cInp = zeros((numInp, numHid))
    cOut = zeros((numHid, numOut))

    for x, (inputs, knownOut) in enumerate(trainData):
        trainData[x] = (array(inputs), array(knownOut))
 
    for step in range(steps):  
        random.shuffle(trainData) # Important to avoid bias
        error = 0.0
 
        for inputs, knownOut in trainData:
            sigIn, sigHid, sigOut = neuralNetPredict(inputs, wInp, wOut)

            diff = knownOut - sigOut
            error += sum(diff * diff)

            gradient = ones(numOut) - (sigOut*sigOut)
            outAdjust = gradient * diff 

            diff = sum(outAdjust * wOut, axis=1)
            gradient = ones(numHid) - (sigHid*sigHid)
            hidAdjust = gradient * diff 

            # update output 
            change = outAdjust * sigHid.reshape(numHid, 1)
            wOut += (rate * change) + (momentum * cOut)
            cOut = change
 
            # update input 
            change = hidAdjust * sigIn.reshape(numInp, 1)
            wInp += (rate * change) + (momentum * cInp)
            cInp = change
 
        if (minError is None) or (error < minError):
            minError = error
            bestWeightMatrices = (wInp.copy(), wOut.copy())
            print("Step: %d Error: %f" % (step, error))
    
    return bestWeightMatrices

### Testing the functions
Simple data to test a network can be binary input vectors with the desired output being an 'exclusive OR' (EOR) response. This responds True if any input is true but False is both are together. 

In [4]:
##Run this cell to define the test data
testEORdata = [[[0,0], [0]],
               [[0,1], [1]], 
               [[1,0], [1]], 
               [[1,1], [0]]]

The network test uses two hidden nodes - in real use several values would be tried to find the best performance.
Run the cell below to see if the training converges.

In [5]:
## Run this cell to train the network
wMatrixIn, wMatrixOut = neuralNetTrain(testEORdata, 2, 1000)

Step: 0 Error: 1.996738
Step: 1 Error: 1.440903
Step: 2 Error: 1.265039
Step: 4 Error: 1.169860
Step: 22 Error: 1.118598
Step: 28 Error: 1.083179
Step: 32 Error: 0.994186
Step: 35 Error: 0.975448
Step: 37 Error: 0.935896
Step: 38 Error: 0.900493
Step: 39 Error: 0.893845
Step: 40 Error: 0.861554
Step: 44 Error: 0.823851
Step: 45 Error: 0.821889
Step: 51 Error: 0.817290
Step: 53 Error: 0.796468
Step: 56 Error: 0.789788
Step: 59 Error: 0.750791
Step: 62 Error: 0.731600
Step: 65 Error: 0.711318
Step: 66 Error: 0.706699
Step: 70 Error: 0.677616
Step: 71 Error: 0.638384
Step: 73 Error: 0.589361
Step: 80 Error: 0.492021
Step: 83 Error: 0.469774
Step: 85 Error: 0.442459
Step: 86 Error: 0.298744
Step: 92 Error: 0.226552
Step: 95 Error: 0.166673
Step: 106 Error: 0.137360
Step: 107 Error: 0.087597
Step: 113 Error: 0.067742
Step: 116 Error: 0.059237
Step: 125 Error: 0.057300
Step: 134 Error: 0.054089
Step: 137 Error: 0.050076
Step: 140 Error: 0.045679
Step: 141 Error: 0.037000
Step: 149 Error: 0.0

Here quite good convergence has occurred and there is no oscillation. Perhaps you can see that the initial steps are giving large changes in the error while later on there are smaller and smaller changes. This is owing to the effect of the gradient calculation. The changes in the actual weights are not printed, but will follow the same trend.

The output weight matrices can then be run on test data for evaluation. Test data should be inputs with known output but which were not included in the training set. 

Obviously it is not possible to give any new data for the EOR function as the training set covered all possible responses!

But the trained network should be able to do a reasonable job on the training set.
Run the following cell to compare the output of the network with the actual values of the training set outputs.

In [11]:
##Run this cell to test the network
for inputs, knownOut in testEORdata:
    sIn, sHid, sOut =    neuralNetPredict(array(inputs), wMatrixIn, wMatrixOut)
    print('input', inputs, ' should have output ', knownOut, 'actual output {:.3f}'.format(sOut[0]))

input [0 0]  should have output  [0] actual output -0.000
input [1 1]  should have output  [0] actual output 0.014
input [0 1]  should have output  [1] actual output 0.984
input [1 0]  should have output  [1] actual output 0.984


### Simple feature vectors for sequence data
A simple numbering scheme is used to convert to the sequence alphabet to a numeric form as an input vector. For proteins that is number from 1 to 20 from the list of one-letter codes.

k-mers with k=5 are used. Only the output for the middle residue is required but the network will use the neighbours to predict the secondary structure of the middle one.  

Although static k-mer are used for training in practice a prediction in a moving 5-mer window could be implemented. 

The possible outputs are also coded as integers for the more restricted alphabet of H, C, and S

In [12]:
##Run this cell to define the dictionaries for the vectors
aminoAcids = 'ACDEFGHIKLMNPQRSTVWY'
aaIndexDict = {}
for i, aa in enumerate(aminoAcids):
        aaIndexDict[aa] = i

ssIndexDict = {}
ssCodes = 'HCS'
for i, code in enumerate(ssCodes):
        ssIndexDict[code] = i

Here is a very limited set of training data. It shows the raw format which is a 5-mer string and the secondary structure that was observed for the central residue of this in at least one PDB structure. 

The actual structure is a simplified output from the DSSP program mentioned in the introduction. DSSP acutally distinguishes more structures that the three here - for example there are other kinds of helix. But these complications are not dealt with.

In [27]:
##Run this cell to define the training set
seqSecStrucData = [('ADTLL','S'),
                   ('DTLLI','S'),
                   ('TLLIL','S'),
                   ('LLILG','S'),
                   ('LILGD','S'),
                   ('ILGDS','S'),
                   ('LGDSL','C'),
                   ('GDSLS','H'),
                   ('DSLSA','H'),
                   ('SLSAG','H'),
                   ('LSAGY','H'),
                   ('SAGYR','C'),
                   ('AGYRM','C'),
                   ('GYRMS','C'),
                   ('YRMSA','C'),
                   ('RMSAS','C')]

The training data has to be converted to the numerical code. Here is a function to to that.

In [14]:
##Run this cell to define the function
def convertSeqToVector(seq, indexDict):
    """converts a one-letter sequence to numerical
    coding for neural network calculations"""   
    numLetters = len(indexDict)
    vector = [0.0] * len(seq) * numLetters

    for pos, letter in enumerate(seq):
        index = pos * numLetters + indexDict[letter]    
        vector[index] = 1.0

    return vector


The training data is prepared with this.

In [15]:
##Run this cell to create the training data
trainingData = []
for seq, ss in seqSecStrucData:
 
        inputVec = convertSeqToVector(seq, aaIndexDict)
        outputVec = convertSeqToVector(ss, ssIndexDict)
 
        trainingData.append( (inputVec, outputVec) )


And then the network is trained. Here there are 3 hidden nodes specified.

In [16]:
   wMatrixIn, wMatrixOut = neuralNetTrain(trainingData, 3, 1000)

Step: 0 Error: 19.738710
Step: 1 Error: 16.156241
Step: 2 Error: 12.451463
Step: 3 Error: 8.126981
Step: 6 Error: 7.768595
Step: 7 Error: 6.606724
Step: 8 Error: 1.961691
Step: 11 Error: 1.001516
Step: 13 Error: 0.793545
Step: 15 Error: 0.525485
Step: 16 Error: 0.500134
Step: 17 Error: 0.460263
Step: 18 Error: 0.387424
Step: 25 Error: 0.309206
Step: 26 Error: 0.239182
Step: 27 Error: 0.137848
Step: 31 Error: 0.124080
Step: 33 Error: 0.108376
Step: 48 Error: 0.076003
Step: 53 Error: 0.072243
Step: 54 Error: 0.049964
Step: 55 Error: 0.049776
Step: 59 Error: 0.032444
Step: 103 Error: 0.021625
Step: 134 Error: 0.016265
Step: 135 Error: 0.011543
Step: 164 Error: 0.009952
Step: 168 Error: 0.009446
Step: 221 Error: 0.007421
Step: 224 Error: 0.006656
Step: 225 Error: 0.004523
Step: 299 Error: 0.004444
Step: 311 Error: 0.004356
Step: 315 Error: 0.004092
Step: 317 Error: 0.004017
Step: 320 Error: 0.003200
Step: 349 Error: 0.002690
Step: 377 Error: 0.002686
Step: 378 Error: 0.002642
Step: 379 Err

You will see that this training has converged nicely. The only problem is that it was for a very restricted set of sequence data. 

There are 3 x 20 = 60 theoretical combinations of amino acid residues with secondary structure states. But for particular residues some of these are favoured and others disfavoured. 

For each residue there will be 20^4 = 160 000 different contexts that then could possibly occur in. Although some of the resulting 5-mers are actually quite rare in structured proteins. 

All the same, it would be good to have a larger training set. It is better to have some rare examples of residue state combinations although, of course, the network will not have to predict them frequently.

One thing to remember is to retain some test examples - where the answer is known but which are not in the training set. 

In its current, poorly-trained state, the network is still able to make a reasonable predictions. But only if the test is clearly related to examples that it has seen. testSecStrucSeq here is very similar to examples in the seqSecStrucData training set.

In [17]:
##Run this cell to test for a new 5-mer
testSecStrucSeq = 'DLLSA'
testSecStrucVec = convertSeqToVector(testSecStrucSeq, aaIndexDict)
testSecStrucArray = array( [testSecStrucVec,] )

sIn, sHid, sOut =    neuralNetPredict(testSecStrucArray, wMatrixIn, wMatrixOut)
index = sOut.argmax()
print("Test prediction: %s" % ssCodes[index])

Test prediction: H


# making a prediction of the secondary structure for PDB 6aam

Lets use the recent PDB structure 6AAM "Non-receptor tyrosine-protein kinase TYK2" 
https://www.ebi.ac.uk/pdbe/entry/pdb/6aam/
    
as an example

In [18]:
# 6aam sequence from https://www.ebi.ac.uk/pdbe/entry/pdb/6aam/protein/1
sequence_6aam = ('GPGDPTVFHKRYLKKIRDLGEGHFGKVSLYCYDPTNDGTGEMVAVKALKADAGP'
                 'QHRSGWKQEIDILRTLYHEHIIKYKGCCEDAGAASLQLVMEYVPLGSLRDYLPR'
                 'HSIGLAQLLLFAQQICEGMAYLHAQHYIHRNLAARNVLLDNDRLVKIGDFGLAK'
                 'AVPEGHEYYRVREDGDSPVFWYAPECLKEYKFYYASDVWSFGVTLYELLTHCDS'
                 'SQSPPTKFLELIGLAQGQMTVLRLTELLERGERLPRPDKCPAEVYHLMKNCWET'
                 'EASFRPTFENLIPILKTVHEKYQGQAPS')
print(len(sequence_6aam))

298


In [19]:
def predict_for_sequence(sequence):
    prediction = []
    for ires, residue in enumerate(sequence):
        if ires < 2 or ires > len(sequence) - 3:
            this_prediction = '.'
        else:
            fivemer = sequence[ires-2:ires+3]
            testSecStrucVec = convertSeqToVector(fivemer, aaIndexDict)
            testSecStrucArray = array( [testSecStrucVec,] )
            sIn, sHid, sOut =    neuralNetPredict(testSecStrucArray, wMatrixIn, wMatrixOut)
            index = sOut.argmax()
            this_prediction = ssCodes[index]
        prediction.append(this_prediction)
    return ''.join(prediction)

initial_predict_6aam = predict_for_sequence(sequence_6aam)
print(initial_predict_6aam)
print(len(initial_predict_6aam))


..SCSSSSCCCCSSCCSSSSCSSCSCCHHHCCCSSSCSSCCHCSCCSSSHCHCCSSCHCCCCSSSSSSSSSCCSCSCSCCCCHCHCCCSSSSSCCSCSSSHHHCSSSCCHHSSSSSSSSSCSSSCCCCSSSCCCCCCCSCSCCCSSSSSCSSSSCSSHHCHCCCSSSCCCCCCCSSSCHHSHCCCCSSSHCCCCCCCCHHCHHSCSSSSSSSSCCHHHHHSSSSSSSSSSHCSCSSSSSSSSSSSCCCSCSCSSHHCCSCSCCSCSCCCSCCCCCCSCCSSSSSSSSSCSCSCSCH..
298


In [20]:
# from https://cdn.rcsb.org/etl/kabschSander/ss_dis.txt.gz
dssp_result_for_6aam  = """>6AAM:A:secstr
      B  GGGEEEEEE       EEEEEEE TT     EEEEEEE      TTHHHHHHHHHHHHHH   TTB
  EEEEEEEGGGTEEEEEEE  TT BHHHHGGGS   HHHHHHHHHHHHHHHHHHHHTTEE S  SGGGEEEEET
TEEEE   TT EE                 GGG  HHHHHH    HHHHHHHHHHHHHHHHTTT GGGSHHHHHH
HHH S  TT HHHHHHHHHHTT      TT  HHHHHHHHHHT SSGGGS  HHHHHHHHHHHHHHHH     
"""
dssp_result_for_6aam = dssp_result_for_6aam.splitlines()
dssp_result_for_6aam.pop(0)
dssp_result_for_6aam = ''.join(dssp_result_for_6aam)
# need to convert DSSP code to the 3-category helix, strand, coil.
# Use mapping 
# helices H, C, I go to H
# strands E & bridges B go to S
# everything else got to C
translation = str.maketrans('HCIEB GT', 'HHHSSCCC')
dssp_result_for_6aam = dssp_result_for_6aam.translate(translation)
print(len(dssp_result_for_6aam))
print(dssp_result_for_6aam)

298
CCCCCCSCCCCCSSSSSSCCCCCCCSSSSSSSCCCCCCCCSSSSSSSCCCCCCCCHHHHHHHHHHHHHHCCCCCSCCSSSSSSSCCCCSSSSSSSCCCCCSHHHHCCCSCCCHHHHHHHHHHHHHHHHHHHHCCSSCSCCSCCCSSSSSCCSSSSCCCCCCSSCCCCCCCCCCCCCCCCCCCCCCHHHHHHCCCCHHHHHHHHHHHHHHHHCCCCCCCSHHHHHHHHHCSCCCCCHHHHHHHHHHCCCCCCCCCCCCHHHHHHHHHHCCSSCCCSCCHHHHHHHHHHHHHHHHCCCCC


In [21]:
def highlight_line(first_seq, second_seq):
    """ 
    for the two sequences returns a line where matching letters are 
    highlighted with | except if the letter are a gap
    """
    joins = ['|' if a == b and a != '-' else ' ' for a, b in zip(first_seq, second_seq)]
    return ''.join(joins)
def print_alignment(seq_a, seq_b):
    len_split = 50
    n_splits = len(seq_a)//len_split + 1
    for i_split in range(n_splits):
        start = len_split*i_split
        end = start + len_split
        part_a = seq_a[start:end]
        part_b = seq_b[start:end]
        print(part_a)
        print(highlight_line(part_a, part_b))
        print(part_b)
        print()

print_alignment(dssp_result_for_6aam, initial_predict_6aam)

CCCCCCSCCCCCSSSSSSCCCCCCCSSSSSSSCCCCCCCCSSSSSSSCCC
   |  | ||||||  ||  |  |        |   |  |   |  |   
..SCSSSSCCCCSSCCSSSSCSSCSCCHHHCCCSSSCSSCCHCSCCSSSH

CCCCCHHHHHHHHHHHHHHCCCCCSCCSSSSSSSCCCCSSSSSSSCCCCC
| ||   |             ||   ||       ||||||||   |   
CHCCSSCHCCCCSSSSSSSSSCCSCSCSCCCCHCHCCCSSSSSCCSCSSS

SHHHHCCCSCCCHHHHHHHHHHHHHHHHHHHHCCSSCSCCSCCCSSSSSC
 ||    |                        ||  |  |||||||||||
HHHCSSSCCHHSSSSSSSSSCSSSCCCCSSSCCCCCCCSCSCCCSSSSSC

CSSSSCCCCCCSSCCCCCCCCCCCCCCCCCCCCCCHHHHHHCCCCHHHHH
 |||     |   |   |||||||   |    |||    | ||||   ||
SSSSCSSHHCHCCCSSSCCCCCCCSSSCHHSHCCCCSSSHCCCCCCCCHH

HHHHHHHHHHHCCCCCCCSHHHHHHHHHCSCCCCCHHHHHHHHHHCCCCC
 ||          ||    |         | | |           ||| |
CHHSCSSSSSSSSCCHHHHHSSSSSSSSSSHCSCSSSSSSSSSSSCCCSC

CCCCCCCHHHHHHHHHHCCSSCCCSCCHHHHHHHHHHHHHHHHCCCCC
 |    |          ||| ||| |                  |   
SCSSHHCCSCSCCSCSCCCSCCCCCCSCCSSSSSSSSSCSCSCSCH..



In [22]:
def success(seq_a, seq_b):
    same = 0
    different = 0
    for let_a, let_b in zip(seq_a, seq_b):
        if let_a == let_b:
            same += 1
        else:
            different += 1
    return same/(same + different)
print('percentage correct predictions: {:.1f}%'.format(100*success(dssp_result_for_6aam, initial_predict_6aam)))

percentage correct predictions: 32.2%


### Inporting a larger training set
A much larger training set is provided as a comma separated file called:
"PDB_protein_secondary_5mers.csv". 

    from csv import reader #may help here
Can you create a training set vector data structure from this and use it to train the network?
If you would like some examples of test data, here are some examples from the recently-determined PDB 6aam.pdb

    S: EMVAV, KVSKY, YKGCC
    H: LAQLL, ICEGM, ASVDW
    C: ERLPR, GDFGL, YKFYY


In [23]:
# answer
from csv import reader
with open('PDB_protein_secondary_5mers.csv') as csv_file:
    csv_reader = reader(csv_file, delimiter=',')
    data = list(csv_reader)
print('have read', len(data), 'lines from csv file')
print('first 5 lines', data[:5])

have read 26242 lines from csv file
first 5 lines [['MGKMY', 'S'], ['YGIPQ', 'C'], ['KMWTY', 'H'], ['YRLRK', 'H'], ['NSVSV', 'S']]


In [43]:
training_set = seqSecStrucData + data[:200]
##Run this cell to create the training data
trainingData = []
for seq, ss in training_set:
 
        inputVec = convertSeqToVector(seq, aaIndexDict)
        outputVec = convertSeqToVector(ss, ssIndexDict)
 
        trainingData.append( (inputVec, outputVec) )

In [48]:
 wMatrixIn, wMatrixOut = neuralNetTrain(trainingData, 4, 3000)

Step: 0 Error: 250.602497
Step: 15 Error: 248.776633
Step: 16 Error: 238.671113
Step: 17 Error: 233.067781
Step: 18 Error: 219.627317
Step: 21 Error: 188.787130
Step: 32 Error: 188.786146
Step: 33 Error: 180.054240
Step: 37 Error: 173.220385
Step: 38 Error: 160.614172
Step: 44 Error: 159.688608
Step: 46 Error: 152.873979
Step: 71 Error: 148.366355
Step: 79 Error: 140.795987
Step: 109 Error: 138.691307
Step: 125 Error: 135.592580
Step: 128 Error: 134.971765
Step: 133 Error: 129.121363
Step: 155 Error: 127.287103
Step: 184 Error: 123.494047
Step: 193 Error: 119.126797
Step: 237 Error: 118.153780
Step: 243 Error: 117.131650
Step: 356 Error: 114.343184
Step: 521 Error: 111.513306
Step: 622 Error: 110.264015
Step: 1331 Error: 106.479068


In [49]:
##Run this cell to test for a new 5-mer
testSecStrucSeq = 'DLLSA'
testSecStrucVec = convertSeqToVector(testSecStrucSeq, aaIndexDict)
testSecStrucArray = array( [testSecStrucVec,] )

sIn, sHid, sOut =    neuralNetPredict(testSecStrucArray, wMatrixIn, wMatrixOut)
index = sOut.argmax()
print("Test prediction: %s" % ssCodes[index])

Test prediction: C


In [50]:
def predict_for_sequence(sequence):
    prediction = []
    for ires, residue in enumerate(sequence):
        if ires < 2 or ires > len(sequence) - 3:
            this_prediction = '.'
        else:
            fivemer = sequence[ires-2:ires+3]
            testSecStrucVec = convertSeqToVector(fivemer, aaIndexDict)
            testSecStrucArray = array( [testSecStrucVec,] )
            sIn, sHid, sOut =    neuralNetPredict(testSecStrucArray, wMatrixIn, wMatrixOut)
            index = sOut.argmax()
            this_prediction = ssCodes[index]
        prediction.append(this_prediction)
    return ''.join(prediction)

second_predict_6aam = predict_for_sequence(sequence_6aam)
print(second_predict_6aam)
print('percentage correct predictions: {:.1f}%'.format(100*success(dssp_result_for_6aam, second_predict_6aam)))

..CCCCCCHHCHHHHHCCCCCCCCCCCHHHCCCCCCCCCCCHHCCHHHHCCCCCCCCCHCCHCHCCCCHHHCCHHCCCHCCCCCCCCCHCHCCCCCCCCCCCCCCCCCCHCCCCCCHCHHCCCCCCHCCCHCCCHCCCHCHCCCCCHCCCCCHCCCCCCCHCHCCCCCCHCCHCCCCCCCCCHCCCCCHHCHHCHCCCHCCCCCCCCCCHHCCHCHCCCCCCCHHHCCCCCCCCCCCCHCHCHHCCCCCCCCCCCCCCCHCCHCHCHCCHCHCCCCCCHHCCCCCCCHCHHCCCCC..
percentage correct predictions: 47.3%
