## A feed-forward Neural Network for secondary structure prediction
This notebook looks at a Neural Network based on code from 
[Stevens and Boucher (2014, Python Programming for Biology CUP)](https://www.amazon.co.uk/Python-Programming-Biology-Bioinformatics-Beyond/dp/0521720095). The aim is to predict the secondary structure of a protein from its sequence. 
A Predictive network is trained using data on known secondary structure of k-mers of 5 amino-acids taken from a set of PDB structures. Three secondary structure states are defined: H, C, and S. H and S are helix and strand respectively while C is for coil which is a range of structures not with regular H-bonding pattern. In practice, secondary structure prediction has many uses, for instance in helping in the identification of functional domains ([Drozdetskiy et al., 2015](https://doi.org/10.1093/nar/gkv332)) and can be easily acheived using the JPRED4 server http://www.compbio.dundee.ac.uk/jpred4

Secondary structure is much more complicated than indicated by the simple classification of this data - full details are available from the analysis of the H-bonding arrangements. The program DSSP (https://swift.cmbi.umcn.nl/gv/dssp/DSSP_3.html) is a well-tested approach to this problem. This produces a description of the secondary structure in a known protein structure. For historical reasons DSSP uses E for Strands. A related DSSR program gives RNA secondary structure.

It is useful to be able to predict the secondary structure of a protein for which there is only sequence available. One approach would be to align it with homologous sequences where the structure is known. The approach here is to use the sequence in the neighbourhood of a residue as a basis for a neural network prediction. 

The network here is a simple three layer feed-forward one. The number of nodes in the hidden layer can be defined by the programmer. But the number of input and output nodes is defined by the sizes of the input and output data vectors.


In [1]:
##Run this cell to import numpy 
from numpy import tanh, ones, append, array, zeros, random, sum

The neural network function takes input data for the first layer of Network nodes, applies the the first weighted connections to pass the signal to the hidden layer of nodes, then applies the second weights to produce output. 

The output may not be optimized as the function also operates on the weighting during training. However after training the function gives predictions so takes its name from that. 

The weightsIn values define the strength of connection between the input nodes and the hidden nodes. Similarly weightsOut define the strengths of connection between the hidden and the output nodes. 

The weights are given as matrices with the rows indexing the nodes in a layer and the columns indexing the nodes in the other layer. 

The signalIn vector is the input features and an extra value of 1.0. This additional value is called the bias node which is used to tune the baseline response of the network. The baseline is the level without a meaningful signal. 

Setting the bias node value happens during training to adapt to the values in the input data. This means the input data don't need to pre-prepared with a mean of zero.

In [2]:
## Run this cell to define the function
def neuralNetPredict(inputVec, weightsIn, weightsOut):
    """ uses the current weights in a neural network
    to make a prediction from an input vector
    all input and output are numpy data structures""" 
    signalIn = append(inputVec, 1.0) # input layer

    prod = signalIn * weightsIn.T
    sums = sum(prod, axis=1)
    signalHid = tanh(sums)    # hidden    layer

    prod = signalHid * weightsOut.T
    sums = sum(prod, axis=1)
    signalOut = tanh(sums)    # output    layer

    return signalIn, signalHid, signalOut
 

Note that the numpy `.T` methodn gives the transpose of a matrix - that is the matrix with the columns turned into rows and the rows turned into columns. 

This is used so that the input signal gets applied to all the hidden nodes.

The network applies the hyperbolic tangent function (tanh) to get the signal output from all the nodes in layer. Hyperbolic tan is a sigmoidal function that varies from -1 to 1, so it is much better than ordinary tan that runs off to infinity. 

<img src="https://mathworld.wolfram.com/images/interactive/TanhReal.gif" width=300></img> 

The tanh function defines the output of that node given an input or set of inputs. As a nod to the output of neurones, which depends on an activation level across their cell membrane, the output function is called the *Activation* function.

In operation only the signalOut from the output layer is of interest. But during training the response signals from the other layers are also needed to adjust the weighting scheme.

### Training 

The weighting scheme (and gain) will be optimized by using a training dataset. 

The training data will be an input feature vector and a known output vector. The order of the data will be randomly shuffled to avoid bias. The number of hidden nodes needs to be specified and the number of optimization cycles. 

After each cycle the 'error' between the output signal of the network and the known training set output is used to adjust the network weights. The difference is combined with the *gradient* in the signal values - calculated from the tanh activation function (conveniently the gradient of tanh(sig) is 1-sig<sup>2</sup> or 1 - {sig x sig}). 

Early in training large difference can make the network go haywire so the speed of weight changing is damped down by a 'rate' and 'momentum' multipliers (usually the default values of 0.5 and 0.2 are good enough). 

More damping would mean that many more cycles would be needed for the weights to converge. 

The training will work back from the value of the error to adjust the weighting scheme of the network. This is called *back propagation*. 

The use of the gradient is crucial as it means initial adjustments will be large but then finer adjustments will be made as the optimum is approached.

In [3]:
## Run this cell to define the neuralNetTrain function
def neuralNetTrain(trainData, numHid, steps=100, rate=0.5, momentum=0.2, wInp=None, wOut=None):
    """ uses training data to set the weights in a simple
    neural network, number of hidden nodes is specified"""
    numInp = len(trainData[0][0])
    numOut = len(trainData[0][1])
    numInp += 1
    minError = None

    sigInp = ones(numInp)
    sigHid = ones(numHid)
    sigOut = ones(numOut)
    
    if wInp is None:
        wInp = random.random((numInp, numHid))-0.5
    
    if wOut is None:
        wOut = random.random((numHid, numOut))-0.5
    bestWeightMatrices = (wInp, wOut)

    cInp = zeros((numInp, numHid))
    cOut = zeros((numHid, numOut))

    for x, (inputs, knownOut) in enumerate(trainData):
        trainData[x] = (array(inputs), array(knownOut))
 
    for step in range(steps):  
        random.shuffle(trainData) # Important to avoid bias
        error = 0.0
 
        for inputs, knownOut in trainData:
            sigIn, sigHid, sigOut = neuralNetPredict(inputs, wInp, wOut)

            diff = knownOut - sigOut
            error += sum(diff * diff)

            gradient = ones(numOut) - (sigOut*sigOut)
            outAdjust = gradient * diff 

            diff = sum(outAdjust * wOut, axis=1)
            gradient = ones(numHid) - (sigHid*sigHid)
            hidAdjust = gradient * diff 

            # update output 
            change = outAdjust * sigHid.reshape(numHid, 1)
            wOut += (rate * change) + (momentum * cOut)
            cOut = change
 
            # update input 
            change = hidAdjust * sigIn.reshape(numInp, 1)
            wInp += (rate * change) + (momentum * cInp)
            cInp = change
 
        if (minError is None) or (error < minError):
            minError = error
            bestWeightMatrices = (wInp.copy(), wOut.copy())
            print("Step: %d Error: %f" % (step, error))
    
    return bestWeightMatrices

### Testing the functions
Simple data to test a network can be binary input vectors with the desired output being an 'exclusive OR' (EOR) response https://en.wikipedia.org/wiki/Exclusive_or. This responds True if any input is true but False is both are together. 

In [4]:
##Run this cell to define the test data
testEORdata = [[[0,0], [0]],
               [[0,1], [1]], 
               [[1,0], [1]], 
               [[1,1], [0]]]

The network test uses two hidden nodes - in real use several values would be tried to find the best performance.
Run the cell below to see if the training converges.

In [5]:
## Run this cell to train the network
wMatrixIn, wMatrixOut = neuralNetTrain(testEORdata, 2, 1000)

Step: 0 Error: 2.117528
Step: 1 Error: 1.392060
Step: 2 Error: 1.365615
Step: 3 Error: 1.280868
Step: 4 Error: 1.239094
Step: 5 Error: 1.096176
Step: 6 Error: 0.970600
Step: 10 Error: 0.909827
Step: 11 Error: 0.901503
Step: 13 Error: 0.899322
Step: 14 Error: 0.842626
Step: 17 Error: 0.829305
Step: 18 Error: 0.776297
Step: 22 Error: 0.750190
Step: 24 Error: 0.709475
Step: 29 Error: 0.678282
Step: 31 Error: 0.643846
Step: 33 Error: 0.635477
Step: 34 Error: 0.627824
Step: 35 Error: 0.570631
Step: 39 Error: 0.511134
Step: 40 Error: 0.505995
Step: 41 Error: 0.457139
Step: 42 Error: 0.331631
Step: 48 Error: 0.329306
Step: 50 Error: 0.205766
Step: 55 Error: 0.203626
Step: 56 Error: 0.118081
Step: 68 Error: 0.111239
Step: 70 Error: 0.088335
Step: 77 Error: 0.079027
Step: 82 Error: 0.077963
Step: 84 Error: 0.057786
Step: 92 Error: 0.042400
Step: 100 Error: 0.041747
Step: 102 Error: 0.037621
Step: 108 Error: 0.029122
Step: 116 Error: 0.028673
Step: 119 Error: 0.027827
Step: 125 Error: 0.023189
S

Here quite good convergence has occurred and there is no oscillation. Perhaps you can see that the initial steps are giving large changes in the error while later on there are smaller and smaller changes. This is owing to the effect of the gradient calculation. The changes in the actual weights are not printed, but will follow the same trend.

The output weight matrices can then be run on test data for evaluation. Test data should be inputs with known output but which were not included in the training set. 

Obviously it is not possible to give any new data for the EOR function as the training set covered all possible responses!

But the trained network should be able to do a reasonable job on the training set.
Run the following cell to compare the output of the network with the actual values of the training set outputs.

In [6]:
##Run this cell to test the network
for inputs, knownOut in testEORdata:
    sIn, sHid, sOut =    neuralNetPredict(array(inputs), wMatrixIn, wMatrixOut)
    print('input', inputs, ' should have output ', knownOut, 'actual output {:.3f}'.format(sOut[0]))

input [0 1]  should have output  [1] actual output 0.984
input [1 1]  should have output  [0] actual output 0.006
input [1 0]  should have output  [1] actual output 0.984
input [0 0]  should have output  [0] actual output -0.002


### Simple feature vectors for sequence data
A simple numbering scheme is used to convert to the sequence alphabet to a numeric form as an input vector. For proteins that is number from 1 to 20 from the list of one-letter codes.

k-mers with k=5 are used. Only the output for the middle residue is required but the network will use the neighbours to predict the secondary structure of the middle one.  

Although static k-mer are used for training in practice a prediction in a moving 5-mer window could be implemented. 

The possible outputs are also coded as integers for the more restricted alphabet of H, C, and S

In [7]:
##Run this cell to define the dictionaries for the vectors
aminoAcids = 'ACDEFGHIKLMNPQRSTVWY'
aaIndexDict = {}
for i, aa in enumerate(aminoAcids):
        aaIndexDict[aa] = i

ssIndexDict = {}
ssCodes = 'HCS'
for i, code in enumerate(ssCodes):
        ssIndexDict[code] = i

Here is a very limited set of training data. It shows the raw format which is a 5-mer string and the secondary structure that was observed for the central residue of this in at least one PDB structure. 

The actual structure is a simplified output from the DSSP program mentioned in the introduction. DSSP acutally distinguishes more structures that the three here - for example there are other kinds of helix. But these complications are not dealt with.

In [8]:
##Run this cell to define the training set
small_training_set = [('ADTLL','S'),
                      ('DTLLI','S'),
                      ('TLLIL','S'),
                      ('LLILG','S'),
                      ('LILGD','S'),
                      ('ILGDS','S'),
                      ('LGDSL','C'),
                      ('GDSLS','H'),
                      ('DSLSA','H'),
                      ('SLSAG','H'),
                      ('LSAGY','H'),
                      ('SAGYR','C'),
                      ('AGYRM','C'),
                      ('GYRMS','C'),
                      ('YRMSA','C'),
                      ('RMSAS','C')]

The training data has to be converted to the numerical code. Here is a function to to that.

In [9]:
##Run this cell to define the function
def convertSeqToVector(seq, indexDict):
    """converts a one-letter sequence to numerical
    coding for neural network calculations"""   
    numLetters = len(indexDict)
    vector = [0.0] * len(seq) * numLetters

    for pos, letter in enumerate(seq):
        index = pos * numLetters + indexDict[letter]    
        vector[index] = 1.0

    return vector


The training data is prepared with this.

In [10]:
##Run this cell to create the training data
small_training_vector = []
for seq, ss in small_training_set:
 
        inputVec = convertSeqToVector(seq, aaIndexDict)
        outputVec = convertSeqToVector(ss, ssIndexDict)

        small_training_vector.append( (inputVec, outputVec) )


And then the network is trained. Here there are 3 hidden nodes specified.

In [11]:
wMatrixIn, wMatrixOut = neuralNetTrain(small_training_vector, 3, 1000)

Step: 0 Error: 19.635992
Step: 1 Error: 11.546471
Step: 2 Error: 11.348317
Step: 3 Error: 7.996287
Step: 4 Error: 5.794609
Step: 5 Error: 5.535160
Step: 6 Error: 2.135310
Step: 7 Error: 1.445940
Step: 9 Error: 1.253434
Step: 10 Error: 0.705480
Step: 11 Error: 0.675753
Step: 13 Error: 0.404313
Step: 18 Error: 0.396442
Step: 19 Error: 0.256821
Step: 27 Error: 0.219662
Step: 34 Error: 0.154109
Step: 37 Error: 0.116712
Step: 51 Error: 0.108171
Step: 67 Error: 0.103497
Step: 69 Error: 0.048243
Step: 83 Error: 0.045614
Step: 85 Error: 0.043765
Step: 88 Error: 0.030492
Step: 97 Error: 0.023526
Step: 100 Error: 0.016722
Step: 173 Error: 0.012254
Step: 174 Error: 0.009497
Step: 232 Error: 0.007960
Step: 237 Error: 0.005484
Step: 238 Error: 0.004232
Step: 322 Error: 0.004194
Step: 328 Error: 0.003167
Step: 380 Error: 0.002579
Step: 408 Error: 0.001786
Step: 522 Error: 0.001459
Step: 579 Error: 0.001129
Step: 874 Error: 0.001025
Step: 879 Error: 0.000980
Step: 886 Error: 0.000974
Step: 894 Error:

You will see that this training has converged nicely. The only problem is that it was for a very restricted set of sequence data. 

There are 3 x 20 = 60 theoretical combinations of amino acid residues with secondary structure states. But for particular residues some of these are favoured and others disfavoured. 

For each residue there will be 20^4 = 160 000 different contexts that then could possibly occur in. Although some of the resulting 5-mers are actually quite rare in structured proteins. 

All the same, it would be good to have a larger training set. It is better to have some rare examples of residue state combinations although, of course, the network will not have to predict them frequently.

One thing to remember is to retain some test examples - where the answer is known but which are not in the training set. 

In its current, poorly-trained state, the network is still able to make a reasonable predictions. But only if the test is clearly related to examples that it has seen. testSecStrucSeq here is very similar to examples in the seqSecStrucData training set.

In [15]:
# run this cell to define 
def predict_seq(seq, w_matrix_in, w_matrix_out):
    """
    returns a prediction either 'H', 'S' or 'C' for the input sequence
    """
    vector_seq = convertSeqToVector(seq, indexDict=aaIndexDict)
    array_seq = array([vector_seq,])
    sIn, sHid, sOut =    neuralNetPredict(array_seq, w_matrix_in, w_matrix_out)
    index = sOut.argmax()
    return ssCodes[index]

In [16]:
correct = 0
total = 0
for test_5_mer, code in small_training_set:
    predict = predict_seq(test_5_mer, wMatrixIn, wMatrixOut)
    if predict == code:
        correct += 1
    total += 1
    print(test_5_mer, 'input ', code, ' predict ', predict)
print('success "prediction" for training data is {:.1f}%'.format(100.*correct/total))

ADTLL input  S  predict  S
DTLLI input  S  predict  S
TLLIL input  S  predict  S
LLILG input  S  predict  S
LILGD input  S  predict  S
ILGDS input  S  predict  S
LGDSL input  C  predict  C
GDSLS input  H  predict  H
DSLSA input  H  predict  H
SLSAG input  H  predict  H
LSAGY input  H  predict  H
SAGYR input  C  predict  C
AGYRM input  C  predict  C
GYRMS input  C  predict  C
YRMSA input  C  predict  C
RMSAS input  C  predict  C
success "prediction" for training data is 100.0%


In [20]:
test_expect_h = 'DLLSA'
print('prediction for', test_expect_h, 'is', predict_seq(test_expect_h, wMatrixIn, wMatrixOut))

prediction for DLLSA is H


In [18]:
# test data from PDB structure 6aam, divided into 5-mers
test_data_6aam = [('DPTVF', 'C'), ('HKRYL', 'C'), ('KKIRD', 'S'), ('LGEGH', 'C'), 
                  ('FGKVS', 'S'), ('LYCYD', 'S'), ('PTNDG', 'C'), ('TGEMV', 'S'), 
                  ('AVKAL', 'S'), ('KADAG', 'C'), ('PQHRS', 'H'), ('GWKQE', 'H'), 
                  ('IDILR', 'H'), ('TLYHE', 'C'), ('HIIKY', 'C'), ('KGCCE', 'S'), 
                  ('DAGAA', 'C'), ('SLQLV', 'S'), ('MEYVP', 'C'), ('LGSLR', 'S'), 
                  ('DYLPR', 'C'), ('HSIGL', 'C'), ('AQLLL', 'H'), ('FAQQI', 'H'), 
                  ('CEGMA', 'H'), ('YLHAQ', 'H'), ('HYIHR', 'S'), ('NLAAR', 'S'), 
                  ('NVLLD', 'S'), ('NDRLV', 'C'), ('KIGDF', 'C'), ('GLAKA', 'C'), 
                  ('VPEGH', 'C'), ('EYYRV', 'C'), ('REDGD', 'C'), ('SPVFW', 'C'), 
                  ('YAPEC', 'H'), ('LKEYK', 'H'), ('FYYAS', 'H'), ('DVWSF', 'H'), 
                  ('GVTLY', 'H'), ('ELLTH', 'H'), ('CDSSQ', 'C'), ('SPPTK', 'H'), 
                  ('FLELI', 'H'), ('GLAQG', 'C'), ('QMTVL', 'H'), ('RLTEL', 'H'), 
                  ('LERGE', 'C'), ('RLPRP', 'C'), ('DKCPA', 'C'), ('EVYHL', 'H'), 
                  ('MKNCW', 'H'), ('ETEAS', 'S'), ('FRPTF', 'C'), ('ENLIP', 'H'), 
                  ('ILKTV', 'H'), ('HEKYQ', 'H'), ('GQAPS', 'C')]

In [21]:
correct = 0
for test_5_mer, code in test_data_6aam:
    predict = predict_seq(test_5_mer, wMatrixIn, wMatrixOut)
    if predict == code:
        correct += 1
print('success prediction for 6AAM test data is {:.1f}%'.format(100.*correct/len(test_data_6aam)))


success prediction for test data is 27.1%


# making a prediction of the secondary structure for PDB 6aam

Lets use the recent PDB structure 6AAM "Non-receptor tyrosine-protein kinase TYK2" 
https://www.ebi.ac.uk/pdbe/entry/pdb/6aam/
    
as an example

In [None]:
# 6aam sequence from https://www.ebi.ac.uk/pdbe/entry/pdb/6aam/protein/1
sequence_6aam = ('GPGDPTVFHKRYLKKIRDLGEGHFGKVSLYCYDPTNDGTGEMVAVKALKADAGP'
                 'QHRSGWKQEIDILRTLYHEHIIKYKGCCEDAGAASLQLVMEYVPLGSLRDYLPR'
                 'HSIGLAQLLLFAQQICEGMAYLHAQHYIHRNLAARNVLLDNDRLVKIGDFGLAK'
                 'AVPEGHEYYRVREDGDSPVFWYAPECLKEYKFYYASDVWSFGVTLYELLTHCDS'
                 'SQSPPTKFLELIGLAQGQMTVLRLTELLERGERLPRPDKCPAEVYHLMKNCWET'
                 'EASFRPTFENLIPILKTVHEKYQGQAPS')
print(len(sequence_6aam))

In [None]:
# from https://cdn.rcsb.org/etl/kabschSander/ss_dis.txt.gz
dssp_result_for_6aam  = """>6AAM:A:secstr
      B  GGGEEEEEE       EEEEEEE TT     EEEEEEE      TTHHHHHHHHHHHHHH   TTB
  EEEEEEEGGGTEEEEEEE  TT BHHHHGGGS   HHHHHHHHHHHHHHHHHHHHTTEE S  SGGGEEEEET
TEEEE   TT EE                 GGG  HHHHHH    HHHHHHHHHHHHHHHHTTT GGGSHHHHHH
HHH S  TT HHHHHHHHHHTT      TT  HHHHHHHHHHT SSGGGS  HHHHHHHHHHHHHHHH     
"""
dssp_result_for_6aam = dssp_result_for_6aam.splitlines()
dssp_result_for_6aam.pop(0)
dssp_result_for_6aam = ''.join(dssp_result_for_6aam)
# need to convert DSSP code to the 3-category helix, strand, coil.
# Use mapping 
# helices H, C, I go to H
# strands E & bridges B go to S
# everything else got to C
translation = str.maketrans('HCIEB GT', 'HHHSSCCC')
dssp_result_for_6aam = dssp_result_for_6aam.translate(translation)
print(len(dssp_result_for_6aam))
print(dssp_result_for_6aam)

In [None]:
test_data_from_6aam = []
for ires, dssp in enumerate(dssp_result_for_6aam):
    if ires>1 and ires%5 == 0:
        fivemer = sequence_6aam[ires-2:ires+3]
        test_data_from_6aam.append((fivemer, dssp))
print(test_data_from_6aam)  

In [None]:
def predict_for_sequence(sequence):
    prediction = []
    for ires, residue in enumerate(sequence):
        if ires < 2 or ires > len(sequence) - 3:
            this_prediction = '.'
        else:
            fivemer = sequence[ires-2:ires+3]
            testSecStrucVec = convertSeqToVector(fivemer, aaIndexDict)
            testSecStrucArray = array( [testSecStrucVec,] )
            sIn, sHid, sOut =    neuralNetPredict(testSecStrucArray, wMatrixIn, wMatrixOut)
            index = sOut.argmax()
            this_prediction = ssCodes[index]
        prediction.append(this_prediction)
    return ''.join(prediction)

initial_predict_6aam = predict_for_sequence(sequence_6aam)
print(initial_predict_6aam)
print(len(initial_predict_6aam))


In [None]:
def highlight_line(first_seq, second_seq):
    """ 
    for the two sequences returns a line where matching letters are 
    highlighted with | except if the letter are a gap
    """
    joins = ['|' if a == b and a != '-' else ' ' for a, b in zip(first_seq, second_seq)]
    return ''.join(joins)
def print_alignment(seq_a, seq_b):
    len_split = 50
    n_splits = len(seq_a)//len_split + 1
    for i_split in range(n_splits):
        start = len_split*i_split
        end = start + len_split
        part_a = seq_a[start:end]
        part_b = seq_b[start:end]
        print(part_a)
        print(highlight_line(part_a, part_b))
        print(part_b)
        print()

print_alignment(dssp_result_for_6aam, initial_predict_6aam)

In [None]:
def success(seq_a, seq_b):
    same = 0
    different = 0
    for let_a, let_b in zip(seq_a, seq_b):
        if let_a == let_b:
            same += 1
        else:
            different += 1
    return same/(same + different)
print('percentage correct predictions: {:.1f}%'.format(100*success(dssp_result_for_6aam, initial_predict_6aam)))

### Inporting a larger training set
A much larger training set is provided as a comma separated file called:
"PDB_protein_secondary_5mers.csv". 

    from csv import reader #may help here
Can you create a training set vector data structure from this and use it to train the network?
If you would like some examples of test data, here are some examples from the recently-determined PDB 6aam.pdb

    S: EMVAV, KVSKY, YKGCC
    H: LAQLL, ICEGM, ASVDW
    C: ERLPR, GDFGL, YKFYY


In [22]:
# answer
from csv import reader
with open('PDB_protein_secondary_5mers.csv') as csv_file:
    csv_reader = reader(csv_file, delimiter=',')
    data = list(csv_reader)
print('have read', len(data), 'lines from csv file')
print('first 5 lines', data[:5])

have read 26242 lines from csv file
first 5 lines [['MGKMY', 'S'], ['YGIPQ', 'C'], ['KMWTY', 'H'], ['YRLRK', 'H'], ['NSVSV', 'S']]


In [33]:
training_set = small_training_set + data[:2000]
##Run this cell to create the training data
training_vector = []
for seq, ss in training_set:
        inputVec = convertSeqToVector(seq, aaIndexDict)
        outputVec = convertSeqToVector(ss, ssIndexDict)
        training_vector.append( (inputVec, outputVec) )

In [37]:
wMatrixIn, wMatrixOut = neuralNetTrain(training_vector, 3, 1000, wInp=wMatrixIn, wOut=wMatrixOut)
#
#wMatrixIn, wMatrixOut = neuralNetTrain(training_vector, 3, 1000, wInp=wMatrixIn, wOut=wMatrixOut)

Step: 0 Error: 2227.156243
Step: 4 Error: 2213.111653
Step: 13 Error: 2198.422190
Step: 24 Error: 2191.190179
Step: 32 Error: 2187.640857
Step: 51 Error: 2185.208241
Step: 56 Error: 2170.163848
Step: 69 Error: 2162.817387
Step: 94 Error: 2157.230469
Step: 102 Error: 2156.136502
Step: 277 Error: 2140.591012
Step: 798 Error: 2126.489703
Step: 898 Error: 2095.296813


In [35]:
correct = 0
total = 0
for test_5_mer, code in small_training_set:
    predict = predict_seq(test_5_mer, wMatrixIn, wMatrixOut)
    if predict == code:
        correct += 1
    total += 1
    print(test_5_mer, 'input ', code, ' predict ', predict)
print('success "prediction" for training data is {:.1f}%'.format(100.*correct/total))

ADTLL input  S  predict  H
DTLLI input  S  predict  S
TLLIL input  S  predict  H
LLILG input  S  predict  H
LILGD input  S  predict  S
ILGDS input  S  predict  S
LGDSL input  C  predict  S
GDSLS input  H  predict  S
DSLSA input  H  predict  H
SLSAG input  H  predict  S
LSAGY input  H  predict  H
SAGYR input  C  predict  S
AGYRM input  C  predict  S
GYRMS input  C  predict  S
YRMSA input  C  predict  S
RMSAS input  C  predict  C
success "prediction" for training data is 37.5%


In [36]:
correct = 0
for test_5_mer, code in test_data_6aam:
    predict = predict_seq(test_5_mer, wMatrixIn, wMatrixOut)
    if predict == code:
        correct += 1
print('success prediction for 6AAM test data is {:.1f}%'.format(100.*correct/len(test_data_6aam)))

success prediction for 6AAM test data is 25.4%


# Maybe try 3-mer rather than 5mer

In [None]:
##Run this cell to define the training set
small_training_set = [('ADTLL','S'),
                      ('DTLLI','S'),
                      ('TLLIL','S'),
                      ('LLILG','S'),
                      ('LILGD','S'),
                      ('ILGDS','S'),
                      ('LGDSL','C'),
                      ('GDSLS','H'),
                      ('DSLSA','H'),
                      ('SLSAG','H'),
                      ('LSAGY','H'),
                      ('SAGYR','C'),
                      ('AGYRM','C'),
                      ('GYRMS','C'),
                      ('YRMSA','C'),
                      ('RMSAS','C')]
small_training_set_3 = []
for k, v in small_training_set:
    small_training_set_3.append((k[1:4],v))
print(small_training_set_3)

In [None]:
##Run this cell to create the training data
small_training_vector = []
for seq, ss in small_training_set_3:
 
        inputVec = convertSeqToVector(seq, aaIndexDict)
        outputVec = convertSeqToVector(ss, ssIndexDict)

        small_training_vector.append( (inputVec, outputVec) )


In [None]:
wMatrixIn, wMatrixOut = neuralNetTrain(small_training_vector, 3, 1000)

In [None]:
# run this cell to define 
def predict_3_mer(seq):
    """
    returns a prediction either 'H', 'S' or 'C' for the input sequence of 5 amino acids
    """
    global wMatrixIn  # produced by previous training
    global wMatrixOut
    vector_seq = convertSeqToVector(seq, indexDict=aaIndexDict)
    array_seq = array([vector_seq,])
    sIn, sHid, sOut =    neuralNetPredict(array_seq, wMatrixIn, wMatrixOut)
    index = sOut.argmax()
    return ssCodes[index]

In [None]:
correct = 0
total = 0
for test_3_mer, code in small_training_set_3:
    predict = predict_3_mer(test_3_mer)
    if predict == code:
        correct += 1
    total += 1
    print(test_3_mer, 'input ', code, ' predict ', predict)
print('success "prediction" for training data is {:.1f}%'.format(100.*correct/total))

In [None]:
training_set = small_training_set + data[:100]
##Run this cell to create the training data
training_set_3 = [(k[1:4], v) for (k, v) in training_set]
training_vector = []
for seq, ss in training_set_3:
        inputVec = convertSeqToVector(seq, aaIndexDict)
        outputVec = convertSeqToVector(ss, ssIndexDict)
        training_vector.append( (inputVec, outputVec) )

In [None]:
MatrixIn, wMatrixOut = neuralNetTrain(training_vector, 4, 3000)

In [None]:
correct = 0
total = 0
for test_3_mer, code in training_set_3:
    #print(test_3_mer, code)
    predict = predict_3_mer(test_3_mer)
    if predict == code:
        correct += 1
    total += 1
print('success "prediction" for training data is {:.1f}%'.format(100.*correct/total))

In [None]:
correct = 0
for test_5_mer, code in test_data_6aam:
    test_3_mer = test_5_mer[1:4]
    predict = predict_5_mer(test_3_mer)
    if predict == code:
        correct += 1
print('success prediction for 6aan test data is {:.1f}%'.format(100.*correct/len(test_data_6aam)))


# lookup closest

idea lookup 'closest' value in large dataset





In [None]:
exact_match_5 = {}
match_central_3 = {}
for k, v in data:
    exact_match_5[k] = v
    match_central_3[k[1:4]] =
for k, v in data