# ProtVec: Amino Acid Embedding Representation of Proteins for Function Classification

## Objectives
1. Extract features from amino acid sequences for machine learning
2. Use features to predict protein family and other structural properties

## Abstract
This project attempts to reproduce the results from [Asgari 2015](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141287) and to expand it to phage sequences and their protein families. Currently, Asgari's classification of protein families can be reproduced with his using his [trained embedding.](https://github.com/ehsanasgari/Deep-Proteomics). However, his results cannot be reproduced with current attempts to train using the skip-gram negative sampling method detailed in [this tutorial.](http://adventuresinmachinelearning.com/word2vec-keras-tutorial/) Training samples have been attempted with the SwissProt database. 

## Introduction

Predicting protein function with machine learning methods require informative features that is extracted from data. A natural language processing (NLP) technique, known as Word2Vec is used to represent a word by its context with a vector that encodes for the probability a context would occur for a word. These vectors are effective at representing meanings of words since words with similar meanings would have similar contexts. For example, the word cat and kitten would have similar contexts that they are used in since they have very similar meanings. These words would thus have very similar vectors. 


## Methods
1. Preprocessing
  1. Load dataset containing protein amino acid sequences and Asgari's embedding
  2. [Convert sequences to three lists of non-overlapping 3-mer words](https://www.researchgate.net/profile/Mohammad_Mofrad/publication/283644387/figure/fig4/AS:341292040114179@1458381771303/Protein-sequence-splitting-In-order-to-prepare-the-training-data-each-protein-sequence.png) 
  3. Convert 3-mers to numerical encoding using kmer indicies from Asgari's embedding (row dimension)
  4. Generate skipgrams with [Keras function](https://keras.io/preprocessing/sequence/)  
        Output: [target word, context word](http://mccormickml.com/assets/word2vec/training_data.png), label  
        Label refers to true or false target/context pairing generated for the negative sampling technique             
2. Training embedding
    1. Create negative sampling skipgram model with Keras [using technique from this tutorial](http://adventuresinmachinelearning.com/word2vec-keras-tutorial/)
3. Generate ProtVecs from embedding for a given protein sequence
    1. Break protein sequence to list of kmers
    2. Convert kmers to vectors by taking the dot product of its one hot vector with the embedding 
    3. Sum up all vectors for all kmers for a single vector representation for a protein (length 100)        
4. Classify protein function with ProtVec features (results currently not working, refer to R script)
    1. Use protvecs as training features
    2. Use pfam as labels
    3. For a given pfam classification, perform binary classification with all of its positive samples and randomly sample an equal amount of negative samples
    4. Train SVM model 
    
    
## Resources 
1. Intuition behind Word2Vec http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
2. Tutorial followed for implementation of skip-gram negative sampling (includes code) http://adventuresinmachinelearning.com/word2vec-keras-tutorial/
3. Introduction to protein function prediction
http://biofunctionprediction.org/cafa-targets/Introduction_to_protein_prediction.pdf

## Author
Mike Huang  
huangjmike@gmail.com

In [1]:
import pandas as pd
import numpy as np
from keras.preprocessing.sequence import skipgrams, pad_sequences, make_sampling_table
from keras.preprocessing.text import hashing_trick
from keras.layers import Embedding, Input, Reshape, Dense, merge
from keras.models import Sequential, Model
from sklearn.manifold import TSNE
from joblib import Parallel, delayed
import multiprocessing

import csv


#Load Ehsan Asgari's embeddings
#Source: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141287
#Embedding: https://github.com/ehsanasgari/Deep-Proteomics
ehsanEmbed =  []
with open("protVec_100d_3grams.csv") as tsvfile:
    tsvreader = csv.reader(tsvfile, delimiter="\t")
    for line in tsvreader:
        ehsanEmbed.append(line[0].split('\t'))
threemers = [vec[0] for vec in ehsanEmbed]
embeddingMat = [[float(n) for n in vec[1:]] for vec in ehsanEmbed]
threemersidx = {} #generate word to index translation dictionary. Use for kmersdict function arguments.
for i, kmer in enumerate(threemers):
    threemersidx[kmer] = i

    
#Load NCBI Phage processed dataset - 38420 sequences
#table = pd.read_csv("90filter.ncbi.statis_phage_gene.csv", index_col=0)
#Remove entries without vector representation due to amino acid sequence not reaching threshold length
#table = table[table['Protein'].apply(lambda x: type(x)!=float)]
#table[:10]

#Load Second NCBI Phage processed dataset - 99520 sequences
#cherry = pd.read_csv("CherryProteins.csv")
#cherry = cherry.loc[cherry['Component'] != 'HYP'] #Filter hypothetical sequences
#cherry  = cherry.loc[cherry['Component'] != 'UNS'] #Filter unsorted sequences
#cherryseqs = pd.read_csv("cherryaaseqs.csv")

#Load SwissProt 2015 data
swissprot = pd.read_csv("family_classification_metadata.tab", sep='\t')
swissprot['Sequence'] = pd.read_csv("family_classification_sequences.tab", sep='\t')

#Create non-redundant, concatenated sequences list between two NCBI datasets for training
#seqsunique = [seq[0] for seq in cherryseqs['Protein'].values if seq not in table['Protein'].values]
#uniqueinds = [i for i in range(len(cherryseqs)) if cherryseqs['Protein'].iloc[i] not in table['Protein'].values]
#cherryunq = cherryseqs.iloc[uniqueinds]
#cherryunq = cherryunq.append(table[['Function','Protein']])
cherryunq = pd.read_csv("cherryall.csv")

#Set parameters
vocabsize = len(threemersidx)
window_size = 25
num_cores = multiprocessing.cpu_count() #For parallel computing

#Path to model weights
weightspath = '38420sample10000000epochsAsgari.hdf'

Using TensorFlow backend.


### Ehsan's embedding trained with SwissProt 2015
The embedding has dimensions of 9048 x 100. 9048 represents one for each 3-mer. 100 is the size of the vector representation for each 3-mer. The matrix is a lookup table to get the vector for a given 3-mer. 3-mers not in the table are represented by unk.

![](http://mccormickml.com/assets/word2vec/word2vec_weight_matrix_lookup_table.png)

### Data preprocessing

Let's create the three lists of non-overlapping 3mers as described in the paper.

<img src="https://www.researchgate.net/profile/Mohammad_Mofrad/publication/283644387/figure/fig4/AS:341292040114179@1458381771303/Protein-sequence-splitting-In-order-to-prepare-the-training-data-each-protein-sequence.png" width="40%">

Next, encode each 3-mer to its row index in the embedding. 

In [4]:
#Convert sequences to three lists of non overlapping 3mers 
def kmerlists(seq):
    kmer0 = []
    kmer1 = []
    kmer2 = []
    for i in range(0,len(seq)-2,3):
        if len(seq[i:i+3]) == 3:
            kmer0.append(seq[i:i+3])
        i+=1
        if len(seq[i:i+3]) == 3:
            kmer1.append(seq[i:i+3])
        i+=1
        if len(seq[i:i+3]) == 3:
            kmer2.append(seq[i:i+3])
    return [kmer0,kmer1,kmer2]

#Same as kmerlists function but outputs an index number assigned to each kmer. Index number is from Asgari's embedding
def kmersindex(seqs, kmersdict=threemersidx):
    kmers = []
    for i in range(len(seqs)):
        kmers.append(kmerlists(seqs[i]))
    kmers = np.array(kmers).flatten().flatten(order='F')
    kmersindex = []
    for seq in kmers:
        temp = []
        for kmer in seq:
            try:
                temp.append(kmersdict[kmer])
            except:
                temp.append(kmersdict['<unk>'])
        kmersindex.append(temp)
    return kmersindex

sampling_table = make_sampling_table(vocabsize)
def generateskipgramshelper(kmersindicies): 
    couples, labels = skipgrams(kmersindicies, vocabsize, window_size=window_size, sampling_table=sampling_table)
    if len(couples)==0: 
        couples, labels = skipgrams(kmersindicies, vocabsize, window_size=window_size, sampling_table=sampling_table)
    if len(couples)==0:
        couples, labels = skipgrams(kmersindicies, vocabsize, window_size=window_size, sampling_table=sampling_table)
    else:
        word_target, word_context = zip(*couples)
        return word_target, word_context, labels
    
def generateskipgrams(seqs,kmersdict=threemersidx):
    kmersidx = kmersindex(seqs,kmersdict)
    return Parallel(n_jobs=num_cores)(delayed(generateskipgramshelper)(kmers) for kmers in kmersidx)


In [20]:
print("Sample sequence")
print(swissprot['Sequence'].iloc[0])
print("")
print("Convert sequence to list of kmers")
print(kmerlists(swissprot['Sequence'].iloc[0]))
print("")
print("Convert kmers to their index on the embedding")
print(kmersindex(swissprot['Sequence'].iloc[:1]))
print("")
testskipgrams = generateskipgrams(swissprot['Sequence'].iloc[:1])
print("Sample skipgram input:")
print("Word Target:", testskipgrams[0][0][0])
print("Word Context:", testskipgrams[0][1][0])
print("Label:", testskipgrams[0][2][0])

Sample sequence
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL

Convert sequence to list of kmers
[['MAF', 'SAE', 'DVL', 'KEY', 'DRR', 'RRM', 'EAL', 'LLS', 'LYY', 'PND', 'RKL', 'LDY', 'KEW', 'SPP', 'RVQ', 'VEC', 'PKA', 'PVE', 'WNN', 'PPS', 'EKG', 'LIV', 'GHF', 'SGI', 'KYK', 'GEK', 'AQA', 'SEV', 'DVN', 'KMC', 'CWV', 'SKF', 'KDA', 'MRR', 'YQG', 'IQT', 'CKI', 'PGK', 'VLS', 'DLD', 'AKI', 'KAY', 'NLT', 'VEG', 'VEG', 'FVR', 'YSR', 'VTK', 'QHV', 'AAF', 'LKE', 'LRH', 'SKQ', 'YEN', 'VNL', 'IHY', 'ILT', 'DKR', 'VDI', 'QHL', 'EKD', 'LVK', 'DFK', 'ALV', 'ESA', 'HRM', 'RQG', 'HMI', 'NVK', 'YIL', 'YQL', 'LKK', 'HGH', 'GPD', 'GPD', 'ILT', 'VKT', 'GSK', 'GVL', 'YDD', 'SFR', 'KIY', 'TDL', 'GWK', 'FTP'], ['AFS', 'AED', 'VLK', 'EYD', 'RRR', 'RME', 'ALL', 'LSL', 'YYP', 'NDR', 'KLL', 'DYK', 'EWS', 'P

In [3]:
# create some input variables
input_target = Input((1,))
input_context = Input((1,))
vector_dim = 100

embedding = Embedding(vocabsize, vector_dim, input_length=1, name='embedding')
embedding.build((None,))
embedding.set_weights(np.array([embeddingMat])) #Load Asgari's embedding as initial weights

target = embedding(input_target)
target = Reshape((vector_dim, 1))(target)
context = embedding(input_context)
context = Reshape((vector_dim, 1))(context)

# setup a cosine similarity operation which will be output in a secondary model
similarity = merge([target, context], mode='cos', dot_axes=0)

# now perform the dot product operation to get a similarity measure
dot_product = merge([target, context], mode='dot', dot_axes=1)
dot_product = Reshape((1,))(dot_product)
# add the sigmoid output layer
output = Dense(1, activation='sigmoid')(dot_product)

# create the primary training model
model = Model(input=[input_target, input_context], output=output)
model.compile(loss='binary_crossentropy', optimizer='rmsprop')

# create a secondary validation model to run our similarity checks during training
validation_model = Model(input=[input_target, input_context], output=similarity)

#model.load_weights(weightspath)
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, 1)             0                                            
____________________________________________________________________________________________________
input_2 (InputLayer)             (None, 1)             0                                            
____________________________________________________________________________________________________
embedding (Embedding)            (None, 1, 100)        904800      input_1[0][0]                    
                                                                   input_2[0][0]                    
____________________________________________________________________________________________________
reshape_1 (Reshape)              (None, 100, 1)        0           embedding[0][0]         

  from ipykernel import kernelapp as app
  name=name)


In [4]:
reverse_dictionary = threemers
vocab_size = vocabsize
class SimilarityCallback:
    def run_sim(self):
        for i in range(valid_size):
            valid_word = reverse_dictionary[valid_examples[i]]
            top_k = 8  # number of nearest neighbors
            sim = self._get_sim(valid_examples[i])
            nearest = (-sim).argsort()[1:top_k + 1]
            log_str = 'Nearest to %s:' % valid_word
            for k in range(top_k):
                close_word = reverse_dictionary[nearest[k]]
                log_str = '%s %s,' % (log_str, close_word)
            print(log_str)

    @staticmethod
    def _get_sim(valid_word_idx):
        sim = np.zeros((vocab_size,))
        in_arr1 = np.zeros((1,))
        in_arr2 = np.zeros((1,))
        for i in range(vocab_size):
            in_arr1[0,] = valid_word_idx
            in_arr2[0,] = i
            out = validation_model.predict_on_batch([in_arr1, in_arr2])
            sim[i] = out
        return sim
sim_cb = SimilarityCallback()

## Training embedding with SwissProt 2015, sequential sampling

In [None]:
epochs = 1
samplesize = len(swissprot)
valid_size = 16  # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
chunklength = 1000
#ite=0
losses = 0
#print("Loading part",ite+1,"/",int(samplesize/chunklength), "of data, length of ",samplesize)
#kmerskipgrams=generateskipgrams(swissprot['Sequence'].iloc[:chunklength-1].values,threemersidx)

arr_1 = np.zeros((1,))
arr_2 = np.zeros((1,))
arr_3 = np.zeros((1,))

for epoch in range(epochs):
    for cnt in range(samplesize):
        idx = cnt-chunklength*ite
        if type(kmerskipgrams[idx]) == tuple:
            for idx2 in range(len(kmerskipgrams[idx][0])):
                arr_1[0,] = kmerskipgrams[idx][0][idx2]
                arr_2[0,] = kmerskipgrams[idx][1][idx2]
                arr_3[0,] = kmerskipgrams[idx][2][idx2]
                loss = model.train_on_batch([arr_1, arr_2], arr_3)
                #losses += loss
                if idx2 % 1000 == 0:
                    #print("Iteration {}, loss={}".format(cnt, loss), "Average loss in last 1000 samples", losses/1000)
                    print("Iteration {}, loss={}".format(cnt, loss))
                    losses = 0
        if cnt % 1000 == 0:
            sim_cb.run_sim()
        if cnt % chunklength == 0 and cnt != 0:
            ite+=1
            print("Loading ",ite+1,"/",int(samplesize/chunklength),"part of data")
            del kmerskipgrams
            kmerskipgrams=generateskipgrams(swissprot['Sequence'].iloc[ite*chunklength:(ite+1)*chunklength-1].values,threemersidx)    
        if cnt % samplesize/2 == 0:
            model.save_weights('swissprothalf.hdf')
model.save_weights('swissprot.hdf')


In [32]:
#embeddingweights = model.layers[2].get_weights()[0]

def protvec(kmersdict, seq, embeddingweights=embeddingMat):
    #Convert seq to three lists of kmers 
    kmerlist = kmerlists(seq) 
    kmerlist = [j for i in kmerlist for j in i]
    #Convert center kmers to their vector representations
    kmersvec = [0]*100
    for kmer in kmerlist:
        try:
            kmersvec = np.add(kmersvec,embeddingweights[kmersdict[kmer]])
        except:
            kmersvec = np.add(kmersvec,embeddingweights[kmersdict['<unk>']])
    return kmersvec

def formatprotvecs(protvecs):
    #Format protvecs for classifier inputs by transposing the matrix
    protfeatures = []
    for i in range(100):
        protfeatures.append([vec[i] for vec in protvecs])
    protfeatures = np.array(protfeatures).reshape(len(protvecs),len(protfeatures))
    return protfeatures

def formatprotvecsnormalized(protvecs):
    #Formatted protvecs with feature normalization
    protfeatures = []
    for i in range(100):
        tempvec = [vec[i] for vec in protvecs]
        mean = np.mean(tempvec)
        var = np.var(tempvec)
        protfeatures.append([(vec[i]-mean)/var for vec in protvecs])
    protfeatures = np.array(protfeatures).reshape(len(protvecs),len(protfeatures))
    return protfeatures

def sequences2protvecsCSV(filename, seqs, kmersdict=threemersidx, embeddingweights=embeddingMat):
    #Convert a list of sequences to protvecs and save protvecs to a csv file
    #ARGUMENTS;
    #filename: string, name of csv file to save to, i.e. "sampleprotvecs.csv"
    #seqs: list, list of amino acid sequences
    #kmersdict: dict to look up index of kmer on embedding, default: Asgari's embedding index
    #embeddingweights: 2D list or np.array, embedding vectors, default: Asgari's embedding vectors

    swissprotvecs = Parallel(n_jobs=num_cores)(delayed(protvec)(kmersdict, seq, embeddingweights) for seq in seqs)
    swissprotvecsdf = pd.DataFrame(formatprotvecs(swissprotvecs))
    swissprotvecsdf.to_csv(filename, index=False)
    return swissprotvecsdf

In [33]:
sequences2protvecsCSV("testprotvecs.csv", swissprot['Sequence'][:5])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-15.906601,-15.786575,-12.858975,-20.429255,-25.811314,-1.425858,-0.273261,1.218515,-2.768131,-1.975148,...,-2.353896,1.683457,-2.855346,-4.20035,-6.863898,-8.97545,0.160262,-3.139883,-12.887104,-15.625758
1,-1.501441,-2.576011,0.572682,3.01789,3.397684,-8.050783,-7.461454,-4.145822,-1.219607,-9.140709,...,0.279602,4.074085,3.937202,-5.439727,0.134528,-1.72766,-6.800861,-4.274481,-0.687948,-6.683564
2,4.571641,6.08794,-0.093153,1.788396,3.143507,0.587536,4.343174,3.281648,4.594139,6.544705,...,-5.428951,-8.151144,-4.075138,-3.999045,-6.826098,-19.353062,-13.091738,-14.026761,-31.865056,-33.461093
3,1.186221,15.03131,4.09964,-7.429306,0.673389,12.435832,28.053988,13.094559,18.319831,23.192327,...,3.338261,7.016894,-0.624116,0.052345,3.364417,-5.378887,-2.588111,-7.852132,-9.4,-9.117914
4,-11.274578,-12.735979,-9.380683,-24.705562,-24.733077,-0.129001,-6.032588,2.454767,6.513942,1.185852,...,-4.999941,-8.337236,-1.482302,2.712613,-1.113474,18.634332,28.020647,13.635002,25.419577,33.273547


## Classification of Protein Function Category

In [None]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report,log_loss
from sklearn.model_selection import cross_val_score
from scipy.ndimage.measurements import center_of_mass, label
from skimage.measure import regionprops
from sklearn.cross_validation import ShuffleSplit
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score
from scipy.stats import percentileofscore

In [None]:
lb = LabelBinarizer()
binlab=lb.fit_transform(table['labels'])

In [None]:
n_splits=10
kfold=StratifiedKFold(n_splits=n_splits, shuffle=True)

models=[RandomForestClassifier(),
        GradientBoostingClassifier(),]
name=["Random Forest", "Gradient Boosting"]

predictedmodels={}

for nm, clf in zip(name[:-1], models[:-1]):
    print(nm)
    predicted=[]
    for train,test in kfold.split(featuretable,binlab[:,0]):
    scores=cross_val_score(clf,featuretable, binlab, cv=StratifiedKFold(n_splits=n_splits, shuffle=True), n_jobs=-1, scoring='neg_log_loss')
    print("Cross-validated logloss",-np.mean(scores))
    print("---------------------------------------")



In [None]:
clf.fit(featuretable, binlab)

In [None]:
featuretable = pd.DataFrame(np.array(protfeatures).reshape(9173,100))

In [None]:
#Binary classification of single function
intsamples = table.loc[table['labels']=='INF']
intsamples['binarylabel'] = [1]*len(intsamples)
nonint = table.loc[table['labels'] != 'INF']
nonint['binarylabel'] = [0]*len(nonint)
intsamples = intsamples.append(nonint.sample(frac=len(intsamples)/len(nonint)))
intsamples = intsamples.sample(frac=1)

In [None]:
features = formatprotvecs(intsamples['ProtVecs'].values)


In [None]:
models=[LogisticRegression(C=0.1),
        RandomForestClassifier(),
        GradientBoostingClassifier(),
    SVC(C=0.02,kernel='rbf', probability=True)]
name=["Logistic Regression","Random Forest", "Gradient Boosting","SVM rbf kernel"]

predictedmodels={}

for nm, clf in zip(name, models):
    print(nm)
    scores=cross_val_score(clf,features, intsamples['binarylabel'], cv=StratifiedKFold(n_splits=n_splits, shuffle=True), n_jobs=-1, scoring='neg_log_loss')
    print("Cross-validated logloss",-np.mean(scores))
    print("---------------------------------------")
    



In [None]:
import matplotlib.pyplot as plt

def BinaryClassification(x,y):
    n_splits=10
    kfold=StratifiedKFold(n_splits=n_splits, shuffle=True)
    models=[LogisticRegression(C=0.1),
            RandomForestClassifier(),
            GradientBoostingClassifier(),
        SVC(C=1,kernel='rbf')]
    name=["Logistic Regression", "Random Forest", "Gradient Boosting", "SVM with rbf kernel"]

    predictedmodels={}

    for nm, clf in zip(name[-1:], models[-1:]):
        print(nm)
        predicted=[]
        labelcv=[]
        for train,test in kfold.split(x, y):
            clf.fit(x[train],y[train])
            predicted.append(clf.predict(x[test]))
            labelcv.append(y[test])
        #scores=cross_val_score(clf,x, y, cv=StratifiedKFold(n_splits=n_splits, shuffle=True), n_jobs=-1, scoring='neg_log_loss')
        predicted=np.concatenate(np.array(predicted),axis=0)
        labelcv=np.concatenate(np.array(labelcv),axis=0)
        predictedmodels[nm]=predicted
        #roc=roc_curve(labelcv,predicted)
        #print("Average precision score:", average_precision_score(labelcv,predicted))
        #print("Area under curve:", auc(roc[0],roc[1]))
        #plt.plot(roc[0],roc[1])
        #print(-scores)
        print(classification_report(labelcv,predicted))
        print(confusion_matrix(labelcv,predicted))
        print("Cross-validated logloss",-np.mean(scores))
        print("---------------------------------------")
        #plt.plot(rocrandom[0],rocrandom[1])
    #plt.title('ROC')
    #plt.ylabel('TPrate')
    #plt.xlabel('FPrate')
    #plt.legend(name)
    #plt.savefig("clfroccomparison.png",dpi=300)
    #plt.show()

In [None]:
#Set up accuracy, sensitivty, specificity as evaluation scorews
#Normalize features
#Try Asgari 2015 initial weights
#Try regularization


## Replicate Asgari 2015 Protein family classification results

In [5]:
swissprot = pd.read_csv("family_classification_metadata.tab", sep='\t')
swissprot['Sequence'] = pd.read_csv("family_classification_sequences.tab", sep='\t')

In [None]:
swissprot.loc[swissprot['FamilyDescription'] == '50S ribosome-binding GTPase']

In [None]:
del swissprotseq

In [None]:
concordance = [table.iloc[i] for i in range(len(table)) if table['Protein'].iloc[i] in swissprot['Sequence'].values]

In [None]:
results = Parallel(n_jobs=num_cores)(delayed(generateskipgrams)(kmers) for kmers in kmersindex[20000:])
word_target = []
word_context = []
labels = []
for sample in results:
    if type(sample) == tuple:
        word_target += sample[0]
        word_context += sample[1]
        labels += sample[2]
del results

In [None]:
#Sample 50S ribosome-binding GTPase and equal amount of negative cases
def SampleBinaryClassification(table, function,ProtVecs):
    pos = table.loc[table['FamilyDescription'] == function]
    neg = table.loc[table['FamilyDescription'] != function]
    pos['binarylabel'] = np.ones(len(pos), dtype=bool)
    neg = neg.sample(frac=len(pos)/len(neg))
    neg['binarylabel'] = np.zeros(len(neg), dtype=bool)
    pos = pos.append(neg)
    pos = pos.sample(frac=1)
    #print("Generating ProtVecs")
    #ProtVecs = Parallel(n_jobs=num_cores)(delayed(protvec)(len(threemers), threemersidx, seq, embeddingMat) for seq in pos['Sequence'])
    #pos['ProtVecs'] = ProtVecs
    features = formatprotvecs(ProtVecs)
    BinaryClassification(features,pfambinary['binarylabel'].values)
    return pos


In [None]:
#ProtVecs = Parallel(n_jobs=num_cores)(delayed(protvec)(threemersidx, seq, embeddingMat) for seq in pfambinary['Sequence'])

In [None]:
SampleBinaryClassification(swissprot,'50S ribosome-binding GTPase',famclass.iloc[pfambinary.index].values)

In [None]:
features = formatprotvecsnormalized(famclass.iloc[pfambinary.index].values)
#labels = LabelBinarizer().fit_transform(pfambinary['binarylabel'].values)
BinaryClassification(features,pfambinary['binarylabel'].values)

In [None]:
features = formatprotvecsnormalized(famclass.iloc[pfambinary.index].values)
#labels = LabelBinarizer().fit_transform(pfambinary['binarylabel'].values)
BinaryClassification(features,pfambinary['binarylabel'].values)

In [None]:
labels = pfambinary['binarylabel'].values
def fit_model(X, y, clf):
    cv_sets = ShuffleSplit(X.shape[0], n_iter = 5, test_size = 0.20, random_state = 42)
    params = {'C':np.arange(10,100),
             'gamma':np.arange(1e-2,1e-1)}
    grid = GridSearchCV(clf, params, cv=cv_sets, n_jobs=-1)
    grid = grid.fit(X, y)
    return grid.best_params_, grid.best_score_, grid.best_estimator_

best_params, best_score, optimal_svm=fit_model(features,labels,SVC())

print("The best parameters are %s with a score of %0.2f"
      % (best_params, best_score))
print(optimal_svm)

name=["Optimized SVM"]
print(name)
#scores=cross_val_score(optimal_gb,inputfeatures[featurelist], malignantlabel, cv=5, scoring='neg_log_loss')
#print(-scores)
#print("Cross-validated logloss",-np.mean(scores))
print("---------------------------------------")
clf=optimal_svm
clf.fit(features[train],labels[train])
print(classification_report(labels[test],clf.predict(features[test])))
print(confusion_matrix(labels[test],clf.predict(features[test])))
#roc=roc_curve(Ytest,clf.predict_proba(Xtest[featurelist])[:,1])
#print(clf.feature_importances)
#ROC curve
#plt.plot(roc[0],roc[1], alpha=0.5)
#plt.plot(rocrandom[0],rocrandom[1])

#scores=cross_val_score(GradientBoostingClassifier(),inputfeatures[featurelist], malignantlabel, cv=5, scoring='neg_log_loss')
#print(classification_report(Ytest,model.predict(Xtest[featurelist])))
#print(-scores)
#print("Cross-validated logloss",-np.mean(scores))
#print("---------------------------------------")
#clf=SVC()
#clf.fit(Xtrain[featurelist],Ytrain)
#roc=roc_curve(Ytest,clf.predict_proba(Xtest[featurelist])[:,1])

In [None]:
#Collect cherry annotated dataset
#Import into R and use bioconductor to translate DNAseqs to AASeqs
#Get protvecs for each AAseq
#Load into classifier to determine prediction rate for each category