## Combine BLASTX results with the results from the GO annotation

The goal here is to use the BLASTX ouput (from Swissprot and TrEMBL) and make one big dataframe that combines it with the GO annotation details.

After merging the two dataframes, write a table back to the folder. 

I will be working with the output from the BLASTX query against Swissprot and Trembl, `denovo_annotateSwissProt.tab`, and with the output from the GO annotation, `uniprotKB.tab`. 

In [1]:
import pandas as pd
import glob

In [3]:
# Read in a csv file

go = pd.read_csv('uniprotKB.tab', sep = '\t')

In [11]:
go.head(n=5) # take a look at the top rows

Unnamed: 0,Entry,Entry name,Status,Protein names,Gene names,Organism,Length,Gene ontology (biological process),Gene ontology (cellular component),Gene ontology (GO),Gene ontology (molecular function),Gene ontology IDs
0,P64264,Y1310_MYCBO,reviewed,Uncharacterized GMC-type oxidoreductase Mb1310...,BQ2027_MB1310,Mycobacterium bovis (strain ATCC BAA-935 / AF2...,528,,,flavin adenine dinucleotide binding [GO:005066...,flavin adenine dinucleotide binding [GO:005066...,GO:0016614; GO:0050660
1,Q9KWW9,FLAB3_TREMA,reviewed,Flagellar filament 30.7 kDa core protein (Flag...,flaB3,Treponema maltophilum,285,bacterial-type flagellum-dependent cell motili...,bacterial-type flagellum filament [GO:0009420]...,bacterial-type flagellum filament [GO:0009420]...,structural molecule activity [GO:0005198],GO:0005198; GO:0009420; GO:0055040; GO:0071973
2,Q9BWF2,TRAIP_HUMAN,reviewed,E3 ubiquitin-protein ligase TRAIP (EC 2.3.2.27...,TRAIP RNF206 TRIP,Homo sapiens (Human),469,apoptotic process [GO:0006915]; negative regul...,nucleolus [GO:0005730]; perinuclear region of ...,nucleolus [GO:0005730]; perinuclear region of ...,metal ion binding [GO:0046872]; ubiquitin prot...,GO:0005730; GO:0006915; GO:0007165; GO:0010804...
3,Q680P8,RS29_ARATH,reviewed,40S ribosomal protein S29,RPS29A At3g43980 T15B3.120; RPS29B At3g44010 T...,Arabidopsis thaliana (Mouse-ear cress),56,translation [GO:0006412],cytosol [GO:0005829]; cytosolic small ribosoma...,cytosol [GO:0005829]; cytosolic small ribosoma...,structural constituent of ribosome [GO:0003735...,GO:0003735; GO:0005829; GO:0005886; GO:0006412...
4,P9WPH9,ECCA1_MYCTU,reviewed,ESX-1 secretion system protein EccA1 (ESX cons...,eccA1 Rv3868 MTV027.03,Mycobacterium tuberculosis (strain ATCC 25618 ...,573,growth of symbiont in host [GO:0044117],cytoplasm [GO:0005737]; plasma membrane [GO:00...,cytoplasm [GO:0005737]; plasma membrane [GO:00...,ATPase activity [GO:0016887]; ATP binding [GO:...,GO:0005524; GO:0005737; GO:0005886; GO:0016887...


In [None]:
# How many rows of the data frame?
len(gf.index)

In [22]:
# Read in swissprot and trembl data frame, but know it has tab delimited and maybe pipe delimited.
# Use regexpressions to specify multiple delimiters after the 'sep' argument. I got a warning, so I had to include 'engine=python'
# Specify there is no header and manually enter row names

spt = pd.read_csv('denovo_annotateSwissProt.tab', sep = r'[\t|]', engine='python', header=None, names=['contig', 'db', 'entry', 'entryname', 'percent-id', 'align-length', 'mismatch', 'gap-opens', 'q.start', 'q.end', 's.start', 's.end', 'eval', 'bitscore'])


In [23]:
#Take a look at the top columns
spt.head(n=5)

Unnamed: 0,contig,db,entry,entryname,percent-id,align-length,mismatch,gap-opens,q.start,q.end,s.start,s.end,eval,bitscore
0,TRINITY_DN88414_c0_g1_i1,sp,P64264,Y1310_MYCBO,32.258,155,81,8,434,18,379,525,8.02e-07,51.6
1,TRINITY_DN88485_c0_g1_i1,sp,Q9KWW9,FLAB3_TREMA,76.056,71,17,0,213,1,26,96,5.8899999999999995e-30,109.0
2,TRINITY_DN88449_c0_g1_i1,sp,Q9BWF2,TRAIP_HUMAN,31.122,196,109,10,641,87,7,187,6.35e-15,76.3
3,TRINITY_DN88424_c0_g1_i1,sp,Q680P8,RS29_ARATH,80.0,35,7,0,214,110,22,56,5.33e-14,62.8
4,TRINITY_DN88445_c0_g1_i1,sp,P9WPH9,ECCA1_MYCTU,61.728,81,29,2,2,238,370,450,2.2e-28,93.6


In [24]:
len(spt.index) #take a look at the length of this dataframe.

68659

In [26]:
# Merge the two data frames to maintain the length of the swissprot/trembl table

sptgo = spt.merge(go, how='left', left_on='entryname', right_on='Entry name')

In [27]:
sptgo.head(n=3)

Unnamed: 0,contig,db,entry,entryname,percent-id,align-length,mismatch,gap-opens,q.start,q.end,...,Status,Protein names,Gene names,Organism,Length,Gene ontology (biological process),Gene ontology (cellular component),Gene ontology (GO),Gene ontology (molecular function),Gene ontology IDs
0,TRINITY_DN88414_c0_g1_i1,sp,P64264,Y1310_MYCBO,32.258,155,81,8,434,18,...,reviewed,Uncharacterized GMC-type oxidoreductase Mb1310...,BQ2027_MB1310,Mycobacterium bovis (strain ATCC BAA-935 / AF2...,528.0,,,flavin adenine dinucleotide binding [GO:005066...,flavin adenine dinucleotide binding [GO:005066...,GO:0016614; GO:0050660
1,TRINITY_DN88485_c0_g1_i1,sp,Q9KWW9,FLAB3_TREMA,76.056,71,17,0,213,1,...,reviewed,Flagellar filament 30.7 kDa core protein (Flag...,flaB3,Treponema maltophilum,285.0,bacterial-type flagellum-dependent cell motili...,bacterial-type flagellum filament [GO:0009420]...,bacterial-type flagellum filament [GO:0009420]...,structural molecule activity [GO:0005198],GO:0005198; GO:0009420; GO:0055040; GO:0071973
2,TRINITY_DN88449_c0_g1_i1,sp,Q9BWF2,TRAIP_HUMAN,31.122,196,109,10,641,87,...,reviewed,E3 ubiquitin-protein ligase TRAIP (EC 2.3.2.27...,TRAIP RNF206 TRIP,Homo sapiens (Human),469.0,apoptotic process [GO:0006915]; negative regul...,nucleolus [GO:0005730]; perinuclear region of ...,nucleolus [GO:0005730]; perinuclear region of ...,metal ion binding [GO:0046872]; ubiquitin prot...,GO:0005730; GO:0006915; GO:0007165; GO:0010804...


In [28]:
len(sptgo.index)

68659

In [31]:
# Check to make sure all the column names are there.
for col in sptgo.columns: 
    print(col)

contig
db
entry
entryname
percent-id
align-length
mismatch
gap-opens
q.start
q.end
s.start
s.end
eval
bitscore
Entry
Entry name
Status
Protein names
Gene names
Organism
Length
Gene ontology (biological process)
Gene ontology (cellular component)
Gene ontology (GO)
Gene ontology (molecular function)
Gene ontology IDs


In [34]:
sptgo.to_csv('BLASTX_and_GO_merged.tab', sep = "\t") # Write out a tab delimited file containing the merged information