## `blastx` and Uniprot file merging

### based on this script from EIMD 2019: https://github.com/eimd-2019/project-EWD-transcriptomics/blob/master/scripts/2019-07-10-blastx-Uniprot-File-Merging.ipynb

#### in this notebook, I'll merge the `blast` output from SR for infected and uninfected _C. bairdi_ with information from the Uniprot-Swissprot databse. 

## Step 0. Set working directory

In [1]:
pwd

u'/Volumes/toaster/grace/gitrepos/project-crab/notebooks'

In [2]:
wd = '/Volumes/toaster/grace/gitrepos/project-crab/analyses/'

In [3]:
cd {wd}

/Volumes/toaster/grace/gitrepos/project-crab/analyses


In [4]:
ls -F

091419-crab-blastx-output.csv         _blast-sep3.tab
1111-infected-blastout-d12-d26.csv    _blast-sort-crab.tab
1111-infected-d12-d26.csv             _blast-sort.tab
1111-uninfected-blastout-d12-d26.csv  _intermediate.file
1111-uninfected-d12-d26.csv           crab-Blastquery-GOslim.sorted
[34m304428_L1[m[m/                            crab-GOslim-count.csv
[34m304428_L2[m[m/                            crab-GOslim.csv
[34m329774_L1[m[m/                            crab-blastx-sp-full.tab
[34m329774_L2[m[m/                            crab-blastx-sp.tab
[34m329775_L1[m[m/                            crab-forRevigo.tab
[34m329775_L2[m[m/                            crab-stress-genes.tab
[34m329776_L1[m[m/                            crab-stress-uniprotID.tab
[34m329776_L2[m[m/                            hemat-Blastquery-GOslim.sorted
[34m329777_L1[m[m/                            hemat-GOslim-count.csv
[34m329777_L2[m[m/                            he

## Step 1. Format `blast` output

### Step 1a. Infected

In [5]:
!head -2 1111-infected-blastout-d12-d26.csv

﻿target_id,up_ID,evalue
TRINITY_DN30057_c0_g1_i1,sp|Q641I1|F135B_XENLA,4.19E-25


In [6]:
#convert commas to tabs
!tr ',' '\t' < 1111-infected-blastout-d12-d26.csv \
> inf-nocommas.csv

In [7]:
!head -2 inf-nocommas.csv

﻿target_id	up_ID	evalue
TRINITY_DN30057_c0_g1_i1	sp|Q641I1|F135B_XENLA	4.19E-25


In [8]:
#convert pipes to tabs
!tr '|' '\t' < inf-nocommas.csv \
> infected-blast-sep.tab

In [9]:
!head -2 infected-blast-sep.tab

﻿target_id	up_ID	evalue
TRINITY_DN30057_c0_g1_i1	sp	Q641I1	F135B_XENLA	4.19E-25


In [10]:
#remove column names
!awk NR\>1 infected-blast-sep.tab > infected-blast-sep2.tab

In [11]:
!head -2 infected-blast-sep2.tab

TRINITY_DN30057_c0_g1_i1	sp	Q641I1	F135B_XENLA	4.19E-25
TRINITY_DN49483_c0_g1_i1	sp	Q6VH22	IF172_MOUSE	1.09E-41


In [12]:
#reduce the number of columns, sort
!awk -v OFS='\t' '{print $3, $1, $5}' < infected-blast-sep2.tab | sort \
> infected-blast-sort.tab

In [13]:
!head -2 infected-blast-sort.tab

A0A0R4IBK5	TRINITY_DN56324_c0_g1_i1	2.73E-27
A0CDD4	TRINITY_DN7741_c2_g1_i4	1.11E-119


In [14]:
!wc infected-blast-sort.tab
!echo "infected transcripts"

    2384    7152   97896 infected-blast-sort.tab
infected transcripts


### Step 1b. Uninfected

In [15]:
!head -2 1111-uninfected-blastout-d12-d26.csv

﻿target_id,up_ID,evalue
TRINITY_DN1627_c0_g1_i1,sp|Q08ER8|ZN543_HUMAN,1.42E-28


In [16]:
#convert commas to tabs
!tr ',' '\t' < 1111-uninfected-blastout-d12-d26.csv \
> uninf-nocommas.csv

In [17]:
!head -2 uninf-nocommas.csv

﻿target_id	up_ID	evalue
TRINITY_DN1627_c0_g1_i1	sp|Q08ER8|ZN543_HUMAN	1.42E-28


In [18]:
#convert pipes to tabs
!tr '|' '\t' < uninf-nocommas.csv \
> uninfected-blast-sep.tab

In [19]:
!head -2 uninfected-blast-sep.tab

﻿target_id	up_ID	evalue
TRINITY_DN1627_c0_g1_i1	sp	Q08ER8	ZN543_HUMAN	1.42E-28


In [20]:
#remove column names
!awk NR\>1 uninfected-blast-sep.tab > uninfected-blast-sep2.tab

In [21]:
!head -2 uninfected-blast-sep2.tab

TRINITY_DN1627_c0_g1_i1	sp	Q08ER8	ZN543_HUMAN	1.42E-28
TRINITY_DN14986_c0_g1_i1	sp	G5E8P0	GCP6_MOUSE	7.94E-23


In [22]:
#reduce the number of columns, sort
!awk -v OFS='\t' '{print $3, $1, $5}' < uninfected-blast-sep2.tab | sort \
> uninfected-blast-sort.tab

In [23]:
!head -2 uninfected-blast-sort.tab

G5E8P0	TRINITY_DN14986_c0_g1_i1	7.94E-23
P33527	TRINITY_DN7707_c0_g1_i8	0


In [24]:
!wc uninfected-blast-sort.tab
!echo "uninfected transcripts"

      27      81    1063 uninfected-blast-sort.tab
uninfected transcripts


## Step 2. Format Uniprot-Swissprot database

#### The uniprot annotation file was downloaded from [this link](https://www.uniprot.org/uniprot/?query=reviewed:yes) on 2019-11-11. The following information was included as separate columns: 
- Entry (Uniprot Accession code)
- Protein Names
- Gene ontology (biological process)
- Gene ontology (cellular component)
- Gene ontology (molecular function)
- Gene ontology IDs
- Status (reviewed or note reviewed)
- Organism 

In [25]:
!ls

091419-crab-blastx-output.csv        crab-GOslim-count.csv
1111-infected-blastout-d12-d26.csv   crab-GOslim.csv
1111-infected-d12-d26.csv            crab-blastx-sp-full.tab
1111-uninfected-blastout-d12-d26.csv crab-blastx-sp.tab
1111-uninfected-d12-d26.csv          crab-forRevigo.tab
[34m304428_L1[m[m                            crab-stress-genes.tab
[34m304428_L2[m[m                            crab-stress-uniprotID.tab
[34m329774_L1[m[m                            hemat-Blastquery-GOslim.sorted
[34m329774_L2[m[m                            hemat-GOslim-count.csv
[34m329775_L1[m[m                            hemat-blastx-sp-full.tab
[34m329775_L2[m[m                            hemat-blastx-sp.tab
[34m329776_L1[m[m                            hemat-forRevigo.tab
[34m329776_L2[m[m                            inf-nocommas.csv
[34m329777_L1[m[m                            infected-Blastquery-GOslim.sorted
[34m329777_L2[m[m                            inf

In [51]:
!head -n2 uniprot-reviewed-yes.tab

Entry name Protein names Gene ontology (biological process) Gene ontology (cellular component) Gene ontology (molecular function) Gene ontology IDs Status Organism
B7NR61 Pantothenate kinase (EC 2.7.1.33) (Pantothenic acid kinase) coenzyme A biosynthetic process [GO:0015937] cytoplasm [GO:0005737] ATP binding [GO:0005524]; pantothenate kinase activity [GO:0004594] GO:0004594; GO:0005524; GO:0005737; GO:0015937 reviewed Escherichia coli O7:K1 (strain IAI39 / ExPEC)


In [39]:
#sort the file by the first column (-k 1) which is the Uniprot Entry (Uniprot Acession code)
!sort uniprot-reviewed-yes-2.tab -k 1 > uniprot-SP-GO-sorted.tab

In [49]:
!head -10 uniprot-SP-GO-sorted.tab

001R_FRG3G Putative transcription factor 001R regulation of viral transcription [GO:0046782] GO:0046782 reviewed Frog virus 3 (isolate Goorha) (FV-3)
002L_FRG3G Uncharacterized protein 002L host cell membrane [GO:0033644]; integral component of membrane [GO:0016021] GO:0016021; GO:0033644 reviewed Frog virus 3 (isolate Goorha) (FV-3)
002R_IIV3 Uncharacterized protein 002R reviewed Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)
003L_IIV3 Uncharacterized protein 003L reviewed Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)
003R_FRG3G Uncharacterized protein 3R reviewed Frog virus 3 (isolate Goorha) (FV-3)
004R_FRG3G Uncharacterized protein 004R host cell membrane [GO:0033644]; integral component of membrane [GO:0016021] GO:0016021; GO:0033644 reviewed Frog virus 3 (isolate Goorha) (FV-3)
005L_IIV3 Uncharacterized protein 005L reviewed Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)
005R_FRG3G Uncharacterized protein 005R

In [41]:
#count the number of columns for reference 
!awk '{print NF; exit}' uniprot-SP-GO-sorted.tab

18


## Step 3. Join `blast` output with Uniprot annotation file

### Step 3a. Infected

In [45]:
#join the first column in the first file with the first column in the second file
#the files are tab delimited, and the output should also be tab delimited (-t $'\t')
!join -1 1 -2 1 -t $'\t' \ 
infected-blast-sort.tab \
uniprot-SP-GO-sorted.tab \
> infected-blast-annot.tab

SyntaxError: invalid syntax (<ipython-input-45-a84c32ca200b>, line 4)