## `blastx` and Uniprot file merging

### based on this script from EIMD 2019: https://github.com/eimd-2019/project-EWD-transcriptomics/blob/master/scripts/2019-07-10-blastx-Uniprot-File-Merging.ipynb

#### in this notebook, I'll merge the `blast` output from SR for infected and uninfected _C. bairdi_ with information from the Uniprot-Swissprot databse. 

## Step 0. Set working directory

In [1]:
pwd

u'/Volumes/toaster/grace/gitrepos/project-crab/notebooks'

In [2]:
wd = '/Volumes/toaster/grace/gitrepos/project-crab/analyses/'

In [3]:
cd {wd}

/Volumes/toaster/grace/gitrepos/project-crab/analyses


In [4]:
ls -F

091419-crab-blastx-output.csv         _blast-sep3.tab
1111-infected-blastout-d12-d26.csv    _blast-sort-crab.tab
1111-infected-d12-d26.csv             _blast-sort.tab
1111-uninfected-blastout-d12-d26.csv  _intermediate.file
1111-uninfected-d12-d26.csv           crab-Blastquery-GOslim.sorted
[34m304428_L1[m[m/                            crab-GOslim-count.csv
[34m304428_L2[m[m/                            crab-GOslim.csv
[34m329774_L1[m[m/                            crab-blastx-sp-full.tab
[34m329774_L2[m[m/                            crab-blastx-sp.tab
[34m329775_L1[m[m/                            crab-forRevigo.tab
[34m329775_L2[m[m/                            crab-stress-genes.tab
[34m329776_L1[m[m/                            crab-stress-uniprotID.tab
[34m329776_L2[m[m/                            hemat-Blastquery-GOslim.sorted
[34m329777_L1[m[m/                            hemat-GOslim-count.csv
[34m329777_L2[m[m/                            he

## Step 1. Format `blast` output

### Step 1a. Infected

In [5]:
!head -2 1111-infected-blastout-d12-d26.csv

﻿target_id,up_ID,evalue
TRINITY_DN30057_c0_g1_i1,sp|Q641I1|F135B_XENLA,4.19E-25


In [6]:
#convert commas to tabs
!tr ',' '\t' < 1111-infected-blastout-d12-d26.csv \
> inf-nocommas.csv

In [7]:
!head -2 inf-nocommas.csv

﻿target_id	up_ID	evalue
TRINITY_DN30057_c0_g1_i1	sp|Q641I1|F135B_XENLA	4.19E-25


In [8]:
#convert pipes to tabs
!tr '|' '\t' < inf-nocommas.csv \
> infected-blast-sep.tab

In [9]:
!head -2 infected-blast-sep.tab

﻿target_id	up_ID	evalue
TRINITY_DN30057_c0_g1_i1	sp	Q641I1	F135B_XENLA	4.19E-25


In [10]:
#remove column names
!awk NR\>1 infected-blast-sep.tab > infected-blast-sep2.tab

In [11]:
!head -2 infected-blast-sep2.tab

TRINITY_DN30057_c0_g1_i1	sp	Q641I1	F135B_XENLA	4.19E-25
TRINITY_DN49483_c0_g1_i1	sp	Q6VH22	IF172_MOUSE	1.09E-41


In [12]:
#reduce the number of columns, sort
!awk -v OFS='\t' '{print $3, $1, $5}' < infected-blast-sep2.tab | sort \
> infected-blast-sort.tab

In [13]:
!head -2 infected-blast-sort.tab

A0A0R4IBK5	TRINITY_DN56324_c0_g1_i1	2.73E-27
A0CDD4	TRINITY_DN7741_c2_g1_i4	1.11E-119


In [14]:
!wc infected-blast-sort.tab
!echo "infected transcripts"

    2384    7152   97896 infected-blast-sort.tab
infected transcripts


### Step 1b. Uninfected

In [15]:
!head -2 1111-uninfected-blastout-d12-d26.csv

﻿target_id,up_ID,evalue
TRINITY_DN1627_c0_g1_i1,sp|Q08ER8|ZN543_HUMAN,1.42E-28


In [16]:
#convert commas to tabs
!tr ',' '\t' < 1111-uninfected-blastout-d12-d26.csv \
> uninf-nocommas.csv

In [17]:
!head -2 uninf-nocommas.csv

﻿target_id	up_ID	evalue
TRINITY_DN1627_c0_g1_i1	sp|Q08ER8|ZN543_HUMAN	1.42E-28


In [18]:
#convert pipes to tabs
!tr '|' '\t' < uninf-nocommas.csv \
> uninfected-blast-sep.tab

In [19]:
!head -2 uninfected-blast-sep.tab

﻿target_id	up_ID	evalue
TRINITY_DN1627_c0_g1_i1	sp	Q08ER8	ZN543_HUMAN	1.42E-28


In [20]:
#remove column names
!awk NR\>1 uninfected-blast-sep.tab > uninfected-blast-sep2.tab

In [21]:
!head -2 uninfected-blast-sep2.tab

TRINITY_DN1627_c0_g1_i1	sp	Q08ER8	ZN543_HUMAN	1.42E-28
TRINITY_DN14986_c0_g1_i1	sp	G5E8P0	GCP6_MOUSE	7.94E-23


In [22]:
#reduce the number of columns, sort
!awk -v OFS='\t' '{print $3, $1, $5}' < uninfected-blast-sep2.tab | sort \
> uninfected-blast-sort.tab

In [23]:
!head -2 uninfected-blast-sort.tab

G5E8P0	TRINITY_DN14986_c0_g1_i1	7.94E-23
P33527	TRINITY_DN7707_c0_g1_i8	0


In [24]:
!wc uninfected-blast-sort.tab
!echo "uninfected transcripts"

      27      81    1063 uninfected-blast-sort.tab
uninfected transcripts


## Step 2. Format Uniprot-Swissprot database

#### The uniprot annotation file was downloaded from [this link](https://www.uniprot.org/uniprot/?query=reviewed:yes) on 2019-11-11. The following information was included as separate columns: 
- Entry (Uniprot Accession code)
- Protein Names
- Gene ontology (biological process)
- Gene ontology (cellular component)
- Gene ontology (molecular function)
- Gene ontology IDs
- Status (reviewed or note reviewed)
- Organism 

In [25]:
!ls

091419-crab-blastx-output.csv        crab-GOslim-count.csv
1111-infected-blastout-d12-d26.csv   crab-GOslim.csv
1111-infected-d12-d26.csv            crab-blastx-sp-full.tab
1111-uninfected-blastout-d12-d26.csv crab-blastx-sp.tab
1111-uninfected-d12-d26.csv          crab-forRevigo.tab
[34m304428_L1[m[m                            crab-stress-genes.tab
[34m304428_L2[m[m                            crab-stress-uniprotID.tab
[34m329774_L1[m[m                            hemat-Blastquery-GOslim.sorted
[34m329774_L2[m[m                            hemat-GOslim-count.csv
[34m329775_L1[m[m                            hemat-blastx-sp-full.tab
[34m329775_L2[m[m                            hemat-blastx-sp.tab
[34m329776_L1[m[m                            hemat-forRevigo.tab
[34m329776_L2[m[m                            inf-nocommas.csv
[34m329777_L1[m[m                            infected-Blastquery-GOslim.sorted
[34m329777_L2[m[m                            inf

In [52]:
!head -n2 uniprot-reviewed-yes.tab

Entry	Protein names	Gene ontology (biological process)	Gene ontology (cellular component)	Gene ontology (molecular function)	Gene ontology IDs	Status	Organism
B7NR61	Pantothenate kinase (EC 2.7.1.33) (Pantothenic acid kinase)	coenzyme A biosynthetic process [GO:0015937]	cytoplasm [GO:0005737]	ATP binding [GO:0005524]; pantothenate kinase activity [GO:0004594]	GO:0004594; GO:0005524; GO:0005737; GO:0015937	reviewed	Escherichia coli O7:K1 (strain IAI39 / ExPEC)


In [54]:
#sort the file by the first column (-k 1) which is the Uniprot Entry (Uniprot Acession code)
!sort uniprot-reviewed-yes.tab -k 1 > uniprot-SP-GO-sorted.tab

In [59]:
!head -2 uniprot-SP-GO-sorted.tab

A0A023GPI8	Lectin alpha chain (CboL) [Cleaved into: Lectin beta chain; Lectin gamma chain]			mannose binding [GO:0005537]; metal ion binding [GO:0046872]	GO:0005537; GO:0046872	reviewed	Canavalia boliviana
A0A023GPJ0	Immunity protein CdiI					reviewed	Enterobacter cloacae subsp. cloacae (strain ATCC 13047 / DSM 30054 / NBRC 13535 / NCDC 279-56)


In [56]:
#count the number of columns for reference 
!awk '{print NF; exit}' uniprot-SP-GO-sorted.tab

25


## Step 3. Join `blast` output with Uniprot annotation file

### Step 3a. Infected

In [64]:
#join the first column in the first file with the first column in the second file
#the files are tab delimited, and the output should also be tab delimited (-t $'\t')
!join -1 1 -2 1 -t $'\t' \
infected-blast-sort.tab \
uniprot-SP-GO-sorted.tab \
> infected-blast-annot.tab

In [67]:
!head -10 infected-blast-annot.tab

A0A0R4IBK5	TRINITY_DN56324_c0_g1_i1	2.73E-27	E3 ubiquitin-protein ligase rnf213-alpha (EC 2.3.2.27) (EC 3.6.4.-) (Mysterin-A) (Mysterin-alpha) (RING finger protein 213-A) (RING finger protein 213-alpha) (RING-type E3 ubiquitin transferase rnf213-alpha)	blood circulation [GO:0008015]; sprouting angiogenesis [GO:0002040]; ubiquitin-dependent protein catabolic process [GO:0006511]	cytosol [GO:0005829]	ATPase activity [GO:0016887]; metal ion binding [GO:0046872]; ubiquitin-protein transferase activity [GO:0004842]	GO:0002040; GO:0004842; GO:0005829; GO:0006511; GO:0008015; GO:0016887; GO:0046872	reviewed	Danio rerio (Zebrafish) (Brachydanio rerio)
A0CDD4	TRINITY_DN7741_c2_g1_i4	1.11E-119	Cilia- and flagella-associated protein 20 (Bug22p)	cilium assembly [GO:0060271]; positive regulation of cell motility [GO:2000147]; positive regulation of feeding behavior [GO:2000253]; regulation of cilium beat frequency involved in ciliary motility [GO:0060296]	ciliary basal body [GO:0036064]; cilium 

In [66]:
!tail -2 infected-blast-annot.tab

Q9ZVX5	TRINITY_DN17360_c0_g1_i3	2.36E-22	Pentatricopeptide repeat-containing protein At2g16880					reviewed	Arabidopsis thaliana (Mouse-ear cress)
Q9ZWG1	TRINITY_DN10665_c0_g1_i7	1.13E-44	Mitochondrial uncoupling protein 2 (AtPUMP2)	amino acid transmembrane transport [GO:0003333]; mitochondrial transmembrane transport [GO:1990542]	Golgi apparatus [GO:0005794]; integral component of membrane [GO:0016021]; mitochondrial inner membrane [GO:0005743]; plasma membrane [GO:0005886]	amino acid transmembrane transporter activity [GO:0015171]; oxidative phosphorylation uncoupler activity [GO:0017077]	GO:0003333; GO:0005743; GO:0005794; GO:0005886; GO:0015171; GO:0016021; GO:0017077; GO:1990542	reviewed	Arabidopsis thaliana (Mouse-ear cress)


In [None]:
!wc -l infected-blast-annot.tab
!echo "annnotated infected transcripts"

### Step 3b. Uninfected

In [None]:
#join the first column in the first file with the first column in the second file
#the files are tab delimited, and the output should also be tab delimited (-t $'\t')
!join -1 1 -2 1 -t $'\t' \
uninfected-blast-sort.tab \
uniprot-SP-GO-sorted.tab \
> uninfected-blast-annot.tab

In [None]:
!head -10 uninfected-blast-annot.tab

In [None]:
!wc -l uninfected-blast-annot.tab
!echo "annnotated UNinfected transcripts"

## Step 4. Isolate gene IDs. 

notes from EIMD script --> `blast` was performed using isoform data. Currently, each line in the annotated file is denoted by an isoform ID (ex. TRINITY_DN416168_c0_g1_i1). The gene IDs are similar to the isoform IDs, in that they have contig and gene information, but no isoform information (ex. TRINITY_DN416168_c0_g1). Differential expression analyses will be conducted in `edgeR` at the gene level, so gene IDs are needed on annotation files. 

### Step 4a. Infected

In [None]:
# Isolate the contig column name with cut
# flip order of characters with rev
# delete last three characters with cut -c
# flip order of characters with rev
# add information as a new column to annotated table with paste

!cut -f2 infected-blast-annot.tab \
| rev \
| cut -c 4- \
| rev \
> infected-blast-annot-geneIDOnly.tab

In [None]:
!head infected-blast-annot-geneIDOnly.tab

In [None]:
#line count matches line count of original file
!wc -l infected-blast-annot-geneIDOnly.tab

In [None]:
!paste infected-blast-annot-geneIDOnly.tab infected-blast-annot.tab \
> infected-blast-annot-withGeneID.tab

In [None]:
!head -2 infected-blast-annot-withGeneID.tab

### Step 4b. Uninfected

In [None]:
# Isolate the contig column name with cut
# flip order of characters with rev
# delete last three characters with cut -c
# flip order of characters with rev
# add information as a new column to annotated table with paste

!cut -f2 uninfected-blast-annot.tab \
| rev \
| cut -c 4- \
| rev \
> uninfected-blast-annot-geneIDOnly.tab

In [None]:
!head uninfected-blast-annot-geneIDOnly.tab

In [None]:
#line count matches line count of original file
!wc -l uninfected-blast-annot-geneIDOnly.tab

In [None]:
!paste uninfected-blast-annot-geneIDOnly.tab uninfected-blast-annot.tab \
> uninfected-blast-annot-withGeneID.tab

In [None]:
!head -2 uninfected-blast-annot-withGeneID.tab