## Redoing a shell script from inside Python
During Week 2 Day 4 you have developed a script like the one below for obtaining a multiple sequence alignment from BLAST.  
(Note that you cannot run this code here, because it cannot handle command line arguments inside a notebook.)  
We are going to do the same thing from inside Python. Parts for which we needed `sed` in the shell script could be done by Python code now.

In [None]:
#!/bin/bash
if [ $1 ]; then
    BLASTDATABASE=$1
else
    echo which BLAST database do you want to use?
    read BLASTDATABASE
fi     

if [ $2 ]; then
    INPUTFILE=$2
else
    echo which file contains the query sequence?
    read INPUTFILE
fi     

#remove the files from previous runs, if any
if [ -f blasthits.fasta ]; then
    rm blasthits.*
fi

BLASTHITS=`blastp -query $INPUTFILE -db $BLASTDATABASE -evalue 1e-3 -outfmt "6 sacc"`

for BLASTHIT in $BLASTHITS; do 
    blastdbcmd -db $BLASTDATABASE -entry $BLASTHIT 
done > blasthits.fasta

sed -i 's/>.*[|]/>/' blasthits.fasta

clustalo --in blasthits.fasta --outfmt=clustal > blasthits.clustal

#send any output that jalview produces to the trash (/dev/null)
jalview -open blasthits.clustal -colour clustal -nodisplay -png blasthits.png -imgMAP blasthits.html >& /dev/null

firefox blasthits.html

## Preparations
We need module `sys` for obtaining command line arguments. Well, this doesn't work inside a notebook, but for the final script we still include it.  
Further we are using `subprocess.check_call()` and `subprocess.check_output()` as replacements for `os.popen()` used in the book.

In [2]:
# First import all necessary modules
import sys
from subprocess import check_call, check_output
# TODO: If you need more, add them here and then re-execute this box

Like in the previous notebook, we add our own `check_output_ext()`.  
This is admittedly more complex code than we are teaching in this course. 

In [3]:
import subprocess

def check_output_ext (command, shell=False):
    try:
        output = subprocess.check_output (command, shell=shell)
    except subprocess.CalledProcessError as e:
        output = e.output
    return output

In [9]:
# Put all fixed filenames (and other constants) here
all_outputs = 'blasthits.*' # used by rm
# TODO: Add more when needed, add then re-execute this box

## Getting input file names
The first two pieces of the shell script check command line arguments and ask the user for missing arguments.
```
if [ $1 ]
then
    BLASTDATABASE=$1
else
    echo which BLAST database do you want to use?
    read BLASTDATABASE
fi

if [ $2 ]
then
    INPUTFILE=$2
else
    echo which file contains the query sequence?
    read INPUTFILE
fi
```

We could do that in Python like this:

In [7]:
if len(sys.argv) > 1: # will not work properly inside notebook...
    blast_database = sys.argv[1]
else:
    blast_database = raw_input('which BLAST database do you want to use? ')
    
if len(sys.argv) > 2: # will not work properly inside notebook...
    input_filename = sys.argv[2]
else:
    input_filename = raw_input('which file contains the query sequence? ')

In [8]:
# these command line arguments are valid for the notebook, not for our script...
print blast_database
print input_filename

-f
/run/user/17170083/jupyter/kernel-bba31f70-9f22-4fb0-9352-4a8061177104.json


In [10]:
# Inside the notebook ask for input anyway
# Don't include this code in a real Python script
blast_database = raw_input('which BLAST database do you want to use? ')
input_filename = raw_input('which file contains the query sequence? ')

which BLAST database do you want to use? sp_human
which file contains the query sequence? proteinX.fasta


In [11]:
# TODO: Add code to see what we've got as database
print blast_database

sp_human


In [12]:
# TODO: Add code to see what we've got as input file
print input_filename

proteinX.fasta


## Removing old output
In the shell script:
```
if [ -f blasthits.fasta ]; then
    rm blasthits.*
fi
```
We can do that from Python as well. And we will use symbolic names where possible.

In [13]:
# TODO: fill the dots
command = 'ls %s' % all_outputs
output = check_output_ext(command, shell=True)
print '--- start of output ---\n' + output + '--- end of output ---'

--- start of output ---
--- end of output ---


In [14]:
if len(output) > 0:
    print 'something to be removed'
else:
    print 'nothing to do'

nothing to do


In [15]:
if len(output) > 0:
    command = 'rm %s' % all_outputs
    check_call(command, shell=True)

Run the appropriate cell above to check that outputs have been removed indeed.

## Obtaining the BLAST hits
In the shell script:
```
BLASTHITS=`blastp -query $INPUTFILE -db $BLASTDATABASE -evalue 1e-3 -outfmt "6 sacc"`
```

We pass the command to `check_output()`, but let Python substitute file names.

In [16]:
blast_query_command = 'blastp -query %s -db %s -evalue 1e-3 -outfmt "6 sacc"'
# TODO: Fill in variable parts of command
command = blast_query_command % (input_filename,blast_database)
output = check_output(command, shell=True)
print len(output)
print output[:300]

28
P01308
F8WCM5
P01344
P05019



In [17]:
# turn the output into a list
blast_hits = output.strip().split('\n')
print blast_hits
# print the elements of the list
for hit in blast_hits:
    print hit

['P01308', 'F8WCM5', 'P01344', 'P05019']
P01308
F8WCM5
P01344
P05019


## Combining the BLAST hits
In the shell script:
```
for BLASTHIT in $BLASTHITS; do 
    blastdbcmd -db $BLASTDATABASE -entry $BLASTHIT 
done > blasthits.fasta
```
Now we do the loop in Python, and write all results to the output file.  
First let's try it without writing to file (simulating).

In [22]:
for hit in blast_hits:
    blast_command = 'blastdbcmd -db %s -entry "%s"'
    # TODO: Fill in variable parts of command
    command = blast_command % (blast_database,hit)
    output = check_output(command, shell=True)
    print output

>sp|P01308|INS_HUMAN Insulin OS=Homo sapiens GN=INS PE=1 SV=1
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPL
ALEGSLQKRGIVEQCCTSICSLYQLENYCN

>sp|F8WCM5|INSR2_HUMAN Insulin, isoform 2 OS=Homo sapiens GN=INS-IGF2 PE=2 SV=1
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQASALSLSSSTSTWPEGLD
ATARAPPALVVTANIGQAGGSSSRQFRQRALGTSDSPVLFIHCPGAAGTAQGLEYRGRRVTTELVWEEVDSSPQPQGSES
LPAQPPAQPAPQPEPQQAREPSPEVSCCGLWPRRPQRSQN

>sp|P01344|IGF2_HUMAN Insulin-like growth factor II OS=Homo sapiens GN=IGF2 PE=1 SV=1
MGIPMGKSMLVLLTFLAFASCCIAAYRPSETLCGGELVDTLQFVCGDRGFYFSRPASRVSRRSRGIVEECCFRSCDLALL
ETYCATPAKSERDVSTPPTVLPDNFPRYPVGKFFQYDTWKQSTQRLRRGLPALLRARRGHVLAKELEAFREAKRHRPLIA
LPTQDPAHGGAPPEMASNRK

>sp|P05019|IGF1_HUMAN Insulin-like growth factor I OS=Homo sapiens GN=IGF1 PE=1 SV=1
MGKISSLPTQLFKCCFCDFLKVKMHTMSSSHLFYLALCLLTFTSSATAGPETLCGAELVDALQFVCGDRGFYFNKPTGYG
SSSRRAPQTGIVDECCFRSCDLRRLEMYCAPLKPAKSARSVRAQRHTDMPKTQKYQPPSTNKNTKSQRRKGWPKTHPGGE
QKEGTEASLQIRGKKKEQRREIGS

We get empty lines in between. This means that each `output` ends in a newline already.  
So we can write them to the fasta file immediately.

In [38]:
fasta_filename = 'blasthits.fasta'
fasta_file = open(fasta_filename, 'w')
for hit in blast_hits:
    blast_command = 'blastdbcmd -db %s -entry "%s"'
    # TODO: Fill in variable parts of command
    command = blast_command % (blast_database,hit)
    output = check_output(command, shell=True)
    fasta_file.write(output)
fasta_file.close()

In [39]:
# Use check_output() to look into the fasta_file.
# hint: use command cat
command = 'cat %s' % (fasta_filename)
output = check_output(command, shell=True)
print output

>sp|P01308|INS_HUMAN Insulin OS=Homo sapiens GN=INS PE=1 SV=1
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPL
ALEGSLQKRGIVEQCCTSICSLYQLENYCN
>sp|F8WCM5|INSR2_HUMAN Insulin, isoform 2 OS=Homo sapiens GN=INS-IGF2 PE=2 SV=1
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQASALSLSSSTSTWPEGLD
ATARAPPALVVTANIGQAGGSSSRQFRQRALGTSDSPVLFIHCPGAAGTAQGLEYRGRRVTTELVWEEVDSSPQPQGSES
LPAQPPAQPAPQPEPQQAREPSPEVSCCGLWPRRPQRSQN
>sp|P01344|IGF2_HUMAN Insulin-like growth factor II OS=Homo sapiens GN=IGF2 PE=1 SV=1
MGIPMGKSMLVLLTFLAFASCCIAAYRPSETLCGGELVDTLQFVCGDRGFYFSRPASRVSRRSRGIVEECCFRSCDLALL
ETYCATPAKSERDVSTPPTVLPDNFPRYPVGKFFQYDTWKQSTQRLRRGLPALLRARRGHVLAKELEAFREAKRHRPLIA
LPTQDPAHGGAPPEMASNRK
>sp|P05019|IGF1_HUMAN Insulin-like growth factor I OS=Homo sapiens GN=IGF1 PE=1 SV=1
MGKISSLPTQLFKCCFCDFLKVKMHTMSSSHLFYLALCLLTFTSSATAGPETLCGAELVDALQFVCGDRGFYFNKPTGYG
SSSRRAPQTGIVDECCFRSCDLRRLEMYCAPLKPAKSARSVRAQRHTDMPKTQKYQPPSTNKNTKSQRRKGWPKTHPGGE
QKEGTEASLQIRGKKKEQRREIGSRNA

## Reformatting blasthits.fasta
In the shell script:
```
sed -i 's/>.*[|]/>/' blasthits.fasta
```
If we want to that in Python, we better combine it with writing the fasta file (at that moment we have all text, piece by piece, in memory anyway).  
For now, just call `sed` by `check_call()`.

In [40]:
sed_command = "sed -i 's/>.*[|]/>/' %s"
# TODO: Fill in variable parts of command
command = sed_command % (fasta_filename)
check_call(command, shell=True)

0

## Finalizing the script
In the shell script:
```
clustalo --in blasthits.fasta --outfmt=clustal > blasthits.clustal

#send any output that jalview produces to the trash (/dev/null)
jalview -open blasthits.clustal -colour clustal -nodisplay -png blasthits.png -imgMAP blasthits.html >& /dev/null

firefox blasthits.html
```
Sending output to trash is default behaviour of `check_call()`, so we get that for free.

In [44]:
# TODO: Define names for further files
# Challenge: Derive these names from the fasta_filename
# Easy: Just define constants for those names
main_name = fasta_filename.split('.')
clustal_filename = main_name[0]+'.clustal'
png_filename = main_name[0]+'.png'
html_filename = main_name[0]+'.html'

blasthits.png blasthits.html


In [43]:
clustalo_command = 'clustalo --in %s --outfmt=clustal > %s'
# TODO: Fill in variable parts of command
command = clustalo_command % (fasta_filename,clustal_filename)
check_call(command, shell=True)

0

In [45]:
jal_command = 'jalview -open %s -colour clustal -nodisplay -png %s -imgMAP %s'
# TODO: Fill in variable parts of command
command = jal_command % (clustal_filename,png_filename,html_filename)
check_call(command, shell=True)

0

In [46]:
firefox_command = 'firefox %s'
# TODO: Fill in variable parts of command
command = firefox_command % html_filename
check_call(command, shell=True)

0