# Using Python for bioinformatics, and with bioinformatics packages

## Python versus Bash for building toolchains/pipelines

### The Bash approach

- In Bash/Shell it's straightforward to build toolchains using pipes, and software found in your $PATH
- This is useful to sequentially process outputs with specialist programs or operations for data manipulation

In [28]:
# This environment runs on a Python interpreter, so it doesn't directly run shell commands.
# But since it uses an IPython interpreter, prefixing with "!" allows shell commands to be used, e.g.:
!echo "Hi"

# Toolchains in Bash
# Bash/Shell has the pipe character "|", which allows quite complex toolchains to be built up. Programs in $PATH can be directly called, e.g.:
!echo -n "1234" | wc -m   # Count characters from echo command. -n suppresses the behaviour to also add a newline, ensuring only string characters counted.

# Side note, we can integrate this with python, though it doesn't create a Python data type directly
var = !echo "1234" # Python mixed with Shell
print (type(var)) # Not a regular python data type, an Slist
str_var = str(var[0]) # Converts first element of Slist ("1234") into a python string
print(type(str_var)) # Evaluates as a python string, good!
print(str_var[0]) # Works as expected

Hi
4
<class 'IPython.utils.text.SList'>
<class 'str'>
1


### The Python equivalent of pipes

- In Python, instead of pipes we might use the "subprocess" module
- This allows external commands to be run, but can be pretty wordy compared to a simple: "echo -n "1234" | wc -m" (see below example)
- So sometimes in pipeline development, it can make sense to use Shell over Python for ease of both writing & reading, while bringing in Python for more advanced data manipulations & operations
- Keep in mind that in the workflow paradigm Nextflow, you can integrate the two. Each process code block can have a shebang for setting the code interpreter (if none is added, default is Shell)
- Alternatively, you can have a Nextflow Shell code block, and call Python as an external programme from within it, pointing at a script file, e.g.,: `python scripts/function_x.py`

In [57]:
# The subprocess module
import subprocess

# Defines command 1, which would direct stdout into a pipe rather than printing to terminal (can now be accessed programatically)
process1 = subprocess.Popen(['echo', '-n', '1234'], stdout=subprocess.PIPE) # process1 is an instance of .Popen class.

# Defines second command, defines input from pipe, and output back to pipe
process2 = subprocess.Popen(['wc', '-m'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)

# Actually runs the commands as a pipeline
output1, _ = process1.communicate() # .communicate is a method, waits for the process to complete, capturing stdout + stderr in a tuple. stdout as first element, stderr as second. We use the underscore to throw away that value.
output2, _ = process2.communicate(input=output1) # Sends captured stdout as input, and waits for completion

# Final output
out = output2.decode().strip() # Decode converts back from "bytes object" to string, strip clears leading/trailing whitespace
print(output2, "# This is a bytes object")
print(out)

b'4\n' # This is a bytes object
4


### A Python alternative to subprocess

- The "sh" package is a subprocess replacement

- Programs can be called as if they were functions

- This can be used for any binary command on your system (i.e., in your Conda env)

- Only works on Unix-like operating systems, as it uses underlying system calls (not python reimplementations)

- It's in the repo, installed with: `conda install -c conda-forge sh`

In [56]:
import sh

# Find as a "sh" argument
print(sh.find(".","-name","*ipynb"))

# Bedtools, an installed package in the Conda env
print(sh.bedtools("-version"))

# Can build in python exception handling, for example if a file isn't real
try:
    sh.ls("./non-existant_file")
except sh.ErrorReturnCode_2:
    print ("This is not the file you're looking for")

./python-bioinformatics.ipynb
./python-machine-learning.ipynb
./python-data-manipulation-and-graphics.ipynb
./python-fundamentals.ipynb

bedtools v2.30.0

This is not the file you're looking for


## Bioinformatics packages built for Python

### Some popular bioinformatics packages have reimplementions for direct python integration (e.g., Pybedtools)

- To use Linux programmes, we need to install them and call via Sh or Subprocess
- If there is a Python reimplementation available, these can be used directly; we will use Pybedtools as an example
- Pybedtools is in the conda environment already, installed using: `conda install -c bioconda pybedtools`
- Again, if already familiar with Bash, learning the python reimplementations can feel clunky

In [58]:
import pybedtools

a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')

print("A")
a.head()
print()
print("B")
b.head()

# Intersect a with b
a_and_b = a.intersect(b)
print()
print ("A and B")
a_and_b.head()


A
chr1	1	100	feature1	0	+
 chr1	100	200	feature2	0	+
 chr1	150	500	feature3	0	-
 chr1	900	950	feature4	0	+
 
B
chr1	155	200	feature5	0	-
 chr1	800	901	feature6	0	+
 
A and B
chr1	155	200	feature2	0	+
 chr1	155	200	feature3	0	-
 chr1	900	901	feature4	0	+
 

### Biopython

- Info

- The Conda environment contains a Biopython install (the command used was: `conda install -c conda-forge biopython`)
