<small><small><i>
Introduction to Python for Bioinformatics - available at https://github.com/kipkurui/Python4Bioinformatics.
</i></small></small>

## Files, Scripting and Modules

So far, we have been writing all our Python Code in Jupyter notebooks. However, if you want to use the code we have written as part of a pipeline, you need to write scripts. Also, most of the time the data you need to analyse is in a file, which you need to read to Python and process. 


### Reading Files

So far we have been working from memory. In Bioinformatics, you will need to read some file or even write some output to file. We use the `open` function. 

In [1]:
myfile = open("../Files/test.txt", "w")
myfile.write("My first file written from Python \n")
myfile.write("---------------------------------\n")
myfile.write("Hello, world!\n")
myfile.close()

The **mode** in which you open the file determines whether to write (w), read (r) or append(a) to file. 

Opening a file creates what we call a **file handle** which contains methods for manipulating the file. In our case, `myfile` has the methods to write and close the file. Closing the file makes it accessible in the disk. 

Alternatively, one can open the file in a mode that automatically closes the file when done. 

In [2]:
with open("../Files/test.txt", "w") as myfile:
    myfile.write("My first file written from Python \n")
    myfile.write("---------------------------------\n")
    myfile.write("Hello, world!\n")

Let's check what else we can do with `open`.

In [3]:
?open

[0;31mSignature:[0m [0mopen[0m[0;34m([0m[0mfile[0m[0;34m,[0m [0mmode[0m[0;34m=[0m[0;34m'r'[0m[0;34m,[0m [0mbuffering[0m[0;34m=[0m[0;34m-[0m[0;36m1[0m[0;34m,[0m [0mencoding[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0merrors[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mnewline[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mclosefd[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0mopener[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Open file and return a stream.  Raise IOError upon failure.

file is either a text or byte string giving the name (and the path
if the file isn't in the current working directory) of the file to
be opened or an integer file descriptor of the file to be
wrapped. (If a file descriptor is given, it is closed when the
returned I/O object is closed, unless closefd is set to False.)

mode is an optional string that specifies the mode in which the file
is opened. It defaults to 'r' which means open for 

#### Fetching file from the web
Download this [file](https://www.uniprot.org/docs/humchrx.txt) we will use to explore file reading in python. 

In [3]:
import urllib.request

url = "https://www.uniprot.org/docs/humchrx.txt"
destination_filename = "humchrx.txt"
urllib.request.urlretrieve(url, destination_filename)

('humchrx.txt', <http.client.HTTPMessage at 0x7ff450524e20>)

#### Reading a file line-at-a-time

We can read the file line by line using `readline`. Thie reads the line one by one until the end of the file. This is suitable for a large file which may not fit memory. 

In [5]:
humchrx = open('../Files/humchrx.txt', 'r')
line = humchrx.readline()
print(line)

----------------------------------------------------------------------------



In [6]:
humchrx.close()

In [7]:
with open('../Files/test.txt', 'r') as myfile:
    while True:
        line = myfile.readline()
        if len(line) == 0: # If there are no more lines
            break
        print(line)
    

My first file written from Python 

---------------------------------

Hello, world!



### Read the whole file

If the file is small or PC has enough memory, you can read the whole file into memory as a list using `readlines`.

In [8]:
with open('../Files/test.txt', 'r') as myfile:
    lines = myfile.readlines()
    for line in lines:
        print(line)

My first file written from Python 

---------------------------------

Hello, world!



or as a whole

In [9]:
with open('../Files/test.txt', 'r') as myfile:
    whole_file = myfile.read()
    print(whole_file)

My first file written from Python 
---------------------------------
Hello, world!



### Exercise 1

Write a function the reads the file (humchr.txt) and writes to another file (gene_names.txt) a clean list of gene names.

In [82]:
def get_genes(infile,outfile):
    """
    Function to extract a list of genes and write to file
    """
    gene_list = []
    with open(infile) as  gene:
        tag = False
        for line in gene:
            if line.startswith('name'):
                tag = True
                pass
            if tag:
                items = line.split()
                if len(items) > 0:
                    gene_list.append(items[0])
    gene_list = gene_list[1:-7]
    with open(outfile, 'w') as outfile:
        for i in gene_list:
            outfile.write(i+'\n')

In [87]:
import genelist

### Scripts and Modules

A script is a file containing Python definitions and statements for performing some analysis. Scripts are known as when they are intended for use in other Python programs. Many Python modules come with Python as part of the standard library. 

You can get a list of available modules using help() and explore them.

In [10]:
help('modules')


Please wait a moment while I gather a list of all available modules...





Bio                 autoreload          jupyter_console     select
BioSQL              backcall            jupyter_core        selectors
IPython             base64              jupyterlab          send2trash
PyQt5               bdb                 jupyterlab_launcher seqtools
__future__          binascii            keyword             setuptools
_ast                binhex              kiwisolver          shelve
_asyncio            bisect              lib2to3             shlex
_bisect             bleach              linecache           shutil
_blake2             builtins            locale              signal
_bootlocale         bz2                 logging             simplegeneric
_bz2                cProfile            lxml                sip
_codecs             calendar            lzma                sipconfig
_codecs_cn          certifi             macpath             sipdistutils
_codecs_hk          cgi                 macurl2path         site
_codecs_iso2022     cgitb              

    Install tornado itself to use zmq with the tornado IOLoop.
    
  yield from walk_packages(path, info.name+'.', onerror)


### Writing you own modules

All we need to do to create our own modules is to save our script as a file with a `.py` extension. Suppose, for example, this script is saved as a file named `seqtools.py`.

```python
def remove_at(pos, seq):
    return seq[:pos] + seq[pos+1:]```
    
We can import the module as:

In [2]:
import dnatools

In [3]:
dnatools.percentGC("ADTAFTAFTA")

There is an invalid base 'D' at position 2
There is an invalid base 'F' at position 5
There is an invalid base 'F' at position 8


In [11]:
dnatools.dnacheck("ACGAgTVHTGATA")

There is an invalid base 'V' at position 7
There is an invalid base 'H' at position 8


False

In [11]:
import seqtools

In [12]:
s = "A string!"
seqtools.remove_at(4,s)

'A sting!'

In [16]:
'23,000,'.replace(',','')

'23000'

In [91]:
import genelist

In [94]:
from genelist import get_genes

ImportError: cannot import name 'get_genes'

Modules are useful when you want to analyse large data using the HPC or even create your library of handy functions. 

#### Running scripts

When you have put your commands into a .py file, you can execute on the command line by invoking the Python interpreter using `python script.py.`

### Exercise 2

1. Convert the function you wrote in exercise 1 into a python module. Then, import the module and use the function to read `humchrx.txt` file and create a gene list file.
2. Create a stand-alone script that does all the above.


### Script that takes command line arguments
So far, we can create a script that does one thing. In this case, you have to edit the script if you have a new gene file to analyse or you want to use a different name for the output file.

#### sys.argv
sys.argv is a list in Python, which contains the command line arguments passed to the script. Lets add this to a script `sysargv.py` and run on the command line. 

```python
import sys
print("This is the name of the script: ", sys.argv[0])
print("Number of arguments: ", len(sys.argv))
print("The arguments are: " , str(sys.argv))```

In [5]:
!python sysargv.py

This is the name of the script:  sysargv.py
Number of arguments:  1
The arguments are:  ['sysargv.py']


In [96]:
!python genelist.py ../Files/humchrx.txt ../Files/command_out.txt

In [1]:
import quiz

#quiz.

(1, 2, 3, 4, 5, 6)
(7, 8, 9, 10, 11, 12)
Kenyan
My circle radius is 7cm while perimetr is 43.988cm and area is 201.088 square cm
His circle radius is 8cm while perimeter is 50.272cm
42
25
5


URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

### Exercise 3

- Using the same concept, convert your script in your previous exercise to take command line arguments (input and output files)