# Acquire files in a specific format

If you need to download a file, and perform some standard functions on them, then you can do it with the `Acquire` object.

The Acquire object facilitates the acquisition of files and preprocessing.
Currently supported functions are:

 * Acquisition:
   * curl
   * wget
   * lftp
   * local
   * touch
   * merge
  
 * Processing:
   * Compression
     * unzip
     * gunzip
     * bunzip
     * untar
     * gzip
     * bzip
     * bgzip
   
   * Commands:
     * cat
     * ls
     * call
     * cmd
    
   * Processing
     * sort
     * tabix
    
 * Renaming
   * finalize
   
The usage of the `Acquire` object always starts with an acquisition command, followed by some processing commands, followed by the `finalize` command.

    biu.utils.Acquire().curl(url).unzip().finalize(finalLocation)

The Acquire object follows a lazy evaluation. Acquisition and processing is only performed when the `acquire` command is applied to it.

    biu.utils.Acquire().curl(url).unzip().finalize(finalLocation).acquire()

In [1]:
import biu

## The `Acquire` object
The `Acquire` object is an object that allows you to chain commands after each other. these commands are defined above, and exemplified below. To construct an `Acquire` object, one simple creates one with the `biu.utils.Acquire` class. You can also specify to redo each step in the pipeline you create using the `redo=True` argument. You can also specify where the files should be downloaded to with the `where` argument.

In [2]:
myAcquire = biu.utils.Acquire(redo=True)
print(myAcquire)

Acquire object.
 Re-do steps: yes
 Current steps:



## Acquisition

### curl

In [3]:
ao = biu.utils.Acquire().curl("http://zzz.bwh.harvard.edu/plink/dist/plink-1.07-i686.zip")\
                        .unzip("plink-1.07-i686/test.ped")\
                        .call("cat")

print(ao)

ao.acquire()

Acquire object.
 Re-do steps: no
 Current steps:
  * curl
  * unzip
  * call



D: curl -L  'http://zzz.bwh.harvard.edu/plink/dist/plink-1.07-i686.zip' > '/home/tgehrmann/repos/BIU/docs/biu/downloads/8f52ad05b4c2ba036683cfedcdedb328eb8c837c'
D: unzip -o -d '/home/tgehrmann/repos/BIU/docs/biu/downloads/8f52ad05b4c2ba036683cfedcdedb328eb8c837c.unzipped' '/home/tgehrmann/repos/BIU/docs/biu/downloads/8f52ad05b4c2ba036683cfedcdedb328eb8c837c'


1 1 0 0 1  1  A A  G T
2 1 0 0 1  1  A C  T G
3 1 0 0 1  1  C C  G G
4 1 0 0 1  2  A C  T T
5 1 0 0 1  2  C C  G T
6 1 0 0 1  2  C C  T T



D: cat '/home/tgehrmann/repos/BIU/docs/biu/downloads/8f52ad05b4c2ba036683cfedcdedb328eb8c837c.unzipped/plink-1.07-i686/test.ped'


'/home/tgehrmann/repos/BIU/docs/biu/downloads/8f52ad05b4c2ba036683cfedcdedb328eb8c837c.unzipped/plink-1.07-i686/test.ped'

### wget

In [4]:
ao = biu.utils.Acquire().wget("http://zzz.bwh.harvard.edu/plink/dist/plink-1.07-i686.zip")\
                        .call('unzip -l %s | head').acquire()

Archive:  /home/tgehrmann/repos/BIU/docs/biu/downloads/8f52ad05b4c2ba036683cfedcdedb328eb8c837c
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2011-06-21 03:20   plink-1.07-i686/
  4589621  2011-06-21 03:20   plink-1.07-i686/plink
  1799865  2011-06-21 03:20   plink-1.07-i686/gPLINK.jar
      138  2007-07-27 16:51   plink-1.07-i686/test.ped
       23  2007-07-27 16:51   plink-1.07-i686/test.map
     1287  2007-07-27 16:51   plink-1.07-i686/README.txt
    15365  2007-07-27 16:51   plink-1.07-i686/COPYING.txt



D: unzip -l /home/tgehrmann/repos/BIU/docs/biu/downloads/8f52ad05b4c2ba036683cfedcdedb328eb8c837c | head


## ftp

In [6]:
print(biu.formats.Fasta(biu.utils.Acquire().ftp('ftp.wormbase.org', '/pub/wormbase/releases/WS264/species/c_japonica/PRJNA12591/c_japonica.PRJNA12591.WS264.pseudogenic_transcripts.fa.gz').gunzip().acquire()))
      

error_perm: 500 Unknown command.

### lftp

You can use lftp for sftp connections etc.

In [None]:
biu.utils.Acquire().lftp("sftp-cancer.sanger.ac.uk",
                        "cosmic/grch38/cosmic/v84/VCF/CosmicCodingMuts.vcf.gz",
                        username="t.gehrmann@lumc.nl", password="Cosmic_password1").gunzip().acquire()

### local

There are two ways to make use of a local file. One is more or less a shortcut of the other

In [None]:
biu.utils.Acquire().local('/etc/group').call("head -n3").acquire()

Or you can directly give it as a parameter to the Acquire function:

In [None]:
biu.utils.Acquire('/etc/group').call("head -n3").acquire()

### touch

You can also create a file (if you simply need an empty one).

In [None]:
biu.utils.Acquire().touch().acquire()

### merge

You can merge multiple acquire steps together into one file, using for example cat.

Currently available methods:
  * cat
  * zcat

In [None]:
ao1 = biu.utils.Acquire().curl("http://zzz.bwh.harvard.edu/plink/dist/plink-1.07-i686.zip")\
                        .unzip("plink-1.07-i686/test.ped")
    
ao2 = biu.utils.Acquire().curl("http://zzz.bwh.harvard.edu/plink/dist/plink-1.07-i686.zip")\
                        .unzip("plink-1.07-i686/test.map")
    
merged = biu.utils.Acquire().merge([ao1, ao2], method='cat').call("wc -l").acquire()

## Processing

### Compression

#### unzip

Unzip a zip file. You can optionally define a specific file from the directory to use for further processing (otherwise a link to the directory is maintained.

In [None]:
biu.utils.Acquire().curl("http://zzz.bwh.harvard.edu/plink/dist/plink-1.07-i686.zip")\
                   .unzip("plink-1.07-i686/test.ped")\
                   .call("cat").acquire()

In [None]:
biu.utils.Acquire().curl("http://zzz.bwh.harvard.edu/plink/dist/plink-1.07-i686.zip")\
                   .unzip()\
                   .call("ls").acquire()

#### gunzip

gunzip a file.

In [None]:
biu.utils.Acquire().curl("http://geneontology.org/gene-associations/goa_human.gaf.gz")\
                   .gunzip()\
                   .call("head -n3").acquire()


#### untar

Untar a file. You can optionally define a specific file from the directory to use for further processing (otherwise a link to the directory is maintained.

In [None]:
biu.utils.Acquire().curl("https://github.com/thiesgehrmann/proteny/archive/0.1.tar.gz")\
                   .gunzip()\
                   .untar()\
                   .call("ls").acquire()

In [None]:
biu.utils.Acquire().curl("https://github.com/thiesgehrmann/proteny/archive/0.1.tar.gz")\
                   .gunzip()\
                   .untar("proteny-0.1/Snakefile")\
                   .call("head -n5").acquire()

#### gzip
gzip a file

In [None]:
biu.utils.Acquire().curl("https://github.com/thiesgehrmann/proteny/archive/0.1.tar.gz")\
                   .gunzip()\
                   .untar("proteny-0.1/Snakefile")\
                   .gzip().acquire()

#### bgzip

bgzip a file.

In [None]:
biu.utils.Acquire().curl("https://github.com/thiesgehrmann/proteny/archive/0.1.tar.gz")\
                   .gunzip()\
                   .untar("proteny-0.1/Snakefile")\
                   .bgzip().acquire()

### Commands

#### cat

#### ls

#### call

#### cmd

#### func

You can call an arbitrary python function on the file to do some processing. The function must take two parameters that describe the input and output file names. The function must also return a success or failure state.

In [None]:
def myfunc(inFile, outFile):

### Processing

#### sort
Sort a file. Default is no parameters, but you can provide paramaters to sort the file how you want (posix sort parameters)

In [None]:
biu.utils.Acquire().curl("ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz")\
                   .gunzip()\
                   .sort("-t $'\\t' -k19,19V -k 20,21n")\
                   .call("head")\
                   .acquire()

#### tabix
Use tabix to generate an index for a file

In [None]:
biu.utils.Acquire().curl("ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz")\
                   .gunzip()\
                   .sort("-t $'\\t' -k19,19V -k 20,21n")\
                   .cmd("awk -F $'\\t' 'BEGIN {OFS = FS} { if($19 != \"na\"){ print $0}}'")\
                   .bgzip()\
                   .tabix(seq=19, start=20, end=21)\
                   .acquire()

## Finalize

## Constructing multiple processes.

In [None]:
bgzip = biu.utils.Acquire().curl("ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz")\
                   .gunzip()\
                   .sort("-t $'\\t' -k19,19V -k 20,21n")\
                   .cmd("awk -F $'\\t' 'BEGIN {OFS = FS} { if($19 != \"na\"){ print $0}}'")\
                   .bgzip()

tbi = bgzip.tabix(seq=19, start=20, end=21)

In [None]:
print(bgzip)

In [None]:
print(tbi)