# Full bacmapping pipeline

This jupyter notebook contains the full pipeline to create the bacmapping database and some examples on how to use the bacmapping functions.

Note: Running the main pipeline can take around 8 hours using 16 cores.

In [None]:
import bacmapping as bmap
import matplotlib.pyplot as plt
import os
from Bio import Entrez

## Main pipeline

The following pipeline download all the necessary files from the FTP server
-   download can be set to false if the sequence is already available locally
-   onlyType and vtype determine whether clones should be filtered and how so, automatically set to only download and map BACs
-   email is the email sent to the NIH server when you download sequence

This function creates two folders in the working directory, details and sequences

-   details contains gff files from CloneDB, as well as various of presenting the data in the folds reordered and repaired
    -   reordered contains clones split by the sequence ID they are contained in
    -   repaired contains clones with their "attributes" split up for simple digestion
-   sequences contains all the sequences related to the clones in fasta format




In [None]:
bmap.getNewClones(download = False, email='') # Remember to put your email in to let NIH know who uses these resources!

The following functions generate the database locally in a folder called maps. mapSequencedClones saves all the maps of clones that are insert-sequenced into the folder sequenced in maps. mapPlacedClones saves all the maps of clones that are end-sequenced into the folder placed in maps.
-   cpustouse determines the number of cores to use when running multiprocessing 
-   chunk_size determines the amount of lines to read into pandas at once, larger is faster but requires more memory

In [None]:
bmap.mapSequencedClones(cpustouse=16) 
bmap.mapPlacedClones(cpustouse=16, chunk_size=1000)

## Functions for statistics

In [None]:
bmap.countPlacedBACs()
bmap.getCoverage()
bmap.getAverageLength()
bmap.getSequencedClonesStats()

Output files:

- countPlacedBACs counts the number of BACs in each end-sequenced library and saves this to counts.csv
- getCoverage determines the number of bases per chromosome which are included in the inserts of end-sequenced BACs in each library and saves this to coverage.csv
- getAverageLength finds the average length of clones in each end-sequenced library and saves this to averagelength.csv
- getSequencedClonesStats gets both the average length and number of clones for each library of insert-sequenced clones

## Generating the pairs database
Generate the database of all of the clone pairs which have overlapping ends produced by linearization.
-   cpustouse determines the number of cores to use when running multiprocessing 
-   longestoverlap is the longest acceptable overlap between the ends of different linearized BACs
-   shortestoverlap is the shortest acceptable overlap between the ends of different linearized BACs

In [None]:
bmap.makePairs(cpustouse=16, longestoverlap=200, shortestoverlap=20)

## Functions to explore the library

### getRestrictionMap

Given the name of a BAC and an enzyme, returns the cut locations.

In [None]:
name = "RP11-168H2"
enzyme = "SgrDI"
rmap = bmap.getRestrictionMap(name, enzyme)
rmap

### getMaps

Given the name of a BAC, returns a dataframe containing all the restriction maps related to that BAC.

In [None]:
name = "RP11-168H2"
maps = bmap.getMaps(name)
maps

### getRightIsoschizomer

Given an enzyme name, returns the enzyme name and Bio.restriction class which corresponds to the isoschizomer which is in the database. Name is a string of the enzyme name, libraryenzyme is the Bio.restriction class of the enzyme.

In [None]:
testenzyme = "BsaI"
name, libraryenzyme = bmap.getRightIsoschizomer(testenzyme)
print(name)

### DrawMap

Draws a map for a given BAC and enzyme.

In [None]:
name = "RP11-168H2"
enzyme = "DrdI"
rmap = bmap.drawMap(name, enzyme)

### getSequenceFromName

Given the name of a BAC, tries to return the sequence of that insert.

In [None]:
name = "RP11-168H2"
seq = bmap.getSequenceFromName(name)
print(seq)

### getSequenceFromLoc

Given a chromosome, start and end location, returns sequence of that location.

In [None]:
chrom = 2
start = 100000
end = 105000
seq = bmap.getSequenceFromLoc(chrom,start,end)
print(seq)

### getMapsFromLoc

Given a chromosome, start and end location, returns all the maps in that region.

In [None]:
chrom = 2
start = 100000
end = 500000
maps = bmap.getMapsFromLoc(chrom,start,end)
maps

### findPairsFromName

Given a BAC name as well as overlap and other details, finds possible BACs with acceptable overlap and restriction sites. This returns a dataframe where each line is a pair of BACs, including details of what enzymes are used and how they cut.

In [None]:
name = "RP11-168H2"
longestoverlap=200
shortestoverlap=20
pairs = bmap.findPairsFromName(name, longestoverlap, shortestoverlap)
pairs

### findOverlappingBACs

Given a BAC name, returns a dataframe with details for all the BACs which overlap the BAC.

In [None]:
name = 'RP11-168H2'
bacs = bmap.findOverlappingBACs(name)
bacs