# bacmapping

This jupyter notebook contains some examples on how to use the bacmapping functions. 

In [None]:
import bacmapping as bmap
import matplotlib as plt
import os
from Bio import Entrez

## An example dataset
The following commands are used on the example dataset provided here, containing only the details and sequence for chromosome 19 of the RP11 BACs, for the sake of size. This will still take a few minutes.
-   Instead of running getNewClones, which would download all the material, create an empty folder and place the folders details and sequence inside
-   Run the following commands from inside the folder

To begin, this code will download the sequence required to run the code, this is not part of the example, we just need some sequence!

In [None]:
email = "e" # Remember to give NIH your email!

acc = 'NC_000019.10' # this is the name of the file we'll be downloading, chromosome 19 from the human genome
wd = os.getcwd() # get current directory
seqdir = os.path.join(wd , 'sequences' ) #making the directory to save this in
os.makedirs(seqdir) #making the directory to save this in
fpath = os.path.join(seqdir,  acc + '.fasta') #the path to save our new sequence in
Entrez.email = email # You are required to provide an email
net_handle = Entrez.efetch(db="nucleotide", id=acc, rettype="fasta", retmode="text") #This does the heavy lifting, using Biopython to access and download the single file
out_handle = open(fpath, "w") #opening a new path to write the file into
out_handle.write(net_handle.read()) # writing the new sequence file
out_handle.close()
net_handle.close() # close everything out

Map the 1956 BACs provided in the example dataset by running mapPlacedClones

In [None]:
bmap.mapPlacedClones(cpustouse=8, chunk_size=1000)

The statistics on this dataset can be determined by running the following commands, which are detailed below in the "Functions for statistics" section, they'll save 4 csv files with the results of this analysis

In [None]:
bmap.countPlacedBACs()
bmap.getCoverage(chunk_size=1000)
bmap.getAverageLength()
bmap.getSequencedClonesStats()

To create a library of BAC pairs which are linearized to produce overlapping ends. We'll set longestoverlap, the longest acceptable overlap in the overlapping end, to 500 and shortestoverlap, the shorted acceptable overlap, to 0. This means that we'll also include BACs which are linearized at the same site. This code will produce a file in pairs detailing all the possible pairs.

In [None]:
bmap.makePairs(cpustouse=8,longestoverlap=500,shortestoverlap=0)

Finally, let's explore one set of maps produced in the library, we'll return all the maps for one BAC which is included in the library and then get an image of the produced map.

In [None]:
name = 'RP11-1055H23'
enzyme = 'FspI'
maps = bmap.getMaps(name)
#print(maps)
rmap = bmap.getRestrictionMap(name,enzyme)
#print(rmap)
plt = bmap.drawMap(name, enzyme)
plt.show()

-   maps from bmap.getMaps(name) is a series of all the restriction maps for RP11-1055H23
-   rmap from bmap.getRestrictionMap(name,enzyme) is just the cut locations of FspI in RP11-1055H23
-   plt is a visual representation of rmap

## Main pipeline

The following pipeline download all the necessary files from the FTP server
-   download can be set to false if the sequence is already available locally
-   onlyType and vtype determine whether clones should be filtered and how so, automatically set to only download and map BACs
-   email is the email sent to the NIH server when you download sequence

This function creates two folders in the working directory, details and sequences

-   details contains gff files from CloneDB, as well as various of presenting the data in the folds reordered and repaired
    -   reordered contains clones split by the sequence ID they are contained in
    -   repaired contains clones with their "attributes" split up for simple digestion
-   sequences contains all the sequences related to the clones in fasta format




In [None]:
bmap.getNewClones(download = False, email='') # Remember to put your email in to let NIH know who uses these resources!

The following functions generate the database locally in a folder called maps. mapSequencedClones saves all the maps of clones that are insert-sequenced into the folder sequenced in maps. mapPlacedClones saves all the maps of clones that are end-sequenced into the folder placed in maps.
-   cpustouse determines the number of cores to use when running multiprocessing 
-   chunk_size determines the amount of lines to read into pandas at once, larger is faster but requires more memory

In [None]:
bmap.mapSequencedClones(cpustouse=8, chunk_size=1000) 
bmap.mapPlacedClones(cpustouse=8, chunk_size=1000)

Note: Running the main pipeline can take around 8 hours using 16 cores.

## Functions for statistics

In [None]:
bmap.countPlacedBACs()
bmap.getCoverage(chunk_size=1000)
bmap.getAverageLength()
bmap.getSequencedClonesStats()

Output files:

- countPlacedBACs counts the number of BACs in each end-sequenced library and saves this to counts.csv
- getCoverage determines the number of bases per chromosome which are included in the inserts of end-sequenced BACs in each library and saves this to coverage.csv
- getAverageLength finds the average length of clones in each end-sequenced library and saves this to averagelength.csv
- getSequencedClonesStats gets both the average length and number of clones for each library of insert-sequenced clones

## Generating the pairs database
Generate the database of all of the clone pairs which have overlapping ends produced by linearization.
-   cpustouse determines the number of cores to use when running multiprocessing 
-   longestoverlap is the longest acceptable overlap between the ends of different linearized BACs
-   shortestoverlap is the shortest acceptable overlap between the ends of different linearized BACs

In [None]:
bmap.makePairs(cpustouse=8, longestoverlap=200, shortestoverlap=20)

## Function to explore the library

### getRestrictionMap

Given the name of a BAC and an enzyme, returns the cut locations.

In [None]:
name = "RP11-168H2"
enzyme = "SgrDI"
maps = bmap.getRestrictionMap(name, enzyme)

### getMaps

Given the name of a BAC, returns a dataframe containing all the restriction maps related to that BAC.

In [None]:
name = "RP11-168H2"
mapfn = bmap.getMaps(name)

### getRightIsoschizomer

Given an enzyme name, returns the enzyme name and Bio.restriction class which corresponds to the isoschizomer which is in the database. Name is a string of the enzyme name, libraryenzyme is the Bio.restriction class of the enzyme.

In [None]:
testenzyme = "SgrDI"
name , libraryenzyme = bmap.getRightIsoschizomer(testenzyme)

### DrawMap

Draws a map for a given BAC and enzyme.

In [None]:
name = "RP11-168H2"
enzyme =  "SgrDI"
rmap = bmap.drawMap(name, enzyme)

### getSequenceFromName

Given the name of a BAC, tries to return the sequence of that insert.

In [None]:
name = "RP11-168H2"
seq = bmap.getSequenceFromName(name)

### getSequenceFromLoc

Given a chromosome, start and end location, returns sequence of that location.

In [None]:
chrom = 2
start = 100000
end = 500000
seq = bmap.getSequenceFromLoc(chrom,start,end)

### getMapsFromLoc

Given a chromosome, start and end location, returns all the maps in that region.

In [None]:
chrom = 2
start = 100000
end = 500000
maps = bmap.getMapsFromLoc(chrom,start,end)

### findPairsFromName

Given a row for a specific BAC as well as overlap and other details, finds possible BACs with acceptable overlap and restriction sites. This returns a dataframe where each line is a pair of BACs, including details of what enzymes are used and how they cut.

In [None]:
name = "RP11-168H2"
pairs = bmap.findPairsFromName(name)