# Full bacmapping pipeline

This jupyter notebook contains the full pipeline to create the bacmapping database and some examples on how to use the bacmapping functions.

Note: Running the main pipeline can take around 8 hours using 16 cores.

In [1]:
import bacmapping as bmap
import matplotlib.pyplot as plt
import os
from Bio import Entrez

## Main pipeline

The following pipeline download all the necessary files from the FTP server
-   download can be set to false if the sequence is already available locally
-   onlyType and vtype determine whether clones should be filtered and how so, automatically set to only download and map BACs
-   email is the email sent to the NIH server when you download sequence

This function creates two folders in the working directory, details and sequences

-   details contains gff files from CloneDB, as well as various of presenting the data in the folds reordered and repaired
    -   reordered contains clones split by the sequence ID they are contained in
    -   repaired contains clones with their "attributes" split up for simple digestion
-   sequences contains all the sequences related to the clones in fasta format




In [None]:
bmap.getNewClones(download = False, email='') # Remember to put your email in to let NIH know who uses these resources!

The following functions generate the database locally in a folder called maps. mapSequencedClones saves all the maps of clones that are insert-sequenced into the folder sequenced in maps. mapPlacedClones saves all the maps of clones that are end-sequenced into the folder placed in maps.
-   cpustouse determines the number of cores to use when running multiprocessing 
-   chunk_size determines the amount of lines to read into pandas at once, larger is faster but requires more memory

In [None]:
bmap.mapSequencedClones(cpustouse=16) 
bmap.mapPlacedClones(cpustouse=16, chunk_size=1000)

## Functions for statistics

In [None]:
bmap.countPlacedBACs()
bmap.getCoverage()
bmap.getAverageLength()
bmap.getSequencedClonesStats()

Output files:

- countPlacedBACs counts the number of BACs in each end-sequenced library and saves this to counts.csv
- getCoverage determines the number of bases per chromosome which are included in the inserts of end-sequenced BACs in each library and saves this to coverage.csv
- getAverageLength finds the average length of clones in each end-sequenced library and saves this to averagelength.csv
- getSequencedClonesStats gets both the average length and number of clones for each library of insert-sequenced clones

## Generating the pairs database
Generate the database of all of the clone pairs which have overlapping ends produced by linearization.
-   cpustouse determines the number of cores to use when running multiprocessing 
-   longestoverlap is the longest acceptable overlap between the ends of different linearized BACs
-   shortestoverlap is the shortest acceptable overlap between the ends of different linearized BACs

In [None]:
bmap.makePairs(cpustouse=16, longestoverlap=200, shortestoverlap=20)

## Functions to explore the library

### getRestrictionMap

Given the name of a BAC and an enzyme, returns the cut locations.

In [7]:
name = "RP11-168H2"
enzyme = "SgrDI"
rmap = bmap.getRestrictionMap(name, enzyme)
rmap

['138495']

### getMaps

Given the name of a BAC, returns a dataframe containing all the restriction maps related to that BAC.

In [6]:
name = "RP11-168H2"
maps = bmap.getMaps(name)
maps

Unnamed: 0,Name,Library,Chrom,Start,End,Accession,AloI,BstSNI,BsaWI,CaiI,...,BseCI,Cfr13I,DrdI,HindII,MspJI,PfoI,Ecl136II,HapII,Hpy188III,NmeAIII
1546,RP11-168H2,RP11,7,17160175,17316564,NC_000007.14,"[6033, 6065, 18616, 18648, 38332, 38364, ...","[4919, 62698, 79310, 133543, 143861, 1496...","[19084, 19481, 74282, 87293, 94727, 11635...",overflow,...,[125298],overflow,"[10781, 49805, 65493, 90731, 99864]",overflow,overflow,"[2915, 3242, 4739, 4881, 13777, 15281, 1...","[18361, 23607, 24710, 29045, 30324, 35191...",overflow,overflow,"[5058, 5240, 8934, 40168, 48355, 48537, ..."


### getRightIsoschizomer

Given an enzyme name, returns the enzyme name and Bio.restriction class which corresponds to the isoschizomer which is in the database. Name is a string of the enzyme name, libraryenzyme is the Bio.restriction class of the enzyme.

In [None]:
testenzyme = "BsaI"
name, libraryenzyme = bmap.getRightIsoschizomer(testenzyme)
print(name)

### DrawMap

Draws a map for a given BAC and enzyme.

In [None]:
name = "RP11-168H2"
enzyme = "DrdI"
rmap = bmap.drawMap(name, enzyme)

### getSequenceFromName

Given the name of a BAC, tries to return the sequence of that insert.

In [5]:
name = "RP11-168H2"
seq = bmap.getSequenceFromName(name)
print(seq)

SeqRecord(seq=Seq('TTCTGTAACTGATTAGGTTTCCCTTTTCTAATTGGCTGCTAGACAGCTAAGAAC...AGC'), id='NC_000007.14', name='NC_000007.14', description='NC_000007.14 Homo sapiens chromosome 7, GRCh38.p14 Primary Assembly', dbxrefs=[])

### getSequenceFromLoc

Given a chromosome, start and end location, returns sequence of that location.

In [None]:
chrom = 2
start = 100000
end = 105000
seq = bmap.getSequenceFromLoc(chrom,start,end)
print(seq)

### getMapsFromLoc

Given a chromosome, start and end location, returns all the maps in that region.

In [4]:
chrom = 2
start = 100000
end = 500000
maps = bmap.getMapsFromLoc(chrom,start,end)
maps

Unnamed: 0,Name,Library,Chrom,Start,End,Accession,AloI,BstSNI,BsaWI,CaiI,...,BseCI,Cfr13I,DrdI,HindII,MspJI,PfoI,Ecl136II,HapII,Hpy188III,NmeAIII
0,CH17-453L11,CH17,2,211480,259222,NC_000002.12,"[11818, 11819, 11850, 11851, 19470, 19502, 224...","[5910, 7346]","[22058, 27957]","[512, 4420, 7955, 8043, 8148, 8193, 9111, 9849...",...,[9515],overflow,"[234, 1087, 6803, 24859, 32337]","[2208, 4013, 4300, 5176, 5574, 8503, 8772, 898...",overflow,"[4646, 8051, 9229, 9317, 10337, 13410, 13981, ...","[162, 4755, 6566, 11330, 14569, 17781, 23469, ...","[116, 2783, 2968, 2992, 3016, 3040, 3064, 3088...",overflow,"[2572, 2913, 3100, 17430, 18040, 18222, 21823,..."
1,CH17-127P2,CH17,2,137307,348602,NC_000002.12,"[23472, 23504, 49048, 49080, 53574, 53606, 678...","[80083, 81519, 132045, 137034, 154226, 155565,...","[2608, 3108, 11565, 16559, 22986, 23043, 25677...",overflow,...,"[11509, 39155, 83688, 168175, 184477, 188327, ...",overflow,"[5484, 23939, 29485, 46588, 74407, 75260, 8097...",overflow,overflow,overflow,overflow,overflow,overflow,"[5295, 5372, 34228, 34410, 54382, 54566, 58310..."
2,CH17-129E9,CH17,2,349314,562283,NC_000002.12,overflow,"[21373, 31287, 39382, 39432, 39482, 39540, 396...","[335, 7198, 7249, 15357, 16375, 26558, 43496, ...",overflow,...,"[18324, 73641, 101543, 146892]",overflow,"[4241, 7668, 8153, 23217, 29085, 62178, 62535,...",overflow,overflow,overflow,overflow,overflow,overflow,"[1487, 13785, 25628, 32938, 36537, 45490, 4567..."
3,CH17-145K12,CH17,2,119478,327125,NC_000002.12,"[4675, 4707, 10665, 10697, 41301, 41333, 66877...","[15275, 97912, 99348, 149874, 154863, 172055, ...","[8687, 10350, 14920, 20437, 20937, 29394, 3438...",overflow,...,"[29338, 56984, 101517, 186004, 202306, 206156,...",overflow,"[23313, 41768, 47314, 64417, 92236, 93089, 988...",overflow,overflow,overflow,"[18225, 27407, 33944, 34529, 34687, 41013, 411...",overflow,overflow,"[6785, 6876, 7046, 7145, 11116, 15933, 23124, ..."
4,CH17-193H11,CH17,2,107882,327963,NC_000002.12,"[16271, 16303, 22261, 22293, 52897, 52929, 784...","[3290, 8964, 26871, 109508, 110944, 161470, 16...","[10045, 20283, 21946, 26516, 32033, 32533, 409...",overflow,...,"[40934, 68580, 113113, 197600, 213902, 217752,...",overflow,"[7889, 8355, 9054, 34909, 53364, 58910, 76013,...",overflow,overflow,overflow,"[4086, 6876, 29821, 39003, 45540, 46125, 46283...",overflow,overflow,"[4269, 8484, 8665, 10364, 18381, 18472, 18642,..."
5,CH17-193H12,CH17,2,107884,327963,NC_000002.12,"[16269, 16301, 22259, 22291, 52895, 52927, 784...","[3288, 8962, 26869, 109506, 110942, 161468, 16...","[10043, 20281, 21944, 26514, 32031, 32531, 409...",overflow,...,"[40932, 68578, 113111, 197598, 213900, 217750,...",overflow,"[7887, 8353, 9052, 34907, 53362, 58908, 76011,...",overflow,overflow,overflow,"[4084, 6874, 29819, 39001, 45538, 46123, 46281...",overflow,overflow,"[4267, 8482, 8663, 10362, 18379, 18470, 18640,..."
6,CH17-215N22,CH17,2,412954,609655,NC_000002.12,"[814, 846, 3543, 3575, 28861, 28893, 43397, 43...","[4520, 25705, 26493, 58951, 86317]",overflow,overflow,...,"[10001, 37903, 83252, 166844]",overflow,"[1129, 35342, 36407, 37723, 42159, 43214, 4509...",overflow,overflow,overflow,overflow,overflow,overflow,"[3096, 10850, 25436, 25843, 30118, 31772, 3924..."
7,CH17-221K5,CH17,2,25007,252975,NC_000002.12,"[9057, 9089, 10405, 10437, 10537, 10569, 12515...","[47022, 48100, 54561, 62827, 86165, 91839, 109...","[8243, 9040, 9957, 18083, 21087, 21531, 39667,...",overflow,...,"[9114, 68661, 71295, 123809, 151455, 195988]",overflow,"[10396, 13290, 32595, 51839, 57255, 58108, 694...",overflow,overflow,overflow,overflow,overflow,overflow,"[10103, 12984, 13167, 20718, 35161, 36158, 362..."
8,CH17-253H16,CH17,2,343729,585294,NC_000002.12,overflow,"[26958, 36872, 44967, 45017, 45067, 45125, 452...","[5920, 12783, 12834, 20942, 21960, 32143, 4908...",overflow,...,"[23909, 79226, 107128, 152477, 236069]",overflow,"[9826, 13253, 13738, 28802, 34670, 67763, 6812...",overflow,overflow,overflow,overflow,overflow,overflow,"[7072, 19370, 31213, 38523, 42122, 51075, 5125..."
9,CH17-274K18,CH17,2,61138,252867,NC_000002.12,"[32617, 32649, 39803, 39835, 63015, 63047, 690...","[10891, 11969, 18430, 26696, 50034, 55708, 736...","[3536, 3605, 8432, 15685, 31100, 56789, 67027,...",overflow,...,"[32530, 35164, 87678, 115324, 159857]",overflow,"[15708, 21124, 21977, 33321, 43311, 45164, 546...",overflow,overflow,overflow,"[923, 11335, 11826, 12214, 12840, 15900, 18011...",overflow,overflow,"[71, 253, 691, 1109, 11517, 29698, 29836, 3846..."


### findPairsFromName

Given a BAC name as well as overlap and other details, finds possible BACs with acceptable overlap and restriction sites. This returns a dataframe where each line is a pair of BACs, including details of what enzymes are used and how they cut.

In [3]:
name = "RP11-168H2"
longestoverlap=200
shortestoverlap=20
pairs = bmap.findPairsFromName(name, longestoverlap, shortestoverlap)
pairs

Unnamed: 0,Name1,Start1,End1,Enzyme1,Site1,Name2,Star2t,End2,Enzyme2,Site2
0,RP11-168H2,17160175,17316564,SrfI,17298938,RP11-746H13,17199498,17357438,SacII,17298856
1,RP11-168H2,17160175,17316564,SrfI,17298938,RP11-471P5,17255137,17448215,SacII,17298856


### findOverlappingBACs

Given a BAC name, returns a dataframe with details for all the BACs which overlap the BAC.

In [2]:
name = 'RP11-168H2'
bacs = bmap.findOverlappingBACs(name)
bacs

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,attributes,ID,...,unique,placement-method,assm_unit_name,assm_unit_acc,assm_name,assm_acc,Library,overlapstart,overlapend,overlaplength
974,NC_000007.14,NCBI,clone_insert,17166232,17246601,.,+,.,ID=41420696;Name=CTD-2126N18;concordant=TRUE;u...,41420696,...,True,end-seq,Primary Assembly,GCF_000001305.15,GRCh38.p12,GCF_000001405.38,CTD,17160175,17246601,86426
975,NC_000007.14,NCBI,clone_insert,17166424,17280916,.,+,.,ID=41474384;Name=CTD-2126N24;concordant=TRUE;u...,41474384,...,True,end-seq,Primary Assembly,GCF_000001305.15,GRCh38.p12,GCF_000001405.38,CTD,17160175,17280916,120741
2199,NC_000007.14,NCBI,clone_insert,17030980,17178112,.,+,.,ID=41508888;Name=CTD-2322J11;concordant=TRUE;u...,41508888,...,True,end-seq,Primary Assembly,GCF_000001305.15,GRCh38.p12,GCF_000001405.38,CTD,17160175,17178112,17937
2293,NC_000007.14,NCBI,clone_insert,17081880,17178248,.,+,.,ID=41501000;Name=CTD-2329I7;concordant=TRUE;un...,41501000,...,True,end-seq,Primary Assembly,GCF_000001305.15,GRCh38.p12,GCF_000001405.38,CTD,17160175,17178248,18073
2359,NC_000007.14,NCBI,clone_insert,17189436,17280948,.,+,.,ID=41455051;Name=CTD-2335O7;concordant=TRUE;un...,41455051,...,True,end-seq,Primary Assembly,GCF_000001305.15,GRCh38.p12,GCF_000001405.38,CTD,17160175,17280948,120773
2783,NC_000007.14,NCBI,clone_insert,17105522,17237054,.,+,.,ID=41478336;Name=CTD-2380I21;concordant=TRUE;u...,41478336,...,True,end-seq,Primary Assembly,GCF_000001305.15,GRCh38.p12,GCF_000001405.38,CTD,17160175,17237054,76879
4199,NC_000007.14,NCBI,clone_insert,17096636,17275874,.,+,.,ID=41506832;Name=CTD-3093C15;concordant=TRUE;u...,41506832,...,True,end-seq,Primary Assembly,GCF_000001305.15,GRCh38.p12,GCF_000001405.38,CTD,17160175,17275874,115699
4585,NC_000007.14,NCBI,clone_insert,17254735,17267104,.,+,.,ID=41424799;Name=CTD-3177I22;concordant=TRUE;u...,41424799,...,True,end-seq,Primary Assembly,GCF_000001305.15,GRCh38.p12,GCF_000001405.38,CTD,17160175,17267104,106929
6456,NC_000007.14,NCBI,clone_insert,16970517,17189675,.,+,.,ID=48950582;Name=CH17-186K21;concordant=TRUE;u...,48950582,...,True,end-seq,Primary Assembly,GCF_000001305.15,GRCh38.p12,GCF_000001405.38,CH17,17160175,17189675,29500
6674,NC_000007.14,NCBI,clone_insert,17027660,17222323,.,+,.,ID=48944648;Name=CH17-1M2;concordant=TRUE;uniq...,48944648,...,True,end-seq,Primary Assembly,GCF_000001305.15,GRCh38.p12,GCF_000001405.38,CH17,17160175,17222323,62148
