# Hello!

## This is a small interactive environment for running DeepMAP algorithms. This one handles the data generation for training our CNN models.

## First, let us import everything necessary to start our data simulation.


In [1]:
from generate_trainingdata import CallTraceGeneration

## Now that we are ready with the imports, we need to provide the path for simulation parameters. The simulation parameters file is a json file, which contains physically relevant parameters related to labelling efficiency, stretch factor on the substrate, the number of data and so on.

## The majority of parameters are given in pixels, since this the unit with which the maps are extracted from the microscope. To convert between basepairs and pixels, we can use the following formula: 

##              bp = px*PixelSize/StretchFactor/0.34 (0.34 nm/bp is the size of the single basepair)

## Or to convert between pixels and nanometer, we have:

##              nm = px*PixelSize

## An example of the json is given below. These parameters will generate fragments of 316 pixel (42.3 kb) in length totalling in size for 10-fold sampling of the genome at the given stride of 3 pixels (400 bp)



In [4]:
sim_params = {   "Wavelength" : 586,           #-> Emission wavelength of the fluorophore
    "NA" : 1.49,                  #-> NA of the objective
    "FragmentSize" :316,          #-> Fragment size in pixels (number of points)
    "PixelSize" : 78.6,           #-> Calibrated pixel size of the instrument
    "Enzyme" : "TaqI",            #-> Restriction/Methyltransferase enzyme used for labelling.   
    "NumTransformations": 5,    #-> Number of simulations. This is the total number of times that a single genome is sampled.
    "StretchingFactor" :[1.75],   #-> Stretch factor for the simulations. Default is 1.75
    "LowerBoundEffLabelingRate" : 0.75, #-> Lower bound labelling rate. Labelling rate is uniformly distributed between lower and upper bound 
    "UpperBoundEffLabelingRate" : 0.9,  #-> Upper bound labelling rate. 
    "step" :3,                          #-> Sampling step across genome (stride) in number of pixels
    "PixelShift":0.5,                   #-> Random shift in dye position (+- PixelShift)
    "NoiseAmp": [0.2],                  #-> Relative SNR
    "LocalNormWindow":10000,            #-> Local normalization window in kb. Set to 0 in order to disable.
    "Min#ShuffledFrags":0,              #-> Minimal number of shuffeled regions within the simulated trace
    "Max#ShuffledFrags":4,              #-> Maximal number of shuffeled regions within the simulated trace
    "MinLengthShuffledFrags": 35,       #-> Minimal length of shuffeled regions in pixel
    "MaxLengthShuffledFrags": 55,       #-> Maximal length of shuffeled regions in pixel
    "FPR": 0.7,                         #-> False positive rate /kb
    "FPR2": 0.2,                        #-> Double false positive rate
    "Random-min": 1.2,                  #-> Minimal number of dyes/kb for random genome
    "Random-max": 6.8                   #-> Maximal number of dyes/kb for random genome
}  

## Let us also provide the necessary paths for storage of the simulation data. This is the path where we will store our training and validation datasets. 


In [5]:
sim_params['Save_path'] = r'Data\simulated\Train' #PATH TO TRAIN

## Lastly, let us also provide the necessary folder for the fasta files. Let us take f.e. Escherichia coli:

In [6]:
sim_params['Genomes'] = r'RefGenomes'

## The data is stored in  .npz format, however, if necessary for plotting, the data can also be saved as CSV. To add the option for CSV, switch the CSV parameter in the simulation parameters to True. For now, let's not concatenate them.

In [7]:
sim_params['ConcatToCsv'] = False #PATH TO TRAIN

## The training and validation data has to be generated separately. We start with the training data with the previously mentioned parameters

In [8]:
CallTraceGeneration(sim_params)

Generating simulated data from genome:NC_000913.3.fasta


 36%|███▌      | 4132/11594 [00:11<00:20, 363.00it/s]

## To simulate the validation data, we need to first switch the saving folder. Also, we need to change the amount of generated data to ~10% of the total training data. Then, we can call the trace generation again.

In [8]:
sim_params['Save_path'] = r'Data\simulated\Val' #PATH TO TRAIN
sim_params['NumTransformations'] = 5 #~10% of the training data
CallTraceGeneration(sim_params)

Generating simulated data from genome:NC_000913.3.fasta


100%|██████████| 11594/11594 [00:31<00:00, 365.76it/s]
100%|██████████| 11594/11594 [00:29<00:00, 388.00it/s]
100%|██████████| 11594/11594 [00:31<00:00, 373.72it/s]
100%|██████████| 11594/11594 [00:31<00:00, 367.06it/s]
100%|██████████| 11594/11594 [00:31<00:00, 372.50it/s]


Generating random simulated fragments


100%|██████████| 11594/11594 [00:07<00:00, 1499.54it/s]
100%|██████████| 11594/11594 [00:07<00:00, 1506.36it/s]
100%|██████████| 11594/11594 [00:07<00:00, 1492.14it/s]
100%|██████████| 11594/11594 [00:07<00:00, 1527.01it/s]
100%|██████████| 11594/11594 [00:08<00:00, 1406.88it/s]


## We are all done! We can now proceed to training our networks with the data that we generated