# Preprocessing for bulk data 

The bulk sample you want to deconvolute using _MethylBERT_ also needs to be preprocessed using `finetune_data_generate` function. 

In [1]:
from methylbert.data import finetune_data_generate as fdg

f_bam = "../test/data/bulk.bam"
f_dmr = "../test/data/dmrs.csv"
f_ref = "../../../genome/hg19.fa"
out_dir = "tmp/"

fdg.finetune_data_generate(
    input_file = f_bam,
    f_dmr = f_dmr,
    f_ref = f_ref,
    output_dir=out_dir,
    n_mers=3, # 3-mer DNA sequences 
    n_cores=20
)

DMRs sorted by areaStat
     chr      start          end  length  nCG  meanMethy1  meanMethy2  \
1  chr10  134597480  134602875.0    5396  670    0.861029    0.140400   
0   chr7    1268957    1277884.0    8928  753    0.793278    0.129747   
2   chr4    1395812    1402597.0    6786  663    0.831162    0.185272   
5  chr16   54962053   54967980.0    5928  546    0.783631    0.096095   
9  chr18   76736906   76741580.0    4675  510    0.829475    0.104403   

   diff.Methy     areaStat  abs_areaStat  abs_diff.Methy ctype  dmr_id  
1    0.720629  6144.089331   6144.089331        0.720629     T       0  
0    0.663531  5722.091790   5722.091790        0.663531     T       1  
2    0.645891  4941.410089   4941.410089        0.645891     T       2  
5    0.687536  4714.551799   4714.551799        0.687536     T       3  
9    0.725072  4684.608381   4684.608381        0.725072     T       4  
Number of DMRs to extract sequence reads: 20
Fine-tuning data generated:                           

This process generates a new file `data.csv` where the preprocessed bulk data is contained. 

In [4]:
ls tmp/

data.csv  dmrs.csv  test_seq.csv  train_seq.csv


Since the cell-type information is not given with the bulk sample, `ctype` column only contains `NaN` value. 

In [3]:
import pandas as pd
pd.read_csv("tmp/data.csv", sep="\t").head()

Unnamed: 0,name,flag,ref_name,ref_pos,map_quality,cigar,next_ref_name,next_ref_pos,length,seq,...,NM,XM,XR,PG,RG,dna_seq,methyl_seq,dmr_ctype,dmr_label,ctype
0,SRR10166000.9089788_9089788_length=151,147,chr10,131767360,42,151M,=,131767187,-324,GTGGAGTGTCGTTGCGTAGTCGGGAGTCGGGAGTAGAATAGTTTGG...,...,49,........xZ.x..Z.x..xZ.....xZ.....x....x..hx......,GA,MarkDuplicates-287B47C6,diffuse_large_B_cell_lymphoma_test_8,GTG TGG GGA GAG AGT GTG TGC GCC CCG CGC GCT CT...,2222222212222122222122222212222222222222222222...,T,5,
1,SRR10165998.65829390_65829390_length=150,163,chr4,20254248,23,151M,=,20254343,244,GGGGATTCTACCTTTACCATCAAATATCTACCGCGAAACTACGACT...,...,35,H..............h......xh.h...x..Z.Zx.h..x.Zx.....,GA,MarkDuplicates-3DAAB091,diffuse_large_B_cell_lymphoma_test_8,GTT TTT TTC TCT CTT TTC TCT CTA TAC ACC CCT CT...,2222222222222222222222222222221212222222122222...,T,19,
2,SRR10165467.85837758_85837758_length=151,99,chr4,1401206,40,151M,=,1401285,227,AAAATGAGAGATTGTTTGTTTTTTTTAATTTGTTTTTAAAAGGGGG...,...,40,...........x..h....hhh.h....hxz.hhhhh............,CT,MarkDuplicates-36E4BA78,Bcell_noncancer_test_8,AAA AAA AAT ATG TGA GAG AGA GAG AGA GAC ACT CT...,2222222222222222222222222222202222222222222222...,T,2,
3,SRR10165995.16747267_16747267_length=149,83,chr2,176945656,40,149M,=,176945572,-233,AAATAACTTAATCTACTTCTCTCCGACCAAACCCAACCCCAAATAC...,...,35,x...hh...hh.............Z.....h.........z.h......,CT,MarkDuplicates-74536757,diffuse_large_B_cell_lymphoma_test_8,GAA AAT ATG TGG GGC GCT CTT TTG TGG GGT GTC TC...,2222222222222222222222122222222222222202222222...,T,12,
4,SRR10165995.46034072_46034072_length=151,99,chr4,20253524,40,151M,=,20253771,398,TCGGATTTGGTGTTATTTATTTGGGAAGCGTCCGGACGGCGGAGCT...,...,2,.Z...h......................Z.hXZ...Z..Z....H....,CT,MarkDuplicates-74536757,diffuse_large_B_cell_lymphoma_test_8,TCG CGG GGA GAC ACT CTT TTG TGG GGT GTG TGT GT...,1222222222222222222222222221222122212212222222...,T,19,
