# Preprocessing for _MethylBERT_ fine-tuning training data

_MethylBERT_ fine-tuning needs DNA methylation data from tumour (T) and normal (N) samples as training data. You can give a list of sample files with annotations in a tab-deliminated file. 

In [1]:
cat ../test/data/bam_list.txt

../test/data/T_sample.bam	T
../test/data/N_sample.bam	N



As described in the [data preparation](https://github.com/hanyangii/methylbert/blob/main/tutorials/01_Data_Preparation.md) tutorial, DMRs and the reference genome should be prepared in the required format. 

_MethylBERT_ provides `finetune_data_generate` function to preprocess the given tumour and normal data.

In [2]:
from methylbert.data import finetune_data_generate as fdg

f_bam_file_list = "../test/data/bam_list.txt"
f_dmr = "../test/data/dmrs.csv"
f_ref = "../../../genome/hg19.fa"
out_dir = "tmp/"

fdg.finetune_data_generate(
    sc_dataset = f_bam_file_list,
    f_dmr = f_dmr,
    f_ref = f_ref,
    output_dir=out_dir,
    split_ratio = 0.8, # Split ratio to make training and validation data
    n_mers=3, # 3-mer DNA sequences 
    n_cores=20
)

DMRs sorted by areaStat
     chr      start          end  length  nCG  meanMethy1  meanMethy2  \
1  chr10  134597480  134602875.0    5396  670    0.861029    0.140400   
0   chr7    1268957    1277884.0    8928  753    0.793278    0.129747   
2   chr4    1395812    1402597.0    6786  663    0.831162    0.185272   
5  chr16   54962053   54967980.0    5928  546    0.783631    0.096095   
9  chr18   76736906   76741580.0    4675  510    0.829475    0.104403   

   diff.Methy     areaStat  abs_areaStat  abs_diff.Methy ctype  dmr_id  
1    0.720629  6144.089331   6144.089331        0.720629     T       0  
0    0.663531  5722.091790   5722.091790        0.663531     T       1  
2    0.645891  4941.410089   4941.410089        0.645891     T       2  
5    0.687536  4714.551799   4714.551799        0.687536     T       3  
9    0.725072  4684.608381   4684.608381        0.725072     T       4  
Number of DMRs to extract sequence reads: 20
../test/data/T_sample.bam processing (T)...
../test/da

After the preprocessing, you get three different files:
1. dmrs.csv : Selected DMRs (when the number of DMRs is given) with `dmr_label` column
2. train_seq.csv : Preprocessed training data
3. test_seq.csv : Preprocessed evaluation data (20% of given data, due to the split_ratio=0.8)

In [22]:
ls tmp/

dmrs.csv  test_seq.csv  train_seq.csv


Each preprocessed data is a tab-deliminated .csv file where each column contains the individual field of given BAM/SAM file. Additionally `dmr_ctype`, `dmr_label` and `ctype` are given:
1. `dmr_ctype`: The specific cell type for each DMR
2. `dmr_label`: DMR label. This is used for the read classifier fully-connected network in _MethylBERT_
3. `ctype` : Cell-type of the read (indicated in the input file)

In [11]:
import pandas as pd
pd.read_csv("tmp/test_seq.csv", sep='\t').head()

Unnamed: 0,name,flag,ref_name,ref_pos,map_quality,cigar,next_ref_name,next_ref_pos,length,seq,...,PG,XG,NM,XM,XR,dna_seq,methyl_seq,dmr_ctype,dmr_label,ctype
0,SRR10165464.6790597_6790597_length=151,83,chr2,176943541,40,151M,=,176943475,-217,AATTAACAATTTTCATCATAATCTACACATTATTAACATCAAACTT...,...,MarkDuplicates,GA,37,h...hh........z.........x..........h.............,CT,GAT ATT TTG TGG GGC GCA CAA AAT ATT TTT TTT TT...,2222222222220222222222222222222222222222222222...,T,12,N
1,SRR10165994.18752987_18752987_length=149,163,chr7,157486616,40,149M,=,157486650,183,AGGCACGCGACCACCCTAAACCTCGAACAAAACTAAAAAAACGCAA...,...,MarkDuplicates,GA,51,..Z...Z.Zx.......xhh....Zx...xhh...hhhhh..Z..x...,GA,CCG CGC GCA CAC ACG CGC GCG CGG GGC GCC CCA CA...,1222121222222222222222122222222222222222122222...,T,11,T
2,SRR10165994.2935274_2935274_length=150,83,chr7,1270222,42,150M,=,1269981,-391,ACGAACATTAAAACGCACGGAACCGCCGCGACGCGGACTCGCTCTT...,...,MarkDuplicates,GA,27,h.Z.h....hhh..Z...ZX.h..Z..Z.Zx.Z.ZX....Z........,CT,GCG CGA GAG AGC GCA CAT ATT TTG TGG GGG GGA GA...,1222222222221222122222122121221212222212222222...,T,1,T
3,SRR10165464.56090327_56090327_length=151,163,chr2,176949511,42,149M,=,176949602,242,AGGATTTCTTACTACATAACCACAAAAATACATTAAACCCACACCT...,...,MarkDuplicates,GA,36,h.Z.......h....z.hh..z.zx.hh.h....hhh...z.z......,GA,GCG CGC GCT CTT TTT TTC TCT CTT TTG TGC GCT CT...,1222222222222022222020222222222222222202022222...,T,12,N
4,SRR10165464.47924911_47924911_length=150,147,chr7,1272480,42,151M,=,1272378,-253,AATTATTGGGAGTTTGATGTTGATAAGTAAAGTGTTGGAGTGTGGG...,...,MarkDuplicates,CT,31,......z.....h...................z.xz......z......,GA,AAT ATT TTA TAT ATC TCG CGG GGG GGA GAG AGC GC...,2222202222222222222222222222222022022222202220...,T,1,N
