Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract features file structure #30

Open
RahelehSalehi opened this issue Mar 15, 2023 · 15 comments
Open

Extract features file structure #30

RahelehSalehi opened this issue Mar 15, 2023 · 15 comments

Comments

@RahelehSalehi
Copy link

Hi, I used ccsmeth extract command to extract features. How should it be structured when opened in tsv file in python?
Could you please give some information related to the file structure?
Thank you so much...

@PengNi
Copy link
Owner

PengNi commented Mar 17, 2023

Hi @RahelehSalehi , the features-tsv file are in the following format, each row represents features of a CpG site:

chrom, position_in_chrom, strand, read_id, position_in_read,
seq_of_fwd_kmer, no_of_fwd_subreads, ipd_mean_of_fwd_kmer, ipd_std_of_fwd_kmer(deprecated), pw_mean_of_fwd_kmer, pw_std_of_fwd_kmer(deprecated), qual_of_fwd_kmer, mapq_of_fwd_kmer,
seq_of_rev_kmer, no_of_rev_subreads, ipd_mean_of_rev_kmer, ipd_std_of_rev_kmer(deprecated), pw_mean_of_rev_kmer, pw_std_of_rev_kmer(deprecated), qual_of_fwd_kmer, mapq_of_fwd_kmer,
methy_label

Best,
Peng

@RahelehSalehi
Copy link
Author

Thank you so much for your response. Regarding the extracted features, when you normalized the signals what is the difference between 'zscore', 'min-max', 'min-mean', or 'mad' normalization methods? Since I extracted features from my data, some of the mean IPD values are negative values, do you know why it happens?

@PengNi
Copy link
Owner

PengNi commented Mar 23, 2023

@RahelehSalehi , it is because of the zscore normalization. You can check the zscore formula, there can be negative values after zscore normalization. The related code is here: https://github.com/PengNi/ccsmeth/blob/master/ccsmeth/extract_features.py#L169.

Best,
Peng

@olaraym
Copy link

olaraym commented Mar 28, 2023

@PengNi I find this interesting. I am currently exploring this tool for my data and it's been a bit technical as I have a little bioinformatic background. I have used your trained model for calling the modification and the extraction of the features, please how do I use the output of the ccsmeth extract in the deep neural network of the ccsmeth according to your paper on arxiv.

@olaraym
Copy link

olaraym commented Mar 28, 2023

I also want to know if it will be worthwhile to train my own model

@PengNi
Copy link
Owner

PengNi commented Mar 28, 2023

@olaraym , hi, you can just follow the steps in quick strat to call modifications and frequencies. To train a new model, please check the ccsmeth train or ccsmeth trainm commands. If you data is non-human, it is worth a try.

@olaraym
Copy link

olaraym commented Mar 28, 2023

@PengNi thank you very much for your response, I appreciate it. My data is non-human and I will definitely try it out.

@RahelehSalehi
Copy link
Author

Hi PengNi,
Thank you so much for sharing your code with us. I extracted the features by extracted_features code for my data. the features are chrom, position_in_chrom, strand, read_id, position_in_read,
seq_of_fwd_kmer, no_of_fwd_subreads, ipd_mean_of_fwd_kmer, ipd_std_of_fwd_kmer(deprecated), pw_mean_of_fwd_kmer, pw_std_of_fwd_kmer(deprecated), qual_of_fwd_kmer, mapq_of_fwd_kmer,
seq_of_rev_kmer, no_of_rev_subreads, ipd_mean_of_rev_kmer, ipd_std_of_rev_kmer(deprecated), pw_mean_of_rev_kmer, pw_std_of_rev_kmer(deprecated), qual_of_fwd_kmer, mapq_of_fwd_kmer,
methy_label
I was wondering if it is possible to tell me, what is strand? when should it be positive and when should it be negative?
thanks

@PengNi
Copy link
Owner

PengNi commented Jul 14, 2023

Hi @RahelehSalehi, if the read is mapped to the reverse strand of the reference (SAM FLAG 0x10), then strand is -.

@RahelehSalehi
Copy link
Author

Could you please explain a little more? when we have a seq which is involved forward strand and reverse strand. If it is -, is the strand in the features list the reverse strand?

@PengNi
Copy link
Owner

PengNi commented Jul 14, 2023

Could you please explain a little more? when we have a seq which is involved forward strand and reverse strand. If it is -, is the strand in the features list the reverse strand?

The strand value is based on if there is 0x10 in the FLAG field of an alignment segment. Ref: https://samtools.github.io/hts-specs/SAMv1.pdf

@RahelehSalehi
Copy link
Author

Hi Peng,
I have a question for you about the extracted file. there are two parameters which are mapq and identity. Please explain a bit about them. I'd like to know when I should change the default.
Thank you so much.
Best Raheleh

@PengNi
Copy link
Owner

PengNi commented Oct 25, 2023

Hi Raheleh,

mapq and identity are for removing low quality reads, representing the mapping quality and identity of an alignment item (read to reference alignmet), respectively. The defaults of the two params are 1 and 0.0, respecitively, which generally keep all the reads for feature extraction.

Best,
Peng

@RahelehSalehi
Copy link
Author

Hi Peng,
Thanks a lot for your response. Could you please explain to me what are the differences when I set mapq to 1 or 0? Do you think it is mandatory to set mapq to 1 if I want to align my dataset? If I set mapq and identity in the following numbers, could you please explain to me about each set?
1- mapq = 0, identity =0.
2-mapq=0,identity=1,
3-mapq=1,identity=0,
4-mapq=1,identity=1?
Thank you so much.

@PengNi
Copy link
Owner

PengNi commented Oct 25, 2023

Hi Raheleh, mapq is an integer ranged from 0-255 (check https://samtools.github.io/hts-specs/SAMv1.pdf); identity is a decimal ranged from 0-1 (check https://www.differencebetween.com/difference-between-similarity-and-identity-in-sequence-alignment/). For both the two params, higher values mean higher read quality, which wil make more reads being removed, and may lead a better predeiction with only the high-quality reads.

Best,
Peng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants