Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

methyl_summary generation problem #65

Closed
PraddyumnaR opened this issue Jul 9, 2024 · 30 comments
Closed

methyl_summary generation problem #65

PraddyumnaR opened this issue Jul 9, 2024 · 30 comments

Comments

@PraddyumnaR
Copy link

Methylframe <- generate_methylframe(

  • methyl_bed_list = methyl_bed,
  • Sample_count = 0,
  • Methyl_call_type = "Dorado",
  • filter_NAs = 1,
  • max_read_depth = 100,
  • gene_info = TRUE,
  • gene_coordinate_file = Geneco,
  • Gene_column = "gene.ID", # Pass the column name as a string
  • target_info = FALSE,
  • File_prefix = "Sample"
  • )
    Creating the Megaframe
    |--------------------------------------------------|
    |==================================================|
    Error in [.data.frame(Methyl_bed, , c(1, 2, 3, 4, 5, 6)) :
    undefined columns selected

This is the error I am getting after submiting the DeepSignal3 output bed file as input here what should be done to resolve it

@decibel-tom
Copy link
Collaborator

Can you submit your bed file so I can try to reproduce the error?

@PraddyumnaR
Copy link
Author

PraddyumnaR commented Jul 11, 2024

I have solved the problem of methyl_frame geneartion but now other problem I am facing is this

methyl_summary <- create_methyl_summary(dmr_obj,
control = 'C',
treated = 'T',
additional_summary_cols = list(
c('sd', 'Group')
))
[1] "Output is in a different order. Try running again."
Error in [<-.data.frame(*tmp*, , ncol(Output_Frame) + 1, value = c(-50, :
replacement has 2887333 rows, data has 3076303
In addition: Warning messages:
1: In GenePercentPlant$Zeroth_pos == dmr_obj$ZoomFrame_filtered$Zeroth_pos :
longer object length is not a multiple of shorter object length
2: In GenePercentPlant$Gene == dmr_obj$ZoomFrame_filtered$Gene :
longer object length is not a multiple of shorter object length
head(dmr_obj$ZoomFrame_filtered$Zeroth_pos)
[1] 15 16 22 23 36 42
head(dmr_obj$ZoomFrame_filtered$Gene)
[1] "Chr2" "Chr2" "Chr2" "Chr2" "Chr2" "Chr2"
head(dmr_obj$LongPercent)
A tibble: 6 × 9
Groups: Gene, Zeroth_pos, Plant, Position, CX, Strand, Group [6]
Gene Zeroth_pos Plant Position CX Strand Group Chromosome Percent

1 Chr2 15 S1 15 CHH + C Chr2 50
2 Chr2 15 S2 15 CHH + T Chr2 0
3 Chr2 16 S1 16 CHH + C Chr2 17
4 Chr2 16 S2 16 CHH + T Chr2 0
5 Chr2 22 S1 22 CHH + C Chr2 33
6 Chr2 22 S2 22 CHH + T Chr2 0

length(dmr_obj$ZoomFrame_filtered$Zeroth_pos)
[1] 3076303
length(dmr_obj$ZoomFrame_filtered$Gene)
[1] 3076303
length(dmr_obj$LongPercent$Zeroth_pos)
[1] 5915114
length(dmr_obj$LongPercent$Gene)
[1] 5915114

can you tell me how should I resolve this and what might be the issue.I have added most of the information needed to resolve this issue hope you reply soon. Also I have used Dorado with 5mC model

@PraddyumnaR PraddyumnaR changed the title methyl_frame generation problem methyl_summary generation problem Jul 11, 2024
@PraddyumnaR
Copy link
Author

can anyone help me solve this issue

@decibel-tom
Copy link
Collaborator

Can you share either your R environment or attach your dmr_object here?

@decibel-tom
Copy link
Collaborator

I think it's going to be difficult for me to import this representation of your dmr_obj into my code to try to reproduce the error. Can you save your r environment and attach it here as a .Rdata file?

@PraddyumnaR
Copy link
Author

PraddyumnaR commented Jul 20, 2024

sorry that is confidential can you suggest me a solution by the above data or if you need any other data to get to solution I can provide you. What I can tell you is I followed the example data for demo run for which I got the results without facing any problems and I have edited my input file in same way as the example file were but I am facing this problem now.

For experimental data

sapply(dmr_obj, length)
ZoomFrame_filtered LongPercent LonUnMeth
13 9 8
LongMeth experimental_design_df Inputpers
9 6 3

For example data in package

sapply(dmr_obj, length)
ZoomFrame_filtered LongPercent LonUnMeth
22 9 8
LongMeth experimental_design_df Inputpers
9 6 6

could this be making the problem ?

@decibel-tom
Copy link
Collaborator

decibel-tom commented Jul 22, 2024

I appreciate the additional information! So I'm curious what your fix was to the very first issue you raised? Our pipeline was set up for an older version of deepsignal and I'm curious if the output file is drastically different from what the code expects. Our latest improvements have been geared toward the use of Dorado since we have found that works better for us. It does appear like in your input data you have multiple rows for each position? That is the most likely culprit that I can think of with what's been provided since the older version of deepsignal / dorado output produces one row per position in the genome

@PraddyumnaR
Copy link
Author

PraddyumnaR commented Jul 23, 2024 via email

@PraddyumnaR
Copy link
Author

As for the earlier issue I raised about methylframe generation it was because I got the output from deepsignal3 had that format which I later adjusted as per the example files you provided after doing that the issue was resolved.

@jcolicchio-soundag
Copy link
Contributor

The issue is indeed because there are multiple rows for the same position. This leads to "LongMeth" and "LongPercent" not being the appropriate lengths since they get created by using a grouping function. If you remove duplicate positions from your methylframe it should work.

Best,
Jack

@PraddyumnaR
Copy link
Author

PraddyumnaR commented Jul 26, 2024 via email

@jcolicchio-soundag
Copy link
Contributor

Just confirmed, there are still rows with the same chromosome and positions:

nrow(ZoomFrame_filtered)
[1] 3076303
distinct(ZoomFrame_filtered, Chromosome, Position, .keep_all = FALSE)->Zoom_RemoveDuplicates
nrow(Zoom_RemoveDuplicates)
[1] 2887333

If you run this distinct command first to remove duplicates (on the megaframe) your commands should work.

@PraddyumnaR
Copy link
Author

Can you write the whole script which you used to do this because I am still getting errors. And if you were able to do complete analysis please provide the complete script on my email ID praddyumnapdm@bicpu.edu.in

@PraddyumnaR
Copy link
Author

Hey Jack, can you share with me the R commands you used for running those bed files I gave you and how you removed duplicates from it. you can write those here or can mail me on above given mail ID, Thankyou.

@PraddyumnaR
Copy link
Author

PraddyumnaR commented Aug 2, 2024

After removing multiple entries for chromosome and positions this is what I am getting can you help me solve this

methyl_summary <- create_methyl_summary(dmr_obj,
control = 'C',
treated = 'T',
additional_summary_cols = list(
c('sd', 'Group')
))
[1] "Number of columns in methyl_summary is correct"
methyl_summary <- find_DMR(methyl_summary, dmr_obj, fixed = c('Group'),
random = c('Individual'), reads_threshold = 5,
control = 'C', model = 'beta-binomial',
analysis_type = 'individual')
cbind(Meth, UnMeth) ~ (1 | Individual) + Group
<environment: 0x55d7b284a510>
|======================================================================| 100%
There were 50 or more warnings (use warnings() to see the first 50)
warnings()
Warning messages:
1: In finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old) :
Model convergence problem; . See vignette('troubleshooting'), help('diagnose')
2: In finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old) :
Model convergence problem; . See vignette('troubleshooting'), help('diagnose')
3: In finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old) :
Model convergence problem; . See vignette('troubleshooting'), help('diagnose')
4: In finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old) :
Model convergence problem; . See vignette('troubleshooting'), help('diagnose')
5: In finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old) :
Model convergence problem; . See vignette('troubleshooting'), help('diagnose')

changepoint_cols = find_changepoint_col_options(methyl_summary)
character(0)
methyl_summary <- find_DMR(methyl_summary, dmr_obj, fixed = c('Group'),
random = c('Individual'), reads_threshold = 3,
control = 'C', model = 'binomial',
analysis_type = 'group')
cbind(Meth, UnMeth) ~ (1 | Individual) + Group
<environment: 0x55d745b5b8d8>
boundary (singular) fit: see help('isSingular')
[1] "37 No bobyqa Converge, trying Nelder"
[1] "37 No Converge"
[1] "78 No bobyqa Converge, trying Nelder"
[1] "78 No Converge"
[1] "80 No bobyqa Converge, trying Nelder"
[1] "80 No Converge"
[1] "95 No bobyqa Converge, trying Nelder"
[1] "95 No Converge"
[1] "172 No bobyqa Converge, trying Nelder"
[1] "172 No Converge"
[1] "174 No bobyqa Converge, trying Nelder"
[1] "174 No Converge"

@jcolicchio-soundag
Copy link
Contributor

jcolicchio-soundag commented Aug 2, 2024 via email

@PraddyumnaR
Copy link
Author

Thankyou for the insight Jack I am in process of generating rest of my files hope it works without any problem since methyl summary generation was successful.

@PraddyumnaR
Copy link
Author

Now that I have 3 samples in each Control and treated group still I am getting this error what could be the reason. Methyl summary was generated successfully but now this problem is there.

methyl_summary <- create_methyl_summary(dmr_obj,

  •                                     control = 'C', 
    
  •                                     treated = 'T',
    
  •                                     additional_summary_cols = list(
    
  •                                       c('sd', 'Group')
    
  •                                       ))
    

[1] "Number of columns in methyl_summary is correct"

methyl_summary <- find_DMR(methyl_summary, dmr_obj, fixed = c('Group'),

  •                        random = c('Individual'), reads_threshold = 3, 
    
  •                        control = 'C', model = 'binomial', 
    
  •                        analysis_type = 'group')
    

cbind(Meth, UnMeth) ~ (1 | Individual) + Group
<environment: 0x560204c00df8>
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')

@PraddyumnaR PraddyumnaR reopened this Aug 16, 2024
@PraddyumnaR
Copy link
Author

PraddyumnaR commented Aug 17, 2024 via email

@decibel-tom
Copy link
Collaborator

What is the error you're facing here? The output looks correct to me

@PraddyumnaR
Copy link
Author

PraddyumnaR commented Aug 21, 2024

boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular') is this supposed to happen while generating the output because it is taking too long for the output to generate even when I have given bed file of 2 chromosomes ?

@PraddyumnaR
Copy link
Author

It is running for more than a day now is there any way that we can speed up this process ? I have provided you with the bed files if you can try running it on your end and figure out something it would be great.

@PraddyumnaR
Copy link
Author

PraddyumnaR commented Aug 22, 2024

I am running two analysis from yesterday one on the bed file with entries on chromosome 1 & 2 and another on only chromosome 1 now the line count in the megaframe generated for those two sets are 3649274 for chr 1&2 and 1936024 for Chr1 how much time will it take I have to do the analysis for 12 chromosomes so with this speed it will take 24 days to complete the analysis. Is it possible to speed up the process of this Group DMR analysis. Hope you reply soon.

@decibel-tom
Copy link
Collaborator

For larger analysis we definitely recommend splitting the data into smaller chunks and running on that. For example with whole genome analysis we recommend splitting the data by chromosome or more across multiple machines to run in parallel. Unfortunately we don't have immediate plans to speed up the code on the back end as that would require a major overhaul of the code, but it is something we hope to get to in the next year

@PraddyumnaR
Copy link
Author

PraddyumnaR commented Aug 24, 2024

Okay if that's the case then i think I will have to wait for it be done chromosome by chromosome. Also I completed analysis on one chromosome file but while doing that I faced yet another issue with following command:

changepoint_cols = find_changepoint_col_options(methyl_summary)
[1] "Z_GroupT_small"
target_genes <- unique(dmr_obj$ZoomFrame_filtered$Gene)
methyl_summary <- changepoint_analysis(methyl_summary,
CG_penalty = 5,
CHG_penalty = 7,
CHH_penalty = 6,
target_genes = c(),
save_plots = TRUE,
z_col = "Z_GroupT_small",
whole_genome = TRUE)
Running whole genome changepoint analysis on chromosomes: Chr1
|======================================================================| 100%
Warning message:
The x argument of as_tibble.matrix() must have unique column names if .name_repair is omitted as of tibble 2.0.0.
ℹ Using compatibility .name_repair.
ℹ The deprecated feature was likely used in the sounDMR package.
Please report the issue to the authors.
This warning is displayed once every 8 hours.
Call lifecycle::last_lifecycle_warnings() to see where this warning was generated.

So here what happened is it did changepoint analysis successfully but the volcano and line plots it was supossed to save were not saved also I tried replacing TRUE with T, the volcano plots were generated on R Graphics device but were not saved to the directory. whereas when a month before I ran analysis on demo data it worked perfectly fine. what can be the issue here?

@decibel-tom
Copy link
Collaborator

Do you have any target genes that you're working with? The issue with the plots is that we found the line plots took too much RAM to produce for the entire genome, crashing a lot of our machines. So any data with > 100,000 rows the plots will not be displayed. Are you running this on the data split by chromosome? If that's the case I would recommend setting whole_genome = FALSE (even if your analysis is for whole genome) as the data will then be split gene by gene with a new plot being created and saved for each gene. If you do want to save the volcano plot for the whole genome I would recommend saving the plots on the R Graphics device. This is something that can be changed in a future version of sounDMR, but I don't have a good timeline on when that will be.

@PraddyumnaR
Copy link
Author

PraddyumnaR commented Aug 29, 2024

I was not using the whole genome file instead I ran the analysis on just Chr1 data also i tried it with Whole_genome = FALSE but still it didn't save the plots line plots were not even visible on R Graphics Device only volcano plot was generated. Apart from this I
had a query regarding the output of analysis where two files are generated methyl_summary with headers:
"Chromosome" "Gene" "Position" "Strand" "CX" "Zeroth_pos" "S4_Z" "S4_MethChange" "S4_RD" "S5_Z" "S5_MethChange" "S5_RD" "S6_Z" "S6_MethChange" "S6_RD" "S1" "S2" "S3" "S4" "S5" "S6" "Treat_V_Control" "Control" "Treated" "C_sd" "T_sd" "Z_GroupT_small" "Estimate_GroupT_small" "StdErr_GroupT_small" "MethRegion_Z_GroupT_small" "MethRegionLength_Z_GroupT_small" "MethCytosines_Z_GroupT_small" "MethGroup_Z_GroupT_small"
here S4, S5 & S6 was my treatment group.

and DMR_region_summary with headers:
"cp_group","Start","Stop","Gene","CX","Count","MethRegion_Z_GroupT_small","MethRegionLength_Z_GroupT_small","Treat_V_Control","Control","Estimate_GroupT_small","dmr_score","dmr_score2","dmr_score_Percentile","dmr_score2_Percentile"

Can I know the interpretation of these headers, it would be very helpful if I get the context of what these column entries signify.
Also I would like to Thankyou for the support throughout the issues.

@PraddyumnaR
Copy link
Author

Also can you tell me should we consider sample wise methylchange in our case S4_MethChange", S5_MethChange" and S6_MethChange" or should we consider Treat_V_Control as we are performing group wise analysis which one is significant.

@jcolicchio-soundag
Copy link
Contributor

Sample wise shifts in methylation are the "MethChange" columns, group wise shifts in methylation are Treat_V_Control.

As far as the other columns, they refer to dmr regions, or groups of regions with similar shifts in methylation. That file refers to these different regions and includes summary stats for each one. the "dmr_score" column refer to a scoring metric described in our publication, while the other columns reflect the control methylation, shifts in methylation, T statistic, size of the region, and its start and stop. I will look to add better documentation for this as soon as i can.

@PraddyumnaR
Copy link
Author

Thank you for the reply. Hope you put out a detailed description of output in future which will be helpful for interpretation of users. also if you can update the code such that the experimental design starter is generated in the way it is supposed to be by taking input from user in R-environment itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants