Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

help undestanding output #103

Open
RichardCorbett opened this issue Apr 1, 2020 · 13 comments
Open

help undestanding output #103

RichardCorbett opened this issue Apr 1, 2020 · 13 comments

Comments

@RichardCorbett
Copy link

RichardCorbett commented Apr 1, 2020

Hi dropseqpipers,

I have downsampled some single cell data to test for the quality of the results using the same number of reads per cell. I have given each of my 100 cells 10,000 reads in the fastq files.

After analysis I see that only have about ~3000 reads per cell that are used for the anlysis and I am trying to diagnose where the reads might have been filtered out. One possibility is that cutadapt is filtering reads for not having the correct structure, or perhaps too much adapter or similar.

Here is the cutadapt multiqc file:

Sample	r_processed	r_with_adapters	bp_processed	quality_trimmed	bp_written	percent_trimmed
sample1_R2	950971	584570	118871375	12147260	77382403	34.90240774955283
sample2_R2	1058522	655457	132315250	12524937	90626084	31.507453600397533
sample3_R2	929245	557125	116155625	8035848	74496049	35.865310870653055
sample4_R2	958210	563527	119776250	9272229	83291408	30.46083175921771

I'm not very familiar with the output of cutadapt. Does the above indicate that there are many reads that are being filtered out?

@Hoohm
Copy link
Owner

Hoohm commented Apr 1, 2020

Hey @RichardCorbett
I would rather see the log file we create that is called clean_qc or even better, the yield plot if you got it.

@RichardCorbett
Copy link
Author

Thanks @Hoohm,

I don't seem to have either of those:

find . ./samples ./samples/sample1 ./samples/sample1/top_barcodes.csv ./samples/sample1/barcodes.csv ./samples/sample1/trimmed_repaired_R1.fastq.gz ./samples/sample1/trimmed_repaired_R2.fastq.gz ./samples/sample1/empty_barcode_mapping.pkl ./samples/sample1/barcode_ref.pkl ./samples/sample1/barcode_ext_ref.pkl ./samples/sample1/Log.final.out ./samples/sample1/Log.out ./samples/sample1/_STARtmp find: ./samples/sample1/_STARtmp': Permission denied
./samples/sample1/Log.progress.out
./samples/sample1/read
./samples/sample1/read/barcodes.tsv
./samples/sample1/read/genes.tsv
./samples/sample1/read/matrix.mtx
./samples/sample1/barcode_mapping_counts.pkl
./samples/sample1/SJ.out.tab
./samples/sample1/Unmapped.out.mate1.gz
./samples/sample1/final.bam
./samples/sample1/umi
./samples/sample1/umi/expression.long
./samples/sample1/umi/barcodes.tsv
./samples/sample1/umi/genes.tsv
./samples/sample1/umi/matrix.mtx
./logs
./logs/cutadapt
./logs/cutadapt/sample1_R2.qc.txt
./logs/cutadapt/sample1_R1.qc.txt
./logs/cutadapt/sample1.clean_qc.csv
./logs/fastqc
./logs/fastqc/sample1_R1_fastqc.html
./logs/fastqc/sample1_R1_fastqc.zip
./logs/fastqc/sample1_R2_fastqc.html
./logs/fastqc/sample1_R2_fastqc.zip
./logs/bbmap
./logs/bbmap/sample1_repair.txt
./logs/dropseq_tools
./logs/dropseq_tools/sample1_beadSubstitutionSummary.txt
./logs/dropseq_tools/sample1_beadSubstitutionReport.txt
./logs/dropseq_tools/sample1_synthesis_stats_summary.txt
./logs/dropseq_tools/sample1_hist_out_cell.txt
./logs/dropseq_tools/sample1_rna_metrics.txt
./reports
./reports/fastqc_barcodes_data
./reports/fastqc_barcodes_data/multiqc_sources.txt
./reports/fastqc_barcodes_data/multiqc_data.json
./reports/fastqc_barcodes_data/multiqc_general_stats.txt
./reports/fastqc_barcodes_data/multiqc_fastqc.txt
./reports/fastqc_barcodes_data/multiqc.log
./reports/fastqc_reads_data
./reports/fastqc_reads_data/multiqc_sources.txt
./reports/fastqc_reads_data/multiqc_data.json
./reports/fastqc_reads_data/multiqc_general_stats.txt
./reports/fastqc_reads_data/multiqc_fastqc.txt
./reports/fastqc_reads_data/multiqc.log
./reports/fastqc_barcodes.html
./reports/fastqc_reads.html
./reports/RNA_filtering_data
./reports/RNA_filtering_data/multiqc_sources.txt
./reports/RNA_filtering_data/multiqc_data.json
./reports/RNA_filtering_data/multiqc_cutadapt.txt
./reports/RNA_filtering_data/multiqc_general_stats.txt
./reports/RNA_filtering_data/multiqc.log
./reports/RNA_filtering.html
./reports/barcode_filtering_data
./reports/barcode_filtering_data/multiqc_sources.txt
./reports/barcode_filtering_data/multiqc_data.json
./reports/barcode_filtering_data/multiqc_cutadapt.txt
./reports/barcode_filtering_data/multiqc_general_stats.txt
./reports/barcode_filtering_data/multiqc.log
./reports/barcode_filtering.html
./reports/star_data
./reports/star_data/multiqc_sources.txt
./reports/star_data/multiqc_star.txt
./reports/star_data/multiqc_data.json
./reports/star_data/multiqc_general_stats.txt
./reports/star_data/multiqc.log
./reports/star.html
./plots
./plots/rna_metrics
./plots/knee_plots
./plots/knee_plots/sample1_knee_plot.pdf
./summary
./summary/read
./summary/read/barcodes.tsv
./summary/read/genes.tsv
./summary/read/matrix.mtx
./summary/umi
./summary/umi/barcodes.tsv
./summary/umi/genes.tsv
./summary/umi/matrix.mtx
`

@Hoohm
Copy link
Owner

Hoohm commented Apr 1, 2020 via email

@RichardCorbett
Copy link
Author

The plots folder is there, I think that some plots failed, perhaps due to an R library problem on the server. I do have the plots for the same data pre-downsampled (ie. not just 100 cells at 10K reads, but all the data for all the cells:

image

clean_qc file:

Adapter,Sequence,Pair,Count
Illumina_Universal,AGATCGGAAGAG,R1,5
PrefixNX/1,AGATGTGTATAAGAGACAG,R1,0
Trans1,TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG,R1,0
Trans1_rc,CTGTCTCTTATACACATCTGACGCTGCCGACGA,R1,697
Trans2,GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG,R1,0
Trans2_rc,CTGTCTCTTATACACATCTCCGAGCCCACGAGAC,R1,62525
polyA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,R1,6129
polyT,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,R1,780065
polyC,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC,R1,5133
polyG,GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG,R1,434
drop-seq,GTACTCTGCGTTGATACCACTGCTTCCGCGGACAGGC,R1,0
Nextera,CTGTCTCTTATACACATCT,R1,104
Illumina_Universal,AGATCGGAAGAG,R2,843
PrefixNX/1,AGATGTGTATAAGAGACAG,R2,627
Trans1,TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG,R2,44
Trans1_rc,CTGTCTCTTATACACATCTGACGCTGCCGACGA,R2,815
Trans2,GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG,R2,31
Trans2_rc,CTGTCTCTTATACACATCTCCGAGCCCACGAGAC,R2,25
polyA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,R2,505027
polyT,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,R2,2125
polyC,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC,R2,7866
polyG,GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG,R2,3347
drop-seq,GTACTCTGCGTTGATACCACTGCTTCCGCGGACAGGC,R2,63726
Nextera,CTGTCTCTTATACACATCT,R2,94

thanks!

@RichardCorbett
Copy link
Author

Hi @Hoohm,
Do the plot or data above help you understand how we're only seeing about 3500 reads per cell in our read-based matrix.mtx file? If I understand the plot correctly I expected to see about twice the number that we're seeing. Any thoughts would be appreciated.

@Hoohm
Copy link
Owner

Hoohm commented Apr 3, 2020 via email

@RichardCorbett
Copy link
Author

Star results...
image

@Hoohm
Copy link
Owner

Hoohm commented Apr 4, 2020 via email

@RichardCorbett
Copy link
Author

The 3000 is the total reads per cell. I also have the UMI counts which average down around 1000 counts per cell.

@Hoohm
Copy link
Owner

Hoohm commented Apr 4, 2020

From the cleanqc I see that around 800'000 of your R1 has been trimmed, that means your left with around 200'000 reads after trimming. From there, you loose around 20% so you end up with 150'000 so, assuming similar distribution of your data for all the cells, this would be around 1500 reads per cell.

@RichardCorbett
Copy link
Author

Great. Would it be correct to interpret this as the reads have lots of polyA/T included that don't contain enough 3' RNA to be useful?

@RichardCorbett
Copy link
Author

I just heard that our libraries for this test were 100-150bp shorter than intended which seems to be concordant with the interpretation that many of our read pairs did not contain significant mRNA signal adjacent to the polyA sequence as shown by these lines from the clean_qc file:

polyT,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,R1,780065
polyA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,R2,505027

Does this seem to make sense to you?

thanks
RIchard

@Hoohm
Copy link
Owner

Hoohm commented Jun 24, 2020

Yes, seems to be inline with my interpretation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants