Yesterday, we looked at a scRNA-seq dataset of <i style="color:#EB1960">Danio rerio</i> cells [@zebrafish-data].  We spent a lot of time understanding how the dataset was created, and ended with a bit of a mystery: <i style="color:#C0CF96">why did the cell counts not match up between our datasets?</i>  I don't have an answer to that mystery, unfortunately - but in my experience this mismatch happens in a lot of papers.  We might as well continue with the analysis, rather than getting hung up on a minor anomaly.

I realized since yesterday that there was another gene counts file, (the filtered tpms file).  I downloaded this and the "experiment metadata" files; I suspect these may be useful as I know that the TPM values were pre-quality-control:

> For the <b style="color:#EB1960">Smart-seq2 protocol</b> transcript per million (TPM) values reported by Salmon were used for the quality control (QC). Wells with fewer than 900 expressed genes (TPM > 1) or having more than either 60% of ERCC or 45% of mitochondrial content were annotated as <strong style="color:#C0CF96">poor quality cells</strong>. As a result, 322 cells failed QC and 542 single cells were selected for the further study.
>
> -- <cite><b style="color:#EB1960">Quality Control of Single-Cell Data; Materials and Methods Section;</b> @zebrafish-data </cite>

<details>
    <summary style="color:#C0CF96">
        <b>Can I mix R and Python in the same notebook?</b>
    </summary>
    Yes!  I sometimes use <b style="color:#EB1960">SOS Kernel</b> which allows me to swap between kernels at ease.  You can change kernels (R vs Python) using <span style="color:#757575">%use</span> magic commands; I've done this before but the syntax highlighting gets messed up often, so I've chosen not to. do that here.
</details>

In [None]:
raw.counts <- Matrix::readMM('./localdata/E-MTAB-7117.expression_tpm.mtx')
dim(raw.counts)

In [None]:
%%html
<style>
div.input {
    display:none;
}
</style>

ERROR: Error in parse(text = x, srcfile = src): <text>:1:1: unexpected SPECIAL
1: %%
    ^


#### <span style="color:#C0CF96">Does the metadata file help us resolve the cell count mystery?</span>

What does this extra metadata file tell us?  Let's see:

In [None]:
meta.data <- read.table("./localdata/E-MTAB-7117.sdrf.txt", sep='\t', header = TRUE)
head(meta.data)
dim(meta.data)
length(which(meta.data['Characteristics.single.cell.quality.'] != "not OK"))

Unnamed: 0_level_0,Source.Name,Comment.ENA_SAMPLE.,Comment.BioSD_SAMPLE.,Characteristics.organism.,Characteristics.strain.,Characteristics.age.,Unit.time.unit.,Characteristics.developmental.stage.,Characteristics.sex.,Characteristics.genotype.,⋯,Comment.ENA_EXPERIMENT.,Scan.Name,Comment.SUBMITTED_FILE_NAME.,Comment.ENA_RUN.,Comment.FASTQ_URI.,Comment.SPOT_LENGTH.,Comment.READ_INDEX_1_BASE_COORD.,Factor.Value.genotype.,Factor.Value.organism.part.,Factor.Value.single.cell.identifier.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>
1,CD4_gill_A1,ERS2634491,SAMEA4814592,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry),⋯,ERX2737040,SLX-10875.N701_S513.C9FTNANXX.s_5.r_1.fq.gz,SLX-10875.N701_S513.C9FTNANXX.s_5.r_1.fq.gz,ERR2723271,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2723271/ERR2723271_1.fastq.gz,250,126,Tg(cd4-1:mCherry),gill,CD4_gill_A1
2,CD4_gill_A1,ERS2634491,SAMEA4814592,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry),⋯,ERX2737040,SLX-10875.N701_S513.C9FTNANXX.s_5.r_2.fq.gz,SLX-10875.N701_S513.C9FTNANXX.s_5.r_2.fq.gz,ERR2723271,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2723271/ERR2723271_2.fastq.gz,250,126,Tg(cd4-1:mCherry),gill,CD4_gill_A1
3,CD4_gill_A10,ERS2634691,SAMEA4814793,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry),⋯,ERX2737240,SLX-10875.N712_S513.C9FTNANXX.s_5.r_1.fq.gz,SLX-10875.N712_S513.C9FTNANXX.s_5.r_1.fq.gz,ERR2723471,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2723471/ERR2723471_1.fastq.gz,250,126,Tg(cd4-1:mCherry),gill,CD4_gill_A10
4,CD4_gill_A10,ERS2634691,SAMEA4814793,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry),⋯,ERX2737240,SLX-10875.N712_S513.C9FTNANXX.s_5.r_2.fq.gz,SLX-10875.N712_S513.C9FTNANXX.s_5.r_2.fq.gz,ERR2723471,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2723471/ERR2723471_2.fastq.gz,250,126,Tg(cd4-1:mCherry),gill,CD4_gill_A10
5,CD4_gill_A11,ERS2635221,SAMEA4815323,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry),⋯,ERX2737770,SLX-10875.N714_S513.C9FTNANXX.s_5.r_1.fq.gz,SLX-10875.N714_S513.C9FTNANXX.s_5.r_1.fq.gz,ERR2724001,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2724001/ERR2724001_1.fastq.gz,250,126,Tg(cd4-1:mCherry),gill,CD4_gill_A11
6,CD4_gill_A11,ERS2635221,SAMEA4815323,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry),⋯,ERX2737770,SLX-10875.N714_S513.C9FTNANXX.s_5.r_2.fq.gz,SLX-10875.N714_S513.C9FTNANXX.s_5.r_2.fq.gz,ERR2724001,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2724001/ERR2724001_2.fastq.gz,250,126,Tg(cd4-1:mCherry),gill,CD4_gill_A11


At first glance, <b style="color:#C0CF96">no</b> - it does not.  The metadata has 2112 lines, and removing poor quality cells leaves us with 1952.

But, it does seem like some cells have multiple names, so let's filter by unique values.

In [None]:
dim(unique(meta.data['Source.Name']))
length(
    unique(
        meta.data[
            which(meta.data['Characteristics.single.cell.quality.'] != "not OK"),
            "Source.Name"
        ]
    )
)

Gee whiz, that 976 number returns...  We can re-use the list of 10 extra cells calculated yesterday to investigate them:

In [None]:
discrepancies <- c(
    "ERR2723217", "ERR2723218", "ERR2723236", "ERR2723237", "ERR2723551",
    "ERR2723578", "ERR2723789", "ERR2723794", "ERR2723974", "ERR2723990"
)
extra.cells <- meta.data[
    which(sapply(meta.data["Comment.ENA_RUN."], `%in%`, discrepancies)),
]

In [None]:
#| output: false
# Output hidden to reduce spam
extra.cells[1:10]
extra.cells[11:20]
extra.cells[21:30]
extra.cells[31:40]
extra.cells[41:55]

Unnamed: 0_level_0,Source.Name,Comment.ENA_SAMPLE.,Comment.BioSD_SAMPLE.,Characteristics.organism.,Characteristics.strain.,Characteristics.age.,Unit.time.unit.,Characteristics.developmental.stage.,Characteristics.sex.,Characteristics.genotype.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>
247,CD4_gut_D4,ERS2634771,SAMEA4814873,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry)
248,CD4_gut_D4,ERS2634771,SAMEA4814873,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry)
375,CD4_kidney_C5,ERS2634798,SAMEA4814900,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry)
376,CD4_kidney_C5,ERS2634798,SAMEA4814900,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry)
1053,LCK_gut_G8,ERS2634437,SAMEA4814538,Danio rerio,AB,6,month,adult,male,Tg(lck:EGFP)
1054,LCK_gut_G8,ERS2634437,SAMEA4814538,Danio rerio,AB,6,month,adult,male,Tg(lck:EGFP)
1055,LCK_gut_G9,ERS2634438,SAMEA4814539,Danio rerio,AB,6,month,adult,male,Tg(lck:EGFP)
1056,LCK_gut_G9,ERS2634438,SAMEA4814539,Danio rerio,AB,6,month,adult,male,Tg(lck:EGFP)
1305,LCK_thymus_B2,ERS2635210,SAMEA4815312,Danio rerio,AB,6,month,adult,male,Tg(lck:EGFP)
1306,LCK_thymus_B2,ERS2635210,SAMEA4815312,Danio rerio,AB,6,month,adult,male,Tg(lck:EGFP)


Unnamed: 0_level_0,Characteristics.organism.part.,Characteristics.cell.type.,Characteristics.phenotype.,Characteristics.individual.,Characteristics.well.information.,Characteristics.single.cell.quality.,Characteristics.cluster.,Comment.spike.in.,Comment.spike.in.dilution.,Material.Type
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
247,gut,blood cell,mCherry positive cell,2,single cell,OK,unknown,ERCC,1:10,cell
248,gut,blood cell,mCherry positive cell,2,single cell,OK,unknown,ERCC,1:10,cell
375,kidney,blood cell,mCherry positive cell,2,single cell,OK,unknown,ERCC,1:10,cell
376,kidney,blood cell,mCherry positive cell,2,single cell,OK,unknown,ERCC,1:10,cell
1053,gut,blood cell,EGFP positive cell,1,single cell,OK,unknown,ERCC,1:10,cell
1054,gut,blood cell,EGFP positive cell,1,single cell,OK,unknown,ERCC,1:10,cell
1055,gut,blood cell,EGFP positive cell,1,single cell,OK,unknown,ERCC,1:10,cell
1056,gut,blood cell,EGFP positive cell,1,single cell,OK,unknown,ERCC,1:10,cell
1305,thymus,blood cell,EGFP positive cell,1,single cell,OK,unknown,ERCC,1:10,cell
1306,thymus,blood cell,EGFP positive cell,1,single cell,OK,unknown,ERCC,1:10,cell


Unnamed: 0_level_0,Protocol.REF,Performer,Protocol.REF.1,Performer.1,Protocol.REF.2,Performer.2,Extract.Name,Material.Type.1,Comment.single.cell.isolation.,Comment.library.construction.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
247,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,CD4_gut_D4,RNA,FACS,smart-seq2
248,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,CD4_gut_D4,RNA,FACS,smart-seq2
375,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,CD4_kidney_C5,RNA,FACS,smart-seq2
376,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,CD4_kidney_C5,RNA,FACS,smart-seq2
1053,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,LCK_gut_G8,RNA,FACS,smart-seq2
1054,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,LCK_gut_G8,RNA,FACS,smart-seq2
1055,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,LCK_gut_G9,RNA,FACS,smart-seq2
1056,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,LCK_gut_G9,RNA,FACS,smart-seq2
1305,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,LCK_thymus_B2,RNA,FACS,smart-seq2
1306,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,LCK_thymus_B2,RNA,FACS,smart-seq2


Unnamed: 0_level_0,Comment.input.molecule.,Comment.primer.,Comment.end.bias.,Comment.LIBRARY_LAYOUT.,Comment.LIBRARY_SELECTION.,Comment.LIBRARY_SOURCE.,Comment.LIBRARY_STRAND.,Comment.LIBRARY_STRATEGY.,Comment.NOMINAL_LENGTH.,Comment.NOMINAL_SDEV.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>
247,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
248,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
375,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
376,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
1053,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
1054,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
1055,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
1056,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
1305,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
1306,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20


Unnamed: 0_level_0,Comment.ORIENTATION.,Protocol.REF.3,Performer.3,Assay.Name,Technology.Type,Comment.ENA_EXPERIMENT.,Scan.Name,Comment.SUBMITTED_FILE_NAME.,Comment.ENA_RUN.,Comment.FASTQ_URI.,Comment.SPOT_LENGTH.,Comment.READ_INDEX_1_BASE_COORD.,Factor.Value.genotype.,Factor.Value.organism.part.,Factor.Value.single.cell.identifier.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>
247,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,CD4_gut_D4,sequencing assay,ERX2737320,SLX-12114.i720_i506.HKKG2BBXX.s_1.r_1.fq.gz,SLX-12114.i720_i506.HKKG2BBXX.s_1.r_1.fq.gz,ERR2723551,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2723551/ERR2723551_1.fastq.gz,300,151,Tg(cd4-1:mCherry),gut,CD4_gut_D4
248,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,CD4_gut_D4,sequencing assay,ERX2737320,SLX-12114.i720_i506.HKKG2BBXX.s_1.r_2.fq.gz,SLX-12114.i720_i506.HKKG2BBXX.s_1.r_2.fq.gz,ERR2723551,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2723551/ERR2723551_2.fastq.gz,300,151,Tg(cd4-1:mCherry),gut,CD4_gut_D4
375,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,CD4_kidney_C5,sequencing assay,ERX2737347,SLX-10874.N705_S505.C9FTNANXX.s_4.r_1.fq.gz,SLX-10874.N705_S505.C9FTNANXX.s_4.r_1.fq.gz,ERR2723578,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/008/ERR2723578/ERR2723578_1.fastq.gz,250,126,Tg(cd4-1:mCherry),kidney,CD4_kidney_C5
376,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,CD4_kidney_C5,sequencing assay,ERX2737347,SLX-10874.N705_S505.C9FTNANXX.s_4.r_2.fq.gz,SLX-10874.N705_S505.C9FTNANXX.s_4.r_2.fq.gz,ERR2723578,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/008/ERR2723578/ERR2723578_2.fastq.gz,250,126,Tg(cd4-1:mCherry),kidney,CD4_kidney_C5
1053,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,LCK_gut_G8,sequencing assay,ERX2736986,SLX-12119.i710_i521.HKCTNBBXX.s_1.r_1.fq.gz,SLX-12119.i710_i521.HKCTNBBXX.s_1.r_1.fq.gz,ERR2723217,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/007/ERR2723217/ERR2723217_1.fastq.gz,300,151,Tg(lck:EGFP),gut,LCK_gut_G8
1054,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,LCK_gut_G8,sequencing assay,ERX2736986,SLX-12119.i710_i521.HKCTNBBXX.s_1.r_2.fq.gz,SLX-12119.i710_i521.HKCTNBBXX.s_1.r_2.fq.gz,ERR2723217,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/007/ERR2723217/ERR2723217_2.fastq.gz,300,151,Tg(lck:EGFP),gut,LCK_gut_G8
1055,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,LCK_gut_G9,sequencing assay,ERX2736987,SLX-12119.i711_i521.HKCTNBBXX.s_1.r_1.fq.gz,SLX-12119.i711_i521.HKCTNBBXX.s_1.r_1.fq.gz,ERR2723218,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/008/ERR2723218/ERR2723218_1.fastq.gz,300,151,Tg(lck:EGFP),gut,LCK_gut_G9
1056,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,LCK_gut_G9,sequencing assay,ERX2736987,SLX-12119.i711_i521.HKCTNBBXX.s_1.r_2.fq.gz,SLX-12119.i711_i521.HKCTNBBXX.s_1.r_2.fq.gz,ERR2723218,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/008/ERR2723218/ERR2723218_2.fastq.gz,300,151,Tg(lck:EGFP),gut,LCK_gut_G9
1305,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,LCK_thymus_B2,sequencing assay,ERX2737759,SLX-12119.i702_i503.HKCTNBBXX.s_1.r_1.fq.gz,SLX-12119.i702_i503.HKCTNBBXX.s_1.r_1.fq.gz,ERR2723990,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/000/ERR2723990/ERR2723990_1.fastq.gz,300,151,Tg(lck:EGFP),thymus,LCK_thymus_B2
1306,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,LCK_thymus_B2,sequencing assay,ERX2737759,SLX-12119.i702_i503.HKCTNBBXX.s_1.r_2.fq.gz,SLX-12119.i702_i503.HKCTNBBXX.s_1.r_2.fq.gz,ERR2723990,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/000/ERR2723990/ERR2723990_2.fastq.gz,300,151,Tg(lck:EGFP),thymus,LCK_thymus_B2


But as before there doesn't really seem to be any outstanding features.  So the mystery continues!

# Quality Control

### References

::: {#refs}
:::

<script src="https://giscus.app/client.js"
        data-repo="baileyandrew/blog"
        data-repo-id="R_kgDOInJwKg"
        data-category="Announcements"
        data-category-id="DIC_kwDOInJwKs4CTGOQ"
        data-mapping="title"
        data-strict="0"
        data-reactions-enabled="1"
        data-emit-metadata="0"
        data-input-position="top"
        data-theme="dark_protanopia"
        data-lang="en"
        data-loading="lazy"
        crossorigin="anonymous"
        async>
</script>