In [None]:
#| echo: false
# Run this to allow the use of R
# Cheekily, rpy2 doesn't work with the Matrix package of R,
# so I actually ran all the R using a document-wide R kernel,
# this is just for syntax highlighting...
%load_ext rpy2.ipython

Yesterday, we looked at a scRNA-seq dataset of <i style="color:#EB1960">Danio rerio</i> cells [@zebrafish-data].  We spent a lot of time understanding how the dataset was created, and ended with a bit of a mystery: <i style="color:#C0CF96">why did the cell counts not match up between our datasets?</i>  I don't have an answer to that mystery, unfortunately - but in my experience this mismatch happens in a lot of papers.  We might as well continue with the analysis, rather than getting hung up on a minor anomaly.

I realized since yesterday that there was another gene counts file, (the filtered tpms file).  I downloaded this and the "experiment metadata" files; I suspect these may be useful as I know that the TPM values were pre-quality-control:

> For the <b style="color:#EB1960">Smart-seq2 protocol</b> transcript per million (TPM) values reported by Salmon were used for the quality control (QC). Wells with fewer than 900 expressed genes (TPM > 1) or having more than either 60% of ERCC or 45% of mitochondrial content were annotated as <strong style="color:#C0CF96">poor quality cells</strong>. As a result, 322 cells failed QC and 542 single cells were selected for the further study.
>
> -- <cite><b style="color:#EB1960">Quality Control of Single-Cell Data; Materials and Methods Section;</b> @zebrafish-data </cite>

<details>
    <summary style="color:#C0CF96">
        <b>Can I mix R and Python in the same notebook?</b>
    </summary>
    Yes!  I sometimes use <b style="color:#EB1960">SOS Kernel</b> which allows me to swap between kernels at ease.  You can change kernels (R vs Python) using <span style="color:#757575">%use</span> magic commands; I've done this before but the syntax highlighting gets messed up often, so I've chosen not to. do that here.
</details>

In [None]:
raw.counts <- Matrix::readMM('./localdata/E-MTAB-7117.expression_tpm.mtx')
dim(raw.counts)

<details>
    <summary style="color:#C0CF96"><b>Does the metadata file help us resolve the cell count mystery?</b></summary>

What does this extra metadata file tell us?  Let's see:

In [None]:
meta.data <- read.table("./localdata/E-MTAB-7117.sdrf.txt", sep='\t', header = TRUE)
head(meta.data)
dim(meta.data)
length(which(meta.data['Characteristics.single.cell.quality.'] != "not OK"))

Unnamed: 0_level_0,Source.Name,Comment.ENA_SAMPLE.,Comment.BioSD_SAMPLE.,Characteristics.organism.,Characteristics.strain.,Characteristics.age.,Unit.time.unit.,Characteristics.developmental.stage.,Characteristics.sex.,Characteristics.genotype.,⋯,Comment.ENA_EXPERIMENT.,Scan.Name,Comment.SUBMITTED_FILE_NAME.,Comment.ENA_RUN.,Comment.FASTQ_URI.,Comment.SPOT_LENGTH.,Comment.READ_INDEX_1_BASE_COORD.,Factor.Value.genotype.,Factor.Value.organism.part.,Factor.Value.single.cell.identifier.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>
1,CD4_gill_A1,ERS2634491,SAMEA4814592,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry),⋯,ERX2737040,SLX-10875.N701_S513.C9FTNANXX.s_5.r_1.fq.gz,SLX-10875.N701_S513.C9FTNANXX.s_5.r_1.fq.gz,ERR2723271,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2723271/ERR2723271_1.fastq.gz,250,126,Tg(cd4-1:mCherry),gill,CD4_gill_A1
2,CD4_gill_A1,ERS2634491,SAMEA4814592,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry),⋯,ERX2737040,SLX-10875.N701_S513.C9FTNANXX.s_5.r_2.fq.gz,SLX-10875.N701_S513.C9FTNANXX.s_5.r_2.fq.gz,ERR2723271,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2723271/ERR2723271_2.fastq.gz,250,126,Tg(cd4-1:mCherry),gill,CD4_gill_A1
3,CD4_gill_A10,ERS2634691,SAMEA4814793,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry),⋯,ERX2737240,SLX-10875.N712_S513.C9FTNANXX.s_5.r_1.fq.gz,SLX-10875.N712_S513.C9FTNANXX.s_5.r_1.fq.gz,ERR2723471,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2723471/ERR2723471_1.fastq.gz,250,126,Tg(cd4-1:mCherry),gill,CD4_gill_A10
4,CD4_gill_A10,ERS2634691,SAMEA4814793,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry),⋯,ERX2737240,SLX-10875.N712_S513.C9FTNANXX.s_5.r_2.fq.gz,SLX-10875.N712_S513.C9FTNANXX.s_5.r_2.fq.gz,ERR2723471,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2723471/ERR2723471_2.fastq.gz,250,126,Tg(cd4-1:mCherry),gill,CD4_gill_A10
5,CD4_gill_A11,ERS2635221,SAMEA4815323,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry),⋯,ERX2737770,SLX-10875.N714_S513.C9FTNANXX.s_5.r_1.fq.gz,SLX-10875.N714_S513.C9FTNANXX.s_5.r_1.fq.gz,ERR2724001,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2724001/ERR2724001_1.fastq.gz,250,126,Tg(cd4-1:mCherry),gill,CD4_gill_A11
6,CD4_gill_A11,ERS2635221,SAMEA4815323,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry),⋯,ERX2737770,SLX-10875.N714_S513.C9FTNANXX.s_5.r_2.fq.gz,SLX-10875.N714_S513.C9FTNANXX.s_5.r_2.fq.gz,ERR2724001,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2724001/ERR2724001_2.fastq.gz,250,126,Tg(cd4-1:mCherry),gill,CD4_gill_A11


At first glance, <b style="color:#C0CF96">no</b> - it does not.  The metadata has 2112 lines, and removing poor quality cells leaves us with 1952.

But, it does seem like some cells have multiple names, so let's filter by unique values.

In [None]:
dim(unique(meta.data['Source.Name']))
length(
    unique(
        meta.data[
            which(meta.data['Characteristics.single.cell.quality.'] != "not OK"),
            "Source.Name"
        ]
    )
)

Gee whiz, that 976 number returns...  We can re-use the list of 10 extra cells calculated yesterday to investigate them:

In [None]:
discrepancies <- c(
    "ERR2723217", "ERR2723218", "ERR2723236", "ERR2723237", "ERR2723551",
    "ERR2723578", "ERR2723789", "ERR2723794", "ERR2723974", "ERR2723990"
)
extra.cells <- meta.data[
    which(sapply(meta.data["Comment.ENA_RUN."], `%in%`, discrepancies)),
]

In [None]:
#| output: false
# Output hidden to reduce spam
extra.cells[1:10]
extra.cells[11:20]
extra.cells[21:30]
extra.cells[31:40]
extra.cells[41:55]

Unnamed: 0_level_0,Source.Name,Comment.ENA_SAMPLE.,Comment.BioSD_SAMPLE.,Characteristics.organism.,Characteristics.strain.,Characteristics.age.,Unit.time.unit.,Characteristics.developmental.stage.,Characteristics.sex.,Characteristics.genotype.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>
247,CD4_gut_D4,ERS2634771,SAMEA4814873,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry)
248,CD4_gut_D4,ERS2634771,SAMEA4814873,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry)
375,CD4_kidney_C5,ERS2634798,SAMEA4814900,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry)
376,CD4_kidney_C5,ERS2634798,SAMEA4814900,Danio rerio,AB,6,month,adult,male,Tg(cd4-1:mCherry)
1053,LCK_gut_G8,ERS2634437,SAMEA4814538,Danio rerio,AB,6,month,adult,male,Tg(lck:EGFP)
1054,LCK_gut_G8,ERS2634437,SAMEA4814538,Danio rerio,AB,6,month,adult,male,Tg(lck:EGFP)
1055,LCK_gut_G9,ERS2634438,SAMEA4814539,Danio rerio,AB,6,month,adult,male,Tg(lck:EGFP)
1056,LCK_gut_G9,ERS2634438,SAMEA4814539,Danio rerio,AB,6,month,adult,male,Tg(lck:EGFP)
1305,LCK_thymus_B2,ERS2635210,SAMEA4815312,Danio rerio,AB,6,month,adult,male,Tg(lck:EGFP)
1306,LCK_thymus_B2,ERS2635210,SAMEA4815312,Danio rerio,AB,6,month,adult,male,Tg(lck:EGFP)


Unnamed: 0_level_0,Characteristics.organism.part.,Characteristics.cell.type.,Characteristics.phenotype.,Characteristics.individual.,Characteristics.well.information.,Characteristics.single.cell.quality.,Characteristics.cluster.,Comment.spike.in.,Comment.spike.in.dilution.,Material.Type
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
247,gut,blood cell,mCherry positive cell,2,single cell,OK,unknown,ERCC,1:10,cell
248,gut,blood cell,mCherry positive cell,2,single cell,OK,unknown,ERCC,1:10,cell
375,kidney,blood cell,mCherry positive cell,2,single cell,OK,unknown,ERCC,1:10,cell
376,kidney,blood cell,mCherry positive cell,2,single cell,OK,unknown,ERCC,1:10,cell
1053,gut,blood cell,EGFP positive cell,1,single cell,OK,unknown,ERCC,1:10,cell
1054,gut,blood cell,EGFP positive cell,1,single cell,OK,unknown,ERCC,1:10,cell
1055,gut,blood cell,EGFP positive cell,1,single cell,OK,unknown,ERCC,1:10,cell
1056,gut,blood cell,EGFP positive cell,1,single cell,OK,unknown,ERCC,1:10,cell
1305,thymus,blood cell,EGFP positive cell,1,single cell,OK,unknown,ERCC,1:10,cell
1306,thymus,blood cell,EGFP positive cell,1,single cell,OK,unknown,ERCC,1:10,cell


Unnamed: 0_level_0,Protocol.REF,Performer,Protocol.REF.1,Performer.1,Protocol.REF.2,Performer.2,Extract.Name,Material.Type.1,Comment.single.cell.isolation.,Comment.library.construction.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
247,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,CD4_gut_D4,RNA,FACS,smart-seq2
248,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,CD4_gut_D4,RNA,FACS,smart-seq2
375,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,CD4_kidney_C5,RNA,FACS,smart-seq2
376,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,CD4_kidney_C5,RNA,FACS,smart-seq2
1053,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,LCK_gut_G8,RNA,FACS,smart-seq2
1054,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,LCK_gut_G8,RNA,FACS,smart-seq2
1055,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,LCK_gut_G9,RNA,FACS,smart-seq2
1056,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,LCK_gut_G9,RNA,FACS,smart-seq2
1305,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,LCK_thymus_B2,RNA,FACS,smart-seq2
1306,P-MTAB-77662,Paulina Strzelecka,P-MTAB-77663,Paulina Strzelecka,P-MTAB-77664,Paulina Strzelecka,LCK_thymus_B2,RNA,FACS,smart-seq2


Unnamed: 0_level_0,Comment.input.molecule.,Comment.primer.,Comment.end.bias.,Comment.LIBRARY_LAYOUT.,Comment.LIBRARY_SELECTION.,Comment.LIBRARY_SOURCE.,Comment.LIBRARY_STRAND.,Comment.LIBRARY_STRATEGY.,Comment.NOMINAL_LENGTH.,Comment.NOMINAL_SDEV.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>
247,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
248,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
375,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
376,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
1053,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
1054,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
1055,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
1056,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
1305,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20
1306,polyA RNA,oligo-dT,none,PAIRED,Oligo-dT,TRANSCRIPTOMIC SINGLE CELL,not applicable,RNA-Seq,400,20


Unnamed: 0_level_0,Comment.ORIENTATION.,Protocol.REF.3,Performer.3,Assay.Name,Technology.Type,Comment.ENA_EXPERIMENT.,Scan.Name,Comment.SUBMITTED_FILE_NAME.,Comment.ENA_RUN.,Comment.FASTQ_URI.,Comment.SPOT_LENGTH.,Comment.READ_INDEX_1_BASE_COORD.,Factor.Value.genotype.,Factor.Value.organism.part.,Factor.Value.single.cell.identifier.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>
247,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,CD4_gut_D4,sequencing assay,ERX2737320,SLX-12114.i720_i506.HKKG2BBXX.s_1.r_1.fq.gz,SLX-12114.i720_i506.HKKG2BBXX.s_1.r_1.fq.gz,ERR2723551,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2723551/ERR2723551_1.fastq.gz,300,151,Tg(cd4-1:mCherry),gut,CD4_gut_D4
248,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,CD4_gut_D4,sequencing assay,ERX2737320,SLX-12114.i720_i506.HKKG2BBXX.s_1.r_2.fq.gz,SLX-12114.i720_i506.HKKG2BBXX.s_1.r_2.fq.gz,ERR2723551,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/001/ERR2723551/ERR2723551_2.fastq.gz,300,151,Tg(cd4-1:mCherry),gut,CD4_gut_D4
375,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,CD4_kidney_C5,sequencing assay,ERX2737347,SLX-10874.N705_S505.C9FTNANXX.s_4.r_1.fq.gz,SLX-10874.N705_S505.C9FTNANXX.s_4.r_1.fq.gz,ERR2723578,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/008/ERR2723578/ERR2723578_1.fastq.gz,250,126,Tg(cd4-1:mCherry),kidney,CD4_kidney_C5
376,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,CD4_kidney_C5,sequencing assay,ERX2737347,SLX-10874.N705_S505.C9FTNANXX.s_4.r_2.fq.gz,SLX-10874.N705_S505.C9FTNANXX.s_4.r_2.fq.gz,ERR2723578,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/008/ERR2723578/ERR2723578_2.fastq.gz,250,126,Tg(cd4-1:mCherry),kidney,CD4_kidney_C5
1053,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,LCK_gut_G8,sequencing assay,ERX2736986,SLX-12119.i710_i521.HKCTNBBXX.s_1.r_1.fq.gz,SLX-12119.i710_i521.HKCTNBBXX.s_1.r_1.fq.gz,ERR2723217,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/007/ERR2723217/ERR2723217_1.fastq.gz,300,151,Tg(lck:EGFP),gut,LCK_gut_G8
1054,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,LCK_gut_G8,sequencing assay,ERX2736986,SLX-12119.i710_i521.HKCTNBBXX.s_1.r_2.fq.gz,SLX-12119.i710_i521.HKCTNBBXX.s_1.r_2.fq.gz,ERR2723217,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/007/ERR2723217/ERR2723217_2.fastq.gz,300,151,Tg(lck:EGFP),gut,LCK_gut_G8
1055,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,LCK_gut_G9,sequencing assay,ERX2736987,SLX-12119.i711_i521.HKCTNBBXX.s_1.r_1.fq.gz,SLX-12119.i711_i521.HKCTNBBXX.s_1.r_1.fq.gz,ERR2723218,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/008/ERR2723218/ERR2723218_1.fastq.gz,300,151,Tg(lck:EGFP),gut,LCK_gut_G9
1056,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,LCK_gut_G9,sequencing assay,ERX2736987,SLX-12119.i711_i521.HKCTNBBXX.s_1.r_2.fq.gz,SLX-12119.i711_i521.HKCTNBBXX.s_1.r_2.fq.gz,ERR2723218,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/008/ERR2723218/ERR2723218_2.fastq.gz,300,151,Tg(lck:EGFP),gut,LCK_gut_G9
1305,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,LCK_thymus_B2,sequencing assay,ERX2737759,SLX-12119.i702_i503.HKCTNBBXX.s_1.r_1.fq.gz,SLX-12119.i702_i503.HKCTNBBXX.s_1.r_1.fq.gz,ERR2723990,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/000/ERR2723990/ERR2723990_1.fastq.gz,300,151,Tg(lck:EGFP),thymus,LCK_thymus_B2
1306,5-3-5-3,P-MTAB-77665,Cambridge Institute Genomics Core Facility,LCK_thymus_B2,sequencing assay,ERX2737759,SLX-12119.i702_i503.HKCTNBBXX.s_1.r_2.fq.gz,SLX-12119.i702_i503.HKCTNBBXX.s_1.r_2.fq.gz,ERR2723990,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR272/000/ERR2723990/ERR2723990_2.fastq.gz,300,151,Tg(lck:EGFP),thymus,LCK_thymus_B2


But as before there doesn't really seem to be any outstanding features.  So the mystery continues!

</details>

# Quality Control

The <b style="color:#757575">raw.counts</b> matrix is a <b style="color:#A6A440">Sparse Matrix</b> with 21,797 genes and 966 cells - the values are expressed in TPM.  Our goal is to filter out the low quality cells.

> For the <b style="color:#EB1960">Smart-seq2 protocol</b> transcript per million (TPM) values reported by Salmon were used for the quality control (QC). Wells with fewer than 900 expressed genes (TPM > 1) or having more than either 60% of ERCC or 45% of mitochondrial content were annotated as <strong style="color:#C0CF96">poor quality cells</strong>. As a result, 322 cells failed QC and 542 single cells were selected for the further study.
>
> -- <cite><b style="color:#EB1960">Quality Control of Single-Cell Data; Materials and Methods Section;</b> @zebrafish-data </cite>

At the time, I didn't think we could find the mitochondrial genes with the data we had (and same with ERCC).  I've gone back and edited this though, in the dropdowns.

<details>
    <summary style="color:#C0CF96"><b>Getting mitochondrial genes</b></summary>

I used Python since chronologically I did this ater I gave up on R.  The work to create the `mito-genes` csv was done in the blog post [Exploring Ensembl](./004_005_ensemble.html).

In [None]:
mito_genes = pd.read_csv("./localdata/mito-genes.csv")
mito_genes["gene_id"].head()

0    ENSDARG00000083480
1    ENSDARG00000082753
2    ENSDARG00000081443
3    ENSDARG00000080337
4    ENSDARG00000083046
Name: gene_id, dtype: object


Now we want to calculate the percent of genomic material per cell that is contained in just the genes in `mito_genes`, which should not be too hard!  Left as excersize to the reader.

</details>

<details>
    <summary style="color:#C0CF96"><b>Getting ERCC genes</b></summary>

I could not figure out how to do this; I think they have already been removed from the dataset that we have.

</details>

But yesterday we noticed that we could get the 542 futher-study cells by looking at the metadata and only grabbing the cells which were eventually assigned a cluster by the researchers.

In [None]:
# Calculate the cells we want to study
sample.data <- read.table("./localdata/ExpDesign-E-MTAB-7117.tsv", sep='\t', header=TRUE)
filtered <- sample.data[which(
    sample.data['Sample.Characteristic.cluster.'] != "unknown"
),]
cells.to.study <- filtered$Assay
length(cells.to.study)

In [None]:
# Get which rows correspond to which assay (cell)
assay.to.row.map <- read.table(
    './localdata/E-MTAB-7117.expression_tpm.mtx_cols'
)$V1
length(assay.to.row.map)

In [None]:
# Subset the raw.counts matrix
raw.counts.filtered <- raw.counts[,
    which(sapply(assay.to.row.map, `%in%`, cells.to.study)),
]
dim(raw.counts.filtered)
# Save the matrix, for posterity
writeMM(raw.counts.filtered, "./localdata/542_cells_21797_genes.tpm.mtx")

NULL

Now we need to perform quality control on the genes;

> For each of the <strong style="color:#C0CF96">542 single cells</strong>, counts reported by <b style="color:#EB1960">Salmon</b> were transformed into normalised counts per million (CPM) and used for the further analysis. This was performed by <strong style="color:#C0CF96">dividing the number of counts for each gene with the total number of counts for each cell and by multiplying the resulting number by a factor of 1,000,000</strong>. Genes that were expressed in less than 1% of cells (e.g. 5 single cells with CPM > 1) <strong style="color:#C0CF96">were filtered out</strong>. In the final step we ended up using 16,059 genes across the 542 single cells. The <b style="color:#EB1960">scran R package</b> (version 1.6.7) @scran was then used to <b style="color:#537FBF">normalise</b> the data and remove differences due to the library size or capture efficiency and sequencing depth.
>
> -- <cite><b style="color:#EB1960">Downstream Analysis of Smart-seq2 Data; Materials and Methods Section;</b> @zebrafish-data </cite>

It was harder than expected, as I am less-than-fluent with R and their sparse matrix operations weren't as analogous to dense matrices as I would have liked.  First let's take a look at our the sparse datatype we have.  I found the blogpost by @dgTMatrix useful, but ultimately got fed up with R and decided to transition to python.

<details><summary><span class="false-starts" style="font-weight:bold;">False Start</span>: <b style="color:#C0CF96">Trying to use R</b></summary>

In [None]:
str(raw.counts.filtered)

Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:1156509] 4 10 19 36 56 57 58 59 62 69 ...
  ..@ j       : int [1:1156509] 0 0 0 0 0 0 0 0 0 0 ...
  ..@ Dim     : int [1:2] 21797 542
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:1156509] 0.593 145.541 361.733 0.178 0.107 ...
  ..@ factors : list()


`(i, j)` gives us the rows and columns, with the value being the corresponding entry in `x`.  The `T` in `dgTMatrix` stands for triplet, because it's essentially just a list of triplets `(i, j, x)`.  Another common format is `CsparseMatrix`, although the explanation is more complicated.  It's explained well by @dgTMatrix, I'm just in a rush to get to actual data analysis!

In [None]:
raw.counts.csparse <- str(as(raw.counts.filtered, "CsparseMatrix"))

Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:1156509] 4 10 19 36 56 57 58 59 62 69 ...
  ..@ p       : int [1:543] 0 1610 3532 5491 7585 9273 11182 13157 14623 15552 ...
  ..@ Dim     : int [1:2] 21797 542
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:1156509] 0.593 145.541 361.733 0.178 0.107 ...
  ..@ factors : list()


Now, we want to transform the data into <b style="color:#C0CF96">normalized counts per million</b>.

In [None]:
# This will throw an error
colSums(raw.counts.csparse)

ERROR: Error in colSums(raw.counts.csparse): 'x' must be an array of at least two dimensions


<details><summary style="color:#C0CF96;font-weight:bold;">But we know it should work for sparse matrixes from the docs</summary>

![What happens when we run `?colSums` to get the docs](r-help.png)

Fun fact: for some reason running the R help command will mess with Quarto's html, breaking your webpage!

</details>


In [None]:
# We can try to force it to use the Matrix overload, to no avail.
Matrix::colSums(raw.counts.csparse)

ERROR: Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...): 'x' must be an array of at least two dimensions


And it's about here that I get fed up with R.

</details>

Let's load our `raw.counts.filtered` matrix in python:

In [None]:
from scipy.io import mmread
from scipy import sparse
import numpy as np
import pandas as pd

In [None]:
raw_counts_filtered = sparse.csc_array(mmread(
    './localdata/542_cells_21797_genes.tpm.mtx'
))
raw_counts_filtered

<21797x542 sparse array of type '<class 'numpy.float64'>'
	with 1156509 stored elements in Compressed Sparse Column format>

And now we'll copy the process they used to rescale their data in terms of genes per cell.

In [None]:
genes_per_cell = raw_counts_filtered.sum(axis=0)
raw_counts_nonzero = raw_counts_filtered[:, genes_per_cell > 0]
genes_per_cell = genes_per_cell[genes_per_cell > 0]
raw_counts_nonzero = raw_counts_nonzero / genes_per_cell * 1e6
raw_counts_normed = sparse.csr_array(raw_counts_nonzero)
raw_counts_normed

<21797x542 sparse array of type '<class 'numpy.float64'>'
	with 1156509 stored elements in Compressed Sparse Row format>

Now we need to filter out genes who were expressed in less than 1% of cells.  They're a bit more strict than that - it has to have a CPM of at least 1 to count as being "expressed".

In [None]:
raw_counts_expressed = raw_counts_normed.copy()
raw_counts_expressed[raw_counts_expressed <= 1] = 0
raw_counts_expressed[raw_counts_expressed > 1] = 1
number_of_cells_gene_appears_in = raw_counts_expressed.sum(axis=1)
raw_counts_genes_filtered = raw_counts_normed[
    number_of_cells_gene_appears_in > 5,
    :
].copy()

In [None]:
raw_counts_genes_filtered

<17832x542 sparse array of type '<class 'numpy.float64'>'
	with 1141288 stored elements in Compressed Sparse Row format>

We have too many genes and I'm not sure why - however, I can't find anywhere in their paper a count of the pre-filtered genes.  Just like how their dataset contains more cells than reported, it may contain more genes than reported!

# Analysis

> In order to identify the <b style="color:#A6A440">highly variable genes (HVGs)</b> we utilised the <b style="color:#537FBF">Brennecke Method</b> [@brennecke]. We inferred the noise model from the <b style="color:#A6A440">ERCCs</b> and <b style="color:#C0CF96">selected genes that vary higher than 20% percentage of variation</b>. This was performed by using the
“<b style="color:#757575">BrenneckeGetVariableGenes</b>” command of <b style="color:#EB1960">M3Drop v1.4.0 R package</b> setting fdr equal to 0.01 and minimum percentage of variance due to biological factors (minBiolDisp) equal to 0.2. <b style="color:#C0CF96">In total, 3,374 were annotated as HVGs.</b>
>
> -- <cite><b style="color:#EB1960">Downstream Analysis of Smart-seq2 Data; Materials and Methods Section;</b> @zebrafish-data </cite>

The `BrenneckeGetVariableGenes` method is described in the <b style="color:#EB1960">M3Drop documentation</b> as follows:

![`BrenneckeGetVariableGenes`](./m3drop.png)

Thus we can see that we need the ERCCs to deduce the HGVs.  However, I can't find any data on the ERCC spike-ins, so at this point we stop here.  Sorry to disappoint.

### References

::: {#refs}
:::