In this walkthrough, we're going to start playing around with real scRNA-seq data.  I had no end goal in mind when starting this post; let's see what happens!

# The Dataset

We'll use the [Single Cell Expression Atlas](https://www.ebi.ac.uk/gxa/sc/home) to find a suitable dataset to explore.  The website has a nice design, and there're a lot of species to choose from.

![Species Options](./images/cell-expression-atlas-species.png)

I chose to look at <i style="color:#EB1960">Danio rerio</i> (<b style="color:#EB1960">Zebrafish</b>) because it was the animal with the most experiments that I hadn't really heard about before^[Apparently it does seem to be one of the 'useful' animals in biology - you learn new things every day!  Shoutout to Sam's blog for introducing me to being able to use footnotes.].  I didn't want to look at plants/fungi/protists because they might require additional considerations I'm not aware of (especially protists).

<details>
    <summary style="color:#C0CF95"><b>&lt;i&gt; vs &lt;em&gt;</b></summary>
    <p>This is very tangential, but in trying to learn the right way to represent scientific names such as <i style="color:#EB1960">Danio rerio</i> I stumbled onto a debate about <span style="color:#757575">&lt;i&gt;</span> vs <span style="color:#757575">&lt;em&gt;</span> (html tags to represent italics), and analogously about <span style="color:#757575">&lt;b&gt;</span> vs <span style="color:#757575">&lt;strong&gt;</span> (to represent bolds).  In my investigation into this debate, I've encountered that <span style="color:#757575">&lt;i&gt;</span>/<span style="color:#757575">&lt;b&gt;</span> are to be used when there is no semantic emphasis on the words, whereas <span style="color:#757575">&lt;em&gt;</span>/<span style="color:#757575">&lt;strong&gt;</span> contain semantic emphasis [@b-vs-strong].  This unfortunately means I've been using them wrong 😅 - since I've just been <span style="color:#757575">&lt;strong&gt;</span>-ing all my bolds even though I basically never bold for emphasis these days.  (I'm not a very emphatic person).</p>
</details>

I chose the dataset from the paper "Single-cell transcriptional analysis reveals innate lymphoid cell (ILC)-like cells in zebrafish" [@zebrafish-data], because it was one of the most recent but also not exceedingly large (<1000 cells).

## Experimental Design

Before downloading the data, we want to check if it's actually useful to us - is it mRNA, what type of cells are they, were specific genes targeted, etc.  Since the previous blog posts were talking about the wetlab generation of the data, let's do a deep dive into what they did.  The paper [@zebrafish-data] is freely available online, and the pertinent information will be contained in the <b style="color:#EB1960">Materials and Methods</b> section.  The paper is the source of two datasets on the <b style="color:#EB1960">Single Cell Expression Atlas</b> so we will have to keep that in mind when reading about the methods.

### Sample Selection

> The aim of this study was to characterised innate and adaptive <strong style="color:#A6A440">lymphocytes</strong> in zebrafish in steady state and following the immune challenge, using scRNA-seq. <strong style="color:#C0CF95">Multiple zebrafish</strong>, either in <strong style="color:#A6A440">steady state</strong> or <strong style="color:#537FBF">exposed to immune challenge</strong>, were used to collect cells for sequencing.
>
> -- <cite><b style="color:#EB1960">Study Design Subsection; Materials and Methods Section;</b> @zebrafish-data </cite>

A <b style="color:#A6A440">lymphocyte</b> is a specific type of cell that is part of your immune system; B, T, and NK cells are all <b style="color:#A6A440">lymphocytes</b>.  We can also see that the experiment was done on multiple zebrafish, rather than just one - and that some were "exposed to immune challenge", which I assume means they were made sick to try to trigger interesting processes in their immune system.

The paper goes into more depth into this immune challenge (it involves <i style="color:#EB1960">Vibrio anguillarum</i>, a "fish pathogen" [details: @vibrio-anguillarum]), however this does not seem to be relevant for our dataset.  The paper seems to refer to our dataset as the <b style="color:#EB1960">Smart-seq2 experiment</b>, whereas the other dataset is the <b style="color:#EB1960">10x experiment</b>.  The immune challenge was only applied to the <b style="color:#EB1960">10x experiment</b>.  We can double-check this by comparing the sample characteristics of [our dataset](https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-7117/experiment-design) and [the other dataset](https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-7159/experiment-design) on the <b style="color:#EB1960">Single Cell Expression Atlas</b>, noting that <b style="color:#757575">infect</b> is not one of the experimental variables for our dataset.

A triple-check, if the above was not convincing enough^[It wasn't really, at least for me - I like to be sure!], can be found when we read the whole paper:

> To allow easy retrieval of sequencing data from zebrafish innate and adaptive lymphocytes we generated a cloud repository (https://www.sanger.ac.uk/science/tools/lymphocytes/ lymphocytes/) with transcriptional profiles of over 14,000 single cells collected from healthy and immune challenged zebrafish using 10x genomics and Smart-seq2 methodology <strong style="color:#C0CF96">(please
see Explanatory Note in Supplementary Material)</strong>.
>
> -- <cite><b style="color:#EB1960">"Single cell atlas of innate and adaptive lymphocytes in zebrafish" section;</b> @zebrafish-data </cite>

We can then track down this explanatory note^[I found it on [NIH](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6258902/); warning - you'll have to download a 16MB file to read it!] to see what it has to say;

> Available datasets includes <strong style="color:#C0CF96"><b style="color:#EB1960">Smart-seq2 data</b> from kidney, thymus, spleen, guts and gills of healthy, unstimulated <b style="color:#A6A440">wild-type</b> zebrafish</strong> as well as <strong style="color:#C0CF96"><b style="color:#EB1960">10x datasets</b> from gut of unstimulated zebrafish both <b style="color:#A6A440">wild-type</b> and <b style="color:#A6A440">rag1<sup>-/-</sup> mutant</b> and <b style="color:#537FBF">immune-challenged</b> (V. anguillarum- or A. simplex-injected) <b style="color:#A6A440">rag1<sup>-/-</sup> mutant</b></strong>.
>
> -- <cite><b style="color:#EB1960">Supplementary Material; Explanatory Note</b> @zebrafish-data </cite>

We know the <b style="color:#EB1960">10x experiment</b> contains over 10,000 cells, and experienced <b style="color:#537FBF">immune challenges</b>.  Ours doesn't.  Our dataset should be from multiple body parts (corroborated by the sample characteristics of [our dataset](https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-7117/experiment-design)) and the <b style="color:#EB1960">10x dataset</b> should only be from the gut (corroborated by the sample characteristics of [the other dataset](https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-7159/experiment-design)).

The above quote also points out that all our zebrafish were <b style="color:#A6A440">wild-type</b>.  Interestingly, the sample characteristics of [our dataset](https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-7117/experiment-design) do list multiple genotypes, including <b style="color:#A6A440">rag1<sup>-/-</sup> mutants</b>.  The quote seems to imply that <b style="color:#A6A440">rag1<sup>-/-</sup></b> is not "<b style="color:#A6A440">wild-type</b>", as does the rest of the paper.  However, the rest of the paper does explicitly say that the <b style="color:#EB1960">Smart-seq2 data</b> contains <b style="color:#A6A440">rag1<sup>-/-</sup> mutants</b>, so I assume this was either an oversight by the authors or a minor misinterpretation on my part^[I'm leaning towards misinterpretation, as I don't see why wild populations would be incapable of having a <b style="color:#A6A440">rag1<sup>-/-</sup> mutation</b> - unless maybe the <b style="color:#A6A440">rag1<sup>-/-</sup> mutants</b> used in the study were specifically bred in the lab?].  Either way, I'm satisfied with this understanding; we can move on to the tissue preparation.

<details>
    <summary style="color:#C0CF95"><b>About rag1<sup>-/-</sup> mutants</b></summary>
    <p>
        Well, maybe not move on just yet; I'm just a bit curious as to what a <b style="color:#A6A440">rag1<sup>-/-</sup> mutant</b> is!  According to @medline-rag1, it's related to <b style="color:#537FBF">VDJ-recombination</b> (something that was briefly mentioned, but not expanded on, in a <a href="./004_scRNA2.html">prior blogpost</a>)).  It stands for "recombination activating gene 1", and when it is absent it can really mess up your immune system as VDJ-recombination is essential for B and T cells to adapt to new pathogens.  For a paper specifically on this mutation's effects in zebrafish, see @rag1-zebrafish and @rag1-zebrafish-2.
    </p>
</details>

### Tissue Preparation

> Kidneys from heterozygote transgenic zebrafish either <b style="color:#A6A440">wild-type</b> or <b style="color:#A6A440">rag1-/- mutant</b>, were dissected and processed as previously described [@tissue-preparation].  The guts, spleens, gills and thymuses
were dissected and placed in ice cold PBS/5% foetal bovine serum.
>
> -- <cite><b style="color:#EB1960">FACS Sorting Subsection; Materials and Methods Section;</b> @zebrafish-data </cite>

Which gives us another paper to read!  Also, I never thought I'd read the phrase "ice cold fetal bovine serum"...

> A single kidney from heterozygote transgenic or wild-type fish was dissected and placed in ice cold PBS/5% foetal bovine serum.
>
> -- <cite><b style="color:#EB1960">Single-Cell Sorting Subsection; Methods Section;</b> @tissue-preparation </cite>

Despite being given another paper to read, it seems that the same dissection type was used for all organs.  That reduces the amount of work for us!

### Cell Isolation

They used <b style="color:#537FBF">FACS (Fluorescence-Activated Cell Sorting)</b>.  There's a brief overview of it available from @FACS.

> For [the] <b style="color:#EB1960">Smart-seq2 experiment</b> individual cells were index sorted into 96 well plates using a <b style="color:#EB1960">BD Influx Index Sorter</b>
>
> -- <cite><b style="color:#EB1960">FACS Sorting Subsection; Materials and Methods Section;</b> @zebrafish-data </cite>

<details>
    <summary style="color:#C0CF95"><b>96</b></summary>
    <p>
        Why does the number 96 keep popping up???  I'll never know.
    </p>
</details>

### Library Preparation and Sequencing

> The <b style="color:#EB1960">Smart-seq2 protocol</b> was used for <b style="color:#537FBF">whole transcriptome amplification</b> and <b style="color:#537FBF">library preparation</b> as previously described. Generated libraries were sequenced in <b style="color:#A6A440">pair-end mode</b> on [a] <b style="color:#EB1960">Hi-Seq4000 platform</b>.
>
> -- <cite><b style="color:#EB1960">Plate-Based Single-Cell RNA processing Subsection; Materials and Methods Section;</b> @zebrafish-data </cite>

This is short and straight to the point.  They say a bit more about sequencing later:

> For the samples that were processed using the <b style="color:#EB1960">Smart-seq2 protocol</b>, the reads were aligned to the zebrafish reference genome (<b style="color:#EB1960">Ensemble BioMart version 89</b>) combined with the sequences for EGFP, mCherry, mhc2dab and ERCC spike-ins. <b style="color:#EB1960">Salmon v0.8.2</b> [@salmon] was used for both alignment and quantification of reads with the default paired-end parameters, while library type was set to inward (I) relative orientation (reads face each other) with unstranded (U) protocol (parameter –l IU).
>
> -- <cite><b style="color:#EB1960">Alignment and Quantification of Single-Cell RNA-Sequencing Data; Materials and Methods Section;</b> @zebrafish-data </cite>

## Acquiring and Exploring the Dataset

The page for this dataset on the <b style="color:#EB1960">Single Cell Expression Atlas</b> contains a [download tab](https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-7117/downloads).  We want the <strong style="color:#C0CF96">raw counts</strong> matrix (we don't want to use the normalized counts, since we intend to recreate their analysis and that includes normalizing the counts ourselves!).

The file we're interested in, after downloading and unzipping, is `E-MTAB-7117.aggregated_filtered_counts.mtx`.  (The other files are important too; they're the row and column names!)  We can read the file in using <b style="color:#EB1960">R</b>.

In [None]:
raw.counts <- Matrix::readMM('./localdata/E-MTAB-7117.aggregated_filtered_counts.mtx')

Let's take a peak at this matrix to see what it's like.

In [None]:
print(dim(raw.counts))
raw.counts[1:10, 1:10]
max(raw.counts)

[1] 21797   966


10 x 10 sparse Matrix of class "dgTMatrix"
                                          
 [1,] .   .         .      . .   . . . . .
 [2,] .   .         .      . .   . . . . .
 [3,] .   .         .      . .   . . . . .
 [4,] .   .        48.000 77 .   . . . . .
 [5,] 1 680.8777 1151.611  1 .   . . . . .
 [6,] .   .         .      . .   . . . . .
 [7,] .   .         .      . .   . . . . .
 [8,] .   .         .      . . 111 . . . .
 [9,] .   .         .      . .   . . . . .
[10,] .   .         .      . .   . . . . .

We can see that the matrix is 21,797 by 966 - thus, there are 21,797 genes in the dataset and each column corresponds to a specific cell.

Strangely, <strong style="color:#C0CF96">some values seem to be noninteger</strong>?  Let's consult the paper!

> For each of the <strong style="color:#C0CF96">542 single cells</strong>, counts reported by <b style="color:#EB1960">Salmon</b> were transformed into normalised counts per million (CPM) and used for the further analysis. This was performed by <strong style="color:#C0CF96">dividing the number of counts for each gene with the total number of counts for each cell and by multiplying the resulting number by a factor of 1,000,000</strong>. Genes that were expressed in less than 1% of cells (e.g. 5 single cells with CPM > 1) <strong style="color:#C0CF96">were filtered out</strong>. In the final step we ended up using 16,059 genes across the 542 single cells. The <b style="color:#EB1960">scran R package</b> (version 1.6.7) @scran was then used to <b style="color:#537FBF">normalise</b> the data and remove differences due to the library size or capture efficiency and sequencing depth.
>
> -- <cite><b style="color:#EB1960">Downstream Analysis of Smart-seq2 Data; Materials and Methods Section;</b> @zebrafish-data </cite>

This explains the non-integer values; there's division involved in producing them.

However, we've found a new discrepancy!  Why are there 966 cells in the dataset, when we should only have 542?  Let's read a bit more of the paper:

> For the <b style="color:#EB1960">Smart-seq2 protocol</b> transcript per million (TPM) values reported by Salmon were used for the quality control (QC). Wells with fewer than 900 expressed genes (TPM > 1) or having more than either 60% of ERCC or 45% of mitochondrial content were annotated as <strong style="color:#C0CF96">poor quality cells</strong>. As a result, 322 cells failed QC and 542 single cells were selected for the further study.
>
> -- <cite><b style="color:#EB1960">Quality Control of Single-Cell Data; Materials and Methods Section;</b> @zebrafish-data </cite>

So 322 cells failed quality control - but that still only brings us up to 864 cells, 102 too few.  Let's download the [experiment design table](https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-7117/experiment-design) from the <b style="color:#EB1960">Single Cell Expression Atlas</b>, as we'll need it for later anyways (it contains non-gene info about the samples) and because it contains info on sample quality via the fields "well information" and "single cell quality".  The name of the file once downloaded is `ExpDesign-E-MTAB-7117.tsv`.

In [None]:
sample.data <- read.table("./localdata/ExpDesign-E-MTAB-7117.tsv", sep='\t', header=TRUE)
head(sample.data)

Unnamed: 0_level_0,Assay,Sample.Characteristic.organism.,Sample.Characteristic.Ontology.Term.organism.,Sample.Characteristic.strain.,Sample.Characteristic.Ontology.Term.strain.,Sample.Characteristic.age.,Sample.Characteristic.Ontology.Term.age.,Sample.Characteristic.developmental.stage.,Sample.Characteristic.Ontology.Term.developmental.stage.,Sample.Characteristic.sex.,⋯,Sample.Characteristic.single.cell.quality.,Sample.Characteristic.Ontology.Term.single.cell.quality.,Sample.Characteristic.cluster.,Sample.Characteristic.Ontology.Term.cluster.,Factor.Value.genotype.,Factor.Value.Ontology.Term.genotype.,Factor.Value.organism.part.,Factor.Value.Ontology.Term.organism.part.,Factor.Value.single.cell.identifier.,Factor.Value.Ontology.Term.single.cell.identifier.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<lgl>,<chr>,<chr>,<chr>,⋯,<chr>,<lgl>,<chr>,<lgl>,<chr>,<lgl>,<chr>,<chr>,<chr>,<lgl>
1,ERR2722968,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,6 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,male,⋯,OK,,1,,Tg(lck:EGFP),,kidney,,LCK_kidney_A12,
2,ERR2722969,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,6 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,male,⋯,OK,,1,,Tg(lck:EGFP),,kidney,,LCK_kidney_A4,
3,ERR2722970,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,6 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,male,⋯,OK,,1,,Tg(lck:EGFP),,kidney,,LCK_kidney_B1,
4,ERR2722971,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,6 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,male,⋯,OK,,1,,Tg(lck:EGFP),,kidney,,LCK_kidney_B2,
5,ERR2722972,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,6 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,male,⋯,OK,,1,,Tg(lck:EGFP),,kidney,,LCK_kidney_C10,
6,ERR2722973,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,6 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,male,⋯,OK,,1,,Tg(lck:EGFP),,kidney,,LCK_kidney_C11,


In [None]:
#| echo: false
#| output: false
colnames(sample.data)

In [None]:
dim(sample.data)

In [None]:
dim(sample.data[which(
    sample.data['Sample.Characteristic.single.cell.quality.'] == "OK"
    & sample.data['Sample.Characteristic.well.information.'] == "single cell"
),])

In [None]:
filtered <- sample.data[which(
    sample.data['Sample.Characteristic.cluster.'] != "unknown"
),]

In [None]:
dim(filtered)

The mystery keeps compounding!  For some reason, we start with 90 extra rows (cells) in this dataset!  We can filter out the obviously bad entries, but we still have 10 extra rows.  There's a field "cluster" which we can use to get the post-filtered value of 542, although I'm fairly certain that these clusters were created after the analysis rather than as part of the data creation, hence why there's only 542.

Since we know the names of all 'names' of all the samples in our experiment design table, we can compare this with the list of row names for our gene count matrix to hopefully shed some light on what's going wrong:

In [None]:
# Grab the lists of genes in the experiment
# and in the gene counts, for comparison
experiment.assays <- sample.data$Assay
gene.assays <- read.table(
    './localdata/E-MTAB-7117.aggregated_filtered_counts.mtx_cols'
)$V1

In [None]:
# Get the list of genes that appear in the experiment data
# but not the gene counts
discrepancies <- setdiff(experiment.assays, gene.assays)
length(discrepancies)

In [None]:
# Now get the experiment info of the 10 of these cells that
# aren't filtered out by our previous filtering method
extra.cells <- sample.data[
    which(sapply(experiment.assays, `%in%`, discrepancies)),
]
extra.cells <- extra.cells[
    which(
        extra.cells['Sample.Characteristic.single.cell.quality.'] == "OK"
    ),
]

In [None]:
#| output: false
# Show all columns.  There's probably a better way,
# but oh well...
# (I've hidden the outputs of this code block on the blog,
# because it's just a bunch of ugly, unenlightening tables.)
extra.cells[,1:10]
extra.cells[,11:20]
extra.cells[,21:30]
extra.cells[,31:33]

Unnamed: 0_level_0,Assay,Sample.Characteristic.organism.,Sample.Characteristic.Ontology.Term.organism.,Sample.Characteristic.strain.,Sample.Characteristic.Ontology.Term.strain.,Sample.Characteristic.age.,Sample.Characteristic.Ontology.Term.age.,Sample.Characteristic.developmental.stage.,Sample.Characteristic.Ontology.Term.developmental.stage.,Sample.Characteristic.sex.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<lgl>,<chr>,<chr>,<chr>
250,ERR2723217,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,6 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,male
251,ERR2723218,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,6 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,male
269,ERR2723236,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,6 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,male
270,ERR2723237,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,6 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,male
584,ERR2723551,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,6 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,male
611,ERR2723578,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,6 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,male
822,ERR2723789,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,10 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female
827,ERR2723794,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,10 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female
1007,ERR2723974,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,4 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female
1023,ERR2723990,Danio rerio,http://purl.obolibrary.org/obo/NCBITaxon_7955,AB,,6 month,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,male


Unnamed: 0_level_0,Sample.Characteristic.Ontology.Term.sex.,Sample.Characteristic.genotype.,Sample.Characteristic.Ontology.Term.genotype.,Sample.Characteristic.organism.part.,Sample.Characteristic.Ontology.Term.organism.part.,Sample.Characteristic.cell.type.,Sample.Characteristic.Ontology.Term.cell.type.,Sample.Characteristic.phenotype.,Sample.Characteristic.Ontology.Term.phenotype.,Sample.Characteristic.individual.
Unnamed: 0_level_1,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<int>
250,http://purl.obolibrary.org/obo/PATO_0000384,Tg(lck:EGFP),,gut,http://purl.obolibrary.org/obo/UBERON_0001007,blood cell,http://purl.obolibrary.org/obo/CL_0000081,EGFP positive cell,,1
251,http://purl.obolibrary.org/obo/PATO_0000384,Tg(lck:EGFP),,gut,http://purl.obolibrary.org/obo/UBERON_0001007,blood cell,http://purl.obolibrary.org/obo/CL_0000081,EGFP positive cell,,1
269,http://purl.obolibrary.org/obo/PATO_0000384,Tg(lck:EGFP),,thymus,,blood cell,http://purl.obolibrary.org/obo/CL_0000081,EGFP positive cell,,1
270,http://purl.obolibrary.org/obo/PATO_0000384,Tg(lck:EGFP),,thymus,,blood cell,http://purl.obolibrary.org/obo/CL_0000081,EGFP positive cell,,1
584,http://purl.obolibrary.org/obo/PATO_0000384,Tg(cd4-1:mCherry),,gut,http://purl.obolibrary.org/obo/UBERON_0001007,blood cell,http://purl.obolibrary.org/obo/CL_0000081,mCherry positive cell,,2
611,http://purl.obolibrary.org/obo/PATO_0000384,Tg(cd4-1:mCherry),,kidney,,blood cell,http://purl.obolibrary.org/obo/CL_0000081,mCherry positive cell,,2
822,http://purl.obolibrary.org/obo/PATO_0000383,"Tg(mhc2dab:GFP, cd45:dsRed)",,kidney,,blood cell,http://purl.obolibrary.org/obo/CL_0000081,GFP and dsRed positive cell,,3
827,http://purl.obolibrary.org/obo/PATO_0000383,"Tg(mhc2dab:GFP, cd45:dsRed)",,kidney,,blood cell,http://purl.obolibrary.org/obo/CL_0000081,GFP and dsRed positive cell,,3
1007,http://purl.obolibrary.org/obo/PATO_0000383,Tg(lck:EGFP); Rag1 -/-,,kidney,,blood cell,http://purl.obolibrary.org/obo/CL_0000081,EGFP positive cell,,4
1023,http://purl.obolibrary.org/obo/PATO_0000384,Tg(lck:EGFP),,thymus,,blood cell,http://purl.obolibrary.org/obo/CL_0000081,EGFP positive cell,,1


Unnamed: 0_level_0,Sample.Characteristic.Ontology.Term.individual.,Sample.Characteristic.well.information.,Sample.Characteristic.Ontology.Term.well.information.,Sample.Characteristic.single.cell.quality.,Sample.Characteristic.Ontology.Term.single.cell.quality.,Sample.Characteristic.cluster.,Sample.Characteristic.Ontology.Term.cluster.,Factor.Value.genotype.,Factor.Value.Ontology.Term.genotype.,Factor.Value.organism.part.
Unnamed: 0_level_1,<lgl>,<chr>,<lgl>,<chr>,<lgl>,<chr>,<lgl>,<chr>,<lgl>,<chr>
250,,single cell,,OK,,unknown,,Tg(lck:EGFP),,gut
251,,single cell,,OK,,unknown,,Tg(lck:EGFP),,gut
269,,single cell,,OK,,unknown,,Tg(lck:EGFP),,thymus
270,,single cell,,OK,,unknown,,Tg(lck:EGFP),,thymus
584,,single cell,,OK,,unknown,,Tg(cd4-1:mCherry),,gut
611,,single cell,,OK,,unknown,,Tg(cd4-1:mCherry),,kidney
822,,single cell,,OK,,unknown,,"Tg(mhc2dab:GFP, cd45:dsRed)",,kidney
827,,single cell,,OK,,unknown,,"Tg(mhc2dab:GFP, cd45:dsRed)",,kidney
1007,,single cell,,OK,,unknown,,Tg(lck:EGFP); Rag1 -/-,,kidney
1023,,single cell,,OK,,unknown,,Tg(lck:EGFP),,thymus


Unnamed: 0_level_0,Factor.Value.Ontology.Term.organism.part.,Factor.Value.single.cell.identifier.,Factor.Value.Ontology.Term.single.cell.identifier.
Unnamed: 0_level_1,<chr>,<chr>,<lgl>
250,http://purl.obolibrary.org/obo/UBERON_0001007,LCK_gut_G8,
251,http://purl.obolibrary.org/obo/UBERON_0001007,LCK_gut_G9,
269,,LCK_thymus_C4,
270,,LCK_thymus_C5,
584,http://purl.obolibrary.org/obo/UBERON_0001007,CD4_gut_D4,
611,,CD4_kidney_C5,
822,,mhc2dab_kidney_pl2_A2,
827,,mhc2dab_kidney_pl2_D1,
1007,,Rag1_kidney_H3,
1023,,LCK_thymus_B2,


The above contains the 10 outstanding cells.  Frustratingly, there doesn't seem to be any reason I can find as to why these cells were excluded at this stage!

<b>To summarize</b>:

*  The website reports 966 cells
*  The gene count table also reports 966 cells
*  The experiment table reports 1056 cells
*  Flitering the experiment table yields 976 cells
*  The paper reports 864 cells pre-quality control

This isn't an uncommon occurance; in fact, every time I've read a paper like this (which, admittedly, is only 3ish times) I've been unable to get a consistent answer on how many cells were actually used!  <strong style="color:#C0CF96">It's a major nuisance</strong>.

We'll pick up where we left off in the next blog post.  While annoying, this discrepancy isn't fatal - we can perform quality control on the 976 cells we get from filtering.

### References

::: {#refs}
:::

<script src="https://giscus.app/client.js"
        data-repo="baileyandrew/blog"
        data-repo-id="R_kgDOInJwKg"
        data-category="Announcements"
        data-category-id="DIC_kwDOInJwKs4CTGOQ"
        data-mapping="title"
        data-strict="0"
        data-reactions-enabled="1"
        data-emit-metadata="0"
        data-input-position="top"
        data-theme="dark_protanopia"
        data-lang="en"
        data-loading="lazy"
        crossorigin="anonymous"
        async>
</script>