I've seen a few "<b style="color:#A6A440">protocols</b>" pop up, such as <b style="color:#A6A440">Smart-seq2</b> and <b style="color:#A6A440">10x Chromium</b>.  This blog aims to answer the following questions:

*  What are protocols
*  What affect do they have on downstream analysis
*  Which are the common protocols

I am indebted to @protocols for much of my understanding.

# What is a <b style="color:#A6A440">protocol</b>?

An algorithm for biologicians; it's a recorded sequence of steps to follow in the lab to produce scRNA-seq data.

# Details about <b style="color:#A6A440">protocols</b>

> One type of technical variable is the <b style="color:#A6A440">sensitivity</b> of a scRNA-seq method (i.e., <b style="color:#C0CF96">the probability to capture and convert a particular mRNA transcript present in a single cell into a cDNA molecule present in the library</b>). Another variable of interest is the <b style="color:#A6A440">accuracy</b> (i.e.,  <b style="color:#C0CF96">how well the read quantification corresponds to the actual concentration of mRNAs</b>), and a third type is the <b style="color:#A6A440">precision with which this amplification occurs</b> (i.e., the technical variation of the quantification). The combination of <b style="color:#A6A440">sensitivity</b>, <b style="color:#A6A440">precision</b>, and number of cells analyzed determines the power to detect relative differences in expression levels.
>
> -- @protocols

The difference between <b style="color:#A6A440">sensitivity</b> and <b style="color:#A6A440">accuracy</b> is a bit confusing.  I <strong style="color:#757575">think</strong> that <b style="color:#A6A440">sensitivity</b> here is on a per-trascript level (how likely it is to be captured) whereas the <b style="color:#A6A440">accuracy</b> is about the relative proportions of the transcripts.  If you had 5 instances of <b style="color:#EB1960">mt-co1</b> and 10 instances of <b style="color:#EB1960">mt-cyb</b> in your results but had 10  <b style="color:#EB1960">mt-co1</b>/20 <b style="color:#EB1960">mt-cyb</b> in reality, the precision would be 50% but the accuracy would be 100%.  I'm not sure what they mean by <b style="color:#A6A440">precision</b>, but I feel confident in assuming it's the probability that a reported transcript was actually present in the sample.

<details>
    <summary><b class="sidequests">Sidequest</b><b style="color:#C0CF96">: What is <b style="color:#EB1960">mt-co1</b>?</b></summary>
    I've been trying to "<b style="color:#C0CF96">think more like a bioinformatician</b>".  Now, I don't really think that bioinformatician spend much time learning about random genes, but I think the tools I have to use to learn about them will be tools that bioinformaticians use and hence learning about random genes will still be useful for me.  <b style="color:#EB1960">mt-co1</b> is a random gene, which is why I picked it out here.  I specifically picked it because it's present in zebrafish (<i style="color:#EB1960">Danio rerio</i>), which is what I was reading about a lot lately.  However, it is also present in humans and mice.  In fact it's probably present in a lot of places, this was just what came up on the first page of my <b style="color:#EB1960">Ensembl</b> search. 
    
*  For zebrafish, the <b style="color:#A6A440">Ensembl ID</b> is <b style="color:#757575">ENSDARG00000063905</b>
*  For humans, the <b style="color:#A6A440">Ensembl ID</b> is <b style="color:#757575">ENSG00000198804</b>
*  For mice, the <b style="color:#A6A440">Ensembl ID</b> is <b style="color:#757575">ENSMUSG00000064351</b>
    
In general it has quite a lot of <b style="color:#A6A440">orthologs</b> (genes in similar species sharing a common ancestor).  However, there are also quite a lot of species (98) for which <b style="color:#EB1960">Ensembl</b> reports no <b style="color:#A6A440">orthologs</b>, such as <i style="color:#EB1960">Balaenoptera musculus</i>, the blue whale.  Unfortunately, it's not clear to me whether <b style="color:#EB1960">Ensembl</b> reports no orthologs because it is confident none exist, or because it does not have the data; there are 314 species in <b style="color:#EB1960">Ensembl</b> as far as I can tell, 200 of which have <b style="color:#A6A440">orthologs</b> and 98 of which don't.  Presumably the remaining 16 are unsure?  But it'd be weird that such a high percentage of species, spread out over many domains in the tree of life, lack a mitochondrial gene that is very widespread.  I'm not sure.

![Human vs zebrafish mitochondrial dna near <b style="color:#EB1960">mt-co1</b>](./human-zebra-m1-co1.png)

We can see that this region seems to be <em style="color:#C0CF96">highly</em> conserved between species; all the protein-coding genes are in the same order along the chromosome.  Zebrafish and humans are just about as far away as you can get among vertebrates so if it's conserved amongst them, we'd expect it to be conserved amongst much closer relatives as well.
    
![<b style="color:#EB1960">Ensembl</b> genome for the blue whale](./whale-genome.png)
    
The above graphic indicates that <b style="color:#EB1960">Ensembl</b> does not have the mitochondrial DNA for the blue whale (or else we'd expect an "MT" chromosome like in humans).  This seals the deal for me that a lack of <b style="color:#A6A440">orthologs</b> in <b style="color:#EB1960">Ensembl</b> does not denote confidence in a lack of <b style="color:#A6A440">orthologs</b> irl, merely that no orthologs have been found.

Despite my current fascination with zebrafish, I'll mostly talk about its role in humans (but again, it's probably quite similar).
    
To continue, <b style="color:#EB1960">mt-co1</b> stands for "mitochondrially encoded cytochrome c oxidase I".
    
![Table of possible gene variants](./variants.png)
    
The above table is quite interesting; it's a list of known variants to the gene that have occured.  I'll go through column-by-column to explain what they are.
    
* <b style="color:#537FBF">Variant ID</b>: The unique identifier of this gene variant.
* <b style="color:#537FBF">Chr: bp</b>: The location, in terms of base pairs, along the chromosome.
* <b style="color:#537FBF">Alleles</b>: The base pair before the slash is the default, the ones after the slash are possible mutations.
* <b style="color:#537FBF">Global MAF</b>: "The frequency of the second most common allele in the global population, defined in human by the 1000 Genomes Project phase 3.
* <b style="color:#537FBF">Class</b>: The type of variant it is (SNP or SNV for example)
* <b style="color:#537FBF">Source</b>: The database this is from
* <b style="color:#537FBF">Evidence</b>: Evidence for the existence of this variant
* <b style="color:#537FBF">Clinical Significance</b>: Clinical significance
* <b style="color:#537FBF">Consequence Type</b>: What will happen when the cell/mitochondria try to transcribe this DNA?  How will it affect the mRNA and the protein?
* <b style="color:#537FBF">AA</b>: The resulting amino acid (before the slash is original, after is the variant - it's possible for synonymous variants to exist in which case there's no slash).
* <b style="color:#537FBF">AA coord</b>: Not sure, but I think this is its location in the chain of amino acids in the generated protein.
* <b style="color:#537FBF">SIFT</b>: Score of likelihood for whether we predict the function of the generated protein to change.
* <b style="color:#537FBF">Poly-Phen</b>: (I think "phen" is "phenotype", as in "multiple phenotypes").  Another score of likelihood of predicted function change, calculated by looking at different characteristics.
* <b style="color:#537FBF">CADD</b>: Scores the "deleteriousness" of SNPs
* <b style="color:#537FBF">REVEL</b>: Predicts "pathogeneity" of of SNVs
* <b style="color:#537FBF">Meta LR</b>: Same as above
* <b style="color:#537FBF">Mutation Assessor</b>: Similar to <b style="color:#537FBF">SIFT</b> and <b style="color:#537FBF">Poly-Phen</b>.  For SNVs.
* <b style="color:#537FBF">Transcript</b>: How the gene manifests in RNA
    
![We can look at more details by clicking on the variant ID!  But this is for another time.](./variant-info.png)
    
![And more details about the transcript!  But this is also for another time.](./transcript-info.png)
    
The stuff about clinical significance might be quite useful to look at for future projects.
    
We could look into ontology terms and pathways as well, but I've spent quite a long time on this <b class="sidequests">sidequest</b>.
    
</details>

<details>
    <summary><b class="sidequests">Sidequest</b><b style="color:#C0CF96">: What is <b style="color:#EB1960">mt-cyb</b>?</b></summary>
<b style="color:#EB1960">mt-cyb</b>, or <b style="color:#757575">ENSG00000198727</b>, or even "mitochondrially encoded cytochrome b", is another mitochondrial gene.  It is involved in "respiratory electron transport", metabolism, and <b style="color:#A6A440">the citric acid cycle</b>, but has no more known pathways in <b style="color:#EB1960">Ensembl</b>.  All of the aforementioned pathways were also shared by <b style="color:#EB1960">mt-co1</b>, but <b style="color:#EB1960">mt-co1</b> had much more pathways overall.
    
![The phenotypes page on <b style="color:#EB1960">Ensembl</b>](./phenotypes.png)
    
We can see that variants in this gene can be related to certain diseases.  Based on the GO terms, it is an "integral component of membrane".  That's all for now, folks!
</details>

![A graphical overview from @protocols.  Not my own work.](./protocol-compare.jpg)

There's a lot to unpack in this image.  I'll only pick out a few.

## <b style="color:#537FBF">Reverse Transcription</b>

The creation of cDNA from RNA (called so because DNA->RNA is <b style="color:#537FBF">transcription</b>).  cDNA is more stable than RNA, so it's easier to work with.

## <b style="color:#537FBF">2nd Strand Synthesis</b>

The synthesis of the second strand of cDNA or RNA (😉).  (Since <b style="color:#537FBF">reverse transcription</b> is just creating one strand of cDNA)

## <b style="color:#537FBF">PCR</b>

<b style="color:#537FBF">PCR</b> stands for <b style="color:#537FBF">Polymerase Chain Reaction</b>.

> Sometimes called "molecular photocopying," the <b style="color:#537FBF">polymerase chain reaction (PCR)</b> is a fast and inexpensive technique used to "<b style="color:#537FBF">amplify</b>" - copy - small segments of DNA.
>
> -- <b style="color:#EB1960">@pcr</b>

## <b style="color:#537FBF">IVT</b>

<b style="color:#537FBF">IVT</b> stands for <b style="color:#537FBF">In Vitro Transcription</b>, and is a method of <b style="color:#537FBF">amplification</b>.

> <b style="color:#537FBF">In vitro transcription</b> is a simple procedure that allows for template-directed synthesis of RNA molecules of any sequence from short <b style="color:#A6A440">oligonucleotides</b> to those of several kilobases in μg to mg quantities. It is based on the engineering of a template that includes a <b style="color:#A6A440">bacteriophage promoter sequence</b> (e.g. from the T7 coliphage) upstream of the sequence of interest followed by transcription using the corresponding RNA polymerase.
>
> -- <b style="color:#EB1960">@ivt</b>

## <b style="color:#537FBF">RNA fragmentation</b>

> After poly(A) + selection or rRNA depletion, RNA samples are typically subject to RNA fragmentation to a certain size range before RT. <strong style="color:#C0CF96">This is necessary because of the size limitation of most current sequencing platforms</strong>, e.g., <600 bp on Illumina sequencers.
>
> -- <b style="color:#EB1960"> Fragmentation Subsection; General Aspects OF RNA-Seq Section; @rna-seq-methods </b>

Fragmentation is done when there is a limited <b style="color:#A6A440">read length</b> in your <b style="color:#A6A440">flow cell</b>.  If we did not fragment the RNA, then we would need large read length to read the base pairs in the middle of a long RNA strand.  <b style="color:#537FBF">Fragmentation</b> is not necessary if your read length is longer than the largest RNA molecule (or unlimited).  <b style="color:#537FBF">Fragmentation</b> puts extra strain on downstream analysis, because we need to reconstruct the original sequence from the fragments.  Furthermore, <b style="color:#537FBF">fragmentation</b> isn't random; the techniques used to break up the RNA may preferentially cause breaks in some places compared to others.  @rna-seq-methods talk about about the biases for different <b style="color:#537FBF">fragmentation</b> methods.

<b style="color:#537FBF">Fragmentation</b> can also be used on the cDNA, but it can be harder to automate.  However, <b style="color:#537FBF">tagmentation</b> is done on cDNA.

## <b style="color:#537FBF">Tagmentation</b>

> Recent development in using transposon-based, so-called <b style="color:#537FBF">tagmentation</b> method has made it simple to <b style="color:#537FBF">fragment cDNA</b> and <strong style="color:#537FBF">add adapter sequences at the same time</strong>.  In this method, an active variant of the Tn5 transposase mediates the <b style="color:#537FBF">fragmentation</b> of double-stranded DNA and <b style="color:#537FBF">ligates</b> adapter <b style="color:#A6A440">oligonucleotides</b> at both ends in a quick reaction (~5 min).  However, it is notable that Tn5 and other enzyme-based cDNA fragmentation methods require a precise enzyme:DNA ratio, <strong style="color:#C0CF96">making method optimization less straightforward than RNA fragmentation</strong>. Consequently, fragmenting RNA is currently still the most frequently used approach in RNA-Seq library preparation.
>
> -- <b style="color:#EB1960"> Fragmentation Subsection; General Aspects OF RNA-Seq Section; @rna-seq-methods </b>

<b style="color:#537FBF">Ligation</b> is the joining of two nucleic acid fragments through the action of an enzyme - at least, according to Wikipedia.  It seems that this is just the combination of <b style="color:#537FBF">fragmentation</b> and tagging with the adapter <b style="color:#A6A440">oligonucleotides</b>.  These processes used to need to be done sequentially.

<details>
    <summary>
        <b style="color:#C0CF96">Why do we want to tag the adapter with</b> <b style="color:#A6A440">oligonucleotides</b>?
    </summary>
    The tags are necessary to attach <b style="color:#A6A440">primers</b> to your cDNA [@tag-why].  This is described well during the following explanation of <b style="color:#537FBF">Index PCR</b>:
    
> A single Illumina <b style="color:#A6A440">flow cell</b> can sequence multiple samples as long as the expected reads will not saturate (exceed)  the capacity of the flow cell. However, since the samples are pooled and loaded as one into the flow cell, <b style="color:#C0CF96">there must be a mechanism to which to distinguish sequences from one sample to another</b>. This is accomplished via a process called <b style="color:#537FBF">Index PCR</b>. Here, custom <b style="color:#A6A440">oligonucleotides</b> called <b style="color:#A6A440">index primers</b> are used to <b style="color:#537FBF">amplify</b> and ‘barcode’ the fragments. Each <b style="color:#A6A440">index primer</b> contains the following:
> * <b style="color:#A6A440">Read (1 or 2) sequencing primer</b>: <b style="color:#C0CF96">a segment complementary to the ‘tag’ introduced during tagmentation</b>,
> * <b style="color:#A6A440">Indexing sequence</b>: a unique DNA sequence for identification / barcoding of samples
> * <b style="color:#A6A440">Sequencing Anchor (P5/P7)</b>: sequences complementary to the oligos in the flow cell. <b style="color:#C0CF96">These allow the index PCR products (libraries) to bind to the Illumina flow cell for sequencing</b>. 
>
> <b style="color:#EB1960">STEP 4: LIBRARY PREPARATION: AMPLIFICATION; @tag-why</b>

For an explanation on all the details here, see the section on <b style="color:#537FBF">Index PCR</b>; essentially, the tags are necessary to hook an <b style="color:#A6A440">index primer</b> into your cDNA, which is then necessary to hook it up with the <b style="color:#A6A440">oligos</b> in your <b style="color:#A6A440">flow cell</b>.
    
</details>

## <b style="color:#A6A440">Primers</b>

> A <b style="color:#A6A440">primer</b>, as related to genomics, is a short single-stranded DNA fragment used in certain laboratory techniques, such as the <b style="color:#537FBF">polymerase chain reaction (PCR)</b>. In the <b style="color:#537FBF">PCR method</b>, a pair of <b style="color:#A6A440">primers</b> hybridizes with the sample DNA and defines the region that will be <b style="color:#537FBF">amplified</b>, resulting in millions and millions of copies in a very short timeframe. <b style="color:#A6A440">Primers</b> are also used in DNA sequencing and other experimental processes.
>
> -- <b style="color:#EB1960">@primers</b>

The above quote is, I think, sufficient.  You can see <b style="color:#A6A440">primers</b> in action by reading through the section on <b style="color:#537FBF">PCR</b>, for example.

## 3' Enrichment

TODO: THIS

## <b style="color:#537FBF">Adapter Ligation</b>

> In a standard RNA-Seq library <b style="color:#A6A440">protocol</b>, cDNAs of a desired size generated from <b style="color:#537FBF">RT</b> of fragmented RNAs with random hexamer <b style="color:#A6A440">primers</b> or from fragmented <b style="color:#A6A440">full-length cDNAs</b> are <b style="color:#537FBF">ligated</b> to DNA adapters before amplification and sequencing. While simple, <b style="color:#C0CF96">this approach loses the information about which DNA strand corresponds to the sense strand of RNA</b>. Lack of strand specificity would make it difficult to identify <b style="color:#A6A440">antisense</b> and novel RNA species and cause inaccurate measurement of sense RNA expression. Several methods have been developed to capture the directionality of RNA in cDNA libraries.
>
> -- <b style="color:#EB1960"> Adapters and Directionality Subsection; General Aspects OF RNA-Seq Section; @rna-seq-methods </b>

<b style="color:#537FBF">RT</b> stands for <b style="color:#537FBF">Reverse Transcription</b>, described earlier.  <b style="color:#A6A440">Antisense RNA</b> is RNA that is complementary to mRNA.  <b style="color:#A6A440">Adapters</b> are necessary for sequencing, but I don't know why at the moment (TODO: THIS)

## UMI vs Full Length

TODO: THIS

::: References
:::