# scRNA-seq Data

I'm heavily indebted to Morgan for help in understanding this topic.  I'm still learning about it, so don't take my word as gospel.

## Library Preparation

<strong style="color:#537FBF">Library preparation</strong> is the collection of the bits of data that you're actually interested in (these bits would be your 'library').  The methods I've seen do this by attaching something to the mRNA molecules that we can take advantage of later.  For scRNA-seq we are interested in the <strong style="color:#A6A440">transcriptome</strong> (this is the set of RNA thingamajigs in the cell).  <strong style="color:#EB1960">@illumina-ebook</strong> lists a few library preparation techniques for the transcriptome:

<hr/>

*  Full-length RNA-seq
*  mRNA end-tag amplification
*  Targeted panels
   *  Targeted towards measuring specific things
*  IR-seq
   *  Specifically used for B and T cells.

<hr/>

These are important for me to understand so I'll definitely be delving into this topic deeper in the future.

## Sequencing

We've now isolated our cells and grabbed/tagged the bits that we're interested in.  Now, we want to actually be able to learn about the mRNA that we've measured.  To do this, we need to sequence it. 

### Flow Cells

One important concept for sequencing is the <strong style="color:#A6A440">Flow Cell</strong>.  Intuitively, the flow cell is a sticky plate that will grab the mRNA.  <strong style="color:#EB1960">@illumina-ngs</strong> defines them as the following:

> <strong style="color:#A6A440">Flow Cell</strong>: A glass slide with one, two, or eight physically separated lanes, depending on the instrument platform. Each lane is coated with a lawn of surface bound, adapter-complimentary <strong style="color:#A6A440">oligos</strong>. A single library or a pool of up to 96 <strong style="color:#537FBF">multiplexed</strong> libraries
can be run per lane, depending on application parameters.
>
> -- <cite><strong style="color:#EB1960">Illumina handbook on next generation sequencing</strong></cite> [@illumina-ngs]



An <strong style="color:#A6A440">oligo</strong> is a short strand of synthetic DNA [@oligo].  These are the "sticky bits".  <strong style="color:#537FBF">Multiplexing</strong> is the process of sequencing multiple samples at a time; this can be useful as sequencing produces a lot of data, more than may be necessary for a single project on its own [@ngs-considerations].  This can be done by attaching molecular barcodes to the prepared cDNA to indicate the original sample from which the cDNA hails.

#### Sequencing by Synthesis

The flow cell definition given by Illumina is specifically tuned to their flow cells - this may have been obvious due to the use of specific numbers like "96".

<details>
    <summary style="color:#c0cf95"><strong>Tangential Observation</strong></summary>
    Surprisingly, though, the competitor company <strong style="color:#EB1960">Oxford Nanopore</strong> [also uses this number](https://store.nanoporetech.com/uk/rapid-barcoding-kit-1.html).  I assume this is due to some chemical feature of the bardcodes, but it probably isn't as simple as that since I know you can have more than 96 UMIs which I thought would be a similar process...
</details>
<br/>
<strong style="color:#EB1960">Illumina</strong> uses a <strong style="color:#537FBF">Sequencing by Synthesis</strong> approach; we'll look at another approach, <strong style="color:#537FBF">Nanopore Sequencing</strong>, later.  During <strong style="color:#EB1960">Illumina's</strong> library preparation phase they convert the RNA to cDNA.

![Example of a patterned flow cell](./images/flow_cell.jpeg)

The above diagram is that of a 2D slice of a <strong style="color:#A6A440">Patterned Flow Cell</strong>.  These differ from <strong style="color:#A6A440">Nonpatterned Flow Cells</strong> in the use of <strong style="color:#A6A440">nanowells</strong> (little valleys in the cell) to ideally keep fragments of DNA from binding to nearby oligos.  Patterned flow cells are a more recent innovation, and for a brief overview of the differences you can read @patterned-vs-nonpatterned and @pvn-cons.

The cDNA created during library preparation is added onto the flow cell.  The strands of cDNA (<strong style="color:#A6A440">reads</strong>) will bind to the oligos, and then start multiplying (<strong style="color:#537FBF">"bridge amplification"</strong>) so that duplicates of the read will be bound to nearby oligos.  For patterned flow cells, this should mean that the nanowells are full of copies of the same read - for unpatterned flow cells the clusters are more chaotic.

After you've prepared your flow cell, you'll plug it into a sequencer to read the base pairs.  These use chemical tricks to cause each base pair in a read to let out a flash of a specific color.  The sequence and color of these flashes will inform the sequencer of what the base pair sequence is.  After doing this, you'll have raw data on the contents of each read, likely in the `bcl` file format (`bcl` is the output for <strong style="color:#EB1960">Illumina</strong> sequencers).

#### Nanopore Sequencing

While most of this blog has focused on <strong style="color:#EB1960">Illumina</strong>, <strong style="color:#EB1960">Oxford Nanopore</strong> is definitely worth talking about.  They use <strong style="color:#537FBF">Nanopore Sequencing</strong> instead of <strong style="color:#537FBF">Sequencing by Synthesis</strong>.  A brief comparison of the two companines is given by @illumina-vs-nanopore.

<details>
    <summary style="color:#C0CF95"><strong>¿cDNA?</strong></summary>
    It is possible to directly use RNA, instead of cDNA, depending on what is done at the library preparation stage.  I'll assume we use cDNA here.  It would make an interesting blog post, I think, to delve into the intricacies as to what is going on and why we should care about cDNA vs RNA.
</details>
<br/>

Nanopore sequencing is a newer method; instead of sticking to an oligo, the cDNA passes through a <strong style="color:#A6A440">nanopore</strong>.  To do this, the cDNA has a <strong style="color:#A6A440">motor protein</strong> attached to it during the library preparation stage.  The act of passing through the nanopore creates a detectable electrical signal that is dependent on the bases in the cDNA.  By measuring this signal, we can sequence the molecule.

![Nanopore sequencing.  The cDNA will pass through the nanopore as it gets sequenced.  The tether will grab on to the motor protein to accomplish this.  [@nanopore]](./images/nanopore.jpeg)

One interesting advantage of nanopore sequencing is the ability to detect <strong style="color:#A6A440">modified bases</strong> (which I didn't know were a thing!) [@modified-bases].  Basically, the structure that makes up your G, T, A, C, U nucleic acid bases can actually get modified in certain ways, turning them into different sub-molecules that may affect gene expression.

<strong style="color:#EB1960">Oxford Nanopore</strong> outputs `FAST5` files, a type of `HDF5` file.  These contain the raw electrical signals the nanopores measured.  One can then perform <strong style="color:#537FBF">basecalling</strong> to determine the sequence of bases corresponding to the signals.  Due to the aforementioned detectability of modified bases, these raw electrical signals can have value rather than just looking at an end product of everything converted to G/T/A/C/U [@fast5].

#### Measures and Options

Flow cells have two key measures (besides data quality); the amount of reads and the <strong style="color:#A6A440">read length</strong>.  Read counts can be in the hundreds of millions.  Read lengths may be much smaller; the flow cells paired with the <strong style="color:#EB1960">Illumina NextSeq 550</strong> can only have reads as long as 150 base pairs on average.  <strong style="color:#EB1960">Oxford Nanopore</strong> (and in general any nanopore method) can have unbounded read lengths.  Longer reads are desirable from the perspective of reconstructing larger sequences, as it is easier to determine if two segments have a significant overlap.

![My theory as to why it's called "depth"](./images/depth.jpeg)

An important concept is <strong style="color:#A6A440">coverage</strong>, or the amount of times a gene has a read associated with it.  Higher coverage is better as it allows us to piece together the gene more accurately and weed out incorrect bases.  This is also called <strong style="color:#A6A440">sequence depth</strong>.  Deep sequencing is important for detecting rare genes.  This can be affected by the steps done during library preparation; immune cells, which undergo <strong style="color:#537FBF">VDJ recombination</strong> - in short, each cell has its own unique markers that need to be accounted for [@ngs-considerations]

For single cell applications depth refers to the amount of reads per cell instead of the reads per base pair measure used for bulk sequencing.  When cell populations are more homogenous, the depth should be larger as false overlaps will be more likely.

Finally, there's the decision of whether to use <strong style="color:#A6A440">paired-end</strong> or <strong style="color:#A6A440">single-end</strong> sequencing.  DNA and RNA are linear structures with two different ends - the 5' and 3' ends ("5-prime" and "3-prime").  By reading from both the ends, we can better find errors as well as increase the read length (intuitively: instead of reading n base pairs in one direction, you can read 2n total pairs with n in each direction).  These advantages decrease when you have <strong style="color:#A6A440">UMIs (unique molecular identifiers)</strong> or other feature you can take advantage of (such as VDJs), but that is out of scope for this blog post.  [@ngs-considerations]  The advantages are not completely nullified, as single-end sequencing cannot detect certain types of errors (such as "indel" errors).  [@illumina-ebook]

### References

::: {#refs}
:::

<script src="https://giscus.app/client.js"
        data-repo="baileyandrew/blog"
        data-repo-id="R_kgDOInJwKg"
        data-category="Announcements"
        data-category-id="DIC_kwDOInJwKs4CTGOQ"
        data-mapping="title"
        data-strict="0"
        data-reactions-enabled="1"
        data-emit-metadata="0"
        data-input-position="top"
        data-theme="dark_protanopia"
        data-lang="en"
        data-loading="lazy"
        crossorigin="anonymous"
        async>
</script>