# COPILOT tutorial - toy data

### The toy data was preprocessed to generate gene-by-cell matrices using scKB, for the detail of scKB and access to toy data and supplementary data, please visit https://github.com/Hsu-Che-Wei/scKB . Note that the toy data only contains 1 million reads, due to shallow sequencing depth, the COPILOT parameters min.UMI.low.quality and min.UMI.high.quality should be adjusted accordingly

In [1]:
rm(list=ls())
# Set the working directory to where the folder named after the sample is located. 
# The folder contains spliced.mtx, unspliced.mtx, barcodes and gene id files, and json files produced by scKB that documents the sequencing stats. 
setwd("/scratch/AG_Ohler/CheWei/scKB")

In [2]:
# Load libraries
suppressMessages(library(tidyverse))
suppressMessages(library(COPILOT))

In [3]:
sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /fast/home/c/chsu/anaconda3/envs/seu314/lib/libopenblasp-r0.3.9.so

locale:
 [1] LC_CTYPE=en_US.utf-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.utf-8        LC_COLLATE=en_US.utf-8    
 [5] LC_MONETARY=en_US.utf-8    LC_MESSAGES=en_US.utf-8   
 [7] LC_PAPER=en_US.utf-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.utf-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] COPILOT_0.1.0   forcats_0.5.0   stringr_1.4.0   dplyr_1.0.0    
 [5] purrr_0.3.4     readr_1.3.1     tidyr_1.1.0     tibble_3.0.1   
 [9] ggplot2_3.3.1   tidyverse_1.3.0

loaded via a namespace (and not attached):
  [1] readxl_1.3.1                uuid_0.1-4                 
  [3] backpo

In [4]:
# Load unwanted genes
pp.genes <- as.character(read.table("./supp_data/Protoplasting_DEgene_FC2_list.txt", header=F)$V1)

In [5]:
# Run copilot, please notice that do.annotation only supports root of Arabidopsis thaliana and Oryza sativa 
copilot(sample.name = "col0_toy", species.name = "Arabidopsis thaliana", transcriptome.name = "TAIR10", sample.stats = NULL, mt.pattern = "ATMG", 
        mt.threshold = 5, cp.pattern = "ATCG", remove.doublet = FALSE, do.seurat = FALSE, do.annotation = FALSE, unwanted.genes = pp.genes, dir_to_color_scheme = "./supp_data/color_scheme_at.RData", 
        dir_to_bulk = "./supp_data/Root_bulk_arabidopsis_curated.RD", min.UMI.low.quality = 1, min.UMI.high.quality = 3)


Attaching package: ‘Matrix’


The following objects are masked from ‘package:tidyr’:

    expand, pack, unpack


Loading required package: SingleCellExperiment

Loading required package: SummarizedExperiment

Loading required package: GenomicRanges

Loading required package: stats4

Loading required package: BiocGenerics

Loading required package: parallel


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


The following object is masked from ‘package:Matrix’:

    which


The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, c

[1] "threshold cell number: 806"
[1] "removed cells: 806"
[1] "iteration: 1"
[1] "removed cells: 377"


Iteration finished



# copilot

# Description

Single cell RNA-seq preprocessing tool for gene-by-cell matrices of UMI counts. It is recommended to use the raw spliced and unpliced counts matrices produced by scKB pipeline as the input of copilot.

# Usage

copilot(
  sample.name,
  spliced.mtx = NULL,
  unspliced.mtx = NULL,
  total.mtx = NULL,
  filtered.mtx.output.dir = NULL,
  species.name = "Not Provided",
  transcriptome.name = "Not Provided",
  sample.stats = NULL,
  mt.pattern,
  mt.threshold = 5,
  cp.pattern = NULL,
  top.percent = 1,
  filtering.ratio = 1,
  estimate.doublet.rate = TRUE,
  doublet.rate = NULL,
  remove.doublet = TRUE,
  do.seurat = TRUE,
  do.annotation = FALSE,
  unwanted.genes = NULL,
  HVG = FALSE,
  HVGN = 200,
  dir_to_bulk = NULL,
  dir_to_color_scheme = NULL,
  clustering_alg = 3,
  res = 0.5,
  min.UMI.low.quality = 100,
  min.UMI.high.quality = 300,
  legend.position = c(0.8, 0.8)
)

# Arguments

sample.name:
User defined sample name (character), which should be the same as the name of directory that contains spliced and unspliced matrices if you are following scKB pipeline to produce raw counts matrices.

spliced.mtx:
Gene by cell matrix of spliced counts, which should have column and row names, Default is NULL.

unspliced.mtx:
Gene by cell matrix of unspliced counts, which should have column and row names. Default is NULL.

total.mtx:
Gene by cell matrix of total counts, which should have column and row names. Default is NULL.

filtered.mtx.output.dir:
Output directory for quality filtered matrices. Default is NULL.

species.name:
Species name (character). Default is "Not Provided".

transcriptome.name:
Name of transcriptome annotation file. (e.g. TAIR10 for Arabidopsis). Default is "Not Provided".
                                        
sample.stats:
Meta data of the sample in data.frame format. Default is NULL.
                                        
mt.pattern:
Pattern of mitochondrial gene names/ids (character; e.g. "ATMG") or list of mitochondrial genes (character vector). This argument is required to run copilot.
                                        
mt.threshold:
Threshold of mitochondrial expression percentage. Cell would be treated as dying cell if it has mitochodrial expression percentage higher than this threshold (numeric). Default is 5.
                                        
cp.pattern:
Pattern of chloroplast gene names/ids (character; e.g. "ATCG") or list of chloroplast genes (character vector). Default is NULL.
                                        
top.percent:
Percentage of cells that contain high numer of UMIs filtered (numeric). Default is 1.
                                        
filtering.ratio:
Metric that controls the stringency of cell filtering (lenient: 1; strict:0; moderate: 0 < filtering.ratio < 1; numeric). Default is 1.
                                        
estimate.doublet.rate:
Whether or not to estimate doublet rate according to 10X Genomics' estimation (boolean). Default is TRUE.

doublet.rate:
User specified doublet rate (numeric). Default is NULL.

remove.doublet:
Whether or not to remove doublets after quality filtering of gene and cell (boolean). Default is TRUE.

do.seurat:
Whether or not to perform normalization, PCA, UMAP and clustering using Seurat and output a Seurat object (boolean). Default is TRUE.
                                        
do.annotation:
Whether or not to do annotation (boolean). COPILOT only supports annotation on root of Arabidopsis thaliana and Oryza sativa. Default is FALSE.
                                        
unwanted.genes:
Gene IDs/names of unwanted genes (character vector, e.g. cell cycle related genes, organelle genes ... etc). Default is NULL.
                                        
HVG:
Whether or not to select highly variable genes (boolean). Default is FALSE.
                                        
HVGN:
Number of highly variable genes selected (numeric). Defalut is 200.
                                        
dir_to_bulk:
Directory to reference expression profile for annotation. Default is NULL.

dir_to_color_scheme:
Directory to color scheme file for annotation. Default is NULL.
                                        
clustering_alg:
Algorithm for clustering (1 = original Louvain algorithm; 2 = Louvain algorithm with multilevel refinement; 3 = SLM algorithm; 4 = Leiden algorithm, which requires the leidenalg python; numeric). Default is 3.

res:
Resolution used for clustering (numeric). Default is 0.5.

min.UMI.low.quality:
Minimum UMIs for a barcode to be considered as cell (numeric). Default is 100.

min.UMI.high.quality:
Minimum UMIs for a cell to be considered as high quality cell (numeric). Default is 300.

legend.position:
x y position of the legend on UMI histogram plot (numeric vector of length 2). Default is c(0.8,0.8).