# Tutorial Introduction


[SCRuB](https://korem-lab.github.io/SCRuB/) is a tool designed to help researchers address the common issue of contamination in microbial studies. This package provides an easy to use framework to apply SCRuB to your projects. All you need to get started are a feature tables describing both your samples and negative controls, and a metadata files describing each sample. 

In this tutorial we use _SCRuB_ to decontaminate a dataset comparing the plasma samples of cancer and control subejcts published in [Poore et al](https://www.nature.com/articles/s41586-020-2095-1). This data can be downloaded with the following links:

* **Table** (table.qza) | [download](https://github.com/korem-lab/q2-SCRuB/q2-SCRuB/tutorials/data/table.qza')
* **Sample Metadata** (metadata.tsv) | [download](https://github.com/korem-lab/q2-SCRuB/q2-SCRuB/tutorials/data/plasma_metadata.tsv)


**Note**: This tutorial assumes you have installed [QIIME2](https://qiime2.org/) using one of the procedures in the [install documents](https://docs.qiime2.org/2020.2/install/). This tutorial also assumed you have installed [SCRuB](https://korem-lab.github.io/SCRuB/).

First, we will make a tutorial directory and download the data above and move the files to the `plasma-data` directory:

```bash
mkdir plasma-data
```

First we will import our data with the QIIME2 Python API. 


In [2]:
import os
import warnings
import pandas as pd
import numpy as np
# hide pandas Future/Deprecation Warning(s) for tutorial
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.simplefilter(action='ignore', category=FutureWarning)

# import table
table =pd.read_csv('plasma-data/table.tsv', sep='\t', index_col=0)
# import metadata
metadata = pd.read_csv('plasma-data/metadata.tsv', sep='\t', index_col=0)


Next, we will demonstrate how to run SCRuB on this dataset. First, we will explore the required samples and metadata for SCRuB:

In [3]:
from q2_SCRuB import SCRuB

To run SCRuB we only need to run the one single command. The inputs are:

1. `table`
    - The table is of type `FeatureTable[Frequency]` which is a table where the rows are features (e.g. ASVs/microbes), the columns are samples, and the entries are the number of sequences for each sample-feature pair.
2. `metadata`
    - This is a QIIME2 formatted [metadata](https://docs.qiime2.org/2020.2/tutorials/metadata/) (e.g. tsv format) where the rows are samples matched to the (1) table and the columns are different sample data (e.g. time point).  
3. ( _Optional_ ) `control_idx_column`
    - This is the name of the column in the (2) metadata that indicates the which samples should be treated as negative controls. If not specified, will identify negative controls by searching for a metadata column of 'empo_2' or 'qiita_empo_2', and identifying which entries contain the keyword 'negative'
4. ( _Optional_ ) `sample_type_column`
    - This is the name of the column in the (2) metadata that indicates the sample type, which specifies the groupings of negative controls SCRuB should use for decontamination. Default is 'sample_type'
5. ( _Optional_ ) `well_location_column`
    - This is the name of the column in the (2) metadata that indicates the well of each sample, which specifies the groupings of negative controls SCRuB should use for decontamination. Default is 'well_id'
6. ( _Optional_ ) `control_order`
    - specifies the ordering which the negative controls from `sample_type` should be run. Default uses the ordering in which the sample are found in the metadata table.

7. output-dir
    - The desired location of the output. We will cover each output independently below.  

In this tutorial our control_idx_column is `is_control`, our sample_type_column in `sample_type`, and our well_location_column is `well_id`. Now we are ready to SCRuB away the contamination:

In [5]:
metadata.sample_type.unique()

array(['plasma', 'control blank DNA extraction',
       'control blank library prep', 'bacteria monoculture'], dtype=object)

In [11]:
scrubbed = SCRuB(table, 
                 metadata, 
                'is_control', # specifies metadata column where True denotes the negative controls
                'sample_type', # specifies metadata column denoting the sample type
                'well_id', # specifies metadata column representing samples location, in 'A11','B10' format
                ['control blank DNA extraction','control blank library prep']
                )

Running SCRuB on Qiime2!
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_SCRuB.R --samples_counts_path /var/folders/60/0byq_5yx2jbgs0cn6s5s2y7r0000gn/T/tmp_zf1g6gm/samples.csv --sample_metadata_path /var/folders/60/0byq_5yx2jbgs0cn6s5s2y7r0000gn/T/tmp_zf1g6gm/metadata.csv --control_order control blank DNA extraction,control blank library prep --output_path /var/folders/60/0byq_5yx2jbgs0cn6s5s2y7r0000gn/T/tmp_zf1g6gm/scrubbed.csv

R version 4.1.3 (2022-03-10) 


Loading required package: torch
Loading required package: glmnet
Loading required package: Matrix
Loaded glmnet 4.1-4
Loading required package: tidyverse
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::expand() masks Matrix::expand()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ tidyr::pack()   masks Matrix::pack()
✖ tidyr::unpack() masks Matrix::unpack()
Loading required package: magrittr

Attaching package: ‘magrittr’

The following object is masked from ‘package:purrr’:

    set_names

The following object is masked from ‘package:tidyr’:

    extract

Loading required package: rlang

Attaching package: ‘rlang’

The following object is masked from ‘package:magrittr’:

    set_names

The fo

SCRuB: 0.0.1 
1) Loading datas
2) Decontaminating 
[1] "Incorporating the well metadata to track well-to-well leakage!"
[1] "SCRuBbing away contamination in the control blank DNA extraction controls..."
[1] "SCRuBbing away contamination in the control blank library prep controls..."
3) Write output


In [12]:
scrubbed.head()

Unnamed: 0,G000002415,G000002495,G000002525,G000002655,G000002715,G000002725,G000002765,G000002825,G000002845,G000002855,...,G900091435,G900101525,G900113745,G900120125,G900156315,G900156805,G900169525,G900187165,G900248245,G900639865
12667.X2457964,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12667.X3004746,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12691.PC33,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12692.Control.80,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12667.X2062971,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we can compare the raw data and SCRuB's output.

In [13]:
import os
import warnings
import qiime2 as q2
# hide pandas Future/Deprecation Warning(s) for tutorial
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.simplefilter(action='ignore', category=FutureWarning)

# import table
table = q2.Artifact.load('plasma-data/table.qza')\
# import metadata
metadata = q2.Metadata.load('plasma-data/metadata.tsv')
# import SCRuB output
scrubbed = q2.Artifact.load('results/scrubbed.qza')

In [14]:
from qiime2.plugins.deicode.actions import rpca
from qiime2.plugins.emperor.actions import (plot, biplot)

In [15]:
# run RPCA and plot with emperor
rpca_biplot, rpca_distance = rpca(table)
rpca_biplot_emperor = biplot(rpca_biplot, metadata)
# make directory to store results
output_path = 'results'
if os.path.isdir(output_path)==False:
    os.mkdir(output_path)

# now we can save the plots
rpca_biplot_emperor.visualization.save(os.path.join(output_path, 'Raw-RPCA-biplot.qzv'))


'results/Raw-RPCA-biplot.qzv'


Now we can visualize the samples via RPCA    

![image.png](results/Raw-RPCA.png)




For comparison, we can observe the samples decontaminated by SCRuB:

In [16]:
# run RPCA and plot with emperor
rpca_biplot, rpca_distance = rpca(scrubbed.scrubbed)
rpca_biplot_emperor = biplot(rpca_biplot, metadata)

# save the plots
rpca_biplot_emperor.visualization.save(os.path.join(output_path, 'SCRuBbed-RPCA-biplot.qzv'))

AttributeError: 'Artifact' object has no attribute 'scrubbed'

![image.png](results/SCRuBbed-RPCA.png)