# Cell Ranger 3.1.0 pipeline with Feature Barcoding (5k PBMC dataset)

In this notebook we will proceed through the CellRanger pipeline starting with with the input files provided for the [5k PBMC dataset](https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.1.0/5k_pbmc_protein_v3) and recreate the output files.

In [1]:
import os
import pandas as pd

## 1. Download input files

In [1]:
# FASTQ Files
!curl -O http://cf.10xgenomics.com/samples/cell-exp/3.1.0/5k_pbmc_protein_v3/5k_pbmc_protein_v3_fastqs.tar

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.9G  100 13.9G    0     0  45.5M      0  0:05:14  0:05:14 --:--:-- 50.3M


In [3]:
# untar
!tar -xvf 5k_pbmc_protein_v3_fastqs.tar

5k_pbmc_protein_v3_fastqs/
5k_pbmc_protein_v3_fastqs/5k_pbmc_protein_v3_gex_fastqs/
5k_pbmc_protein_v3_fastqs/5k_pbmc_protein_v3_gex_fastqs/5k_pbmc_protein_v3_gex_S1_L001_I1_001.fastq.gz
5k_pbmc_protein_v3_fastqs/5k_pbmc_protein_v3_gex_fastqs/5k_pbmc_protein_v3_gex_S1_L001_R1_001.fastq.gz
5k_pbmc_protein_v3_fastqs/5k_pbmc_protein_v3_gex_fastqs/5k_pbmc_protein_v3_gex_S1_L001_R2_001.fastq.gz
5k_pbmc_protein_v3_fastqs/5k_pbmc_protein_v3_gex_fastqs/5k_pbmc_protein_v3_gex_S1_L002_I1_001.fastq.gz
5k_pbmc_protein_v3_fastqs/5k_pbmc_protein_v3_gex_fastqs/5k_pbmc_protein_v3_gex_S1_L002_R1_001.fastq.gz
5k_pbmc_protein_v3_fastqs/5k_pbmc_protein_v3_gex_fastqs/5k_pbmc_protein_v3_gex_S1_L002_R2_001.fastq.gz
5k_pbmc_protein_v3_fastqs/5k_pbmc_protein_v3_antibody_fastqs/
5k_pbmc_protein_v3_fastqs/5k_pbmc_protein_v3_antibody_fastqs/5k_pbmc_protein_v3_antibody_S2_L001_I1_001.fastq.gz
5k_pbmc_protein_v3_fastqs/5k_pbmc_protein_v3_antibody_fastqs/5k_pbmc_protein_v3_antibody_S2_L001_R1_001.fastq.gz
5k_pbmc_pr

In [4]:
# Feature Reference CSV    
!curl -O http://cf.10xgenomics.com/samples/cell-exp/3.1.0/5k_pbmc_protein_v3/5k_pbmc_protein_v3_feature_ref.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2689  100  2689    0     0   8075      0 --:--:-- --:--:-- --:--:--  8075


## 2. Run [cellranger count](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/feature-bc-analysis) with feature barcoding

### 2.1 Create <font color='green'> Libraries CSV </font> file

From https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/feature-bc-analysis:

_The Libraries CSV File declares the input FASTQ data for the libraries that make up a Feature Barcoding experiment. This will include one library containing Single Cell Gene Expression reads, and one more libraries containing Feature Barcoding reads. To use `cellranger count` in Feature Barcoding mode, you must create a Libraries CSV File and pass it with the <font color='grey'> --libraries </font> flag. The following table describes what the content should be in the Libraries CSV File._

| Column name | Description |
| --- | --- |
| fastqs | A fully qualified path to the __directory__ containing the demultiplexed FASTQ files for this sample. Analogous to the <font color='grey'> --fastqs </font> arg to `cellranger count`. This field does not accept comma-delimited paths. If you have multiple sets of fastqs for this library, add an additional row, and use the same `library_type` value. |
| sample | Same as the <font color='grey'> --sample </font> arg to `cellranger count`. Sample name assigned in the bcl2fastq sample sheet. |
| library_type | The FASTQ data will be interpreted using the rows from the feature reference file that have a ‘feature_type’ that matches this `library_type`. This field is case-sensitive, and must match a valid library type as described in the Library / Feature Types section. Must be `Gene Expression` for the gene expression libraries. Must be one of `Custom`, `Antibody Capture`, or `CRISPR Guide Capture` for Feature Barcoding libraries. |

* Set absolute paths

In [2]:
cwd = os.getcwd()
path_gex = os.path.join(cwd, "5k_pbmc_protein_v3_fastqs", "5k_pbmc_protein_v3_gex_fastqs")
path_antibody = os.path.join(cwd, "5k_pbmc_protein_v3_fastqs", "5k_pbmc_protein_v3_antibody_fastqs")

In [None]:
#sample names are inferred from the fastq files names that are in the format:
#sample_name_SX_L001_001.fastq.gz
libraries = pd.DataFrame({"fastqs":[path_gex, path_antibody], 
                          "sample":["5k_pbmc_protein_v3_gex",
                                    "5k_pbmc_protein_v3_antibody"],
                          "library_type":["Gene Expression", "Antibody Capture"]})
libraries.head()

In [4]:
libraries.to_csv("5k_pbmc_protein_v3_libraries.csv", index=False)

### 2.2 Run [cellranger count](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count#args)

* make sure to enter the correct path for the reference transcriptome
* pass in the <font color='green'> Libraries CSV </font> and <font color='green'> Feature Reference CSV CSV </font> to enable feature barcoding
* here we set <font color='grey'> --expect-cell </font> to 5000 since 5000 PBMCs were sequenced

In [None]:
!cellranger count --id=5k_pbmc_protein_v3 \
                   --transcriptome=../refdata-cellranger-GRCh38-3.0.0 \
                   --libraries=5k_pbmc_protein_v3_libraries.csv \
                   --feature-ref=5k_pbmc_protein_v3_feature_ref.csv \
                   --expect-cells=5000

The [outputs](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/overview) will now be available in the `5k_pbmc_protein_v3/outs` directory and the summary obtained can be compared to the [summary](http://cf.10xgenomics.com/samples/cell-exp/3.1.0/5k_pbmc_protein_v3/5k_pbmc_protein_v3_web_summary.html) available on the 10x website

In [None]:
os.listdir("5k_pbmc_protein_v3/outs")

## 3. Convert matrix to csv format

From https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices:

_Cell Ranger represents the feature-barcode matrix using sparse formats (only the nonzero entries are stored) in order to minimize file size. All of our programs, and many other programs for gene expression analysis, support sparse formats._

_However, certain programs (e.g. Excel) only support dense formats (where every row-column entry is explicitly stored, even if it's a zero). You can convert a feature-barcode matrix to dense CSV format using the cellranger mat2csv command. This command takes two arguments - an input matrix generated by Cell Ranger (either an H5 file or a MEX directory), and an output path for the dense CSV._

In [None]:
#convert matrix from h5 to csv
!cellranger mat2csv 5k_pbmc_protein_v3/outs/filtered_feature_bc_matrix.h5 filtered_feature_bc_matrix.csv