This repository contains CellTag workflow, written in Python.
You may clone this repository and run the jupyter notebook file (celltagging_workflow.ipynb
) to go through CellTag workflow.
Paper: The article can be found here.
GEO dataset: Here is the link to the GEO DataSet (GSE99915) which contains all of the sequencing data we generated.
CellTag Libraries: The CellTag libraries are available from AddGene here. This link also contains whitelists for each CellTag library as well as sequence information for each library. The CellTag libraries available at AddGene are labeled CellTag-V1, CellTag-V2, and CellTag-V3. These labels correspond to CellTagMEF, CellTagD3, and CellTagD13 respectively.
From our GEO DataSet we will select one timepoint to use as an example of our CellTag processing and analysis. We will use data collected from Timecourse 1 at Day 15. The link to the SRA for this sample is here. We are using the data from Day 15 because we should be able to call clones using each of the CellTag versions (CellTagMEF, CellTagDay3, and CellTagDay13)
- Download BAM file for samples at day 15
- Extract celltag reads from BAM file
- Parse reads to extract required information I've also added the extracted file in the repository.
# #bash
# wget https://sra-pub-src-1.s3.amazonaws.com/SRR7347033/hf1.d15.possorted_genome_bam.bam.1
# samtools view hf1.d15.possorted_genome_bam.bam | grep -P 'GGT[ACTG]{8}GAATTC' > v1.celltag.reads.out
# samtools view hf1.d15.possorted_genome_bam.bam | grep -P 'GTGATG[ACTG]{8}GAATTC' > v2.celltag.reads.out
# samtools view hf1.d15.possorted_genome_bam.bam | grep -P 'TGTACG[ACTG]{8}GAATTC' > v3.celltag.reads.out
# ./scripts/celltag.parse.reads.10x.sh -v tagregex="CCGGT([ACTG]{8})GAATTC" v1.celltag.reads.out > v1.celltag.parsed.tsv
# ./scripts/celltag.parse.reads.10x.sh -v tagregex="GTGATG([ACTG]{8})GAATTC" v2.celltag.reads.out > v2.celltag.parsed.tsv
# ./scripts/celltag.parse.reads.10x.sh -v tagregex="TGTACG([ACTG]{8})GAATTC" v3.celltag.reads.out > v3.celltag.parsed.tsv
# Rscript ./scripts/matrix.count.celltags.R ./cell.barcodes/hf1.d15.barcodes.tsv v1.celltag.parsed.tsv hf1.d15.v1
# Rscript ./scripts/matrix.count.celltags.R ./cell.barcodes/hf1.d15.barcodes.tsv v2.celltag.parsed.tsv hf1.d15.v2
# Rscript ./scripts/matrix.count.celltags.R ./cell.barcodes/hf1.d15.barcodes.tsv v3.celltag.parsed.tsv hf1.d15.v3
import pyreadr #!pip install pyradr
import numpy as np
import pandas as pd
import celltagging_utils as ct
mef = pyreadr.read_r("./celltag_matrix/hf1.d15.v1.celltag.matrix.Rds")[None]
d3 = pyreadr.read_r("./celltag_matrix/hf1.d15.v2.celltag.matrix.Rds")[None]
d13 = pyreadr.read_r("./celltag_matrix/hf1.d15.v3.celltag.matrix.Rds")[None]
mef.set_index('Cell.BC',inplace=True)
d3.set_index('Cell.BC',inplace=True)
d13.set_index('Cell.BC',inplace=True)
mef.shape, d3.shape, d13.shape
# ((3812, 6319), (3812, 8246), (3812, 4630))
mef.head()
AAAAAAGA | AAAAAAGC | AAAAAATA | AAAAACTC | AAAAACTG | AAAAAGAC | AAAAAGCC | AAAAAGCG | AAAAAGGG | AAAACTAA | ... | TTTTTCGG | TTTTTCTA | TTTTTGAT | TTTTTGCA | TTTTTGTT | TTTTTTAT | TTTTTTCC | TTTTTTCT | TTTTTTGG | TTTTTTTT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cell.BC | |||||||||||||||||||||
AAACCTGAGTATGACA | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
AAACCTGCAGCCTATA | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
AAACCTGGTAAGTAGT | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
AAACCTGTCACAACGT | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
AAACCTGTCCGCGCAA | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 6319 columns
ct.get_stats(mef)
Cell_UMI_Counts | CellTags_per_Cell | CellTag_UMI_Counts | Cells_per_CellTag | |
---|---|---|---|---|
count | 3812.000000 | 3812.000000 | 6319.000000 | 6319.000000 |
mean | 54.328437 | 5.917629 | 32.774173 | 3.569869 |
std | 96.041066 | 6.887981 | 379.364072 | 16.430824 |
min | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
25% | 2.000000 | 2.000000 | 1.000000 | 1.000000 |
50% | 20.000000 | 4.000000 | 1.000000 | 1.000000 |
75% | 64.000000 | 8.000000 | 4.000000 | 2.000000 |
max | 1549.000000 | 84.000000 | 20226.000000 | 569.000000 |
Convert the CellTag UMI count matrices into binary matrices and any Cell Barcode/CellTag pair with a UMI count less than a cutoff will be disregarded
mef_bin = ct.single_cell_data_binarization(mef, 2)
d3_bin = ct.single_cell_data_binarization(d3, 2)
d13_bin = ct.single_cell_data_binarization(d13, 2)
‘Whitelisting’ is performed to remove PCR and sequencing artifacts that are not corrected in the previous step. This whitelisting consists of filtering out CellTags that are not detected from sequencing of the original complex CellTag library.
mef_filt = ct.single_cell_data_whitelist(mef_bin, "./whitelist/V1.CellTag.Whitelist.csv")
d3_filt = ct.single_cell_data_whitelist(d3_bin, "./whitelist/V2.CellTag.Whitelist.csv")
d13_filt = ct.single_cell_data_whitelist(d13_bin, "./whitelist/V3.CellTag.Whitelist.csv")
print(mef_filt.shape, d3_filt.shape, d13_filt.shape)
# (3812, 3256) (3812, 2537) (3812, 1981)
Cells with >20 CellTags (likely to correspond to cell multiplets) and less than two unique CellTags per cell are filtered out.
mef_filt = ct.metric_based_filtering(mef_filt, 20, "less")
d3_filt = ct.metric_based_filtering(d3_filt, 20, "less")
d13_filt = ct.metric_based_filtering(d13_filt, 20, "less")
mef_filt = ct.metric_based_filtering(mef_filt, 2, "greater")
d3_filt = ct.metric_based_filtering(d3_filt, 2, "greater")
d13_filt = ct.metric_based_filtering(d13_filt, 2, "greater")
print(mef_filt.shape, d3_filt.shape, d13_filt.shape)
# (1852, 3256) (1733, 2537) (717, 1981)
Clone calling is performed where Jaccard coefficient scores were calculated to assess the similarity of CellTag expression signatures in all cells in a pairwise manner, thereby identifying clonally related cells.
mef_sim = ct.jaccard_analysis(mef_filt, id='mef')
d3_sim = ct.jaccard_analysis(d3_filt, id='d3')
d13_sim = ct.jaccard_analysis(d13_filt, id='d13')
Cells with a given threshold for similarity (here 0.7) will be get the same clone index. Clones with one cell will be disregarded.
mef_clones, mef_clone_size = ct.clone_calling(mef_sim, "./hf1.d15.v1.clones.csv", 0.7)
d3_clones, d3_clone_size = ct.clone_calling(d3_sim, "./hf1.d15.v2.clones.csv", 0.7)
d13_clones, d13_clone_size = ct.clone_calling(d13_sim, "./hf1.d15.v13.clones.csv", 0.7)
print(mef_clones.head())
print(mef_clone_size.head())
# clone_id cell_barcode
# 0 1 CTAGCCTAGTGTCCAT
# 1 1 AGCTTGAAGTACACCT
# 2 1 CTGAAGTAGAGTTGGC
# 3 1 CAGATCACAACTTGAC
# 4 1 ATCCACCTCGAATGCT
#
# Clone_ID Frequency
# 0 2 228
# 1 4 202
# 2 16 65
# 3 13 56
# 4 8 53
mef_clones.rename(columns={mef_clones.columns[0]: "CellTagV1"}, inplace=True)
d3_clones.rename(columns={d3_clones.columns[0]: "CellTagV2"}, inplace=True)
d13_clones.rename(columns={d13_clones.columns[0]: "CellTagV3"}, inplace=True)
clone_cells = pd.concat([mef_clones.cell_barcode, d3_clones.cell_barcode, d13_clones.cell_barcode]).unique()
celltag_data = pd.DataFrame(index=clone_cells, columns=["CellTagV1", "CellTagV2", "CellTagV3"])
celltag_data.loc[mef_clones['cell_barcode'], "CellTagV1"] = mef_clones["CellTagV1"].values
celltag_data.loc[d3_clones['cell_barcode'], "CellTagV2"] = d3_clones["CellTagV2"].values
celltag_data.loc[d13_clones['cell_barcode'], "CellTagV3"] = d13_clones["CellTagV3"].values
celltag_data.index = celltag_data.index.map(lambda x: x + "-1")
celltag_data
CellTagV1 | CellTagV2 | CellTagV3 | |
---|---|---|---|
CTAGCCTAGTGTCCAT-1 | 1 | NaN | NaN |
AGCTTGAAGTACACCT-1 | 1 | NaN | NaN |
CTGAAGTAGAGTTGGC-1 | 1 | NaN | NaN |
CAGATCACAACTTGAC-1 | 1 | NaN | NaN |
ATCCACCTCGAATGCT-1 | 1 | NaN | NaN |
... | ... | ... | ... |
GGAGCAAGTTACCGAT-1 | NaN | NaN | 45 |
TCTCATAGTCCGTTAA-1 | NaN | NaN | 48 |
TGACAACCAAAGGAAG-1 | NaN | NaN | 48 |
TGCGCAGAGATATGCA-1 | NaN | NaN | 50 |
TTAGGACGTAAGTTCC-1 | NaN | NaN | 50 |
1957 rows × 3 columns
- CellTagV1 Clones: 166 (1402 Cells)
- CellTagV2 Clones: 197 (1174 Cells)
- CellTagV3 Clones: 50 (100 Cells)
- Cells in 1 Clone: 1268
- Cells in 2 Clones: 659 (1318 Nodes)
- Cells in 3 Clones: 30 (90 Nodes)
- Unique Cells: 1957
- Nodes Count: 3089 -> 413 Clones + 2676 Cells -> (166+197+50) + (1268+1318+90)
(1268 + 1318 + 90) = (1402 + 1174 + 100) = 2676
link_list = ct.convert_cell_tag_matrix_to_link_list(celltag_data)
all_nodes = ct.get_nodes_from_link_list(link_list)
# Preprocessing data..
# Cells that have CellTagV1: 1402
# Cells that have CellTagV2: 1174
# Cells that have CellTagV3: 100
# find connection between [celltag -> cells]...
# find hidden links [CellTagV2 -> CellTagV3], or [CellTagV1 -> CellTagV3]...
# find hidden links [CellTagV1 -> CellTagV2]...
# finished
ref_nodes = ["CellTagV1_95"]
sub_links, sub_nodes = ct.get_subnet(ref_nodes, link_list, all_nodes)
print("Links:", sub_links.shape[0], "\nNodes:",sub_nodes.shape[0])
# Links: 25
# Nodes: 15
G = nx.Graph()
G.add_nodes_from(sub_nodes['nodes'])
edges = [(row['source'], row['target']) for index, row in sub_links.iterrows()]
G.add_edges_from(edges)
# Define the layout of the graph
layout = go.Layout(
title="Network",
showlegend=False,
hovermode='closest',
width=800,
height=600
)
plotly_graph = nx.spring_layout(G)
# Create node positions
node_x = []
node_y = []
for node in plotly_graph:
x, y = plotly_graph[node]
node_x.append(x)
node_y.append(y)
color_mapping = {
'cell': 'red',
'clone': 'blue',
}
node_colors = [color_mapping.get(category, 'gray') for category in sub_nodes['type']]
node_trace = go.Scatter(
x=node_x,
y=node_y,
mode='markers+text', # Add 'text' mode to display labels
#"hoverinfo='text',
text=sub_nodes['nodes'], # Set the labels to node names
textposition='bottom center', # Adjust the label position
marker=dict(
showscale=False,
color=node_colors,
size=10
)
)
edge_x = []
edge_y = []
for edge in G.edges():
x0, y0 = plotly_graph[edge[0]]
x1, y1 = plotly_graph[edge[1]]
edge_x.append(x0)
edge_x.append(x1)
edge_x.append(None)
edge_y.append(y0)
edge_y.append(y1)
edge_y.append(None)
edge_trace = go.Scatter(
x=edge_x,
y=edge_y,
line=dict(width=0.5, color='#888'),
hoverinfo='none',
mode='lines'
)
fig = go.Figure(data=[edge_trace, node_trace], layout=layout)
iplot(fig)
celltag_data[celltag_data['CellTagV1']==95]
CellTagV1 | CellTagV2 | CellTagV3 | |
---|---|---|---|
CCGTTCAGTTTGTTTC-1 | 95 | 147 | 26 |
CTCCTAGTCATATCGG-1 | 95 | e | e |
CGGACACCACCGAATT-1 | 95 | 109 | e |
CTACACCGTAGGGACT-1 | 95 | 109 | e |
CGTAGGCAGCGATAGC-1 | 95 | 147 | e |
GAAACTCTCAATACCG-1 | 95 | 109 | e |
GCATGATGTACCCAAT-1 | 95 | e | e |
TTGCCGTAGTACACCT-1 | 95 | 109 | e |
ATCGAGTTCCCTAATT-1 | 95 | 109 | e |
CGACCTTTCTACCTGC-1 | 95 | 147 | 26 |
TTGAACGCATCAGTAC-1 | 95 | 147 | e |
TAGACCACAATGACCT-1 | 95 | 147 | e |