<a href="https://colab.research.google.com/github/Droslj/scATAC-seq-complete-/blob/Google-colab/scATAC_seq_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

scATAC seq, based on scATAC seq processing Galaxy tutorials (scATAC preprocessing (2), Standard scATAC seq processing pipeline (1) )
AD Objects created in Galaxy using customized Galaxy WF with Snapatac2 and imported
(1) https://usegalaxy.eu/training-material/topics/single-cell/tutorials/scatac-preprocessing-tenx/tutorial.html#mapping-reads-to-a-reference-genome, (2) https://usegalaxy.eu/training-material/topics/single-cell/tutorials/scatac-standard-processing-snapatac2/tutorial.html
Data taken from the following NCBI study:
Metabolic adaptation pilots the differentiation of human hematopoietic cells (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1015713)
Import Anndata objects for two biological replicates, SRR26046013 (cells treated with AOA inhibitor) and SRR26046019 (untreated cells)
Perform following steps:
(1) Import matrices
(2) Compute fragment size distribution
(3) Compute TSS enrichment
(4) Filter cell counts based on TSSe
(5) Create cell by bin matrix based on 500 bp wide bins accross the whole genome
(6) Perform feature selection
(7) Perform Doublet removal
(8) Perform Dim reduction (spectral)
(9) Perform Clustering (neighborhood, UMAP, leiden)
(10) Create a cell by gene matrix
(11) Concatenate matrices using Inner join
(12) Remove batch effects

In [1]:
!pip install -q condacolab

In [2]:
import condacolab

In [3]:
condacolab.install()

✨🍰✨ Everything looks OK!


In [4]:
!conda --version

conda 23.11.0


In [5]:
!which conda

/usr/local/bin/conda


In [6]:
!conda config --add channels conda-forge



In [7]:
!conda config --add channels bioconda



In [None]:
!conda install snapatac2 -q

Channels:
 - bioconda
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... 

In [None]:
!pip show snapatac2

In [None]:
import snapatac2 as snap

In [None]:
!pip install umap-learn



In [None]:
import umap.umap_ as umap


In [None]:
from umap import UMAP



In [None]:
!pip install scanpy -q

In [None]:
import scanpy as sc

In [None]:
pip show scanpy

In [None]:
import numpy as np

In [None]:
import anndata as ad

In [None]:
import matplotlib.pyplot as plt

In [None]:
import plotly.subplots as sp
import plotly.graph_objects as go

Import reads from google drive, one sample treated with energy metabolism inhibitors and one untreated

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
SRR26046013_DM_AOA_INH = sc.read_h5ad('/content/drive/MyDrive/Colab Notebooks/SRR26046013_Annotated_data_matrix.h5ad')

In [None]:
SRR26046013_DM_AOA_INH

In [None]:
SRR26046019_DM_UT = sc.read_h5ad('/content/drive/MyDrive/Colab Notebooks/SRR26046019_Annotated_data_matrix.h5ad')

In [None]:
SRR26046019_DM_UT

Compute fragment size distributions

In [None]:
# Create a subplot figure
fig, axes = plt.subplots(1, 2, figsize=(10, 5))

# Plot for SRR26046013_DM_AOA_INH
frag_size_distr = snap.metrics.frag_size_distr(SRR26046013_DM_AOA_INH, inplace=False)
axes[0].plot(frag_size_distr)
axes[0].set_title('Treated w/AOA')

# Plot for SRR26046019_DM_UT
frag_size_distr4 = snap.metrics.frag_size_distr(SRR26046019_DM_UT, inplace=False)
axes[1].plot(frag_size_distr4)
axes[1].set_title('Untreated')

plt.tight_layout()
plt.show()

Compute and plot TSSe

In [None]:
# Compute TSSe metrics
# Get genome annotation
gene_anno = snap.genome.hg38

In [None]:
snap.metrics.tsse(SRR26046013_DM_AOA_INH, gene_anno)

In [None]:
snap.metrics.tsse(SRR26046019_DM_UT, gene_anno)

In [None]:
# Generate TSSE plots
TSSE1plot = snap.pl.tsse(SRR26046013_DM_AOA_INH, show = False)
TSSE2plot = snap.pl.tsse(SRR26046019_DM_UT,show = False)

# Create a subplot figure
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=('Treated w/AOA', 'Untreated'))

# Add the plots to the subplot figure
fig.add_trace(TSSE1plot.data[0], row=1, col=1)
fig.add_trace(TSSE2plot.data[0], row=1, col=2)

# Update layout and set X-axis to logarithmic scale
fig.update_layout(height=400, width=800, title_text="TSS enrichment")
fig.update_xaxes(type="log", row=1, col=1)
fig.update_xaxes(type="log", row=1, col=2)

fig.show()
fig.write_image("TSSE_plots.png", height=1080, width=1920)

In [None]:
gene_anno

fig.write_image("TSSe_plot.png",  width=1920, height=1080)

#Filter cell counts and TSSE values based on above plots

In [None]:
snap.pp.filter_cells(SRR26046013_DM_AOA_INH, min_counts=3000, min_tsse=15, max_counts=70000)

In [None]:
snap.pp.filter_cells(SRR26046019_DM_UT, min_counts=6000, min_tsse=13, max_counts=90000)

In [None]:
SRR26046013_DM_AOA_INH

In [None]:
SRR26046019_DM_UT

#Write to file
SRR26046013_DM_AOA_INH.write_h5ad('SRR26046013_DM_AOA_INH_filt.h5ad')
SRR26046019_DM_UT.write_h5ad('SRR26046018_DM_UT_filt.h5ad')

#Save back to Galaxy
put('SRR26046013_DM_AOA_INH_filt.h5ad')
put('SRR26046019_DM_UT_filt.h5ad')

Create cell by bin matrix containing insertion counts across genome-wide 500-bp bins

In [None]:
snap.pp.add_tile_matrix(SRR26046013_DM_AOA_INH)

In [None]:
snap.pp.add_tile_matrix(SRR26046019_DM_UT)

In [None]:
SRR26046013_DM_AOA_INH

In [None]:
SRR26046019_DM_UT

Perform feature selection

In [None]:
snap.pp.select_features(SRR26046013_DM_AOA_INH, n_features = 250000)

In [None]:
snap.pp.select_features(SRR26046019_DM_UT, n_features = 250000)

In [None]:
SRR26046013_DM_AOA_INH

In [None]:
#sc.write(adata=SRR26046013_DM_AOA_INH,filename="SRR26046013_Annotated_data_matrix_P1.h5ad",compression='gzip')

Doublet removal

In [None]:
SRR26046013_DM_AOA_INH

In [None]:
#Apply a scrublet algorithm to identify potential doublets
snap.pp.scrublet(SRR26046013_DM_AOA_INH)

In [None]:
snap.pp.scrublet(SRR26046019_DM_UT)

In [None]:
SRR26046013_DM_AOA_INH

In [None]:
SRR26046019_DM_UT

In [None]:
#Filter doublets
snap.pp.filter_doublets(SRR26046013_DM_AOA_INH)

In [None]:
snap.pp.filter_doublets(SRR26046019_DM_UT)

In [None]:
SRR26046013_DM_AOA_INH

In [None]:
SRR26046019_DM_UT

Dimension reduction

In [None]:
snap.tl.spectral(SRR26046013_DM_AOA_INH)

In [None]:
snap.tl.spectral(SRR26046019_DM_UT)

In [None]:
SRR26046013_DM_AOA_INH

In [None]:
SRR26046019_DM_UT

In [None]:
snap.tl.umap(SRR26046013_DM_AOA_INH)

In [None]:
snap.tl.umap(SRR26046019_DM_UT)

In [None]:
SRR26046013_DM_AOA_INH

In [None]:
SRR26046019_DM_UT

Clustering analysis

In [None]:
#(1)Calculate knn graph, (2) use Leiden community detection, (3) plot UMAP (returns plotly object)

In [None]:
snap.pp.knn(SRR26046013_DM_AOA_INH)
snap.tl.leiden(SRR26046013_DM_AOA_INH)
SRR26046013_DM_AOA_INH_UMAP = snap.pl.umap(SRR26046013_DM_AOA_INH, color = 'leiden', interactive = False, show = False)

In [None]:
snap.pp.knn(SRR26046019_DM_UT)
snap.tl.leiden(SRR26046019_DM_UT)
SRR26046019_DM_UT_UMAP = snap.pl.umap(SRR26046019_DM_UT, color='leiden', interactive=False, show=False)

In [None]:
# Create a subplot figure
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=('Treated w/AOA', 'Untreated'))

# Add all traces from each UMAP plot to the subplot figure
for trace in SRR26046013_DM_AOA_INH_UMAP.data:
 trace.showlegend = False  # Hide legend for individual traces
 fig.add_trace(trace, row=1, col=1)

for trace in SRR26046019_DM_UT_UMAP.data:
 fig.add_trace(trace, row=1, col=2)  # Show legend only for the last subplot

# Update layout with legend title and position
fig.update_layout(height=600, width=1200, title_text="UMAP Clustering", legend_title_text='Clusters', legend=dict(x=1.05, y=1, traceorder='normal', font=dict(family='sans-serif', size=12, color='black'), bordercolor='Black', borderwidth=1))

fig.show()
fig.write_image("Cluster_plots.png", height=1080, width=1920)

!export HDF5_USE_FILE_LOCKING=FALSE

#Write structures to file
SRR26046013_DM_AOA_INH.write_h5ad("SRR26046013_PI.h5ad")

SRR26046019_DM_UT.write_h5ad("SRR26046019_PI.h5ad")

#Save back to Galaxy doublet removal, dimension reduction, clustering
put('SRR26046013_PI.h5ad')

put('SRR26046019_PI.h5ad')

Create a cell-by-gene activity matrix

In [None]:
#sc.write(SRR26046013_DM_AOA_INH,"/SRR26046013_DM_AOA_INH.h5ad")

In [None]:
SRR26046019_DM_UT

In [None]:
SRR26046013_DM_AOA_INH_GM = snap.pp.make_gene_matrix(SRR26046013_DM_AOA_INH, snap.genome.hg38)

In [None]:
SRR26046019_DM_UT_GM = snap.pp.make_gene_matrix(SRR26046019_DM_UT, snap.genome.hg38)

In [None]:
SRR26046013_DM_AOA_INH_GM

In [None]:
SRR26046019_DM_UT_GM

In [None]:
SRR26046013_DM_AOA_INH

Copy other annotations from original AD object

In [None]:
# Transfer `obs` annotations
SRR26046013_DM_AOA_INH_GM.obs = SRR26046013_DM_AOA_INH.obs

In [None]:
# Transfer `uns` data
SRR26046013_DM_AOA_INH_GM.uns = SRR26046013_DM_AOA_INH.uns

In [None]:
# Transfer `obsm` matrices
SRR26046013_DM_AOA_INH_GM.obsm = SRR26046013_DM_AOA_INH.obsm

In [None]:
# Transfer `obsp` matrices
SRR26046013_DM_AOA_INH_GM.obsp = SRR26046013_DM_AOA_INH.obsp

In [None]:
# Transfer `obs` annotations
SRR26046019_DM_UT_GM.obs = SRR26046019_DM_UT.obs

In [None]:
# Transfer `uns` data
SRR26046019_DM_UT_GM.uns = SRR26046019_DM_UT.uns

In [None]:
# Transfer `obsm` matrices
SRR26046019_DM_UT_GM.obsm = SRR26046019_DM_UT.obsm

In [None]:
# Transfer `obsp` matrices
SRR26046019_DM_UT_GM.obsp = SRR26046019_DM_UT.obsp

In [None]:
SRR26046013_DM_AOA_INH_GM

In [None]:
SRR26046019_DM_UT_GM

Concatenate Data matrices

In [None]:
#Use inner join
adata_concat = ad.concat([SRR26046013_DM_AOA_INH_GM, SRR26046019_DM_UT_GM], label = 'Treatment', keys = ['Treated w/AOA', 'Untreated'], join='inner')

In [None]:
print(np.isnan(adata.X).sum())

Differential accesibility analysis

In [None]:
import numpy as np

In [None]:
from scipy import stats

In [None]:
# Assuming adata is your AnnData object
condition1 = adata_concat[adata_concat.obs['Treatment'] == 'Treated w/AOA']
condition2 = adata_concat[adata_concat.obs['Treatment'] == 'Untreated']

In [None]:
# Perform a t-test for each peak/gene
pvals = []
for peak in adata_concat.var_names:
 _, pval = stats.ttest_ind(condition1[:, peak].X.toarray(), condition2[:, peak].X.toarray())
 pvals.append(pval)

In [None]:
# Adjust p-values for multiple testing (e.g., using Benjamini-Hochberg)
from statsmodels.stats.multitest import multipletests
_, pvals_adj, _, _ = multipletests(pvals, method='fdr_bh')

In [None]:
# Add p-values to AnnData object
adata.var['pvals'] = pvals
adata.var['pvals_adj'] = pvals_adj

In [None]:
Filter Significant DARs: Filter the results to retain only those regions that are significantly differentially accessible after multiple testing correction.

In [None]:
significant_dars = adata.var[adata.var['pvals_adj'] < 0.05]

In [None]:
Visualize Results: Visualize the differentially accessible regions using various plotting functions available in scanpy or other visualization libraries.

In [None]:
sc.pl.heatmap(adata, var_names=significant_dars.index, groupby='condition')

Interpret Results: Interpret the biological significance of the differentially accessible regions. This might involve looking at the genes associated with these regions and understanding their roles in the biological conditions you are studying.