<a href="https://colab.research.google.com/github/Droslj/scATAC-seq-complete-/blob/Google-colab/scATAC_seq_(1)_DA_PyDESeq2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

scATAC seq, based on scATAC seq processing Galaxy tutorials (scATAC preprocessing (2), Standard scATAC seq processing pipeline (1) )
AD Objects created in Galaxy using customized Galaxy WF with Snapatac2 and imported
(1) https://usegalaxy.eu/training-material/topics/single-cell/tutorials/scatac-preprocessing-tenx/tutorial.html#mapping-reads-to-a-reference-genome, (2) https://usegalaxy.eu/training-material/topics/single-cell/tutorials/scatac-standard-processing-snapatac2/tutorial.html
Data taken from the following NCBI study:
Metabolic adaptation pilots the differentiation of human hematopoietic cells (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1015713)
Import preprocessed Anndata object for four biological replicates, SRR26046013 (cells treated with AOA inhibitor), SRR26046015 (cells treated with DON inhibitor), SRR26046017 (cells treated with DG inhibitor), and SRR26046019 (untreated cells).
Following steps were performed in the preprocessing:
(1) Import matrices
(2) Compute fragment size distribution
(3) Compute TSS enrichment
(4) Filter cell counts based on TSSe
(5) Create cell by bin matrix based on 500 bp wide bins accross the whole genome
(6) Perform feature selection
(7) Perform Doublet removal
(8) Perform Dim reduction (spectral)
(9) Perform Clustering (neighborhood, UMAP, leiden)
(10) Create a cell by gene matrix
(11) Concatenate matrices using Inner join
(12) Remove batch effects

In [1]:
!pip install -q condacolab

In [2]:
import condacolab

In [3]:
condacolab.install()

✨🍰✨ Everything looks OK!


In [4]:
!conda --version

conda 23.11.0


In [5]:
!which conda

/usr/local/bin/conda


In [6]:
!conda config --add channels conda-forge



In [7]:
!conda config --add channels bioconda



In [8]:
!pip install snapatac2 -q

In [9]:
!pip show snapatac2

Name: snapatac2
Version: 2.8.0
Summary: SnapATAC2: Single-cell epigenomics analysis pipeline
Home-page: https://github.com/
Author: Kai Zhang <kai@kzhang.org>
Author-email: Kai Zhang <zhangkai33@westlake.edu.cn>
License: MIT
Location: /usr/local/lib/python3.10/site-packages
Requires: anndata, igraph, kaleido, macs3, multiprocess, natsort, numpy, pandas, plotly, polars, pooch, pyarrow, pyfaidx, rustworkx, scikit-learn, scipy, tqdm, typeguard
Required-by: 


In [10]:
import snapatac2 as snap

In [11]:
!pip install umap-learn



In [12]:
import umap.umap_ as umap


In [13]:
from umap import UMAP

In [14]:
!pip install scanpy -q

In [15]:
import scanpy as sc

In [16]:
pip show scanpy

Name: scanpy
Version: 1.10.4
Summary: Single-Cell Analysis in Python.
Home-page: 
Author: Alex Wolf, Philipp Angerer, Fidel Ramirez, Isaac Virshup, Sergei Rybakov, Gokcen Eraslan, Tom White, Malte Luecken, Davide Cittaro, Tobias Callies, Marius Lange, Andrés R. Muñoz-Rojas
Author-email: 
License: 
Location: /usr/local/lib/python3.10/site-packages
Requires: anndata, h5py, joblib, legacy-api-wrap, matplotlib, natsort, networkx, numba, numpy, packaging, pandas, patsy, pynndescent, scikit-learn, scipy, seaborn, session-info, statsmodels, tqdm, umap-learn
Required-by: 


In [17]:
!pip install pydeseq2 -q # Install PyDESeq2

In [18]:
import pydeseq2

In [19]:
!pip show pydeseq2

Name: pydeseq2
Version: 0.4.12
Summary: A python implementation of DESeq2.
Home-page: 
Author: Boris Muzellec, Maria Telenczuk, Vincent Cabelli and Mathieu Andreux
Author-email: boris.muzellec@owkin.com
License: MIT
Location: /usr/local/lib/python3.10/site-packages
Requires: anndata, matplotlib, numpy, pandas, scikit-learn, scipy
Required-by: 


In [20]:
import numpy as np

In [21]:
import anndata as ad

In [22]:
import matplotlib.pyplot as plt

In [23]:
import seaborn as sns

In [24]:
import plotly.subplots as sp
import plotly.graph_objects as go

In [25]:
from scipy import stats

In [26]:
import pandas as pd

# Import reads from google drive, three samples treated with energy metabolism inhibitors and one untreated

In [27]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [28]:
SRR26046013_DM_AOA_INH = sc.read_h5ad('/content/drive/MyDrive/Colab Notebooks/SRR26046013_Annotated_data_matrix.h5ad')

In [29]:
SRR26046013_DM_AOA_INH

AnnData object with n_obs × n_vars = 13546 × 0
    obs: 'n_fragment', 'frac_dup', 'frac_mito'
    uns: 'reference_sequences'
    obsm: 'fragment_paired'

In [30]:
SRR26046019_DM_UT = sc.read_h5ad('/content/drive/MyDrive/Colab Notebooks/SRR26046019_Annotated_data_matrix.h5ad')

In [31]:
SRR26046019_DM_UT

AnnData object with n_obs × n_vars = 10448 × 0
    obs: 'n_fragment', 'frac_dup', 'frac_mito'
    uns: 'reference_sequences'
    obsm: 'fragment_paired'

# Perform DeSEQ2 test

In [36]:
gene_anno = '/content/drive/MyDrive/Colab Notebooks/gencode.v41.basic.annotation.gff3.gz'

# Create cell-by-gene matrix (replace with cell-by-bin if appropriate)
gene_matrix_1 = snap.pp.make_gene_matrix(SRR26046013_DM_AOA_INH, gene_anno=gene_anno)
gene_matrix_2 = snap.pp.make_gene_matrix(SRR26046019_DM_UT, gene_anno=gene_anno)

In [47]:
# Check for non-zero values
if gene_matrix_2.X.nnz > 0:
    print("The matrix contains non-zero values.")
else:
    print("The matrix is empty (all zeros).")

The matrix contains non-zero values.


In [45]:
gene_matrix_1.X

<Compressed Sparse Row sparse matrix of dtype 'uint32'
	with 73519275 stored elements and shape (13546, 60606)>

In [38]:
gene_matrix_2

AnnData object with n_obs × n_vars = 10448 × 60606
    obs: 'n_fragment', 'frac_dup', 'frac_mito'

In [39]:
# Extract count matrices and convert to pandas DataFrames
count_matrix_1 = gene_matrix_1.X.toarray()
count_matrix_2 = gene_matrix_2.X.toarray()

In [41]:
count_df_1 = pd.DataFrame(count_matrix_1, index=gene_matrix_1.obs_names, columns=gene_matrix_1.var_names)
count_df_2 = pd.DataFrame(count_matrix_2, index=gene_matrix_2.obs_names, columns=gene_matrix_2.var_names)

In [42]:
count_df_1.index = count_df_1.index + '_1'
count_df_2.index = count_df_2.index + '_2'

In [43]:
count_df_1

Unnamed: 0,DDX11L1,WASH7P,MIR6859-1,MIR1302-2HG,MIR1302-2,FAM138A,OR4G4P,OR4G11P,OR4F5,ENSG00000238009,...,MT-ND4,MT-TH,MT-TS2,MT-TL2,MT-ND5,MT-ND6,MT-TE,MT-CYB,MT-TT,MT-TP
AAAAAAAAAAAAAAAA_1,0,0,0,0,0,0,0,0,0,0,...,4,2,2,2,2,6,4,4,4,2
AAACAACGAACGAGCA_1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,1
AAACAACGAAGAGGCT_1,0,0,0,0,0,0,0,0,0,0,...,9,8,9,9,14,8,8,11,8,2
AAACAACGAAGTCGGA_1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,4,4,1,0,4
AAACAACGACGCACTG_1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,3,3,1,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TTTGGGATGTCCAGCG_1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,1,2,2,0
TTTGGGATGTGTTCCC_1,0,0,0,0,0,0,0,0,0,0,...,88,61,58,58,126,91,74,129,86,18
TTTGGGATGTTAGCTG_1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,1,1,0
TTTGGGATGTTGCGTA_1,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,2,3,3,3,3,0


In [None]:
#Add metadata information
metadata_df_1 = pd.DataFrame(index=count_df_1.index)
metadata_df_1['Treatment'] = ['Treated w/AOA'] * len(gene_matrix_1.obs_names)

metadata_df_2 = pd.DataFrame(index=count_df_2.index)
metadata_df_2['Treatment'] = ['Untreated'] * len(gene_matrix_2.obs_names)

In [None]:
# Combine the two scATAC-seq count matrices
count_df = pd.concat([count_df_1, count_df_2], axis=0)
metadata_df = pd.concat([metadata_df_1, metadata_df_2], axis=0)

In [None]:
count_df

In [None]:
metadata_df

In [None]:
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats

In [None]:
dds = DeseqDataSet(
    counts=count_df,
    metadata=metadata_df,
    design_factors="Treatment",
    refit_cooks=True,
)

In [None]:
# Perform differential accessibility analysis
dds.deseq2()
dds.get_results()

In [None]:
# Assuming adata is your AnnData object containing the count data:

    #Filter mitochondrial genes before highly variable gene selection
    adata.var['mt'] = adata.var_names.str.startswith('MT-')
    adata_filtered = adata[:, ~adata.var['mt']]

    # Calculate highly variable features
    sc.pp.highly_variable_genes(adata_filtered, flavor='seurat', batch_key="Treatment", n_top_genes=5000) #Selecting top 5000 HVGs

    # Subset your data to include only the selected features
    adata_subset = adata_filtered[:, adata_filtered.var['highly_variable']]

    #Extract count matrices and convert to pandas DataFrames
    count_matrix_1 = adata_subset[adata_subset.obs['Treatment'] == 'Treated w/AOA', :].X.toarray()
    count_matrix_2 = adata_subset[adata_subset.obs['Treatment'] == 'Untreated', :].X.toarray()

    count_df_1 = pd.DataFrame(count_matrix_1, index=adata_subset[adata_subset.obs['Treatment'] == 'Treated w/AOA', :].obs_names, columns=adata_subset.var_names)
    count_df_2 = pd.DataFrame(count_matrix_2, index=adata_subset[adata_subset.obs['Treatment'] == 'Untreated', :].obs_names, columns=adata_subset.var_names)

    #Continue with the rest of your code... (creating count_df, metadata_df, and running DESeq2)