<a href="https://colab.research.google.com/github/Droslj/scATAC-seq-complete-/blob/Google-colab/scATAC_seq_(1)_DA_scVI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

scATAC seq, based on scATAC seq processing Galaxy tutorials (scATAC preprocessing (2), Standard scATAC seq processing pipeline (1) )
AD Objects created in Galaxy using customized Galaxy WF with Snapatac2 and imported
(1) https://usegalaxy.eu/training-material/topics/single-cell/tutorials/scatac-preprocessing-tenx/tutorial.html#mapping-reads-to-a-reference-genome, (2) https://usegalaxy.eu/training-material/topics/single-cell/tutorials/scatac-standard-processing-snapatac2/tutorial.html
Data taken from the following NCBI study:
Metabolic adaptation pilots the differentiation of human hematopoietic cells (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1015713)
Import preprocessed Anndata object for four biological replicates, SRR26046013 (cells treated with AOA inhibitor), SRR26046015 (cells treated with DON inhibitor), SRR26046017 (cells treated with DG inhibitor), and SRR26046019 (untreated cells).
Following steps were performed in the preprocessing:
(1) Import matrices
(2) Compute fragment size distribution
(3) Compute TSS enrichment
(4) Filter cell counts based on TSSe
(5) Create cell by bin matrix based on 500 bp wide bins accross the whole genome
(6) Perform feature selection
(7) Perform Doublet removal
(8) Perform Dim reduction (spectral)
(9) Perform Clustering (neighborhood, UMAP, leiden)
(10) Create a cell by gene matrix
(11) Concatenate matrices using Inner join
(12) Remove batch effects

In [1]:
!pip install snapatac2 -q

In [2]:
!pip show snapatac2

Name: snapatac2
Version: 2.8.0
Summary: SnapATAC2: Single-cell epigenomics analysis pipeline
Home-page: https://github.com/
Author: Kai Zhang <kai@kzhang.org>
Author-email: Kai Zhang <zhangkai33@westlake.edu.cn>
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: anndata, igraph, kaleido, macs3, multiprocess, natsort, numpy, pandas, plotly, polars, pooch, pyarrow, pyfaidx, rustworkx, scikit-learn, scipy, tqdm, typeguard
Required-by: 


In [3]:
import snapatac2 as snap

In [4]:
!pip install umap-learn



In [5]:
import umap.umap_ as umap


In [6]:
from umap import UMAP

In [7]:
!pip install scanpy -q

In [8]:
import scanpy as sc

In [9]:
pip show scanpy

Name: scanpy
Version: 1.10.4
Summary: Single-Cell Analysis in Python.
Home-page: https://scanpy.org
Author: Alex Wolf, Philipp Angerer, Fidel Ramirez, Isaac Virshup, Sergei Rybakov, Gokcen Eraslan, Tom White, Malte Luecken, Davide Cittaro, Tobias Callies, Marius Lange, Andrés R. Muñoz-Rojas
Author-email: 
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: anndata, h5py, joblib, legacy-api-wrap, matplotlib, natsort, networkx, numba, numpy, packaging, pandas, patsy, pynndescent, scikit-learn, scipy, seaborn, session-info, statsmodels, tqdm, umap-learn
Required-by: 


In [46]:
!pip install tqdm



In [12]:
!pip show pydeseq2

Name: pydeseq2
Version: 0.4.12
Summary: A python implementation of DESeq2.
Home-page: 
Author: Boris Muzellec, Maria Telenczuk, Vincent Cabelli and Mathieu Andreux
Author-email: boris.muzellec@owkin.com
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: anndata, matplotlib, numpy, pandas, scikit-learn, scipy
Required-by: 


In [13]:
import numpy as np

In [14]:
import anndata as ad

In [15]:
import matplotlib.pyplot as plt

In [16]:
import seaborn as sns

In [17]:
import plotly.subplots as sp
import plotly.graph_objects as go

In [18]:
from scipy import stats

In [19]:
import pandas as pd

# Import reads from google drive, three samples treated with energy metabolism inhibitors and one untreated

In [20]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [21]:
SRR26046013_DM_AOA_INH = sc.read_h5ad('/content/drive/MyDrive/Colab Notebooks/SRR26046013_Annotated_data_matrix.h5ad')

In [22]:
SRR26046013_DM_AOA_INH

AnnData object with n_obs × n_vars = 13546 × 0
    obs: 'n_fragment', 'frac_dup', 'frac_mito'
    uns: 'reference_sequences'
    obsm: 'fragment_paired'

In [23]:
SRR26046019_DM_UT = sc.read_h5ad('/content/drive/MyDrive/Colab Notebooks/SRR26046019_Annotated_data_matrix.h5ad')

In [24]:
SRR26046019_DM_UT

AnnData object with n_obs × n_vars = 10448 × 0
    obs: 'n_fragment', 'frac_dup', 'frac_mito'
    uns: 'reference_sequences'
    obsm: 'fragment_paired'

# Preprocess AD objects

In [56]:
gene_anno = '/content/drive/MyDrive/Colab Notebooks/gencode.v41.basic.annotation.gff3.gz'

# Create cell-by-gene matrix (replace with cell-by-bin if appropriate)
gene_matrix_1 = snap.pp.make_gene_matrix(SRR26046013_DM_AOA_INH, gene_anno=gene_anno)
gene_matrix_2 = snap.pp.make_gene_matrix(SRR26046019_DM_UT, gene_anno=gene_anno)

In [57]:
gene_matrix_1.X

<13546x60606 sparse matrix of type '<class 'numpy.uint32'>'
	with 73519275 stored elements in Compressed Sparse Row format>

In [58]:
#Check for suitable format
print(gene_matrix_1.X.dtype)  # Should output: int32, int64, or uint32

uint32


In [59]:
#Check that var contains gene names
print(gene_matrix_1.var_names)  # Should show gene names/IDs

Index(['DDX11L1', 'WASH7P', 'MIR6859-1', 'MIR1302-2HG', 'MIR1302-2', 'FAM138A',
       'OR4G4P', 'OR4G11P', 'OR4F5', 'ENSG00000238009',
       ...
       'MT-ND4', 'MT-TH', 'MT-TS2', 'MT-TL2', 'MT-ND5', 'MT-ND6', 'MT-TE',
       'MT-CYB', 'MT-TT', 'MT-TP'],
      dtype='object', length=60606)


# Add experiment index to cell barcode

In [60]:
gene_matrix_1.obs.index = gene_matrix_1.obs.index + '_1'

In [61]:
print(gene_matrix_1.obs.head())  # Shows a preview of cell-level info

                    n_fragment  frac_dup  frac_mito
AAAAAAAAAAAAAAAA_1         453  0.002119   0.038217
AAACAACGAACGAGCA_1       23759  0.415605   0.000252
AAACAACGAAGAGGCT_1       15377  0.403037   0.002336
AAACAACGAAGTCGGA_1       19907  0.421316   0.001004
AAACAACGACGCACTG_1         203  0.458115   0.019324


In [62]:
gene_matrix_2.obs.index = gene_matrix_2.obs.index + '_2'

In [63]:
print(gene_matrix_2.obs.head())

                    n_fragment  frac_dup  frac_mito
AAAAAAAAAAAAAAAA_2         361  0.000000   0.008242
AAACAACGATAAGTAG_2       26068  0.285284   0.001800
AAACAACGATAGGTTC_2       12427  0.335221   0.001527
AAACAACGATCTATCT_2       25665  0.346743   0.000973
AAACAACGATGCGTGC_2       17103  0.337616   0.000760


# Combine AD objects

In [64]:
#Combine AD objects, add Treatment as a batch key
adata_combined = gene_matrix_1_subset.concatenate(gene_matrix_2_subset, batch_key="Treatment", index_unique=None)

In [65]:
adata_combined

AnnData object with n_obs × n_vars = 23994 × 3974
    obs: 'n_fragment', 'frac_dup', 'frac_mito', 'Treatment'
    var: 'mt', 'highly_variable'

# Filter mitochondrial genes

In [66]:
#Filter mitochondrial genes
adata_combined.var['mt'] = adata_combined.var_names.str.startswith('MT-')
adata_filtered = adata_combined[:, ~adata_combined.var['mt']]

In [67]:
# Check for mito genes
print(adata_filtered.var[adata_filtered.var['mt']])

# Get the number of mitochondrial genes:
print(adata_filtered.var['mt'].sum())

Empty DataFrame
Columns: [mt, highly_variable]
Index: []
0


In [68]:
adata_filtered

View of AnnData object with n_obs × n_vars = 23994 × 3974
    obs: 'n_fragment', 'frac_dup', 'frac_mito', 'Treatment'
    var: 'mt', 'highly_variable'

# Remove batch effects

In [47]:
from tqdm.notebook import tqdm

In [48]:
import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()

# Your DESeq2 analysis code here:
dds.deseq2()
results = dds.get_results()

profiler.disable()
stats = pstats.Stats(profiler).sort_stats('cumulative')
stats.print_stats()

Fitting size factors...
  self.fit_size_factors()
Fitting dispersions...
... done in 29.28 seconds.

Fitting MAP dispersions...
... done in 24.73 seconds.



KeyboardInterrupt: 