In [1]:
import os
import sys
import dill
import pickle
import warnings
import pandas as pd
import pandas as pd
from pycisTopic.qc import *
from matplotlib.pyplot import rcParams
from pycisTopic.clust_vis import *
from pycisTopic.lda_models import *
from pycisTopic.diff_features import *
from pycisTopic.cistopic_class import *
from pycisTopic.topic_binarization import *
from pycisTopic.iterative_peak_calling import *
from pycistarget.utils import region_names_to_coordinates
from scenicplus.wrappers.run_pycistarget import run_pycistarget

#supress warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

_stderr = sys.stderr                                                         
null = open(os.devnull,'wb')


### **Data Prep Parameters**

Specify aditional information and importantly, the location of the ATAC fragments file, whihc is the main input into `pycisTopic`

In [2]:
save_prefix = 'seaad_mtg' # this takes the format '{StudyName}_{ThreeLetterAccronymForBrainRegion}'

load_cistopic_object = True
load_topic_models = True

exclude_cells = None
include_cells = None

cell_type_column = 'Subclass' # 'Supertype (non-expanded)', 'Subclass'

subclass = {'seaad_mtg': {'excitatory': ['L5 IT', 'L2/3 IT', 'L4 IT', 'L6 IT', 'L6 IT Car3', 'L5/6 NP', 'L6b', 'L6 CT', 'L5 ET'],
                          'inhibitory': ['Pvalb', 'Sst', 'Lamp5 Lhx6', 'Vip', 'Lamp5', 'Sncg', 'Chandelier', 'Sst Chodl', 'Pax6'],
                          'astrocyte': ['Astrocyte'],
                          'microglia': ['Microglia-PVM'],
                          'opc': ['OPC'],
                          'oligodendrocyte': ['Oligodendrocyte'],
                          'endothelial': ['Endothelial'],
                           'vlmc': ['VLMC'],
                         },

            'seaad_pfc': {'excitatory': ['L5 IT', 'L2/3 IT', 'L4 IT', 'L6 IT', 'L6 IT Car3', 'L5/6 NP', 'L6b', 'L6 CT', 'L5 ET'],
                          'inhibitory': ['Pvalb', 'Sst', 'Lamp5 Lhx6', 'Vip', 'Lamp5', 'Sncg', 'Chandelier', 'Sst Chodl', 'Pax6'],
                          'astrocyte': ['Astrocyte'],
                          'microglia': ['Microglia-PVM'],
                          'opc': ['OPC'],
                          'oligodendrocyte': ['Oligodendrocyte'],
                          'endothelial': ['Endothelial'],
                           'vlmc': ['VLMC'],
                         },
            
            'gazestani_pfc': {'excitatory': ['FEZF2', 'RORB', 'THEMIS', 'LINC00507', 'CTGF'],
                              'inhibitory': ['PVALB', 'SST', 'LHX6', 'LAMP5', 'VIP', 'NDNF'],
                              'astrocyte': ['WIF1', 'CHI3L1', 'PTCSC3', 'GRIA1'],
                              'microgliverbosea': ['Myeloid', 'CX3CR1', 'GPNMB', 'Prolif', 'IRM', 'Macrophage', 'CRM'],
                              'opc': ['PBX3', 'ANKRD55', 'BRCA2', 'OPC', 'PLP1'],
                              'oligodendrocyte': ['GRIK2', 'BACE2', 'PLXDC2', 'SLC38A1'],
                              'endothelial': ['ABCC9', 'TMEM45B', 'GPR126', 'C7', 'HRK', 'IGFBP6', 'DOCK8', 'G0S2', 'APLN', 'COL8A1'],
                        }
            }

subject_ids_for_study = {'leng_sfg': 'PatientID',
                        'leng_etc': 'PatientID',
                        'seaad_mtg': 'Donor ID', 
                        'seaad_pfc': 'Donor ID', 
                        'gazestani_pfc': 'individualID'}

subject_id = subject_ids_for_study[save_prefix]     # for leng this is `PatientID` for mathys is 'Subject', and allen is 'individualID'
metadata = f'../data/raw/{save_prefix}/{save_prefix}_metadata.csv' # Metatdata location
meta = pd.read_csv(metadata, encoding_errors='ignore')

region_name = save_prefix.split('_')[-1].upper()
save_dir = f'/media/tadeoye/Volume1/SEA-AD/{region_name}/ATACseq/results'
tmp_dir = f'/media/tadeoye/Volume1/SEA-AD/{region_name}/ATACseq/temp_files'

if not os.path.exists(save_dir):
    os.makedirs(save_dir)

We have previously completed all the essential scATAC-seq preprocessing steps. Specifically, we:

1. Generated a set of **`consensus peaks`**.
2. Performed **`quality control steps`**, retaining only cell barcodes that passed QC metrics in both the scRNA-seq and scATAC-seq assays.
3. Conducted **`topic modeling`**.
4. **`Inferred candidate enhancer regions`** by binarizing the region-topic probabilities and differentially accessible regions (DARs) per cell type.

Next, we will perform **`motif enrichment analysis`** on these candidate enhancer regions using the Python package, [pycistarget](https://pycistarget.readthedocs.io/en/latest/). For this, a precomputed motif-score database is needed. A sample-specific database can be generated by scoring the consensus peaks with motifs, or a general pre-scored database can also be used. 

**`Here, we use a pre-scored database`**


## **Motif enrichment analysis using pycistarget**

After having identified candidate enhancer regions we will use [pycistarget](https://pycistarget.readthedocs.io/en/latest/) to find which motifs are enriched in these regions. 

### **Cistarget databases**

In order to run pycistarget one needs a precomputed database containing motif scores for genomic regions.

For this analysis we will use a **`custom cistarget database`** from **`/scripts/eGRN_custom_cistarget_db.ipynb`**. 

As an alternative, one can use the **`precomputed database`** from **[screen regions](https://screen.encodeproject.org/)**.

Next to the motif database we also need a **`motif-to-tf annotation`** database. This is available on [https://resources.aertslab.org/cistarget/](https://resources.aertslab.org/cistarget/).

Load candidate enhancer regions identified in `/scripts/eGRN_enhancers.ipynb`.

In [3]:
region_bin_topics_otsu = pickle.load(open(os.path.join(save_dir, f'candidate_enhancers/{cell_type_column}_region_bin_topics_otsu.pkl'), 'rb'))
region_bin_topics_top3k = pickle.load(open(os.path.join(save_dir, f'candidate_enhancers/{cell_type_column}_region_bin_topics_top3k.pkl'), 'rb'))
markers = pickle.load(open(os.path.join(save_dir, f'candidate_enhancers/{cell_type_column}_markers_dict.pkl'), 'rb'))
markers_dict = {key.replace("/", "_").replace(" ", "_"): markers[key] for key in markers.keys()}

Convert to dictionary of pyranges objects.

In [4]:
region_sets = {}
region_sets['topics_otsu'] = {}
region_sets['topics_top_3'] = {}
region_sets['DARs'] = {}

for topic in region_bin_topics_otsu.keys():
    regions = region_bin_topics_otsu[topic].index[region_bin_topics_otsu[topic].index.str.startswith('chr')] #only keep regions on known chromosomes
    region_sets['topics_otsu'][topic] = pr.PyRanges(region_names_to_coordinates(regions))
for topic in region_bin_topics_top3k.keys():
    regions = region_bin_topics_top3k[topic].index[region_bin_topics_top3k[topic].index.str.startswith('chr')] #only keep regions on known chromosomes
    region_sets['topics_top_3'][topic] = pr.PyRanges(region_names_to_coordinates(regions))
for DAR in markers_dict.keys():
    regions = markers_dict[DAR].index[markers_dict[DAR].index.str.startswith('chr')] #only keep regions on known chromosomes
    region_sets['DARs'][DAR] = pr.PyRanges(region_names_to_coordinates(regions))

In [5]:
for key in region_sets.keys():
    print(f'{key}: {region_sets[key].keys()}')

topics_otsu: dict_keys(['Topic1', 'Topic2', 'Topic3', 'Topic4', 'Topic5', 'Topic6', 'Topic7', 'Topic8', 'Topic9', 'Topic10', 'Topic11', 'Topic12', 'Topic13', 'Topic14', 'Topic15', 'Topic16'])
topics_top_3: dict_keys(['Topic1', 'Topic2', 'Topic3', 'Topic4', 'Topic5', 'Topic6', 'Topic7', 'Topic8', 'Topic9', 'Topic10', 'Topic11', 'Topic12', 'Topic13', 'Topic14', 'Topic15', 'Topic16'])
DARs: dict_keys(['Astrocyte', 'Chandelier', 'Endothelial', 'L2_3_IT', 'L4_IT', 'L5_ET', 'L5_IT', 'L5_6_NP', 'L6_CT', 'L6_IT', 'L6_IT_Car3', 'L6b', 'Lamp5', 'Lamp5_Lhx6', 'Microglia-PVM', 'OPC', 'Oligodendrocyte', 'Pax6', 'Pvalb', 'Sncg', 'Sst', 'Sst_Chodl', 'VLMC', 'Vip'])


Define rankings, score and motif annotation database.

The ranking database is used for running the cistarget analysis and the scores database is used for running the DEM analysis. For more information see [the pycistarget read the docs page](https://pycistarget.readthedocs.io/en/latest/)


In [6]:
db_fpath = f"/media/tadeoye/Volume1/SEA-AD/{region_name}/ATACseq/results/motif_collection"

In [7]:
rankings_db = os.path.join(db_fpath, f'{save_prefix}_1kb_bg_with_mask.regions_vs_motifs.rankings.feather')
scores_db =  os.path.join(db_fpath, f'{save_prefix}_1kb_bg_with_mask.regions_vs_motifs.scores.feather')
motif_annotation = os.path.join(db_fpath, 'v10nr_clust_public/snapshots/motifs-v10-nr.hgnc-m0.00001-o0.0.tbl')

Next we will run pycistarget using the `run_pycistarget` wrapper function.

This function will run cistarget based and DEM based motif enrichment analysis with or without promoter regions.


In [8]:
if not os.path.exists(os.path.join(save_dir, 'motifs')):
    os.makedirs(os.path.join(save_dir, 'motifs'))

In [10]:
tmp_dir = f'/media/tadeoye/Volume1/temp_files'

run_pycistarget(
            region_sets = region_sets,
            species = 'homo_sapiens',
            save_path = os.path.join(save_dir, 'motifs'),
            ctx_db_path = rankings_db,
            dem_db_path = scores_db,
            path_to_motif_annotations = motif_annotation,
            run_without_promoters = True,
            n_cpu = 40,
            _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
            annotation_version = 'v10nr_clust',
            )

2024-09-13 15:36:50,602 pycisTarget_wrapper INFO     /media/tadeoye/Volume1/SEA-AD/MTG/ATACseq/results/motifs folder already exists.
2024-09-13 15:36:51,609 pycisTarget_wrapper INFO     Loading cisTarget database for topics_otsu
2024-09-13 15:36:51,610 cisTarget    INFO     Reading cisTarget database
2024-09-13 15:41:04,725 pycisTarget_wrapper INFO     Running cisTarget for topics_otsu


2024-09-13 15:41:24,698	INFO worker.py:1724 -- Started a local Ray instance.


[36m(ctx_internal_ray pid=187444)[0m 2024-09-13 15:41:39,795 cisTarget    INFO     Running cisTarget for Topic1 which has 48980 regions
[36m(ctx_internal_ray pid=187452)[0m 2024-09-13 15:41:50,444 cisTarget    INFO     Running cisTarget for Topic2 which has 567029 regions
[36m(ctx_internal_ray pid=187458)[0m 2024-09-13 15:42:04,433 cisTarget    INFO     Running cisTarget for Topic3 which has 41887 regions
[36m(ctx_internal_ray pid=187460)[0m 2024-09-13 15:42:18,791 cisTarget    INFO     Running cisTarget for Topic4 which has 28604 regions
[36m(ctx_internal_ray pid=187444)[0m 2024-09-13 15:43:07,538 cisTarget    INFO     Annotating motifs for Topic1
[36m(ctx_internal_ray pid=187444)[0m 2024-09-13 15:43:20,104 cisTarget    INFO     Getting cistromes for Topic1
[36m(ctx_internal_ray pid=187458)[0m 2024-09-13 15:43:25,237 cisTarget    INFO     Annotating motifs for Topic3
[36m(ctx_internal_ray pid=187458)[0m 2024-09-13 15:43:35,554 cisTarget    INFO     Getting cistromes fo

Below we show the motifs found for **`topic 7 (specific to Microglia-PVM)`** using DEM.

In [12]:
menr = dill.load(open(os.path.join(save_dir, 'motifs/menr.pkl'), 'rb'))

In [None]:
menr['DEM_topics_otsu_All'].DEM_results('Topic7')

In [20]:
menr['DEM_topics_otsu_All'].DEM_results('Topic7')

Unnamed: 0,Logo,Contrast,Log2FC,Adjusted_pval,Mean_fg,Mean_bg,Motif_hit_thr,Motif_hits
kznf__ZBTB26_Schmitges2016_ChIP-seq,,Topic7,3.535538,0.022834,0.154457,0.01332,3.0,329
taipale_cyt_meth__E2F2_GCGCGCGCGYW_eDBD_repr,,Topic7,2.867175,0.003339,0.174543,0.023922,3.0,384
taipale_tf_pairs__ERF_HES7_NNCACGTGNNNNNCCGGAANN_CAP_repr,,Topic7,2.488631,0.004032,0.172184,0.030679,3.0,286
transfac_pro__M06111,,Topic7,2.444333,0.016571,0.15603,0.028668,3.0,297
transfac_pro__M05893,,Topic7,2.358857,0.00782,0.186116,0.036282,3.0,402
taipale_cyt_meth__ZNF385D_NCGTCGCGACGN_eDBD_meth,,Topic7,2.20971,0.015077,0.097039,0.020978,3.0,104
kznf__ZNF783_Imbeault2017_OM_MEME,,Topic7,2.056323,0.0,0.355976,0.085587,3.0,916
metacluster_144.8,,Topic7,2.008942,5e-06,0.25057,0.062255,3.0,356
metacluster_55.3,,Topic7,2.007238,0.0,0.366136,0.091076,3.0,761
kznf__ZBTB14_Schmitges2016_RCADE,,Topic7,1.994344,6e-06,0.30661,0.076954,3.0,695


We now have completed all the steps necessary for starting the SCENIC+ analysis 😅. In particalular, we have 3. looked for enriched motifs in candidate enhancer regions.

In the next section we will combine all these analysis and run **`SCENIC+`**