ASAP CRN Prototype workflow 

# ASAP CRN Prototype workflow 

Skeleton workflow to begin development of a unified workflow for _all_ ASAP CRN single cell (or nucleus) RNA sequence data.
based on:
* PD_MFG_snRNAseq_with_redos.Rmd from Team Lee ***sequence alignment AND processing***
* Harmony-RNA-Workflow [snakefile repo](https://github.com/shahrozeabbas/Harmony-RNA-Workflow) ***processing only***
  
28 July 2023
Andy Henrie


Note that the 


## TOOLS / PACKAGES


## STEPS

### housekeeping
- define names
- list of fastq
- collect meta-data


### alignment 
- cellranger

### preprocessing + basic QC
- convert to data object (e.g. Scanpy, Surat)
- add metadata
    - from metadata
    - calculated
        - doublets
        - mt
        - rb
    - QC
        - mt
        - doublets
        - counts
        - features

### processing: cross-team harmonization, QC, and typing
- QC
- Aggregate
    - normalize
    - identify highly variable genes
    - scale
    - assign
        - cell cycle scores
    - harmonize (LATER)
        - concatenate
        - RunHarmony


# HOUSEKEEPING:  set up samples.csv

We need to make a samples.csv for running 

In [1]:
import pandas as pd

from pathlib import Path
import scanpy as sc
import scvi
import torch

Global seed set to 0
  from .autonotebook import tqdm as notebook_tqdm
  jax.tree_util.register_keypaths(data_clz, keypaths)
  jax.tree_util.register_keypaths(data_clz, keypaths)


In [2]:

metadata_path = Path.home() / ("Projects/ASAP/team-lee/metadata")
HIP_covar = pd.read_csv(f"{metadata_path}/HIP/covar.csv")
HIP_cases = pd.read_csv(f"{metadata_path}/HIP/PD_ASAP_Sample_batch_information_banner_cases.csv").dropna(axis=0,how='all')
HIP_control = pd.read_csv(f"{metadata_path}/HIP/PD_ASAP_Sample_batch_information_banner_controls.csv")

MFG_covar = pd.read_csv(f"{metadata_path}/MFG/covar.csv") # includes 'PMI' ?
MFG_cases = pd.read_csv(f"{metadata_path}/MFG/PD_ASAP_Sample_batch_information_banner_cases.csv").dropna(axis=0,how='all')
MFG_control = pd.read_csv(f"{metadata_path}/MFG/PD_ASAP_Sample_batch_information_banner_controls.csv")


SN_covar = pd.read_csv(f"{metadata_path}/SN/covar.csv")
SN_cases = pd.read_csv(f"{metadata_path}/SN/PD_ASAP_Sample_batch_information_banner_cases.csv").dropna(axis=0,how='all')
SN_control = pd.read_csv(f"{metadata_path}/SN/PD_ASAP_Sample_batch_information_banner_controls.csv")

### Hippocampus samples

In [3]:
HIP_cases["GROUPcv"]="PD"
HIP_control["GROUPcv"]="HC"
HIP_meta = pd.concat([HIP_cases, HIP_control], axis=0, ignore_index=True)

In [4]:
HIP_meta['MERGE_ID'] = "HIP_" + HIP_meta['GROUPcv'] +"_" + HIP_meta['CaseID'].str.replace('-','')
HIP_covar['MERGE_ID'] = HIP_covar['COUNT_ID']
# the fastqs follow COUNT_ID insteald of SEQ_ID naming convention
HIP_covar['SEQ_ID'] = HIP_covar['COUNT_ID']


HIP_TABLE = pd.merge(HIP_covar, HIP_meta, on='MERGE_ID', how='inner')
HIP_TABLE.columns

HIP_TABLE['subdir']="HIP"

### medial frontal gyrus samples

In [5]:
MFG_cases["GROUPcv"]="PD"
MFG_control["GROUPcv"]="HC"
MFG_meta = pd.concat([MFG_cases, MFG_control], axis=0, ignore_index=True)

# make a MERGE_ID column because the formatting is inconsistent
MFG_meta['MERGE_ID'] = "MFG_" + MFG_meta['GROUPcv'] +"_" + MFG_meta['CaseID'].str.replace('-','')
MFG_covar['MERGE_ID'] = MFG_covar['SAMPLE']
# the fastqs are in SEQ_ID 

MFG_TABLE = pd.merge(MFG_covar, MFG_meta, on='MERGE_ID', how='inner')
MFG_TABLE['subdir']="MFG"

### Substantia Nigra samples

In [6]:
SN_covar.head()

Unnamed: 0,BRAIN_REGION,SAMPLE,ID,SEQ_ID,BATCH,GROUP,DIAGNOSIS,SEX,AGE,DIAGNOSIS_AGE,DIAGNOSIS_YEARS,DISEASE,ALZHEIMERS_BRAAK,DEMENTIA,DEMENTIA_AGE,DEMENTIA_YEARS,WEIGHT_USED,TOTAL_WEIGHT
0,Substantia_nigra,SN_HC_1939,HC_19-39,SN_1939_HC,BATCH_5,HC,0,2,93,,,LBS,II,0,,,27,27
1,Substantia_nigra,SN_HC_0602,MCI_06-02,SN_0602_HC,BATCH_5,HC,0,1,84,81.0,3.0,LBS,I,0,,,25,46
2,Substantia_nigra,SN_PD_0413,PD_04-13,SN_0413_PD,BATCH_5,PD,1,1,72,62.0,10.0,PD,I,0,,,33,33
3,Substantia_nigra,SN_PD_1317,PD_13-17,SN_1317_PD,BATCH_5,PD,1,1,83,70.0,13.0,PD,IV,0,,,30,30
4,Substantia_nigra,SN_PD_1973,PD_19-73,SN_1973_PD,BATCH_5,PD,1,1,84,70.0,14.0,PD,IV,1,79.0,5.0,40,65


In [7]:
SN_cases["GROUPcv"]="PD"
SN_control["GROUPcv"]="HC"
SN_meta = pd.concat([SN_cases, SN_control], axis=0, ignore_index=True)

SN_meta['MERGE_ID'] = "SN_" + MFG_meta['GROUPcv'] +"_" + MFG_meta['CaseID'].str.replace('-','')
SN_covar['MERGE_ID'] = SN_covar['SAMPLE']

SN_TABLE = pd.merge(SN_covar, SN_meta, on='MERGE_ID', how='inner')
SN_TABLE['subdir']="SN"

### concatenate SN, MSG, and HIP tables into one 'all_samples' table

In [8]:
all_samples = pd.concat([HIP_TABLE, MFG_TABLE, SN_TABLE], axis=0, ignore_index=True)



In [9]:

all_samples.to_csv("team-Lee-all_samples.csv")

In [27]:
samples_csv = pd.DataFrame()
samples_csv['sample'] = all_samples['SEQ_ID']
samples_csv['batch'] = all_samples['BATCH']
samples_csv['subdir'] = all_samples['subdir']

samples_csv.to_csv("team-Lee-samples.csv")

# ALIGNMENT

> SOURCE: [gs://asap-raw-data-team-lee/scripts/PD_MFG_snRNAseq_with_redos.Rmd](gs://asap-raw-data-team-lee/scripts/PD_MFG_snRNAseq_with_redos.Rmd)


## Cellranger

Transfer sequencing data from genomics to rawdata  
cellranger-6.0.1 to align and count  
performed mRNA alignment for standard single cell counts  
also performed mRNA and premRNA (introns) counts for trajectory analysis

Need to run cellranger count for each sample.. i.e. 

HIP/HIP_HC_2067,HIP/HIP_PD_0413,HIP/HIP_PD_1504,HIP/HIP_HC_2062,HIP/HIP_PD_2038...


In [12]:
!pwd

/Users/ergonyc/Projects/ASAP/harmonized-wf-dev/examples


In [None]:
%%bash
#---------- cellranger
#----- mRNA alignment and counting
# used for single cell
# don't use --include-introns

#---------- cellranger_mRNA_premRNA
#----- mRNA and premRNA alignment and counting
# used for single nuclei
module load cellranger/cellranger-6.0.1


"cellranger count  \
  --id=TEAM_LEE \
  --transcriptome=refdata-gex-GRCh38-2020-A \
  --fastqs=../data/team-lee/fastq/ 
  --sample=SAMPLE 
  --localcores=4"

# PRE-PROCESSING

- convert to seurat Object
- qc features

In [13]:
## convert to seurat Object
data_path = Path.home() / ("Projects/ASAP/team-lee")
metadata_path = data_path / "metadata"
SAMPLE = pd.read_csv(f"{metadata_path}/SAMPLE.tsv",delimiter="\t")

SUBJECT = pd.read_csv(f"{metadata_path}/SUBJECT.tsv",delimiter="\t")
CLINPATH = pd.read_csv(f"{metadata_path}/CLINPATH.csv",delimiter=",")
STUDY = pd.read_csv(f"{metadata_path}/STUDY.tsv",delimiter="\t")
PROTOCOL = pd.read_csv(f"{metadata_path}/PROTOCOL.tsv",delimiter="\t")

# read count matriz
adata_path = data_path / "cellranger_counts/SN"

# SN_2061_HC_count_filtered_feature_bc_matrix.h5
# SN_2061_HC_count_molecule_info.h5
# SN_2061_HC_count_raw_feature_bc_matrix.h5

raw_counts = adata_path / "SN_2061_HC_count_raw_feature_bc_matrix.h5"
filtered_counts = adata_path / "SN_2061_HC_count_filtered_feature_bc_matrix.h5"
molec_info_counts = adata_path / "SN_2061_HC_count_molecule_info.h5"



In [15]:
raw_adata = sc.read_10x_h5(raw_counts, genome='GRCh38')
filt_adata = sc.read_10x_h5(filtered_counts, genome='GRCh38')

  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


In [17]:
raw_adata

AnnData object with n_obs × n_vars = 514945 × 36601
    var: 'gene_ids', 'feature_types', 'genome'

In [18]:
raw_adata.var_names_make_unique()
raw_adata

AnnData object with n_obs × n_vars = 514945 × 36601
    var: 'gene_ids', 'feature_types', 'genome'

In [23]:
raw_adata[:10,:10]

View of AnnData object with n_obs × n_vars = 10 × 10
    var: 'gene_ids', 'feature_types', 'genome'

In [24]:
filt_adata

AnnData object with n_obs × n_vars = 6004 × 36601
    var: 'gene_ids', 'feature_types', 'genome'

In [25]:
filt_adata.gene_ids

AttributeError: 'AnnData' object has no attribute 'gene_ids'

In [26]:
seurat_path = data_path / "seurat_objects/SN"


filtered_original = seurat_path / "seurat_filtered_original.rds"
filtered = seurat_path / "seurat_filtered.rds"
combined = seurat_path / "seurat_combined.rds"

filt = sc.read_seurat(filtered_original)

AttributeError: module 'scanpy' has no attribute 'read_seurat'