# 🧠 Kamath *et al.* (2022) Dopamine‑Neuron snRNA‑seq Workflow

## 📖 Introduction & Data Sources  
* **Paper**: Kamath T, Abdulraouf A, Burris SJ, Langlieb J, *et al.* (2022) *Nat Neurosci* 25:588‑595  
* **Title**: *Single‑cell genomic profiling of human dopamine neurons identifies a population that selectively degenerates in Parkinson’s disease*  
* **Collection** (cellxgene): <https://cellxgene.cziscience.com/collections/b0f0b447-ac37-45b0-b1bf-5c0b7d871120>  
* **Scope**: 8 broad brain cell types — initial raw objects per cell type.

### Workflow Outputs
| File | Description |
| --- | --- |
| `Kamath_2022_combined_raw.h5ad` | All 8 cell‑type raw objects concatenated |
| `Kamath_2022_normal_expr_gene_withPos.h5ad` | Healthy donor subset with genomic‑position genes |
| `Kamath_2022_normal_pc.h5ad` | Protein‑coding subset |
| `.cov` | scDRS covariate file |
| Gene lists | `*_allgene_list.csv`, `*_pcgene_list.csv` |
| Cell‑type hierarchy | `Kamath_2022_celltypes_levels.txt` |

*Absolute paths are **not** stored; edit `DATA_DIR` and `OUTPUT_DIR` variables to match your environment.*


## 🔧 Environment Setup

In [None]:
import scanpy as sc
import anndata as ad
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

## 📂 Define Paths

In [None]:
DATA_DIR   = Path('data/PD_Macosko')   # contains per‑cell‑type .h5ad files + metadata txt
OUTPUT_DIR = Path('output/PD_Macosko')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

CELLTYPE_FILES = [
    'Astrocytes.h5ad', 'Endothelial.h5ad', 'Non_DA_Neurons.h5ad',
    'OPC.h5ad', 'DA_Neurons.h5ad', 'Microglia.h5ad', 'Oligodendrocytes.h5ad'
]
CELLTYPE_FILES = [DATA_DIR/f for f in CELLTYPE_FILES]

META_LEVELS_TXT = DATA_DIR/'Kamath_2022_celltypes_levels.txt'
GENE_MATRIX    = Path('data')/'geneMatrix.tsv.gz'   # 56k gene coords

## 🧩 Load & Concatenate Cell‑Type Objects

In [None]:
datasets=[]
for fp in CELLTYPE_FILES:
    adata = sc.read_h5ad(fp)
    adata.obs['broad_cell_type'] = fp.stem  # add from filename
    datasets.append(adata)
combined = ad.concat(datasets, join='outer', label='batch', keys=[p.stem for p in CELLTYPE_FILES])
combined.write(OUTPUT_DIR/'Kamath_2022_combined_raw.h5ad')
print(combined)

## 🗂️ Merge Author‑Defined Cell‑Type Levels

In [None]:
meta_levels = pd.read_csv(META_LEVELS_TXT, sep='\t')
combined.obs.reset_index(inplace=True)
combined.obs = combined.obs.merge(meta_levels, on=['author_cell_type','broad_cell_type','cell_type'], how='left')
combined.obs.set_index('index', inplace=True)

## 👤 Parse Age & Recode Sex

In [None]:
combined.obs['age'] = pd.to_numeric(combined.obs['development_stage'].str.extract(r'(\d+)')[0], errors='coerce')
combined.obs['sex'] = combined.obs['sex'].map({'female':0,'male':1})

## 🩺 Subset to Healthy Donors

In [None]:
sce = combined[combined.obs['disease']=='normal'].copy()
print(sce)

## 🧬 Filter Genes with Coordinates

In [None]:
gene_coords = pd.read_csv(GENE_MATRIX, sep='\t', compression='infer')
valid_genes = gene_coords['Gene'].astype(str).intersection(sce.var_names)
sce = sce[:, sce.var_names.isin(valid_genes)].copy()
print('Genes after filter:', sce.n_vars)

## 💾 Save Expression + Position AnnData

In [None]:
sce.write(OUTPUT_DIR/'Kamath_2022_normal_expr_gene_withPos.h5ad')

## 📊 Generate scDRS Covariate File

In [None]:
cov = pd.DataFrame(index=sce.obs_names)
cov['const'] = 1
cov['n_genes'] = (sce.X>0).sum(axis=1)
cov['sex'] = sce.obs['sex']
cov['age'] = sce.obs['age']
for donor in sorted(sce.obs['donor_id'].unique()):
    cov[f'donor_{donor}'] = (sce.obs['donor_id']==donor).astype(int)
cov.to_csv(OUTPUT_DIR/'Kamath_2022_normal_expr_gene_withPos.cov', sep='\t')

## 🧩 Subset to Protein‑Coding Genes

In [None]:
pc_genes = gene_coords[gene_coords['gene_type']=='protein_coding']['Gene']
subset_pc = sce[:, sce.var_names.isin(pc_genes)].copy()
subset_pc.write(OUTPUT_DIR/'Kamath_2022_normal_pc.h5ad')
print(subset_pc)

## 🗃️ Export Gene Lists

In [None]:
pd.Series(sce.var_names,name='Gene').to_csv(OUTPUT_DIR/'Kamath_2022_normal_allgene_list.csv', index=False)
pd.Series(subset_pc.var_names,name='Gene').to_csv(OUTPUT_DIR/'Kamath_2022_normal_pcgene_list.csv', index=False)

## 📈 Plot Gene‑Type Distribution

In [None]:
gene_type_counts = gene_coords[gene_coords['Gene'].isin(sce.var_names)]['gene_type'].value_counts()
ax = gene_type_counts.plot(kind='bar', figsize=(8,6))
ax.set_ylabel('Number of genes')
ax.set_xlabel('Gene type')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig(OUTPUT_DIR/'Kamath_2022_gene_type_distribution.png')
plt.show()