# 🫀 CARE Heart scRNA‑seq Processing Workflow
This notebook demonstrates a path‑agnostic workflow for preprocessing CARE Heart snRNA‑seq data.


## 📖 Introduction & Data Sources

**Dataset:** CARE heart single‑nucleus RNA‑seq (snRNA‑seq)  
*Original portal:* <http://ns104190.ip-147-135-44.us/data_CARE_portal/snATAC/ucsc_browser/>

**Raw files obtained from HPC storage**

```
./Human_sc/processed/CARE
└── CARE_snRNA_Heart.h5ad                # Pre‑assembled AnnData (GRCh38)
```

**Key processing outputs (relative to the same folder)**

| Purpose | Output file |
|---|---|
| Expression w/ genomic positions | `CARE_snRNA_Heart_expr_gene_withPos.h5ad` |
| Protein‑coding gene subset | `CARE_snRNA_Heart_pc.h5ad` |
| scDRS covariate file | `CARE_snRNA_Heart_expr_gene_withPos.cov` |
| Gene‑type distribution plot | `CARE_snRNA_Heart_gene_type_distribution.png` |
| Gene lists | `CARE_snRNA_Heart_allgene_list.csv`, `CARE_snRNA_Heart_pcgene_list.csv` |
| Unique cell‑type list | `CARE_snRNA_Heart_unique_celltype.csv` |

Feel free to adjust the absolute paths (`/QRISdata/...`) to match your environment, or keep them as‑is when running on the QRISdata cluster.


## 🔧 Environment Setup

In [None]:
import scanpy as sc
import anndata as ad
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

## 📂 Define Input / Output Paths

In [None]:
# Edit these paths as needed
DATA_DIR   = Path('data/CARE')
OUTPUT_DIR = Path('output/CARE')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

expr_path        = DATA_DIR / 'exprMatrix.tsv.gz'
umap_path        = DATA_DIR / 'UMAP_coordinates.coords.tsv.gz'
meta_path        = DATA_DIR / 'meta.tsv'
gene_coord_path  = DATA_DIR.parent / 'geneMatrix.tsv.gz'

## 📑 Load Expression Matrix, UMAP & Metadata

In [None]:
expr_matrix = pd.read_csv(expr_path, sep='\t', index_col=0, compression='infer')
umap_coords = pd.read_csv(umap_path, sep='\t', index_col=0, compression='infer')
metadata    = pd.read_csv(meta_path, sep='\t', index_col=0)

## 🧬 Build AnnData Object

In [None]:
adata = ad.AnnData(expr_matrix.T)
adata.obs = metadata
umap_coords_aligned = umap_coords.reindex(adata.obs_names)
adata.obsm['X_umap'] = umap_coords_aligned.values
print(adata)

## 🔁 Convert Gene Symbols → Ensembl IDs

In [None]:
biomart_df = pd.read_csv(gene_coord_path, sep='\t', compression='infer')
biomart_df['ensgid'] = biomart_df['ensgid'].astype(str)
adata.var['symbol_gene_name'] = adata.var.index
symbol_to_ensg = biomart_df.set_index('gene_name')['ensgid']
adata.var['ensgid'] = adata.var['symbol_gene_name'].map(symbol_to_ensg)
print('Missing ENSG IDs:', adata.var['ensgid'].isna().sum())
adata.var.dropna(subset=['ensgid'], inplace=True)
adata.var.set_index('ensgid', inplace=True)
adata = adata[:, adata.var.index].copy()

## 🕵️‍♀️ Remove Low‑Frequency Cell Types (<20 cells)

In [None]:
adata.obs['celltype'] = adata.obs['celltype'].str.replace(' ', '_', regex=False)
cell_counts = adata.obs['celltype'].value_counts()
cell_counts.to_csv(OUTPUT_DIR / 'celltype_counts.txt', sep='\t')
keep_types = cell_counts[cell_counts >= 20].index
adata = adata[adata.obs['celltype'].isin(keep_types)].copy()
print(f'Retained {adata.n_obs} cells across {adata.obs["celltype"].nunique()} cell types.')

## 💾 Save Filtered AnnData

In [None]:
adata.write(OUTPUT_DIR / 'CARE_snRNA_Heart_expr_gene_withPos.h5ad')

## 📊 Generate Covariate File for scDRS

In [None]:
df_cov = pd.DataFrame(index=adata.obs.index)
df_cov['const'] = 1
df_cov['n_genes'] = (adata.X > 0).sum(axis=1)
for samp in sorted(adata.obs['Sample'].unique()):
    df_cov[f'donor_{samp}'] = (adata.obs['Sample'] == samp).astype(int)
df_cov.to_csv(OUTPUT_DIR / 'CARE_snRNA_Heart_expr_gene_withPos.cov', sep='\t')

## 🧩 Subset to Protein‑Coding Genes

In [None]:
protein_coding = biomart_df[biomart_df['gene_type'] == 'protein_coding']['ensgid']
adata_pc = adata[:, adata.var.index.isin(protein_coding)].copy()
adata_pc.write(OUTPUT_DIR / 'CARE_snRNA_Heart_pc.h5ad')

## 📈 Plot Gene Type Distribution

In [None]:
gene_type_counts = biomart_df[biomart_df['ensgid'].isin(adata.var.index)]['gene_type'].value_counts()
ax = gene_type_counts.plot(kind='bar', figsize=(8,6))
ax.set_ylabel('Number of genes')
ax.set_xlabel('Gene type')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'CARE_snRNA_Heart_gene_type_distribution.png')
plt.show()

## 🗂️ Export Gene Lists

In [None]:
pd.Series(adata.var.index, name='ensgid').to_csv(OUTPUT_DIR / 'CARE_snRNA_Heart_allgene_list.csv', index=False)
pd.Series(adata_pc.var.index, name='ensgid').to_csv(OUTPUT_DIR / 'CARE_snRNA_Heart_pcgene_list.csv', index=False)