🟫 Cheng et al. (2018) Human Epidermis scRNA‑seq Workflow

## 📖 Overview  
* **Paper**: Cheng JB *et al.* (2018) *Cell Reports* — “Transcriptional Programming of Normal and Inflamed Human Epidermis at Single‑Cell Resolution”.  
* **Dataset portal**: <https://cells.ucsc.edu/?ds=human-epidermis>  
* **Raw data accession**: EGAS00001002927  
* **Cells profiled**: 92 889 epidermal cells (9 normal, 3 inflamed).

### Outputs (this workflow)
| File | Description |
| --- | --- |
| `Cheng_2018_Cell_Reports_updated.h5ad` | AnnData with raw counts & cleaned labels |
| `Cheng_2018_Cell_Reports_pc.h5ad` | Protein‑coding subset |
| `.cov` files | scDRS covariates (all / pc) |
| Cell‑type counts | `Cheng_2018_Cell_Reports.cell_type_counts.txt` |


## 🔧 Environment Setup

In [None]:
import scanpy as sc
import pandas as pd
import numpy as np
from pathlib import Path
from scipy.sparse import issparse

## 📂 Define Paths

In [None]:
DATA_DIR   = Path('data/Cheng_2018')
OUTPUT_DIR = Path('output/Cheng_2018')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

H5AD_RAW   = DATA_DIR/'Cheng_2018_Cell_Reports.h5ad'
EXPR_TSV   = DATA_DIR/'exprMatrix.tsv.gz'
UMAP_TSV   = DATA_DIR/'umap_hm.coords.tsv.gz'
META_TSV   = DATA_DIR/'meta.tsv'
GENE_MATRIX= Path('data')/'geneMatrix.tsv.gz'

H5AD_UPD   = OUTPUT_DIR/'Cheng_2018_Cell_Reports_updated.h5ad'
PC_H5AD    = OUTPUT_DIR/'Cheng_2018_Cell_Reports_pc.h5ad'
COV_ALL    = OUTPUT_DIR/'Cheng_2018_Cell_Reports.cov'
COV_PC     = OUTPUT_DIR/'Cheng_2018_Cell_Reports_pc.cov'
COUNTS_TXT = OUTPUT_DIR/'Cheng_2018_Cell_Reports.cell_type_counts.txt'

## 🧬 Load Base AnnData

In [None]:
sce = sc.read_h5ad(H5AD_RAW)
print(sce)

## 🔄 Replace Expression Matrix with Raw Counts + Align

In [None]:
expr = pd.read_csv(EXPR_TSV, sep='\t', index_col=0)
expr.index = expr.index.str.split('|').str[0]  # keep ENSG only
meta = pd.read_csv(META_TSV, sep='\t', index_col=0)

# Sanity check cell IDs
assert set(expr.columns)==set(sce.obs_names), 'Cell IDs mismatch!'
expr = expr[sce.obs_names]  # same order
common_genes = expr.index.intersection(sce.var_names)
expr = expr.loc[common_genes]
sce = sce[:, common_genes]

sce.X = expr.T.values
print('Replaced matrix with', expr.shape[0], 'genes.')

### Update UMAP Coordinates

In [None]:
umap = pd.read_csv(UMAP_TSV, sep='\t', index_col=0)
umap_aligned = umap.reindex(sce.obs_names)
sce.obsm['X_umap'] = umap_aligned.values

## 🧹 Clean Cell‑Type Labels

In [None]:
sce.obs['cell_type'] = (
    sce.obs['cell_type']
      .str.replace(' ', '_')
      .str.replace('-', '_', regex=False)
      .str.replace(',', '_', regex=False))
print(sce.obs['cell_type'].unique()[:10])

### Save Updated AnnData & Cell‑Type Counts

In [None]:
sce.write(H5AD_UPD)
ct_counts = sce.obs['cell_type'].value_counts()
ct_counts.to_csv(COUNTS_TXT, sep='\t', header=True)
ct_counts.head()

## 📊 Covariate File (All Genes)

In [None]:
cov = pd.DataFrame(index=sce.obs.index)
cov['const']=1
cov['n_genes']=(sce.X>0).sum(axis=1)
for donor in sorted(sce.obs['donor_id'].unique()):
    if donor!='fore8':
        cov[f'donor_{donor}']=(sce.obs['donor_id']==donor).astype(int)

cov.to_csv(COV_ALL, sep='\t')
print('Saved', COV_ALL.name)

## 🧩 Protein‑Coding Subset & Covariate

In [None]:
gene_coords = pd.read_csv(GENE_MATRIX, sep='\t', compression='infer')
pc_set = set(gene_coords[gene_coords['gene_type']=='protein_coding']['Gene'])
subset_pc = sce[:, sce.var_names.isin(pc_set)].copy()
subset_pc.write(PC_H5AD)

cov_pc = pd.DataFrame(index=subset_pc.obs.index)
cov_pc['const']=1
cov_pc['n_genes']=(subset_pc.X>0).sum(axis=1)
for donor in sorted(subset_pc.obs['donor_id'].unique()):
    if donor!='fore8':
        cov_pc[f'donor_{donor}']=(subset_pc.obs['donor_id']==donor).astype(int)

cov_pc.to_csv(COV_PC, sep='\t')
print('Saved', PC_H5AD.name, 'and cov file')