# 🦠 Smillie *et al.* (2019) Colon snRNA‑seq Workflow

## 📖 Overview  

## 📖 Data Sources  
| Resource | Link | Notes |
| --- | --- | --- |
| Smillie *et al.* (2019) *Cell* — “Intra‑ and Inter‑cellular Rewiring of the Human Colon during Ulcerative Colitis” | DOI:10.1016/j.cell.2019.10.009 | Healthy samples only |
| **Download portal** | <https://cellxgene.cziscience.com/collections/33d19f34-87f5-455b-8ca5-9023a2e5453d> | Dataset ID on CZ Science |
| GRCh38 gene‑coordinate file | `geneMatrix.tsv.gz` | 56 778 genes with start–end positions |

### 📦 Outputs Produced
| File | Description |
| --- | --- |
| `2019_Smillie_normal_cellxgene.h5ad` | Cleaned AnnData (all genes, normalized) |
| `2019_Smillie_normal_cellxgene_pc.h5ad` | Protein‑coding subset |
| `*.cov` | scDRS covariates (all / pc) |
| `*.cell_type_counts.txt` | Cell‑type abundance table |
| *(Optional)* CELLEX outputs | Saved in `cellex_out/` when run |

---

## 🔧 Environment Setup

In [None]:
import scanpy as sc
import pandas as pd
from pathlib import Path

## 📂 Define Paths

In [None]:
DATA_DIR   = Path('data/Smillie_2019')
OUTPUT_DIR = Path('output/Smillie_2019')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

RAW_H5AD   = DATA_DIR/'2019_Smillie_normal_cellxgene.h5ad'
GENE_MATRIX = Path('data')/'geneMatrix.tsv.gz'

OUT_H5AD   = OUTPUT_DIR/'2019_Smillie_normal_cellxgene.h5ad'
PC_H5AD    = OUTPUT_DIR/'2019_Smillie_normal_cellxgene_pc.h5ad'
COV_ALL    = OUTPUT_DIR/'2019_Smillie_normal_cellxgene.cov'
COV_PC     = OUTPUT_DIR/'2019_Smillie_normal_cellxgene_pc.cov'
COUNTS_TXT = OUTPUT_DIR/'2019_Smillie_normal_cellxgene.cell_type_counts.txt'

## 🧬 Load AnnData

In [None]:
sce = sc.read_h5ad(RAW_H5AD)
print(sce)

## 🧹 Clean Cell‑Type Labels

In [None]:
sce.obs['cell_type'] = (
    sce.obs['cell_type']
      .str.replace(' ', '_')
      .str.replace('-', '_', regex=False)
      .str.replace(',', '_', regex=False))
print('Example cell types:', sce.obs['cell_type'].unique()[:10])

### View Unique Cell‑Type Mapping

In [None]:
unique_combo = sce.obs[['CellType','cell_type','cell_type_ontology_term_id']].drop_duplicates().reset_index(drop=True)
unique_combo.head()

### Save Cleaned AnnData & Cell‑Type Counts

In [None]:
sce.write(OUT_H5AD)
ct_counts = sce.obs['cell_type'].value_counts()
ct_counts.to_csv(COUNTS_TXT, sep='\t', header=True)
ct_counts.head()

## 📊 Covariate File (All Genes)

In [None]:
cov = pd.DataFrame(index=sce.obs.index)
cov['const']=1
cov['n_genes']=(sce.X>0).sum(axis=1)
for donor in sorted(sce.obs['donor_id'].unique()):
    if donor!='N10':
        cov[f'donor_{donor}']=(sce.obs['donor_id']==donor).astype(int)

cov.to_csv(COV_ALL, sep='\t')
print('Cov file saved:', COV_ALL.name)

## 🧩 Protein‑Coding Subset & Covariate

In [None]:
gene_coords = pd.read_csv(GENE_MATRIX, sep='\t', compression='infer')
pc_set = set(gene_coords[gene_coords['gene_type']=='protein_coding']['Gene'])
subset_pc = sce[:, sce.var_names.isin(pc_set)].copy()
subset_pc.write(PC_H5AD)

cov_pc = pd.DataFrame(index=subset_pc.obs.index)
cov_pc['const']=1
cov_pc['n_genes']=(subset_pc.X>0).sum(axis=1)
for donor in sorted(subset_pc.obs['donor_id'].unique()):
    if donor!='N10':
        cov_pc[f'donor_{donor}']=(subset_pc.obs['donor_id']==donor).astype(int)

cov_pc.to_csv(COV_PC, sep='\t')
print('Protein‑coding cov saved')