# 🧑🏻‍⚕️ Guilliams *et al.* (2022) Human Liver Cell Atlas Workflow

## 📖 Overview  
* **Paper**: Guilliams *et al.* (2022) — Human Liver Cell Atlas  
* **Download**: <https://livercellatlas.org/datasets_human.php> &nbsp;|&nbsp; <https://cellxgene.cziscience.com/collections/74e10dc4-cbb2-4605-a189-8a1cd8e44d8c>  
* **Dataset**: `human_liver_atlas_Guilliams_2022_cell.h5ad` (normalized counts)

### Outputs
| File | Description |
| --- | --- |
| `human_liver_atlas_Guilliams_2022_cell.h5ad` | Cleaned AnnData (all genes) |
| `human_liver_atlas_Guilliams_2022_cell_pc.h5ad` | Protein‑coding subset |
| `.cov` files | scDRS covariates (all / pc) |
| Cell‑type counts | `*.cell_type_counts.txt` |
| CELLEX scores | CSV files in `cellex_out/` |


## 🔧 Environment Setup

In [None]:
import scanpy as sc
import pandas as pd
from pathlib import Path

## 📂 Define Paths

In [None]:
DATA_DIR   = Path('data/Guilliams_2022')
OUTPUT_DIR = Path('output/Guilliams_2022')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

RAW_H5AD   = DATA_DIR/'human_liver_atlas_Guilliams_2022_cell.h5ad'
GENE_MATRIX= Path('data')/'geneMatrix.tsv.gz'

OUT_H5AD   = OUTPUT_DIR/'human_liver_atlas_Guilliams_2022_cell.h5ad'
PC_H5AD    = OUTPUT_DIR/'human_liver_atlas_Guilliams_2022_cell_pc.h5ad'
COV_ALL    = OUTPUT_DIR/'human_liver_atlas_Guilliams_2022_cell.cov'
COV_PC     = OUTPUT_DIR/'human_liver_atlas_Guilliams_2022_cell_pc.cov'
COUNTS_TXT = OUTPUT_DIR/'human_liver_atlas_Guilliams_2022_cell.cell_type_counts.txt'
CELLEX_DIR = OUTPUT_DIR/'cellex_out'
CELLEX_DIR.mkdir(exist_ok=True)

## 🧬 Load AnnData & Clean Cell‑Type Labels

In [None]:
sce = sc.read_h5ad(RAW_H5AD)
# Clean cell_type labels
sce.obs['cell_type'] = (
    sce.obs['cell_type']
      .str.replace(' ', '_')
      .str.replace('-', '_', regex=False)
      .str.replace(',', '_', regex=False))
print('Example cell types:', sce.obs['cell_type'].unique()[:10])

### Save Cell‑Type Counts & Cleaned AnnData

In [None]:
sce.write(OUT_H5AD)
ct_counts = sce.obs['cell_type'].value_counts()
ct_counts.to_csv(COUNTS_TXT, sep='\t', header=True)
ct_counts.head()

## 📊 Covariate File (All Genes)

In [None]:
cov = pd.DataFrame(index=sce.obs.index)
cov['const']=1
cov['n_genes']=(sce.X>0).sum(axis=1)
for donor in sorted(sce.obs['donor_id'].unique()):
    if donor!='H02':
        cov[f'donor_{donor}']=(sce.obs['donor_id']==donor).astype(int)
cov.to_csv(COV_ALL, sep='\t')
print('Saved', COV_ALL.name)

## 🧩 Protein‑Coding Subset & Covariate

In [None]:
gene_coords = pd.read_csv(GENE_MATRIX, sep='\t', compression='infer')
pc_set = set(gene_coords[gene_coords['gene_type']=='protein_coding']['Gene'])
subset_pc = sce[:, sce.var_names.isin(pc_set)].copy()
subset_pc.write(PC_H5AD)

cov_pc = pd.DataFrame(index=subset_pc.obs.index)
cov_pc['const']=1
cov_pc['n_genes']=(subset_pc.X>0).sum(axis=1)
for donor in sorted(subset_pc.obs['donor_id'].unique()):
    if donor!='H02':
        cov_pc[f'donor_{donor}']=(subset_pc.obs['donor_id']==donor).astype(int)
cov_pc.to_csv(COV_PC, sep='\t')

## 🔬 Run CELLEX (All Genes & PC)

In [None]:
# Requires cellex installed & may take time ↴
# import cellex
# import numpy as np
# from cellex import ESObject
# def run_cellex(a, prefix):
#     df = a.to_df().T
#     eso = ESObject(data=df, annotation=a.obs['cell_type'], normalize=False, verbose=True)
#     eso.compute(verbose=True)
#     eso.save_as_csv(keys=['all'], path=CELLEX_DIR, file_prefix=prefix, verbose=True)
# run_cellex(sce, 'human_liver_atlas_Guilliams_2022_cell')
# run_cellex(subset_pc, 'human_liver_atlas_Guilliams_2022_cell_pc')