# 🪄 Fasolino *et al.* (2022) Pancreas snRNA‑seq Workflow

## 📖 Introduction & Data Sources  
* **Paper**: Fasolino M, *et al.* (2022) *Nat Metab* — “Transcriptional and epigenomic signatures of human pancreas cell types in health and disease”  
* **Portal**: <https://cellxgene.cziscience.com/> (collection identifier as per study)  
* **Scope**: Human pancreas single‑nucleus RNA‑seq; healthy samples only.

### Workflow Outputs
| File | Description |
| --- | --- |
| `Fasolino_2022_expr_gene_withPos.h5ad` | Expression + genomic positions, ≥ 20 cells/CT |
| `Fasolino_2022_pc.h5ad` | Protein‑coding subset |
| `.cov` | scDRS covariates |
| Gene lists | `*_allgene_list.csv`, `*_pcgene_list.csv` |
| Cell‑type hierarchy | `Fasolino_2022_celltypes_levels.txt` |

All paths below use **relative variables** (`DATA_DIR`, `OUTPUT_DIR`).  
Edit to point at your dataset locations.


## 🔧 Environment Setup

In [None]:
import scanpy as sc
import anndata as ad
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

## 📂 Define Input / Output Paths

In [None]:
DATA_DIR   = Path('data/Fasolino')          # raw .h5ad + metadata
OUTPUT_DIR = Path('output/Fasolino')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

RAW_H5AD       = DATA_DIR/'Fasolino_2022_raw.h5ad'   # rename as appropriate
GENE_MATRIX    = Path('data')/'geneMatrix.tsv.gz'     # GRCh38 coords

## 🧬 Load Raw AnnData

In [None]:
adata = sc.read_h5ad(RAW_H5AD)
print(adata)

## 🔁 Verify Gene IDs (Assumed ENSG)

In [None]:
# If symbols present, add mapping code here; else continue

## 🧬 Keep Genes with Genomic Coordinates

In [None]:
gene_coords = pd.read_csv(GENE_MATRIX, sep='\t', compression='infer')
valid_genes = gene_coords['Gene'].astype(str).intersection(adata.var_names)
adata = adata[:, adata.var_names.isin(valid_genes)].copy()
print('Genes after coord filter:', adata.n_vars)

## 🧹 Filter Cell Types with < 20 Cells

In [None]:
label_col = 'cell_type'  # adjust based on .obs column name
adata.obs[label_col] = adata.obs[label_col].str.replace('[\s,\-]','_', regex=True)
keep = adata.obs[label_col].value_counts()[lambda s: s>=20].index
adata = adata[adata.obs[label_col].isin(keep)].copy()
print('Cells retained:', adata.n_obs)

## 💾 Save Expression + Gene‑Position AnnData

In [None]:
adata.write(OUTPUT_DIR/'Fasolino_2022_expr_gene_withPos.h5ad')

## 📊 Generate scDRS Covariate File

In [None]:
cov = pd.DataFrame(index=adata.obs_names)
cov['const']=1
cov['n_genes']=(adata.X>0).sum(1)
# Assume donor column present
for donor in sorted(adata.obs['donor'].unique()):
    cov[f'donor_{donor}']=(adata.obs['donor']==donor).astype(int)

cov.to_csv(OUTPUT_DIR/'Fasolino_2022_expr_gene_withPos.cov', sep='\t')

## 🧩 Subset to Protein‑Coding Genes

In [None]:
pc_genes = gene_coords[gene_coords['gene_type']=='protein_coding']['Gene']
adata_pc = adata[:, adata.var_names.isin(pc_genes)].copy()
adata_pc.write(OUTPUT_DIR/'Fasolino_2022_pc.h5ad')

## 🗃️ Export Gene Lists

In [None]:
pd.Series(adata.var_names, name='Gene').to_csv(OUTPUT_DIR/'Fasolino_2022_allgene_list.csv', index=False)
pd.Series(adata_pc.var_names, name='Gene').to_csv(OUTPUT_DIR/'Fasolino_2022_pcgene_list.csv', index=False)

## 🗂️ Export Cell‑Type Hierarchy (Optional)

In [None]:
# adata.obs[['cell_type','broad_cell_type']].drop_duplicates().to_csv(...)