# 🦴 Fetal Bone Marrow snRNA‑seq Processing Workflow
This notebook preprocesses the **Jardine et al. (2021) Nature** fetal bone‐marrow dataset to produce analysis‑ready AnnData objects and companion files.

## 📖 Introduction & Data Sources

* **Paper**: Jardine *et al.* (2021) *Nature* — “Blood and immune development in human fetal bone marrow and Down syndrome”  
* **Portal**: <https://developmental.cellatlas.io/fetal-bone-marrow>  
* **Assays**: 10x 3′, 10x 5′, 10x VDJ‑TCR, 10x VDJ‑BCR  
* **Counts**: 103 228 cells   |   **Genes expressed**: 33 712 (initial)

### Outputs Generated

| Purpose | Output file |
| --- | --- |
| Expression + genomic positions | `Human_fetal_BM10x_expr_gene_withPos.h5ad` |
| Protein‑coding subset | `Human_fetal_BM10x_pc.h5ad` |
| scDRS covariate file | `Human_fetal_BM10x_expr_gene_withPos.cov` |
| Gene‑type distribution plot | `Human_fetal_BM10x_gene_type_distribution.png` |
| Gene lists (all / pc) | `Human_fetal_BM10x_allgene_list.csv`, `Human_fetal_BM10x_pcgene_list.csv` |
| Unique cell‑label pairs | `Human_fetal_BM10x_unique_cell_label_pairs.csv` |

> **Path Policy** : All filesystem references below use **relative paths** (`DATA_DIR`, `OUTPUT_DIR`).  
> Adjust them to suit your environment without revealing absolute locations.


## 🔧 Environment Setup

In [None]:
import scanpy as sc
import anndata as ad
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

## 📂 Define Input / Output Paths

In [None]:
# Edit these directories as needed
DATA_DIR   = Path('data/FBM')           # Holds raw .h5ad
OUTPUT_DIR = Path('output/FBM')         # Will store processed files
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

RAW_H5AD          = DATA_DIR / 'Human_fetal_BM10x.h5ad'
GENE_MATRIX_PATH  = Path('data') / 'geneMatrix.tsv.gz'   # 56 778 genes with coordinates

## 📑 Load Raw AnnData

In [None]:
sce = sc.read_h5ad(RAW_H5AD)
print(sce)

## 🔁 Convert Gene IDs → Ensembl (ENSG)

In [None]:
# Assume gene IDs are stored in var index like 'gene_ids-1'; keep a backup symbol column
sce.var['symbol_gene_name'] = sce.var.index

# Replace index with Ensembl IDs from 'gene_ids-1'
sce.var.index = sce.var['gene_ids-1'].astype(str)
# Drop the now‑redundant column
sce.var.drop(columns=['gene_ids-1'], inplace=True)

## 🧬 Keep Genes with Genomic Coordinates

In [None]:
gene_coords = pd.read_csv(GENE_MATRIX_PATH, sep='\t', compression='infer')
common_genes = gene_coords['Gene'].astype(str).intersection(sce.var_names)
print(f'Genes before: {sce.n_vars} | after coordinate filter: {common_genes.size}')
adata = sce[:, sce.var_names.isin(common_genes)].copy()

## 🕵️‍♀️ Remove Low‑Frequency Cell Types (<20 cells)

In [None]:
adata.obs['cell.labels'] = adata.obs['cell.labels'].str.replace(' ', '_', regex=False)
cell_counts = adata.obs['cell.labels'].value_counts()
cell_counts.to_csv(OUTPUT_DIR/'cell_labels_counts.txt', sep='\t')
keep_types = cell_counts[cell_counts >= 20].index
adata = adata[adata.obs['cell.labels'].isin(keep_types)].copy()
print(f'Retained {adata.n_obs} cells • {adata.obs["cell.labels"].nunique()} cell types')

## 💾 Save Expression + Gene‑Position AnnData

In [None]:
adata.write(OUTPUT_DIR/'Human_fetal_BM10x_expr_gene_withPos.h5ad')

## 📊 Generate scDRS Covariate File (.cov)

In [None]:
cov = pd.DataFrame(index=adata.obs_names)
cov['const'] = 1
cov['n_genes'] = (adata.X>0).sum(axis=1)
for donor in sorted(adata.obs['orig.ident'].unique()):
    cov[f'donor_{donor}'] = (adata.obs['orig.ident']==donor).astype(int)

cov.to_csv(OUTPUT_DIR/'Human_fetal_BM10x_expr_gene_withPos.cov', sep='\t')

## 🧩 Subset to Protein‑Coding Genes

In [None]:
pc_genes = gene_coords[gene_coords['gene_type']=='protein_coding']['Gene']
adata_pc = adata[:, adata.var_names.isin(pc_genes)].copy()
adata_pc.write(OUTPUT_DIR/'Human_fetal_BM10x_pc.h5ad')
print(adata_pc)

## 📈 Plot Gene‑Type Distribution

In [None]:
gene_type_counts = gene_coords[gene_coords['Gene'].isin(adata.var_names)]['gene_type'].value_counts()
ax = gene_type_counts.plot(kind='bar', figsize=(8,6))
ax.set_ylabel('Number of genes')
ax.set_xlabel('Gene type')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig(OUTPUT_DIR/'Human_fetal_BM10x_gene_type_distribution.png')
plt.show()

## 🗂️ Export Unique Cell‑Label Pairs

In [None]:
labels_df = adata.obs[['cell.labels','broad_fig1_cell.labels']].drop_duplicates()
labels_df.to_csv(OUTPUT_DIR/'Human_fetal_BM10x_unique_cell_label_pairs.csv',index=False)

## 🗃️ Export Gene Lists

In [None]:
pd.Series(adata.var_names,name='ensgid').to_csv(OUTPUT_DIR/'Human_fetal_BM10x_allgene_list.csv',index=False)
pd.Series(adata_pc.var_names,name='ensgid').to_csv(OUTPUT_DIR/'Human_fetal_BM10x_pcgene_list.csv',index=False)