**This notebook provides a sample curation workflow of a Visium dataset towards CELLxGENE standards starting with Space Ranger outputs**
* [Read Space Ranger outputs into AnnData and curate uns](#read_vis_uns)
* [Add fullres image from file](#full_res)
* [Curate var](#var)
* [Fill in barcodes not included from the Space Ranger outputs](#fill_bar)
* [Revise any in_tissue:1 spots that have all 0s to in_tissue:0](#rev_intiss)
* [Curate obs](#obs)
* [Add cell label metadata](#cell_labels)
* [Add normalized layer](#norm_data)
* [Add non-spatial embeddings](#umap)
* [QA by plotting](#qa_plot)
* [Write to .h5ad](#write)

The example is from [He et al 2022](https://doi.org/10.1016/j.cell.2022.11.005)\
The Space Ranger ouputs `6332STDY10289523.tar.gz` & fullres image `V10S24-031_D1.jpg` can be downloaded from [E-MTAB-11265](https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-11265)\
`6332STDY10289523.220627.h5ad` can be downloaded from the pcw19 Dataset at the [Fetal Lung portal](https://fetal-lung.cellgeni.sanger.ac.uk/visium.html) for cell2location proportions, normalized data, non-spatial embeddings 

In [None]:
import anndata as ad
import json
import numpy as np
import os
import pandas as pd
import scanpy as sc
import squidpy as sq
from PIL import Image
from scipy import sparse

**Read Space Ranger outputs into AnnData** <a id="read_vis_uns"></a> \
Specify the Space Ranger output folder that contains at least these files...
- raw_feature_bc_matrix.h5
- spatial/
  - scalefactors_json.json
  - tissue_hires_image.png
  - tissue_lowres_image.png (will be removed but is required for read.visium)
  - tissue_positions_list.csv / tissue_positions.csv

In [None]:
sr_outs = '6332STDY10289523/outs'

Create the tissue_positions_list.csv from tissue_positions.csv, if needed\
Space Ranger v2.0 onwards includes tissue_positions.csv, which includes headers, but squidpy consumes tissue_positions_list.csv 

REQUIRED to include background spots, so must specify `raw_feature_bc_matrix.h5`

In [None]:
count_file = f'{sr_outs}/raw_feature_bc_matrix.h5'
if os.path.exists(count_file):
    adata = sc.read_10x_h5(count_file)

    with File(count_file, mode='r') as f:
        attrs = dict(f.attrs)
    library_id = str(attrs['library_ids'][0], 'utf-8')

    #optional
    adata.uns['spatial_metadata'] = {
        'chemistry_description': attrs['chemistry_description'],
        'software_version': attrs['software_version']
    }

else: #if only mtx is available
    count_file = f'{sr_outs}/raw_feature_bc_matrix'
    adata = sc.read_10x_mtx(count_file)
    library_id = 'my library id'

hires_path = f'{sr_outs}/spatial/tissue_hires_image.png'
hires_np = np.asarray(Image.open(hires_path))

scfs = json.load(open(f'{sr_outs}/spatial/scalefactors_json.json'))

adata.uns['title'] = library_id,
adata.uns['spatial'] = {
    'is_single': True,
    library_id: {
        'images': {
            'hires': hires_np
        },
        'scalefactors': {
            'spot_diameter_fullres': scfs['spot_diameter_fullres'],
            'tissue_hires_scalef': scfs['tissue_hires_scalef']
        }
    }
}

if os.path.exists(sr_outs + '/spatial/tissue_positions_list.csv'):
    tpl = pd.read_csv(f'{sr_outs}/spatial/tissue_positions_list.csv', index_col=0, names=['in_tissue','array_row','array_col','pxl_row_in_fullres','pxl_col_in_fullres'])
else:
    tpl = pd.read_csv(f'{sr_outs}/spatial/tissue_positions.csv', index_col=0)
adata.obs = adata.obs.merge(tpl, left_index=True, right_index=True, how='left').set_index(adata.obs.index)

adata.obsm['spatial'] = adata.obs[['pxl_col_in_fullres', 'pxl_row_in_fullres']].to_numpy()
adata.obs.drop(columns=['pxl_col_in_fullres', 'pxl_row_in_fullres'], inplace=True)

**PREFERRED to include fullres image** <a id="full_res"></a> \
This is the image input to Space Ranger, not an output\
Specify image file

In [None]:
fullres_path = 'V10S24-031_D1.jpg'

Read the image in as a numpy array and slot it in `uns`

In [None]:
if fullres_path.split('.')[-1] in ['tif','tiff','jpg']:
    #some of the fullres images require expanding the limit
    Image.MAX_IMAGE_PIXELS = 699408640
    fullres_np = np.asarray(Image.open(fullres_path))

#.ome.tif examples - https://www.heartcellatlas.org/
elif fullres_path.endswith('.ome.tif'):
    from pyometiff import OMETIFFReader

    reader = OMETIFFReader(fpath=fullres_path)
    fullres_np, metadata, xml_metadata = reader.read()

    #may need to transpose the image if its an invalid shape
    fullres_np = np.transpose(fullres_np, (1,2,0))

    sr_adata.uns['spatial'][library_id]['images']['fullres'] = fullres_np

    #OPTIONAL to store image metadata in the dataset
    sr_adata.uns['fullres_xml_metadata'] = xml_metadata

#may need to rotate the image to align with embeddings
#k is the number of times to rotate the image 90 degrees counter-clockwise
#fullres_np  = np.rot90(fullres_np, k=3)

adata.uns['spatial'][library_id]['images']['fullres'] = fullres_np

**Curate `var` to meet CELLxGENE standards** <a id="var"></a> \
Ensembl gene IDs are required to be in the var index

In [None]:
adata.var.set_index('gene_ids', inplace=True)

**Fill any missing barcodes into the matrix with all `0` counts** <a id="fill_bar"></a>\
Space Ranger will not output spots that have zero reads mapped

In [None]:
if adata.obs.shape[0] < 4992:
    all_barcodes = pd.read_csv(sr_outs + '/spatial/tissue_positions_list.csv', header=None)
    missing_barcodes = all_barcodes[all_barcodes[0].isin(list(adata.obs.index)) == False]
    missing_barcodes.set_index(0, inplace=True)
    missing_barcodes.rename(columns={1: 'in_tissue', 2:'array_row', 3:'array_col'}, inplace=True)
    empty_matrix = sparse.csr_matrix((missing_barcodes.shape[0], adata.var.shape[0]), dtype=np.float32)
    missing_adata = ad.AnnData(empty_matrix, var=adata.var, obs=missing_barcodes[['in_tissue','array_row','array_col']])
    comb_adata = ad.concat([adata, missing_adata], uns_merge='first', merge='first')
    comb_adata.obsm['spatial'] = np.concatenate((adata.obsm['spatial'],missing_barcodes[[5,4]].values))
    adata = comb_adata

**Revise misannotated in_tissue observations** <a id="rev_intiss"></a> \
Occassionally, in_tissue:1 observations have all `0` counts, indicating that they are not truly in tissue

In [None]:
sum_df = adata.obs.copy()
sum_df['total_counts'] = [np.sum(r) for r in adata.X.toarray()]
to_revise = sum_df[(sum_df['total_counts'] == 0) & (sum_df['in_tissue'] != 0)]
if not to_revise.empty:
    adata.obs.loc[to_revise.index, 'in_tissue'] = 0
    print(to_revise.shape[0],'obs revised to in_tissue:0')

**Curate `obs` metadata to meet CELLxGENE Standards** <a id="obs"></a>

In [None]:
#REQUIRED - will be the same for all Visium V1 Datasets
adata.obs['suspension_type'] = 'na'
adata.obs['assay_ontology_term_id'] = 'EFO:0010961'

#REQUIRED - most likely the same value for all obs, update based on the given donor/sample
adata.obs['donor_id'] = 'HDBR15773'
adata.obs['organism_ontology_term_id'] = 'NCBITaxon:9606' #NCBITaxon:9606 for human, NCBITaxon:10090 for mouse
adata.obs['sex_ontology_term_id'] = 'PATO:0000384' #PATO:0000383 for female, PATO:0000384 for male
adata.obs['development_stage_ontology_term_id'] = 'HsapDv:0000056' #HsapDv or MmusDv term
adata.obs['self_reported_ethnicity_ontology_term_id'] = 'HANCESTRO:0022' #HANCESTRO term, 'na' for mouse
adata.obs['disease_ontology_term_id'] = 'PATO:0000461' #PATO:0000461 for normal, MONDO term for disease
adata.obs['tissue_type'] = 'tissue' #tissue, organoid
adata.obs['tissue_ontology_term_id'] = 'UBERON:0002048' #UBERON term

**Currate cell type metadata** <a id="cell_labels"></a>

Define a mapping of population names to CL terms

In [None]:
cl_map = {
    'Adventitial fibro': 'CL:4028006', #alveolar adventitial fibroblast
    'Alveolar fibro': 'CL:4028006', #alveolar adventitial fibroblast
    'AT1': 'CL:0002062', #pulmonary alveolar type 1 cell
    'AT2': 'CL:0002063', #pulmonary alveolar type 2 cell
    'ASPN+ chondrocyte': 'CL:0000138', #chondrocyte
    'Interm chondrocyte': 'CL:0000138', #chondrocyte
    'Myofibro 2': 'CL:0000186', #myofibroblast cell
    'Ciliated': 'CL:0000064', #ciliated cell
    'MUC16+ ciliated': 'CL:0000064', #ciliated cell
    'Late airway SMC': 'CL:0000192', #smooth muscle cell
    'Vascular SMC 2': 'CL:0000359', #vascular associated smooth muscle cell
    'Late airway progenitor': 'CL:0011026', #progenitor cell
    'Mid fibro': 'CL:0000057', #fibroblast
    'Mid Schwann': 'CL:0002573', #Schwann cell
    'Proximal secretory 2': 'CL:0000151', #secretory cell
    'Late tip': 'CL:0000423', #tip cell
    'Club': 'CL:0000158', #club cell
    'KCNIP4+ neuron': 'CL:0000540', #neuron
    'SST+ neuron': 'CL:0000540', #neuron
    'SCG3+ lymphatic endothelial': 'CL:0002138', #endothelial cell of lymphatic vessel
    'Deuterosomal': 'CL:4033044', #deuterosomal cell
    'Proximal basal': 'CL:0000646', #basal cell
    'Late basal': 'CL:0000646' #basal cell
}

This example contains cell metadata, including cell2location outputs stored in the obs of an AnnData object\
Specify the .h5ad file & load

In [None]:
import scanpy as sc


final_mx = '6332STDY10289523.220627.h5ad'
final_adata = sc.read_h5ad(final_mx)
final_adata.var.set_index('gene_ids', inplace=True)

One possible way to curate cell_type from such outputs is to identify the cell label with the highest abundance score,\
and then map that to a CL term, per CELLxGENE standards

In [None]:
#update the obs index values to match the Space Ranger outputs (<barcode>-1), if needed
final_adata.obs.index = [i[0] for i in final_adata.obs.index.str.split('_')]

#OPTIONAL merge over all of final_adata.obs to adata
new_cols = [c for c in final_adata.obs.columns if c in adata.obs.columns]
final_adata.obs.drop(columns=new_cols, inplace=True)
adata.obs = adata.obs.merge(final_adata.obs, left_index=True, right_index=True, how='left').set_index(adata.obs.index)

#define a prefix to identify all of the columns with abundance metrics
prefix = 'q05cell_abundance_w_sf_'

#name the column that will store the max cell label
max_field = 'annotation'

#extract the max cell label
adata.obs[max_field] = adata.obs[[c for c in final_adata.obs.columns if c.startswith(prefix)]].idxmax(axis='columns')
adata.obs[max_field] = adata.obs[max_field].str.replace(prefix, '')

#map the cell labels to CL terms
adata.obs['cell_type_ontology_term_id'] = adata.obs[max_field].map(cl_map).fillna('unknown')

**OPTIONAL to add normalized data layer** <a id="norm_data"></a> \
Fills in all `0`s for barcodes filtered out of the normalized data (usually in_tissue:0 observations),\
so this may not be appropriate depending on the normalization/scaling of final layer

In [None]:
#add filtered-out barcodes to the normalized AnnData
barcodes_add = [e for e in adata.obs.index if e not in final_adata.obs.index]
new_obs=pd.DataFrame(index=barcodes_add)
empty_matrix = sparse.csr_matrix((len(barcodes_add), final_adata.var.shape[0]))
missing_adata = ad.AnnData(empty_matrix, var=final_adata.var, obs=new_obs)
final_adata = ad.concat([final_adata, missing_adata], join='outer')

#add filtered-out features to the normalized AnnData
genes_add = [e for e in adata.var.index if e not in final_adata.var.index]
all_genes = final_adata.var.index.to_list() + genes_add
new_var = pd.DataFrame(index=all_genes)
new_matrix = sparse.csr_matrix((final_adata.X.data, final_adata.X.indices, final_adata.X.indptr), shape = adata.shape)
final_adata = ad.AnnData(X=new_matrix, obs=final_adata.obs, var=new_var, obsm=final_adata.obsm)

#sort the normalized AnnData to match the order of the raw AnnData
final_adata = final_adata[adata.obs.index.to_list(), :]
final_adata = final_adata[:, adata.var.index.to_list()]

#set the raw counts to the .raw slot and normalized to .X
adata.raw = adata
adata.X = final_adata.X

#features that are measured in the raw layer, but were filtered-out 
adata.var['feature_is_filtered'] = np.where(adata.var.index.isin(genes_add), True, False)

**OPTIONAL to add non-spatial embeddings** <a id="umap"></a> \
filtered-out barcodes will have null values in each embedding from the above processes

In [None]:
for k in final_adata.obsm:
    if 'spatial' not in k:
        adata.obsm[k] = final_adata.obsm[k]

**QA by plotting with the hires image and fullres image, if present**  <a id="qa_plot"></a>

In [None]:
sq.pl.spatial_scatter(
    adata, library_id=library_id, figsize=(12,4),
    color='in_tissue'
)

sq.pl.spatial_scatter(
    adata, library_id=library_id, figsize=(12,4),
    color=max_field ,legend_fontsize=10
)
del adata.uns[max_field + '_colors']

if 'fullres' in adata.uns['spatial'][library_id]['images']:
    sq.pl.spatial_scatter(
        adata, library_id=library_id, figsize=(12,4),
        color='in_tissue', img_res_key='fullres', scale_factor=1.0
        )

    sq.pl.spatial_scatter(
        adata, library_id=library_id, figsize=(12,4),
        color=max_field, img_res_key='fullres', scale_factor=1.0,
        legend_fontsize=10
        )
    del adata.uns[max_field + '_colors']

**Write to file**  <a id="write"></a>

In [None]:
adata.write(filename=library_id + '_curated.h5ad', compression='gzip')