### Notebook to subset HLCA to a more manageable size using `geosketch`
- **Developed by**: Carlos Talavera-López Ph.D
- **Institute of Computational Biology - Computational Health Centre - Helmholtz Munich**
- v221206

### Import required modules

In [None]:
import anndata
import numpy as np
import pandas as pd
import scanpy as sc
from geosketch import gs

### Set up working environment

In [None]:
sc.settings.verbosity = 3
sc.logging.print_versions()
sc.settings.set_figure_params(dpi = 160, color_map = 'RdPu', dpi_save = 300, vector_friendly = True, format = 'svg', fontsize = 8)

### Read in HLCA object

In [None]:
hlca_raw = sc.read_h5ad('/home/cartalop/data/carlos/single_cell/lung/hlca/LCA_v2_Carlos.h5ad')
hlca_raw

### Add counts into `adata.X`

In [None]:
hlca = anndata.AnnData(X = hlca_raw.layers['counts'], var = hlca_raw.var, obs = hlca_raw.obs, varm = hlca_raw.varm, obsm = hlca_raw.obsm)
hlca

### Generate subset using `geosketch`

In [None]:
N = 100000
sketch_index = gs(hlca.obsm['X_pca'], N, replace = False)
hlca.obs_names_make_unique()
hlca_subset = hlca[hlca.obs_names[sketch_index]]
hlca_subset

### Clean up object

- Remove unnecessary fields in `adata.obs` and `adata.var`
- Remove `adata.obsm`, `adata.varm` 

In [None]:
hlca_subset.obs = hlca_subset.obs[['sample', 'ann_level_4', 'sequencing_platform', 'subject_ID']]
hlca_subset.var = hlca_subset.var[['gene_symbols', 'gene_ids']]
hlca_subset

In [None]:
del(hlca_subset.obsm)
del(hlca_subset.varm)
hlca_subset


### Export object

In [None]:
hlca_subset.write('/home/cartalop/data/carlos/single_cell/lung/hlca/HLCA_raw_100K_subset.h5ad')