**This notebook provides code samples to be used to manipulate AnnData objects towards CELLxGENE curation\
It is not intended to be used as a single coherent workflow**

## Table of Contents
* **CELLxGENE Revision**
 * [Remove CELLxGENE portal fields](#revision)
* **Subset**
 * [subset the matrix](#subset)
* **obsm**
 * [convert x,y columns to embeddings](#set-embed)
* **layers**
 * [Move a layer](#mv-layer)
 * [Delete a layer](#del-layer)
* **uns**
 * [Set a field](#set-uns)
 * [Delete a field](#del-uns)
* **obs**
 * [Remove columns](#del-obs)
 * [Rename columns](#rn-obs)
 * [Replace values](#rp-obs)
 * [Set a column with the same value](#set-obs)
 * [Fill null values in a specific column](#fillna-obs)
 * [Convert numeric field to categorical](#cat-obs)
 * [Alter the values in a column using a function](#typo-obs)
 * [Add a column mapped from another - dictionary](#add-dict-obs)
 * [Add a column mapped from aonther - Google Sheet](#add-gs-obs)
* **var**
 * [Remove columns](#del-var)
 * [Set a column with the same value](#set-var)
 * [Add a column mapped from another - function](#typo-var)
 * [Set a column as the index](#index-var)
 * [Map in Ensembl IDs based on symbols and reference annotation](#id-map-var)
 * [Fill var with filtered features that are in raw.var](#fill-filt-var)

# Revising existing CELLxGENE Dataset <a class="anchor" id="revision"></a>
**Remove fields that filled in by the portal upon submission**

In [None]:
portal_obs = [
    'assay',
    'cell_type',
    'development_stage',
    'disease',
    'self_reported_ethnicity',
    'organism',
    'sex',
    'tissue'
]

portal_var = [
    'feature_name',
    'feature_reference',
    'feature_biotype'
]

adata.obs.drop(columns=portal_obs, inplace=True)
adata.var.drop(columns=portal_var, inplace=True)


if adata.raw:
    remove_raw_var = [p for p in portal_var if p in adata.raw.var]
    if remove_raw_var:
        adata.raw.var.drop(columns=remove_raw_var, inplace=True)

# Subset matrix  <a class="anchor" id="subset"></a>

In [None]:
#give me a csv with cell IDs & embeddings
#obs_to_keep = cell ID column
#set the embeddings in obsm

In [None]:
obs_to_keep = [i for i in adatasm.obs.index if i.endswith('-1-1') != True]
len(obs_to_keep)

In [None]:
adatasm = adatasm[obs_to_keep, : ]
adatasm

# obsm

**add spatial embeddings based on two columns in obs** <a class="anchor" id="set-embed"></a>

In [None]:
adata.obsm['X_spatial'] = adata.obs[['xcoord','ycoord']].to_numpy()
adata.obs.drop(columns=['xcoord','ycoord'], inplace=True)

# layers

**move a layer to the raw slot** <a class="anchor" id="mv-layer"></a>

In [None]:
raw_adata = ad.AnnData(adata.layers['counts'], var=adata.var)
adata.raw = raw_adata

**delete a layer** <a class="anchor" id="del-layer"></a>

In [None]:
del adata.layers['counts']

# uns

**define a field in uns** <a class="anchor" id="set-uns"></a>

In [None]:
adata.uns['schema_version'] = '3.0.0'
adata.uns['default_embedding'] = 'X_umap'

**remove a field from uns** <a class="anchor" id="del-uns"></a>

In [None]:
del adata.uns['X_normalization']

# obs

**Remove columns**  <a class="anchor" id="del-obs"></a>

In [None]:
obs_remove = [
    'author_tissue',
    'Assay',
    'method',
    'donor_age'
]

obs_remove = [o for o in obs_remove if o in adata.obs.columns]
adata.obs.drop(columns=obs_remove, inplace=True)
if obs_remove:
    print('removed: ' + ','.join(obs_remove))

**change column names**  <a class="anchor" id="rn-obs"></a>

In [None]:
rename_me = {
    'cell_type': 'author_cell_type',
    'ethnicity_ontology_id': 'self_reported_ethnicity_ontology_term_id',
    'disease_ontology_id': 'disease_ontology_term_id'
}

adata.obs.rename(columns=rename_me, inplace=True)

**replace specified values in specified columns** <a class="anchor" id="rp-obs"></a>

In [None]:
replace_me = {
    'organism_ontology_term_id':{'human': 'NCBITaxon:9606', 'mouse': 'NCBITaxon:10090'},
    'assay_ontology_term_id': {'EFO:0030003': 'EFO:0009899'}
}

adata.obs.replace(replace_me,inplace=True)

**set a column with all the same values**  <a class="anchor" id="set-obs"></a>

In [None]:
adata.obs['is_primary_data'] = True
adata.obs['suspension_type'] = 'nucleus'

**fill null values of a specific column with a specified value**  <a class="anchor" id="fillna-obs"></a>

In [None]:
adata.obs['sex_ontology_term_id'].cat.add_categories('unknown', inplace=True)
adata.obs.fillna({'sex_ontology_term_id': 'unknown'}, inplace=True)

**Update a gradient field to categorical** <a class="anchor" id="cat-obs"></a>

In [None]:
adata.obs['cluster_id'] = adata.obs['cluster_id'].map(str)

**adjust the values in a specific column in a standard way with a function** <a class="anchor" id="typo-obs"></a>

In [None]:
def fix_typo(x):
    return x.replace('_',':')


adata.obs['development_stage_ontology_term_id'] = adata.obs['development_stage_ontology_term_id'].apply(fix_typo)

**Add a new column mapped from another- with Dictionary** <a class="anchor" id="add-dict-obs"></a>

In [None]:
donor_map = {
    'KL001': 'P21',
    'KL002': 'P22',
    'KL003': 'P23'
}

adata.obs['donor_id'] = adata.obs['sample'].map(donor_map)
adata.obs[['donor_id','sample']].value_counts(dropna=False)

***Add a new column mapped from another - with Google Sheet** <a class="anchor" id="add-gs-obs"></a>\
**Step 1:** get the values to map from

In [None]:
for k in adata.obs['author_cell_type'].unique():
    print(k)

**Step 2** set up a dataframe with the mapping from a Google Sheet\
*Google Sheet permissions must be Anyone with Link is a Viewer*

In [None]:
sheet_id = '15oG8v5BS6HMPqCehYQcujMZUq9PgQNpo8osKhO7yA5o'
tab_name = 'Sheet1'
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={tab_name}'
ct_df = pd.read_csv(url)
ct_df

**Step 3:** merge the dataframe into obs\
*`how='left'` is critical to ensure obs order is retained\
`set_index` is critical to ensure the index is retained*

In [None]:
adata.obs = adata.obs.merge(ct_df, on='author_cell_type',how='left').set_index(adata.obs.index)

# var

**Remove columns** <a class="anchor" id="del-var"></a>

In [None]:
adata.var.drop(columns=['gene_symbols'], inplace=True)

**set a column with all the same values** <a class="anchor" id="set-var"></a>

In [None]:
adata.var['feature_is_filtered'] = False

**Add a new column mapped from another - with function** <a class="anchor" id="typo-var"></a>

In [None]:
adata.var['gene_id'] = adata.var['ensembl_version'].apply(lambda x: x.split('.')[0])

**Set a column as the index** <a class="anchor" id="index-var"></a>

In [None]:
adata.var.set_index('gene_id', inplace=True)

**Map Ensembl IDs from symbols using a reference annotation** <a class="anchor" id="id-map-var"></a>

**If CellRanger may have been used for alignment, check against the default CellRanger references for matches in order to inform symbol-to-ID mapping**

In [None]:
CR_12 = 'refdata-cellranger-GRCh38-1_2_0_genes_gtf.tsv'
CR_30 = 'refdata-cellranger-GRCh38-3_0_0_genes_gtf.tsv'
CR_2020 = 'refdata-gex-GRCh38-2020-A_genes_gtf.tsv'
CR_hg19 = 'refdata-cellranger-hg19-1_2_0_genes_gtf.tsv'
for v in [CR_12,CR_30,CR_2020,CR_hg19]:
    map_df = pd.read_csv(v, sep='\t')
    print(v)
    print(adata.var.merge(map_df,left_index=True,right_on='gene_symbols',how='inner').shape[0])
    print('----------')

In [None]:
var_mapping_file = CR_12

**Fill in the mapping file to use to map symbols to Ensembl IDs**<br>
*Expecting a .tsv with columns `gene_symbols` & `gene_ids`*

In [None]:
#make a tsv from a gtf

In [None]:
var_mapping_file = 'refdata-cellranger-GRCh38-3_0_0_genes_gtf.tsv'

**View what features are not mapped in this**<br>
*Check for typos or other alterations to the symbols that can be fixed*<br>
*Common to see many ending in `.1` or `-1` resulting from duplicated symbols in the reference*

In [None]:
var_map_df = pd.read_csv(var_mapping_file, sep='\t')
adata.var[adata.var.index.isin(var_map_df['gene_symbols']) != True]

**Map the Ensembl IDs**

In [None]:
adata.var = adata.var.merge(var_map_df,left_index=True,right_on='gene_symbols',how='left').set_index(adata.var.index)

**Filter out genes that don't appear in the approved annotation**

**Create the list of approved IDs to filter on**<br>
*For the initial run, download the 4 genes_ csv files from https://github.com/chanzuckerberg/single-cell-curation/tree/main/cellxgene_schema_cli/cellxgene_schema/ontology_files*<br>
*After that, if the `genes_approved.csv` is available locally, then the 4 genes_ files won't be necessary*

In [None]:
ref_files = [
    'genes_ercc.csv',
    'genes_homo_sapiens.csv',
    'genes_mus_musculus.csv',
    'genes_sars_cov_2.csv'
]

if not os.path.exists('genes_approved.csv'):
    ids = pd.DataFrame()
    for f in ref_files:
        df = pd.read_csv(f, names=['feature_id','symb','num','length'],dtype='str',index_col=False)
        ids = ids.append(df)
        os.remove(f)
    ids.to_csv('genes_approved.csv', index=False)

approved = pd.read_csv('genes_approved.csv',dtype='str')

In [None]:
var_to_keep = adata.var.index.tolist()
var_in_approved = adata.var.index[adata.var.index.isin(approved['feature_id'])].tolist()
var_to_keep = [e for e in var_to_keep if e in var_in_approved]
adata = adata[:, var_to_keep]

**Repeat much of the same steps for the `raw.var`, if it exists**

In [None]:
raw_adata = ad.AnnData(adata.raw.X, var=adata.raw.var, obs=adata.obs)

raw_adata.var = raw_adata.var.merge(var_map_df,left_index=True,right_on='gene_symbols',how='left').set_index(raw_adata.var.index)

raw_adata = raw_adata[:, var_to_keep]
adata.raw = raw_adata

adata.raw.var

**Fill genes that are present in raw but not in X** <a class="anchor" id="fill-filt-var"></a>

In [None]:
genes_add = [e for e in adata.raw.var.index if e not in adata.var.index]

new_matrix = sparse.csr_matrix((adata.X.data, adata.X.indices, adata.X.indptr), shape = adata.raw.shape)

all_genes = adata.var.index.to_list()
all_genes.extend(genes_add)

new_var = pd.DataFrame(index=all_genes)
new_var = pd.merge(new_var, adata.var, left_index=True, right_index=True, how='left')
new_var.loc[genes_add, 'feature_is_filtered'] = True
new_adata = ad.AnnData(X=new_matrix, obs=adata.obs, var=new_var, uns=adata.uns, obsm=adata.obsm, raw = adata.raw)
new_adata = new_adata[:,adata.raw.var.index.to_list()]
adata = new_adata

for c in ['feature_is_filtered']:
    adata.var[c] = adata.var[c].astype('bool')