**This notebook provides code samples to be used to manipulate AnnData objects towards CELLxGENE curation\
It is not intended to be used as a single coherent workflow**

## Table of Contents
* **CELLxGENE Revision**
 * [Remove CELLxGENE portal fields](#revision)
* **Matrix**
 * [Convert matrix to sparse](#sparsity)
 * [Convert raw matrix to sparse](#sparsity-raw)
 * [Subset the matrix](#subset)
* **obsm**
 * [Convert x,y columns to embeddings](#set-embed)
* **layers**
 * [Move a layer](#mv-layer)
 * [Delete a layer](#del-layer)
* **uns**
 * [Set a field](#set-uns)
 * [Delete a field](#del-uns)
* **obs**
 * [Remove columns](#del-obs)
 * [Rename columns](#rn-obs)
 * [Replace values](#rp-obs)
 * [Set a column with the same value](#set-obs)
 * [Fill null values in a specific column](#fillna-obs)
 * [Convert numeric field to categorical](#cat-obs)
 * [Alter the values in a column using a function](#typo-obs)
 * [Map HsapDv terms from human ages in specific years](#yr-hsapdv)
 * [Add a column mapped from another - dictionary](#add-dict-obs)
 * [Add a column mapped from aonther - Google Sheet](#add-gs-obs)
 * [Create a csv from barcode list](#create-csv)
* **var**
 * [Remove columns](#del-var)
 * [Set a column with the same value](#set-var)
 * [Add a column mapped from another - function](#typo-var)
 * [Set a column as the index](#index-var)
 * [Map in Ensembl IDs based on symbols and reference annotation](#id-map-var)
 * [Curate raw.var](#raw-var)
 * [Fill var with filtered features that are in raw.var](#fill-filt-var)

# Revising existing CELLxGENE Dataset <a class="anchor" id="revision"></a>
**Remove fields that filled in by the portal upon submission**

In [None]:
portal_obs = [
    'assay',
    'cell_type',
    'development_stage',
    'disease',
    'self_reported_ethnicity',
    'organism',
    'sex',
    'tissue'
]

portal_var = [
    'feature_name',
    'feature_reference',
    'feature_biotype'
]

portal_uns = [
    'schema_version'
]

adata.obs.drop(columns=portal_obs, inplace=True)
adata.var.drop(columns=portal_var, inplace=True)

for p in portal_uns:
    del adata.uns[p]

if adata.raw:
    adata.raw.var.drop(columns=portal_var, inplace=True)

# Matrix 

**Convert a matrix to sparse** <a class="anchor" id="sparsity"></a>

In [None]:
adata.X = sparse.csr_matrix(adata.X)

**Convert a matrix to sparse - raw layer** <a class="anchor" id="sparsity-raw"></a>

In [None]:
raw_adata = ad.AnnData(sparse.csr_matrix(adata.raw.X), var=adata.raw.var, obs=adata.obs)
adata.raw = raw_adata
del raw_adata

**Subset a matrix for select observations** <a class="anchor" id="subset"></a>

In [None]:
embed_file = 'HumanNonNeuronal_clusterfile.txt'
embed_df = pd.read_csv(embed_file, sep='\t', skiprows=[1])
obs_to_keep = embed_df['NAME']

adatasm = adata[obs_to_keep, : ]
adatasm

# obsm

**Add spatial embeddings based on two columns in obs** <a class="anchor" id="set-embed"></a>

In [None]:
adata.obsm['X_spatial'] = adata.obs[['xcoord','ycoord']].to_numpy()
adata.obs.drop(columns=['xcoord','ycoord'], inplace=True)
sc.pl.embedding(adata, basis='X_spatial', color=['cell_type_ontology_term_id'])

# layers

**Move a layer to the raw slot** <a class="anchor" id="mv-layer"></a>

In [None]:
raw_adata = ad.AnnData(adata.layers['counts'], dtype=adata.layers['counts'].dtype, var=adata.var)
adata.raw = raw_adata

**Delete a layer** <a class="anchor" id="del-layer"></a>

In [None]:
del adata.layers['counts']

# uns

**Define a field in uns** <a class="anchor" id="set-uns"></a>

In [None]:
adata.uns['default_embedding'] = 'X_umap'

**Remove a field from uns** <a class="anchor" id="del-uns"></a>

In [None]:
del adata.uns['X_normalization']

# obs

**Remove columns**  <a class="anchor" id="del-obs"></a>

In [None]:
obs_remove = [
    'author_tissue',
    'Assay',
    'method',
    'donor_age'
]

obs_remove = [o for o in obs_remove if o in adata.obs.columns]
adata.obs.drop(columns=obs_remove, inplace=True)
if obs_remove:
    print('removed: ' + ','.join(obs_remove))

**Change column names**  <a class="anchor" id="rn-obs"></a>

In [None]:
rename_me = {
    'cell_type': 'author_cell_type',
    'ethnicity_ontology_id': 'self_reported_ethnicity_ontology_term_id',
    'disease_ontology_id': 'disease_ontology_term_id'
}

adata.obs.rename(columns=rename_me, inplace=True)
adata.obs.columns

**Replace specified values in specified columns** <a class="anchor" id="rp-obs"></a>

In [None]:
replace_me = {
    'organism_ontology_term_id':{'human': 'NCBITaxon:9606', 'mouse': 'NCBITaxon:10090'},
    'assay_ontology_term_id': {'EFO:0030003': 'EFO:0009899'}
}

adata.obs.replace(replace_me,inplace=True)
adata.obs[['organism_ontology_term_id','assay_ontology_term_id']].value_counts()

**Set a column with all the same values**  <a class="anchor" id="set-obs"></a>

In [None]:
adata.obs['is_primary_data'] = True
adata.obs['suspension_type'] = 'nucleus'
adata.obs[['is_primary_data','suspension_type']].value_counts()

**Fill null values of a specific column with a specified value**  <a class="anchor" id="fillna-obs"></a>

In [None]:
if 'unknown' not in adata.obs['sex_ontology_term_id'].unique():
    adata.obs['sex_ontology_term_id'] = adata.obs['sex_ontology_term_id'].cat.add_categories('unknown')
adata.obs.fillna({'sex_ontology_term_id': 'unknown'}, inplace=True)
adata.obs['sex_ontology_term_id'].value_counts()

**Update a gradient field to categorical** <a class="anchor" id="cat-obs"></a>

In [None]:
adata.obs['cluster_id'] = adata.obs['cluster_id'].map(str)
adata.obs['cluster_id'].unique()

**Adjust the values in a specific column in a standard way with a function** <a class="anchor" id="typo-obs"></a>

In [None]:
def fix_typo(x):
    return x.replace('_',':')


adata.obs['development_stage_ontology_term_id'] = adata.obs['development_stage_ontology_term_id'].apply(fix_typo)
adata.obs['development_stage_ontology_term_id'].unique()

**Use OLS to map HsapDv terms from human ages in specific years** <a class="anchor" id="yr-hsapdv"></a>

In [None]:
url = 'http://www.ebi.ac.uk/ols4/api/ontologies/hsapdv/terms?size=500'
r = requests.get(url).json()
yr_specific = {t['label']: t['obo_id'] for t in r['_embedded']['terms'] if t['label'].endswith('-year-old human stage')}

adata.obs['development_stage_ontology_term_id'] = adata.obs['age'].apply(lambda x: yr_specific[x + '-year-old human stage'])
adata.obs[['age','development_stage_ontology_term_id']].value_counts(dropna=False)

**Add a new column mapped from another- with Dictionary** <a class="anchor" id="add-dict-obs"></a>

In [None]:
donor_map = {
    'KL001': 'P21',
    'KL002': 'P22',
    'KL003': 'P23'
}

adata.obs['donor_id'] = adata.obs['sample'].map(donor_map)
adata.obs[['donor_id','sample']].value_counts(dropna=False)

**Add a new column mapped from another - with Google Sheet** <a class="anchor" id="add-gs-obs"></a>\
**Step 1:** get the values to map from

In [None]:
for k in adata.obs['donor_id'].unique():
    print(k)

**Step 2** set up a dataframe with the mapping from a Google Sheet\
*Google Sheet permissions must be Anyone with Link is a Viewer*

In [None]:
sheet_id = '15oG8v5BS6HMPqCehYQcujMZUq9PgQNpo8osKhO7yA5o'
tab_name = 'donor table'
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={quote(tab_name)}'
donor_meta = pd.read_csv(url)[['donor_id','sex_ontology_term_id','development_stage_ontology_term_id']]
donor_meta

**Step 3:** merge the dataframe into obs\
*`how='left'` is critical to ensure obs order is retained\
`set_index` is critical to ensure the index is retained*

In [None]:
adata.obs = adata.obs.merge(donor_meta, on='donor_id',how='left').set_index(adata.obs.index)
adata.obs[donor_meta.columns].value_counts()

**See what cell prefixes/suffixes are**
<br>Note: This will need to be adapted depending on what the prefix/suffix looks like</br>

In [None]:
#for prefix and suffix
pattern = r"[ACGT]*-1$"  # for prefix
#pattern = r"^[AGCT]*"  # for suffix
adata_index_split = adata.obs.index.to_series().str.split(pat = pattern, regex=True, expand=True)
adata_index_split.iloc[:,1].value_counts(dropna=False)

**Create csv from barcode lists** <a class="anchor" id="create-csv"></a>

In [None]:
#Used in generating summary for 10x barcode csv table
def summarize(v2,v3,m,b):
    if re.match(r".*[ACTG]{16}.*", b):
        if v2 + v3 + m > 1:
            return 'multiple'
        elif (v2 == 1) and (v3 == 0) and (m == 0):
            return '3pv2_5pv1_5pv2'
        elif (v2 == 0) and (v3 == 1) and (m == 0):
            return '3pv3'
        elif (v2 ==0 ) and (v3 == 0) and (m == 1):
            return 'multiome'
    else:
        pass

In [None]:
#Code in sample notebook consumes the barcodes across all 3 lists and compiles them into a table
ref_dir = '../cellxgene_resources/ref_files/'
v2_file = '737K-august-2016.txt'
v3_file = '3M-february-2018.txt'
multiome_file = '737K-arc-v1.txt'

# create dataframes of ref files
v2_df = pd.read_csv(ref_dir + v2_file, names=['barcode'])
v2_df['3pv2_5pv1_5pv2'] = 1

v3_df = pd.read_csv(ref_dir + v3_file, names=['barcode'])
v3_df['3pv3'] = 1

multiome_df = pd.read_csv(ref_dir + multiome_file, names=['barcode'])
multiome_df['multiome'] = 1

# merge ref dfs
barcode_table_df = v2_df.merge(v3_df,on='barcode',how='outer').merge(multiome_df,on='barcode',how='outer')
barcode_table_df.fillna(0, inplace=True)
barcode_table['summary'] = barcode_table.apply(lambda x: summarize(x['3pv2_5pv1_5pv2'],x['3pv3'],x['multiome'],x['barcode']), axis=1)
barcode_table_df.to_csv('10X_barcode_table.csv.gz', sep=',', index=False, compression='gzip')

In [None]:
# add summary column with the corresponding ref list 
for index,row in barcode_results_df.iterrows():
    if (row[1]+row[2]+row[3]) > 1:
        barcode_results_df.loc[index, 'summary'] = 'multiple'

    elif (row[1]==1) and (row[2]==0) and (row[3]==0):
        barcode_results_df.loc[index, 'summary'] = '3pv2_5pv1_5pv2' 

    elif (row[1]==0) and (row[2]==1) and (row[3]==0):
        barcode_results_df.loc[index, 'summary'] = '3pv3'

    elif (row[1]==0) and (row[2]==0) and (row[3]==1):
        barcode_results_df.loc[index, 'summary'] = 'multiome'

    else:
        print('Error, check row conditions')

# write barcode_results_df to csv for 10x barcode checker in curation_qa notebook
barcode_results_df.to_csv('10X_barcode_results.csv', sep=',', index=False)

# var

**Remove columns** <a class="anchor" id="del-var"></a>

In [None]:
adata.var.drop(columns=['gene_symbols'], inplace=True)
adata.var.columns

**Set a column with all the same values** <a class="anchor" id="set-var"></a>

In [None]:
adata.var['feature_is_filtered'] = False
adata.var['feature_is_filtered'].value_counts()

**Add a new column mapped from another - with function** <a class="anchor" id="typo-var"></a>

In [None]:
adata.var['gene_id'] = adata.var['ensembl_version'].apply(lambda x: x.split('.')[0])
adata.var

**Set a column as the index** <a class="anchor" id="index-var"></a>

In [None]:
adata.var.set_index('gene_id', inplace=True)
adata.var

**Map Ensembl IDs from symbols using a reference annotation** <a class="anchor" id="id-map-var"></a>

**If CellRanger may have been used for alignment, check against the default CellRanger references for matches in order to inform symbol-to-ID mapping**<br>
*Each tsv file has been compiled from the gtfs distributed with CellRanger and are stored in this repo*

In [None]:
ref_dir = 'ref_files/'
CR_12 = 'refdata-cellranger-GRCh38-1_2_0_genes_gtf.tsv'
CR_30 = 'refdata-cellranger-GRCh38-3_0_0_genes_gtf.tsv'
CR_2020 = 'refdata-gex-GRCh38-2020-A_genes_gtf.tsv'
CR_hg19 = 'refdata-cellranger-hg19-1_2_0_genes_gtf.tsv'
results = []
for v in [CR_12,CR_30,CR_2020,CR_hg19]:
    map_df = pd.read_csv(ref_dir + v, sep='\t')
    results.append({
        'ref': v,
        'matched': adata.var.merge(map_df,left_index=True,right_on='gene_symbols',how='inner').shape[0]
    })
df = pd.DataFrame(results).set_index('ref')
df['unmatched'] = df['matched'].apply(lambda x: adata.var.shape[0] - x)
df.sort_values('unmatched', inplace=True)
df

**If one of the CellRanger references looks like a good match, you can set it as the map file for use downstream, demonstrated further below**

In [None]:
var_mapping_file = ref_dir + CR_12

**IN PROGRESS: If a reference other than one of the default CellRanger reference files was used, a map file can be created from the annotation file for use in curation**

**If a GENCODE/ENSEMBL reference was used, parse the annotation file**

In [None]:
# Extracts gene_id and gene_symbol from gtf

def extract_gene_id_and_symb(dir, gtf):
    '''
    Input: path to gtf file
    Output: dataframe of the gene_ids and gene_symbols
    '''
    gtf = pd.read_table(dir + gtf, header=None, comment = '#')
    row_8 = gtf[8]
    pattern1 = ';'
    pattern2 = '(ENSG\d{11})'
    pattern3 = '([\w\d\-]+)'
    split1 = row_8.str.split(pattern1, expand=True)
    gene_ids = split1[0].str.split(pattern2, expand=True)[1]
    gene_sym = split1[4].str.split(pattern3, expand=True)[3]
    gtf_df = pd.DataFrame({'gene_ids':gene_ids, 'gene_symbols':gene_sym})
    gtf_df.drop_duplicates(inplace=True)  # drop rows where both gene_id and gene_symbol are duplicated
    gtf_df.set_index('gene_ids')
    return gtf_df

In [None]:
# Filters out gene_symbols with multiple gene_ids and assigns 'multiple' as gene_id to corresponding gene_symbol
def assign_multiple(gtf_df):
    '''
    Input: gtf pandas dataframe with gene_id and gene_symbol columns
    Output: de-duplicated dataframe
    '''
    dups = gtf_df[gtf_df.duplicated(subset='gene_symbols',keep=False)]
    dups['gene_ids'] = 'multiple'
    dups = dups.drop_duplicates()  # only keep one instance of the gene_symbol
    gene_df_wo_dups = gene_df.drop_duplicates(subset='gene_symbols',keep=False) 
    gene_df_multiple = gene_df_wo_dups.append(dups)  #create new df with all non-duplicated gene_symbols and gene_symbols with multiple ensembl ids
    gene_df_multiple.set_index('gene_ids', inplace=True, )
    return gene_df_multiple

In [None]:
# can include multiple gtf files (list of gtf files)
gtf_list = ['gtf','gtf']

In [None]:
# Extract and de-duplicate gene annotations and write them to a tsv file
for file in gtf_list:
    gtf_df = extract_gene_id_and_symb(file)
    gene_id_symbol = assign_multiple(gtf_df)

    # write dataframe to tsv file in ref_file folder
    dir = '/path/to/cxg_curation/ref_files/'
    tsv_file = file.replace('.','_') + '.tsv' 
    gene_id_symbol.to_csv(dir + tsv_file, sep= '\t', compression= 'gzip')

**View what features are not mapped in this**<br>
*Check for typos or other alterations to the symbols that can be fixed*<br>
*Common to see many ending in `.1` or `-1` resulting from duplicated symbols in the reference*

In [None]:
var_map_df = pd.read_csv(var_mapping_file, sep='\t')
adata.var[adata.var.index.isin(var_map_df['gene_symbols']) != True]

**Map the Ensembl IDs & set them to the index**

In [None]:
adata.var = adata.var.merge(var_map_df,left_index=True,right_on='gene_symbols',how='left').set_index(adata.var.index)
adata.var.set_index('gene_ids', inplace=True)
adata.var

**Filter out genes that don't appear in the approved annotation**

**Create the list of approved IDs to filter on**<br>
*For the initial run, download the 4 genes_ csv files from https://github.com/chanzuckerberg/single-cell-curation/tree/main/cellxgene_schema_cli/cellxgene_schema/ontology_files*<br>
*After that, if the `genes_approved.csv` is available locally, then the 4 genes_ files won't be necessary*

In [None]:
ref_dir = 'ref_files/'
ref_files = [
    'genes_ercc.csv',
    'genes_homo_sapiens.csv',
    'genes_mus_musculus.csv',
    'genes_sars_cov_2.csv'
]

if not os.path.exists(ref_dir + 'genes_approved.csv'):
    ids = pd.DataFrame()
    for f in ref_files:
        df = pd.read_csv(ref_dir + f, names=['feature_id','symb','num','length'],dtype='str',index_col=False)
        ids = ids.append(df)
        os.remove(f)
    ids.to_csv(ref_dir + 'genes_approved.csv', index=False)

approved = pd.read_csv(ref_dir + 'genes_approved.csv',dtype='str')

In [None]:
adata.var.reset_index(inplace=True)
var_to_keep = adata.var[adata.var['gene_ids'].isin(approved['feature_id'])].index
adata = adata[:, var_to_keep]
adata.var.set_index('gene_ids', inplace=True)
adata.var

**Repeat much of the same steps for the `raw.var`, if it exists** <a class="anchor" id="raw-var"></a>

In [None]:
raw_adata = ad.AnnData(adata.raw.X, var=adata.raw.var, obs=adata.obs)

raw_adata.var = raw_adata.var.merge(var_map_df,left_index=True,right_on='gene_symbols',how='left').set_index(raw_adata.var.index)

raw_adata.var.reset_index(inplace=True)
var_to_keep = raw_adata.var[raw_adata.var['gene_ids'].isin(approved['feature_id'])].index
raw_adata = raw_adata[:, var_to_keep]
raw_adata.var.set_index('gene_ids', inplace=True)

adata.raw = raw_adata
adata.raw.var

**Fill genes that are present in raw but not in X** <a class="anchor" id="fill-filt-var"></a><br>
*Ensure the matrix is CSR-formatted prior to using this*

In [None]:
genes_add = [e for e in adata.raw.var.index if e not in adata.var.index]
new_matrix = sparse.csr_matrix((adata.X.data, adata.X.indices, adata.X.indptr), shape = adata.raw.shape)
all_genes = adata.var.index.to_list()
all_genes.extend(genes_add)
new_var = pd.DataFrame(index=all_genes)
new_var = pd.merge(new_var, adata.var, left_index=True, right_index=True, how='left')
new_var.loc[genes_add, 'feature_is_filtered'] = True
new_adata = ad.AnnData(X=new_matrix, dtype=new_matrix.dtype, obs=adata.obs, var=new_var, uns=adata.uns, obsm=adata.obsm, raw = adata.raw)
new_adata = new_adata[:,adata.raw.var.index.to_list()]
new_adata.var.loc[adata.var.index, 'feature_is_filtered'] = False
new_adata.var['feature_is_filtered'] = new_adata.var['feature_is_filtered'].astype('bool')

adata = new_adata

adata.var['feature_is_filtered'].value_counts()