# Data Analysis of ad_worm_aging file


In this notebook we evaluate the contents of [cds_baseline.h5ad](https://zenodo.org/record/7296547/files/cds_baseline.h5ad)

The data in the ad_worm_aging file was create for this paper
[hole-body gene expression atlas of an adult metazoan](https://www.biorxiv.org/content/10.1101/2022.11.06.515345v1)



In [None]:
# Run this cell to download the data
# If you already have the data, SKIP this step

!wget -P ./input_data https://zenodo.org/record/7296547/files/cds_baseline.h5ad

In [None]:
import os
output_dir = "./output_data"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)


## [Annotated data](https://anndata.readthedocs.io/en/latest/)

The below diagram provides a view of the overall structure of the file and the layout of the data content.
  

In [None]:
%%html
<h2>Annotated Data Scructure</h2>
<img src="https://anndata.readthedocs.io/en/latest/_images/anndata_schema.svg" width=400/>

In [None]:
# Check the version of anndata we are using
import anndata as ad
ad.__version__

In [None]:
# Load the h5ad file
input_dir='./input_data'
adult_metazoan = ad.read(f"{input_dir}/cds_baseline.h5ad")
adult_metazoan

In [None]:
adult_metazoan.uns['cds_version']

## Evaluation of observations data

In [None]:
# Let's take a look at observations
obs_df = adult_metazoan.obs
obs_df

### Questions on the cluster naming convension

What naming/numbering convention is used for cell type/cluster names (`annotate_name`)? e.g., the 41 in this example 41_2:marginal. An initial hypothesis is that this is the order in which the UMAP algorithm discovered the clusters.

What is the meaning of the prefixes _0, _1, _2 on the annotate_names? e.g., the _2 in this example 41_2:marginal. An initial hypothesis is that cluster definitions were defined and then a refinement pass on clusters further broke down the cluster groups identifying additional clusters.



In [None]:
# Let's confirm that annotate_name aligns with cell types
# FROM PAPER: "Identification of over 163 distinct C. elegans cell types and subtypes "

# Yes, we see 163 Unique Cell types
cell_types = obs_df['assigned_cell_type'].unique()
print(f"Cell types = {len(cell_types)}")

print("Cell Types")
# All we are doing here is prefixing the cell_type names with spaces so we align on the : (colon)
print(*sorted(list(cell_types)), sep='\n')


In [None]:

cell_type_group = obs_df['cell_type_group'].unique()
print(f"Cell type Groups = {len(cell_type_group)}")

print("Cell Type Groups")

print(*sorted(list(cell_type_group)), sep='\n')


## Evaluation of Var Data

In [None]:
#var: 'id', 'gene_short_name', 'num_cells_expressed', 'use_for_ordering'
var_df = adult_metazoan.var
var_df

## Evaluation of the X Data

In [None]:
import pandas as pd
from scipy.sparse import csr_matrix


x_df = pd.DataFrame(data=csr_matrix.todense(adult_metazoan.X))
x_df

## Evaluation of obsm: 'X_umap', 'scvi'

In [None]:
X_umap = adult_metazoan.obsm['UMAP']
print(type(X_umap))
print(X_umap.shape)
print(X_umap)
print(X_umap.T)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

plt.rcParams['figure.dpi'] = 500
#plt.scatter(X_umap.T[0],X_umap.T[1],  cmap='Spectral', s=.01)
plt.scatter(X_umap.T[0],X_umap.T[1], c='grey',  s=.008)
plt.gca().set_aspect('equal', 'datalim')
#plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('UMAP projection of the dataset', fontsize=18);

In [None]:
# Create a category type list 
import matplotlib.pyplot as plt

def cell_type_group(row, cell_type_group):
    ret_val='other'
    for cat in cell_type_group:
        if cat in row['cell_type_group']:
            ret_val=cat
            break
    return ret_val

cell_type_group_series = obs_df['cell_type_group'].unique()
cell_type_group_list = list(cell_type_group_series)

obs_df['category'] = obs_df.apply(lambda row: cell_type_group(row, cell_type_group_list), axis=1)

colors = {}
cm = plt.get_cmap('gist_rainbow')
for index, group in enumerate(cell_type_group_list):
    color= cm(1.*index/len(cell_type_group_list))
    colors[group]=color

colors['Unassigned']='#7f7f7f'
colors['Hypodermis']='#a65728'
colors    

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib.colors as mcolors
import numpy as np
import pandas as pd
%matplotlib inline

# Map the categories from above to the UMAP 
X_umap = adult_metazoan.obsm['UMAP']
X_umap_df = pd.DataFrame(X_umap, columns = ['X','Y'])

# Add the category to the X_umap_df
obs2_df = obs_df.reset_index(drop=True)
X_umap_df = X_umap_df.join(obs2_df['category'])



####################################

plt.rcParams['figure.dpi'] = 500
sss = plt.scatter(X_umap_df['X'],X_umap_df['Y'], c=X_umap_df['category'].map(colors), s=.008)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the dataset', fontsize=18);
plt.yticks([])
plt.xticks([])

patches = [ mpatches.Patch(color=colors[key], label=key) for key in colors.keys()]
plt.rcParams["legend.fontsize"] = 5
legend = plt.legend(handles=patches)
legend.set_title('Cell Group')

output_dir='./output_data'
file_name='umap_top_15_cell_categories.png'
plt.savefig(f'{output_dir}/{file_name}')

In [None]:
X_umap_df['category']