In [1]:
# Transfers scanorama labels from hvg adata to all gene adata

In [43]:
import scanpy as sc
import anndata as ad
import pandas as pd
from scipy.sparse import csr_matrix, vstack
import random
import numpy as np
import random
from scipy.io import mmread, mmwrite
from sklearn.neighbors import NearestNeighbors
import plotly.graph_objects as go
import igraph
import seaborn as sns
import matplotlib.pyplot as plt

This code loads the AnnData object from the H5AD file "scanorama_integrated_leiden_hvg.h5ad" into the variable hvg_adata. This allows access to the integrated data with Leiden clustering results for further analysis.

In [44]:
hvg_adata = ad.read_h5ad('data/scanorama_integrated_leiden_hvg.h5ad')

This code modifies the observation (cell) index of hvg_adata by removing the last two characters from each entry. This is useful for standardizing cell names or correcting artifacts introduced during dataset merging.

In [48]:
# Strip the last two characters of the index
hvg_adata.obs.index = hvg_adata.obs.index.str[:-2]

This code loads the AnnData object from the H5AD file "original_raw.h5ad" into the variable ag_adata. 

This allows access to the original raw data for comparison or further analysis.

In [49]:
ag_adata = ad.read_h5ad('data/original_raw.h5ad')

This code filters both ag_adata and hvg_adata based on the number of unique molecular identifiers (UMIs) in the "n.umi" column. It keeps cells where "n.umi" is between 250 and 10,000 or if the "origin" is "Cao". The filtered datasets are stored in ag_adata_fil and hvg_adata_fil.

In [50]:
ag_adata_fil = ag_adata[((ag_adata.obs["n.umi"] >= 250) & (ag_adata.obs["n.umi"] <= 10000)) | (ag_adata.obs["origin"] == "Cao")]
hvg_adata_fil = hvg_adata[((hvg_adata.obs["n.umi"] >= 250) & (hvg_adata.obs["n.umi"] <= 10000)) | (hvg_adata.obs["origin"] == "Cao")]

  if not is_categorical_dtype(df_full[k]):


This code extracts the cell indices from hvg_adata_fil and stores them in indices_to_keep. 

It then subsets ag_adata_fil to retain only the cells whose indices are present in indices_to_keep, ensuring that both datasets have the same cells.

In [51]:
indices_to_keep = hvg_adata_fil.obs.index
# Subset adata_1 to keep only the cells in adata_2
ag_adata_fil = ag_adata_fil[indices_to_keep]

This code checks if the observation (cell) indices of ag_adata_fil and hvg_adata_fil are identical by comparing them using the .equals() method. It returns True if the indices match exactly, and False otherwise.

In [60]:
ag_adata_fil.obs.index.equals(hvg_adata_fil.obs.index)

True

This code transfers the Leiden clustering results and other relevant data from hvg_adata_fil to ag_adata_fil. Specifically, it copies the "leiden" column from hvg_adata_fil.obs, and the .obsm, .uns, and .obsp attributes from hvg_adata_fil to ag_adata_fil, ensuring both datasets have consistent clustering and metadata.

In [61]:
ag_adata_fil.obs["leiden"] = hvg_adata_fil.obs["leiden"].values
ag_adata_fil.obsm = hvg_adata_fil.obsm
ag_adata_fil.uns = hvg_adata_fil.uns
ag_adata_fil.obsp = hvg_adata_fil.obsp

  ag_adata_fil.obs["leiden"] = hvg_adata_fil.obs["leiden"].values


This code saves the modified AnnData object ag_adata_fil to an H5AD file named "scanorama_full_leiden_v1.h5ad". This file now contains the updated data with the transferred Leiden clustering results and metadata, ready for future analysis.

In [62]:
ag_adata_fil.write_h5ad("data/scanorama_full_leiden_v1.h5ad")