# How to find, load and process snRNA-seq data

In [1]:
#import libraries
import wget
import pandas as pd
import numpy as np
import scanpy as sc
import anndata

Gene network analysis is a method designed to identify sub-networks (modules) of correlated genes, which are likely to be co-expressed. 
This can be helpful in identification of sub-networks (modules) of genes that contribute to disease. 
In this example, we will cover how to create a pairwise correlation matrix of genes, as well as how to associate them with disease.

First we will cover how to find, load and process the snRNA-seq data.

# Find a Dataset

For this tutorial, we will be using an open access freely available dataset that has been generated from human peripheral blood mononuclear cells from patients with clonal hematopoiesis and controls.
This dataset is available from the cellxgene portal, accessible here: https://cellxgene.cziscience.com/collections/0aab20b3-c30c-4606-bd2e-d20dae739c45 entitled "Multiomic Profiling of Human Clonal Hematopoiesis Reveals Genotype and Cell-Specific Inflammatory Pathway Activation". The associated paper is called "Multiomic profiling of human clonal hematopoiesis reveals genotype and cell-specific inflammatory pathway activation" and available at: https://ashpublications.org/bloodadvances/article/8/14/3665/515374/Multiomic-profiling-of-human-clonal-hematopoiesis
ScRNA-seq was performed for patients with clonal haematopoiesis and controls.
This dataset was chosen due to its compatability with the purpose of the pipeline.
This data will be available in the data/test/ directory.
The generated dataset is stored in h5ad format.
By the end of this section, we will have loaded and explored the dataset.


# Download a Dataset

Start by downloading the dataset from the original portal.
Important to note, this step does not have to be complete. To save time, the filtered dataset has already been placed into the github repository within /dataset.

In [None]:
# URL of the dataset
url = "https://datasets.cellxgene.cziscience.com/6094cddd-de51-4891-8841-43e25120c336.h5ad"

# Destination path where the dataset will be saved
destination_path = "/ReCoDE-Gene-Network-Analysis/data/data/pbmc.h5ad"

# Download the dataset
wget.download(url, destination_path)

#Alternatively, the dataset can be found in the directory stated in the next cell.

In [None]:
# Load the dataset
#Please be aware that you will have to personally download the dataset to work with
pbmc = sc.read("/ReCoDE-Gene-Network-Analysis/data/data/pbmc.h5ad")

In [4]:
#inspect the loaded data
pbmc

AnnData object with n_obs × n_vars = 66985 × 36263
    obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap'

As can be seen there are 67110 cells within the dataset. For the purposes of these exercises we will be filtering the dataset further to focus on one cell type and to reduce the dataset in size for ease.

In [5]:
pbmc.obs['cell_type']

0002_AAACCCACAAGTCCCG-1                      CD4-positive, alpha-beta T cell
0002_AAACCCAGTAGTCGTT-1    CD16-positive, CD56-dim natural killer cell, h...
0002_AAACCCATCTACACAG-1                                       dendritic cell
0002_AAACGAAAGAATTTGG-1                      CD4-positive, alpha-beta T cell
0002_AAACGCTAGCGACTGA-1                               CD14-positive monocyte
                                                 ...                        
079_TTTGGAGTCAGAGTGG-1                                CD14-positive monocyte
079_TTTGGAGTCGACATAC-1                                CD14-positive monocyte
079_TTTGGTTAGGTTATAG-1     CD16-positive, CD56-dim natural killer cell, h...
079_TTTGGTTCACACCAGC-1                                   natural killer cell
079_TTTGTTGGTTGTTGCA-1                       CD4-positive, alpha-beta T cell
Name: cell_type, Length: 66985, dtype: category
Categories (9, object): ['platelet', 'B cell', 'dendritic cell', 'natural killer cell', ..., 'CD8-positiv

As can be seen there are many different cell types contained within this dataset. We shall focus on B cells for the purposes of our exercises.

In [6]:
# Filter the AnnData object for hepatocytes
Bcell = pbmc[pbmc.obs['cell_type'] == 'B cell']

In [7]:
Bcell

View of AnnData object with n_obs × n_vars = 3540 × 36263
    obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap

In [8]:
#Check if the gene names are in the correct format of gene symbols and not Ensembl IDs which are also common.
Bcell.var

Unnamed: 0,feature_is_filtered,feature_name,feature_reference,feature_biotype,feature_length
ENSG00000243485,False,MIR1302-2HG,NCBITaxon:9606,gene,1021
ENSG00000237613,False,FAM138A,NCBITaxon:9606,gene,1219
ENSG00000186092,False,OR4F5,NCBITaxon:9606,gene,2618
ENSG00000238009,False,ENSG00000238009.6,NCBITaxon:9606,gene,3726
ENSG00000239945,False,ENSG00000239945.1,NCBITaxon:9606,gene,1319
...,...,...,...,...,...
ENSG00000277836,False,ENSG00000277836.1,NCBITaxon:9606,gene,288
ENSG00000278633,False,ENSG00000278633.1,NCBITaxon:9606,gene,2404
ENSG00000276017,False,ENSG00000276017.1,NCBITaxon:9606,gene,2404
ENSG00000278817,False,ENSG00000278817.1,NCBITaxon:9606,gene,1213


As can be seen from the gene features dataframe, they have currently used the Ensembl gene naming system.
However, this isn't helpful for our analyses as they are not intuitively easy to interpret, instead you would need to research each Ensembl ID to identify that particular gene's name and function.
From the second column feature_name, it appears that the original authors have converted the Ensembl IDs to gene symbol names.

# Process the Dataset to Correct Format for Analysis

In [9]:
#Let's go ahead and map the values in the feature_name column to the rownames of the dataframe:
# Set the "feature_name" column as the index (row names)
Bcell.var.set_index("feature_name", drop = False, inplace=True)


It is important to note that not all Ensembl IDs map to Gene symbol names, as can be seen within the top of the dataframe.
Therefore, since there is not a mapping for all Ensembl IDs, we shall remove these rows from the dataframe as they will be difficult to interpret in subsequent analyses.

In [10]:
# Filter rows where the index does not start with "ENSG" i.e. the Ensembl IDs.
# Define the condition for filtering genes
filter_genes = ~Bcell.var.index.str.startswith("ENSG")  # Exclude genes starting with "ENSG"
filter_genes

# Filter genes based on the condition
Bcell = Bcell[:, filter_genes]


In [11]:
Bcell

View of AnnData object with n_obs × n_vars = 3540 × 25198
    obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap

In [12]:
Bcell.var

Unnamed: 0,feature_is_filtered,feature_name,feature_reference,feature_biotype,feature_length
ENSG00000243485,False,MIR1302-2HG,NCBITaxon:9606,gene,1021
ENSG00000237613,False,FAM138A,NCBITaxon:9606,gene,1219
ENSG00000186092,False,OR4F5,NCBITaxon:9606,gene,2618
ENSG00000284733,False,OR4F29,NCBITaxon:9606,gene,939
ENSG00000284662,False,OR4F16,NCBITaxon:9606,gene,939
...,...,...,...,...,...
ENSG00000223641,False,TTTY17C,NCBITaxon:9606,gene,776
ENSG00000228786,False,SEPTIN14P23,NCBITaxon:9606,gene,1192
ENSG00000172288,False,CDY1,NCBITaxon:9606,gene,2670
ENSG00000231141,False,TTTY3,NCBITaxon:9606,gene,344


As can be seen, the number of genes have now reduced from 36263 to 25198 as any rows with Ensembl IDs have been removed. However, let's change the variable slot to contain the gene symbol names as they are easier to work with.

In [None]:
# Update var_names with feature names from var DataFrame
Bcell.var_names = Bcell.var['feature_name']

In [14]:
Bcell.var

Unnamed: 0_level_0,feature_is_filtered,feature_name,feature_reference,feature_biotype,feature_length
feature_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
MIR1302-2HG,False,MIR1302-2HG,NCBITaxon:9606,gene,1021
FAM138A,False,FAM138A,NCBITaxon:9606,gene,1219
OR4F5,False,OR4F5,NCBITaxon:9606,gene,2618
OR4F29,False,OR4F29,NCBITaxon:9606,gene,939
OR4F16,False,OR4F16,NCBITaxon:9606,gene,939
...,...,...,...,...,...
TTTY17C,False,TTTY17C,NCBITaxon:9606,gene,776
SEPTIN14P23,False,SEPTIN14P23,NCBITaxon:9606,gene,1192
CDY1,False,CDY1,NCBITaxon:9606,gene,2670
TTTY3,False,TTTY3,NCBITaxon:9606,gene,344


Also need to calculate the highly variable genes.

Calculating highly variable genes on gene expression data that has not been log-transformed or normalised appropriately can lead to issues, including the presence of infinity values.
Log transformation is a common preprocessing step for scRNA-seq data, especially when dealing with count data, to stabilise the variance and make the data more amenable to downstream analysis. 
It helps to mitigate the impact of high expression values and reduce the influence of technical noise.

In [None]:
# Log normalise the gene expression data
sc.pp.log1p(Bcell)

In [16]:
# Calculate highly variable genes
sc.pp.highly_variable_genes(Bcell, n_top_genes = 1000)

In [17]:
Bcell

AnnData object with n_obs × n_vars = 3540 × 25198
    obs: 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample', 'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP', 'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'scType_celltype', 'pANN', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'citation', 'default_embedding', 'schema_

In [18]:
Bcell.var

Unnamed: 0_level_0,feature_is_filtered,feature_name,feature_reference,feature_biotype,feature_length,highly_variable,means,dispersions,dispersions_norm
feature_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
MIR1302-2HG,False,MIR1302-2HG,NCBITaxon:9606,gene,1021,False,1.000000e-12,,
FAM138A,False,FAM138A,NCBITaxon:9606,gene,1219,False,1.000000e-12,,
OR4F5,False,OR4F5,NCBITaxon:9606,gene,2618,False,1.000000e-12,,
OR4F29,False,OR4F29,NCBITaxon:9606,gene,939,False,1.000000e-12,,
OR4F16,False,OR4F16,NCBITaxon:9606,gene,939,False,1.000000e-12,,
...,...,...,...,...,...,...,...,...,...
TTTY17C,False,TTTY17C,NCBITaxon:9606,gene,776,False,1.000000e-12,,
SEPTIN14P23,False,SEPTIN14P23,NCBITaxon:9606,gene,1192,False,4.027802e-04,0.354964,-1.686367
CDY1,False,CDY1,NCBITaxon:9606,gene,2670,False,1.000000e-12,,
TTTY3,False,TTTY3,NCBITaxon:9606,gene,344,False,1.000000e-12,,


In [19]:
#Lets save the filtered object
Bcell.write_h5ad('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_filtered.h5ad')

# Process Associated Metadata

We will now explore the associated metadata 

In [20]:
Bcell.obs.columns

Index(['nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'HTO_maxID',
       'HTO_secondID', 'HTO_margin', 'HTO_classification.global', 'sample',
       'donor_id', 'CHIP', 'LANE', 'ProjectID', 'MUTATION', 'MUTATION.GROUP',
       'sex_ontology_term_id', 'HTOID', 'percent.mt', 'nCount_SCT',
       'nFeature_SCT', 'scType_celltype', 'pANN',
       'development_stage_ontology_term_id', 'cell_type_ontology_term_id',
       'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id',
       'suspension_type', 'is_primary_data', 'tissue_type',
       'tissue_ontology_term_id', 'organism_ontology_term_id',
       'disease_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism',
       'sex', 'tissue', 'self_reported_ethnicity', 'development_stage',
       'observation_joinid'],
      dtype='object')

As can be seen, this dataset contains 3540 cells and 25198 genes.
It also has relevant metadata in the obs section, such as MUTATION. 
The metadata may need to be encoded into the correct format for subsequent analyses, so let's have a look at the current format.

In [21]:
Bcell.obs

Unnamed: 0,nCount_RNA,nFeature_RNA,nCount_HTO,nFeature_HTO,HTO_maxID,HTO_secondID,HTO_margin,HTO_classification.global,sample,donor_id,...,disease_ontology_term_id,cell_type,assay,disease,organism,sex,tissue,self_reported_ethnicity,development_stage,observation_joinid
0002_AAAGGGCAGCAGCACA-1,1192.0,629,97.0,2,sample-2,sample-5,3.146440,Singlet,sample-2,CH-20-002,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,68-year-old human stage,FrFs19`Dsw
0002_AACAACCAGGGTTAGC-1,849.0,548,21.0,3,sample-2,sample-5,1.314667,Singlet,sample-2,CH-20-002,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,68-year-old human stage,MMer^rOrRY
0002_AACCCAAAGGGCCTCT-1,2492.0,1188,110.0,4,sample-2,sample-6,2.556420,Singlet,sample-2,CH-20-002,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,68-year-old human stage,^dC2N0DTU|
0002_AACGAAACACAAAGTA-1,1060.0,608,20.0,4,sample-2,sample-3,0.705259,Singlet,sample-2,CH-20-002,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,68-year-old human stage,F>Ad_32l$>
0002_AAGCGTTTCTTGGGCG-1,1270.0,716,237.0,4,sample-2,sample-5,3.121787,Singlet,sample-2,CH-20-002,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,68-year-old human stage,+@dOztSS*d
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
079_TCTCCGAAGCTATCTG-1,1925.0,1054,43.0,3,sample-5,sample-6,2.155876,Singlet,sample-5,CH-21-079,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,78-year-old human stage,GMZ)5R6Eh*
079_TGAATCGAGATTCGAA-1,2026.0,1097,89.0,4,sample-5,sample-4,2.725727,Singlet,sample-5,CH-21-079,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,78-year-old human stage,lxd{TRji23
079_TGCGATAAGGTAGATT-1,1594.0,933,37.0,2,sample-5,sample-2,1.818129,Singlet,sample-5,CH-21-079,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,78-year-old human stage,KN2ItXPkR4
079_TGCTCGTAGGGTTGCA-1,1840.0,1101,74.0,3,sample-5,sample-1,2.510466,Singlet,sample-5,CH-21-079,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,78-year-old human stage,+VU%_s11(N


Lets create a separate dataframe with the metadata information as this will be needed for the correlation analysis.

In [22]:
#Currently we want to create a copy of the metadata so as not to alter the original adata object.
metadata = Bcell.obs.copy()
metadata

Unnamed: 0,nCount_RNA,nFeature_RNA,nCount_HTO,nFeature_HTO,HTO_maxID,HTO_secondID,HTO_margin,HTO_classification.global,sample,donor_id,...,disease_ontology_term_id,cell_type,assay,disease,organism,sex,tissue,self_reported_ethnicity,development_stage,observation_joinid
0002_AAAGGGCAGCAGCACA-1,1192.0,629,97.0,2,sample-2,sample-5,3.146440,Singlet,sample-2,CH-20-002,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,68-year-old human stage,FrFs19`Dsw
0002_AACAACCAGGGTTAGC-1,849.0,548,21.0,3,sample-2,sample-5,1.314667,Singlet,sample-2,CH-20-002,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,68-year-old human stage,MMer^rOrRY
0002_AACCCAAAGGGCCTCT-1,2492.0,1188,110.0,4,sample-2,sample-6,2.556420,Singlet,sample-2,CH-20-002,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,68-year-old human stage,^dC2N0DTU|
0002_AACGAAACACAAAGTA-1,1060.0,608,20.0,4,sample-2,sample-3,0.705259,Singlet,sample-2,CH-20-002,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,68-year-old human stage,F>Ad_32l$>
0002_AAGCGTTTCTTGGGCG-1,1270.0,716,237.0,4,sample-2,sample-5,3.121787,Singlet,sample-2,CH-20-002,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,68-year-old human stage,+@dOztSS*d
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
079_TCTCCGAAGCTATCTG-1,1925.0,1054,43.0,3,sample-5,sample-6,2.155876,Singlet,sample-5,CH-21-079,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,78-year-old human stage,GMZ)5R6Eh*
079_TGAATCGAGATTCGAA-1,2026.0,1097,89.0,4,sample-5,sample-4,2.725727,Singlet,sample-5,CH-21-079,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,78-year-old human stage,lxd{TRji23
079_TGCGATAAGGTAGATT-1,1594.0,933,37.0,2,sample-5,sample-2,1.818129,Singlet,sample-5,CH-21-079,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,78-year-old human stage,KN2ItXPkR4
079_TGCTCGTAGGGTTGCA-1,1840.0,1101,74.0,3,sample-5,sample-1,2.510466,Singlet,sample-5,CH-21-079,...,MONDO:0100542,B cell,10x 3' v3,clonal hematopoiesis,Homo sapiens,male,blood,European,78-year-old human stage,+VU%_s11(N


There are many columns that are not needed.

In [23]:
#Let's remove these columns
columns_to_remove = ['nCount_HTO', 'nFeature_HTO', 'HTO_maxID', 
                     'HTO_secondID', 'HTO_margin', 'HTO_classification.global',
                     'sample', 'sex_ontology_term_id', 'assay_ontology_term_id', 
                     'suspension_type', 'is_primary_data', 'tissue_ontology_term_id',
                     'organism_ontology_term_id', 'disease_ontology_term_id', 'assay', 
                     'organism', 'self_reported_ethnicity', 'observation_joinid',
                    'CHIP', 'LANE', 'ProjectID', 'HTOID',
                    'nCount_SCT', 'nFeature_SCT', 'pANN',
                    'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 
                    'self_reported_ethnicity_ontology_term_id']

In [24]:
metadata.drop(columns=columns_to_remove, inplace = True) #Set inplace=True to modify the DataFrame in place. If you set inplace=False or omit it, the drop() method will return a new DataFrame with the specified columns removed, leaving the original DataFrame unchanged.

In [25]:
metadata

Unnamed: 0,nCount_RNA,nFeature_RNA,donor_id,MUTATION,MUTATION.GROUP,percent.mt,scType_celltype,tissue_type,cell_type,disease,sex,tissue,development_stage
0002_AAAGGGCAGCAGCACA-1,1192.0,629,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,3.803975,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage
0002_AACAACCAGGGTTAGC-1,849.0,548,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,7.969349,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage
0002_AACCCAAAGGGCCTCT-1,2492.0,1188,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,4.029404,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage
0002_AACGAAACACAAAGTA-1,1060.0,608,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,7.138810,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage
0002_AAGCGTTTCTTGGGCG-1,1270.0,716,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,13.945409,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage
...,...,...,...,...,...,...,...,...,...,...,...,...,...
079_TCTCCGAAGCTATCTG-1,1925.0,1054,CH-21-079,DNMT3A M880V (5%),DNMT3A,4.876033,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78-year-old human stage
079_TGAATCGAGATTCGAA-1,2026.0,1097,CH-21-079,DNMT3A M880V (5%),DNMT3A,5.510031,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78-year-old human stage
079_TGCGATAAGGTAGATT-1,1594.0,933,CH-21-079,DNMT3A M880V (5%),DNMT3A,5.495584,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78-year-old human stage
079_TGCTCGTAGGGTTGCA-1,1840.0,1101,CH-21-079,DNMT3A M880V (5%),DNMT3A,6.130157,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78-year-old human stage


From investigating the metadata dataframe there are some columns that contain numerical data and some that contain character strings.
The columns with character strings will need to be reformatted appropriately so that they can be correlated against.
Lets first identify the unique labels within each column

In [26]:
metadata['sex'].unique()

['male', 'female']
Categories (2, object): ['female', 'male']

Looks like both male and female patients are included within this dataset. This will need to be numerically encoded so that it can be correlated against in downstream analysis.

In [27]:
metadata['male'] = metadata['sex'].apply(lambda x: 1 if x == "male" else 0)
metadata['female'] = metadata['sex'].apply(lambda x: 1 if x == "female" else 0)

In [28]:
metadata

Unnamed: 0,nCount_RNA,nFeature_RNA,donor_id,MUTATION,MUTATION.GROUP,percent.mt,scType_celltype,tissue_type,cell_type,disease,sex,tissue,development_stage,male,female
0002_AAAGGGCAGCAGCACA-1,1192.0,629,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,3.803975,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage,1,0
0002_AACAACCAGGGTTAGC-1,849.0,548,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,7.969349,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage,1,0
0002_AACCCAAAGGGCCTCT-1,2492.0,1188,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,4.029404,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage,1,0
0002_AACGAAACACAAAGTA-1,1060.0,608,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,7.138810,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage,1,0
0002_AAGCGTTTCTTGGGCG-1,1270.0,716,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,13.945409,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
079_TCTCCGAAGCTATCTG-1,1925.0,1054,CH-21-079,DNMT3A M880V (5%),DNMT3A,4.876033,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78-year-old human stage,1,0
079_TGAATCGAGATTCGAA-1,2026.0,1097,CH-21-079,DNMT3A M880V (5%),DNMT3A,5.510031,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78-year-old human stage,1,0
079_TGCGATAAGGTAGATT-1,1594.0,933,CH-21-079,DNMT3A M880V (5%),DNMT3A,5.495584,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78-year-old human stage,1,0
079_TGCTCGTAGGGTTGCA-1,1840.0,1101,CH-21-079,DNMT3A M880V (5%),DNMT3A,6.130157,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78-year-old human stage,1,0


Now let's have a look at the disease variable

In [29]:
metadata['disease'].unique()

['clonal hematopoiesis', 'normal']
Categories (2, object): ['normal', 'clonal hematopoiesis']

In [30]:
#The disease column can be encoded into a binary variable. 
metadata['CH'] = metadata['disease'].apply(lambda x: 1 if x == "clonal hematopoiesis" else 0)
metadata['normal'] = metadata['disease'].apply(lambda x: 1 if x == "normal" else 0)

In [31]:
metadata

Unnamed: 0,nCount_RNA,nFeature_RNA,donor_id,MUTATION,MUTATION.GROUP,percent.mt,scType_celltype,tissue_type,cell_type,disease,sex,tissue,development_stage,male,female,CH,normal
0002_AAAGGGCAGCAGCACA-1,1192.0,629,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,3.803975,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage,1,0,1,0
0002_AACAACCAGGGTTAGC-1,849.0,548,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,7.969349,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage,1,0,1,0
0002_AACCCAAAGGGCCTCT-1,2492.0,1188,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,4.029404,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage,1,0,1,0
0002_AACGAAACACAAAGTA-1,1060.0,608,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,7.138810,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage,1,0,1,0
0002_AAGCGTTTCTTGGGCG-1,1270.0,716,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,13.945409,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68-year-old human stage,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
079_TCTCCGAAGCTATCTG-1,1925.0,1054,CH-21-079,DNMT3A M880V (5%),DNMT3A,4.876033,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78-year-old human stage,1,0,1,0
079_TGAATCGAGATTCGAA-1,2026.0,1097,CH-21-079,DNMT3A M880V (5%),DNMT3A,5.510031,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78-year-old human stage,1,0,1,0
079_TGCGATAAGGTAGATT-1,1594.0,933,CH-21-079,DNMT3A M880V (5%),DNMT3A,5.495584,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78-year-old human stage,1,0,1,0
079_TGCTCGTAGGGTTGCA-1,1840.0,1101,CH-21-079,DNMT3A M880V (5%),DNMT3A,6.130157,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78-year-old human stage,1,0,1,0


Now lets sort out the development_stage column

In [32]:
print(metadata['development_stage'].cat.categories)

Index(['39-year-old human stage', '48-year-old human stage',
       '50-year-old human stage', '58-year-old human stage',
       '60-year-old human stage', '61-year-old human stage',
       '65-year-old human stage', '67-year-old human stage',
       '68-year-old human stage', '70-year-old human stage',
       '71-year-old human stage', '73-year-old human stage',
       '74-year-old human stage', '77-year-old human stage',
       '78-year-old human stage', '80-year-old human stage',
       '81-year-old human stage', '83-year-old human stage',
       '85-year-old human stage', '89-year-old human stage',
       '91-year-old human stage'],
      dtype='object')


In [33]:
#There appear to be 8 categories. Lets numerically encode them
# Recode development_stage
development_stage_mapping = {
    '39-year-old human stage': 39,
    '48-year-old human stage': 48,
    '50-year-old human stage': 50,
    '58-year-old human stage': 58,
    '60-year-old human stage': 60,
    '61-year-old human stage': 61,
    '65-year-old human stage': 65,
    '67-year-old human stage': 67,
    '68-year-old human stage': 68,
    '70-year-old human stage': 70,
    '71-year-old human stage': 71,
    '73-year-old human stage': 73,
    '74-year-old human stage': 74,
    '77-year-old human stage': 77,
    '78-year-old human stage': 78,
    '80-year-old human stage': 80,
    '81-year-old human stage': 81,
    '83-year-old human stage': 83,
    '85-year-old human stage': 85,
    '89-year-old human stage': 89,
    '91-year-old human stage': 91    
}
metadata['development_stage'] = metadata['development_stage'].map(development_stage_mapping)

In [34]:
metadata

Unnamed: 0,nCount_RNA,nFeature_RNA,donor_id,MUTATION,MUTATION.GROUP,percent.mt,scType_celltype,tissue_type,cell_type,disease,sex,tissue,development_stage,male,female,CH,normal
0002_AAAGGGCAGCAGCACA-1,1192.0,629,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,3.803975,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68,1,0,1,0
0002_AACAACCAGGGTTAGC-1,849.0,548,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,7.969349,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68,1,0,1,0
0002_AACCCAAAGGGCCTCT-1,2492.0,1188,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,4.029404,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68,1,0,1,0
0002_AACGAAACACAAAGTA-1,1060.0,608,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,7.138810,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68,1,0,1,0
0002_AAGCGTTTCTTGGGCG-1,1270.0,716,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,13.945409,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
079_TCTCCGAAGCTATCTG-1,1925.0,1054,CH-21-079,DNMT3A M880V (5%),DNMT3A,4.876033,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78,1,0,1,0
079_TGAATCGAGATTCGAA-1,2026.0,1097,CH-21-079,DNMT3A M880V (5%),DNMT3A,5.510031,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78,1,0,1,0
079_TGCGATAAGGTAGATT-1,1594.0,933,CH-21-079,DNMT3A M880V (5%),DNMT3A,5.495584,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78,1,0,1,0
079_TGCTCGTAGGGTTGCA-1,1840.0,1101,CH-21-079,DNMT3A M880V (5%),DNMT3A,6.130157,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78,1,0,1,0


In [35]:
metadata['MUTATION.GROUP'].unique()

['DNMT3A', 'none', 'TET2']
Categories (3, object): ['DNMT3A', 'TET2', 'none']

In [36]:
metadata['DNMT3A'] = metadata['MUTATION.GROUP'].apply(lambda x: 1 if x == "DNMT3A" else 0)
metadata['TET2'] = metadata['MUTATION.GROUP'].apply(lambda x: 1 if x == "TET2" else 0)
metadata['NoMutation'] = metadata['MUTATION.GROUP'].apply(lambda x: 1 if x == "none" else 0)

In [37]:
metadata

Unnamed: 0,nCount_RNA,nFeature_RNA,donor_id,MUTATION,MUTATION.GROUP,percent.mt,scType_celltype,tissue_type,cell_type,disease,sex,tissue,development_stage,male,female,CH,normal,DNMT3A,TET2,NoMutation
0002_AAAGGGCAGCAGCACA-1,1192.0,629,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,3.803975,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68,1,0,1,0,1,0,0
0002_AACAACCAGGGTTAGC-1,849.0,548,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,7.969349,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68,1,0,1,0,1,0,0
0002_AACCCAAAGGGCCTCT-1,2492.0,1188,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,4.029404,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68,1,0,1,0,1,0,0
0002_AACGAAACACAAAGTA-1,1060.0,608,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,7.138810,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68,1,0,1,0,1,0,0
0002_AAGCGTTTCTTGGGCG-1,1270.0,716,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",DNMT3A,13.945409,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,68,1,0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
079_TCTCCGAAGCTATCTG-1,1925.0,1054,CH-21-079,DNMT3A M880V (5%),DNMT3A,4.876033,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78,1,0,1,0,1,0,0
079_TGAATCGAGATTCGAA-1,2026.0,1097,CH-21-079,DNMT3A M880V (5%),DNMT3A,5.510031,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78,1,0,1,0,1,0,0
079_TGCGATAAGGTAGATT-1,1594.0,933,CH-21-079,DNMT3A M880V (5%),DNMT3A,5.495584,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78,1,0,1,0,1,0,0
079_TGCTCGTAGGGTTGCA-1,1840.0,1101,CH-21-079,DNMT3A M880V (5%),DNMT3A,6.130157,Naive B cells,tissue,B cell,clonal hematopoiesis,male,blood,78,1,0,1,0,1,0,0


In [38]:
# Drop unnecessary columns
metadata = metadata.drop(['disease', 'MUTATION.GROUP', 'sex'], axis=1)
metadata

Unnamed: 0,nCount_RNA,nFeature_RNA,donor_id,MUTATION,percent.mt,scType_celltype,tissue_type,cell_type,tissue,development_stage,male,female,CH,normal,DNMT3A,TET2,NoMutation
0002_AAAGGGCAGCAGCACA-1,1192.0,629,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",3.803975,Naive B cells,tissue,B cell,blood,68,1,0,1,0,1,0,0
0002_AACAACCAGGGTTAGC-1,849.0,548,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",7.969349,Naive B cells,tissue,B cell,blood,68,1,0,1,0,1,0,0
0002_AACCCAAAGGGCCTCT-1,2492.0,1188,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",4.029404,Naive B cells,tissue,B cell,blood,68,1,0,1,0,1,0,0
0002_AACGAAACACAAAGTA-1,1060.0,608,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",7.138810,Naive B cells,tissue,B cell,blood,68,1,0,1,0,1,0,0
0002_AAGCGTTTCTTGGGCG-1,1270.0,716,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",13.945409,Naive B cells,tissue,B cell,blood,68,1,0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
079_TCTCCGAAGCTATCTG-1,1925.0,1054,CH-21-079,DNMT3A M880V (5%),4.876033,Naive B cells,tissue,B cell,blood,78,1,0,1,0,1,0,0
079_TGAATCGAGATTCGAA-1,2026.0,1097,CH-21-079,DNMT3A M880V (5%),5.510031,Naive B cells,tissue,B cell,blood,78,1,0,1,0,1,0,0
079_TGCGATAAGGTAGATT-1,1594.0,933,CH-21-079,DNMT3A M880V (5%),5.495584,Naive B cells,tissue,B cell,blood,78,1,0,1,0,1,0,0
079_TGCTCGTAGGGTTGCA-1,1840.0,1101,CH-21-079,DNMT3A M880V (5%),6.130157,Naive B cells,tissue,B cell,blood,78,1,0,1,0,1,0,0


In [39]:
#Save the metadata dataframe
metadata.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_metadata.csv', index = True)

In [None]:
metadata = pd.read_csv('data/Bcell_metadata.csv', index_col = 0)

Due to the nature of single-cell data, we naturally have many cells from the same donor.
However, we cannot simply correlate the gene expression data in its current form. this would lead to within and outwith donor correlations.
Therefore, since we are working with single-cell data, this must first be pseudobulked in order to continue with the analysis.
This is important as it not only speeds up the computation, but most importantly negates the effects of within sample correlation.
Also, pseudobulking can help to mitigate the issues commonly found in single-cell data, such as drop outs and high zero value counts.

# Pseudobulk the Metadata

First we shall sort out the metadata dataframe so that it only contains one row per donor since the data will be aggregated.

In [42]:
# Convert row names to a column named 'cell_id'
metadata['cell_id'] = metadata.index

In [43]:
# Group by 'donor_id' and select the first row of each group
rows = metadata.groupby('donor_id').first().reset_index()

In [44]:
rows

Unnamed: 0,donor_id,nCount_RNA,nFeature_RNA,MUTATION,percent.mt,scType_celltype,tissue_type,cell_type,tissue,development_stage,male,female,CH,normal,DNMT3A,TET2,NoMutation,cell_id
0,CH-20-001,2490.0,1403,DNMT3A R882C,6.119578,Naive B cells,tissue,B cell,blood,60,1,0,1,0,1,0,0,001_AAAGAACGTTCTCAGA-1
1,CH-20-002,1192.0,629,"DNMT3A R729W (4%), DNMT3A R736C (2%)",3.803975,Naive B cells,tissue,B cell,blood,68,1,0,1,0,1,0,0,0002_AAAGGGCAGCAGCACA-1
2,CH-20-004,1833.0,985,"TET2 R1516X (30%), TET2 Q659X (29%), SRSF2 P95...",5.335196,Naive B cells,tissue,B cell,blood,85,1,0,1,0,0,1,0,004_AACCTGATCTTTGATC-1
3,CH-20-005,1966.0,886,TET2 V1900F (2%),5.314136,Naive B cells,tissue,B cell,blood,58,0,1,1,0,0,1,0,005_AACAACCAGAGCTGAC-1
4,CH-21-002,1912.0,938,none,5.657238,Naive B cells,tissue,B cell,blood,48,0,1,0,1,0,0,1,002_AAAGGTACACATTGTG-1
5,CH-21-006,1356.0,709,DNMT3A R882H (13%),5.211849,Naive B cells,tissue,B cell,blood,67,0,1,1,0,1,0,0,006_AACGAAACAGAGTTCT-1
6,CH-21-008,1117.0,575,none,8.398348,Naive B cells,tissue,B cell,blood,70,0,1,0,1,0,0,1,008_AACAGGGTCTTCTCAA-1
7,CH-21-013,1321.0,816,none,4.663212,Naive B cells,tissue,B cell,blood,73,1,0,0,1,0,0,1,013_AACCAACAGGTAGCCA-1
8,CH-21-014,1064.0,623,"SRSF2 P95R (40%), TET2 L957Ifs*15 (51%)",4.146577,Naive B cells,tissue,B cell,blood,74,1,0,1,0,0,1,0,014_AAAGTCCGTTTGACAC-1
9,CH-21-017,1880.0,953,"DNMT3A R882H (20%), IDH2 R140Q (10%), TP53 R27...",6.519922,Naive B cells,tissue,B cell,blood,65,1,0,1,0,1,0,0,017_AAACGAAAGGCGAACT-1


In [45]:
# Extract row indices corresponding to the first cell from each donor
row_list = []
for i, row in rows.iterrows():
    row_idx = metadata.index.get_loc(row['cell_id'])
    row_list.append(row_idx)

In [46]:
row_list

[100,
 0,
 187,
 276,
 145,
 421,
 473,
 663,
 842,
 915,
 1070,
 1589,
 1651,
 1700,
 1849,
 1992,
 2409,
 2954,
 3028,
 3228,
 3304,
 3337,
 3353,
 3506]

In [47]:
# Select the columns from the DataFrame
metadata2 = metadata.iloc[row_list, :].copy()

In [48]:
metadata2

Unnamed: 0,nCount_RNA,nFeature_RNA,donor_id,MUTATION,percent.mt,scType_celltype,tissue_type,cell_type,tissue,development_stage,male,female,CH,normal,DNMT3A,TET2,NoMutation,cell_id
001_AAAGAACGTTCTCAGA-1,2490.0,1403,CH-20-001,DNMT3A R882C,6.119578,Naive B cells,tissue,B cell,blood,60,1,0,1,0,1,0,0,001_AAAGAACGTTCTCAGA-1
0002_AAAGGGCAGCAGCACA-1,1192.0,629,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",3.803975,Naive B cells,tissue,B cell,blood,68,1,0,1,0,1,0,0,0002_AAAGGGCAGCAGCACA-1
004_AACCTGATCTTTGATC-1,1833.0,985,CH-20-004,"TET2 R1516X (30%), TET2 Q659X (29%), SRSF2 P95...",5.335196,Naive B cells,tissue,B cell,blood,85,1,0,1,0,0,1,0,004_AACCTGATCTTTGATC-1
005_AACAACCAGAGCTGAC-1,1966.0,886,CH-20-005,TET2 V1900F (2%),5.314136,Naive B cells,tissue,B cell,blood,58,0,1,1,0,0,1,0,005_AACAACCAGAGCTGAC-1
002_AAAGGTACACATTGTG-1,1912.0,938,CH-21-002,none,5.657238,Naive B cells,tissue,B cell,blood,48,0,1,0,1,0,0,1,002_AAAGGTACACATTGTG-1
006_AACGAAACAGAGTTCT-1,1356.0,709,CH-21-006,DNMT3A R882H (13%),5.211849,Naive B cells,tissue,B cell,blood,67,0,1,1,0,1,0,0,006_AACGAAACAGAGTTCT-1
008_AACAGGGTCTTCTCAA-1,1117.0,575,CH-21-008,none,8.398348,Naive B cells,tissue,B cell,blood,70,0,1,0,1,0,0,1,008_AACAGGGTCTTCTCAA-1
013_AACCAACAGGTAGCCA-1,1321.0,816,CH-21-013,none,4.663212,Naive B cells,tissue,B cell,blood,73,1,0,0,1,0,0,1,013_AACCAACAGGTAGCCA-1
014_AAAGTCCGTTTGACAC-1,1064.0,623,CH-21-014,"SRSF2 P95R (40%), TET2 L957Ifs*15 (51%)",4.146577,Naive B cells,tissue,B cell,blood,74,1,0,1,0,0,1,0,014_AAAGTCCGTTTGACAC-1
017_AAACGAAAGGCGAACT-1,1880.0,953,CH-21-017,"DNMT3A R882H (20%), IDH2 R140Q (10%), TP53 R27...",6.519922,Naive B cells,tissue,B cell,blood,65,1,0,1,0,1,0,0,017_AAACGAAAGGCGAACT-1


In [49]:
metadata2.set_index('donor_id', inplace = True, drop = False)

In [50]:
metadata2

Unnamed: 0_level_0,nCount_RNA,nFeature_RNA,donor_id,MUTATION,percent.mt,scType_celltype,tissue_type,cell_type,tissue,development_stage,male,female,CH,normal,DNMT3A,TET2,NoMutation,cell_id
donor_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
CH-20-001,2490.0,1403,CH-20-001,DNMT3A R882C,6.119578,Naive B cells,tissue,B cell,blood,60,1,0,1,0,1,0,0,001_AAAGAACGTTCTCAGA-1
CH-20-002,1192.0,629,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",3.803975,Naive B cells,tissue,B cell,blood,68,1,0,1,0,1,0,0,0002_AAAGGGCAGCAGCACA-1
CH-20-004,1833.0,985,CH-20-004,"TET2 R1516X (30%), TET2 Q659X (29%), SRSF2 P95...",5.335196,Naive B cells,tissue,B cell,blood,85,1,0,1,0,0,1,0,004_AACCTGATCTTTGATC-1
CH-20-005,1966.0,886,CH-20-005,TET2 V1900F (2%),5.314136,Naive B cells,tissue,B cell,blood,58,0,1,1,0,0,1,0,005_AACAACCAGAGCTGAC-1
CH-21-002,1912.0,938,CH-21-002,none,5.657238,Naive B cells,tissue,B cell,blood,48,0,1,0,1,0,0,1,002_AAAGGTACACATTGTG-1
CH-21-006,1356.0,709,CH-21-006,DNMT3A R882H (13%),5.211849,Naive B cells,tissue,B cell,blood,67,0,1,1,0,1,0,0,006_AACGAAACAGAGTTCT-1
CH-21-008,1117.0,575,CH-21-008,none,8.398348,Naive B cells,tissue,B cell,blood,70,0,1,0,1,0,0,1,008_AACAGGGTCTTCTCAA-1
CH-21-013,1321.0,816,CH-21-013,none,4.663212,Naive B cells,tissue,B cell,blood,73,1,0,0,1,0,0,1,013_AACCAACAGGTAGCCA-1
CH-21-014,1064.0,623,CH-21-014,"SRSF2 P95R (40%), TET2 L957Ifs*15 (51%)",4.146577,Naive B cells,tissue,B cell,blood,74,1,0,1,0,0,1,0,014_AAAGTCCGTTTGACAC-1
CH-21-017,1880.0,953,CH-21-017,"DNMT3A R882H (20%), IDH2 R140Q (10%), TP53 R27...",6.519922,Naive B cells,tissue,B cell,blood,65,1,0,1,0,1,0,0,017_AAACGAAAGGCGAACT-1


In [51]:
#Remove the cell_id column
metadata2.drop(columns = 'cell_id', inplace = True)

In [52]:
metadata2

Unnamed: 0_level_0,nCount_RNA,nFeature_RNA,donor_id,MUTATION,percent.mt,scType_celltype,tissue_type,cell_type,tissue,development_stage,male,female,CH,normal,DNMT3A,TET2,NoMutation
donor_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
CH-20-001,2490.0,1403,CH-20-001,DNMT3A R882C,6.119578,Naive B cells,tissue,B cell,blood,60,1,0,1,0,1,0,0
CH-20-002,1192.0,629,CH-20-002,"DNMT3A R729W (4%), DNMT3A R736C (2%)",3.803975,Naive B cells,tissue,B cell,blood,68,1,0,1,0,1,0,0
CH-20-004,1833.0,985,CH-20-004,"TET2 R1516X (30%), TET2 Q659X (29%), SRSF2 P95...",5.335196,Naive B cells,tissue,B cell,blood,85,1,0,1,0,0,1,0
CH-20-005,1966.0,886,CH-20-005,TET2 V1900F (2%),5.314136,Naive B cells,tissue,B cell,blood,58,0,1,1,0,0,1,0
CH-21-002,1912.0,938,CH-21-002,none,5.657238,Naive B cells,tissue,B cell,blood,48,0,1,0,1,0,0,1
CH-21-006,1356.0,709,CH-21-006,DNMT3A R882H (13%),5.211849,Naive B cells,tissue,B cell,blood,67,0,1,1,0,1,0,0
CH-21-008,1117.0,575,CH-21-008,none,8.398348,Naive B cells,tissue,B cell,blood,70,0,1,0,1,0,0,1
CH-21-013,1321.0,816,CH-21-013,none,4.663212,Naive B cells,tissue,B cell,blood,73,1,0,0,1,0,0,1
CH-21-014,1064.0,623,CH-21-014,"SRSF2 P95R (40%), TET2 L957Ifs*15 (51%)",4.146577,Naive B cells,tissue,B cell,blood,74,1,0,1,0,0,1,0
CH-21-017,1880.0,953,CH-21-017,"DNMT3A R882H (20%), IDH2 R140Q (10%), TP53 R27...",6.519922,Naive B cells,tissue,B cell,blood,65,1,0,1,0,1,0,0


In [53]:
#Save the metadata
metadata2.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_metadata_pseudobulk.csv', index = True)

The metadata dataframe for the pseudobulk is now complete

# Pseudobulk the Corresponding Data

Lets proceed to aggregate the gene expression data.
This involves summing the gene expression data for each gene of each donor.

First the gene expression matrix will need to be extracted from our adata object

Since we are working with single-cell data which will be stored as a sparse matrix, this must be coerced into a dense matrix, so that it can be converted to a dataframe.

In [54]:
# Convert the sparse matrix to a dense matrix
dense_matrix = Bcell.X.todense()

In [55]:
datExpr = pd.DataFrame(dense_matrix, index=Bcell.obs_names, columns=Bcell.var_names)

In [56]:
datExpr

feature_name,MIR1302-2HG,FAM138A,OR4F5,OR4F29,OR4F16,LINC01409,FAM87B,LINC01128,LINC00115,FAM41C,...,BPY2B,DAZ3,DAZ4,BPY2C,TTTY4C,TTTY17C,SEPTIN14P23,CDY1,TTTY3,MAFIP
0002_AAAGGGCAGCAGCACA-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0002_AACAACCAGGGTTAGC-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0002_AACCCAAAGGGCCTCT-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0002_AACGAAACACAAAGTA-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0002_AAGCGTTTCTTGGGCG-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
079_TCTCCGAAGCTATCTG-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
079_TGAATCGAGATTCGAA-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
079_TGCGATAAGGTAGATT-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
079_TGCTCGTAGGGTTGCA-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [57]:
#save datExpr
#Save the metadata dataframe
datExpr.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_singlecell.csv', index = True)

Since highly variable genes capture the most informative genes, they will be used to filter the expression matrix further.
This is also a way to reduce the dimensionality of the data, so that downstream analyses may be more computationally efficient.

In [58]:
hvg = Bcell.var_names[Bcell.var['highly_variable']]
hvg

CategoricalIndex(['ISG15', 'LINC01342', 'TTLL10-AS1', 'TNFRSF18', 'CALML6',
                  'CHD5', 'ICMT-DT', 'MIR34AHG', 'RBP7', 'MTOR-AS1',
                  ...
                  'FRMPD3', 'TSC22D3', 'KLHL13', 'AKAP14', 'RHOXF1-AS1',
                  'TMEM255A', 'SMIM10L2B-AS1', 'IL9R_ENSG00000124334', 'DDX3Y',
                  'EIF1AY'],
                 categories=['A1BG', 'A1BG-AS1', 'A1CF', 'A2M', 'A2M-AS1', 'A2ML1', 'A2ML1-AS1', 'A2ML1-AS2', ...], ordered=False, dtype='category', name='feature_name', length=1000)

In [59]:
datExpr = datExpr.loc[:,hvg]
datExpr

feature_name,ISG15,LINC01342,TTLL10-AS1,TNFRSF18,CALML6,CHD5,ICMT-DT,MIR34AHG,RBP7,MTOR-AS1,...,FRMPD3,TSC22D3,KLHL13,AKAP14,RHOXF1-AS1,TMEM255A,SMIM10L2B-AS1,IL9R_ENSG00000124334,DDX3Y,EIF1AY
0002_AAAGGGCAGCAGCACA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.513502,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
0002_AACAACCAGGGTTAGC-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.583828,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
0002_AACCCAAAGGGCCTCT-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.344490,0.0,0.0,0.0,0.0,0.0,0.0,1.271980,0.960117
0002_AACGAAACACAAAGTA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.207486,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
0002_AAGCGTTTCTTGGGCG-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.435951,0.0,0.0,0.0,0.0,0.0,0.0,1.157864,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
079_TCTCCGAAGCTATCTG-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.582282,0.0,0.0,0.0,0.0,0.0,0.0,1.038052,0.000000
079_TGAATCGAGATTCGAA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.445902,0.0,0.0,0.0,0.0,0.0,0.0,1.022813,0.000000
079_TGCGATAAGGTAGATT-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.282646,0.0,0.0,0.0,0.0,0.0,0.0,1.282646,0.000000
079_TGCTCGTAGGGTTGCA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.245300,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000


Add the donor_id column to the gene expression dataframe, so we know which cell came from which donor

In [60]:
# Reset the index of 'datExpr' DataFrame to make the row names (cell names) a column
datExpr_donor = datExpr.reset_index()

In [61]:
datExpr_donor

feature_name,index,ISG15,LINC01342,TTLL10-AS1,TNFRSF18,CALML6,CHD5,ICMT-DT,MIR34AHG,RBP7,...,FRMPD3,TSC22D3,KLHL13,AKAP14,RHOXF1-AS1,TMEM255A,SMIM10L2B-AS1,IL9R_ENSG00000124334,DDX3Y,EIF1AY
0,0002_AAAGGGCAGCAGCACA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.513502,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
1,0002_AACAACCAGGGTTAGC-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.583828,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
2,0002_AACCCAAAGGGCCTCT-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.344490,0.0,0.0,0.0,0.0,0.0,0.0,1.271980,0.960117
3,0002_AACGAAACACAAAGTA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.207486,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
4,0002_AAGCGTTTCTTGGGCG-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.435951,0.0,0.0,0.0,0.0,0.0,0.0,1.157864,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3535,079_TCTCCGAAGCTATCTG-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.582282,0.0,0.0,0.0,0.0,0.0,0.0,1.038052,0.000000
3536,079_TGAATCGAGATTCGAA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.445902,0.0,0.0,0.0,0.0,0.0,0.0,1.022813,0.000000
3537,079_TGCGATAAGGTAGATT-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.282646,0.0,0.0,0.0,0.0,0.0,0.0,1.282646,0.000000
3538,079_TGCTCGTAGGGTTGCA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.245300,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000


In [62]:
# Merge 'datExpr_reset' with 'metadata' on the 'index' and 'cell_id' columns
datExpr_donor = pd.merge(datExpr_donor, metadata[['cell_id', 'donor_id']], left_on='index', right_on='cell_id', how='left')

In [63]:
datExpr_donor

Unnamed: 0,index,ISG15,LINC01342,TTLL10-AS1,TNFRSF18,CALML6,CHD5,ICMT-DT,MIR34AHG,RBP7,...,KLHL13,AKAP14,RHOXF1-AS1,TMEM255A,SMIM10L2B-AS1,IL9R_ENSG00000124334,DDX3Y,EIF1AY,cell_id,donor_id
0,0002_AAAGGGCAGCAGCACA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0002_AAAGGGCAGCAGCACA-1,CH-20-002
1,0002_AACAACCAGGGTTAGC-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0002_AACAACCAGGGTTAGC-1,CH-20-002
2,0002_AACCCAAAGGGCCTCT-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.271980,0.960117,0002_AACCCAAAGGGCCTCT-1,CH-20-002
3,0002_AACGAAACACAAAGTA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0002_AACGAAACACAAAGTA-1,CH-20-002
4,0002_AAGCGTTTCTTGGGCG-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.157864,0.000000,0002_AAGCGTTTCTTGGGCG-1,CH-20-002
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3535,079_TCTCCGAAGCTATCTG-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.038052,0.000000,079_TCTCCGAAGCTATCTG-1,CH-21-079
3536,079_TGAATCGAGATTCGAA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.022813,0.000000,079_TGAATCGAGATTCGAA-1,CH-21-079
3537,079_TGCGATAAGGTAGATT-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.282646,0.000000,079_TGCGATAAGGTAGATT-1,CH-21-079
3538,079_TGCTCGTAGGGTTGCA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,079_TGCTCGTAGGGTTGCA-1,CH-21-079


In [64]:
# Set the cell names as the index again
datExpr_donor.set_index('index', inplace=True)


In [65]:
datExpr_donor

Unnamed: 0_level_0,ISG15,LINC01342,TTLL10-AS1,TNFRSF18,CALML6,CHD5,ICMT-DT,MIR34AHG,RBP7,MTOR-AS1,...,KLHL13,AKAP14,RHOXF1-AS1,TMEM255A,SMIM10L2B-AS1,IL9R_ENSG00000124334,DDX3Y,EIF1AY,cell_id,donor_id
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0002_AAAGGGCAGCAGCACA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0002_AAAGGGCAGCAGCACA-1,CH-20-002
0002_AACAACCAGGGTTAGC-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0002_AACAACCAGGGTTAGC-1,CH-20-002
0002_AACCCAAAGGGCCTCT-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.271980,0.960117,0002_AACCCAAAGGGCCTCT-1,CH-20-002
0002_AACGAAACACAAAGTA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0002_AACGAAACACAAAGTA-1,CH-20-002
0002_AAGCGTTTCTTGGGCG-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.157864,0.000000,0002_AAGCGTTTCTTGGGCG-1,CH-20-002
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
079_TCTCCGAAGCTATCTG-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.038052,0.000000,079_TCTCCGAAGCTATCTG-1,CH-21-079
079_TGAATCGAGATTCGAA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.022813,0.000000,079_TGAATCGAGATTCGAA-1,CH-21-079
079_TGCGATAAGGTAGATT-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.282646,0.000000,079_TGCGATAAGGTAGATT-1,CH-21-079
079_TGCTCGTAGGGTTGCA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,079_TGCTCGTAGGGTTGCA-1,CH-21-079


In [66]:
# Remove the 'cell_id' column if needed
datExpr_donor.drop(columns=['cell_id'], inplace=True)

In [67]:
datExpr_donor

Unnamed: 0_level_0,ISG15,LINC01342,TTLL10-AS1,TNFRSF18,CALML6,CHD5,ICMT-DT,MIR34AHG,RBP7,MTOR-AS1,...,TSC22D3,KLHL13,AKAP14,RHOXF1-AS1,TMEM255A,SMIM10L2B-AS1,IL9R_ENSG00000124334,DDX3Y,EIF1AY,donor_id
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0002_AAAGGGCAGCAGCACA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.513502,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,CH-20-002
0002_AACAACCAGGGTTAGC-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.583828,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,CH-20-002
0002_AACCCAAAGGGCCTCT-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.344490,0.0,0.0,0.0,0.0,0.0,0.0,1.271980,0.960117,CH-20-002
0002_AACGAAACACAAAGTA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.207486,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,CH-20-002
0002_AAGCGTTTCTTGGGCG-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.435951,0.0,0.0,0.0,0.0,0.0,0.0,1.157864,0.000000,CH-20-002
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
079_TCTCCGAAGCTATCTG-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.582282,0.0,0.0,0.0,0.0,0.0,0.0,1.038052,0.000000,CH-21-079
079_TGAATCGAGATTCGAA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.445902,0.0,0.0,0.0,0.0,0.0,0.0,1.022813,0.000000,CH-21-079
079_TGCGATAAGGTAGATT-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.282646,0.0,0.0,0.0,0.0,0.0,0.0,1.282646,0.000000,CH-21-079
079_TGCTCGTAGGGTTGCA-1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.245300,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,CH-21-079


In [68]:
#Save the expression matrix with donor_id
datExpr_donor.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_donorid_singlecell.csv', index = True)

Now that we have our gene expression dataframe, it is now possible to aggregate the data for pseudobulking.

In [69]:
# Aggregate expression by donor ID (summing the values)
pseudobulk_df = datExpr_donor.groupby('donor_id').sum()

In [70]:
pseudobulk_df

Unnamed: 0_level_0,ISG15,LINC01342,TTLL10-AS1,TNFRSF18,CALML6,CHD5,ICMT-DT,MIR34AHG,RBP7,MTOR-AS1,...,FRMPD3,TSC22D3,KLHL13,AKAP14,RHOXF1-AS1,TMEM255A,SMIM10L2B-AS1,IL9R_ENSG00000124334,DDX3Y,EIF1AY
donor_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CH-20-001,6.380902,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,53.239479,0.0,0.0,0.0,0.0,0.0,0.0,21.632603,17.641195
CH-20-002,12.60675,2.33599,0.0,0.0,0.0,0.0,1.089918,0.0,1.158743,1.173824,...,0.0,112.643967,0.0,0.0,0.0,0.0,0.0,0.0,45.432411,22.809191
CH-20-004,12.30251,0.0,0.0,21.512184,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,42.873409,0.0,0.0,0.0,0.0,0.0,0.0,15.570595,20.173725
CH-20-005,18.603716,1.16925,1.232658,4.97588,0.0,0.0,0.0,0.0,0.0,0.0,...,1.112746,190.337738,0.0,0.0,1.191559,0.0,0.0,0.0,6.931139,1.071742
CH-21-002,13.705297,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,44.942261,0.0,0.0,0.0,0.0,0.0,1.323198,0.0,0.0
CH-21-006,4.377715,0.0,0.0,23.782143,0.0,0.0,0.0,0.0,1.023552,0.0,...,0.0,12.741602,0.0,0.0,0.0,0.0,0.0,0.0,5.349793,12.981407
CH-21-008,18.058025,0.0,0.0,44.614342,0.0,1.201673,0.0,0.0,0.0,0.0,...,0.0,76.893723,0.0,0.0,0.0,0.0,0.0,1.08036,1.188176,2.377049
CH-21-013,21.395964,0.0,0.0,30.42651,0.0,0.0,1.235703,0.0,0.0,0.0,...,0.0,54.458328,0.0,0.0,0.0,0.0,1.117969,1.236817,23.543072,53.25042
CH-21-014,13.436963,0.0,0.0,11.067089,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,32.2486,0.0,0.0,0.0,0.0,0.0,0.0,17.70928,21.63619
CH-21-017,22.916807,0.0,0.0,9.076924,0.0,0.0,0.0,0.0,2.478934,0.0,...,0.0,188.600067,0.0,0.0,0.0,0.0,0.0,0.0,49.114601,40.454937


In [71]:
#Save the pseudobulk expression matrix with donor_id
pseudobulk_df.to_csv('/ReCoDE-Gene-Network-Analysis/data/data/Bcell_datExpr_pseudobulk.csv', index = True)

We now have the pseudobulked data and the corresponding metadata dataframe to start the correlation network analysis