# Import relevant packages

Run: *conda activate inflamSpectra*

In [1]:
#import packages
import numpy as np
import json 
import scanpy as sc
from collections import OrderedDict
import scipy 
import pandas as pd
import matplotlib.pyplot as plt

#spectra imports 
import Spectra as spc
from Spectra import Spectra_util as spc_tl
from Spectra import K_est as kst
from Spectra import default_gene_sets

#KnowledgeBase imports
import cytopus as cp
import session_info

# Load a human immunology knowledge base
This is an immunology knowledge base with the following criteria:
1. Genes within a gene set define a cellular process at the transcript level.
2. Gene sets represent cellular processes at the single-cell level.
3. Gene sets can be specific to a defined cell type.

Within this resource, 150 cellular processes apply to all leukocytes, and 31 apply to individual cell types. Of all 231 gene sets, 97 were gene sets newly curated from the literature, 14 used data from perturbation experiments, 11 were adopted from the literature with modifications, and 123 were taken from the literature and external databases without changes. Gene sets correspond to: 

A. cellular identities (n = 50) 

B. cellular processes
- homeostasis (n = 9)
- stress response (n = 3)
- cell death and autophagy (n = 18)
- proliferation (n = 6)
- signaling (n = 12)
- metabolism (n = 90)
- immune function (n = 22)
- immune cell responses to external stimuli (n = 18)
-  hemostasis/coagulation (n = 3).

We designed the gene sets for cellular processes to have comparable size (median n = 20 genes per gene sets) and relatively little overlap (median pairwise overlap coefficient of 40%) to enable dissection of a large number of cellular processes and to avoid gene set size-driven effects.


To retrieve the information: 
- ‘celltypes’ method retrieves a list of available cell types
- ‘processes’ generates a dictionary of all ‘cellular processes’ gene sets
- ‘identities’ gener- ates a dictionary of all ‘cellular identity’ gene sets. 

The get_celltype_processes method retrieves cell-type-specific ‘cellular processes’ based on a user-provided list of cell types at the desired granularity (generally all cell types contained in the data).

In [5]:
import cytopus as cp
G = cp.knowledge_base.KnowledgeBase()

KnowledgeBase object containing 92 cell types and 201 cellular processes



In [None]:
cp.knowledge_base

In [None]:
#list of all cell types in KnowledgeBase
G.celltypes
G.plot_celltypes()

In [None]:
G.identities

# Load gene_set_dictionary

First we need to load a nested dictionary containing global and cell type specific gene sets in the following format: Gene set annotation dictionary with the keys being the celltypes (str) and values being dictionaries with gene set names as keys (str) and gene sets as values (lists of gene names/IDs which matches the gene names/IDs in adata.var_names).

For example:

```
gene_set_dictionary = {'celltype_1':{'gene_set_1':['gene_a', 'gene_b', 'gene_c'], 'gene_set_2':['gene_c','gene_a','gene_e','gene_f']},

'celltype_2':{'gene_set_1':['gene_a', 'gene_b', 'gene_c'], 'gene_set_3':['gene_a', 'gene_e','gene_f','gene_d']},

'celltype_3':{},

'global':"{'gene_set_4':['gene_m','gene_n']} #the global key must be supplied

```

*Note that one key in the dictionary must be 'global' with the corresponding value being a dictionary of gene sets which apply to all cells*

Spectra will use this dictionary to align factors to the input gene sets. Gene sets which apply to only one cell type in the data should be included in the dictionary of that cell type. If a gene sets applies to all cell types in the data, the gene set should be included in the dictionary for 'global'. 

If a gene set applies to more than one cell type but not all cell types in the data there are two options: 
1) Include this gene set in each cell type dictionary which will likely result in a separate factor for this gene set in each cell type.
2) Include this gene set in the 'global' dictionary which will likely result in one factor for this gene set in all cell types.

We give additional guidance on the advantages and disadvantages of either approach in the Supplementary Methods of the Spectra paper: https://doi.org/10.1101/2022.12.20.521311

**Loaded vs custom gene set dictionary**

We can either create our own gene set dictionary or make use of the human immunology knowledge base selecting the cell types of interest. We can also use external databases but they have to be provided in the format described above. 

Here we will use the Cytopus KnowledgeBase and expand those with the default gene sets from spectra: spc.default_gene_sets.load()

In [None]:
celltype_of_interest = ['mono', 'Mac', 
                        "p-DC", "cDC", 
                        "gdT", "MAIT", "NK", "ILC",
                        "CD4-T", "CD8-T", "T-naive", "TCM", "TEM", "TRM", "TSCM",
                        'B', ]

In [None]:
celltype_of_interest = ['Mac',
                        "p-DC", "cDC", 
                        "gdT", "MAIT", "NK", "ILC",
                        "CD4-T", "CD8-T", "T-naive", 
                        'B', ]
global_celltypes = ['all-cells','leukocyte']
G.get_celltype_processes(celltype_of_interest,global_celltypes = global_celltypes,get_children=True,get_parents =False)

In [None]:
gene_set_dict = G.celltype_process_dict

In [None]:
# load the default gene set dictionary from the Spectra paper:
default_gs_dict = spc.default_gene_sets.load()

## Assess the created dictionary: Compare KnowledgeBase and Default

**DEFAULT GENE SET: Number of gene sets per cell type**

In [None]:
# Initialize a dictionary to store the counts
key_counts = {}

# Loop through the keys and calculate the number of keys within each key
total_count = 0
for key, value in default_gs_dict.items():
    if isinstance(value, dict):
        key_counts[key] = len(value)
        total_count += len(value)
    else:
        key_counts[key] = 0

# Print the number of keys within each key
for key, count in key_counts.items():
    print(f"{key}: {count}")

print(f"Total: {total_count}")

**CYTOPUS GENE SET: Number of gene sets per cell type**

In [None]:
# Initialize a dictionary to store the counts
key_counts = {}

# Loop through the keys and calculate the number of keys within each key
total_count = 0
for key, value in gene_set_dict.items():
    if isinstance(value, dict):
        key_counts[key] = len(value)
        total_count += len(value)
    else:
        key_counts[key] = 0

# Print the number of keys within each key
for key, count in key_counts.items():
    print(f"{key}: {count}")

print(f"Total: {total_count}")

## Edit the dictionary to contain as much as gene sets as possible together with the default gene set

**Edit CD4 gene sets**

In [None]:
# Add missing CD4 gene sets
cell_type_to_update = 'CD4_T'
cell_type_in_gene_set = 'CD4-T'

if cell_type_to_update in default_gs_dict:
    for gene_set_name, genes in default_gs_dict[cell_type_to_update].items():
        if gene_set_name not in gene_set_dict[cell_type_in_gene_set]:
            if gene_set_name.startswith('TNK'):
                # Create a new cell type 'TNK' and include the gene set
                if 'TNK' not in gene_set_dict:
                    gene_set_dict['TNK'] = {}
                gene_set_dict['TNK'][gene_set_name] = genes
                print(f"Including {gene_set_name} for {cell_type_to_update} in 'TNK' cell type.")
                # Include also this gene set into CD4
                gene_set_dict[cell_type_in_gene_set][gene_set_name] = genes
                print(f"Also including {gene_set_name} for {cell_type_to_update}.")
            else:
                # Include the new gene set
                gene_set_dict[cell_type_in_gene_set][gene_set_name] = genes
                print(f"Including {gene_set_name} for {cell_type_to_update}.")
                
        else:
            # Check if the gene set already exists with 100% match
            if set(genes) == set(gene_set_dict[cell_type_in_gene_set][gene_set_name]):
                print(f"Skipping {gene_set_name} in {cell_type_to_update} - Already exists with a 100% match.")
            else:
                # Include missing genes
                missing_genes = set(genes) - set(gene_set_dict[cell_type_in_gene_set][gene_set_name])
                gene_set_dict[cell_type_in_gene_set][gene_set_name].extend(list(missing_genes))
                print(f"Including missing genes in {gene_set_name} for {cell_type_to_update}: {list(missing_genes)}")
# Modify the gene set name
print("Modifying gene set names")
old_gene_set_names = list(gene_set_dict[cell_type_in_gene_set].keys())
for gene_set_name in old_gene_set_names:
    split_name = gene_set_name.split('_')
    if len(split_name) > 1:
        new_name = 'CD4T_' + '_'.join(split_name[1:]).replace('-', '_')
        print(f"{gene_set_name} --> {new_name}")
        gene_set_dict[cell_type_in_gene_set][new_name] = gene_set_dict[cell_type_in_gene_set][gene_set_name]
        del gene_set_dict[cell_type_in_gene_set][gene_set_name]
        #gene_set_dict[cell_type_in_gene_set][new_name] = gene_set_dict[cell_type_in_gene_set].pop(gene_set_name)

**Edit CD8 gene sets**

In [None]:
# Add missing CD8 gene sets except those starting with NKT 
cell_type_to_update = 'CD8_T'
cell_type_in_gene_set = 'CD8-T'

if cell_type_to_update in default_gs_dict:
    for gene_set_name, genes in default_gs_dict[cell_type_to_update].items():
        if gene_set_name not in gene_set_dict[cell_type_in_gene_set]:
            if gene_set_name.startswith('TNK'):
                # Create a new cell type 'TNK' and include the gene set
                if 'TNK' not in gene_set_dict:
                    gene_set_dict['TNK'] = {}
                gene_set_dict['TNK'][gene_set_name] = genes
                print(f"Including {gene_set_name} for {cell_type_to_update} in 'TNK' cell type.")
                # Include also this gene set into CD4
                gene_set_dict[cell_type_in_gene_set][gene_set_name] = genes
                print(f"Also including {gene_set_name} for {cell_type_to_update}.")
            else:
                # Include the new gene set
                gene_set_dict[cell_type_in_gene_set][gene_set_name] = genes
                print(f"Including {gene_set_name} for {cell_type_to_update}.")
                
        else:
            # Check if the gene set already exists with 100% match
            if set(genes) == set(gene_set_dict[cell_type_in_gene_set][gene_set_name]):
                print(f"Skipping {gene_set_name} in {cell_type_to_update} - Already exists with a 100% match.")
            else:
                # Include missing genes
                missing_genes = set(genes) - set(gene_set_dict[cell_type_in_gene_set][gene_set_name])
                gene_set_dict[cell_type_in_gene_set][gene_set_name].extend(list(missing_genes))
                print(f"Including missing genes in {gene_set_name} for {cell_type_to_update}: {list(missing_genes)}")
# Modify the gene set name
print("Modifying gene set names")
old_gene_set_names = list(gene_set_dict[cell_type_in_gene_set].keys())
for gene_set_name in old_gene_set_names:
    split_name = gene_set_name.split('_')
    if len(split_name) > 1:
        new_name = 'CD8T_' + '_'.join(split_name[1:]).replace('-', '_')
        print(f"{gene_set_name} --> {new_name}")
        gene_set_dict[cell_type_in_gene_set][new_name] = gene_set_dict[cell_type_in_gene_set][gene_set_name]
        del gene_set_dict[cell_type_in_gene_set][gene_set_name]
        #gene_set_dict[cell_type_in_gene_set][new_name] = gene_set_dict[cell_type_in_gene_set].pop(gene_set_name)

**Edit DC gene sets**

In [None]:
# Add missing CD8 gene sets except those starting with NKT 
cell_type_to_update = 'DC'
cell_type_in_gene_set = 'cDC'

if cell_type_to_update in default_gs_dict:
    for gene_set_name, genes in default_gs_dict[cell_type_to_update].items():
        if gene_set_name not in gene_set_dict[cell_type_in_gene_set]:
                # Include the new gene set
                gene_set_dict[cell_type_in_gene_set][gene_set_name] = genes
                print(f"Including {gene_set_name} for {cell_type_to_update}.")                
        else:
            # Check if the gene set already exists with 100% match
            if set(genes) == set(gene_set_dict[cell_type_in_gene_set][gene_set_name]):
                print(f"Skipping {gene_set_name} in {cell_type_to_update} - Already exists with a 100% match.")
            else:
                # Include missing genes
                missing_genes = set(genes) - set(gene_set_dict[cell_type_in_gene_set][gene_set_name])
                gene_set_dict[cell_type_in_gene_set][gene_set_name].extend(list(missing_genes))
                print(f"Including missing genes in {gene_set_name} for {cell_type_to_update}: {list(missing_genes)}")


**Edit Treg gene sets**

In [None]:
gene_set_dict['Tregs'] = {}
gene_set_dict["Tregs"]["Tregs_FoxP3_stabilization"] = gene_set_dict["CD4-T"]["CD4T_FoxP3_stabilization"]

## Imported dictionary

**Imported dictionary: KnowledgeBase + Default**

In [None]:
# Dictionary to store DataFrames for each cell type
dfs_by_cell_type = {}

# Iterate through the gene_set_dict for each cell type
for cell_type, gene_sets in gene_set_dict.items():
    # Check if the cell type has any gene sets
    if not gene_sets:
        print(f"No gene sets available for {cell_type}")
        continue

    # Create an empty dictionary to store the flattened data for the current cell type
    flattened_data = {}
    for gene_set, genes in gene_sets.items():
        column_name = f"{cell_type}_{gene_set}"
        for gene in genes:
            if column_name not in flattened_data:
                flattened_data[column_name] = []
            flattened_data[column_name].append(gene)

    # Fill missing values with empty strings
    max_length = max(len(genes) if genes else 0 for genes in flattened_data.values())
    for column_name, genes in flattened_data.items():
        if len(genes) < max_length:
            flattened_data[column_name] += [''] * (max_length - len(genes))

    # Create a DataFrame from the flattened data
    df = pd.DataFrame(flattened_data)

    # Remove empty columns
    df = df.dropna(axis=1, how='all')

    # Store the DataFrame in the dictionary with the cell type as the key
    dfs_by_cell_type[cell_type] = df

In [None]:
path = "/scratch_isilon/groups/singlecell/shared/projects/Inflammation-PBMCs-Atlas/03_downstream_analysis/02_gene_universe_definition/data/CYTOPUS_default_geneSets/"
for key, df in dfs_by_cell_type.items():
    # Define the file name
    file_name = f"{path}/{key}_gene_sets.xlsx"
    # Write the DataFrame to an Excel file
    df.to_excel(file_name, index=False)

**Number of gene sets per cell type**

In [None]:
# Initialize a dictionary to store the counts
key_counts = {}

# Loop through the keys and calculate the number of keys within each key
total_count = 0
for key, value in dfs_by_cell_type.items():
    if isinstance(value, dict):
        key_counts[key] = len(value)
        total_count += len(value)
    else:
        key_counts[key] = 0

# Print the number of keys within each key
for key, count in key_counts.items():
    print(f"{key}: {count}")

print(f"Total: {total_count}")

**Number of genes per cell type**

In [None]:
# Initialize a dictionary to store the counts
element_counts = {}

# Loop through the keys and calculate the number of elements
total_count = 0
for key, value in gene_set_dict.items():
    if isinstance(value, list):
        element_counts[key] = len(value)
        total_count += len(value)
    elif isinstance(value, dict):
        sub_count = sum(len(sub_value) for sub_value in value.values())
        element_counts[key] = sub_count
        total_count += sub_count
    else:
        element_counts[key] = 0

# Print the element counts for each key
for key, count in element_counts.items():
    print(f"{key}: {count}")

# Print the total count
print(f"Total: {total_count}")

# Save default Gene Set

In [None]:
import pickle

In [None]:
with open('/scratch_isilon/groups/singlecell/shared/projects/Inflammation-PBMCs-Atlas/03_downstream_analysis/02_gene_universe_definition/data/CYTOPUS_default_geneSets_dict.pickle', 'wb') as f:
    pickle.dump(gene_set_dict, f, pickle.HIGHEST_PROTOCOL)

In [None]:
session_info.show()