# Pseudobulk eqtl phenotype QC and normalization
It is based on Nick's code. Should be optimized for general use.

## Input
The input is pseudo bulk eqtl phenotype data of seurat rds object. In this notebook, we use the following files as input:
phenotype original file:
- Ast: `/with_projids/Astrocytes.rds`
- Immune_cells: `/with_projids/Immune_cells.rds`
- Exc: `/with_projids/Excitatory_neurons_set1.rds`
    `/with_projids/Excitatory_neurons_set2.rds`
    `/with_projids/Excitatory_neurons_set3.rds`
- Inh: `/with_projids/Inhibitory_neurons.rds`
- Oli: `/with_projids/Oligodendrocytes.rds`
- OPC: `/with_projids/OPCs.rds`

For Ast, Inh, Oli, OPC, the input is separate seurat objects, each of a specific celltype. So we list the celltype name as 1st col, rds path as the 2nd col in a txt file as the input. It should use The first version--`seuratagg` workflow.
    
For Immune_cells and some celltypes, it is a combined rds objest with multiple celltypes or subtypes. We want to get one or some of the celltypes(subtypes) from the seurat object. So we list the celltypes(subtypes) name that we need as 1st col, rds path as the 2nd col in a txt file as the input. It should use The Second version--`subtypeagg` workflow.

For Exc, it was split into multiple seurat objects, so handled separately. It should use The Third version--`neuronsagg` workflow.

`FIXME: All of the sos workflow are based on projid, and in R code, the column name with pure number will be add a prefix 'g' to the projid, we should then delete the prefix 'g' to the projid. Need optimize the code for considering sampleid to avoid this step. `

## Steps:
-- Count Cells by Sample: Calculate the number of cells for each sample using metadata. This helps in filtering samples based on cell count.
-- Aggregate Expression Data: Create pseudobulk data by aggregating raw count data per sample, enhancing signal-to-noise ratio for downstream analysis.
-- Filter Samples: Exclude samples with fewer than 10 cells to ensure sufficient data quality and representativeness.
-- Gene Filtering: Use `filterByExpr()` to retain genes with sufficient expression across samples, improving the reliability of statistical tests.
-- Normalization: Apply TMM normalization to adjust for composition effects, making counts between samples comparable.
-- Voom Transformation: Transform count data to log2-counts per million (logCPM), stabilizing variance across genes.
-- Filter by Expression: Remove genes with mean log2CPM < 2.0 to focus on genes with significant expression levels.
-- Quantile normalization: Apply quantile normalization to ensure that the distribution of expression values is consistent across samples.

## Output

The output is a normalized.log2cpm.tsv file, with 1st column id as gene name, then the projids as following columns. 

## Global parameter settings

In [None]:
[global]
# It is required to input the name of the analysis
parameter: name = str
parameter: cwd = path("output")
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number commands to run per job
parameter: job_size = 5
# Wall clock time expected
parameter: walltime = "20h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 2

sos run pipeline/pseudobulk_pheno_aggregation.ipynb seuratagg \
    --name snuc_pseudo_bulk \
    --seurat_rds /home/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/celltype.txt \
    --cwd /home/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/data_after_aggregation/ \
    --container /home/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/container/seurat.sif \
    --mem 40G -J 50 -c /home/al4225/project/quantile_qtl/csg.yml -q csg

## Seurat rds aggregation 
### First version: for each seurat object with just one cell type.
The input is a txt file with the first column as the cell type name and the second column as the seurat rds file path. The output is a normalized aggregated rds file for each cell type.


In [None]:
[seuratagg]
import pandas as pd
# load seurat_rds rds output file
parameter: seurat_rds = path()


#for each tissue.
rds_result = pd.read_csv(seurat_rds, sep = "\t", header=None)
print(rds_result)
input_inv = rds_result.values.tolist()
tissue_id_inv = [x[0] for x in input_inv]
file_inv = [x[1] for x in input_inv]
print("\ntissue ID List:")
print(tissue_id_inv)
print("\nFile List:")
print(file_inv)
print("Length of tissue_id_inv:", len(tissue_id_inv))
print("Length of file_inv:", len(file_inv))

input: file_inv, group_by = 1, group_with = "tissue_id_inv"
output: normalized_log2cpm = f'{cwd:a}/{name}.{_tissue_id_inv}.normalized.log2cpm.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
library(Seurat)
library(edgeR)
library(limma)

#loading separate seurat objects, each of a specific celltype
seu = readRDS(${_input:r})

#keep cell counts for sample filtering, (sample must be in metadata under 'sample')
cellcounts=table(seu@meta.data$projid)
seu=SetIdent(seu,value="projid")

#creation of the raw count pseudobulk
expr=AggregateExpression(seu,group.by="projid",slot="counts")$RNA

# delete the g prefix of colname: only for projids version.
colnames(expr) <- gsub("^g", "", colnames(expr))

#filtering out samples with fewer than 10 cells in a celltype
sampnames=names(cellcounts[cellcounts>9])
expr=expr[,sampnames]


#filter low expression genes
y <- DGEList(counts = expr)
keep <- filterByExpr(y)
y <- y[keep,,keep.lib.sizes=F]


#counts per million
y <- calcNormFactors(y, method = "TMM")
v <- voom(y, plot=F)
logcpm <- v$E

# remove genes if mean log2CPM < 2.0
mean_logcpm <- apply(logcpm, 1, mean)
logcpm <- logcpm[mean_logcpm > 2.0,]

logcpm <- as.data.frame(logcpm)
logcpm$id <- rownames(logcpm)
rownames(logcpm) <- NULL  #the rownames are now in the id column
logcpm <- logcpm[, c("id", setdiff(names(logcpm), "id"))]

# convert log2CPM to matrix
logcpm_id <- logcpm$id
logcpm <- as.matrix(logcpm[, colnames(logcpm) != "id"])
rownames(logcpm) <- logcpm_id

# quantile normalizarion
logcpm <- t(apply(logcpm, 1, rank, ties.method = "average"))
logcpm <- qnorm(logcpm / (ncol(logcpm) + 1))

# export
df <- data.frame(id = rownames(logcpm), logcpm, check.names = F)
write.table(df, file="${_output['normalized_log2cpm']}", sep="\t", quote = F, row.names = F)
cat("the normalized aggregated pseudo_bulk_eqtl tsv are saved")

### The second version for a seurat object with multiple celltypes or subtypes.
If you are loading a seurat object with multiple celltypes or subtypes (in metadata) to run pseudobulk, you can use this command to subtract the celltype of interest from the rest of the cells.

You should check the colnames of the metadata of the seurat object to make sure the celltype column name is correct. In this code, it uses `predicted.id` col as the subtypes name.


# example:
sos run pipeline/pseudobulk_pheno_aggregation.ipynb subtypeagg \
    --name snuc_pseudo_bulk \
    --seurat_rds /mnt/vast/hpc/homes/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/data_before_aggregate/subtype/mic_subtypes.txt \
    --cwd /mnt/vast/hpc/homes/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/phenodata_quantnorm_nofill0/subtype/ \
    --container /home/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/container/seurat.sif \
    --mem 60G -J 50

In [None]:
[subtypeagg]
import pandas as pd
# load seurat rds output file
parameter: seurat_rds = path()


#for each tissue.
rds_result = pd.read_csv(seurat_rds, sep = "\t", header=None)
print(rds_result)
input_inv = rds_result.values.tolist()
tissue_id_inv = [x[0] for x in input_inv]
file_inv = [x[1] for x in input_inv]
print("\ntissue ID List:")
print(tissue_id_inv)
print("\nFile List:")
print(file_inv)
print("Length of tissue_id_inv:", len(tissue_id_inv))
print("Length of file_inv:", len(file_inv))

input: file_inv, group_by = 1, group_with = "tissue_id_inv"
output: normalized_log2cpm = f'{cwd:a}/{name}.{_tissue_id_inv}.normalized.log2cpm.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
library(Seurat)
library(edgeR)
library(limma)

seu_all=readRDS(${_input:r})
seu_all=SetIdent(seu_all,value="predicted.id") #this is the subtype/celltype column name.
ct='${_tissue_id_inv}'
seu=subset(seu_all,idents=ct)

#keep cell counts for sample filtering, (sample must be in metadata under 'sample')
cellcounts=table(seu@meta.data$projid)
seu=SetIdent(seu,value="projid")

#creation of the raw count pseudobulk
expr=AggregateExpression(seu,group.by="projid",slot="counts")$RNA

# delete the g prefix of colname
colnames(expr) <- gsub("^g", "", colnames(expr))

#filtering out samples with fewer than 10 cells in a celltype
sampnames=names(cellcounts[cellcounts>9])
expr=expr[,sampnames]

#filter low expression genes
y <- DGEList(counts = expr)
keep <- filterByExpr(y)
y <- y[keep,,keep.lib.sizes=F]

#counts per million
y <- calcNormFactors(y, method = "TMM")

v <- voom(y, plot=F)
logcpm <- v$E

# remove genes if mean log2CPM < 2.0
mean_logcpm <- apply(logcpm, 1, mean)
logcpm <- logcpm[mean_logcpm > 2.0,]

logcpm <- as.data.frame(logcpm)
logcpm$id <- rownames(logcpm)
rownames(logcpm) <- NULL
logcpm <- logcpm[, c("id", setdiff(names(logcpm), "id"))]

# convert log2CPM to matrix
logcpm_id <- logcpm$id
logcpm <- as.matrix(logcpm[, colnames(logcpm) != "id"])
rownames(logcpm) <- logcpm_id

# quantile normalizarion
logcpm <- t(apply(logcpm, 1, rank, ties.method = "average"))
logcpm <- qnorm(logcpm / (ncol(logcpm) + 1))

# export
df <- data.frame(id = rownames(logcpm), logcpm, check.names = F)
write.table(df, file="${_output['normalized_log2cpm']}", sep="\t", quote = F, row.names = F)

cat("the normalized aggregated pseudo_bulk_eqtl tsv are saved")

sos run pipeline/pseudobulk_pheno_aggregation.ipynb neuronsagg \
    --name snuc_pseudo_bulk \
    --cwd /home/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/data_after_aggregation/exc/ \
    --container /home/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/container/seurat.sif \
    --mem 100 -J 10

### The third version for a Celltype with multiple seurat object. 
e.g.neurons were split into two or more files, so handled separately. This is an example of how to aggregate 3 Exc files together.

`FIXME: The sos workflow is not generalized and you should optimizing the code.`

In [None]:
[neuronsagg]
import pandas as pd
output: normalized_log2cpm = f'{cwd:a}/{name}.{_tissue_id_inv}.normalized.log2cpm.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
library(Seurat)
library(edgeR)
library(limma)

# Neurons were split into three files, so handled separately
seu1 = readRDS("/home/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/data_before_aggregate/Excitatory_neurons_set1.rds")
cellcounts1 = table(seu1@meta.data$projid)
seu1 = SetIdent(seu1, value = "projid")
expr1 = AggregateExpression(seu1, group.by = "projid", slot = "counts")$RNA

seu2 = readRDS("/home/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/data_before_aggregate/Excitatory_neurons_set2.rds")
cellcounts2 = table(seu2@meta.data$projid)
seu2 = SetIdent(seu2, value = "projid")
expr2 = AggregateExpression(seu2, group.by = "projid", slot = "counts")$RNA

seu3 = readRDS("/home/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/data_before_aggregate/Excitatory_neurons_set3.rds")
cellcounts3 = table(seu3@meta.data$projid)
seu3 = SetIdent(seu3, value = "projid")
expr3 = AggregateExpression(seu3, group.by = "projid", slot = "counts")$RNA

ct = "neuron"
genes1 = rownames(expr1)
genes2 = rownames(expr2)
genes3 = rownames(expr3)

# Find common genes among all three sets
common_genes = Reduce(intersect, list(genes1, genes2, genes3))

# Filter the expression matrices to keep only common genes
expr1 = expr1[common_genes, ]
expr2 = expr2[common_genes, ]
expr3 = expr3[common_genes, ]

# Combine the expression matrices horizontally (by columns)
expr = cbind(expr1, expr2, expr3)

# delete the g prefix of colname
colnames(expr) <- gsub("^g", "", colnames(expr))

# Remove unnecessary objects from memory
rm(expr1, expr2, expr3, seu1, seu2, seu3)

# Combine cell counts from all three sets
cellcounts = c(cellcounts1, cellcounts2, cellcounts3)

#filtering out samples with fewer than 10 cells in a celltype
sampnames=names(cellcounts[cellcounts>9])
expr=expr[,sampnames]

#filter low expression genes
y <- DGEList(counts = expr)
keep <- filterByExpr(y)
y <- y[keep,,keep.lib.sizes=F]


#counts per million
y <- calcNormFactors(y, method = "TMM")
v <- voom(y, plot=F)
logcpm <- v$E

# remove genes if mean log2CPM < 2.0
mean_logcpm <- apply(logcpm, 1, mean)
logcpm <- logcpm[mean_logcpm > 2.0,]

logcpm <- as.data.frame(logcpm)
logcpm$id <- rownames(logcpm)
rownames(logcpm) <- NULL
logcpm <- logcpm[, c("id", setdiff(names(logcpm), "id"))]

# convert log2CPM to matrix
logcpm_id <- logcpm$id
logcpm <- as.matrix(logcpm[, colnames(logcpm) != "id"])
rownames(logcpm) <- logcpm_id

# quantile normalizarion
logcpm <- t(apply(logcpm, 1, rank, ties.method = "average"))
logcpm <- qnorm(logcpm / (ncol(logcpm) + 1))

# export
df <- data.frame(id = rownames(logcpm), logcpm, check.names = F)
write.table(df, file="${_output['normalized_log2cpm']}", sep="\t", quote = F, row.names = F)

cat("the normalized aggregated pseudo_bulk_eqtl tsv are saved")

### Supplement: Code for fill projid into length 8 as prefix 0
Python code. The example uses projid insdead of sampleid, and the length of projid number should be filled into 8 with 0 as the prefix to match the projid--sample list. So after aggregatiom, should processed the column name in the tsv file.

In [None]:
# Python code. The example uses projid insdead of sampleid, and the length of projid number should be filled into 8 to match 
the projid--sample list. So after aggregatiom, should processed the column name in the tsv file.
# the log2CPM tsv file: 1st col is id, the rest are projids.
# fill projid to 8
import pandas as pd

def pad_column_names(df, pad_length=8):
    new_cols = [df.columns[0]] + [col if len(col) == pad_length or not col.isdigit() else col.zfill(pad_length) for col in df.columns[1:]]
    df.columns = new_cols
    return df

file_path = '/The input path/phenodata_quantnorm_nofill0/snuc_pseudo_bulk.Ast.normalized.log2cpm.tsv'
df = pd.read_csv(file_path, sep='\t')

df = pad_column_names(df)

output_file_path = 'The output path/For your tsv data'
df.to_csv(output_file_path, sep='\t', index=False, quotechar='', quoting=3)
