# Distribution of gene expression levels

The goal is to estimate quantiles that characterize how a population of promoter sequences affects gene expression. Vaishnav et al. (1) recently trained a state-of-the-art transformer model to predict the expression level of a particular gene induced by a promoter sequence. They used the model's predictions to study the effects of promoters; for example, by assessing how quantiles of predicted expression levels differ between different populations of promoters. This notebook shows how the predictions used by Vaishnav et al. can be leveraged to estimate different quantiles of gene expression levels induced by native yeast promoters with higher statistical power.

1. E. D. Vaishnav, C. G. de Boer, J. Molinet, M. Yassour, L. Fan, X. Adiconis, D. A. Thompson, J. Z. Levin, F. A. Cubillos, A. Regev, The evolution, evolvability and engineering of gene regulatory DNA. Nature 603(7901), 455–463 (2022).

### Import necessary packages

In [1]:
import numpy as np
from datasets import load_dataset
from FL_cpp_method import analyze_dataset, plot_cpp

In [2]:
# 示例调用
dataset_name = "gene_expression"
data = load_dataset('../data/', dataset_name)
Y_total = data["Y"]
Yhat_total = data["Yhat"]

alpha = 0.05

method = "quantile"

dataset_dist = 'IID'
# dataset_dist = 'Non-IID'

grid = np.concatenate([Y_total, Yhat_total], axis=0)
grid = np.linspace(grid.min(), grid.max(), 5000)

# num_ratio = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
num_ratio = [1, 1, 1, 1, 1]  # 数据量分布平衡
# num_ratio = [1,2,3,3,1]  # 数据量分布不平衡

# 计算标注真实值、各节点上、组合数据后和FL后的平均值cpp
true_theta, cpp_intervals, ppi_ci_combined, mean_cpp = analyze_dataset(alpha, None, Y_total, Yhat_total, dataset_dist,
                                                                            num_ratio, method, grid)
# 画图
file_name = dataset_dist + '-' + dataset_name + '.pdf'
xlim = [2, 13.7]
ylim = [0, 1.0]
title = "0.5-quantile gene expression"
plot_cpp(true_theta, cpp_intervals, ppi_ci_combined, mean_cpp, file_name, xlim, ylim, title)

分组： 1
带标签的样本量： 1223
不带标签的样本量： 11007
分组： 2
带标签的样本量： 1223
不带标签的样本量： 11007
分组： 3
带标签的样本量： 1223
不带标签的样本量： 11007
分组： 4
带标签的样本量： 1223
不带标签的样本量： 11007
分组： 5
带标签的样本量： 1223
不带标签的样本量： 11007
imputed var: [1.9761907e-09 1.9761907e-09 1.9761907e-09 ... 1.9761907e-09 1.9761907e-09
 1.9761907e-09]
rectifier var [1.87777747e-07 2.14498693e-07 2.14498693e-07 ... 1.87777747e-07
 1.87777747e-07 1.87777747e-07]
带标签的样本量： 6115
不带标签的样本量： 55035

最终结果：
真实 theta: 5.650311615722635
CPP intervals: [array([5.37570439, 5.97431801]), array([5.0390398 , 5.97375651]), array([5.11503672, 6.06751791]), array([5.521116  , 6.05694165]), array([5.31098184, 6.32432634])]
组合数据的置信区间: [5.49365368 5.82598794]
联邦聚合后的置信区间: [5.49522748 5.82504729]
