# GenOS基因组：序列分析演示

本notebook演示了如何使用模型进行DNA序列分析和基因变异效应预测。

## 功能特点
- DNA序列嵌入生成
- 基因变异效应预测
- KEGG通路分析
- 交互式结果可视化


## 环境检测
该工程需要在装有GPU的机器上运行

In [1]:
# 检测环境
import torch
print(f"PyTorch版本: {torch.__version__}")
print(f"CUDA可用: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU设备: {torch.cuda.get_device_name(0)}")

PyTorch版本: 2.7.1
CUDA可用: True
GPU设备: NVIDIA A40


## 环境准备

工程中提供了requirements.txt文件，用户可以通过以下命令安装依赖：
```
pip install -r requirements.txt
```

用户需要设置可供使用的GPU，此处默认为0卡

In [13]:
# 设置CUDA环境变量
!export CUDA_VISIBLE_DEVICES=0

## 参数选择
用户需要选择合适的模型参数，以便进行通路分析、疾病预测。

In [None]:
import ipywidgets as widgets
from IPython.display import display
import subprocess

text_model = widgets.Dropdown(options=[("Qwen1B","model_weights/Qwen/Qwen3-1___7B"),("Qwen4B","model_weights/Qwen/Qwen3-4B")],description="text model")

dna_model = widgets.Dropdown(options=[("Genos-1b","model_weights/onehot_mix_1b_128k364B_cpt_8k298B_cpt_1m140B_cpt_8k200B_stage1_1_1004"),("Genos-10b","model_weights/onehot_mix_10b_12L_1M140B_cpt_8k298B_cpt_32k128k200B_32k200B_8k200B_stage1_1_211_1009"),
                                      ("hyenadna_1M","hyenadna-large-1m-seqlen"),("NT","model_weights/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species"),("Evo2-1B","evo2_1b_base")],
                            description="DNA model")

dataset_type = widgets.Dropdown(options=[("kegg","kegg"),("kegg_hard","kegg_hard")],description="dataset")
max_epochs = widgets.IntText(value=5,description="max_epochs")
max_length_dna = widgets.IntText(value=1024,description="max_length_dna")
max_length_text = widgets.IntText(value=8192,description="max_length_text")
gradient_accumulation_steps = widgets.IntText(value=8,description="gradient_accumulation_steps")
btn = widgets.Button(description="start")
display(widgets.VBox([text_model,dna_model,dataset_type,max_epochs,max_length_dna,max_length_text,gradient_accumulation_steps,btn]))
dna_is_evo2 = False
cache_dir = "model_weights"
dna_embedding_layer = "blocks.20.mlp.l3"
def on_button_clicked(b):
    global dna_is_evo2,cache_dir,dna_embedding_layer
    if dna_model.value == "evo2_1b_base":
        dna_is_evo2 = True
        cache_dir = "model_weights/arcinstitute/evo2_1b_base/evo2_1b_base.pt"
        dna_embedding_layer = "blocks.20.mlp.l3"
    with open("sh_user.sh","r") as f:
        content = f.read()
        content = content.replace("###cache_dir###",cache_dir)
        content = content.replace("###text_model_name###",text_model.value)
        content = content.replace("###dna_model_name###",dna_model.value)
        content = content.replace("###dna_embedding_layer###",dna_embedding_layer)
        content = content.replace("###dataset_type###",dataset_type.value)
        content = content.replace("###max_epochs###",str(max_epochs.value))
        content = content.replace("###max_length_dna###",str(max_length_dna.value))
        content = content.replace("###max_length_text###",str(max_length_text.value))
        content = content.replace("###gradient_accumulation_steps###",str(gradient_accumulation_steps.value))
        content = content.replace("###dna_is_evo2###",str(dna_is_evo2))
    with open("sh_temp.sh","w") as f:
        f.write(content)
    command = "nohup bash sh_temp.sh > logs/log.log 2>&1 &"
    process = subprocess.Popen(
        command,
        shell=True,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE
    )


btn.on_click(on_button_clicked)

VBox(children=(Dropdown(description='text model', options=(('Qwen1B', 'model_weights/Qwen/Qwen3-1___7B'), ('Qw…

查看日志

In [5]:
# 查看系统日志（Linux/Mac）
import subprocess
result = subprocess.run(['tail', '-50', 'logs/log.log'], 
                      capture_output=True, text=True)
print(result.stdout)



