<a href="https://colab.research.google.com/github/Seung-heon-Baek/KSBi-BIML-2025/blob/main/%5BBIML2025%5DAdvanced_Deep_Learning_Models_for_Biomedical_Research_I_Practice_%EB%B0%B0%ED%8F%AC%EC%9A%A9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

import torch# BIML 2025 Advanced Deep Learning Models for Biomedical Research I
- 고려대학교 의과대학 전민지 교수, 이지호, 심우종 조교

### 실습 논문
- Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, and Han Liu. 2023. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv preprint:2306.15006(2023)

### 실습 목표: DNABERT-2 모델 실습 및 fine-tuning

1. Data load 및 전처리
2. Model 만들기
- Embedding Module
- Transformer Module
- Interaction Module
- Decoder Module
3. SMILES, Protein amino acid sequence의 substructure 정의
4. Loss function 및 Optimizer 정의
5. Training
6. Drug target interaction 예측
7. Interaction Map 시각화

## 실습을 위한 안내
"<font color='45A07A'>## 코드 시작 ##</font>" "<font color='45A07A'>## 코드 종료 ##</font>" 는 여러분이 직접 작성해야 하는 부분입니다.

**먼저, 실습에 사용할 dataset을 다운로드해야합니다.**

https://drive.google.com/drive/folders/10Ho8GbbljAN5NyUhWzFFtWQhIDOkU41t

위 링크에서 dataset을 다운받아 사용할 프로젝트 디렉토리에 옮겨 주세요.


## 0-1. Colab 사용자를 위한 안내

**로컬(개인 노트북)이 아닌 Colab 환경에서 실행하고자 하시는 분을 위한 안내입니다.**

**로컬에서 실행하실 분은 "로컬 사용자를 위한 안내"로 넘어가세요.**

**먼저 GPU를 사용하기 위한 설정입니다.**

`T4 RAM 디스크` 아이콘 옆의 `추가 연결 옵션` -> `런타임 유형 변경` -> 하드웨어 가속기를 `T4 GPU` 로 설정 후 저장

**이제 구글 드라이브와 Colab을 연결합니다.**

In [2]:
## Google Drive mount

from google.colab import auth
auth.authenticate_user()

from google.colab import drive
drive.mount('/content/drive', force_remount=False)

Mounted at /content/drive


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1. Package load

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import transformers
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
from transformers.models.bert.configuration_bert import BertConfig

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

!! transformers 패키지 설명 추가

In [None]:
## Seed를 고정합니다.
torch.manual_seed(1)
np.random.seed(1)

use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

print('GPU 사용 가능 여부: {}'.format(use_cuda))

GPU 사용 가능 여부: True


## 2. DNABERT-2 불러오기

In [None]:
from transformers.models.bert.configuration_bert import BertConfig

config = BertConfig.from_pretrained("zhihan1996/DNABERT-2-117M")
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)

Some weights of BertModel were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
config

BertConfig {
  "_name_or_path": "zhihan1996/DNABERT-2-117M",
  "alibi_starting_size": 512,
  "attention_probs_dropout_prob": 0.0,
  "auto_map": {
    "AutoConfig": "zhihan1996/DNABERT-2-117M--configuration_bert.BertConfig",
    "AutoModel": "zhihan1996/DNABERT-2-117M--bert_layers.BertModel",
    "AutoModelForMaskedLM": "zhihan1996/DNABERT-2-117M--bert_layers.BertForMaskedLM",
    "AutoModelForSequenceClassification": "zhihan1996/DNABERT-2-117M--bert_layers.BertForSequenceClassification"
  },
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.47.1",
  "type_vocab_size": 2,
  "use_cache"

In [None]:
config.attention_probs_dropout_prob=1e-6

In [None]:
tokenizer

PreTrainedTokenizerFast(name_or_path='zhihan1996/DNABERT-2-117M', vocab_size=4096, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [None]:
model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True, config=config)

Some weights of BertModel were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(4096, 768, padding_idx=0)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertUnpadAttention(
          (self): BertUnpadSelfAttention(
            (dropout): Dropout(p=1e-06, inplace=False)
            (Wqkv): Linear(in_features=768, out_features=2304, bias=True)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (mlp): BertGatedLinearUnitMLP(
          (gated_layers): Linear(in_features=768, out_features=6144, bias=False)
          (act): GELU(approximate='none')


In [None]:
model=model.to(device)

## 3. DNABERT-2로 embedding 얻기

In [None]:
DNA = pd.read_csv('/content/drive/MyDrive/[2025]BIML/data/DNA.csv')

In [None]:
DNA

Unnamed: 0,sequence,label
0,AAAAAGCCTGTGAAGCACAGAGAGCAGCCAGCCAGAGCTGATGCTC...,1
1,ACCTGCTAACAATTAAGGCCTCCAGGTCTACCCTGCAGCTGGGCCT...,0
2,ACTGACCATGTGCATCCTCACTGATACCAGTCTTGCCACAGTGTGC...,0
3,ATATTACTCAACCGCCTAACAGAACAAAAGCATTCTTGGCTTGATC...,0
4,CAACCATCCTACTTGCTCGTGGGCTAGCTGCGGGCGCGTCGCGAGC...,0
5,TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCA...,0
6,CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTT...,1
7,ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACA...,0
8,TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGT...,0
9,GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGC...,1


In [None]:
def get_embeddings(sequence, model, tokenizer, device):
    # 서열 토큰화
    tokens = tokenizer(sequence, return_tensors="pt", padding=True, truncation=True, max_length=512)
    tokens = {key: val.to(device) for key, val in tokens.items()}

    # 모델 추론
    with torch.no_grad():
        outputs = model(**tokens)
    hidden_states = outputs[0]  # 마지막 레이어의 Hidden States

    # Mean Pooling (평균 임베딩)
    embedding_mean = torch.mean(hidden_states[0], dim=0)

    # Max Pooling (최대값 임베딩)
    embedding_max = torch.max(hidden_states[0], dim=0)[0]

    return embedding_mean, embedding_max  # 두 개의 값을 반환

In [None]:
# 모든 서열에 대해 임베딩 계산
mean_embeddings = []
max_embeddings = []

for seq in DNA['sequence']:
    mean_emb, max_emb = get_embeddings(seq, model, tokenizer, device)
    mean_embeddings.append(mean_emb.cpu().numpy())
    max_embeddings.append(max_emb.cpu().numpy())

In [None]:
mean_embeddings_array = np.array(mean_embeddings)
print(mean_embeddings_array.shape)

(30, 768)


## 4. 생성된 embedding으로 classification

In [None]:
class MLPClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(MLPClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

In [None]:
# 임베딩 결과를 데이터프레임으로 저장
DNA['embedding_mean'] = mean_embeddings
DNA['embedding_max'] = max_embeddings

In [None]:
# 데이터 분리
X = torch.tensor(DNA['embedding_mean'].tolist(), dtype=torch.float32)
y = torch.tensor(DNA['label'].values, dtype=torch.long)

# 학습 데이터와 테스트 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  X = torch.tensor(DNA['embedding_mean'].tolist(), dtype=torch.float32)


In [None]:
# 하이퍼파라미터 설정
input_size = 768
hidden_size = 128
num_classes = 2
num_epochs = 20
learning_rate = 0.001

# 모델 초기화
model = MLPClassifier(input_size, hidden_size, num_classes).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# 학습 루프
for epoch in range(num_epochs):
    model.train()
    inputs = X_train.to(device)
    labels = y_train.to(device)

    # Forward
    outputs = model(inputs)
    loss = criterion(outputs, labels)

    # Backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

Epoch [1/20], Loss: 0.6913
Epoch [2/20], Loss: 0.6630
Epoch [3/20], Loss: 0.6414
Epoch [4/20], Loss: 0.6245
Epoch [5/20], Loss: 0.6103
Epoch [6/20], Loss: 0.5980
Epoch [7/20], Loss: 0.5871
Epoch [8/20], Loss: 0.5767
Epoch [9/20], Loss: 0.5659
Epoch [10/20], Loss: 0.5539
Epoch [11/20], Loss: 0.5407
Epoch [12/20], Loss: 0.5264
Epoch [13/20], Loss: 0.5111
Epoch [14/20], Loss: 0.4955
Epoch [15/20], Loss: 0.4797
Epoch [16/20], Loss: 0.4638
Epoch [17/20], Loss: 0.4481
Epoch [18/20], Loss: 0.4324
Epoch [19/20], Loss: 0.4169
Epoch [20/20], Loss: 0.4009


In [None]:
# 모델 평가
model.eval()
with torch.no_grad():
    inputs = X_test.to(device)
    labels = y_test.to(device)
    outputs = model(inputs)
    _, predicted = torch.max(outputs, 1)

# 정확도 계산
accuracy = accuracy_score(labels.cpu(), predicted.cpu())
print(f"Test Accuracy: {accuracy:.4f}")

Test Accuracy: 0.6667
