Last update: 2024. 02. 29.

## :star: GenET predict module - DeepPrime

GenET의 predict module은 CRISPR-Cas system들의 gRNA 효율을 예측하는 deep learning model을 손쉽게 사용할 수 있는 함수들을 제공합니다.
이 중, DeepPrime은 prime editing과 관련된 prime editing gRNA (pegRNA)의 효율을 예측하는 모델입니다. 
DeepPrime은 크게 아래의 class들을 불러와서 사용할 수 있습니다. 

- DeepPrime
- DeepPrimeGuideRNA
- DeepPrimeOff

아래의 예시 코드를 이용해서 원하는 기능들을 사용해보세요!

### 💾 Install GenET
---
우선 아래의 기능들을 사용하기 위해서, 현재 environment에 `genet`을 설치합니다. 터미널에 아래의 명령어로 설치할 수 있습니다.

```bash
$ conda create -n genet python=3.9
$ conda activate genet
$ pip install genet
```

추가로, genet.prediction module을 사용하기 위해서는 PyTorch와 ViennaRNA를 추가로 설치해줘야 합니다. 

```bash
$ pip install torch==1.11.0+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
$ conda install viennarna
```

### 🔥 DeepPrime
---
class `DeepPrime`은 target sequence를 기준으로 만들 수 있는 모든 pegRNA를 자동으로 디자인해주고, 각각의 predicted efficiency를 계산해준다.
우선 만들고자 하는 sequence 정보가 담긴 target (unedited and prime-edited) sequence를 준비하고, 모든 pegRNA를 디자인한다.

DeepPrime은 각 pegRNA들을 설계함과 동시에 DeepPrime model에 필요한 biofeatures (Tm, GC counts, MFE ...)를 함께 계산해준다. 이러한 biofeatures를 포함한 전체 데이터를 확인하고 싶다면 `.features`를 사용해서 pd.DataFrame 형태로 불러올 수 있다. 

In [1]:
from genet.predict import DeepPrime

seq_wt   = 'ATGACAATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGAAACTGAGAAGAACTATAACCTGCAAATGTCAACTGAAACCTTAAAGTGAGTATTTAATTGAGCTGAAGT'
seq_ed   = 'ATGACAATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGAAACTGAGACGAACTATAACCTGCAAATGTCAACTGAAACCTTAAAGTGAGTATTTAATTGAGCTGAAGT'

pegrna = DeepPrime('SampleName', seq_wt, seq_ed, edit_type='sub', edit_len=1)

# check designed pegRNAs and biofeatures
pegrna.features.head()

Unnamed: 0,ID,Spacer,RT-PBS,PBS_len,RTT_len,RT-PBS_len,Edit_pos,Edit_len,RHA_len,Target,...,deltaTm_Tm4-Tm2,GC_count_PBS,GC_count_RTT,GC_count_RT-PBS,GC_contents_PBS,GC_contents_RTT,GC_contents_RT-PBS,MFE_RT-PBS-polyT,MFE_Spacer,DeepSpCas9_score
0,SampleName,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGG,7,35,42,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...,...,-340.104846,5,16,21,71.428571,45.714286,50.0,-10.4,-0.6,45.967537
1,SampleName,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGGG,8,35,43,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...,...,-340.104846,6,16,22,75.0,45.714286,51.162791,-10.4,-0.6,45.967537
2,SampleName,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGGGT,9,35,44,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...,...,-340.104846,6,16,22,66.666667,45.714286,50.0,-10.4,-0.6,45.967537
3,SampleName,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGGGTG,10,35,45,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...,...,-340.104846,7,16,23,70.0,45.714286,51.111111,-10.4,-0.6,45.967537
4,SampleName,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGGGTGT,11,35,46,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...,...,-340.104846,7,16,23,63.636364,45.714286,50.0,-10.4,-0.6,45.967537


이제 각 pegRNA들의 editing efficiency를 예측하고 싶다면, `.predict`를 사용한다. 
각 deep learning model은 training된 모델의 최종 파일 (.pt)이 필요하며, 이는 처음 genet을 설치할 때에는 패키지의 경량화를 위해 설치되지 않는다. 
하지만 만약 genet을 이용해서 모델을 불러오거나 해당 모델을 사용하는 함수를 불러오면, 자동으로 genet이 설치된 경로에 model 파일을 다운로드 받는다. 

우선 아래의 예시는 PE2max를 HEK293T cell line에서 사용했을 때의 pegRNA 효율을 예측한 것이다.
모델 파일이 설치되는 메세지가 나타나지 않았는데, 이는 tutorial 작성 전에 이미 DeepPrime-FT: HEK293T-PE2max 모델을 설치해두었기 때문이다. 

In [2]:
pe2max_output = pegrna.predict(pe_system='PE2max', cell_type='HEK293T')

pe2max_output.head()

Unnamed: 0,ID,PE2max_score,Spacer,RT-PBS,PBS_len,RTT_len,RT-PBS_len,Edit_pos,Edit_len,RHA_len,Target
0,SampleName,0.904387,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGG,7,35,42,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...
1,SampleName,2.375938,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGGG,8,35,43,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...
2,SampleName,2.61238,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGGGT,9,35,44,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...
3,SampleName,3.641537,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGGGTG,10,35,45,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...
4,SampleName,3.768321,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGGGTGT,11,35,46,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...


만약, PE2max가 아니라 PE4max-e (PE2max + MMR knockdown + epegRNA system)을 사용한 경우를 예측하고 싶다면, 아래와 같이 옵션을 조정한다.

추가로 더 사용할 수 있는 옵션들에 대해서 확인하고 싶다면, [documentation](https://goosang-yu.github.io/genet/1_Predict/4_predict_pe/#predicting-efficiencies-of-existing-pegrnas)을 확인하라.

DeepPrime-FT: HEK293T-PE4max-e를 처음 불러오는 상황이기 떄문에, 먼저 모델 파일을 다운로드한 후 함수가 실행된다. 한번 다운로드 된 모델 파일은 같은 (가상) 환경에서는 추가로 다운로드 할 필요가 없다. 파일을 다운로드 하는 속도는 사용자의 컴퓨터와 인터넷 상태에 따라 다르다.

In [3]:
pe4max_e_output = pegrna.predict(pe_system='PE4max-e', cell_type='HEK293T')

pe4max_e_output.head()

The model DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428 is not installed. Download checkpoint files.



Downloading: 0KB [00:00, ?KB/s]


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/__init__.py


Downloading: 1KB [00:00, 1319.79KB/s]


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/dp_mean.csv


Downloading: 1KB [00:00, 947.44KB/s]


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/dp_std.csv


Downloading: 337KB [00:00, 1345.73KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_0.pt


Downloading: 337KB [00:00, 1128.57KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_1.pt


Downloading: 337KB [00:00, 1305.80KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_2.pt


Downloading: 337KB [00:00, 1082.29KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_3.pt


Downloading: 337KB [00:00, 1216.90KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_4.pt


Downloading: 337KB [00:00, 849.48KB/s]                          


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_5.pt


Downloading: 337KB [00:00, 1284.55KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_6.pt


Downloading: 337KB [00:00, 1332.07KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_7.pt


Downloading: 337KB [00:00, 1259.86KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_8.pt


Downloading: 337KB [00:00, 1281.58KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_9.pt


Downloading: 337KB [00:00, 1289.06KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_10.pt


Downloading: 337KB [00:00, 1343.60KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_11.pt


Downloading: 337KB [00:00, 1453.29KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_12.pt


Downloading: 337KB [00:00, 1253.20KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_13.pt


Downloading: 337KB [00:00, 1153.94KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_14.pt


Downloading: 337KB [00:00, 1248.96KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_15.pt


Downloading: 337KB [00:00, 987.23KB/s]                          


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_16.pt


Downloading: 337KB [00:00, 845.92KB/s]                          


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_17.pt


Downloading: 337KB [00:00, 1457.00KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_18.pt


Downloading: 337KB [00:00, 1406.65KB/s]                         


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/final_model_19.pt


Downloading: 1KB [00:00, 1400.90KB/s]


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/mean.csv


Downloading: 1KB [00:00, 2040.03KB/s]


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/std.csv


Downloading: 1KB [00:00, 1161.21KB/s]


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/mean_231124.csv


Downloading: 1KB [00:00, 3429.52KB/s]


File downloaded successfully: /home/gsyu/projects/genet/genet/models/DeepPrime/DP_variant_293T_PE4max_epegRNA_Opti_220428/std_231124.csv


Unnamed: 0,ID,PE4max-e_score,Spacer,RT-PBS,PBS_len,RTT_len,RT-PBS_len,Edit_pos,Edit_len,RHA_len,Target
0,SampleName,2.095974,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGG,7,35,42,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...
1,SampleName,4.202025,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGGG,8,35,43,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...
2,SampleName,5.702226,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGGGT,9,35,44,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...
3,SampleName,5.970952,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGGGTG,10,35,45,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...
4,SampleName,6.580553,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGGGTGT,11,35,46,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...


### 🔥 DeepPrimeGuideRNA
---
class `DeepPrimeGuideRNA`는 이미 정해진 pegRNA와 target에 대해서 prime editing efficiency를 예측하기 위한 기능을 제공한다. 우선 아래와 같이 불러온다.

In [4]:
from genet.predict import DeepPrimeGuideRNA

Target sequence는 정확히 74nt DNA sequence를 입력해야 하며, 아래와 같이 특정한 위치가 고정되어 있다. 
- Position 5-24: Protospacer
- Position 25-27: PAM

그 외에 edit length, position, and type에 대한 정확한 정보가 함께 입력되어야 한다. 아래는 이를 사용하는 예시이다. 

`DeepPrime`과 마찬가지로, `.features`을 사용해서 입력된 pegRNA에 대한 feature 정보가 담긴 DataFrame을 볼 수 있다. 

In [6]:
target    = 'ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGAAACTGAGAAGAACTATAACCTGCAAATG'
pbs       = 'GGCAAGGGTGT'
rtt       = 'CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAA'
edit_len  = 1
edit_pos  = 34
edit_type = 'sub'

pegrna = DeepPrimeGuideRNA('pegRNA_test', target=target, pbs=pbs, rtt=rtt, 
                           edit_len=edit_len, edit_pos=edit_pos, edit_type=edit_type)

pegrna.features

Unnamed: 0,ID,Spacer,RT-PBS,PBS_len,RTT_len,RT-PBS_len,Edit_pos,Edit_len,RHA_len,Target,...,deltaTm_Tm4-Tm2,GC_count_PBS,GC_count_RTT,GC_count_RT-PBS,GC_contents_PBS,GC_contents_RTT,GC_contents_RT-PBS,MFE_RT-PBS-polyT,MFE_Spacer,DeepSpCas9_score
0,pegRNA_test,AAGACAACACCCTTGCCTTG,CGTCTCAGTTTCTGGGAGCTTTGAAAACTCCACAAGGCAAGGGTGT,11,35,46,34,1,1,ATAAAAGACAACACCCTTGCCTTGTGGAGTTTTCAAAGCTCCCAGA...,...,-340.104846,7,16,23,63.636364,45.714286,50.0,-10.4,-0.6,45.967541


`DeepPrimeGuideRNA`에서 예측 값은 `.predict`를 사용해서 얻을 수 있다. `DeepPrime.predict`와 다르게, `DeepPrimeGuideRNA.predict`는 예측 값이 단일 값으로 반환된다는 것을 주의하라. 

In [7]:
pe2max_score = pegrna.predict('PE2max')

pe2max_score

3.768320083618164

이미 정해진 pegRNA 종류가 많다면 각각을 for-loop로 반복할 수 있지만, 각각의 pegRNA마다 모든 연산을 따로 수행하고 data In-Out이 별도로 이루어지기 때문에 속도가 훨씬 느릴 수 있다. 

아래의 예시는 대량 (> 11K)의 pegRNA를 미리 디자인하고, 이에 대한 pegRNA features를 계산하는 과정이다.

In [8]:
import pandas as pd
from genet.predict import DeepPrimeGuideRNA
from tqdm import tqdm

df_pegRNAs = pd.read_csv('test_datasets/CML_VUS_DeepPrimeGuideRNA_input.csv')
df_pegRNAs.head()

Unnamed: 0,Sample_ID,Target,PBS,RTT,Edit_len,Edit_pos,Edit_type
0,1_ABL1_ex4_pos1T_A_rank1,GCTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGT...,AGACAGGCAAG,AGAGCTGATGGAAACAGGGAACAGCCTTCAGCCCACAG,1,30,sub
1,2_ABL1_ex4_pos1T_A_rank2,GCTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGT...,AGACAGGCAA,AGAGCTGATGGAAACAGGGAACAGCCTTCAGCCCACAG,1,30,sub
2,3_ABL1_ex4_pos1T_A_rank3,CTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTT...,GAGACAGGCAA,AGAGCTGATGGAAACAGGGAACAGCCTTCAGCCCACA,1,29,sub
3,4_ABL1_ex4_pos1T_C_rank1,GCTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGT...,AGACAGGCAA,AGAGCTGAGGGAAACAGGGAACAGCCTTCAGCCCACAG,1,30,sub
4,5_ABL1_ex4_pos1T_C_rank2,GCTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGT...,AGACAGGCA,AGAGCTGAGGGAAACAGGGAACAGCCTTCAGCCCACAG,1,30,sub


In [10]:
list_features = []

for i in tqdm(df_pegRNAs.index, total=len(df_pegRNAs.index), ncols=70, desc='Predicting'):

    info = df_pegRNAs.loc[i]

    pegrna = DeepPrimeGuideRNA(
        sID      =info['Sample_ID'],
        target   =info['Target'],
        pbs      =info['PBS'],
        rtt      =info['RTT'],
        edit_len =info['Edit_len'],
        edit_pos =info['Edit_pos'],
        edit_type=info['Edit_type'],
    )

    list_features.append(pegrna.features)

df_features = pd.concat(list_features, axis=0)

df_features

Unnamed: 0,ID,Spacer,RT-PBS,PBS_len,RTT_len,RT-PBS_len,Edit_pos,Edit_len,RHA_len,Target,...,deltaTm_Tm4-Tm2,GC_count_PBS,GC_count_RTT,GC_count_RT-PBS,GC_contents_PBS,GC_contents_RTT,GC_contents_RT-PBS,MFE_RT-PBS-polyT,MFE_Spacer,DeepSpCas9_score
0,1_ABL1_ex4_pos1T_A_rank1,TTTGAGCTTGCCTGTCTCTG,AGAGCTGATGGAAACAGGGAACAGCCTTCAGCCCACAGAGACAGGCAAG,11,38,49,30,1,8,GCTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGT...,...,-506.568335,6,21,27,54.545455,55.263158,55.102041,-10.4,-2.0,28.749527
1,2_ABL1_ex4_pos1T_A_rank2,TTTGAGCTTGCCTGTCTCTG,AGAGCTGATGGAAACAGGGAACAGCCTTCAGCCCACAGAGACAGGCAA,10,38,48,30,1,8,GCTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGT...,...,-506.568335,5,21,26,50.000000,55.263158,54.166667,-8.2,-2.0,28.749527
2,3_ABL1_ex4_pos1T_A_rank3,TTGAGCTTGCCTGTCTCTGT,AGAGCTGATGGAAACAGGGAACAGCCTTCAGCCCACAGAGACAGGCAA,11,37,48,29,1,8,CTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTT...,...,-506.474568,6,20,26,54.545455,54.054054,54.166667,-8.2,-2.0,42.639225
3,4_ABL1_ex4_pos1T_C_rank1,TTTGAGCTTGCCTGTCTCTG,AGAGCTGAGGGAAACAGGGAACAGCCTTCAGCCCACAGAGACAGGCAA,10,38,48,30,1,8,GCTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGT...,...,-506.568335,5,22,27,50.000000,57.894737,56.250000,-11.7,-2.0,28.749527
4,5_ABL1_ex4_pos1T_C_rank2,TTTGAGCTTGCCTGTCTCTG,AGAGCTGAGGGAAACAGGGAACAGCCTTCAGCCCACAGAGACAGGCA,9,38,47,30,1,8,GCTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGT...,...,-506.568335,5,22,27,55.555556,57.894737,57.446809,-11.7,-2.0,28.749527
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11479,11480_ABL1_ex9_pos100A_G_rank2,CAGGAATCCAGTATCTCAGA,ATGGGTACCTTACCGTCTGAGATACTGG,10,18,28,10,1,8,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,...,-483.811117,5,9,14,50.000000,50.000000,50.000000,-5.0,-1.3,55.851360
11480,11481_ABL1_ex9_pos100A_G_rank3,CCACTGCAGGTACCCCGGGA,TCCAGGAATCCAGTATCTCAGACGGTAAGGTACCCATCCCGGGGTACC,9,39,48,11,1,28,CACCCCACTGCAGGTACCCCGGGATGGGTACTTTACCGTCTGAGAT...,...,-502.446051,7,20,27,77.777778,51.282051,56.250000,-14.6,-0.2,43.093304
11481,11482_ABL1_ex9_pos100A_T_rank1,CAGGAATCCAGTATCTCAGA,ATGGGTACATTACCGTCTGAGATACTGG,10,18,28,10,1,8,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,...,-483.811117,5,8,13,50.000000,44.444444,46.428571,-2.8,-1.3,55.851360
11482,11483_ABL1_ex9_pos100A_T_rank2,CAGGAATCCAGTATCTCAGA,GGGTACATTACCGTCTGAGATACTGG,10,16,26,10,1,6,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,...,-320.805261,5,8,13,50.000000,50.000000,50.000000,-2.5,-1.3,55.851360


### 🔥 DeepPrimeOff
---
#### class `DeepPrimeOff`는 정해진 pegRNA들에 대해서 off-target candidates를 찾아주고, 각 target마다의 off-target activity를 예측해주는 pipeline이다. 
아래와 같이 불러올 수 있다. 

In [1]:
import pandas as pd
from genet.predict import DeepPrimeOff

#### `DeepPrimeOff`를 사용하기 위해서는 디자인 된 pegRNA와 각 pegRNA마다의 feature들의 데이터가 필요하다. 
이는 `DeepPrime` 또는 `DeepPrimeGuideRNA`를 통해 만들어지는 DataFrame의 형식과 같아야 한다. 여기서는 위의 `DeepPrimeGuideRNA` 예시에서 만들어진 `df_features`의 데이터를 가져왔다. 

In [2]:
df_features.head()

Unnamed: 0,ID,Spacer,RT-PBS,PBS_len,RTT_len,RT-PBS_len,Edit_pos,Edit_len,RHA_len,Target,...,deltaTm_Tm4-Tm2,GC_count_PBS,GC_count_RTT,GC_count_RT-PBS,GC_contents_PBS,GC_contents_RTT,GC_contents_RT-PBS,MFE_RT-PBS-polyT,MFE_Spacer,DeepSpCas9_score
0,1_ABL1_ex4_pos1T_A_rank1,TTTGAGCTTGCCTGTCTCTG,AGAGCTGATGGAAACAGGGAACAGCCTTCAGCCCACAGAGACAGGCAAG,11,38,49,30,1,8,GCTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGT...,...,-506.568335,6,21,27,54.545455,55.263158,55.102041,-10.4,-2.0,28.749527
1,2_ABL1_ex4_pos1T_A_rank2,TTTGAGCTTGCCTGTCTCTG,AGAGCTGATGGAAACAGGGAACAGCCTTCAGCCCACAGAGACAGGCAA,10,38,48,30,1,8,GCTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGT...,...,-506.568335,5,21,26,50.0,55.263158,54.166667,-8.2,-2.0,28.749527
2,3_ABL1_ex4_pos1T_A_rank3,TTGAGCTTGCCTGTCTCTGT,AGAGCTGATGGAAACAGGGAACAGCCTTCAGCCCACAGAGACAGGCAA,11,37,48,29,1,8,CTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTT...,...,-506.474568,6,20,26,54.545455,54.054054,54.166667,-8.2,-2.0,42.639225
3,4_ABL1_ex4_pos1T_C_rank1,TTTGAGCTTGCCTGTCTCTG,AGAGCTGAGGGAAACAGGGAACAGCCTTCAGCCCACAGAGACAGGCAA,10,38,48,30,1,8,GCTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGT...,...,-506.568335,5,22,27,50.0,57.894737,56.25,-11.7,-2.0,28.749527
4,5_ABL1_ex4_pos1T_C_rank2,TTTGAGCTTGCCTGTCTCTG,AGAGCTGAGGGAAACAGGGAACAGCCTTCAGCCCACAGAGACAGGCA,9,38,47,30,1,8,GCTCTTTGAGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGT...,...,-506.568335,5,22,27,55.555556,57.894737,57.446809,-11.7,-2.0,28.749527


#### DeepPrimeOff는 크게 2개의 주요 method로 작동한다.
- setup: DeepPrimeOff 수행에 필요한 pegRNA features, Cas-OFFinder results, 그리고 reference genome FASTA file이 잘 준비되어있는지 확인하고, 필요한 데이터 전처리를 진행한다.
- predict: 준비된 데이터들을 이용해 `DeepPrimeOff`를 실행하고 예측값을 계산한다.

우선 아래와 같이 `DeepPrimeOff` 객체를 만들고, `setup`을 실행하자. setup은 데이터 전처리까지 완료하고 DeepPrime-Off의 input으로 사용될 DataFrame을 반환한다.

In [3]:
dp_off = DeepPrimeOff()

dp_off.setup(
    features=df_features,
    cas_offinder_result='CML_exon_OFFinder_output.txt',
    ref_genome='Homo sapiens', 
    download_fasta=True,
    custom_genome=None, 
)

                                                                                                 

Unnamed: 0,ID,Spacer,RT-PBS,PBS_len,RTT_len,RT-PBS_len,Edit_pos,Edit_len,RHA_len,Target,...,deltaTm_Tm4-Tm2,GC_count_PBS,GC_count_RTT,GC_count_RT-PBS,GC_contents_PBS,GC_contents_RTT,GC_contents_RT-PBS,MFE_RT-PBS-polyT,MFE_Spacer,DeepSpCas9_score
0,48_ABL1_ex4_pos6C_A_rank3,TGCCTGTCTCTGTGGGCTGA,GAGGAGACGTAGATCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,27,1,13,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,...,-345.644952,7,20,27,70.0,50.000000,54.000000,-10.7,-5.9,42.905888
1,66_ABL1_ex4_pos8C_A_rank3,TGCCTGTCTCTGTGGGCTGA,GAGGAGACGTATAGCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,29,1,11,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,...,-345.644952,7,20,27,70.0,50.000000,54.000000,-14.3,-5.9,42.905888
2,69_ABL1_ex4_pos8C_G_rank3,TGCCTGTCTCTGTGGGCTGA,GAGGAGACGTACAGCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,29,1,11,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,...,-345.644952,7,21,28,70.0,52.500000,56.000000,-14.3,-5.9,42.905888
3,72_ABL1_ex4_pos8C_T_rank3,TGCCTGTCTCTGTGGGCTGA,GAGGAGACGTAAAGCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,29,1,11,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,...,-345.644952,7,20,27,70.0,50.000000,54.000000,-14.3,-5.9,42.905888
4,96_ABL1_ex4_pos11C_G_rank3,TGCCTGTCTCTGTGGGCTGA,GAGGAGACCTAGAGCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,32,1,8,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,...,-345.644952,7,21,28,70.0,52.500000,56.000000,-15.5,-5.9,42.905888
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174555,11476_ABL1_ex9_pos100A_C_rank1,CAGGAATCCAGTATCTCAGA,ATGGGTACGTTACCGTCTGAGATACTGG,10,18,28,10,1,8,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,...,-483.811117,5,9,14,50.0,50.000000,50.000000,-3.3,-1.3,55.851360
174556,11479_ABL1_ex9_pos100A_G_rank1,CAGGAATCCAGTATCTCAGA,GGGTACCTTACCGTCTGAGATACTGG,10,16,26,10,1,6,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,...,-320.805261,5,9,14,50.0,56.250000,53.846154,-3.6,-1.3,55.851360
174557,11480_ABL1_ex9_pos100A_G_rank2,CAGGAATCCAGTATCTCAGA,ATGGGTACCTTACCGTCTGAGATACTGG,10,18,28,10,1,8,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,...,-483.811117,5,9,14,50.0,50.000000,50.000000,-5.0,-1.3,55.851360
174558,11482_ABL1_ex9_pos100A_T_rank1,CAGGAATCCAGTATCTCAGA,ATGGGTACATTACCGTCTGAGATACTGG,10,18,28,10,1,8,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,...,-483.811117,5,8,13,50.0,44.444444,46.428571,-2.8,-1.3,55.851360


`setup`으로 만들어진 DataFrame은 `.features`로도 불러올 수 있다. 

In [4]:
dp_off.features

Unnamed: 0,ID,Spacer,RT-PBS,PBS_len,RTT_len,RT-PBS_len,Edit_pos,Edit_len,RHA_len,Target,...,deltaTm_Tm4-Tm2,GC_count_PBS,GC_count_RTT,GC_count_RT-PBS,GC_contents_PBS,GC_contents_RTT,GC_contents_RT-PBS,MFE_RT-PBS-polyT,MFE_Spacer,DeepSpCas9_score
0,48_ABL1_ex4_pos6C_A_rank3,TGCCTGTCTCTGTGGGCTGA,GAGGAGACGTAGATCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,27,1,13,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,...,-345.644952,7,20,27,70.0,50.000000,54.000000,-10.7,-5.9,42.905888
1,66_ABL1_ex4_pos8C_A_rank3,TGCCTGTCTCTGTGGGCTGA,GAGGAGACGTATAGCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,29,1,11,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,...,-345.644952,7,20,27,70.0,50.000000,54.000000,-14.3,-5.9,42.905888
2,69_ABL1_ex4_pos8C_G_rank3,TGCCTGTCTCTGTGGGCTGA,GAGGAGACGTACAGCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,29,1,11,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,...,-345.644952,7,21,28,70.0,52.500000,56.000000,-14.3,-5.9,42.905888
3,72_ABL1_ex4_pos8C_T_rank3,TGCCTGTCTCTGTGGGCTGA,GAGGAGACGTAAAGCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,29,1,11,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,...,-345.644952,7,20,27,70.0,50.000000,54.000000,-14.3,-5.9,42.905888
4,96_ABL1_ex4_pos11C_G_rank3,TGCCTGTCTCTGTGGGCTGA,GAGGAGACCTAGAGCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,32,1,8,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,...,-345.644952,7,21,28,70.0,52.500000,56.000000,-15.5,-5.9,42.905888
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174555,11476_ABL1_ex9_pos100A_C_rank1,CAGGAATCCAGTATCTCAGA,ATGGGTACGTTACCGTCTGAGATACTGG,10,18,28,10,1,8,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,...,-483.811117,5,9,14,50.0,50.000000,50.000000,-3.3,-1.3,55.851360
174556,11479_ABL1_ex9_pos100A_G_rank1,CAGGAATCCAGTATCTCAGA,GGGTACCTTACCGTCTGAGATACTGG,10,16,26,10,1,6,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,...,-320.805261,5,9,14,50.0,56.250000,53.846154,-3.6,-1.3,55.851360
174557,11480_ABL1_ex9_pos100A_G_rank2,CAGGAATCCAGTATCTCAGA,ATGGGTACCTTACCGTCTGAGATACTGG,10,18,28,10,1,8,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,...,-483.811117,5,9,14,50.0,50.000000,50.000000,-5.0,-1.3,55.851360
174558,11482_ABL1_ex9_pos100A_T_rank1,CAGGAATCCAGTATCTCAGA,ATGGGTACATTACCGTCTGAGATACTGG,10,18,28,10,1,8,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,...,-483.811117,5,8,13,50.0,44.444444,46.428571,-2.8,-1.3,55.851360


#### predict는 setup이 완료된 후 실행할 수 있다.
`DeepPrimeOff` 객체의 내부적으로 setup으로 만들어지는 features DataFrame을 DeepPrime-Off 모델의 input으로 사용한다. 따라서 setup이 제대로 완료되지 않았다면 모델은 필요한 input을 찾을 수 없어서 error를 발생시킨다. 

In [5]:
dp_off_retuls = dp_off.predict()
dp_off_retuls

                                                                                  

Unnamed: 0,ID,DeepPrime-Off_score,Spacer,RT-PBS,PBS_len,RTT_len,RT-PBS_len,Edit_pos,Edit_len,RHA_len,Target,Off-target,Off-context,Location,Position,Strand,MM_num
0,48_ABL1_ex4_pos6C_A_rank3,0.0,TGCCTGTCTCTGTGGGCTGA,GAGGAGACGTAGATCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,27,1,13,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,TtCCTGTCTgTGTGGGCTGATGG,TTATTTCCTGTCTGTGTGGGCTGATGGTCCTTCAATCATTGAAGTC...,1 dna:chromosome chromosome:GRCh38:1:1:2489564...,166234971,+,2
1,66_ABL1_ex4_pos8C_A_rank3,0.0,TGCCTGTCTCTGTGGGCTGA,GAGGAGACGTATAGCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,29,1,11,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,TtCCTGTCTgTGTGGGCTGATGG,TTATTTCCTGTCTGTGTGGGCTGATGGTCCTTCAATCATTGAAGTC...,1 dna:chromosome chromosome:GRCh38:1:1:2489564...,166234971,+,2
2,69_ABL1_ex4_pos8C_G_rank3,0.0,TGCCTGTCTCTGTGGGCTGA,GAGGAGACGTACAGCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,29,1,11,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,TtCCTGTCTgTGTGGGCTGATGG,TTATTTCCTGTCTGTGTGGGCTGATGGTCCTTCAATCATTGAAGTC...,1 dna:chromosome chromosome:GRCh38:1:1:2489564...,166234971,+,2
3,72_ABL1_ex4_pos8C_T_rank3,0.0,TGCCTGTCTCTGTGGGCTGA,GAGGAGACGTAAAGCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,29,1,11,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,TtCCTGTCTgTGTGGGCTGATGG,TTATTTCCTGTCTGTGTGGGCTGATGGTCCTTCAATCATTGAAGTC...,1 dna:chromosome chromosome:GRCh38:1:1:2489564...,166234971,+,2
4,96_ABL1_ex4_pos11C_G_rank3,0.0,TGCCTGTCTCTGTGGGCTGA,GAGGAGACCTAGAGCTGAAGGAAACAGGGAACAGCCTTCAGCCCAC...,10,40,50,32,1,8,AGCTTGCCTGTCTCTGTGGGCTGAAGGCTGTTCCCTGTTTCCTTCA...,TtCCTGTCTgTGTGGGCTGATGG,TTATTTCCTGTCTGTGTGGGCTGATGGTCCTTCAATCATTGAAGTC...,1 dna:chromosome chromosome:GRCh38:1:1:2489564...,166234971,+,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174555,11476_ABL1_ex9_pos100A_C_rank1,0.0,CAGGAATCCAGTATCTCAGA,ATGGGTACGTTACCGTCTGAGATACTGG,10,18,28,10,1,8,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,gAGGAgcCCAGTATCTCAGATGG,AGTGAGAGGAGCCCAGTATCTCAGATGGAAATGCAGAAATCACCTG...,Y dna:chromosome chromosome:GRCh38:Y:2781480:5...,26339822,-,3
174556,11479_ABL1_ex9_pos100A_G_rank1,0.0,CAGGAATCCAGTATCTCAGA,GGGTACCTTACCGTCTGAGATACTGG,10,16,26,10,1,6,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,gAGGAgcCCAGTATCTCAGATGG,AGTGAGAGGAGCCCAGTATCTCAGATGGAAATGCAGAAATCACCTG...,Y dna:chromosome chromosome:GRCh38:Y:2781480:5...,26339822,-,3
174557,11480_ABL1_ex9_pos100A_G_rank2,0.0,CAGGAATCCAGTATCTCAGA,ATGGGTACCTTACCGTCTGAGATACTGG,10,18,28,10,1,8,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,gAGGAgcCCAGTATCTCAGATGG,AGTGAGAGGAGCCCAGTATCTCAGATGGAAATGCAGAAATCACCTG...,Y dna:chromosome chromosome:GRCh38:Y:2781480:5...,26339822,-,3
174558,11482_ABL1_ex9_pos100A_T_rank1,0.0,CAGGAATCCAGTATCTCAGA,ATGGGTACATTACCGTCTGAGATACTGG,10,18,28,10,1,8,GTTCCAGGAATCCAGTATCTCAGACGGTAAAGTACCCATCCCGGGG...,gAGGAgcCCAGTATCTCAGATGG,AGTGAGAGGAGCCCAGTATCTCAGATGGAAATGCAGAAATCACCTG...,Y dna:chromosome chromosome:GRCh38:Y:2781480:5...,26339822,-,3
