# **Supplementary Code 4**
This notebook was used for analysis of NGS reads containing intended prime-editing and synonymous mutation marker. For more detail, please read Methods and Supplementary Information. 

Lead contact: Hyoungbum Henry Kim (hkim1@gmail.com)

Technical contact: Goosang Yu (gsyu93@gmail.com), Yusang Jung (ys.jung@yuhs.ac)

## Directory tree

📦Working directory  
 ┣ 📂data  
 ┃ ┣ 📂NGS_FASTQ_files  
 ┃ ┣ 📂NGS_frequency_table  
 ┃ ┃ ┣ 📜C4Bosutinib791.txt  
 ┃ ┃ ┣ 📜C4Control797.txt  
 ┃ ┃ ┗ 📜...  
 ┃ ┣ 📂read_counts  
 ┃  
 ┣ 📂src  
 ┃ ┣ 📜Alignment.py  
 ┃ ┣ 📜VarCalling.py  
 ┃  
 ┣ 📂variants_info  
 ┃ ┣ 📜ex4_info.csv  
 ┃ ┣ 📜ex4-1_output_template.csv  
 ┃ ┣ 📜ex4-2_output_template.csv  
 ┃ ┣ 📜ex5_info.csv  
 ┃ ┣ 📜ex6_info.csv  
 ┃ ┣ 📜ex7_info.csv  
 ┃ ┣ 📜ex8_info.csv  
 ┃ ┣ 📜ex9_info.csv  
 ┃ ┣ 📜invivo_ex4_info.csv  
 ┃ ┗ 📜invivo_ex9_info.csv  
 ┃  
 ┗ 📜SuppleCode4.ipynb (this file)  

# Requirements
- CRISPResso2 (>= 2.x.x)
- pandas

# Step 0: Import packagies

In [4]:
import sys, os
import subprocess
import pandas as pd
from glob import glob

from src.Alignment import ABL1VUS
from src.VarCalling import *

## In vivo xenograft model screening analysis
Figure X. XX

### Step 1: Download NGS FASTQ files

### Step 2: Alignment NGS reads to reference variants sequences

In [6]:
list_samples = [
    'C9Bosutinib789',

    'C9Bosutinib790',

    'C4Bosutinib791',
    'C9Bosutinib791',

    'C4Ponatinib793',

    'C4Ponatinib794',
    'C9Ponatinib794',

    'C4Ponatinib795',
    'C9Ponatinib795',

    'C4Ponatinib796',
    'C9Ponatinib796',

    'C4Control797',
    'C9Control797',

    'C4Control798',
    'C9Control798',

    'C9Control800',
    'C4Control800',

    'K562-PE4k_background_ex4-2',
    'K562-PE4k_background_ex9',
]

In [None]:
data_dir = 'data/NGS_FASTQ_files/'

for sample_id in list_samples:
    
    files = list(glob(f'{data_dir}/{sample_id}*.fq.gz'))
    
    exon_num = list(sample_id)[1]
    
    r1 = files[0]
    r2 = files[1]

    abl_e8 = ABL1VUS(sample_id, r1, r2, exon=f'invivo_exon{exon_num}').run(out_dir='data')



## Step 3: Variant read counts

In [7]:
sample_id = 'K562-PE4k_background_ex4-2'
exon_num = 4

df_count = make_count_file(freq_table=f'data/NGS_frequency_table/{sample_id}.txt',
                            var_ref=f'variants_info/invivo_ex{exon_num}_info.csv')

df_count.to_csv(f'data/read_counts/Stats_{sample_id}.csv', index=False)


[Info] Start - K562-PE4k_background_ex4-2
[Info] Length of variants: 1581


[Info] Read counting: K562-PE4k_background_ex4-2: 100%|██████████| 36367/36367 [00:03<00:00, 10515.97it/s]


In [8]:
sample_id = 'K562-PE4k_background_ex9'
exon_num = 9

df_count = make_count_file(freq_table=f'data/NGS_frequency_table/{sample_id}.txt',
                            var_ref=f'variants_info/invivo_ex{exon_num}_info.csv')

df_count.to_csv(f'data/read_counts/Stats_{sample_id}.csv', index=False)


[Info] Start - K562-PE4k_background_ex9
[Info] Length of variants: 799


[Info] Read counting: K562-PE4k_background_ex9: 100%|██████████| 8383/8383 [00:00<00:00, 10080.98it/s]


In [5]:
for sample_id in list_samples:
    
    exon_num = list(sample_id)[1]

    df_count = make_count_file(freq_table=f'data/NGS_frequency_table/{sample_id}.txt',
                               var_ref=f'variants_info/invivo_ex{exon_num}_info.csv')
    
    df_count.to_csv(f'data/read_counts/Stats_{sample_id}.csv', index=False)



[Info] Start - C9Bosutinib789
[Info] Length of variants: 799


[Info] Read counting: C9Bosutinib789: 100%|██████████| 52886/52886 [00:05<00:00, 10495.71it/s]



[Info] Start - C9Bosutinib790
[Info] Length of variants: 799


[Info] Read counting: C9Bosutinib790: 100%|██████████| 42050/42050 [00:04<00:00, 10342.63it/s]



[Info] Start - C4Bosutinib791
[Info] Length of variants: 1581


[Info] Read counting: C4Bosutinib791: 100%|██████████| 284021/284021 [00:27<00:00, 10354.46it/s]



[Info] Start - C9Bosutinib791
[Info] Length of variants: 799


[Info] Read counting: C9Bosutinib791: 100%|██████████| 55855/55855 [00:05<00:00, 9366.39it/s] 



[Info] Start - C4Ponatinib793
[Info] Length of variants: 1581


[Info] Read counting: C4Ponatinib793: 100%|██████████| 327517/327517 [00:31<00:00, 10282.92it/s]



[Info] Start - C4Ponatinib794
[Info] Length of variants: 1581


[Info] Read counting: C4Ponatinib794: 100%|██████████| 294338/294338 [00:28<00:00, 10295.50it/s]



[Info] Start - C9Ponatinib794
[Info] Length of variants: 799


[Info] Read counting: C9Ponatinib794: 100%|██████████| 77998/77998 [00:07<00:00, 10404.56it/s]



[Info] Start - C4Ponatinib795
[Info] Length of variants: 1581


[Info] Read counting: C4Ponatinib795: 100%|██████████| 324937/324937 [00:31<00:00, 10341.72it/s]



[Info] Start - C9Ponatinib795
[Info] Length of variants: 799


[Info] Read counting: C9Ponatinib795: 100%|██████████| 76429/76429 [00:07<00:00, 10410.99it/s]



[Info] Start - C4Ponatinib796
[Info] Length of variants: 1581


[Info] Read counting: C4Ponatinib796: 100%|██████████| 374900/374900 [00:36<00:00, 10310.30it/s]



[Info] Start - C9Ponatinib796
[Info] Length of variants: 799


[Info] Read counting: C9Ponatinib796: 100%|██████████| 77316/77316 [00:07<00:00, 10387.44it/s]



[Info] Start - C4Control797
[Info] Length of variants: 1581


[Info] Read counting: C4Control797: 100%|██████████| 306136/306136 [00:29<00:00, 10375.86it/s]



[Info] Start - C9Control797
[Info] Length of variants: 799


[Info] Read counting: C9Control797: 100%|██████████| 70761/70761 [00:06<00:00, 10396.05it/s]



[Info] Start - C4Control798
[Info] Length of variants: 1581


[Info] Read counting: C4Control798: 100%|██████████| 300013/300013 [00:29<00:00, 10337.80it/s]



[Info] Start - C9Control798
[Info] Length of variants: 799


[Info] Read counting: C9Control798: 100%|██████████| 76289/76289 [00:07<00:00, 10521.16it/s]



[Info] Start - C9Control800
[Info] Length of variants: 799


[Info] Read counting: C9Control800: 100%|██████████| 60038/60038 [00:05<00:00, 10393.67it/s]



[Info] Start - C4Control800
[Info] Length of variants: 1581


[Info] Read counting: C4Control800: 100%|██████████| 307297/307297 [00:29<00:00, 10348.52it/s]


## Step 4: Odds ratio and p-value calculation

In [1]:
from src.VarCalling import read_statistics

In [2]:
var_sample = 'data/read_counts/Stats_C4Control800.csv'
background = 'data/read_counts/Stats_K562-PE4k_background_ex4-2.csv'
df_orpv = read_statistics(var_sample, background)

df_orpv

Analysis: Stats_C4Control800


Unnamed: 0,RefSeq,Label,RefRead,AA_var,SNV_var,count,Edited_WT_count,RPM,UE_SynPE_count,UE_WT_count,OR,pvalue
0,CATCACCACGCTCCATTATCAGGCCCCAAAGCGCAACAAGCCCACT...,SynPE,Both,P216Q,ABL1_ex4_pos103C_A,3,1018497,2.058072,0,219921,0.863711,1.00000
1,CATCACCACGCTCCATTATCAAGCTCCAAAGCGCAACAAGCCCACT...,SynPE,Both,P216Q,ABL1_ex4_pos103C_A,1,1018497,0.686024,1,219921,0.215928,0.32363
2,CATCACCACGCTCCATTATCGAGCACCAAAGCGCAACAAGCCCACT...,SynPE,Both,P216R,ABL1_ex4_pos103C_G,0,1018497,0.000000,0,219921,0.215928,1.00000
3,CATCACCACGCTCCATTATCGAGCTCCAAAGCGCAACAAGCCCACT...,SynPE,Both,P216R,ABL1_ex4_pos103C_G,0,1018497,0.000000,0,219921,0.215928,1.00000
4,CATCACCACGCTCCATTATCTAGCTCCAAAGCGCAACAAGCCCACT...,SynPE,Both,P216L,ABL1_ex4_pos103C_T,0,1018497,0.000000,0,219921,0.215928,1.00000
...,...,...,...,...,...,...,...,...,...,...,...,...
1030,CATCACCACGCTCCATTATCCAGCCCCAAAGCGCAACAAGCCCACT...,SynPE,R2,Intron,ABL1_ex4_pos283G_A,0,1018497,0.000000,0,219921,0.215928,1.00000
1031,CATCACCACGCTCCATTATCCAGCCCCAAAGCGCAACAAGCCCACT...,SynPE,R2,Intron,ABL1_ex4_pos283G_C,0,1018497,0.000000,0,219921,0.215928,1.00000
1032,CATCACCACGCTCCATTATCCAGCCCCAAAGCGCAACAAGCCCACT...,SynPE,R2,Intron,ABL1_ex4_pos283G_C,0,1018497,0.000000,0,219921,0.215928,1.00000
1033,CATCACCACGCTCCATTATCCAGCCCCAAAGCGCAACAAGCCCACT...,SynPE,R2,Intron,ABL1_ex4_pos283G_T,0,1018497,0.000000,0,219921,0.215928,1.00000
