# **Supplementary Code 4**
This notebook was used for analysis of NGS reads containing intended prime-editing and synonymous mutation marker. For more detail, please read Methods and Supplementary Information. 

Lead contact: Hyoungbum Henry Kim (hkim1@gmail.com)

Technical contact: Goosang Yu (gsyu93@gmail.com), Yusang Jung (ys.jung@yuhs.ac)

## Directory tree

📦Working directory  
 ┣ 📂data  
 ┃ ┣ 📂NGS_FASTQ_files  
 ┃ ┣ 📂NGS_frequency_table  
 ┃ ┃ ┣ 📜C4Bosutinib791.txt  
 ┃ ┃ ┣ 📜C4Control797.txt  
 ┃ ┃ ┗ 📜...  
 ┃ ┣ 📂read_counts  
 ┃ ┣ 📂statistics  
 ┃  
 ┣ 📂src  
 ┃ ┣ 📜Alignment.py  
 ┃ ┣ 📜VarCalling.py  
 ┃  
 ┣ 📂variants_info  
 ┃ ┣ 📜ex4_info.csv  
 ┃ ┣ 📜ex5_info.csv  
 ┃ ┣ 📜ex6_info.csv  
 ┃ ┣ 📜ex7_info.csv  
 ┃ ┣ 📜ex8_info.csv  
 ┃ ┣ 📜ex9_info.csv  
 ┃ ┣ 📜invivo_ex4_info.csv  
 ┃ ┗ 📜invivo_ex9_info.csv  
 ┃  
 ┗ 📜SuppleCode4.ipynb (this file)  

# Requirements
- CRISPResso2 (>= 2.x.x)
- pandas

## Variants calling and make read count file
After running CRISPResso, generate the read count file. This is the process of creating the foundational file for all analyses.

In [1]:
import os
import pandas as pd
from tqdm import tqdm
from glob import glob

from src.Alignment import ABL1VUS
from src.VarCalling import make_count_file, read_statistics, combine_data, VariantFilter, VariantScore, Normalizer

In [3]:
# Make count files from frequency table

freq_tables = glob('data/frequency_table/*.txt')

for f in freq_tables:

    n_sample = os.path.basename(f).replace('.txt', '')
    exon_num = n_sample.split('Exon')[1][0]
    ref_info = f'variants_info/ex{exon_num}_info.csv'

    
    df_cnt = make_count_file(f, ref_info)
    df_cnt.to_csv(f'data/read_counts/Count_{n_sample}.csv', index=False)

[Info] Read counting: K562PE2_unedit_Exon5: 100%|██████████| 36438/36438 [00:03<00:00, 10306.87it/s]
[Info] Read counting: K562PE4K_HTS_Exon4_Rep1_Asciminib: 100%|██████████| 1126274/1126274 [01:52<00:00, 9996.74it/s] 
[Info] Read counting: K562PE4K_HTS_Exon4_Rep1_Bosutinib: 100%|██████████| 1310545/1310545 [02:08<00:00, 10160.25it/s]
[Info] Read counting: K562PE4K_HTS_Exon4_Rep1_Dasatinib: 100%|██████████| 1096266/1096266 [01:48<00:00, 10090.48it/s]
[Info] Read counting: K562PE4K_HTS_Exon4_Rep1_DMSO: 100%|██████████| 1231441/1231441 [02:02<00:00, 10064.29it/s]
[Info] Read counting: K562PE4K_HTS_Exon4_Rep1_Imatinib: 100%|██████████| 1290582/1290582 [02:08<00:00, 10016.11it/s]
[Info] Read counting: K562PE4K_HTS_Exon4_Rep1_Nilotinib: 100%|██████████| 1059960/1059960 [01:47<00:00, 9905.26it/s] 
[Info] Read counting: K562PE4K_HTS_Exon4_Rep1_Ponatinib: 100%|██████████| 964463/964463 [01:36<00:00, 9955.14it/s] 
[Info] Read counting: K562PE4K_HTS_Exon4_Rep2_Asciminib: 100%|██████████| 1532951

## Significance analysis
Compare the prime-edited sample with the unedited sample to calculate the odds ratio and Fisher's exact test p-value for each variant.

In [2]:
# Calculate statistics with read count for each variants

dict_samples = {
    'KCLPE4K_DoseControl_Exon5_Rep1': {
        'test'  : 'Count_KCLPE4K_DoseControl_Exon5_Rep1_DMSO',
        'unedit': 'Count_KCLPE4K_unedit_Exon5',
    },
    'KCLPE4K_DoseControl_Exon5_Rep2': {
        'test'  : 'Count_KCLPE4K_DoseControl_Exon5_Rep2_DMSO',
        'unedit': 'Count_KCLPE4K_unedit_Exon5',
    },
}

for sample in dict_samples:
    
    test_file       = 'data/read_counts/' + dict_samples[sample]['test'] + '.csv'
    background_file = 'data/read_counts/' + dict_samples[sample]['unedit'] + '.csv'

    df_stats = read_statistics(test_file, background_file)

    df_stats.to_csv(f'data/statistics/Stat_{sample}_DMSO.csv', index=False)

Analysis: Count_KCLPE4K_DoseControl_Exon5_Rep1_DMSO
Analysis: Count_KCLPE4K_DoseControl_Exon5_Rep2_DMSO


## DMSO vs TKI response analysis
Analysis for resistance to drugs

- Test: Making variants using Prime editing for 20 days, followed by 10 days of TKI treatment.
- Control: Making variants using Prime editing for 20 days, followed by 10 days of DMSO treatment.

In [2]:
# KCL screening
# Filtering variants and calculate normalized log 2 fold changes

list_sample = [
    ['KCLPE4K_Dose0.1_Exon5_Rep1_Dasatinib',  'KCLPE4K_Dose0.1_Exon5_Rep2_Dasatinib',],
    ['KCLPE4K_Dose0.2_Exon5_Rep1_Ponatinib',  'KCLPE4K_Dose0.2_Exon5_Rep2_Ponatinib',],
    ['KCLPE4K_Dose0.4_Exon5_Rep1_Dasatinib',  'KCLPE4K_Dose0.4_Exon5_Rep2_Dasatinib',],
    ['KCLPE4K_Dose0.7_Exon5_Rep1_Ponatinib',  'KCLPE4K_Dose0.7_Exon5_Rep2_Ponatinib',],
    ['KCLPE4K_Dose100_Exon5_Rep1_Imatinib',   'KCLPE4K_Dose100_Exon5_Rep2_Imatinib',],
    ['KCLPE4K_Dose10_Exon5_Rep1_Asciminib',   'KCLPE4K_Dose10_Exon5_Rep2_Asciminib',],
    ['KCLPE4K_Dose2.5_Exon5_Rep1_Bosutinib',  'KCLPE4K_Dose2.5_Exon5_Rep2_Bosutinib',],
    ['KCLPE4K_Dose2.5_Exon5_Rep1_Nilotinib',  'KCLPE4K_Dose2.5_Exon5_Rep2_Nilotinib',],
    ['KCLPE4K_Dose50_Exon5_Rep1_Imatinib',    'KCLPE4K_Dose50_Exon5_Rep2_Imatinib',],
    ['KCLPE4K_Dose5_Exon5_Rep1_Asciminib',    'KCLPE4K_Dose5_Exon5_Rep2_Asciminib',],
    ['KCLPE4K_Dose7.5_Exon5_Rep1_Bosutinib',  'KCLPE4K_Dose7.5_Exon5_Rep2_Bosutinib',],
    ['KCLPE4K_Dose8_Exon5_Rep1_Nilotinib',    'KCLPE4K_Dose8_Exon5_Rep2_Nilotinib',],
    ]


lws_frac = 0.15

for samples in list_sample:

    r1 = samples[0]
    r2 = samples[1]

    test_r1 = f'data/read_counts/Count_{r1}.csv'
    test_r2 = f'data/read_counts/Count_{r2}.csv'

    control_r1 = f'data/statistics/Stat_KCLPE4K_DoseControl_Exon5_Rep1_DMSO.csv'
    control_r2 = f'data/statistics/Stat_KCLPE4K_DoseControl_Exon5_Rep2_DMSO.csv'
    
    df_rep1, df_rep2 = VariantFilter(test_r1, test_r2, control_r1, control_r2).filter(OR_cutoff=2, p_cutoff=0.05, rpm_cutoff=10)

    normal = Normalizer()

    # LOWESS regression normalization
    df_nor1 = normal.lowess(df_rep1, frac=lws_frac)
    df_nor2 = normal.lowess(df_rep2, frac=lws_frac)

    df_nor1.to_csv(f'data/statistics/Filtered_{r1}.csv', index=False)
    df_nor2.to_csv(f'data/statistics/Filtered_{r2}.csv', index=False)

    rep_1 = f'data/statistics/Filtered_{r1}.csv'
    rep_2 = f'data/statistics/Filtered_{r2}.csv'

    # Score calculation 함수 불러오기
    score = VariantScore()

    adjus_LFC = score.calculate(rep_1, rep_2, var_type='SNV')
    res_score = score.calculate(rep_1, rep_2, var_type='AA')

    n_sample = r1.replace('Rep1_', '')

    adjus_LFC.to_csv(f'data/adjusted_LFC/AdjustedLFC_{n_sample}.csv')
    res_score.to_csv(f'data/resistance_score/ResistanceScore_{n_sample}.csv')