# **Supplementary Code 4**
This notebook was used for analysis of NGS reads containing intended prime-editing and synonymous mutation marker. For more detail, please read Methods and Supplementary Information. 

Lead contact: Hyoungbum Henry Kim (hkim1@gmail.com)

Technical contact: Goosang Yu (gsyu93@gmail.com), Yusang Jung (ys.jung@yuhs.ac)

## Directory tree

📦Working directory  
 ┣ 📂data  
 ┃ ┣ 📂NGS_FASTQ_files  
 ┃ ┣ 📂NGS_frequency_table  
 ┃ ┃ ┣ 📜C4Bosutinib791.txt  
 ┃ ┃ ┣ 📜C4Control797.txt  
 ┃ ┃ ┗ 📜...  
 ┃ ┣ 📂read_counts  
 ┃  
 ┣ 📂src  
 ┃ ┣ 📜Alignment.py  
 ┃ ┣ 📜VarCalling.py  
 ┃  
 ┣ 📂variants_info  
 ┃ ┣ 📜ex4_info.csv  
 ┃ ┣ 📜ex4-1_output_template.csv  
 ┃ ┣ 📜ex4-2_output_template.csv  
 ┃ ┣ 📜ex5_info.csv  
 ┃ ┣ 📜ex6_info.csv  
 ┃ ┣ 📜ex7_info.csv  
 ┃ ┣ 📜ex8_info.csv  
 ┃ ┣ 📜ex9_info.csv  
 ┃ ┣ 📜invivo_ex4_info.csv  
 ┃ ┗ 📜invivo_ex9_info.csv  
 ┃  
 ┗ 📜SuppleCode4.ipynb (this file)  

# Requirements
- CRISPResso2 (>= 2.x.x)
- pandas

# Packagies for analysis

In [4]:
import os
import pandas as pd
from tqdm import tqdm
from glob import glob

## Analysis 1: variants calling and make read count file
CRISPResso를 돌린 후, read count 파일을 만든다. 모든 분석의 기초가 되는 파일을 만드는 과정이다. 그 이후에는 Odds ratio / fisher t-test p-value를 구해서 filtering을 한다. 

In [1]:
from src.Alignment import ABL1VUS
from src.VarCalling import read_statistics

In [2]:
var_sample = 'data/read_counts/Stats_C4Control800.csv'
background = 'data/read_counts/Stats_K562-PE4k_background_ex4-2.csv'
df_orpv = read_statistics(var_sample, background)

df_orpv

Analysis: Stats_C4Control800


Unnamed: 0,RefSeq,Label,RefRead,AA_var,SNV_var,count,Edited_WT_count,RPM,UE_SynPE_count,UE_WT_count,OR,pvalue
0,CATCACCACGCTCCATTATCAGGCCCCAAAGCGCAACAAGCCCACT...,SynPE,Both,P216Q,ABL1_ex4_pos103C_A,3,1018497,2.058072,0,219921,0.863711,1.00000
1,CATCACCACGCTCCATTATCAAGCTCCAAAGCGCAACAAGCCCACT...,SynPE,Both,P216Q,ABL1_ex4_pos103C_A,1,1018497,0.686024,1,219921,0.215928,0.32363
2,CATCACCACGCTCCATTATCGAGCACCAAAGCGCAACAAGCCCACT...,SynPE,Both,P216R,ABL1_ex4_pos103C_G,0,1018497,0.000000,0,219921,0.215928,1.00000
3,CATCACCACGCTCCATTATCGAGCTCCAAAGCGCAACAAGCCCACT...,SynPE,Both,P216R,ABL1_ex4_pos103C_G,0,1018497,0.000000,0,219921,0.215928,1.00000
4,CATCACCACGCTCCATTATCTAGCTCCAAAGCGCAACAAGCCCACT...,SynPE,Both,P216L,ABL1_ex4_pos103C_T,0,1018497,0.000000,0,219921,0.215928,1.00000
...,...,...,...,...,...,...,...,...,...,...,...,...
1030,CATCACCACGCTCCATTATCCAGCCCCAAAGCGCAACAAGCCCACT...,SynPE,R2,Intron,ABL1_ex4_pos283G_A,0,1018497,0.000000,0,219921,0.215928,1.00000
1031,CATCACCACGCTCCATTATCCAGCCCCAAAGCGCAACAAGCCCACT...,SynPE,R2,Intron,ABL1_ex4_pos283G_C,0,1018497,0.000000,0,219921,0.215928,1.00000
1032,CATCACCACGCTCCATTATCCAGCCCCAAAGCGCAACAAGCCCACT...,SynPE,R2,Intron,ABL1_ex4_pos283G_C,0,1018497,0.000000,0,219921,0.215928,1.00000
1033,CATCACCACGCTCCATTATCCAGCCCCAAAGCGCAACAAGCCCACT...,SynPE,R2,Intron,ABL1_ex4_pos283G_T,0,1018497,0.000000,0,219921,0.215928,1.00000


## Analysis 2: Read pattern analysise
Aligned read에서 intended prime-editing 뿐 아니라 다른 패턴으로 나타나는 mutation에 대해서 분석하는 코드 

In [None]:
from src.VarCalling import ReadPatternAnalyzer

In [None]:
rpa = ReadPatternAnalyzer()
files = glob('data/NGS_frequency_table/alignd_*')

dict_summary = {}

for freq_table in tqdm(files, total = len(files),
                       desc = 'Read Pattern Analyzer', ## 진행률 앞쪽 출력 문장
                       ncols = 100,                     ## 진행률 출력 폭 조절
                       ascii = ' =',                   ## 바 모양, 첫 번째 문자는 공백이어야 작동
                      ):
    
    sample   = os.path.basename(freq_table).replace('.txt', '')
    exon_num = int(sample[4])
    ref_info = f'variants_info/ex{exon_num}_info.csv'

    df_pattern = rpa.run(freq_table, ref_info)
    df_pattern.to_csv(f'result/{sample}_patterns.csv', index=False)

    dict_temp   = {}

    for idx in tqdm(df_pattern.index, total=len(df_pattern.index),
                    desc='Summarize mut classes',
                    ncols=100,
                    leave=False,
                    ):
        
        data = df_pattern.loc[idx]

        try   : dict_temp[data.mut_class] += data['#Reads']
        except: dict_temp[data.mut_class]  = data['#Reads']

    dict_summary[sample] = dict_temp

df_summary = pd.DataFrame.from_dict(dict_summary, orient='index')
df_summary
    

## Analysis 3: Single clones
Hit으로 발견된 variants를 single clonal cell로 얻어서 분석한 내용에 관한 코드. Single clone들의 editing efficiency를 보고, variants copy를 측정하기 위함

In [None]:
from src.VarCalling import single_clone_var_freq
from src.constant import single_clones_refseq

# Load variants type and corresponding FASTA file path info
df_ref = pd.read_csv('variants_info/variants_single_clone_info.csv')
df_ref

In [None]:
# Alignment using CRISPResso

for idx in df_ref.index:
    data = df_ref.loc[idx]

    sample_id = data['name']
    r1        = f'data/NGS_FASTQ_files/{data.file1}'
    r2        = f'data/NGS_FASTQ_files/{data.file2}'
    exon      = data.ref_seq

    aligner = ABL1VUS(sample_id, r1, r2, exon=exon)
    aligner.run(out_dir='./', save_plot=True)

In [None]:
list_df = []

for idx in df_ref.index:
    data = df_ref.loc[idx]

    variant_id = data['Sample']
    sample_id  = data['name']
    exon_num   = data['ref_seq']

    freq_table = list(glob(f'CRISPResso_on_{sample_id}/*Alleles_frequency_table_*.txt'))[0]

    # variants reference sequences
    wt_seq        = single_clones_refseq[variant_id]['WT']
    edit_seq      = single_clones_refseq[variant_id]['Edited']
    intended_only = single_clones_refseq[variant_id]['Intended_only']

    df_count = single_clone_var_freq(sample_id=sample_id,
                                    freq_table=freq_table,
                                    wt_seq=wt_seq,
                                    edit_seq=edit_seq,
                                    intended_only=intended_only,
                                    )

    list_df.append(df_count)

df_merge = pd.concat(list_df)
df_merge
