**This script will process ATGNet network and an output from the iSNP pipeline to find the intersect between them**

Requirements: 
1) ATGnet output file in tsv format 
2) iSNP pipeline output for patient population of interest e.g. CD or UC 


In [7]:
atg_net = [] 
uni_to_gene_map = {} #maps UniProt IDs to gene names
with open('/Users/rb1425/Documents/PHD/isnp_atg/RB_ISNP_ATG/input_files/ATGnet_core_dirreg.tsv') as atgnet_file:
    #atgnet_file.readline()
    for line in atgnet_file:
        line = line.strip().split('\t')
        source_uni = line[0]
        target_uni = line[1]
                    
        if source_uni not in uni_to_gene_map.keys():
            uni_to_gene_map[source_uni] = line[2]
        if target_uni not in uni_to_gene_map.keys():
            uni_to_gene_map[target_uni] = line[3]

        if source_uni not in atg_net:
            atg_net.append(source_uni)
        if target_uni not in atg_net:
            atg_net.append(target_uni)
print(atg_net)

len(atg_net)


['A6NCE7', 'Q9H0Y0', 'O75143', 'Q8TDY2', 'Q9BSB4', 'O95166', 'Q9BXW4', 'O94817', 'Q9H1Y0', 'Q9H0R8', 'Q9Y4P1', 'O95352', 'Q9NT62', 'Q9GZQ8', 'Q9H492', 'P60520', 'O75385', 'Q13501', 'Q2TAZ0', 'Q14457', 'Q8NEB9', 'Q676U5', 'Q5MNZ9', 'Q9P2Y5', 'Q6ZNE5', 'Q7Z6L1', 'Q96BY7', 'Q99570', 'Q9BQS8', 'Q9C0C7', 'Q8IYT8', 'Q8WYN0', 'Q9Y4P8', 'Q8NAA4', 'Q7Z3C6', 'P10415', 'P42345', 'O14641', 'Q96GC9', 'Q8N122', 'P09429', 'P63167', 'Q96FJ2', 'Q07817', 'Q05397', 'P51149', 'Q9Y6P5', 'Q14596', 'P58004', 'Q86VP1', 'P54619', 'Q9UHD2', 'Q00535', 'P12931', 'Q00610', 'P54646', 'P17612', 'P17252', 'Q16620', 'P11142', 'Q8NER1', 'Q96A56', 'Q3MII6', 'P00533', 'O15519', 'Q9H3D4', 'P60484', 'P21580', 'Q9HC29', 'P18146', 'Q13158', 'P04629', 'Q16644', 'Q92622', 'Q12778', 'Q00613', 'O15350', 'Q09472', 'Q01543', 'O43524', 'P42574', 'P55957', 'Q96LC9', 'Q9Y478', 'P07384', 'Q12983', 'P16220', 'Q92934', 'Q99933', 'P17275', 'P17655', 'P99999', 'Q16611', 'P27361', 'P04637', 'P07900', 'Q96PG8', 'Q16539', 'Q13794', 'P35638',

3357

Find the simple overlap between the AutophagyNet and the SNP affected proteins from the iSNP ouput. Note that the current logic does not preserve how the protein may be affected by the SNP (e.g. gain or loss of function in an upstream transceiption factors). Later iteration my benefit from a more nuanced approach. 

Currently we include any protein that appearrs in AutophagyNet and either the source or target column of the iSNP outut file.

In [8]:
# Create a list of affected proteins based on the iSNP output
affected_atg_proteins = []
with open('/Users/rb1425/Documents/PHD/isnp_atg/RB_ISNP_ATG/input_files/CD_iSNP_Leuven_test.txt') as isnp_file:
    isnp_file.readline()
    for line in isnp_file:
        line = line.strip().split('\t')
        source = line[1]
        target = line[2]
        snp = line[3]
        
        #note 'HSA-' exclude mriRNAs (check if this is correct)
        if source in atg_net and source not in affected_atg_proteins and 'HSA-' not in source:
            affected_atg_proteins.append(source)
        if target in atg_net and target not in affected_atg_proteins and 'HSA-' not in source:
            affected_atg_proteins.append(target)
            
print(affected_atg_proteins)
len(affected_atg_proteins)

# Map UniProt IDs to gene names for affected proteins
affected_atg_genes = []
for protein in affected_atg_proteins:
    affected_atg_genes.append(uni_to_gene_map[protein])

print(affected_atg_genes)
len(affected_atg_genes)

['Q9BT67', 'Q8WWW0', 'P49137', 'Q9NNW5', 'Q2TAL8', 'P14316', 'Q9UKA4', 'P84022', 'Q13131', 'P68036', 'P62072', 'Q13422', 'Q9UKD1', 'Q8NHV9', 'Q9UNG2', 'P40763', 'Q14938', 'P03372', 'Q5XPI4', 'O75030', 'P10276', 'Q14592', 'Q9H3D4', 'Q86YR5', 'Q99856', 'Q16665', 'P10244', 'Q13163', 'O15027', 'P19484', 'P10275', 'Q02556', 'Q01543', 'P22736', 'Q14186', 'Q92963', 'Q8NI51', 'Q01196', 'Q9UIQ6', 'Q8ND76', 'P50548', 'P04198', 'Q92731', 'Q16236', 'Q9BZE0', 'P16220', 'P17275', 'O43186', 'Q676U5', 'P39880', 'P11473', 'P50222', 'P42229', 'P17861', 'Q8WWA0', 'Q5SXM2', 'Q07869', 'Q01094', 'Q12800', 'P26927', 'P18146', 'Q9UH92', 'O15350', 'Q15796', 'O95747', 'Q03112', 'Q05925', 'O00330', 'Q9NU22', 'Q01860', 'P42224', 'P56179', 'P04637', 'P05412', 'Q14653', 'Q9NQB0', 'Q9UIH9', 'Q70SY1', 'Q12772', 'P08651', 'P27540', 'P51692', 'Q5VTD9']
['NDFIP1', 'RASSF5', 'MAPKAPK2', 'WDR6', 'QRICH1', 'IRF2', 'AKAP11', 'SMAD3', 'PRKAA1', 'UBE2L3', 'TIMM10', 'IKZF1', 'GMEB2', 'RHOXF1', 'TNFSF18', 'STAT3', 'NFIX', 'ESR1

83

**Filtering and Output File Writing**

We generate two patient data matrices for the affected proteins:
1) A list of all lines in the iSNP data where an ATG network protein appears
2) An "all-lines" version which will include lines where the ATG network protein is double counted into the source and target columns  

In [9]:
patientdict = {}
all_line = []
with open('/Users/rb1425/Documents/PHD/isnp_atg/RB_ISNP_ATG/input_files/CD_iSNP_Leuven_test.txt') as isnp_file:
    # Header processing - check file format before running
    header = isnp_file.readline()
    header = header.split('\t')[6:]
    header = '\t'.join(header)    
    # 
    for line in isnp_file:
        line = line.strip().split('\t')
        source = line[0]
        target = line[1]
        snp = line[2]
        patients = line[6:]
    

        if source in affected_atg_proteins:
            patientdict[uni_to_gene_map[source]] = '\t'.join(patients)
            all_line.append(line)

        if target in affected_atg_proteins:
            patientdict[uni_to_gene_map[target]] = '\t'.join(patients)
            all_line.append(line)


Write the data into csv files 

In [10]:

# Write patientdict to a new TSV file
with open('CD_affected_atg_patients.tsv', 'w', newline='') as f:
    f.write('Affected protein\t' + header)
    for key, value in patientdict.items():
        f.write("{}\t{}\n".format(key, value))

In [11]:
# Write all lines to a new TSV file
with open('CD_all_lines.tsv', 'w', newline='') as f:
    for line in all_line:
        f.write("{}\n".format(line))