### This file extends the pseudo sequences for HLA I alleles

One tricky part in this file is on the adjustments needed for verifying the alignments. A summary on the final findings in order to reconstruct the old and construct the extended pseudo sequences is put in file

    t4_summary.md

We have pseudo sequences for HLA alleles on 15 positions for HLA-II alpha chains contained in file 

    ../../data/intermediate_data/pseudosequence_2016_all_X.dat

We want to extend the positions to also cover those additional positions got from 

    t2_check_additional_pos_contacts.log.ipynb.

First, we need to check whether we can get an unique pseudo sequence for each HLA on the additional positions.

If that is true, secondly, we will keep those positions with diversity in amino acids.

In [1]:
import numpy as np
import pandas as pd

from collections import Counter
from collections import defaultdict

Materials:

Full sequence files: 

    ../../data/intermediate_data/HLA_TCR_contact/DRA_prot.alfas

    ../../data/intermediate_data/HLA_TCR_contact/DPA_prot.alfas

    ../../data/intermediate_data/HLA_TCR_contact/DQA_prot.alfas

The 15 positions (needs other additional adjustment. the adjustment needs to follow the findings summarized in step70_summary.txt) for HLA-II alleles from NetMHCIIpan-3.0:

9, 11, 22, 24, 31, 52, 53, 58, 59, 61, 65, 66, 68, 72, 73

18 additional positions (0-indexed) from t2_check_additional_pos_contacts.log.ipynb:

4, 28, 35, 39, 45, 46, 47, 50, 51, 53, 56, 58, 60, 63, 65, 67, 71, 72

pseudo sequences for HLA alleles on 15 positions for HLA-II alpha chains contained in file

    ../../data/intermediate_data/pseudosequence_2016_all_X.dat

In [3]:
HLA_2_pseudo = pd.read_csv("../../data/intermediate_data/pseudosequence_2016_all_X.dat", sep = "\t", header = None)
HLA_2_pseudo.shape
# (5636, 2)
HLA_2_pseudo.columns = ["HLA", "seq"]
HLA_2_pseudo[:6]

Unnamed: 0,HLA,seq
0,DRB1_0101,QEFFIASGAAVDAIMWLFLECYDLQRATYHVGFT
1,DRB1_0102,QEFFIASGAAVDAIMWLFLECYDLQRATYHAVFT
2,DRB1_0103,QEFFIASGAAVDAIMWLFLECYDIDEATYHVGFT
3,DRB1_0104,QEFFIASGAAVDAIMWLFLECYDLQRANYHVVFT
4,DRB1_0105,QEFFIASGAAVDAIMWLFLECYDLQRATYHVGFT
5,DRB1_0106,QEFFIASGAAVDAIMWLFLECYDLQAATYHVVFT


In [28]:
# build pseudo sequence dictionary

# from the exploration in t3_explore_II.ipynb, 
# the only HLA-II pair with two corresponding rows in this table is 'HLA-DPA10103-DPB10601'
# this one is not among the HLA-II pairs that we consider as in HLA v2 table from DeWitt_2018
# so ignore this issue for now and just assign the later pseudo seq to it

HLA_2_pseudo_dict = defaultdict(str)

for hla, seq in zip(HLA_2_pseudo.HLA.tolist(), HLA_2_pseudo.seq.tolist()):
    HLA_2_pseudo_dict[hla] = seq[:15]

len(HLA_2_pseudo_dict)

5635

In [67]:
# the original 15 positions from NetMHC-II-pan-3.0 paper
fifteen = [9, 11, 22, 24, 31, 52, 53, 58, 59, 61, 65, 66, 68, 72, 73]

In [11]:
# get HLA-II pair names from HLA_v2_features
HLA_v2_features_row_names = pd.read_csv("../../data/intermediate_data/DeWitt_2018/HLA_v2_features_row_names.txt", 
                                        sep = " ", header = None)
HLA_v2_features_row_names.columns = ["feature", "hla"]
HLA_v2_features_row_names.shape
# (215, 2)
HLA_v2_features_row_names[:6]

Unnamed: 0,feature,hla
0,feature:,HLA-DPAB*02:01_04:01
1,feature:,HLA-DQAB*05:05_06:04
2,feature:,HLA-B*08:01
3,feature:,HLA-A*24:02
4,feature:,HLA-A*24:03
5,feature:,HLA-B*38:02


In [12]:
HLA_II_v2_pairs = [hla for hla in HLA_v2_features_row_names.hla.tolist() if hla[:7] in ["HLA-DPA", "HLA-DQA", "HLA-DRD", "HLA-DRB"]]

In [14]:
HLA_II_v2_5DRDQ = [item for item in HLA_II_v2_pairs if len(item.split("_")) > 2]
HLA_II_v2_5DRDQ

['HLA-DRDQ*10:01_01:05_05:01',
 'HLA-DRDQ*03:01_05:01_02:01',
 'HLA-DRDQ*13:01_01:03_06:03',
 'HLA-DRDQ*15:01_01:02_06:02',
 'HLA-DRDQ*09:01_03:02_03:03']

In [16]:
HLA_II_v2_5DRDQ_DRB = ["HLA-DRB1*" + item[9:].split("_")[0] for item in HLA_II_v2_5DRDQ]
HLA_II_v2_5DRDQ_DQAB = ["HLA-DQAB*" + "_".join(item[9:].split("_")[1:]) for item in HLA_II_v2_5DRDQ]

['HLA-DQAB*01:05_05:01',
 'HLA-DQAB*05:01_02:01',
 'HLA-DQAB*01:03_06:03',
 'HLA-DQAB*01:02_06:02',
 'HLA-DQAB*03:02_03:03']

In [23]:
HLA_II_complete = list(set(HLA_II_v2_pairs + HLA_II_v2_5DRDQ_DRB + HLA_II_v2_5DRDQ_DQAB) - set(HLA_II_v2_5DRDQ))
len(HLA_II_complete)
# 135
HLA_II_complete[:6]

['HLA-DRB1*16:01',
 'HLA-DRB1*11:02',
 'HLA-DQAB*01:02_03:02',
 'HLA-DRB1*13:01',
 'HLA-DRB1*03:01',
 'HLA-DRB1*11:04']

In [25]:
# reconstruct pseudo sequences on the 15 positions of alpha chain

# map each HLA-II pair to the corresponding alpha chain
# load the corresponding pseudo sequeneces and get the first 15
# reconstruct the pseudo sequences based on position adjustments
#  -- load the three full sequences files
#  -- the lists of DQAs with deletion
#  -- write functions to reconstruct pseudo sequences
# check whether for each HLA-II alpha chain, the pseudo sequences on the 18 additional positions
#  are the same too
# if so, move on and write out a file of the alpha chain names and their corresponding pseudo 
#  sequences

In [27]:
#HLA_II_complete

In [43]:
# this first dictionary holds the corresponding alpha chain of each HLA-II pair
HLA_II_alpha_dict = defaultdict(str)
# this second dictionary holds the translate of one HLA-II pair to the names in "../../data/intermediate_data/pseudosequence_2016_all_X.dat"
trans_hla_II_dict = defaultdict(str)
# based on the existing pseudo sequence dictionary, build one for alpha chain and possible
# corresponding set of pseudo sequences -- expected to be set of len 1 if no bug
HLA_II_alpha_set_pseudo_dict = defaultdict(set)

# separate the HLA-II pairs into two alleles each
# translate them into the names in file "../../data/intermediate_data/pseudosequence_2016_all_X.dat"
for item in HLA_II_complete:
    if item[:8] == "HLA-DQAB":
        item_1 = "DQA1" + "*" + item[9:].split("_")[0]
        item_2 = "DQB1" + "*" + item[9:].split("_")[1]
        HLA_II_alpha_dict[item] = item_1
        trans_hla_II_dict[item] = "HLA-" + item_1.replace("*", "").replace(":", "") + "-" + item_2.replace("*", "").replace(":", "")
        HLA_II_alpha_set_pseudo_dict[item_1].add(HLA_2_pseudo_dict[trans_hla_II_dict[item]][:15])
    elif item[:8] == "HLA-DPAB":
        item_1 = "DPA1" + "*" + item[9:].split("_")[0]
        item_2 = "DPB1" + "*" + item[9:].split("_")[1]
        HLA_II_alpha_dict[item] = item_1
        trans_hla_II_dict[item] = "HLA-" + item_1.replace("*", "").replace(":", "") + "-" + item_2.replace("*", "").replace(":", "")
        HLA_II_alpha_set_pseudo_dict[item_1].add(HLA_2_pseudo_dict[trans_hla_II_dict[item]][:15])
    elif item[:8] == "HLA-DRB1":
        item_1 = "DRA"
        item_2 = "DRB1" + "*" + item[9:]
        HLA_II_alpha_dict[item] = item_1
        trans_hla_II_dict[item] = 'DRB1_' + item[9:].replace(":", "")
        HLA_II_alpha_set_pseudo_dict[item_1].add(HLA_2_pseudo_dict[trans_hla_II_dict[item]][:15])
    else:
        print("error found, first eight letters exception")
        print(item)
        break

In [44]:
set(HLA_II_alpha_dict.values())
list_HLA_II_alpha = list(set(HLA_II_alpha_dict.values()))
list_HLA_II_alpha.sort()
list_HLA_II_alpha

['DPA1*01:03',
 'DPA1*02:01',
 'DPA1*02:02',
 'DQA1*01:01',
 'DQA1*01:02',
 'DQA1*01:03',
 'DQA1*01:04',
 'DQA1*01:05',
 'DQA1*02:01',
 'DQA1*03:01',
 'DQA1*03:02',
 'DQA1*03:03',
 'DQA1*04:01',
 'DQA1*05:01',
 'DQA1*05:05',
 'DQA1*06:01',
 'DRA']

In [46]:
[len(value) for value in HLA_II_alpha_set_pseudo_dict.values()]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [55]:
# use HLA_II_alpha_set_pseudo_dict to get a dict with pseudo seq as value
HLA_II_alpha_pseudo_dict = defaultdict(str)

for key in HLA_II_alpha_set_pseudo_dict:
    HLA_II_alpha_pseudo_dict[key] = list(HLA_II_alpha_set_pseudo_dict[key])[0]

len(HLA_II_alpha_pseudo_dict)

17

In [56]:
# below are the adjustments need to make in order to reconstruct the pseudo sequence on 
# original 15 positions:
# (0) DPA, DQA are processed by subtracting 4 from the indexes from pos_alpha, 
#        with the first aa replaced by the one before it;
# (1) If DQA falls into the list extra_modify_DQAs, need to modify positions 
#        52 & 53 (1-indexed under NetMHC_II_pan-3.0);
# (2) Get rid of those starting with 'X', if there are multiple pseudo seq candidates

Load the full sequences files

Load DRA sequences

In [58]:
DRA_prot = pd.read_csv("../../data/intermediate_data/HLA_TCR_contact/DRA_prot.alfas", 
                       sep = " ", header = None)
DRA_prot.shape

name_ind = list(range(int(DRA_prot.shape[0]/2)))
names = DRA_prot.loc[[2 * ind for ind in name_ind]]
names = names.iloc[:, 0].tolist()
names = [name.replace(">", "") for name in names]
seqs = DRA_prot.loc[[2 * ind + 1 for ind in name_ind]]
seqs = seqs.iloc[:, 0].tolist()

DRA_seqs = pd.DataFrame(list(zip(names, seqs)), 
                            columns =['name', 'seq']) 
DRA_seqs.shape

DRA_seq_unique = list(set(DRA_seqs.seq.tolist()))[0]
DRA_seq_unique

'HVIIQ-AEFYLNPDQSGEFMFDFDGDEIFHVDMAKKETVWRLEEFGRFASFEAQGALANIAVDKANLEIMTKRSN'

Load DPA sequences

In [59]:
DPA1_prot = pd.read_csv("../../data/intermediate_data/HLA_TCR_contact/DPA1_prot.alfas", 
                        sep = " ", header = None)
DPA1_prot.shape

name_ind = list(range(int(DPA1_prot.shape[0]/2)))
names = DPA1_prot.loc[[2 * ind for ind in name_ind]]
names = names.iloc[:, 0].tolist()
names = [name.replace(">", "") for name in names]
seqs = DPA1_prot.loc[[2 * ind + 1 for ind in name_ind]]
seqs = seqs.iloc[:, 0].tolist()

DPA1_seqs = pd.DataFrame(list(zip(names, seqs)), 
                            columns =['name', 'seq']) 
DPA1_seqs.shape

DPA1_seqs['short'] = [":".join(name.split(":")[:2]) for name in DPA1_seqs.name.tolist()]
DPA1_seq_dict = defaultdict(set)
for short, seq in zip(DPA1_seqs.short.tolist(), DPA1_seqs.seq.tolist()):
    DPA1_seq_dict[short].add(seq)

len(DPA1_seq_dict)

24

Load DQA sequences

In [60]:
DQA1_prot = pd.read_csv("../../data/intermediate_data/HLA_TCR_contact/DQA1_prot.alfas", 
                        sep = " ", header = None)
DQA1_prot.shape

name_ind = list(range(int(DQA1_prot.shape[0]/2)))
names = DQA1_prot.loc[[2 * ind for ind in name_ind]]
names = names.iloc[:, 0].tolist()
names = [name.replace(">", "") for name in names]
seqs = DQA1_prot.loc[[2 * ind + 1 for ind in name_ind]]
seqs = seqs.iloc[:, 0].tolist()

DQA1_seqs = pd.DataFrame(list(zip(names, seqs)), 
                            columns =['name', 'seq']) 
DQA1_seqs.shape

DQA1_seqs['short'] = [":".join(name.split(":")[:2]) for name in DQA1_seqs.name.tolist()]

DQA1_seq_dict = defaultdict(set)
for short, seq in zip(DQA1_seqs.short.tolist(), DQA1_seqs.seq.tolist()):
    DQA1_seq_dict[short].add(seq)

len(DQA1_seq_dict)

35

In [187]:
list(DQA1_seq_dict['DQA1*05:04'])[0][31:]

'DLGRKETVWCLPV-LRQFRFDPQFALTNIAVLKHNLNSLIKRSN'

In [76]:
# set of DQAs that need additional modifications
extra_modify_DQAs = ['DQA1*05:04', 'DQA1*05:07', 'DQA1*05:01', 'DQA1*05:06', 'DQA1*05:03', \
                    'DQA1*05:02', 'DQA1*05:08', 'DQA1*05:05', 'DQA1*05:11', 'DQA1*05:10', \
                    'DQA1*05:09', 'DQA1*04:02', 'DQA1*04:04', 'DQA1*06:01', 'DQA1*06:02', \
                    'DQA1*02:01', 'DQA1*04:01']

In [82]:
# reconstruct a pseudo sequence dictionary from full sequence data

# DRA_seq_unique


HLA_II_alpha_pseudo_rec_dict = defaultdict(str)


adjust_pos_alpha = [pos for pos in fifteen]
adjust_pos_alpha[0] = adjust_pos_alpha[0] - 1

def get_a_half_modify_15(lookup_dict, allele, adjust_pos_alpha, modify_flag):
    seq_candids = list(lookup_dict[allele])
    if modify_flag:
        modify_adjust_pos_alpha = [pos for pos in adjust_pos_alpha]
        modify_adjust_pos_alpha[5] = 53
        seq_pseudos_modify = list(set(["".join([item[ind-4] for ind in modify_adjust_pos_alpha]) for item in seq_candids]))
        seq_pseudos = [item[:6] + 'X' + item[7:] for item in seq_pseudos_modify]   
    else:
        seq_pseudos = list(set(["".join([item[ind-4] for ind in adjust_pos_alpha]) for item in seq_candids]))
    seq_pseudo_noX = [seq for seq in seq_pseudos if seq[0]!= 'X']
    if len(seq_pseudo_noX) == 1:
        return True, seq_pseudo_noX[0]
    else:
        return False, ""
    
    
for item in list_HLA_II_alpha:
    if item[:3] == "DQA":
        modify_flag = (item in extra_modify_DQAs)
        sub_flag, seq = get_a_half_modify_15(DQA1_seq_dict, item, adjust_pos_alpha, modify_flag)
        if not sub_flag:
            print("seq_pseudo_noX len is not 1")
            print("item = ", item)
            break
        HLA_II_alpha_pseudo_rec_dict[item] = seq     
    elif item[:3] == "DPA":
        sub_flag, seq = get_a_half_modify_15(DPA1_seq_dict, item, adjust_pos_alpha, False)
        if not sub_flag:
            print("seq_pseudo_noX len is not 1")
            print("item = ", item)
            break
        HLA_II_alpha_pseudo_rec_dict[item] = seq     
    elif item[:8] == "DRA":
        seq = 'QEFFIASGAAVDAIM'
        HLA_II_alpha_pseudo_rec_dict[item] = seq 
    else:
        print("error found, first three letters exception")
        print(item)
        break

In [84]:
# now it is verified that this way of doing the adjustment gets pseudo sequences
# same as that from file "../../data/intermediate_data/pseudosequence_2016_all_X.dat"
HLA_II_alpha_pseudo_rec_dict == HLA_II_alpha_pseudo_dict

True

In [81]:
"".join([DRA_seq_unique[ind-4] for ind in adjust_pos_alpha]) == 'QEFFIASGAAVDAIM'

True

In [78]:
adjust_pos_alpha = [pos for pos in fifteen]
adjust_pos_alpha[0] = adjust_pos_alpha[0] - 1

In [77]:
list_HLA_II_alpha

['DPA1*01:03',
 'DPA1*02:01',
 'DPA1*02:02',
 'DQA1*01:01',
 'DQA1*01:02',
 'DQA1*01:03',
 'DQA1*01:04',
 'DQA1*01:05',
 'DQA1*02:01',
 'DQA1*03:01',
 'DQA1*03:02',
 'DQA1*03:03',
 'DQA1*04:01',
 'DQA1*05:01',
 'DQA1*05:05',
 'DQA1*06:01',
 'DRA']

In [72]:
fifteen

[9, 11, 22, 24, 31, 52, 53, 58, 59, 61, 65, 66, 68, 72, 73]

In [70]:
# all DRA and DPA sequences have a deletion at position 5
DRA_seq_unique
# DRA sequence is perfectly fine with just -4 & move the first to the one next 
# to it on the left

'HVIIQ-AEFYLNPDQSGEFMFDFDGDEIFHVDMAKKETVWRLEEFGRFASFEAQGALANIAVDKANLEIMTKRSN'

### below is a summary of the adjustments need to make

(0) In general, the 15 positions 
9, 11, 22, 24, 31, 52, 53, 58, 59, 61, 65, 66, 68, 72, 73
given in NetMHCIIpan-3.0 for HLA-II alpha chains need to be subtracted by 4 in order to match the info given in full sequences files:

    DRA_prot.alfas
    DPA_prot.alfas
    DQA_prot.alfas

(1) On top of that, the first one (pos 9) of these 15 positions needs to be subtracted by an additional 1 (one guess is it is related to a deletion in 0-indexed position 5, for example, as in the unique full sequence 'HVIIQ-AEFYLNPDQSGEFMFDFDGDEIFHVDMAKKETVWRLEEFGRFASFEAQGALANIAVDKANLEIMTKRSN' for DRA alleles, and the alignment in NetMHCIIpan-3.0 might have ignored this deletion and caused the relative distance between the first postion and the second to be shorter by 1).

(2) Besides these, there is an additional modification needed to be done if the alpha chains falls into a set of DQAs:

'DQA1*05:04','DQA1*05:07', 'DQA1*05:01', 'DQA1*05:06', 'DQA1*05:03', 'DQA1*05:02', 'DQA1*05:08', 'DQA1*05:05', 'DQA1*05:11', 'DQA1*05:10', 'DQA1*05:09', 'DQA1*04:02', 'DQA1*04:04', 'DQA1*06:01', 'DQA1*06:02', 'DQA1*02:01', 'DQA1*04:01'

Most of these DQAs are got from Fig.2 of NetMHCIIpan-3.0 paper, except that 'DQA1*04:01' was not mentioned in the paper but found out to have the same behavior in terms of adjusting positions for pseudo sequence matching, and so it is added into this list. Based on the paper, the sequences of these HLA-II alpha chains have a deletion on position 53 (under indexing system in NetMHCIIpan-3.0 paper). However, the full sequences from DQA_prot.alfas show no deletion here but a deletion on position 48 (under indexing system in NetMHCIIpan-3.0 paper).
For example, DQA1_05_04 has sequence (starting from position 35)

DLGRKETVWCLPVLRQFR-FDPQFALTNIAVLKHNLN...

v.s.

DLGRKETVWCLPV-LRQFRFDPQFALTNIAVLKHNLN...

given by full sequene info as in 

    DQA_prot.alfas.

Thus, the deletion at postion 48 is not considered by the alignment of NetMHCIIpan-3.0 as a deletion, but the one at position 53 is. So in order to construct pseudo sequence in the same way as in 

    pseudosequence_2016_all_X.dat
for these special DQAs, one additional adjustment we need to make is that, to amino acids for positions 48, 49, 50, 51, 52, we need to take those on positions 49, 50, 51, 52, 53 (under indexing system of NetMHCIIpan-3.0) from the full sequences in 

    DQA_prot.alfas 
instead, and if we need to get amino acid on position 53, write an "X" instead. For other DQAs that do not fall into this special set, DPAs and the DRA, this adjustment should not be done.

(3) If one HLA-II A allele has multiple corresponding pseudo sequences, then ignore those sequences starting with "X" and only keep the others.

In [104]:
# -----------------------------------
# the content in this cell was the reasoning when trying to figure out the adjustments needed for alignments. 
# the markdown cell above is a neater summary
# -----------------------------------


# If we use the current full sequence of DRA as base line for the alignment,  

# on one hand, 
# the relative distance between the positions from NetMHC-II-pan-3.0 seems to 
# be ignoring the deletion at position 5 (0-indexed), and then adding 5 to the indexes of all positions

# on the other hand, 
# since all other positions match perfectly after shifting by -4 except 52 & 53 when DQA has delection 
# on position 53 according to alignment in NetMHC-II-pan-3.0 
# but on position 48(44 under 0-indexed, 44+4 = 48) according to "../data/HLA_TCR_contact/DQA1_prot.alfas"
# the pseudo sequences from NetMHC-II-pan-3.0 write 'X' for position 53 and use the aa on position 53 
# instead of that on position 52.
# In this sense, comparied to DQA1_prot.alfas, NetMHC-II-pan-3.0 igores the deletion at position 48
# and push all aas from postion 49 to 53 to the left by one position, and use 'X' for position 53

# Thus, if DQA falls into those needing additional modifications, 
# compared with the alignment in NetMHC-II-pan-3.0, the sequences in DQA1_prot.alfas ignores the deletion
# on position 53, considers pos 48 as a delection and pushed aas on pos 48 to 52 to the right by one pos
# and uses '-' for position 48

# If DQA needs additional modifcation, for getting aas for the additional positions, 
# 49, 50 & 51 (45, 46 & 47 under 0-indexing) need to be changed to 50, 51 & 52
# But, if DQA does not fall into those needing additional modifications, 
# it is fine to just use the same processing as on DRA sequence.


# since from the original 15 positions given by NetMHC-II-pan-3.0, 
# we need to modify the first position index by subtracting by 1,
# and this will change the original 9 to 8. 
# As a result, we can ignore the first position index 8 
# (4 under 0-indexing) in the 18 additional one


# for other positions except the first one in the original 15 positions 
# there does not seem to be a problem
# one guess is it is due to the deletion on position 5(0-indexed in DRA_seq_unique)

In [105]:
#DQA1_seq_dict

In [None]:
# 18 additional positions (index adjusted) for HLA-II alpha chain:
eighteen = [4, 28, 35, 39, 45, 46, 47, 50, 51, 53, 56, 58, 60, 63, 65, 67, 71, 72]
# add 4 to all to take them on the same scale as the original fifteen
up_eighteen = [ind + 4 for ind in eighteen]
# ignore the first pos 8 (4 under 0-indexing, since it is the same as the first position
# in original fifteen 9 moved 1 position to the left
up_seventeen = up_eighteen[1:]
up_seventeen

In [108]:
# now let's check whether for each HLA-II alpha chain, we can find a unique pseudo 
# sequence for it based on these 17 positions


flag_HLA_II_on_17 = []
HLA_II_alpha_pseudo_17_dict = defaultdict(str)


adjust_pos_up_17 = [pos for pos in up_seventeen]

def get_a_half_modify_17(lookup_dict, allele, adjust_pos_up_17, modify_flag):
    seq_candids = list(lookup_dict[allele])
    if modify_flag:
        modify_adjust_pos_up_17 = [pos for pos in adjust_pos_up_17]
        modify_adjust_pos_up_17[3:6] = [pos+1 for pos in adjust_pos_up_17[3:6]]
        seq_pseudos = list(set(["".join([item[ind-4] for ind in modify_adjust_pos_up_17]) for item in seq_candids])) 
    else:
        seq_pseudos = list(set(["".join([item[ind-4] for ind in adjust_pos_up_17]) for item in seq_candids]))
    seq_pseudo_noX = [seq for seq in seq_pseudos if seq[0]!= 'X']
    if len(seq_pseudo_noX) == 1:
        return True, seq_pseudo_noX[0]
    else:
        return False, ""
    
    
for item in list_HLA_II_alpha:
    if item[:3] == "DQA":
        modify_flag = (item in extra_modify_DQAs)
        sub_flag, seq = get_a_half_modify_17(DQA1_seq_dict, item, adjust_pos_up_17, modify_flag)
        flag_HLA_II_on_17 += [sub_flag]
        if not sub_flag:
            print("seq_pseudo_noX len is not 1")
            print("item = ", item)
            break
        HLA_II_alpha_pseudo_17_dict[item] = seq     
    elif item[:3] == "DPA":
        sub_flag, seq = get_a_half_modify_17(DPA1_seq_dict, item, adjust_pos_up_17, False)
        flag_HLA_II_on_17 += [sub_flag]
        if not sub_flag:
            print("seq_pseudo_noX len is not 1")
            print("item = ", item)
            break
        HLA_II_alpha_pseudo_17_dict[item] = seq     
    elif item[:3] == "DRA":
        flag_HLA_II_on_17 += [True]
        seq = "".join([DRA_seq_unique[ind-4] for ind in adjust_pos_up_17])
        HLA_II_alpha_pseudo_17_dict[item] = seq 
    else:
        print("error found, first three letters exception")
        print(item)
        break

In [115]:
sum(flag_HLA_II_on_17)

17

In [113]:
#[len(value) for value in HLA_II_alpha_pseudo_17_dict.values()]

In [121]:
# now we look into how many out of the 17 positions have diversity in terms of amino acids
# among the 17 different HLA-II alpha chains

nunique_17 = []

for i in range(17):
    cur_aas = ''
    for value in list(HLA_II_alpha_pseudo_17_dict.values()):
        cur_aas += value[i]
    nunique_17 += [len(set(cur_aas))]

nunique_17

[1, 1, 1, 4, 3, 3, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1]

In [126]:
# 7 out of 17 positions have diverity
add_HLA_II_7 = [up_seventeen[i] for i in range(17) if nunique_17[i] > 1]
add_HLA_II_7

[49, 50, 51, 55, 67, 71, 75]

Combine the 7 additional positions with the original 15 to make a set of 22 positions

In [143]:
adjust_pos_alpha = [pos for pos in fifteen]
# note that the adjustification on the first index is already done here
adjust_pos_alpha[0] = adjust_pos_alpha[0] - 1
extended_HLA_II_alpha = adjust_pos_alpha + add_HLA_II_7
extended_HLA_II_alpha.sort()
#extended_HLA_II_alpha
# [8,11,22,24,31,49,50,51,52,53,55,58,59,61,65,66,67,68,71,72,73,75]

In [149]:
# now move on to get pseudo sequences based on these 22 positions:

flag_HLA_II_on_22 = []
HLA_II_alpha_pseudo_22_dict = defaultdict(str)


adjust_pos_up_22 = [pos for pos in extended_HLA_II_alpha]

def get_a_half_modify_22(lookup_dict, allele, adjust_pos_up_22, modify_flag):
    seq_candids = list(lookup_dict[allele])
    if modify_flag:
        modify_adjust_pos_up_22 = [pos+1 if pos in [49,50,51,52] else pos for pos in adjust_pos_up_22]
        seq_pseudos_modify = list(set(["".join([item[ind-4] for ind in modify_adjust_pos_up_22]) for item in seq_candids])) 
        seq_pseudos = [item[:9] + 'X' + item[10:] for item in seq_pseudos_modify]   
    else:
        seq_pseudos = list(set(["".join([item[ind-4] for ind in adjust_pos_up_22]) for item in seq_candids]))
    seq_pseudo_noX = [seq for seq in seq_pseudos if seq[0]!= 'X']
    if len(seq_pseudo_noX) == 1:
        return True, seq_pseudo_noX[0]
    else:
        return False, ""
    
    
for item in list_HLA_II_alpha:
    if item[:3] == "DQA":
        modify_flag = (item in extra_modify_DQAs)
        sub_flag, seq = get_a_half_modify_22(DQA1_seq_dict, item, adjust_pos_up_22, modify_flag)
        flag_HLA_II_on_22 += [sub_flag]
        if not sub_flag:
            print("seq_pseudo_noX len is not 1")
            print("item = ", item)
            break
        HLA_II_alpha_pseudo_22_dict[item] = seq     
    elif item[:3] == "DPA":
        sub_flag, seq = get_a_half_modify_22(DPA1_seq_dict, item, adjust_pos_up_22, False)
        flag_HLA_II_on_22 += [sub_flag]
        if not sub_flag:
            print("seq_pseudo_noX len is not 1")
            print("item = ", item)
            break
        HLA_II_alpha_pseudo_22_dict[item] = seq     
    elif item[:8] == "DRA":
        flag_HLA_II_on_22 += [True]
        seq = "".join([DRA_seq_unique[ind-4] for ind in adjust_pos_up_22])
        HLA_II_alpha_pseudo_22_dict[item] = seq 
    else:
        print("error found, first three letters exception")
        print(item)
        break

In [150]:
sum(flag_HLA_II_on_22)/len(flag_HLA_II_on_22)

1.0

In [151]:
HLA_II_alpha_pseudo_22_dict

defaultdict(str,
            {'DPA1*01:03': 'YAFFMGQAFSEGGAILNNNTLQ',
             'DPA1*02:01': 'YAFFQGRAFSEGGAILNNNTLQ',
             'DPA1*02:02': 'YMFFQGRAFSEGGAILNNNTLQ',
             'DQA1*01:01': 'CNYHESKFGGDGARVAKHNIMK',
             'DQA1*01:02': 'CNYHQSKFGGDGARVAKHNIMK',
             'DQA1*01:03': 'CNFHQSKFGGDGARVAKHNIMK',
             'DQA1*01:04': 'CNYHESKFGGDGARVAKHNIMK',
             'DQA1*01:05': 'CNYHESKFGGDGARVAKHNIMK',
             'DQA1*02:01': 'YNFHEHRLRXDFATVLKHNILK',
             'DQA1*03:01': 'YNYHERRFRRDFATVLKHNIVK',
             'DQA1*03:02': 'YNYHERRFRRDFATVLKHNIVK',
             'DQA1*03:03': 'YNYHERRFRRDFATVLKHNIVK',
             'DQA1*04:01': 'YNYHQRQFRXDFATVTKHNILK',
             'DQA1*05:01': 'YNYHQRQFRXDFATVLKHNSLK',
             'DQA1*05:05': 'YNYHQRQFRXDFATVLKHNSLK',
             'DQA1*06:01': 'YNFHQRQFRXDFATVTKHNILK',
             'DRA': 'QEFFIGRFASEGAAVDKAEIMK'})

In [178]:
HLA_II_alpha_pseudo_22_value = [HLA_II_alpha_pseudo_22_dict[key] for key in list_HLA_II_alpha]

df_HLA_II_alpha_22 = pd.DataFrame(list(zip(list_HLA_II_alpha, HLA_II_alpha_pseudo_22_value)), \
                                 columns = ["allele", "seq"])

In [181]:
# write the dictionary out in the format of a table
df_HLA_II_alpha_22.to_csv("../../data/intermediate_data/t4_HLA_II_v2_alpha_pseudo_22_dict.csv", 
                          index = False)

In [153]:
# go and check a few HLA-II alpha chains

In [177]:
HLA_II_alpha_pseudo_dict

defaultdict(str,
            {'DPA1*01:03': 'YAFFMFSGGAILNTL',
             'DPA1*02:01': 'YAFFQFSGGAILNTL',
             'DPA1*02:02': 'YMFFQFSGGAILNTL',
             'DQA1*01:01': 'CNYHEGGGARVAHIM',
             'DQA1*01:02': 'CNYHQGGGARVAHIM',
             'DQA1*01:03': 'CNFHQGGGARVAHIM',
             'DQA1*01:04': 'CNYHEGGGARVAHIM',
             'DQA1*01:05': 'CNYHEGGGARVAHIM',
             'DQA1*02:01': 'YNFHERXFATVLHIL',
             'DQA1*03:01': 'YNYHERRFATVLHIV',
             'DQA1*03:02': 'YNYHERRFATVLHIV',
             'DQA1*03:03': 'YNYHERRFATVLHIV',
             'DQA1*04:01': 'YNYHQRXFATVTHIL',
             'DQA1*05:01': 'YNYHQRXFATVLHSL',
             'DQA1*05:05': 'YNYHQRXFATVLHSL',
             'DQA1*06:01': 'YNFHQRXFATVTHIL',
             'DRA': 'QEFFIASGAAVDAIM'})

In [172]:
#'DRA'
''.join(['QEFFIGRFASEGAAVDKAEIMK'[i] for i in range(22) if adjust_pos_up_22[i] in fifteen]) == 'EFFIASGAAVDAIM'

True

In [173]:
# 'DPA1*01:03'
''.join(['YAFFMGQAFSEGGAILNNNTLQ'[i] for i in range(22) if adjust_pos_up_22[i] in fifteen]) == 'AFFMFSGGAILNTL'

True

In [174]:
# 'DQA1*01:01'
''.join(['CNYHESKFGGDGARVAKHNIMK'[i] for i in range(22) if adjust_pos_up_22[i] in fifteen]) == 'NYHEGGGARVAHIM'

True

In [176]:
# 'DQA1*05:01'
''.join(['YNYHQRQFRXDFATVLKHNSLK'[i] for i in range(22) if adjust_pos_up_22[i] in fifteen]) == 'NYHQRXFATVLHSL'

True