This file takes as input the associated (TCR, HLA) pairs file 

    ../../data/intermediate_data/t12_HLA_associated_TCR_v_alleles.csv

and writes out five files:

one for HLA-I alleles

    ../../data/intermediate_data/t12_HLA_I_associated_TCR_v_alleles.csv, shape (6423, 3)

one for HLA-II allele in the format of HLA v2 from DeWitt_2018 (involving 125 simple ones, DRB1, DPAB, or DQAB, and 5 haplotypes)

    ../../data/intermediate_data/t12_HLA_II_associated_TCR_v_alleles.csv, shape (14159, 3)

and one for HLA-II alleles, similar to the one above, except that 5 haplotypes are separated into 10 HLA-II

    ../../data/intermediate_data/t12_HLA_II_expand_associated_TCR_v_alleles.csv, shape (17281, 3)

one for HLA-II alleles in the format of HLA v2, involving only the 125 simple ones, DRB1, DPAB, or DQAB, excluding the 5 haplotypes)

    ../../data/intermediate_data/t12_HLA_II_no_haplotype_associated_TCR_v_alleles.csv, shape (11037, 3)

one for only the 5 haplotypes separated into 10 HLA-II 

    ../../data/intermediate_data/t12_HLA_II_5haplotypes_to_10pairs_associated_TCR_v_alleles.csv, shape (6244, 3)


In [1]:
import numpy as np
import pandas as pd

from collections import Counter
from collections import defaultdict

In [2]:
df_asso = pd.read_csv("../../data/intermediate_data/t12_HLA_associated_TCR_v_alleles.csv", header = 0)
df_asso.shape

(20582, 3)

In [3]:
df_asso[:6]

Unnamed: 0,tcr,hla_allele,association_pvalue
0,"TCRBV04-02*01,CASSQGGQSYEQYF",HLA-B*07:02,7.159545e-23
1,"TCRBV04-02*01,CASSQGGQSYEQYF",HLA-C*07:02,6.043353999999999e-19
2,"TCRBV04-02*01,CASSQGGQSYEQYF",HLA-DRDQ*15:01_01:02_06:02,1.48414e-08
3,"TCRBV02-01*01,CASSELPNTEAFF",HLA-DPAB*01:03_04:01,1.194622e-07
4,"TCRBV05-01*01,CASSWDKSYEQYF",HLA-A*03:01,7.44668e-08
5,"TCRBV06-06*01,CASNPRGGGSNQPQHF",HLA-B*07:02,9.795153e-08


In [4]:
df_asso.nunique()

tcr                   14945
hla_allele              215
association_pvalue     8756
dtype: int64

In [5]:
hla_short = [hla[:5] for hla in df_asso.hla_allele.tolist()]
df_asso['short'] = hla_short

In [6]:
df_asso_HLA_I = df_asso[df_asso["short"].isin(['HLA-A', 'HLA-B', 'HLA-C'])]
df_asso_HLA_I.shape

(6423, 4)

In [7]:
df_asso_HLA_I.nunique()

tcr                   4877
hla_allele              85
association_pvalue    3574
short                    3
dtype: int64

In [8]:
Counter(df_asso_HLA_I.short)

Counter({'HLA-B': 3357, 'HLA-C': 1653, 'HLA-A': 1413})

In [9]:
df_asso_HLA_I_pure = df_asso_HLA_I.drop(['short'], axis=1)
df_asso_HLA_I_pure[:6]

Unnamed: 0,tcr,hla_allele,association_pvalue
0,"TCRBV04-02*01,CASSQGGQSYEQYF",HLA-B*07:02,7.159545e-23
1,"TCRBV04-02*01,CASSQGGQSYEQYF",HLA-C*07:02,6.043353999999999e-19
4,"TCRBV05-01*01,CASSWDKSYEQYF",HLA-A*03:01,7.44668e-08
5,"TCRBV06-06*01,CASNPRGGGSNQPQHF",HLA-B*07:02,9.795153e-08
6,"TCRBV06-06*01,CASNPRGGGSNQPQHF",HLA-C*07:02,4.896567e-07
8,"TCRBV06-01*01,CASSEAPGPTNEKLFF",HLA-A*03:01,4.462154e-09


Write out the subset related to HLA-I alleles.

In [40]:
df_asso_HLA_I_pure.to_csv("../../data/intermediate_data/t12_HLA_I_associated_TCR_v_alleles.csv", 
                          index = False)

Move on to HLA-II alleles.

In [10]:
df_asso_HLA_II = df_asso[df_asso["short"] == 'HLA-D']
df_asso_HLA_II.shape

(14159, 4)

In [11]:
df_asso_HLA_II.nunique()

tcr                   11216
hla_allele              130
association_pvalue     5358
short                     1
dtype: int64

Write out subset related to HLA_II alleles in the v2 format of DeWitt_2018. 

This version involves 125 simple HLA-II (DRB1, DPAB, or DQAB) and 5 haplotypes.

In [53]:
df_asso_HLA_II_pure = df_asso_HLA_II.drop(['short'], axis=1)

df_asso_HLA_II_pure.to_csv("../../data/intermediate_data/t12_HLA_II_associated_TCR_v_alleles.csv", 
                           index = False)

Next, move on to separate each row corresponding to one haplotype into two rows

In [13]:
haplotypes = ['HLA-DRDQ*15:01_01:02_06:02', 'HLA-DRDQ*03:01_05:01_02:01', \
              'HLA-DRDQ*13:01_01:03_06:03', 'HLA-DRDQ*10:01_01:05_05:01', \
              'HLA-DRDQ*09:01_03:02_03:03']

In [14]:
df_asso_HLA_II_haplotypes = df_asso_HLA_II_pure[df_asso_HLA_II_pure['hla_allele'].isin(haplotypes)]
df_asso_HLA_II_haplotypes.shape

(3122, 3)

In [17]:
df_asso_HLA_II_haplotypes[:6]

Unnamed: 0,tcr,hla_allele,association_pvalue
2,"TCRBV04-02*01,CASSQGGQSYEQYF",HLA-DRDQ*15:01_01:02_06:02,1.48414e-08
7,"TCRBV06-06*01,CASNPRGGGSNQPQHF",HLA-DRDQ*15:01_01:02_06:02,3.184825e-09
9,"TCRBV06-05*01,CASSYSPGGDTEAFF",HLA-DRDQ*15:01_01:02_06:02,4.527401e-07
13,"TCRBV05-01*01,CASSLEGRYEQYF",HLA-DRDQ*15:01_01:02_06:02,8.487745e-12
16,"TCRBV06-05*01,CASSYSRPSSGNTIYF",HLA-DRDQ*15:01_01:02_06:02,2.505062e-31
19,"TCRBV05-01*01,CASSLGGFPDTQYF",HLA-DRDQ*15:01_01:02_06:02,1.490235e-17


In [18]:
# separate each row to two
tcr_hap_vec = []
hla_hap_vec = []
pval_hap_vec= []

for tcr, hla, pval in zip(df_asso_HLA_II_haplotypes.tcr.tolist(), \
                          df_asso_HLA_II_haplotypes.hla_allele.tolist(), \
                          df_asso_HLA_II_haplotypes.association_pvalue.tolist()):
    DRB_name = "HLA-DRB1*" + hla[9:].split("_")[0]
    DQAB_name = "HLA-DQAB*" + "_".join(hla[9:].split("_")[1:])
    tcr_hap_vec += [tcr]
    hla_hap_vec += [DRB_name]
    pval_hap_vec+= [pval]
    tcr_hap_vec += [tcr]
    hla_hap_vec += [DQAB_name]
    pval_hap_vec+= [pval]                                       

In [19]:
len(pval_hap_vec)

6244

In [20]:
df_asso_HLA_II_pure_pairs = pd.DataFrame(list(zip(tcr_hap_vec, hla_hap_vec, pval_hap_vec)), \
                                         columns = ['tcr', 'hla_allele', 'association_pvalue'])

Write out associated (TCR, HLA) pairs purely related to the 5 haplotypes separated into 10 HLA-II AB pairs

In [26]:
df_asso_HLA_II_pure_pairs.to_csv(\
        "../../data/intermediate_data/t12_HLA_II_5haplotypes_to_10pairs_associated_TCR_v_alleles.csv", index = False)

In [28]:
# get the subset related to the 125 simple HLA-II (one of DRB1, DPAB, DQAB)
df_asso_HLA_II_no_haplotype = df_asso_HLA_II_pure[~df_asso_HLA_II_pure['hla_allele'].isin(haplotypes)]
df_asso_HLA_II_no_haplotype.shape

(11037, 3)

In [29]:
df_asso_HLA_II_no_haplotype[:6]

Unnamed: 0,tcr,hla_allele,association_pvalue
3,"TCRBV02-01*01,CASSELPNTEAFF",HLA-DPAB*01:03_04:01,1.194622e-07
17,"TCRBV06-05*01,CASSYSRPSSGNTIYF",HLA-DQAB*01:03_06:02,1.164818e-07
28,"TCRBV18-01*01,CASSPGSNQPQHF",HLA-DRB1*04:01,5.88045e-07
33,"TCRBV05-01*01,CASSWGQGTGELFF",HLA-DQAB*02:01_03:03,1.885378e-06
34,"TCRBV05-01*01,CASSWGQGTGELFF",HLA-DRB1*07:01,3.831379e-14
45,"TCRBV30-01*01,CAWSRDSGSGNTIYF",HLA-DQAB*01:02_05:01,8.209775e-06


Write out a version that is related to HLA_II alleles in the v2 format of DeWitt_2018. 

This version involves 125 simple HLA-II (DRB1, DPAB, or DQAB) only, excluding the five haplotypes. 

In [32]:
df_asso_HLA_II_no_haplotype.to_csv(\
        "../../data/intermediate_data/t12_HLA_II_no_haplotype_associated_TCR_v_alleles.csv", 
                                   index = False)

In [97]:
# stack df_asso_HLA_II_no_haplotype and the expanded results for haplotypes

tcr_expand_vec = df_asso_HLA_II_no_haplotype.tcr.tolist() + tcr_hap_vec
hla_expand_vec = df_asso_HLA_II_no_haplotype.hla_allele.tolist() + hla_hap_vec
pval_expand_vec = df_asso_HLA_II_no_haplotype.association_pvalue.tolist() + pval_hap_vec

len(pval_expand_vec)

17281

In [100]:
df_asso_HLA_II_expand = \
     pd.DataFrame(list(zip(tcr_expand_vec, hla_expand_vec, pval_expand_vec)), \
                 columns = ['tcr', 'hla_allele', 'association_pvalue'])

df_asso_HLA_II_expand.shape

(17281, 3)

Write the expaned version that is related to HLA_II alleles in the v2 format of DeWitt_2018. 

This version involves 125 simple HLA-II (DRB1, DPAB, or DQAB) and 10 that are expanded from the 5 haplotypes.

In [107]:
df_asso_HLA_II_expand.to_csv("../../data/intermediate_data/t12_HLA_II_expand_associated_TCR_v_alleles.csv", index = False)

In [32]:
max(df_asso_HLA_II.association_pvalue)

8.4118895897811e-06

In [33]:
max(df_asso_HLA_I.association_pvalue)

8.39782635654917e-06

In [34]:
min(df_asso_HLA_II.association_pvalue)

6.787789201340051e-67

In [35]:
min(df_asso_HLA_I.association_pvalue)

3.68171810511809e-90