# Resolving issue with aa2genome mapping


We are experiencing issues with finding the correct genomic location for a given amino acid position. This needs to be resolved now.

**Steps:**
1. Reading PLIP output that contains all interactions: pdb id, chain id, residue number, residue
2. Joining this table with the pdb2Ensembl mapping file.
3. Look up detailed mapping via the REST API.
4. Based on the returned mapped region, we have to find out the precise genomic location of the interacting residue.
5. Extract overlapping triplet from genome.
6. Validate the translation.



In [5]:
import requests 
import pandas as pd
from pyspark.sql.types import ArrayType, StringType, IntegerType
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

# establish spark connection
spark = (
    SparkSession.builder
    .master('local[*]')
    .getOrCreate()
)



In [25]:
# Reading sifts datafile:
gene_mapping = (
    spark.read.csv('data/pdb_chain_ensembl.csv.gz', sep=',', header=True, comment='#')
    .filter(f.col('GENE_ID').startswith('ENSG0'))
    .select(
        f.col('PDB').alias('pdb_structure_id'),
        f.col('CHAIN').alias('prot_chain_id'),
        f.col('GENE_ID').alias('target_id'),
        f.col('SP_PRIMARY').alias('uniprot_id')
    )
    .distinct()
    .persist()
)

gene_mapping.show()

+----------------+-------------+---------------+----------+
|pdb_structure_id|prot_chain_id|      target_id|uniprot_id|
+----------------+-------------+---------------+----------+
|            1a02|            F|ENSG00000170345|    P01100|
|            1a1e|            A|ENSG00000197122|    P12931|
|            1awz|            A|ENSG00000214274|    P03950|
|            1bqq|            M|ENSG00000157227|    P50281|
|            1c45|            A|ENSG00000090382|    P61626|
|            1cj6|            A|ENSG00000090382|    P61626|
|            1cz8|            V|ENSG00000112715|    P15692|
|            1d0j|            C|ENSG00000127191|    Q12933|
|            1ds5|            G|ENSG00000224774|    P67870|
|            1duz|            A|ENSG00000206503|    P04439|
|            1fmk|            A|ENSG00000197122|    P12931|
|            1ft8|            C|ENSG00000162231|    Q9UBU9|
|            1gbu|            C|ENSG00000206172|    P69905|
|            1gk4|            E|ENSG0000

In [28]:
interacting_residues = (
    spark.read.csv('/Users/dsuveges/project_data/marine/plip_output.csv', sep=',', header=True)
    .join(gene_mapping, on=['pdb_structure_id', 'prot_chain_id'], how='left')
    .persist()
)

interacting_residues.show(1, vertical=True, truncate=False)

-RECORD 0------------------------------
 pdb_structure_id    | 1d2s            
 prot_chain_id       | A               
 compound_id         | DHT             
 interaction_type    | hbond           
 prot_residue_number | 65              
 prot_residue_type   | ASP             
 target_id           | ENSG00000129214 
 uniprot_id          | P04278          
only showing top 1 row



## Can the chain -> uniprot assignments be ambigous?

- Take mapping file, group by pdb/chain, collect list of gene ids.
- Observe the output.

In [33]:
ambigous_chain_mapping = (
    gene_mapping
    .groupBy(['pdb_structure_id', 'prot_chain_id'])
    .agg(f.collect_set(f.col('target_id')).alias('gene_ids'))
    .withColumn('gene_counts', f.size(f.col('gene_ids')))
    .filter(f.col('gene_counts')  > 1)
    .orderBy('gene_counts', ascending=False)
    .persist()
)

ambigous_chain_mapping.show(1, truncate=False, vertical=True)

-RECORD 0---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 pdb_structure_id | 1b6u                                                                                                                                                                                                                                                                                                                                                                  
 prot_chain_id    | A                                                                                                                                                                                                                             

In [35]:
ambigous_chain_mapping = (
    gene_mapping
    .groupBy(['pdb_structure_id', 'prot_chain_id'])
    .agg(f.collect_set(f.col('uniprot_id')).alias('uniprot_ids'))
    .withColumn('gene_counts', f.size(f.col('uniprot_ids')))
    .filter(f.col('gene_counts')  > 1)
    .orderBy('gene_counts', ascending=False)
    .persist()
)

ambigous_chain_mapping.show(1, truncate=False, vertical=True)

-RECORD 0------------------------------------
 pdb_structure_id | 4ono                     
 prot_chain_id    | A                        
 uniprot_ids      | [P61769, P29017, P29016] 
 gene_counts      | 3                        
only showing top 1 row



- 10.9k structure/chain mapping to uniprot is ambigous: more then one ensembl gene id is linked.
- For the structure `4ono`, the following uniprot ids are identified: P61769, P29017, P29016
- Fetching these protein sequences and align them on [CLUSTAL OMEGA](https://www.ebi.ac.uk/Tools/msa/clustalo/)

In [37]:
%%bash

# These are the mapped uniprot identifiers:
for uniprotid in P61769 P29017 P29016; do 
    curl -s https://www.uniprot.org/uniprot/${uniprotid}.fasta
    echo
done

# Fetching sequence from pdb:
curl -s https://www.ebi.ac.uk/pdbe/entry/pdb/4ono/fasta

>sp|P61769|B2MG_HUMAN Beta-2-microglobulin OS=Homo sapiens OX=9606 GN=B2M PE=1 SV=1
MSRSVALAVLALLSLSGLEAIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLL
KNGERIEKVEHSDLSFSKDWSFYLLYYTEFTPTEKDEYACRVNHVTLSQPKIVKWDRDM

>sp|P29017|CD1C_HUMAN T-cell surface glycoprotein CD1c OS=Homo sapiens OX=9606 GN=CD1C PE=1 SV=2
MLFLQFLLLALLLPGGDNADASQEHVSFHVIQIFSFVNQSWARGQGSGWLDELQTHGWDS
ESGTIIFLHNWSKGNFSNEELSDLELLFRFYLFGLTREIQDHASQDYSKYPFEVQVKAGC
ELHSGKSPEGFFQVAFNGLDLLSFQNTTWVPSPGCGSLAQSVCHLLNHQYEGVTETVYNL
IRSTCPRFLLGLLDAGKMYVHRQVRPEAWLSSRPSLGSGQLLLVCHASGFYPKPVWVTWM
RNEQEQLGTKHGDILPNADGTWYLQVILEVASEEPAGLSCRVRHSSLGGQDIILYWGHHF
SMNWIALVVIVPLVILIVLVLWFKKHCSYQDIL

>sp|P29016|CD1B_HUMAN T-cell surface glycoprotein CD1b OS=Homo sapiens OX=9606 GN=CD1B PE=1 SV=1
MLLLPFQLLAVLFPGGNSEHAFQGPTSFHVIQTSSFTNSTWAQTQGSGWLDDLQIHGWDS
DSGTAIFLKPWSKGNFSDKEVAELEEIFRVYIFGFAREVQDFAGDFQMKYPFEIQGIAGC
ELHSGGAIVSFLRGALGGLDFLSVKNASCVPSPEGGSRAQKFCALIIQYQGIMETVRILL
YETCPRYLLGVLNAGKADLQRQVKPEAWLSSGPSPGPGRLQLVCHVSGFYPKPVWVMWMR
GEQEQQGTQLGDILPN

This is what the alignment shows:

```
sp|P29016|CD1B_HUMAN      ------------------------------------------------------------
sp|P29017|CD1C_HUMAN      ------------------------------------------------------------
sp|P61769|B2MG_HUMAN      MSRSVALAVLALLSLSGLEAIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLL
pdb|4ono|A                -------------------PIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLL
                                                                                      

sp|P29016|CD1B_HUMAN      --------------------------------------------MLL-L-----------
sp|P29017|CD1C_HUMAN      --------------------------------------------MLFLQ--FLL------
sp|P61769|B2MG_HUMAN      KNGERIEKVEHSDLSFSKDWSFYLLYYTEFTPTEKDEYACRVNHVTLSQPKIVKWDRDM-
pdb|4ono|A                KNGERIEKVEHSDLSFSKDWSFYLLYYTEFTPTEKDEYACRVNHVTLSQPKIVKWDRDMG
                                                                      : :             

sp|P29016|CD1B_HUMAN      PFQLLAVLFPGGNSEHAFQGPTSFHVIQTSSFTNSTWAQTQGSGWLDDLQIHGWDSDSGT
sp|P29017|CD1C_HUMAN      ----LALLLPGGDNADASQEHVSFHVIQIFSFVNQSWARGQGSGWLDELQTHGWDSESGT
sp|P61769|B2MG_HUMAN      ------------------------------------------------------------
pdb|4ono|A                GGGSGGSGSGGGSSADASQEHVSFHVIQIFSFVNQSWARGQGSGWLDELQTHGWDSESGT
                                                                                      

sp|P29016|CD1B_HUMAN      AIFLKPWSKGNFSDKEVAELEEIFRVYIFGFAREVQDFAGDFQMKYPFEIQGIAGCELHS
sp|P29017|CD1C_HUMAN      IIFLHNWSKGNFSNEELSDLELLFRFYLFGLTREIQDHASQDYSKYPFEVQVKAGCELHS
sp|P61769|B2MG_HUMAN      ------------------------------------------------------------
pdb|4ono|A                IIFLHQWSKGQFSNEELSDLELLFRFYLFGLTREIQDHASQDYSKYPFEVQVKAGCELHS
                                                                                      

sp|P29016|CD1B_HUMAN      GGAIVSFLRGALGGLDFLSVKNASCVPSPEGGSRAQKFCALI-IQYQGIMETVRILLYET
sp|P29017|CD1C_HUMAN      GKSPEGFFQVAFNGLDLLSFQNTTWVPSPGCGSLAQSVCHLLNHQYEGVTETVYNLIRST
sp|P61769|B2MG_HUMAN      ------------------------------------------------------------
pdb|4ono|A                GGSPEGFFQVAFNGLDLLSFQQTTWVPSPGCGSLAQSVCHLLNHQYEGVTETVYNLIRST
                                                                                      

sp|P29016|CD1B_HUMAN      CPRYLLGVLNAGKADLQRQVKPEAWLSSGPSPGPGRLQLVCHVSGFYPKPVWVMWMRGEQ
sp|P29017|CD1C_HUMAN      CPRFLLGLLDAGKMYVHRQVRPEAWLSSRPSLGSGQLLLVCHASGFYPKPVWVTWMRNEQ
sp|P61769|B2MG_HUMAN      ------------------------------------------------------------
pdb|4ono|A                CPRFLLGLLDAGKMYVHRQVKPEAWLSSGPSPGPGRLQLVCHVSGFYPKPVWVMWMRGEQ
                                                                                      

sp|P29016|CD1B_HUMAN      EQQGTQLGDILPNANWTWYLRATLDVADGEAAGLSCRVKHSSLEGQDIILYWRNPTSIGS
sp|P29017|CD1C_HUMAN      EQLGTKHGDILPNADGTWYLQVILEVASEEPAGLSCRVRHSSLGGQDIILYWGHHFSMNW
sp|P61769|B2MG_HUMAN      ------------------------------------------------------------
pdb|4ono|A                EQQGTQLGDILPNAQGTWYLRATLDVADGEAAGLSCRVKHSSLEGQDIILYWHH------
                                                                                      

sp|P29016|CD1B_HUMAN      IVLAIIVPSLLLLLCLALWYMRRRSYQNIP
sp|P29017|CD1C_HUMAN      IALVVIVPLV-ILIVLVLWFKKHCSYQDIL
sp|P61769|B2MG_HUMAN      ------------------------------
pdb|4ono|A                ------------------------------
```

Although the agreement is absolutely trash, all there are returned for this structure.

The protein sequence for this structure is follows:

- P29016 (T-cell surface glycoprotein CD1b) vs pdb: 

```
           200       210       220       230       240       250   
sp|P29 AGKADLQRQVKPEAWLSSGPSPGPGRLQLVCHVSGFYPKPVWVMWMRGEQEQQGTQLGDI
       :::  ..:::::::::::::::::::::::::::::::::::::::::::::::::::::
pdb|4o AGKMYVHRQVKPEAWLSSGPSPGPGRLQLVCHVSGFYPKPVWVMWMRGEQEQQGTQLGDI
             300       310       320       330       340       350 

           260       270       280       290       
sp|P29 LPNANWTWYLRATLDVADGEAAGLSCRVKHSSLEGQDIILYWRN
       ::::. ::::::::::::::::::::::::::::::::::::..
pdb|4o LPNAQGTWYLRATLDVADGEAAGLSCRVKHSSLEGQDIILYWHH
             360       370       380       390     
```

- P29017 (T-cell surface glycoprotein CD1c) vs pdb:

```
           20        30        40        50        60        70    
sp|P29 GGDNADASQEHVSFHVIQIFSFVNQSWARGQGSGWLDELQTHGWDSESGTIIFLHNWSKG
       ::..:::::::::::::::::::::::::::::::::::::::::::::::::::.::::
pdb|4o GGSSADASQEHVSFHVIQIFSFVNQSWARGQGSGWLDELQTHGWDSESGTIIFLHQWSKG
             120       130       140       150       160       170 

           80        90       100       110       120       130    
sp|P29 NFSNEELSDLELLFRFYLFGLTREIQDHASQDYSKYPFEVQVKAGCELHSGKSPEGFFQV
       .:::::::::::::::::::::::::::::::::::::::::::::::::: ::::::::
pdb|4o QFSNEELSDLELLFRFYLFGLTREIQDHASQDYSKYPFEVQVKAGCELHSGGSPEGFFQV
             180       190       200       210       220       230 

          140       150       160       170       180       190    
sp|P29 AFNGLDLLSFQNTTWVPSPGCGSLAQSVCHLLNHQYEGVTETVYNLIRSTCPRFLLGLLD
       :::::::::::.::::::::::::::::::::::::::::::::::::::::::::::::
pdb|4o AFNGLDLLSFQQTTWVPSPGCGSLAQSVCHLLNHQYEGVTETVYNLIRSTCPRFLLGLLD
             240       250       260       270       280       290 

          200       210       220       230       240       250    
sp|P29 AGKMYVHRQVRPEAWLSSRPSLGSGQLLLVCHASGFYPKPVWVTWMRNEQEQLGTKHGDI
       ::::::::::.::::::: :: : :.: ::::.:::::::::: :::.:::: ::. :::
pdb|4o AGKMYVHRQVKPEAWLSSGPSPGPGRLQLVCHVSGFYPKPVWVMWMRGEQEQQGTQLGDI
             300       310       320       330       340       350 
```

- P61769 (Beta-2-microglobulin) vs pdb

```
sp|P61 IQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNGERIEKVEHSDLSFSKDW
       ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
pdb|4o IQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNGERIEKVEHSDLSFSKDW
              10        20        30        40        50        60 

               90       100       110         
sp|P61 SFYLLYYTEFTPTEKDEYACRVNHVTLSQPKIVKWDRDM
       :::::::::::::::::::::::::::::::::::::::
pdb|4o SFYLLYYTEFTPTEKDEYACRVNHVTLSQPKIVKWDRDM
              70        80        90       100
```

**Conclusion:** We have to be more agnostic on the Ensembl gene id.

**Proposed steps:**

1. Look up SIFTS via the API.
2. Look through all the mappings.

In [130]:
import pandas as pd
import requests
from functools import reduce

"""
 pdb_structure_id    | 1d2s            
 prot_chain_id       | A               
 compound_id         | DHT             
 interaction_type    | hbond           
 prot_residue_number | 65              
 prot_residue_type   | ASP             
 target_id           | ENSG00000129214 
 uniprot_id          | P04278 
"""

# From PLIP output:
chain_id = 'A'
pdb_id = '3e7g'
residue_no = 115


# Fetching data from PDB API:
URL = f'https://www.ebi.ac.uk/pdbe/api/mappings/ensembl/{pdb_id}'
data = requests.get(URL).json()

all_mappings = (
    # Extract mappings for all genes:
    pd.DataFrame(reduce(lambda x, y: x + y['mappings'], data[pdb_id]['Ensembl'].values(), []))
    
    # Extract protein position:
    .assign(
        author_residue_start = lambda df: df.start.apply(lambda x: x['author_residue_number']),
        author_residue_end = lambda df: df.end.apply(lambda x: x['author_residue_number']),
        residue_start = lambda df: df.start.apply(lambda x: x['residue_number']),
        residue_end = lambda df: df.end.apply(lambda x: x['residue_number']),
    )
    
    # Selecting columns:
    [['chain_id', 'accession', 'genome_start', 'genome_end', 'author_residue_start', 
      'author_residue_end', 'residue_start', 'residue_end', 'unp_start', 'unp_end', 'translation_id']]
    
    # Dropping isoforms:
    .assign(
        accession = lambda df: df.accession.str.replace('-\d+', '', regex=True)
    )
    .drop_duplicates()
    
    # Adding position:
    .assign(residue_no=residue_no)
       
    # Filter for chain ID:
    .query('chain_id == @chain_id')
    
    # Filter for position match:
#     .query('(author_residue_start <= residue_no and author_residue_end >= residue_no) or (residue_start <= residue_no and residue_end >= residue_no)')
    
    # Calculate genomic position:
    .assign(genomic_pos = lambda df: df.apply(get_position, axis=1))
)
all_mappings

Unnamed: 0,chain_id,accession,genome_start,genome_end,author_residue_start,author_residue_end,residue_start,residue_end,unp_start,unp_end,translation_id,residue_no,genomic_pos
0,A,P35228,27788979,27789051,,106.0,1,25,82,106,ENSP00000327251,115,
1,A,P35228,27788979,27789051,,106.0,1,25,82,106,ENSP00000482291,115,
2,A,P35228,27787826,27787973,107.0,156.0,26,75,107,156,ENSP00000327251,115,27787850.0
3,A,P35228,27787826,27787973,107.0,156.0,26,75,107,156,ENSP00000482291,115,27787850.0
4,A,P35228,27783106,27783268,156.0,210.0,75,129,156,210,ENSP00000327251,115,27782983.0
5,A,P35228,27783106,27783268,156.0,210.0,75,129,156,210,ENSP00000482291,115,27782983.0
6,A,P35228,27782106,27782196,211.0,241.0,130,160,211,241,ENSP00000327251,115,27781818.0
7,A,P35228,27782106,27782196,211.0,241.0,130,160,211,241,ENSP00000482291,115,27781818.0
8,A,P35228,27781177,27781243,241.0,263.0,160,182,241,263,ENSP00000482291,115,27780799.0
9,A,P35228,27781177,27781318,241.0,288.0,160,207,241,288,ENSP00000327251,115,27780799.0


In [88]:
def get_position(row: pd.Series) -> int:
    aa_in_range = row['residue_no'] - row['author_residue_start']
    base_in_range = aa_in_range * 3
    return base_in_range + row['genome_start']

def get_chr(accession: str) -> str:
    

In [90]:
# test_df = pd.read_clipboard()
test_df.head()

Unnamed: 0,chainId,accession,genome_start,genome_end,author_residue_start,author_residue_end,pdb_struct_id,pdbCompoundId,chr,intType,resType,resNb,residue_start_genomic,start,end,seq_region_name,strand,id,calculated_aa
0,A,P35228,27787826,27787973,107,156,3e7g,ZN,17,metal_complex,CYS,115,27787850,27756766.0,27800529.0,17,-1.0,ENSG00000007171,SER
1,A,P35228,27787826,27787973,107,156,3e7g,ZN,17,metal_complex,CYS,110,27787835,27756766.0,27800529.0,17,-1.0,ENSG00000007171,LEU
2,A,P35228,27779056,27779230,335,393,3e7g,AT2,17,hydroph_interaction,PRO,350,27779101,27756766.0,27800529.0,17,-1.0,ENSG00000007171,LYS
3,A,P35228,27779056,27779230,335,393,3e7g,AT2,17,hbond,GLU,377,27779182,27756766.0,27800529.0,17,-1.0,ENSG00000007171,LEU
4,A,P35228,27779056,27779230,335,393,3e7g,AT2,17,hydroph_interaction,PHE,369,27779158,27756766.0,27800529.0,17,-1.0,ENSG00000007171,LEU


In [133]:
from Bio.Data.CodonTable import CodonTable
from Bio.Seq import Seq
from Bio.SeqUtils import IUPACData

def get_res_sequence(row) -> dict:
    target = 'KSC'
    chromosome = 17
    residue_start_genomic = row['genome_start']
    residue_end_genomic = row['genome_end']
    strand = -1
#     residue_start_genomic += -1
    URL = f'https://rest.ensembl.org/sequence/region/human/{chromosome}:{residue_start_genomic}..{residue_end_genomic +0}:{strand}?content-type=text/plain'
    codon = (requests.get(URL)).text

    my_rna = Seq(codon)
    amino_acid_1 = str(my_rna.translate())

    if target in amino_acid_1:
        return True
    else:
        return False

def get_protein(row):
    target = 'KSC'
    translation_id = row['translation_id']
    start = row['unp_start']
    end = row['unp_end']
    URL = f'https://rest.ensembl.org/sequence/id/{translation_id}?content-type=text/plain&start={start}&end={end}'
    seq = requests.get(URL).text
    print(seq)
    if target in seq:
        return True
    else:
        return False
    
all_mappings.apply(get_protein, axis=1)




PRHVRIKNWGSGMTFQDTLHHKAKG
PRHVRIKNWGSGMTFQDTLHHKAKG
ILTCRSKSCLGSIMTPKSLTRGPRDKPTPPDELLPQAIEFVNQYYGSFKE
ILTCRSKSCLGSIMTPKSLTRGPRDKPTPPDELLPQAIEFVNQYYGSFKE
EAKIEEHLARVEAVTKEIETTGTYQLTGDELIFATKQAWRNAPRCIGRIQWSNLQ
EAKIEEHLARVEAVTKEIETTGTYQLTGDELIFATKQAWRNAPRCIGRIQWSNLQ
VFDARSCSTAREMFEHICRHVRYSTNNGNIR
VFDARSCSTAREMFEHICRHVRYSTNNGNIR
RSAITVFPQRSDGKHDFRVWNAQ
RSAITVFPQRSDGKHDFRVWNAQLIRYAGYQMPDGSIRGDPANVEFTQ
L
LCIDLGWKPKYGRFDVVPLVLQANGRDPELFEIPPDLVLEVAMEHPK
CIDLGWKP
N
NGRDPELFEIPPDLVLEVAMEHPK
KYEWFRELELKWYALPAVANMLLEVGGLEFPGCPFNGWYMGTEIGVRDFCDVQRYNILE
KYEWFRELELKWYALPAVANMLLEVGGLEFPGCPFNGWYMGTEIGVRDFCDVQRYNILE
EVGRRMGLETHKLASLWKDQAVVEINIAVLHSFQ
EVGRRMGLETHKLASLWKDQAVVEINIAVLHSFQ
KQNVTIMDHHSAAESFMKYMQNEYRSRGGCPADWIWLVPPMSGSITPVFHQEMLNYVLSPFYYYQ
KQNVTIMDHHSAAESFMKYMQNEYRSRGGCPADWIWLVPPMSGSITPVFHQEMLNYVLSPFYYYQ
VEAWKTHVWQDEK
VEAWKTHVWQDEK


0     False
1     False
2      True
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
dtype: bool

In [129]:
all_mappings.apply(get_res_sequence, axis=1)

0     False
2     False
4     False
6     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
17    False
19    False
21    False
dtype: bool