# Resolving issue with aa2genome mapping


We are experiencing issues with finding the correct genomic location for a given amino acid position. This needs to be resolved now.

**Steps:**
1. Reading PLIP output that contains all interactions: pdb id, chain id, residue number, residue
2. Joining this table with the pdb2Ensembl mapping file.
3. Look up detailed mapping via the REST API.
4. Based on the returned mapped region, we have to find out the precise genomic location of the interacting residue.
5. Extract overlapping triplet from genome.
6. Validate the translation.



In [5]:
import requests 
import pandas as pd
from pyspark.sql.types import ArrayType, StringType, IntegerType
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

# establish spark connection
spark = (
    SparkSession.builder
    .master('local[*]')
    .getOrCreate()
)



In [25]:
# Reading sifts datafile:
gene_mapping = (
    spark.read.csv('data/pdb_chain_ensembl.csv.gz', sep=',', header=True, comment='#')
    .filter(f.col('GENE_ID').startswith('ENSG0'))
    .select(
        f.col('PDB').alias('pdb_structure_id'),
        f.col('CHAIN').alias('prot_chain_id'),
        f.col('GENE_ID').alias('target_id'),
        f.col('SP_PRIMARY').alias('uniprot_id')
    )
    .distinct()
    .persist()
)

gene_mapping.show()

+----------------+-------------+---------------+----------+
|pdb_structure_id|prot_chain_id|      target_id|uniprot_id|
+----------------+-------------+---------------+----------+
|            1a02|            F|ENSG00000170345|    P01100|
|            1a1e|            A|ENSG00000197122|    P12931|
|            1awz|            A|ENSG00000214274|    P03950|
|            1bqq|            M|ENSG00000157227|    P50281|
|            1c45|            A|ENSG00000090382|    P61626|
|            1cj6|            A|ENSG00000090382|    P61626|
|            1cz8|            V|ENSG00000112715|    P15692|
|            1d0j|            C|ENSG00000127191|    Q12933|
|            1ds5|            G|ENSG00000224774|    P67870|
|            1duz|            A|ENSG00000206503|    P04439|
|            1fmk|            A|ENSG00000197122|    P12931|
|            1ft8|            C|ENSG00000162231|    Q9UBU9|
|            1gbu|            C|ENSG00000206172|    P69905|
|            1gk4|            E|ENSG0000

In [28]:
interacting_residues = (
    spark.read.csv('/Users/dsuveges/project_data/marine/plip_output.csv', sep=',', header=True)
    .join(gene_mapping, on=['pdb_structure_id', 'prot_chain_id'], how='left')
    .persist()
)

interacting_residues.show(1, vertical=True, truncate=False)

-RECORD 0------------------------------
 pdb_structure_id    | 1d2s            
 prot_chain_id       | A               
 compound_id         | DHT             
 interaction_type    | hbond           
 prot_residue_number | 65              
 prot_residue_type   | ASP             
 target_id           | ENSG00000129214 
 uniprot_id          | P04278          
only showing top 1 row



## Can the chain -> uniprot assignments be ambigous?

- Take mapping file, group by pdb/chain, collect list of gene ids.
- Observe the output.

In [33]:
ambigous_chain_mapping = (
    gene_mapping
    .groupBy(['pdb_structure_id', 'prot_chain_id'])
    .agg(f.collect_set(f.col('target_id')).alias('gene_ids'))
    .withColumn('gene_counts', f.size(f.col('gene_ids')))
    .filter(f.col('gene_counts')  > 1)
    .orderBy('gene_counts', ascending=False)
    .persist()
)

ambigous_chain_mapping.show(1, truncate=False, vertical=True)

-RECORD 0---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 pdb_structure_id | 1b6u                                                                                                                                                                                                                                                                                                                                                                  
 prot_chain_id    | A                                                                                                                                                                                                                             

In [35]:
ambigous_chain_mapping = (
    gene_mapping
    .groupBy(['pdb_structure_id', 'prot_chain_id'])
    .agg(f.collect_set(f.col('uniprot_id')).alias('uniprot_ids'))
    .withColumn('gene_counts', f.size(f.col('uniprot_ids')))
    .filter(f.col('gene_counts')  > 1)
    .orderBy('gene_counts', ascending=False)
    .persist()
)

ambigous_chain_mapping.show(1, truncate=False, vertical=True)

-RECORD 0------------------------------------
 pdb_structure_id | 4ono                     
 prot_chain_id    | A                        
 uniprot_ids      | [P61769, P29017, P29016] 
 gene_counts      | 3                        
only showing top 1 row



- 10.9k structure/chain mapping to uniprot is ambigous: more then one ensembl gene id is linked.
- For the structure `4ono`, the following uniprot ids are identified: P61769, P29017, P29016
- Fetching these protein sequences and align them on [CLUSTAL OMEGA](https://www.ebi.ac.uk/Tools/msa/clustalo/)

In [36]:
%%bash

for uniprotid in P61769 P29017 P29016; do 
    curl -s https://www.uniprot.org/uniprot/${uniprotid}.fasta
    echo
done

>sp|P61769|B2MG_HUMAN Beta-2-microglobulin OS=Homo sapiens OX=9606 GN=B2M PE=1 SV=1
MSRSVALAVLALLSLSGLEAIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLL
KNGERIEKVEHSDLSFSKDWSFYLLYYTEFTPTEKDEYACRVNHVTLSQPKIVKWDRDM

>sp|P29017|CD1C_HUMAN T-cell surface glycoprotein CD1c OS=Homo sapiens OX=9606 GN=CD1C PE=1 SV=2
MLFLQFLLLALLLPGGDNADASQEHVSFHVIQIFSFVNQSWARGQGSGWLDELQTHGWDS
ESGTIIFLHNWSKGNFSNEELSDLELLFRFYLFGLTREIQDHASQDYSKYPFEVQVKAGC
ELHSGKSPEGFFQVAFNGLDLLSFQNTTWVPSPGCGSLAQSVCHLLNHQYEGVTETVYNL
IRSTCPRFLLGLLDAGKMYVHRQVRPEAWLSSRPSLGSGQLLLVCHASGFYPKPVWVTWM
RNEQEQLGTKHGDILPNADGTWYLQVILEVASEEPAGLSCRVRHSSLGGQDIILYWGHHF
SMNWIALVVIVPLVILIVLVLWFKKHCSYQDIL

>sp|P29016|CD1B_HUMAN T-cell surface glycoprotein CD1b OS=Homo sapiens OX=9606 GN=CD1B PE=1 SV=1
MLLLPFQLLAVLFPGGNSEHAFQGPTSFHVIQTSSFTNSTWAQTQGSGWLDDLQIHGWDS
DSGTAIFLKPWSKGNFSDKEVAELEEIFRVYIFGFAREVQDFAGDFQMKYPFEIQGIAGC
ELHSGGAIVSFLRGALGGLDFLSVKNASCVPSPEGGSRAQKFCALIIQYQGIMETVRILL
YETCPRYLLGVLNAGKADLQRQVKPEAWLSSGPSPGPGRLQLVCHVSGFYPKPVWVMWMR
GEQEQQGTQLGDILPN

This is what the alignment shows:

```
sp|P61769|B2MG_HUMAN      ------------------------------------------------------------
sp|P29017|CD1C_HUMAN      MLFLQFLLLALLLPGGDNADASQEHVSFHVIQIFSFVNQSWARGQGSGWLDELQTHGWDS
sp|P29016|CD1B_HUMAN      MLLLPFQLLAVLFPGGNSEHAFQGPTSFHVIQTSSFTNSTWAQTQGSGWLDDLQIHGWDS
                                                                                      

sp|P61769|B2MG_HUMAN      ------------------------------------------------------------
sp|P29017|CD1C_HUMAN      ESGTIIFLHNWSKGNFSNEELSDLELLFRFYLFGLTREIQDHASQDYSKYPFEVQVKAGC
sp|P29016|CD1B_HUMAN      DSGTAIFLKPWSKGNFSDKEVAELEEIFRVYIFGFAREVQDFAGDFQMKYPFEIQGIAGC
                                                                                      

sp|P61769|B2MG_HUMAN      -----------------------------------------------------------M
sp|P29017|CD1C_HUMAN      ELHSGKSPEGFFQVAFNGLDLLSFQNTTWVPSPGCGSLAQSVCHLLNHQYEGVTETVYNL
sp|P29016|CD1B_HUMAN      ELHSGGAIVSFLRGALGGLDFLSVKNASCVPSPEGGSRAQKFCALI-IQYQGIMETVRIL
                                                                                     :

sp|P61769|B2MG_HUMAN      SR-SVALAVLALLSLSGLEAIQRTPKIQ-VYSRHPAENGKSNFLNCYVSGFHPSDIEVDL
sp|P29017|CD1C_HUMAN      IRSTCPRFLLGLLD-AGKMYVHRQVRPEAWLSSRPSLGSGQLLLVCHASGFYPKPVWVTW
sp|P29016|CD1B_HUMAN      LYETCPRYLLGVLN-AGKADLQRQVKPEAWLSSGPSPGPGRLQLVCHVSGFYPKPVWVMW
                             :    :*.:*. :*   ::*  : :   *  *: .     * *:.***:*. : *  

sp|P61769|B2MG_HUMAN      LKNGERIEKVEHSDLSFSKDWSFYLLYYTEFTPTEKDEYACRVNHVTLSQPKIVK-WDRD
sp|P29017|CD1C_HUMAN      MRNEQEQLGTKHGDILPNADGTWYLQVILEVASEEPAGLSCRVRHSSLGGQDIILYWGHH
sp|P29016|CD1B_HUMAN      MRGEQEQQGTQLGDILPNANWTWYLRATLDVADGEAAGLSCRVKHSSLEGQDIILYWRNP
                          ::. :.   .: .*:  . : ::**    :.:  *    :***.* :*   .*:  * . 

sp|P61769|B2MG_HUMAN      M----------------------------------
sp|P29017|CD1C_HUMAN      FSMNWIALVVIVPLV-ILIVLVLWFKKHCSYQDIL
sp|P29016|CD1B_HUMAN      TSIGSIVLAIIVPSLLLLLCLALWYMRRRSYQNIP
```

Although the agreement is absolutely trash, all there are returned for this structure.

The proti