# Pattern search

- Build a motif representing the the binding region of the **Tumor necrosis factor receptor superfamily member 11A (TNFRSF11A)** (UniProt Q9Y6Q6).

- Search functionally similar human proteins 

- Identify binding partners

In [1]:
from Bio import SeqIO
import re
import numpy as np
import copy

### SLiM pattern definition 

Build a generic motif representing the TRAF6 binding region from a MSA

The MSA has been previously generated using BLAST 
and aligning the (complete) UniProt sequence against SwissProt (SP) and
Non-Redundant (NR) databases.

The position of the interacting region is known in advance, 
identified directly from the PDB (1LB5). 

- PDB positions 602-608 (...MPTEDEY...)
- UniProt positions 343-349 


In [21]:
# For each position of the fragment get the aminoacid distribution
seq_records = list(SeqIO.parse("data/Q9Y6Q6_blast_msa.txt", "fasta"))
seqs = []
for record in seq_records:
    fragment = record.seq[342:349]  # Extract the fragment corresponding to the interaction region in the query
    if set(fragment) != set("-"):  # Skip fully gapped slices
        seqs.append(list(fragment))

In [22]:
# Calculate the pattern by looking at the amino acid distribution for each position
seqs = np.array(seqs).T
for i, row in enumerate(seqs):
    unique, counts = np.unique(row, return_counts=True)
    print(i, sorted(tuple(zip(unique, counts)), key=lambda k: k[1], reverse=True))


0 [('V', 178), ('I', 128), ('M', 66), ('G', 17), ('S', 7), ('-', 4), ('Q', 4), ('L', 2), ('D', 1), ('T', 1)]
1 [('P', 395), ('D', 4), ('V', 3), ('-', 2), ('G', 2), ('S', 2)]
2 [('T', 253), ('M', 136), ('A', 4), ('I', 4), ('-', 3), ('R', 3), ('S', 3), ('G', 1), ('Q', 1)]
3 [('E', 399), ('A', 3), ('S', 3), ('-', 1), ('D', 1), ('G', 1)]
4 [('D', 391), ('N', 9), ('E', 4), ('R', 3), ('S', 1)]
5 [('E', 397), ('T', 3), ('G', 2), ('V', 2), ('A', 1), ('P', 1), ('Q', 1), ('S', 1)]
6 [('Y', 395), ('S', 6), ('R', 3), ('E', 2), ('P', 2)]


In [10]:
# Find the pattern against the Human proteome
matches = []
seq_records = list(SeqIO.parse("data/human_up000005640.fasta", "fasta"))
for record in seq_records:
#     res = re.findall("MPTEDEY", str(record.seq))  # PDB fragment
#     res = re.findall("[VIM]P[TM]EDEY", str(record.seq))  # Ambiguous chars
#     res = re.findall(".P.EDEY", str(record.seq))  # Gaps
    res = re.findall("..P.E..[FYWHDE].", str(record.seq))  # The ELM pattern
    if res:
        matches.append((record.name, res))
        
print("Number matches", len(matches))
print(matches[:10])


Number matches 6507
[('sp|Q8NB16|MLKL_HUMAN', ['LSPQELEDV', 'DCPSELREI']), ('sp|O94851|MICA2_HUMAN', ['LEPPEDQEN']), ('sp|Q8NCK7|MOT11_HUMAN', ['TPPPETGEL']), ('sp|Q96H12|MSD3_HUMAN', ['SPPEEEPEY']), ('sp|P18669|PGAM1_HUMAN', ['PPPMEPDHP']), ('sp|Q5H9R7|PP6R3_HUMAN', ['ASPFENTEN', 'DLPDEVRER', 'DVPMETTHG']), ('sp|Q8WVI7|PPR1C_HUMAN', ['HNPPEIDDK']), ('sp|Q16557|PSG3_HUMAN', ['LYPREDMEA', 'LNPRENKDV', 'NPPAEYSWT']), ('sp|Q6VN20|RBP10_HUMAN', ['QTPGEIVDA']), ('sp|Q9H2M9|RBGPR_HUMAN', ['NEPQEPEEE', 'ENPDEPKEG', 'KDPEEARFF'])]


# Exercise
Filter ELM matches retaining only those falling inside instrisically disordered regions, as they are defined in MobiDB

MobiDB output is provided in *data/mobidb_lite_human.tsv*
The (disordered) regions column (column no. 3) has the format
```
<start>..<end>,<start>..<end>, ...
```
```
Q13362  prediction-disorder-mobidb_lite 476..509        0.065   34      524
Q9Y3L3  prediction-disorder-mobidb_lite 1..23,160..182,496..701 0.359   252     701
P78314  prediction-disorder-mobidb_lite 160..316,333..451       0.492   276     561
Q7L8J4  prediction-disorder-mobidb_lite 1..58,273..332,362..393 0.382   150     393
```

In [29]:
# Parse MobiDB-lite output
disorder = {}
with open("data/mobidb_lite_human.tsv") as f:
    for line in f:
        uniprot_id, _, regions = line.strip().split("\t")[:3]
        for region in regions.split(","):
            start, end = region.split("..")
            disorder.setdefault(uniprot_id, []).append((int(start), int(end)))

In [38]:
# Find the pattern against the Human proteome
matches = []
seq_records = list(SeqIO.parse("data/human_up000005640.fasta", "fasta"))
for record in seq_records:
    uniprot_id = record.name.split("|")[1]
    if disorder.get(uniprot_id):
        # Use search because it return a match object including start, end positions
        res = re.search("..P.E..[FYWHDE].", str(record.seq))  # The ELM pattern
        if res:
            for start, end in disorder[uniprot_id]:
                if res.start() >= start and end <= res.end():
                    matches.append((record.name, res))
                    break
        
print("Number matches", len(matches))
print(matches[:10])

Number matches 2490
[('sp|Q8NCK7|MOT11_HUMAN', <re.Match object; span=(440, 449), match='TPPPETGEL'>), ('sp|Q6VN20|RBP10_HUMAN', <re.Match object; span=(206, 215), match='QTPGEIVDA'>), ('sp|Q1RMZ1|SAMTR_HUMAN', <re.Match object; span=(231, 240), match='SLPGELFHV'>), ('sp|P35498|SCN1A_HUMAN', <re.Match object; span=(73, 82), match='SEPLEDLDP'>), ('sp|A3KN83|SBNO1_HUMAN', <re.Match object; span=(225, 234), match='DEPEEEDEE'>), ('sp|Q9Y6H5|SNCAP_HUMAN', <re.Match object; span=(306, 315), match='QGPEERSEY'>), ('sp|Q9UNP4|SIAT9_HUMAN', <re.Match object; span=(27, 36), match='AMPSEYTYV'>), ('sp|Q9BUA3|SPNDC_HUMAN', <re.Match object; span=(256, 265), match='AAPAEVRHF'>), ('sp|P28370|SMCA1_HUMAN', <re.Match object; span=(534, 543), match='QTPHEERED'>), ('sp|P32745|SSR3_HUMAN', <re.Match object; span=(283, 292), match='PLPEEPAFF'>)]


### Binding surface pattern
From the PDB is possible to identify some residues concetrated in a specific
strand of the beta sheet which are interacting with the SLiM 
- PDB positions 466-473
- UniProt positions?

Looking at the PDB website chain B is the short peptide while chain A is the structured domain (TNF receptor-associated factor 6 (TRAF 6)) which map to Q9Y4K3.

The positions of the binding surface in the PDB coincide with UniProt 466-473.  

In [23]:
# For each position of the fragment get the aminoacid distribution
seq_records = list(SeqIO.parse("data/Q9Y4K3_blast_msa.txt", "fasta"))
seqs = []
for record in seq_records:
    fragment = record.seq[465:473]  # Extract the fragment corresponding to the interaction region in the query
    if set(fragment) != set("-"):  # Skip fully gapped slices
        seqs.append(list(fragment))

In [24]:
# Calculate the pattern by looking at the amino acid distribution for each position
seqs = np.array(seqs).T
for i, row in enumerate(seqs):
    unique, counts = np.unique(row, return_counts=True)
    print(i, sorted(tuple(zip(unique, counts)), key=lambda k: k[1], reverse=True))


0 [('R', 499)]
1 [('N', 499)]
2 [('P', 499)]
3 [('K', 499)]
4 [('G', 499)]
5 [('F', 495), ('-', 3), ('S', 1)]
6 [('G', 495), ('-', 3), ('Y', 1)]
7 [('Y', 495), ('-', 3), ('R', 1)]


In [28]:
# Find the pattern against the Human proteome
matches = []
seq_records = list(SeqIO.parse("data/human_up000005640.fasta", "fasta"))
for record in seq_records:
#     res = re.findall("RNPKGFGY", str(record.seq))  # PDB region
    res = re.findall("PK.F.Y", str(record.seq))  # PDB contacts
    if res:
        matches.append((record.name, res))
        
print("Number matches", len(matches))
print(matches[:10])


Number matches 31
[('sp|Q5VT52|RPRD2_HUMAN', ['FIPKSFNY']), ('sp|Q9HB90|RRAGC_HUMAN', ['SFPKDFGY']), ('sp|Q9HCX4|TRPC7_HUMAN', ['PSPKSFYY']), ('sp|Q96DY7|MTBP_HUMAN', ['ILPKVFHY']), ('sp|Q86U42|PABP2_HUMAN', ['GHPKGFAY']), ('sp|A6NM11|L37A2_HUMAN', ['PEPKSFNY']), ('sp|Q6R327|RICTR_HUMAN', ['SIPKGFSY']), ('sp|Q86XF7|ZN575_HUMAN', ['DCPKAFSY', 'DCPKSFCY']), ('sp|P28068|DMB_HUMAN', ['GTPKDFTY']), ('sp|Q6ZSU1|C2G1P_HUMAN', ['KDPKYFRY'])]
