# Problem 16: Finding a Protein Motif

To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means "either X or Y" and {X} means "any amino acid except X." For example, the N-glycosylation motif is written as N{P}[ST]{P}.

You can see the complete description and features of a particular protein by its access ID "uniprot_id" in the UniProt database, by inserting the ID number into

http://www.uniprot.org/uniprot/uniprot_id \
Alternatively, you can obtain a protein sequence in FASTA format by following

http://www.uniprot.org/uniprot/uniprot_id.fasta \
For example, the data for protein B5ZC00 can be found at http://www.uniprot.org/uniprot/B5ZC00.

Given: At most 15 UniProt Protein Database access IDs.

Return: For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.

Sample Dataset
>A2Z669 \
B5ZC00 \
P07204_TRBM_HUMAN \
P20840_SAG1_YEAST

Sample Output
>B5ZC00 \
85 118 142 306 395 \
P07204_TRBM_HUMAN \
47 115 116 382 409 \
P20840_SAG1_YEAST \
79 109 135 248 306 348 364 402 485 501 614

In [26]:
import requests
def getSequence(uniprotID):
    target_url = "https://rest.uniprot.org/uniprot/{uniprotID}.fasta".format(uniprotID = uniprotID)
    data = requests.get(target_url)
    # print(target_url)
    return data.text.split('\n',1)[1].replace('\n',"")
    
def findNGlycosylationMotif(sequence):
    locations = []
    for i, aminoAcid in enumerate(sequence):
        if aminoAcid == 'N' and i < len(sequence)-2:
            if sequence[i+1] != 'P':
                if sequence[i+2] == 'S' or sequence[i+2] == 'T':
                    if sequence[i+3] != 'P':
                        locations.append(i+1)
    return locations
            
def findMotives(uniprotIds):
    uniprotIds = uniprotIds.split("\n")
    for prot in uniprotIds:
        protienID = prot.split("_")[0]
        res =str(findNGlycosylationMotif(getSequence(protienID)))
        res = res.replace(",","").replace('[',"").replace(']',"")
        if len(res) >0:
            print(prot)
            print(res)

In [30]:
testInput = """A2Z669
B5ZC00
P07204_TRBM_HUMAN
P20840_SAG1_YEAST"""

In [31]:
findMotives(testInput)

B5ZC00
85 118 142 306 395
P07204_TRBM_HUMAN
47 115 116 382 409
P20840_SAG1_YEAST
79 109 135 248 306 348 364 402 485 501 614


In [28]:
actualInput = """P81428_FA10_TROCA
P01044_KNH1_BOVIN
A1TJ10
P40225_TPO_HUMAN
Q5WFN0
P00750_UROT_HUMAN
P25174
B3PYU7
B4U0J5
Q00001_RHGA_ASPAC
Q03745
P00740_FA9_HUMAN
P02749_APOH_HUMAN
A5A3H2"""

In [29]:
findMotives(actualInput)

P81428_FA10_TROCA
254
P01044_KNH1_BOVIN
47 87 168 169 197 204
A1TJ10
86 251
P40225_TPO_HUMAN
197 206 234 255 340 348
P00750_UROT_HUMAN
152 219 483
P25174
17 32 56 97 116 132 151 178 183 198 325 670
Q00001_RHGA_ASPAC
50 235 317
Q03745
272 438 506 509 550 1043 1096
P00740_FA9_HUMAN
203 213
P02749_APOH_HUMAN
162 183 193 253
