## Matching a Spectrum to a Protein Coverage 

# Searching the Protein Database

Many proteins have already been identified for a wide variety of organisms. Accordingly, there are a large number of protein databases available, and so the first step after creating a mass spectrum for an unidentified protein is to search through these databases for a known protein with a highly similar spectrum. In this manner, many similar proteins found in different species have been identified, which aids researchers in determining protein function.

In “Comparing Spectra with the Spectral Convolution”, we introduced the spectral convolution and used it to measure the similarity of simplified spectra. In this problem, we would like to extend this idea to find the most similar protein in a database to a spectrum taken from an unknown protein. Our plan is to use the spectral convolution to find the largest possible number of masses that each database protein shares with our candidate protein after shifting, and then select the database protein having the largest such number of shared masses.

# Problem

The complete spectrum of a weighted string s is the multiset S[s] containing the weights of every prefix and suffix of s.

Given: A positive integer n followed by a collection of n protein strings s1, s2, ..., sn and a multiset R of positive numbers (corresponding to the complete spectrum of some unknown protein string).

Return: The maximum multiplicity of R⊖S[sk] taken over all strings sk, followed by the string sk for which this maximum multiplicity occurs (you may output any such value if multiple solutions exist).

Sample Dataset

4
GSDMQS
VWICN
IASWMQS
PVSMGAD
445.17838
115.02694
186.07931
314.13789
317.1198
215.09061

Sample Output

3
IASWMQS

In [None]:
# https://github.com/fedeoliv/Rosalind-Problems/blob/master/prsm.py
import operator  # 引入operator模块进行元素的比较
from collections import Counter


def get_largest_multiplicity(s1, s2):
    diff = [abs(r) for r in get_values(s1, s2, oper=operator.sub)]
    multiplicity = Counter(diff) # Counter函数统计每个元素出现的频率，以字典的形式储存

    return multiplicity.most_common()[0][::-1]


def get_values(set1, set2, oper=operator.add):
    return [round(oper(s1, s2), 5) for s1 in set1 for s2 in set2]


def get_substrings(string):
    for i in range(len(string)):
        if i != len(string) - 1:
            yield string[:i + 1]
        yield string[-i:]


def protein_weight(protein):
    mass_dict = {'A': 71.03711, 'C': 103.00919, 'D': 115.02694, 'E': 129.04259, 'F': 147.06841,
                 'G': 57.02146, 'H': 137.05891, 'I': 113.08406, 'K': 128.09496, 'L': 113.08406,
                 'M': 131.04049, 'N': 114.04293, 'P': 97.05276, 'Q': 128.05858, 'R': 156.10111,
                 'S': 87.03203, 'T': 101.04768, 'V': 99.06841, 'W': 186.07931, 'Y': 163.06333}

    return sum(mass_dict[p] for p in protein)


def get_max_multiplicity(proteins, rset):
    output = [0, None]
    for protein in proteins:
        pset = [protein_weight(p) for p in get_substrings(protein)]
        most, _ = get_largest_multiplicity(rset, pset)

        if most &gt;= output[0]:
            output[0] = most
            output[1] = protein
    return output


if __name__ == &quot;__main__&quot;:
    with open(&#39;rosalind_prsm.txt&#39;) as dataset:
        n = int(dataset.readline().rstrip())
        proteins = [dataset.readline().rstrip() for i in range(n)]
        rset = [round(float(r.rstrip()), 5) for r in dataset.readlines()]
    print(&quot;\n&quot;.join(str(x) for x in get_max_multiplicity(proteins, rset)))


['GTAAGTGGGCCACTTTACTGCGACCCAAGACTGTCACAACGCTCAGAACA\n', 'GTCGGCCGAGATCAGCTCGAGACTTAATGCATGGCGGTGAACCGCTTGAA\n', 'GTCGAGCCTTTTAGCCCCTCTTCCACTTAGGGGCTATTCTTCCCGAGCTA\n', 'GTTAAAAACTCTATTCTTAGTTTCACGATAGCTTCGGGAGTACGGCTCTG\n', 'GTATGAAGGTGGGGCGCGTCGATAGATCAAATATAAGTTCCTGGCCCTGA\n', 'CCCTCTTCCACTTAGGGGCTATTCTTCCCGAGCTACCAATGCTATACGCA\n', 'CTTAGTTTCACGATAGCTTCGGGAGTACGGCTCTGCAGCCGAAACTTGCC\n', 'ATAGCACTCGTCTTATCGTGCGCATGCCGTTGTATAACACCGGTATGACC\n', 'TCCACACATATTGCGCAAGAGTCGAATGCGATAAAATCTCGGTATTAGGG\n', 'TGCAGCCGAAACTTGCCTAGACGAAAACGCACTTGTCGAGCCTTTTAGCC\n', 'GTCCCCCTAACCACACAGTACTATTCGAGTAAATAGCATCCAAGTCATCC\n', 'TGTCGGAGTCCGGACATCGGTCTTCCTGCTTACTCTATAGAGTAGACGTT\n', 'ACTTTGAAGTTGCGTATGCACGCCGTTGGGCTTCGGTTATTGCGATCTAT\n', 'GTAGCTGGTGGCCGGCGGCACTTTTTGGTCTACTAGTTGCCCTGGGCTCT\n', 'ATTCTGAACGCCTCGTATGAAGGTGGGGCGCGTCGATAGATCAAATATAA\n', 'GGAATAAGTCTTATACACAAATCTAAGATATACTTGTCGGAGTCCGGACA\n', 'ACATATTGCGCAAGAGTCGAATGCGATAAAATCTCGGTATTAGGGCAAAG\n', 'CAGGCCAACATGGTAATTGCCCTCAGTAAGTAGATTGTAGCATCGC