In addition to the 22 proteinogenic amino acids, NRPs contain non-proteinogenic amino acids, which expand the number of possible building blocks for antibiotic peptides from 20 to over 100. Enlarging the amino acid alphabet spells trouble for current approach to cyclopeptide sequencing. The correct peptide now must “compete” with many more incorrect ones for a place on the leaderboard, increasing the chance that the correct peptide will be cut along the way.

Because so many non-proteinogenic amino acids exist, bioinformaticians often assume that any integer between 57 and 200 may represent the mass of an amino acid; the “lightest” amino acid, Gly, has mass 57 Da, and most amino acids have masses smaller than 200 Da. --> masa peptida Gly-Gly je 114 Da pa možemo pogrešno zaključiti da je to masa aminokiseline --> za sve mase peptide veličine 2 čija je masa < 200 Da možemo pogrešno zaključiti da su mase aminokiselina

Since LEADERBOARDCYCLOPEPTIDESEQUENCING fails to identify the correct peptide even with only 10% false and missing masses, our stated aim from the previous section is now even more important --> We must determine the amino acid composition of a peptide from its spectrum so that we may run LEADERBOARDCYCLOPEPTIDESEQUENCING on this smaller alphabet of amino acids.

One way to determine the amino acid composition of a peptide from its experimental
spectrum would be to take the smallest masses present in the spectrum (between 57 and
200 Da). However, even if only a single amino acid mass is missing, then this approach
will fail to reconstruct the peptide’s amino acid composition.

We now should feel confident about using the most frequently appearing integers
in the convolution as a guess for the amino acid composition of an unknown peptide

Ima smisla samo oduzimati mase kmera i (k-1)mera jer na taj način dobivamo mase aminokiselina koje sastavljaju peptid. Ima smisla oduzimati mase kmera i (k-l)mra, l > 2, jer ako imamo lažnu masu nekog kmera možemo dobiti pravu masu na taj način i nju koristiti za izračunavanje masa aminokiselina. Možemo i na taj način tražiti lažne mase u spektru.

In [118]:
import numpy as np

In [119]:
def PeptideIntegerMass(peptide):
  return sum(peptide) #return sum of all integer masses of peptide

In [120]:
def ParentMass(spectrum):
  return max(spectrum) #we assume that the biggest mass in spectrum is the mass of peptide

In [121]:
def Expand(candidate_linear_peptides,aminoacids_integer_masses):
  return [(candidate_linear_peptides[i] + [aminoacid_integer_mass]) for i in range(len(candidate_linear_peptides)) for aminoacid_integer_mass in aminoacids_integer_masses]

In [122]:
def FindAllKmers(peptide_integer_masses,k):
  kmers_list = []
  i = 0
  while i + k - 1 <= len(peptide_integer_masses)-1:
    kmers_list.append(peptide_integer_masses[i:i+k])
    i = i + 1
  return kmers_list

In [123]:
def FindAllKmersCyclicPpetide(peptide_integer_masses,k):
  kmers_list = []
  i = 0
  while i + k - 1 <= len(peptide_integer_masses)-1:
    kmers_list.append(peptide_integer_masses[i:i+k])
    i = i + 1
  while i <= len(peptide_integer_masses) - 1:
    kmers_list.append(peptide_integer_masses[i:len(peptide_integer_masses)] + peptide_integer_masses[0:k-len(peptide_integer_masses[i:len(peptide_integer_masses)])])
    i = i + 1
  return kmers_list

In [124]:
def GenerateAllSubpeptidesIntegerMassesLinearPeptide(peptide_integer_masses):
  subpeptides_integer_masses = []
  for subpeptide_length in range(len(peptide_integer_masses)):
    for subpeptide_integer_mass in FindAllKmers(peptide_integer_masses,subpeptide_length+1):
      subpeptides_integer_masses.append(sum(subpeptide_integer_mass))
  return subpeptides_integer_masses

In [125]:
def GenerateAllSubpeptidesIntegerMassesCyclicPeptide(peptide_integer_masses):
  subpeptides_integer_masses = []
  for subpeptide_length in range(len(peptide_integer_masses)):
    for subpeptide_integer_mass in FindAllKmersCyclicPpetide(peptide_integer_masses,subpeptide_length+1):
      subpeptides_integer_masses.append(sum(subpeptide_integer_mass))
  return subpeptides_integer_masses

In [126]:
from collections import Counter

In [127]:
def LinearPeptideScoring(spectrum,linear_peptide_spectrum): #the score is computed for linear_peptide_spectrum against spectrum
  linear_peptide_spectrum_kmers = GenerateAllSubpeptidesIntegerMassesLinearPeptide(linear_peptide_spectrum)
  linear_peptide_spectrum_counter_dict = Counter(linear_peptide_spectrum_kmers)
  spectrum_counter_dict = Counter(spectrum)
  score = 0
  scored_aminoacid_integer_masses = []
  for aminoacid_integer_mass in linear_peptide_spectrum_kmers:
    if aminoacid_integer_mass not in scored_aminoacid_integer_masses:
      if linear_peptide_spectrum_counter_dict[aminoacid_integer_mass] == spectrum_counter_dict[aminoacid_integer_mass]:
        score = score + linear_peptide_spectrum_counter_dict[aminoacid_integer_mass]
        scored_aminoacid_integer_masses.append(aminoacid_integer_mass)
      elif linear_peptide_spectrum_counter_dict[aminoacid_integer_mass] > spectrum_counter_dict[aminoacid_integer_mass]:
        if spectrum_counter_dict[aminoacid_integer_mass] > 0: #if peptide_theoretical_spectrum_counter_dict[aminoacid_integer_mass] > 0 then there are surpluss occurences of same mass in experimental spectrum
          score = score + spectrum_counter_dict[aminoacid_integer_mass]
          scored_aminoacid_integer_masses.append(aminoacid_integer_mass)
      else: #spectrum_counter_dict[aminoacid_integer_mass] < peptide_theoretical_spectrum_counter_dict[aminoacid_integer_mass] --> there are surpluss occurences of same mass in theoretical spectrum
        if linear_peptide_spectrum_counter_dict[aminoacid_integer_mass] > 0:
          score = score + linear_peptide_spectrum_counter_dict[aminoacid_integer_mass]
          scored_aminoacid_integer_masses.append(aminoacid_integer_mass)
  return score

In [128]:
def CyclicPeptideScoring(spectrum,linear_peptide_spectrum): #the score is computed for linear_peptide_spectrum against spectrum
  linear_peptide_spectrum_kmers = GenerateAllSubpeptidesIntegerMassesCyclicPeptide(linear_peptide_spectrum)
  linear_peptide_spectrum_counter_dict = Counter(linear_peptide_spectrum_kmers)
  spectrum_counter_dict = Counter(spectrum)
  score = 0
  scored_aminoacid_integer_masses = []
  for aminoacid_integer_mass in linear_peptide_spectrum_kmers:
    if aminoacid_integer_mass not in scored_aminoacid_integer_masses:
      if linear_peptide_spectrum_counter_dict[aminoacid_integer_mass] == spectrum_counter_dict[aminoacid_integer_mass]:
        score = score + linear_peptide_spectrum_counter_dict[aminoacid_integer_mass]
        scored_aminoacid_integer_masses.append(aminoacid_integer_mass)
      elif linear_peptide_spectrum_counter_dict[aminoacid_integer_mass] > spectrum_counter_dict[aminoacid_integer_mass]:
        if spectrum_counter_dict[aminoacid_integer_mass] > 0: #if peptide_theoretical_spectrum_counter_dict[aminoacid_integer_mass] > 0 then there are surpluss occurences of same mass in experimental spectrum
          score = score + spectrum_counter_dict[aminoacid_integer_mass]
          scored_aminoacid_integer_masses.append(aminoacid_integer_mass)
      else: #spectrum_counter_dict[aminoacid_integer_mass] < peptide_theoretical_spectrum_counter_dict[aminoacid_integer_mass] --> there are surpluss occurences of same mass in theoretical spectrum
        if linear_peptide_spectrum_counter_dict[aminoacid_integer_mass] > 0:
          score = score + linear_peptide_spectrum_counter_dict[aminoacid_integer_mass]
          scored_aminoacid_integer_masses.append(aminoacid_integer_mass)
  return score

In [129]:
def ListToString(linear_peptide):
  string = ''
  for aminoacid_integer_mass in linear_peptide:
    string = string + str(aminoacid_integer_mass) + '-'
  return string[0:len(string)-1]

In [130]:
def StringToList(linear_peptide_string):
  linear_peptide_list = linear_peptide_string.split('-')
  for i in range(len(linear_peptide_list)):
    linear_peptide_list[i] = int(linear_peptide_list[i])
  return linear_peptide_list

In [131]:
def Trim(leaderboard,spectrum,N): #peptides in leaderboard are scored as linear peptides
  leaderboard_scores_dict = {}
  for linear_peptide in leaderboard:
    leaderboard_scores_dict.update({ListToString(linear_peptide):LinearPeptideScoring(spectrum,linear_peptide)})
  sorted_linear_peptides = sorted(leaderboard_scores_dict.keys(), key=leaderboard_scores_dict.get, reverse=True)
  top_n_peptides = sorted_linear_peptides[0:N] #last index is N-1, next index is N
  i = N
  for i in range(N,len(sorted_linear_peptides)):
    if leaderboard_scores_dict[sorted_linear_peptides[i]] == leaderboard_scores_dict[top_n_peptides[len(top_n_peptides)-1]]:
      top_n_peptides.append(sorted_linear_peptides[i])
    else:
      break
  for i in range(len(top_n_peptides)):
    top_n_peptides[i] = StringToList(top_n_peptides[i])
  return top_n_peptides

In [132]:
def LeaderboardCycloPeptideSequencing(spectrum,N,aminoacids_integer_masses):
  leaderboard = [[]]
  leader_peptide = []
  cyclic_peptide_mass = ParentMass(spectrum)
  while(len(leaderboard) > 0):
    leaderboard = Expand(leaderboard,aminoacids_integer_masses) #branching
    indices_to_pop = []
    for i in range(len(leaderboard)):
      if PeptideIntegerMass(leaderboard[i]) == cyclic_peptide_mass:
        if CyclicPeptideScoring(spectrum,leaderboard[i]) > CyclicPeptideScoring(spectrum,leader_peptide): #return linear peptide with the highest score, score is computed between linear peptide's spectrum and cyclic petpide's spectrum
          leader_peptide = leaderboard[i]
      elif PeptideIntegerMass(leaderboard[i]) > cyclic_peptide_mass: #pop peptide since we are only interested in peptide's having the same mass as the cyclic peptide
        indices_to_pop.append(i)
    leaderboard = np.delete(leaderboard,indices_to_pop,axis=0).tolist() #delete peptides with mass biger than cyclic peptide's mass
    leaderboard = Trim(leaderboard,spectrum,N) #bounding
  return leader_peptide

In [133]:
def PrintPeptide(peptide):
  string_to_print = ''
  for aminoacid_integer_mass in peptide:
    string_to_print = string_to_print + str(aminoacid_integer_mass) + '-'
  print(string_to_print[0:len(string_to_print)-1])

In [170]:
def Convolution(spectrum):
  positive_differences = []
  i = 0
  while i <= len(spectrum) - 1:
    for aminoacid_integer_mass in spectrum[0:i]:
      if spectrum[i] - aminoacid_integer_mass > 0:
        positive_differences.append(spectrum[i] - aminoacid_integer_mass)
    i = i + 1
  positive_differences_counter_dict = Counter(positive_differences)
  positive_differences_sorted = sorted(positive_differences_counter_dict,key=positive_differences_counter_dict.get,reverse=True)
  return [positive_differences_counter_dict,positive_differences_sorted]

In [306]:
def SelectMMostFrequent(counter_dict,convolution,M):
  m_most_frequent = []
  convolution = np.array(convolution)
  convolution = convolution[np.logical_and(convolution >= 57, convolution <= 200)]
  m_most_frequent.extend(list(convolution[0:M]))
  m_most_frequent.extend([convolution[i] for i in range(M,len(convolution)) if counter_dict[convolution[i]] == counter_dict[m_most_frequent[M-1]]])
  return m_most_frequent

Given an experimental spectrum, we first compute the convolution of an experimental spectrum. We then select the M most frequent elements between 57 and 200 in the convolution to form an extended alphabet of amino acid masses. In order to be fair, we
should include the top M elements of the convolution “with ties”. Finally, we run
LEADERBOARDCYCLOPEPTIDESEQUENCING, where amino acid masses are restricted
to this alphabet.

In [289]:
M = 20

In [290]:
N = 60

In [291]:
spectrum = [57, 57, 71, 99, 129, 137, 170, 186, 194, 208, 228, 265, 285, 299, 307, 323, 356, 364, 394, 422, 493]

In [292]:
spectrum = sorted(spectrum)

In [293]:
counter_dict,convolution = Convolution(spectrum)

In [None]:
aminoacids_integer_masses = SelectMMostFrequent(counter_dict,convolution,M)

In [296]:
PrintPeptide(LeaderboardCycloPeptideSequencing(spectrum,N,aminoacids_integer_masses))

99-71-137-57-57-72


In [311]:
M = 17

In [312]:
N = 384

In [313]:
with open('/content/rosalind_ba4i.txt') as task_file:
  spectrum = [line.rstrip() for line in task_file]

In [314]:
spectrum = spectrum[0]

In [315]:
spectrum = spectrum.split(' ')

In [316]:
for i in range(len(spectrum)):
  spectrum[i] = int(spectrum[i])

In [318]:
spectrum = sorted(spectrum)

In [319]:
counter_dict,convolution = Convolution(spectrum)

In [320]:
aminoacids_integer_masses = SelectMMostFrequent(counter_dict,convolution,M)

In [323]:
PrintPeptide(LeaderboardCycloPeptideSequencing(spectrum,N,aminoacids_integer_masses))

163-103-163-128-114-57-129-87-163-131-114-97-186-103-156
