<a href="https://colab.research.google.com/github/MonikaSpakova/MS_Spectra/blob/main/Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2 - MS Spectra

Part of this exercise is synthetic_search_results.xml file.

Available measurements are:
*   recordId - unique record identifier for chemical substance in the library;u
*   InChIKey - International Chemical Identifier for chemical substance;
*   spectrumCollectionId - unique spectrum tree collection identifier in the library;
* spectrumId - unique spectrum identifier in the library;
* metadata - experimental conditions under which spectrum was retrieved;
* precursorPeak - information about precursor ion (m/z, accuracy and intensity);
* hits - candidates resulting from spectral library search with match scores;
* peaks - information about created ions (m/z, accuracy and intensity).

## Library

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/Colab Notebooks/Data

/content/drive/MyDrive/Colab Notebooks/Data


In [3]:
# Directory checker
from os import getcwd, listdir
from os.path import isfile

# XML interaction
import lxml
from lxml import etree as et
import xml.dom.minidom

# HTML printout
from IPython.display import display, HTML

# XML printout
from bs4 import BeautifulSoup

# String 
from io import StringIO

from tabulate import tabulate
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

In [4]:
if not isfile("synthetic_search_results.xml"):
  raise RuntimeError(f"Please load football.xml file before proceeding.\n"
                     f"load the file in current dir: '{getcwd()}' "
                     f"with following content:\n{listdir()}")

Reading the data inside the xml file to a variable under the name data

In [5]:
with open('synthetic_search_results.xml', 'r') as f:
    data = f.read()

Define function to compile xslt language in python and print HTML response

In [6]:
def transform_xml(xml, xsl, print=True, file=None):
  if isfile(xsl):
    xslt_doc = et.parse(xsl)
  else:
    xslt_doc = et.XML(xsl)
  
  xslt_transformer = et.XSLT(xslt_doc)

  source_doc = et.parse(xml)
  output_doc = xslt_transformer(source_doc)
  return output_doc

def publish_html(output_doc, file=None):
  if file:
    if file.endswith(".html"):
      output_doc.write(file)
  else:
    display(HTML(str(output_doc)))


def publish_xml(output_doc, file=None):
  if file:
    if file.endswith(".xml"):
      output_doc.write(file)
  else:
    data = BeautifulSoup(et.tostring(output_doc), "xml")
    print(data.prettify())

def save_xsl(s, file=None):
  if file.endswith(".xsl"):
    if hasattr(s, 'write'):
      xslt_doc = s
    else:
      xslt_doc = et.XML(s).getroottree()
    xslt_doc.write(file)

#Runtime XSL

In [7]:
s = '''\
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">
<xsl:output method="html"/>

<xsl:template match="*">
<html>
<head></head>
<body>
    <xsl:apply-templates/>
</body>
</html>
</xsl:template>

<xsl:template match="*">
<div>
<h1>Number of spectra:
    <xsl:value-of select="count(//spectra)"/>
</h1>
</div>

<div>
<h1>Number of hits:
    <span ><xsl:value-of select="count(//hits)"/></span>
</h1>
</div>

<div>
    Correct
    <xsl:apply-templates select="correct/incorrect"/>
</div>
</xsl:template>

<xsl:template match="correct/incorrect">
   <xsl:for-each select="//records">
       <xsl:choose>
            <xsl:when test="InChIKey = InChIKey/spectra/hits/InChIKey_hits">
                <xsl:text>Correct </xsl:text>
                <xsl:variable name="correct" select="position()" />
                <xsl:copy>
                 <xsl:value-of select="concat('$correct = ', $correct)"/>
                </xsl:copy>
           </xsl:when>
           <xsl:otherwise>
                <xsl:text>Incorrect</xsl:text>
                <xsl:variable name="incorrect" select="position()" />
                <xsl:copy>
                 <xsl:value-of select="concat('$incorrect = ', $incorrect)"/>
                </xsl:copy>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:for-each>
</xsl:template>

</xsl:stylesheet>
'''

#Run HTML

In [8]:
output_doc = transform_xml("synthetic_search_results.xml", s)
publish_html(output_doc)

# Save

In [9]:
save_xsl(s, "synthetic_search_results_html.xsl")

# Python method approach

# I. Read XML file into the selected workspace

Parsing our xml file

In [10]:
tree = et.parse('synthetic_search_results.xml')
root = tree.getroot()

# II.  Make statistics on compounds and hits

## A.   Number of spectra



In [11]:
number_spectra = len(root.xpath(".//spectra"))
print("Number of spectra: ",number_spectra)

Number of spectra:  1900


## B. Total number of hits and two columns for a count of correct, incorrect



In [12]:
number_hits = len(root.xpath(".//hits"))
print("Total number of hits: ",number_hits)

Total number of hits:  5636


In [13]:
count_correct = 0     # count how many correct hits have been in data
check = 0             # verification of correct loop creation   
count_incorrect = 0   # incorrect hit can be calculated from the total minus the correct one, but we want to check again if we have the correct loop

for hit in root.findall('.//InChIKey_hits'):
    InChIKey = (hit.xpath("../../../../InChIKey"))[0].text
    
    if hit.text == InChIKey:
      count_correct+=1
    else:
      count_incorrect+=1

    check+=1

In [14]:
mydata = [check, count_correct, check-count_correct]
mydata, count_incorrect

([5636, 2833, 2803], 2803)

In [15]:
print(tabulate([mydata], ['Number of hits','Correct hits','Incorrect hits']))

  Number of hits    Correct hits    Incorrect hits
----------------  --------------  ----------------
            5636            2833              2803


## C.  An average score of correct and incorrect candidates

In [16]:
def Average(lst):
    mean = 0
    if len(lst) > 0:
      array = [float(numeric_string) for numeric_string in lst]
      mean = np.mean(array)
    else:
      mean = np.nan

    return mean

In [17]:
average_array = pd.DataFrame(columns=["InChIKey","Cosine","Denver","Nist","Cosine","Denver","Nist"])
correct = 0
incorrect = 0

for key in root.findall('.//InChIKey'):
    cosineMatch_correct, denverMatch_correct, nistMatch_correct = ([] for i in range(3))
    cosineMatch_incorrect, denverMatch_incorrect, nistMatch_incorrect = ([] for i in range(3))

    for hit in key.xpath("..//hits"):
      if key.text == hit.xpath("./InChIKey_hits")[0].text:
        correct += 1
        cosineMatch_correct.append(hit.xpath("./cosineMatch")[0].text)
        denverMatch_correct.append(hit.xpath("./denverMatch")[0].text)
        nistMatch_correct.append(hit.xpath("./nistMatch")[0].text)
      else:
        incorrect += 1
        cosineMatch_incorrect.append(hit.xpath("./cosineMatch")[0].text)
        denverMatch_incorrect.append(hit.xpath("./denverMatch")[0].text)
        nistMatch_incorrect.append(hit.xpath("./nistMatch")[0].text)

    array_to_append = pd.DataFrame([[int(key.text),Average(cosineMatch_correct),Average(denverMatch_correct),Average(nistMatch_correct),
         Average(cosineMatch_incorrect),Average(denverMatch_incorrect),Average(nistMatch_incorrect)]],
         columns=["InChIKey","Cosine","Denver","Nist","Cosine","Denver","Nist"])
    average_array = average_array.append(array_to_append)

In [18]:
correct, incorrect, correct+incorrect

(2833, 2803, 5636)

In [19]:
header=[["InChIKey","Correct","Correct","Correct","Incorrect","Incorrect","Incorrect"],
        ["InChIKey","Cosine","Denver","Nist","Cosine","Denver","Nist"]]
average_array.columns=header
average_array = average_array.set_index(['InChIKey'])

In [20]:
average_array[25:40]

Unnamed: 0_level_0,Correct,Correct,Correct,Incorrect,Incorrect,Incorrect
Unnamed: 0_level_1,Cosine,Denver,Nist,Cosine,Denver,Nist
InChIKey,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
"(92,)",0.173391,0.297721,0.059576,,,
"(65,)",0.172869,0.256271,0.065262,0.10562,0.185461,0.041757
"(73,)",0.04179,0.234856,0.02396,0.174395,0.341834,0.12849
"(91,)",0.132414,0.429023,0.102745,0.267406,0.432996,0.224045
"(29,)",0.368176,0.460361,0.129833,,,
"(93,)",0.235099,0.3918,0.148842,0.133242,0.188718,0.061297
"(68,)",0.117847,0.350058,0.087012,,,
"(14,)",0.202385,0.486434,0.146767,0.17693,0.432116,0.139543
"(67,)",0.390733,0.632089,0.27595,0.066273,0.310608,0.061676
"(86,)",0.471155,0.284365,0.214287,0.106152,0.44861,0.016967


# III. Make statistics on spectra

## A.   Total number of peaks



In [21]:
number_peaks = len(root.xpath(".//peaks"))
print("Number of peaks: ",number_peaks)

Number of peaks:  301525


In [22]:
numbers_of_peaks = pd.DataFrame(columns=["spectrum","peaks"])
i = 0
for peak in root.findall('.//spectra'):
    i+=1
    number_of_peaks = 0
    for count in peak.xpath(".//peaks"):
      number_of_peaks += 1
    numbers_of_peaks = numbers_of_peaks.append(pd.DataFrame([[i,number_of_peaks]],columns=["spectrum","peaks"]))

In [178]:
fig = go.Figure()
fig.add_trace(go.Bar(x=numbers_of_peaks['spectrum'][:500],y=numbers_of_peaks['peaks'],name='peaks',marker_color='rgb(55, 83, 109)'))
fig.update_layout(title='Peaks per spectrum',
                  xaxis=dict(title='Spectrum'),
                  yaxis=dict(title='Peaks'))
fig.show()

## B. Number of peaks for three levels of abundance (high, middle, low)

In [165]:
all_peaks = pd.DataFrame(columns=["spectrum","spectrum_id","peaks","low",'middle','high'])
spectrum_count = 0
low_array,middle_array,high_array = ([] for i in range(3))

for spectrum in root.xpath('.//spectra'):
  peaks,low,middle,high = (0 for i in range(4))
  spectrum_id = spectrum.xpath('./spectrumId')[0].text

  spectrum_count += 1
  for p in spectrum.xpath('./peaks'):
    peaks += 1
    abundance = p.xpath('./abundance')
    if float(abundance[0].text) <= 0.03:  # 0 - 0.05
      low += 1
      low_array.append(1)
    elif float(abundance[0].text) <= 0.2:  #0.05 - 0.2
      middle += 1
      middle_array.append(1)
    else:                           #0.2 - 1
      high += 1
      high_array.append(1)

  all_peaks = all_peaks.append(pd.DataFrame([[spectrum_count,spectrum_id,peaks,low,middle,high]],columns=["spectrum","spectrum_id","peaks","low",'middle','high']))

In [25]:
len(low_array), len(middle_array), len(high_array)

(298610, 2349, 566)

In [26]:
all_peaks.set_index('spectrum')

Unnamed: 0_level_0,spectrum_id,peaks,low,middle,high
spectrum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1295,12,12,0,0
2,155,9,9,0,0
3,1623,12,12,0,0
4,364,39,39,0,0
5,457,45,45,0,0
...,...,...,...,...,...
1896,4,62,62,0,0
1897,402,46,46,0,0
1898,525,50,50,0,0
1899,760,63,57,6,0


## C. Number of peaks for bins of m/z such that each bin has a fixed range of masses (lowest and highest bound comes from all peaks in the dataset, inner thresholds for bins are user-defined)

In [27]:
kreslenie = pd.DataFrame(columns=["spectrum","peaks","accuracy","mass"])

for spectrum in root.xpath('.//spectra'):
  accuracy_a = []
  mass_a = []
  for p in spectrum.xpath('./peaks'):
    accuracy = p.xpath('./accuracy')[0].text
    accuracy_a.append(float(accuracy))
    mass = p.xpath('./mass')[0].text
    mass_a.append(float(mass))

In [179]:
fig = px.scatter(x=mass_a, y=accuracy_a, labels={
                     "mass_a": "Mass",
                     "accuracy_a": "Accuracy",
                 })
fig.show()

In [29]:
mass_array = pd.DataFrame(columns=["spectrum","little","medium","large"])
spectrum_count = 0

for spectrum in root.xpath('.//spectra'):
  little = 0
  medium = 0
  large = 0

  spectrum_count += 1
  for p in spectrum.xpath('./peaks'):
    mass = p.xpath('./mass')
    if float(mass[0].text) <= 0.3:
       little += 1
    elif float(mass[0].text) <= 0.6: 
      medium += 1
    else:
       large += 1

  mass_array = mass_array.append(pd.DataFrame([[spectrum_count,little,medium,large]],
                                              columns=["spectrum","little","medium","large"]))

In [30]:
mass_array.set_index('spectrum')

Unnamed: 0_level_0,little,medium,large
spectrum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,6,3,3
2,3,3,3
3,6,3,3
4,12,9,18
5,21,12,12
...,...,...,...
1896,62,0,0
1897,46,0,0
1898,48,2,0
1899,24,33,6


In [166]:
all_peaks = all_peaks.merge(mass_array, left_on='spectrum', right_on='spectrum')
all_peaks.set_index('spectrum')

Unnamed: 0_level_0,spectrum_id,peaks,low,middle,high,little,medium,large
spectrum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1295,12,12,0,0,6,3,3
2,155,9,9,0,0,3,3,3
3,1623,12,12,0,0,6,3,3
4,364,39,39,0,0,12,9,18
5,457,45,45,0,0,21,12,12
...,...,...,...,...,...,...,...,...
1896,4,62,62,0,0,62,0,0
1897,402,46,46,0,0,46,0,0
1898,525,50,50,0,0,48,2,0
1899,760,63,57,6,0,24,33,6


## D. Bold font for m/z bin with the highest number of peaks

In [32]:
all_peaks = all_peaks.astype('int')

In [33]:
little_id = all_peaks['little']
medium_id = all_peaks['medium']
large_id = all_peaks['large']
highest_n_peaks = 0

if little_id.max() > medium_id.max() and little_id.max() > large_id.max():
  highest_n_peaks = little_id.idxmax()
elif medium_id.max() > little_id.max() and medium_id.max() > large_id.max():
  highest_n_peaks = medium_id.idxmax()
elif large_id.max() > little_id.max() and large_id.max() > medium_id.max():
  highest_n_peaks = large_id.idxmax()

row_highest = all_peaks.iloc[highest_n_peaks,:]

In [34]:
row_highest = all_peaks.iloc[highest_n_peaks,:]
row_highest

spectrum       1221
spectrum_id    1783
peaks          2320
low            2320
middle            0
high              0
little          368
medium         1472
large           480
Name: 1220, dtype: int64

In [105]:
spectrum_accuracy = []
spectrum_mass = []
spectrum_a = pd.DataFrame(columns=["mass","abundance"])

for spectrum in root.xpath('.//spectra'):
  spectrum_id = spectrum.xpath('./spectrumId')[0].text
  if int(spectrum_id) == row_highest['spectrum_id']:
    for pp in spectrum.xpath('./peaks'):
      accuracy = pp.xpath('./accuracy')[0].text
      mass = pp.xpath('./mass')[0].text
      abundance = pp.xpath('./abundance')[0].text
      spectrum_accuracy.append(float(accuracy)*100)
      spectrum_mass.append(float(mass))
      spectrum_a = spectrum_a.append(pd.DataFrame([[mass,abundance]],
                                              columns=["mass","abundance"]))

In [None]:
def create_stem_data(x,y, baseline=0.):
    '''makes y data passing 0 before inbetween actual value to create data for a stem plot
    x,y are 3 times the original length
    '''
    x=np.repeat(x,3)
    y=np.repeat(y,3)
    y[::3]=y[2::3]=baseline
    return x,y

In [108]:
x,y = create_stem_data(spectrum_mass,spectrum_accuracy)
fig1 = px.line(x=x, y=y)

fig2 = px.line(x=[0.3,0.3],y=[0,0.0003])
fig2.update_traces(line_color='red')

fig3 = px.line(x=[0.6,0.6],y=[0,0.0003])
fig3.update_traces(line_color='red')

fig4 = go.Figure(data=fig2.data + fig3.data + fig1.data)
fig4.update_xaxes(title_text="m/z")
fig4.update_yaxes(title_text="Accuracy [%]")
fig4.show()

In [93]:
spectrum_a = spectrum_a.astype(float)
spectrum_a[:10]

Unnamed: 0,mass,abundance
0,0.775263,1.587729e-07
0,0.719728,1.359793e-07
0,0.313244,1.497763e-07
0,0.70452,1.126595e-07
0,0.699524,1.086877e-07
0,0.699434,1.265111e-07
0,0.310528,7.253586e-08
0,0.694346,1.852536e-07
0,0.689443,1.467523e-07
0,0.689351,2.179989e-07


In [96]:
spectrum_a.sort_values(by=['mass'],inplace=True)
fig1 = px.line(x=spectrum_a['mass'], y=spectrum_a['abundance']*100, markers=True)

fig4 = go.Figure(data=fig1.data)
fig4.update_xaxes(title_text="m/z")
fig4.update_yaxes(title_text="Abundance [%]")
fig4.show()

In [102]:
x,y = create_stem_data(spectrum_a['mass'],spectrum_a['abundance']*100)
fig1 = px.line(x=x, y=y)

fig4 = go.Figure(data=fig1.data)
fig4.update_xaxes(title_text="m/z")
fig4.update_yaxes(title_text="Abundance [%]")
fig4.show()

## E. Find the most common m/z location for m/z rounded to whole numbers

In [149]:
array_number = []

for i in np.linspace(0, 1, num=101):
    array_number.append(i)

df_mass = pd.DataFrame(columns=["spectrum","peaks","mass_max"])

spectra = 0

for spectrum in root.xpath('.//spectra'):
  array_number_pd = pd.DataFrame(columns=array_number)
  spectra += 1
  peaks = 0
  for pp in spectrum.xpath('./peaks'):
    peaks += 1
    mass = pp.xpath('./mass')[0].text
    for r in range(len(array_number)):
      if r/100 == round(float(mass), 2):
        array_number_pd = array_number_pd.append(pd.DataFrame([[1]], columns=[r/100]))
  mass_max = array_number_pd.count(axis=0).idxmax()
  df_mass = df_mass.append(pd.DataFrame([[spectra,peaks,mass_max]], columns=["spectrum","peaks","mass_max"]))

In [150]:
df_mass

Unnamed: 0,spectrum,peaks,mass_max
0,1,12,0.18
0,2,9,0.28
0,3,12,0.18
0,4,39,0.28
0,5,45,0.17
...,...,...,...
0,1896,62,0.14
0,1897,46,0.13
0,1898,50,0.13
0,1899,63,0.14


In [153]:
df_mass = df_mass.drop(columns=['peaks'])
df_mass = df_mass.set_index('spectrum')
df_mass

Unnamed: 0_level_0,mass_max
spectrum,Unnamed: 1_level_1
1,0.18
2,0.28
3,0.18
4,0.28
5,0.17
...,...
1896,0.14
1897,0.13
1898,0.13
1899,0.14


In [167]:
all_peaks = all_peaks.merge(df_mass, left_on='spectrum', right_on='spectrum')
all_peaks.set_index('spectrum')

Unnamed: 0_level_0,spectrum_id,peaks,low,middle,high,little,medium,large,mass_max
spectrum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1295,12,12,0,0,6,3,3,0.18
2,155,9,9,0,0,3,3,3,0.28
3,1623,12,12,0,0,6,3,3,0.18
4,364,39,39,0,0,12,9,18,0.28
5,457,45,45,0,0,21,12,12,0.17
...,...,...,...,...,...,...,...,...,...
1896,4,62,62,0,0,62,0,0,0.14
1897,402,46,46,0,0,46,0,0,0.13
1898,525,50,50,0,0,48,2,0,0.13
1899,760,63,57,6,0,24,33,6,0.14


In [170]:
all_peaks

Unnamed: 0_level_0,spectrum_id,peaks,low,middle,high,little,medium,large,mass_max
spectrum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1295,12,12,0,0,6,3,3,0.18
2,155,9,9,0,0,3,3,3,0.28
3,1623,12,12,0,0,6,3,3,0.18
4,364,39,39,0,0,12,9,18,0.28
5,457,45,45,0,0,21,12,12,0.17
...,...,...,...,...,...,...,...,...,...
1896,4,62,62,0,0,62,0,0,0.14
1897,402,46,46,0,0,46,0,0,0.13
1898,525,50,50,0,0,48,2,0,0.13
1899,760,63,57,6,0,24,33,6,0.14


In [173]:
all_peaks = all_peaks.set_index('spectrum')
header=[["Spectrum_id","Peaks","Abundance","Abundance","Abundance","Accuracy","Accuracy","Accuracy","Mass_max"],
        ["Spectrum_id","peaks","low","middle","high","little","medium","large","mass_max"]]
all_peaks.columns=header

In [174]:
all_peaks

Unnamed: 0_level_0,Spectrum_id,Peaks,Abundance,Abundance,Abundance,Accuracy,Accuracy,Accuracy,Mass_max
Unnamed: 0_level_1,Spectrum_id,peaks,low,middle,high,little,medium,large,masss_max
spectrum,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
1,1295,12,12,0,0,6,3,3,0.18
2,155,9,9,0,0,3,3,3,0.28
3,1623,12,12,0,0,6,3,3,0.18
4,364,39,39,0,0,12,9,18,0.28
5,457,45,45,0,0,21,12,12,0.17
...,...,...,...,...,...,...,...,...,...
1896,4,62,62,0,0,62,0,0,0.14
1897,402,46,46,0,0,46,0,0,0.13
1898,525,50,50,0,0,48,2,0,0.13
1899,760,63,57,6,0,24,33,6,0.14


## F. Use m/z accuracy to render an updated number of peaks in m/z bins such that peaks bordering the bins will be counted as +0.5 to the bin which he extends through its accuracy

In [129]:
print("Pokúšali sme sa, ale nevedeli sme na to prísť, ako odhadnúť, že čiara je už za hranicou.")

Pokúšali sme sa, ale nevedeli sme na to prísť, ako odhadnúť, že čiara je už za hranicou.


# IV. Make a table of compounds where each row represents one compound, and columns will provide statistics retrieved in II.

In [176]:
average_array

Unnamed: 0_level_0,Correct,Correct,Correct,Incorrect,Incorrect,Incorrect
Unnamed: 0_level_1,Cosine,Denver,Nist,Cosine,Denver,Nist
InChIKey,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
"(33,)",0.298146,0.528074,0.250037,0.217256,0.358166,0.146416
"(24,)",0.180585,0.285644,0.053103,,,
"(55,)",0.368408,0.456467,0.196585,,,
"(38,)",0.367797,0.339367,0.159534,,,
"(42,)",0.106304,0.387258,0.079581,0.214979,0.412693,0.150852
...,...,...,...,...,...,...
"(36,)",0.210824,0.359136,0.122979,0.465168,0.622386,0.421077
"(4,)",0.176631,0.476380,0.086309,0.058014,0.189462,0.037051
"(26,)",0.582605,0.358903,0.248479,0.287575,0.365823,0.252293
"(19,)",0.239618,0.403607,0.166001,0.203720,0.423415,0.151057


# V. Make a table of spectra where each row represents one spectrum, and columns will contain statistics retrieved in III.

In [175]:
all_peaks

Unnamed: 0_level_0,Spectrum_id,Peaks,Abundance,Abundance,Abundance,Accuracy,Accuracy,Accuracy,Mass_max
Unnamed: 0_level_1,Spectrum_id,peaks,low,middle,high,little,medium,large,masss_max
spectrum,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
1,1295,12,12,0,0,6,3,3,0.18
2,155,9,9,0,0,3,3,3,0.28
3,1623,12,12,0,0,6,3,3,0.18
4,364,39,39,0,0,12,9,18,0.28
5,457,45,45,0,0,21,12,12,0.17
...,...,...,...,...,...,...,...,...,...
1896,4,62,62,0,0,62,0,0,0.14
1897,402,46,46,0,0,46,0,0,0.13
1898,525,50,50,0,0,48,2,0,0.13
1899,760,63,57,6,0,24,33,6,0.14


# VI. Shed some light on the relationship between normalized collision energy and peaks distribution

# VII. Discuss important patterns in data