<a href="https://colab.research.google.com/github/NICE-MSI/NPL-Academy/blob/main/NPL_NiCE-MSI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Using Mass Spectrometry Imaging to Map Molecules**

In this notebook, you will be able to investigate different mean spectra from some tissues of interest. You will overplot the spectra of the different tissues, study the intensity ratios of some compounds, as well as obtain a list of significant ions which drive the difference between two types of cancer tumours. You can use the "Supporting_Material_python.pdf" notes to help you navigate through this part of your research.

First, we need to import the python packages that are needed to read and plot the data. (numpy, matplotlib, pandas):

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

We clone the Mass Spectrometry data that we are going to study in this Notebook. An image of the tissues we are studing are shown in the notes, in Figure 1. 

In [None]:
!git clone https://github.com/NICE-MSI/NPL-Academy.git

We use "pandas" to read the file with the mean spectra of the different tissues. 
We print the file on the next cell to see its structure.

In [None]:
df = pd.read_csv("NPL-Academy/spectra.csv")  # read data files
print(df)  #print data file

As you can see, there are 7 columns in the file. The first column corresponds to the m/z values (X-axis of the spectrum). Columns 2 to 7 correspond to the intensities of the spectra for the different tissues (Y-axis).

Note: You can save your figures by (more help in the support material)
"uncomment" the last line (removing #)
comment #plt.show()
Rename "filename" to the name you want the figure to be called. 
Your plots will be saved in the "outputs" folder.

In [None]:
plt.plot(df["m/z"],df["A_APCKRAS"], color='blue', label='tissue A-APCKRAS')
plt.plot(df["m/z"],df["D_APCKRAS"], color='red', label='tissue D-APCKRAS')
plt.legend()
plt.show()
#plt.savefig("NPL-Academy/outputs/filename.png")


- Can you overplot all the spectra in "spectra.csv"? Remeber to include the labels on your plot.

You can customize your plot in many different ways (colors, linestyles, linewidth,...). If you want to look at all the options you can check here:
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html 

You can also zoom-in into specific areas of the spectrum for a better visualisation. For example:

In [None]:
plt.plot(df["m/z"],df["A_APCKRAS"], color='blue', label='tissue A-APCKRAS')
plt.xlim((300,320))
plt.legend()
plt.show()
#plt.savefig("NPL-Academy/outputs/filename.png")

**NOISE DETERMINATION**

One of the first problems we have when analysing MS data, is to differentiate signal from noise. Noise is random, meaningless signal caused mainly by the electronics within the instrument detector, that appears as a series of small peaks alongside the real data peaks. MS normally contains big amount of data, so it is important to save only the compounds of interest, and remove these unimportant peaks (noise). In this example, we can determine noise level in a basic way, obtaining the standard deviation of the mean spectrum. 

In the next cell we obtain the standard deviation of two of the tissues (A and D) by using the "std" function in the numpy package (np.std)
- Can you obtain the standard deviation for the rest of the tissues?

In [None]:
print('standard deviation for tissue A =', np.std(df["A_APCKRAS"]))
print('standard deviation for tissue D =',np.std(df["D_APCKRAS"]))

We can use this standard deviation as a threshold to determine signal and noise.
In the cell below we are plotting the noise in red and the signal in blue for one of the tissues.
- Do you think this noise level is correct? 

In [None]:
treshold = np.std(df["A_APCKRAS"])
plt.plot(df["m/z"],df["A_APCKRAS"].where(df["A_APCKRAS"]<treshold), color='red', label='noise A-APCKRAS')
plt.plot(df["m/z"],df["A_APCKRAS"].where(df["A_APCKRAS"]>=treshold), color='blue', label='signal A-APCKRAS')
plt.legend()
plt.show()
#plt.savefig("NPL-Academy/outputs/filename.png")

Let's zoom-in at one specific area to have a better visualisation.

In [None]:
treshold = np.std(df["A_APCKRAS"])

plt.plot(df["m/z"],df["A_APCKRAS"].where(df["A_APCKRAS"]<treshold), color='red', label='noise A-APCKRAS')
plt.plot(df["m/z"],df["A_APCKRAS"].where(df["A_APCKRAS"]>=treshold), color='blue', label='signal A-APCKRAS')
plt.xlim((310,320))
plt.ylim((-1E5,6E5))
plt.legend()
plt.show()
#plt.savefig("NPL-Academy/outputs/filename.png")

- Can you lower the noise level for this spectrum? Use the cell above to determin the noise level that you think could be best in this case. 
- What appens in other areas of this spectrum? Can you use the same noise threshold for all the spectrum?
- Can you compare the noise threhold for each of the spectra for the different tissues? Can you use the same threshold for all the spectrum to remove noise properly?

**INTENSITY RATIOS**

To study what compounds are more relevant in each type of tissues (i.e. APC and APCKRAS), we can study the intensity ratios between peaks of the mean spectra of these tissues.
We provide you here with a list of 80 common peaks among the tissues. This file has 7 columns, the m/z value and the intensities for each of the 6 tissues we are working with.

In [None]:
peaks = pd.read_csv("NPL-Academy/top80_peaks.csv")  # read data files
print(peaks)  #print data file

To study the intensity ratio of the compounds for the APC vs APCKRAS tissues, we create the mean of all APCKRA tissues and the mean of all APC tissues. Then we can calculate their ratios.
In the next cell we creat a new column in the table called "mean APCKRAS". 
- Can you add another column called "mean APC"? 

Note: Python numerates the first column of the table as 0. You can see the names of the columns using print(peaks.columns)

In [None]:
peaks['mean APCKRAS'] = peaks.iloc[:, [1,5,6]].mean(axis=1)

print(peaks)

Once you have these two new columns, we create a new column called 'ratio'.  

In [None]:
peaks['ratio'] = peaks['mean APCKRAS']/peaks['mean APC']

We can select the top 5 ions whose ratio is greater in APCKRAS.

In [None]:

print(peaks.nlargest(5,'ratio'))

- Can you identify the top 5 ions whose ratio is greater in APC?
- You can study the single ion images in the supporting material folder....

**T-TEST**

The previous method to find relevant compounds is not very robust. Instead, scientist normally use the so-called t-test (among others).
The t-test gives us information about how similar or different two different samples are. In this instance, we are going to focus in the tumour areas of the tissues. In the supplementary material (Figure 4), you can find an image where the tumour regions are plotted in yellow for the A_APCKRAS, D_APCKRAS, C_APC and G_APC tissues. We performed a t-test analysis between the APCKRAS vs APC tumours across all pixels. 
The t-test will provide you two different parameters, t-value and p-value.

The t-value is a ratio between the difference between two groups and the difference within the groups.
- Larger t-values = more difference between groups.
- Smaller t-values = more similarity between groups.

The p-value from a t-test is the probability that the results from your sample data occurred by chance. P-values can vary from 0% to 100% and are usually written as a decimal (for example, a p value of 5% is 0.05). Low p-values indicate your data did not occur by chance. For example, a p-value of 0.01 means there is only a 1% probability that the results from an experiment happened by chance, therefore this compounds will be of high sifnificance for our data analysis. 

We have prepared a table for you with the t-test results of the comparison between the APCKRAS and the APC tumours. The table is called "t_test_tumours.csv". 
- Can you load the table with pandas?
- Can you read the table?

In [None]:

t_test = #Load the table in here 
print() #print the table

From this table:
- Can you identify the top 5 t-values which provide the bigger differences between the groups? (Hint: you can use one of the two function, "nsmallest" or "nlargest")
- What are the p-values of these ions?
- Will using a ratio provide different results to using t-test?
- Find the images for these ions. Can you comment on them? Which ions you think are more relevant for further analysis?
- Select the two more relevant ions.

In [None]:
print() #print the top-5 t-values and their p-values. 
!cp "NPL-Academy/outputs/images" "/content/drive/My Drive/"

Now that you have the 2 most significant ions in this sample, you can research what these ions are in the HMBD database.