<a href="https://colab.research.google.com/github/TillVollmer5/mass_spectroscopy/blob/main/MS_Service_Blank_Subtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BACKGROUND SUBTRACTION SCRIPT
Till Vollmer, April 2024

*This code was created with the help of OpenAI ChatGPT 3.5, both during writing and during debugging.*


This script allows to remove possible chromatographic signals in samples based on a second file, containing the data from the analysis from the solvent, hence removing possible compounds related to the solvent or column contamination from the sample data.

It is important to consider, especially in the case of environmental samples, that compounds found in the environment, especially common polutants could be removed form the dataset, even though the peak intensity could be drastically higher compared to the blank. If this is problematic, the code needs to be addapted to include a parameter in the exclusion_list function to adjust for this problem.

The input format is based on the export of the peaklist from the Thermo deconvolution plugin of Tracefinder. For more information on the peak deconvolution, identification, library match process, and the data export are described in the document ...........

The following sections will be introduced one by one. To allow the usage of this code, the file names and the path to the files need to be known and adapted in the code, each segment contains a short introduction specifically.

**The backup script should not be manipulated! Always work on a copy of the code!**

For proper record keeping, the input folders and output folders should contain all the comma separated value files that are used and should serve as an archive.


The following section imports all the nescessary libraries. This should not be changed, or the code will not work.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from google.colab import drive
import pandas as pd

drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


The following code defines various functions, with different purposes.
The *combined_df* function allows to combine two dataframes to

1.   The *combined_df* function allows to combine two files, which is useful in cases where two files containing signals from the background should be prepared for the background subtraction. The function can be applied to a dataframe with the following syntax:

```
sample_combined = combine_df(sample_1, sample_2)
```


2.   The *exclusion_list* function creates a list of compound found in both the sample and the background data. It identifies these hits by comparing the chemical formula of the compounds, and confirmin the validity by controling if the retention time of the chromatographic signals fall within a given threshold. The function can be applied to the dataframes with the following syntax:


```
exclusion_df = exclusion_list(sample_df, background_df, retention_time_threshold=n)
```
The abreviation "df" stands for data frame, a data-structure from the pandas library, which is used for this code. If no retention time threshold is defined, a threshold of 1 minute is used.

3. The *remove_rows* function removes the compounds identified in the exclusion list from the original sample. A copy is created and the original data frame is not altered. The function can be applied with the following statement:


```
Corrected_sample = remove_rows(sample_df, exclusion_df)
```





In [None]:
def combine_df(df1, df2):
    combined_df = pd.concat([df1, df2], ignore_index=True)
    return combined_df

def exclusion_list(sample_df, blank_df, retention_time_threshold=1):
    sample_df_copy = sample_df.copy()
    blank_df_copy = blank_df.copy()

    exclusion_list = []

    for index_sample, row_sample in sample_df_copy.iterrows():
        for index_blank, row_blank in blank_df_copy.iterrows():
            if row_sample["Formula (mol ion)"] == row_blank["Formula (mol ion)"]:
                if abs(row_sample["Retention Time"] - row_blank["Retention Time"]) > retention_time_threshold:
                    exclusion_list.append(index_sample)
                    break

    return exclusion_list

def remove_rows(df, exclusion_list):
    df_copy = df.copy()

    df_copy = df_copy.drop(exclusion_list)

    return df_copy

def filter_df(df):
    rows_to_remove = []

    for index, row in df.iterrows():
        if row['HRF Score'] < 85 or row['SI'] < 600:
            rows_to_remove.append(index)

    filtered_df = df.drop(rows_to_remove)

    return filtered_df

def rm_no_RI(df):
    rows_to_remove = []

    for index, row in df.iterrows():
        if row['Calculated RI'] == 0:
            rows_to_remove.append(index)

    filtered_df = df.drop(rows_to_remove)

    return filtered_df

In [None]:
#Blank_EtOAc = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/EtOAC.csv', error_bad_lines=False)
DCM = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/DCM_new2.csv', error_bad_lines=False)
ER_0789a_n = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0789a_n2.csv', error_bad_lines=False)
ER_0788a_n = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0788a_n2.csv', error_bad_lines=False)
ER_0788b_n = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0788b_n2.csv', error_bad_lines=False)
ER_0787a_n = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0787a_n2.csv', error_bad_lines=False)
ER_0787b_n = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0787a_n2.csv', error_bad_lines=False)
ER_0792a_n = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0792a_n2.csv', error_bad_lines=False)



  DCM = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/DCM_new2.csv', error_bad_lines=False)


  ER_0789a_n = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0789a_n2.csv', error_bad_lines=False)


  ER_0788a_n = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0788a_n2.csv', error_bad_lines=False)


  ER_0788b_n = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0788b_n2.csv', error_bad_lines=False)


  ER_0787a_n = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0787a_n2.csv', error_bad_lines=False)


  ER_0787b_n = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0787a_n2.csv', error_bad_lines=False)


  ER_0792a_n = pd.read_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0792a_n2.csv', error_bad_lines=False)


In [None]:
#Blank_EtOAc.fillna(0, inplace=True)
DCM.fillna(0, inplace=True)
ER_0789a_n.fillna(0, inplace=True)
ER_0788a_n.fillna(0, inplace=True)
ER_0788b_n.fillna(0, inplace=True)
ER_0787a_n.fillna(0, inplace=True)
ER_0787b_n.fillna(0, inplace=True)
ER_0792a_n.fillna(0, inplace=True)

In [None]:
#print(Blank_EtOAc.shape)
print(DCM.shape)
print(ER_0789a_n.shape)
print(ER_0788a_n.shape)
print(ER_0788b_n.shape)
print(ER_0787a_n.shape)
print(ER_0787b_n.shape)
print(ER_0792a_n.shape)

(74, 21)
(210, 21)
(208, 21)
(223, 21)
(122, 21)
(122, 21)
(203, 21)


In [None]:
#Blank_data_df = combine_df(Blank_EtOAc, Blank_DCM)

ER_0789a_n_el = exclusion_list(ER_0789a_n, DCM)
ER_0789a_n_rr= remove_rows(ER_0789a_n, ER_0789a_n_el)

ER_0788a_n_el = exclusion_list(ER_0788a_n, DCM)
ER_0788a_n_rr = remove_rows(ER_0788a_n, ER_0788a_n_el)

ER_0788b_n_el = exclusion_list(ER_0788b_n, DCM)
ER_0788b_n_rr = remove_rows(ER_0788b_n, ER_0788b_n_el)

ER_0787a_n_el = exclusion_list(ER_0787a_n, DCM)
ER_0787a_n_rr = remove_rows(ER_0787a_n, ER_0787a_n_el)

ER_0787b_n_el = exclusion_list(ER_0787b_n, DCM)
ER_0787b_n_rr = remove_rows(ER_0787b_n, ER_0787b_n_el)

ER_0792a_n_el = exclusion_list(ER_0792a_n, DCM)
ER_0792a_n_rr = remove_rows(ER_0792a_n, ER_0792a_n_el)

In [None]:
print(ER_0789a_n_rr.shape)
print(ER_0788a_n_rr.shape)
print(ER_0788b_n_rr.shape)
print(ER_0787a_n_rr.shape)
print(ER_0787b_n_rr.shape)
print(ER_0792a_n_rr.shape)

(84, 21)
(81, 21)
(87, 21)
(64, 21)
(64, 21)
(107, 21)


In [None]:
ER_0789a_n_filtered = filter_df(ER_0789a_n_rr)
ER_0788a_n_filtered = filter_df(ER_0788a_n_rr)
ER_0788b_n_filtered = filter_df(ER_0788b_n_rr)
ER_0787a_n_filtered = filter_df(ER_0787a_n_rr)
ER_0787b_n_filtered = filter_df(ER_0787b_n_rr)
ER_0792a_n_filtered = filter_df(ER_0792a_n_rr)

In [None]:
print(ER_0789a_n_filtered.shape)
print(ER_0788a_n_filtered.shape)
print(ER_0788b_n_filtered.shape)
print(ER_0787a_n_filtered.shape)
print(ER_0787b_n_filtered.shape)
print(ER_0792a_n_filtered.shape)

(43, 21)
(46, 21)
(49, 21)
(41, 21)
(41, 21)
(68, 21)


In [None]:
ER_0789a_n_filtered.to_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0789a_n2_filtered.csv')
ER_0788a_n_filtered.to_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0788a_n2_filtered.csv')
ER_0788b_n_filtered.to_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0788b_n2_filtered.csv')
ER_0787a_n_filtered.to_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0787a_n2_filtered.csv')
ER_0787b_n_filtered.to_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0787b_n2_filtered.csv')
ER_0792a_n_filtered.to_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0792a_n2_filtered.csv')

In [None]:
ER_0789a_n_RIR = rm_no_RI(ER_0789a_n_filtered)
ER_0788a_n_RIR = rm_no_RI(ER_0788a_n_filtered)
ER_0788b_n_RIR = rm_no_RI(ER_0788b_n_filtered)
ER_0787a_n_RIR = rm_no_RI(ER_0787a_n_filtered)
ER_0787b_n_RIR = rm_no_RI(ER_0787b_n_filtered)
ER_0792a_n_RIR = rm_no_RI(ER_0792a_n_filtered)

In [None]:
print(ER_0789a_n_RIR.shape)
print(ER_0788a_n_RIR.shape)
print(ER_0788b_n_RIR.shape)
print(ER_0787a_n_RIR.shape)
print(ER_0787b_n_RIR.shape)
print(ER_0792a_n_RIR.shape)

(38, 21)
(36, 21)
(35, 21)
(33, 21)
(33, 21)
(57, 21)


In [None]:
ER_0789a_n_RIR.to_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0789a_n2_RIR.csv')
ER_0788a_n_RIR.to_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0788a_n2_RIR.csv')
ER_0788b_n_RIR.to_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0788b_n2_RIR.csv')
ER_0787a_n_RIR.to_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0787a_n2_RIR.csv')
ER_0787b_n_RIR.to_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0787b_n2_RIR.csv')
ER_0792a_n_RIR.to_csv('/content/drive/My Drive/Reusser_alkene_screening/ER_0792a_n2_RIR.csv')