<font size=4> <i>Script in progress </i></font>

Assuming the user for this notebook has already used the Stats Notebook (Performing basic uni- and multivariate statistical analsysis of untargeted metabolomics data) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Functional-Metabolomics-Lab/Statistical-analysis-of-non-targeted-LC-MSMS-data/blob/main/Stats_Untargeted_Metabolomics.ipynb), we will proceed with loading in necessary files and perform Batch Correction.

# Setting a Working Directory

In [1]:
Directory <- normalizePath(readline("Enter the path of the folder with input files: "),"/",mustWork=FALSE)
setwd(Directory)

Enter the path of the folder with input files:  C:\Users\abzer\OneDrive\Documents\GitHub\Statistical-analysis-of-non-targeted-LC-MSMS-data\data


In [3]:
file_names <- list.files('.') #list all the files in the working directory (mentioned by 'dot symbol')
print(file_names)

[1] "20221102_SD_BeachSurvey_batchFile.xml"                
[2] "20221125_Metadata_SD_Beaches_with_injection_order.txt"
[3] "2023-03-02_Ft_md_merged.csv"                          
[4] "DB_analog_result_FBMN.tsv"                            
[5] "SD_BeachSurvey_GapFilled_quant.csv"                   


In [4]:
ft_merged <- read.csv(file_names[3], header = T, check.names = F)

# <font color ='darkblue'> 1. Batch Correction</font>
<a id="batch_corr"></a>

<p style='text-align: justify;'> A 'Batch' is a group of samples processed and analyzed by the same experimental & instrumental conditions in the same short time period. In general, if we have more samples than the tray size, we might measure them as multiple batches or groups. When arranging samples in a batch for measurement, in order to ensure biological diversity within a batch, in addition to our samples of interest, it is advised to have QCs, blanks, and controls (Wehrens et al., 2016). To merge data from these different batches, we must look for batch-effects, both, between the batches and within each batch and correct these effects. <br> <b>But, prior to batch correction on a dataset, we should evaluate the severity of the batch effect and when it is small, it is best to not perform batch correction as this may result in an incorrect estimation of the biological variance in the data. Instead, we should treat the statistical results with caution (Nygaard et al., 2016). For more details, please read the manuscript </b>.</p>

## 1.1. Inter-batch correction:
<a id="inter_batch"></a>

<p style='text-align: justify;'> In this tutorial, the test dataset was utilized to evaluate the chemical impacts of a significant rain event that occurred in northern San Diego, California (USA) during the Winter of 2017/2018. Despite the presence of a "ATTRIBUTE_Batch" column in the metadata, the 3 groups mentioned are not considered as batches due to their distinct collection conditions. The "ATTRIBUTE_time_run" column indicates that the seawater samples were collected and measured at different times during Dec 2017, Jan 2018 (after rainfall), and Oct 2018, respectively. Therefore, searching for inter-batch effects is not meaningful since these are 3 distinct groups and were not measured at the same time. </p>

<font color="red">If the user is dealing with different batches, they can perform inter-batch correction using the following steps:</font>
    
1. Calculate the <b>overall mean</b> of each feature 
2. Calculate the mean of each feature for each batch referred as &rarr; <b>Batchwise feature-mean</b>
3. The feature intensities in each batch are then divided by the <b>batchwise feature-mean</b> and multiplied by the <b>overall mean</b>.

In [None]:
# selecting only the filename & batch info column along with all feature intensity columns
ft_merged2 <- ft_merged %>% select(`filename`,`ATTRIBUTE_Batch`,starts_with("X")) 
head(ft_merged2,n=2)

Now, we can continue with the batch correction steps. <br>

<b> Step 1: Calculate the overall mean of each feature: </b>

In [None]:
fm <- as.data.frame(rbind(colMeans(ft_merged2[,-(1:2)]))) #getting the columnwise mean for ft_merged2 except its 1st 2 columns
head(fm)

<b>Step 2: Getting batchwise feature-mean</b>

In [None]:
bm <- ft_merged2[,-1] %>%  #excluding filename column as we are geting only batchwise mean value
group_by(`ATTRIBUTE_Batch`) %>%  # grouping them by Batch
summarise_all(mean) %>% # getting column-wise mean
column_to_rownames('ATTRIBUTE_Batch') %>%
as.data.frame() # storing it as dataframe

bm

We can also get the 'ft_merged2' dataframe split into batchwise dataframes using 'group_split' function. It returns a list with each element as a batch-specific dataframe.

In [None]:
batch_df <- ft_merged2 %>%
group_split(`ATTRIBUTE_Batch`) %>% #group_split splits & stores the batchwise info as individual dataframes inside a list
lapply(., function(x) { # lapply applies the below function to each element within the list created by the previous step 
    x <- column_to_rownames(x,'filename') # then, we make "filename" as the rownames of each dataframe within the list
}) 

sapply(batch_df, dim) # gives the dimension of each list element columnwise.

The above output shows that there are 3 dataframes inside 'batch_df' list, each with dimension of 62 rows and 11218 columns. Lets look at the 1st dataframe inside batch_df. It contains the 'Batch 1' data.

In [None]:
head(batch_df[[1]],n=2) 

<b>Step 3: Correcting for inter-batch effect:</b> <br>
Here, the feature intensities in each batch within the batch list are then divided by the <b>batchwise feature-mean</b> and multiplied by the <b>overall mean

In [None]:
ib <- list() # creating an empty list for storing inter-batch corrected data

for (i in 1:length(batch_df)){
    ib[[i]] <- sweep((batch_df[[i]][,-1]), 2, as.numeric(bm[i,]+1), "/") #dividing each batch dataframe by batchwise feature-mean
    ib[[i]] <- sweep(ib[[i]], 2, as.numeric(fm+1), "*") # multiplying by overall mean
}

ib <- bind_rows(ib) #binding all the list elements together

In [None]:
head(ib,n=2)
dim(ib)

## 1.2. Intra-batch correction:
<a id="intra_batch"></a>

<p style='text-align: justify;'> <b>It is important to have pooled QC samples or some Internal Standards to correct for intra-batch effects</b>. In case of not having pooled QCs, one can still look for intra-batch effect by visualizing the housekeeping features across the injection order or run time. For the test data used here, some components typically found in the DOM samples as mentioned in the study by <a href="https://doi.org/10.1016/j.chemosphere.2020.129450">Petras et al</a> are: Dibutyl phthalate, pheophorbide A and tryptophan. We will look at the feature 'tryptophan' to see if there is any intensity drift observed along the run time or injection order. However, we cannot correct the effect with housekeeping features, this can only be done with QCs. In such cases, it is better to not correct the effect and treat the data cautiously during statistical analyses. However, normalizing the data, in general, accounts for batch-correction to a certain extent.<font color=red> {insert ref} </font> </p>

- <font color = "red"> Say how you can perform intra-batch usually with QCs?</font>

<b> Visualizing within-batch effect using housekeeping features:</b> <br>
Since, we have the annotations combined with our column names. we can look for column names of ft_merged with 'tryptophan':

In [None]:
print(grep("tryptophan",colnames(ft_merged),ignore.case=TRUE,value=TRUE))

The molecular weight of tryptophan is 204.22 g/mol. Here, we see tryptophan peaks (~205 m/z) with 2 different RTs. In the column names, we have the libraryID_RT_mz_Annotation. We will choose the feature with higher retention time (row ID 7683) to plot against the injection order.

In [None]:
which(colnames(ft_merged) == grep('X7683',colnames(ft_merged),value = TRUE)) # gets the column number of tryptophan

<font color="red">For your own dataset, the column numbers can vary. Change the number accordingly in the cell below at `y=ft_merged[,1088]` and `limits=c(-1e6,max(ft_merged[,1088])` </font>

In [None]:
ggplot(ft_merged, 
       aes(x=`ATTRIBUTE_Injection_order`, 
           y=ft_merged[,1088])) + #paste the number from the previous cell output
geom_point(size=2.5, alpha=0.9, 
           aes(color=as.factor(`ATTRIBUTE_Batch`), 
               shape = `ATTRIBUTE_Sample.Type`)) +
geom_smooth(method = 'lm',na.rm = T) +  # to add a trend line
scale_y_continuous(labels = scales::scientific,
                   limits=c(-1e6,max(ft_merged[,1088]))) #paste the number from the previous cell output

Since,we do not observe a big drift in the intensity, we do not need to correct for intra-batch effect. But when one has QCs, and observe an intensity drift, you can perform within-batch correction:

<font color ="red"> Acceptable range for intensity drift?  </font>