<a href="https://colab.research.google.com/github/2AMissinou/Metabolomics-Filtering/blob/main/A2M_Metabolic_Features_Filtering_v1_3_rp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **<font color='	dodgerblue'> Metabolic features Clean up and Multivariate analysis **<font>

---

Authors: Anani Amegan Missinou (anani.a.missinou@gmail.com) <br>
Input file format: .csv files or .txt files <br>
Outputs: .csv files  <br>
Dependencies: ggplot2, dplyr, ecodist, vegan, svglite\

## Run rmagic by executing this command %load_ext rpy2. ipython .
After that, every time you want to use R, add %%R in the beginning of each cell. Start rmagic by executing this in a cell: %load_ext rpy2. ipython. Use %%R to execute cell magic




In [None]:
%load_ext rpy2.ipython

In [None]:
%%R
#installing and calling the necessary packages:
install.packages("ggplot2")
install.packages("dplyr")
install.packages("ecodist") #for PCoA using Bray Curtis distance
install.packages("vegan") #for PermANOVA
install.packages("svglite") # for saving ggplots as svg files
install.packages("tidyverse")

In [None]:
%%R
require("ggplot2")
require("dplyr")
require("ecodist")
require("vegan")
require("svglite")
require("tidyverse")

## Recipes for loading and saving data from external sources

Alternatively, we can updoad/import from your local file systemor  mount and read data  directly pull the data files from Github page:



### Upload/Import files from your local file system

In [None]:

from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

### Write and downlad file to your local file system

In [None]:

from google.colab import files

with open('example.txt', 'w') as f:
  f.write('some content')

files.download('example.txt')

### Mount and Read the input data using URL (from Github):


In [None]:
# Mount Google drive directory 
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

## Setting a local working directory and creating an automatic result directory:
Works well with Jupyter Notebook. If you are working with Jupyter Notebook, you can simply copy the folder path from your local computer to the next cell output line. It will be set as your working directory <br> 
For ex: D:\User\Project\Test_Data <br>
<br>
For Google Collab, we can upload the necessary files into a new folder using the 'Files' icon on the left and set the folder as working directory. And all the ouput files will be saved here as well and you need to download them finally into your local computer

In [81]:
%%R
# setting the current directory as the working directory
#Directory <- normalizePath(readline("Enter the path of the folder with input files: "),"/",mustWork=FALSE)
#setwd(Directory)
setwd("/content/drive/MyDrive/Data Science Training/CMFI_Seminar_Multivariate_Statistics")


In [78]:
%%R
getwd()

[1] "/content/drive/MyDrive/Data Science Training/CMFI_Seminar_Multivariate_Statistics"


In [80]:
%%R
# Getting all the files in the folder
dirs <- dir(path=paste(getwd(), sep=""), full.names=TRUE, recursive=TRUE)
folders <- unique(dirname(dirs))
files <- list.files(folders, full.names=TRUE)
files_1 <- basename((files))
files_2 <- dirname((files))
# Creating a Result folder
dir.create(path=paste(files_2[[1]], "_Results", sep=""), showWarnings = TRUE)
fName <-paste(files_2[[1]], "_Results", sep="")

print(files_1)

[1] "Normalised_Quant_table.csv"                                       
[2] "Quant_Table_filled_with_MinValue_3766.csv"                        
[3] "20220716_Xenobiotic_metabolism_gapfilled_quant_Bsub.csv"          
[4] "20220716_Xenobiotic_metabolism_gapfilled_quant_Bsub.mgf"          
[5] "20220716_Xenobiotic_Metabolism_metadata_Bsub.txt"                 
[6] "20220716_Xenobiotic_metabolism_non_gapfilled_quant_Bsub_quant.csv"
[7] "20220716_Xenobiotic_metabolism_non_gapfilled_quant_Bsub.mgf"      


**<font color='orange'> In the following line, enter the required file ID numbers separated by commas. For example as: 1,2,3 </font>**

In [None]:
%%R
input <- as.double(unlist(strsplit(readline("Specify the file index of gapfilled & non-gapfilled feature-file, metadata:"), split=",")))

#Gets the extension of each file. Ex:csv
pattern <- c()
for (i in files_1){
  sep_file <- substr(i, nchar(i)-2,nchar(i))
  pattern <- rbind(pattern,sep_file)
}
#pattern

ft <- read.csv(files_1[input[1]],sep = ifelse(pattern[input[1]]!="csv","\t",","), header=TRUE,check.names = FALSE) # By applying 'row.names = 1', the 1st column 'ID' becomes the row names
nft<- read.csv(files_1[input[2]],sep=ifelse(pattern[input[2]]!="csv","\t",","), header = TRUE,check.names = FALSE)
md <-read.csv(files_1[input[3]], sep = ifelse(pattern[input[3]]!="csv","\t",","), header=TRUE,check.names = FALSE)

In [93]:
%%R
nft <- read.csv(nft_url, header = T, check.names = F)
ft <- read.csv(ft_url, header = T, check.names = F)
md <- read.csv(md_url, header = T, check.names = F, sep = '\t')

Check if the data has been read correclty!!

In [None]:
%%R
head(nft)
dim(nft)

In [None]:
%%R 
head(ft)
dim(ft)

In [None]:
%%R
head(md)
dim(md)

Trying to bring the feature table and metadata in the correct format such as the rownames of metadata and column names of feature table are the same. They both are the file names and they need to be same as from now on, we will call the columns in our feature table based on our metadata information. Thus, using the metadata, the user can filter their data easily. You can also directly deal with your feature table without metadata by getting your hands dirty with some coding!! But having a metadata improves the user-experience greatly.

In [55]:
## Reading the input data using URL (from Github):
# Alternatively, we can also directly pull the data files from our Functional Metabolomics Github page:

%%R
## Non-gap filled
nft_url <- 'https://github.com/2AMissinou/Metabolomics-Filtering/tree/main/CMFI_Seminar_Multivariate_Statistics-main/Test_Data/20220716_Xenobiotic_metabolism_non_gapfilled_quant_Bsub_quant.csv'
## Gap filled
ft_url <- 'https://github.com/2AMissinou/Metabolomics-Filtering/tree/main/CMFI_Seminar_Multivariate_Statistics-main/Test_Data/20220716_Xenobiotic_metabolism_gapfilled_quant_Bsub.csv'
md_url <- 'https://github.com/2AMissinou/Metabolomics-Filtering/tree/main/CMFI_Seminar_Multivariate_Statistics-main/Test_Data/20220716_Xenobiotic_Metabolism_metadata_Bsub.txt'

In [None]:
%%R
#Removing Peak area extensions
colnames(ft) <- gsub(' Peak area','',colnames(ft))
colnames(nft) <- gsub(' Peak area','',colnames(nft))
md$filename<- gsub(' Peak area','',md$filename)

#Removing if any NA columns present in the md file
ft <- ft[,colSums(is.na(ft))<nrow(ft)]
nft <- nft[,colSums(is.na(nft))<nrow(nft)]
md <- md[,colSums(is.na(md))<nrow(md)]

#Changing the row names of the files
rownames(md) <- md$filename
md <- md[,-1]
rownames(ft) <- paste(ft$'row ID',round(ft$'row m/z',digits = 3),round(ft$'row retention time',digits = 3), sep = '_')
rownames(nft) <- paste(nft$'row ID',round(nft$'row m/z',digits = 3),round(nft$'row retention time',digits = 3), sep = '_')

#Picking only the files with column names containing 'mzML'
ft <- ft[,grep('mzML',colnames(ft))]
nft <- nft[,grep('mzML',colnames(nft))]

# Converting replicate attributes into factors (categorical data)
md$ATTRIBUTE_replicates <- as.factor(md$ATTRIBUTE_replicates)

In [None]:
%%R
# setting the current directory as the working directory
Directory <- normalizePath(readline("Enter the path of the folder with input files: "),"/",mustWork=FALSE)
setwd(Directory)

## recommanded format : 'D:/Parts/DATA/Metabolomics/Metabolomics_tools/CMFI Mass Spec Seminar Series/#14 - Feature-Table Data Clean-Up and Multivariate Analysis'