# Preprocessing Step
---

This notebook carrieds out the preprocessing steps for the metabolomics data:    
- Imputation
- Normalization
- Log2 Transformation

## Input

### Libraries

In [1]:
# To use RCall for the first time, one needs to 
# the location of the R home directory.
firstTimeRCall = false
if firstTimeRCall 
    ENV["R_HOME"] = "C:/PROGRA~1/R/R-42~1.1" # from R.home() in R
    Pkg.build("RCall")
end     

In [2]:
using CSV, DataFrames, Missings#, CategoricalArrays
using StatsBase, Statistics#, MultivariateStats
using FreqTables#, Plots, StatsPlots
using RCall 

### Ext. Functions

In [3]:
include(joinpath(@__DIR__,"..","..","src","preprocessing.jl" ));
include(joinpath(@__DIR__,"..","..","src","wrangling_utils.jl" ));

### Load data

#### Reference file

In [4]:
# Get reference metabolite file
fileRef = joinpath(@__DIR__,"..","..","data","processed","refMeta.csv");
dfRef = CSV.read(fileRef, DataFrame);
print_df_size(dfRef)

The dataframe contains 588 rows and 22 columns


#### Metabolite signatures

In [5]:
# Get negative metabolite file
fileMetabo = realpath(joinpath(@__DIR__,"..","..","data","processed","Metabo_old.csv"));
dfMetabo = CSV.read(fileMetabo, DataFrame);
println("The negative metabolite dataset contains $(size(dfMetabo, 1)) samples and $(size(dfMetabo, 2)-1) metabolites.")

The negative metabolite dataset contains 26 samples and 590 metabolites.


## Imputation

Check if imputation is needed:

In [6]:
summary_variables_missing(dfMetabo)

No missing data.


## Normalization
----

### Probabilistic Quotient Normalization

> 1. Perform an integral normalization (typically a constant
integral of 100 is used).
> 2. Choose/calculate the reference spectrum (the best approach
is the calculation of the median spectrum of control samples).
> 3. Calculate the quotients of all variables of interest of the test
spectrum with those of the reference spectrum.
> 4. Calculate the median of these quotients.
> 5. Divide all variables of the test spectrum by this median.


In [7]:
df = pqnorm(dfMetabo, startCol = 2);
first(df, 3)

Row,SampleID,MT10001,MT10002,MT10003,MT10004,MT10005,MT10006,MT10007,MT10008,MT10009,MT10010,MT10011,MT10012,MT10013,MT10014,MT10015,MT10016,MT10017,MT10018,MT10019,MT10020,MT10021,MT10022,MT10023,MT10024,MT10025,MT10026,MT10027,MT10028,MT10029,MT10030,MT10031,MT10032,MT10033,MT10034,MT10035,MT10036,MT10037,MT10038,MT10039,MT10040,MT10041,MT10042,MT10043,MT10044,MT10045,MT10046,MT10047,MT10048,MT10049,MT10050,MT10051,MT10052,MT10053,MT10054,MT10055,MT10056,MT10057,MT10058,MT10059,MT10060,MT10061,MT10062,MT10063,MT10064,MT10065,MT10066,MT10067,MT10068,MT10069,MT10070,MT10071,MT10072,MT10073,MT10074,MT10075,MT10076,MT10077,MT10078,MT10079,MT10080,MT10081,MT10082,MT10083,MT10084,MT10085,MT10086,MT10087,MT10088,MT10089,MT10090,MT10091,MT10092,MT10093,MT10094,MT10095,MT10096,MT10097,MT10098,MT10099,⋯
Unnamed: 0_level_1,String7,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,⋯
1,ID_8358,0.927116,0.1112,0.191489,0.170838,0.0250247,0.0727422,1.96025,3.3038,0.0637297,1.1639,5.43317,28.2027,4.68091,0.3055,0.120578,0.00178243,0.00125641,0.000122501,0.000648381,0.00279648,0.00109979,0.00141082,0.00076746,0.00223166,0.000723136,0.00244708,0.00253656,0.00325909,0.0247078,0.138797,0.00658138,0.0119929,0.0133137,0.00450831,0.00227563,0.00820304,0.0363347,0.00632083,0.00285019,0.00238772,0.00155109,0.0036002,0.000397416,0.00103141,0.000720601,0.00415347,0.0,0.05688,0.0362284,0.00138894,0.000155439,0.00328304,0.000878607,0.0156385,0.0231606,0.000579967,0.000435904,0.0,0.0,0.000433841,0.00163825,0.00908995,0.00619936,0.0225298,0.0333433,0.000340793,0.000277757,0.000187578,0.00072999,0.000338608,0.000477639,0.000476282,0.000209851,0.000477837,0.00100457,0.000692627,0.000126519,0.000106107,0.000423064,0.000702681,0.000589148,0.000602564,0.0123866,0.130576,0.0670281,0.00310687,0.0194128,0.0296686,0.0249781,0.0998122,0.0104715,0.295451,0.0739307,0.0496915,0.0470791,0.021989,0.0581613,0.0896231,0.404854,⋯
2,ID_8370,1.74833,0.234566,0.361085,0.404351,0.0167417,0.0856783,4.70527,7.7702,0.140981,2.01535,13.2623,101.145,9.29093,1.44748,0.417837,0.00187567,0.000975269,0.000178255,0.00115092,0.00396501,0.00111231,0.00136917,0.000968185,0.00255134,0.00101505,0.00213384,0.00281101,0.00334522,0.0877113,2.14875,0.0772441,0.122669,0.198582,0.136642,0.0111269,0.0278811,0.151556,0.0141703,0.00515175,0.0216337,0.0146715,0.0720535,0.0184738,0.011065,0.0042721,0.049927,0.00292398,0.129614,0.0476138,0.0158283,0.00137668,0.002616,0.00278814,0.0324019,0.0228627,0.00111322,0.00185592,0.00100824,0.000218267,0.00451509,0.00181229,0.027524,0.00830781,0.0787198,0.0975405,0.000321746,0.00024442,0.000113678,0.000620492,0.000227495,0.000418338,0.000457429,0.000238084,0.000657814,0.0022815,0.000998544,4.5685e-05,0.00013472,0.000460724,0.00031772,0.000586251,0.000812436,0.0416201,0.215073,0.142543,0.00676283,0.0332415,0.0494194,0.0880554,0.20306,0.0186122,0.734117,0.104191,0.0603695,0.172245,0.0991541,0.0490871,0.157919,1.09033,⋯
3,ID_8378,1.32454,0.122387,0.192543,0.263061,0.0121213,0.0801073,2.75957,8.09386,0.0742277,0.694728,9.62645,51.2608,6.31774,0.616001,0.156156,0.000927477,0.00028754,0.000108537,0.00177058,0.0100534,0.00254842,0.000746835,0.000979749,0.00212162,0.000198859,0.000645667,0.00220916,0.00660744,0.0827329,0.179732,0.0130542,0.0369821,0.035005,0.00479483,0.0110716,0.0238696,0.127575,0.00384548,0.00321389,0.0148919,0.00855081,0.00665483,0.00240562,0.0147389,0.0112803,0.0046398,0.00520706,0.141871,0.0417757,0.0196537,0.00211242,0.00341629,0.00924601,0.0349223,0.0228917,0.000691376,0.000809838,0.000222498,0.000160269,0.00636141,0.000954845,0.0215574,0.00525607,0.0993853,0.0613583,0.000371016,0.0004856,0.000113592,0.00100663,0.000308091,0.000376678,0.000299875,0.000642527,0.00118469,0.00536642,0.00179496,0.000228609,0.00021822,0.000984546,0.000492161,0.00057662,0.000932829,0.0480428,0.243249,0.172623,0.0065151,0.0257212,0.0560553,0.0967311,0.35906,0.0253458,0.912677,0.123346,0.0565454,0.119349,0.0467782,0.0646726,0.197627,1.37777,⋯


## Transformation
---

A simple and widely used transformation to make data more symmetric and homoscedastic is the log-transformation.

In [8]:
df = log2tx(df, startCol = 2);
first(df, 2)

Row,SampleID,MT10001,MT10002,MT10003,MT10004,MT10005,MT10006,MT10007,MT10008,MT10009,MT10010,MT10011,MT10012,MT10013,MT10014,MT10015,MT10016,MT10017,MT10018,MT10019,MT10020,MT10021,MT10022,MT10023,MT10024,MT10025,MT10026,MT10027,MT10028,MT10029,MT10030,MT10031,MT10032,MT10033,MT10034,MT10035,MT10036,MT10037,MT10038,MT10039,MT10040,MT10041,MT10042,MT10043,MT10044,MT10045,MT10046,MT10047,MT10048,MT10049,MT10050,MT10051,MT10052,MT10053,MT10054,MT10055,MT10056,MT10057,MT10058,MT10059,MT10060,MT10061,MT10062,MT10063,MT10064,MT10065,MT10066,MT10067,MT10068,MT10069,MT10070,MT10071,MT10072,MT10073,MT10074,MT10075,MT10076,MT10077,MT10078,MT10079,MT10080,MT10081,MT10082,MT10083,MT10084,MT10085,MT10086,MT10087,MT10088,MT10089,MT10090,MT10091,MT10092,MT10093,MT10094,MT10095,MT10096,MT10097,MT10098,MT10099,⋯
Unnamed: 0_level_1,String7,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,⋯
1,ID_8358,0.946443,0.152119,0.252765,0.227541,0.0356586,0.101303,1.56572,2.10561,0.0891316,1.11363,2.68553,4.86803,2.50612,0.384602,0.164243,0.00256921,0.00181148,0.00017672,0.000935113,0.00402884,0.00158579,0.00203395,0.00110679,0.00321602,0.00104289,0.00352608,0.00365485,0.00469423,0.0352125,0.18751,0.00946381,0.0171992,0.0190809,0.0064895,0.00327931,0.0117862,0.0514901,0.00909033,0.0041061,0.00344064,0.00223602,0.00518467,0.000573236,0.00148724,0.00103923,0.00597979,0.0,0.0798116,0.051342,0.00200243,0.000224234,0.00472867,0.00126701,0.022387,0.0330327,0.000836473,0.000628739,0.0,0.0,0.000625764,0.00236156,0.0130548,0.00891618,0.032143,0.0473196,0.000491576,0.000400663,0.000270593,0.00105277,0.000488426,0.000688923,0.000686966,0.00030272,0.000689208,0.00144857,0.000998904,0.000182517,0.000153072,0.000610223,0.0010134,0.000849711,0.000869054,0.0177603,0.177058,0.0935982,0.00447531,0.0277384,0.0421801,0.0355931,0.137257,0.0150287,0.373455,0.102901,0.0699654,0.0663704,0.0313797,0.0815595,0.123829,0.490421,⋯
2,ID_8370,1.45856,0.304004,0.444757,0.489904,0.0239532,0.118597,2.5123,3.13261,0.190275,1.59232,3.83413,6.67447,3.3633,1.29129,0.503692,0.00270348,0.00140633,0.000257144,0.00165947,0.00570899,0.00160384,0.00197395,0.00139612,0.00367612,0.00146367,0.0030752,0.00404974,0.00481807,0.121296,1.65478,0.107345,0.166932,0.261329,0.184777,0.0159641,0.0396733,0.203584,0.0202999,0.00741332,0.030878,0.0210128,0.100377,0.0264089,0.0158757,0.0061502,0.0702891,0.00421226,0.17583,0.067107,0.0226566,0.00198477,0.00376916,0.00401684,0.0460047,0.0326125,0.00160514,0.00267505,0.00145385,0.000314858,0.00649923,0.00261222,0.039172,0.0119361,0.10932,0.134274,0.000464106,0.000352581,0.000163994,0.000894903,0.000328168,0.000603408,0.00065978,0.000343442,0.000948713,0.00328776,0.00143988,6.5908e-05,0.000194346,0.000664531,0.0004583,0.000845534,0.00117162,0.0588292,0.281043,0.192249,0.00972386,0.0471775,0.0695914,0.121752,0.266708,0.0266049,0.794202,0.14299,0.0845671,0.229274,0.136394,0.0691345,0.211535,1.06373,⋯


## Save pretreatments

In [9]:
fileMeta = joinpath(@__DIR__,"..","..","data","processed","nl2_Meta_old.csv");
df |> CSV.write(fileMeta)

"C:\\git\\gregfa\\Metabolomic\\PANSTEATITISstudyST001052\\notebooks\\preprocessing\\..\\..\\data\\processed\\nl2_Meta_old.csv"

In [10]:
versioninfo()

Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 4 × Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores


In [11]:
R"""
sessionInfo()
"""

RObject{VecSxp}
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.2.1
