# Preprocessing Step
---

This notebook carried out the preprocessing steps for the metabolomics data:    
- Imputation
- Normalization
- Log2 Transformation

## Input

### Libraries

In [1]:
# To use RCall for the first time, one needs to 
# the location of the R home directory.
firstTimeRCall = false
if firstTimeRCall
    using Pkg
    io = IOBuffer()
    versioninfo(io)
    if occursin("Windows", String(take!(io)))
        ENV["R_HOME"] = "C:/PROGRA~1/R/R-43~1.1" # from R.home() in R
    else 
        ENV["R_HOME"] = "/usr/lib/R"

    end
    Pkg.build("RCall")
end      

In [2]:
using CSV, DataFrames, Missings#, CategoricalArrays
using StatsBase, Statistics#, MultivariateStats
using FreqTables#, Plots, StatsPlots
using RCall 

### Ext. Functions

In [3]:
include(joinpath(@__DIR__,"..","..","src","preprocessing.jl" ));
include(joinpath(@__DIR__,"..","..","src","wrangling_utils.jl" ));

### Load data

#### Reference file

In [4]:
# Get reference metabolite file
fileRef = joinpath(@__DIR__,"..","..","data","processed","refMeta.csv");
dfRef = CSV.read(fileRef, DataFrame);
print_df_size(dfRef)

The dataframe contains 588 rows and 22 columns


#### Metabolite signatures

In [5]:
# Get negative metabolite file
fileMetabo = realpath(joinpath(@__DIR__,"..","..","data","processed","Metabo.csv"));
dfMetabo = CSV.read(fileMetabo, DataFrame);
println("The negative metabolite dataset contains $(size(dfMetabo, 1)) samples and $(size(dfMetabo, 2)-1) metabolites.")

The negative metabolite dataset contains 44 samples and 590 metabolites.


## Imputation

Check if imputation is needed:

In [6]:
summary_variables_missing(dfMetabo)

No missing data.


## Normalization
----

### Probabilistic Quotient Normalization

> 1. Perform an integral normalization (typically a constant
integral of 100 is used).
> 2. Choose/calculate the reference spectrum (the best approach
is the calculation of the median spectrum of control samples).
> 3. Calculate the quotients of all variables of interest of the test
spectrum with those of the reference spectrum.
> 4. Calculate the median of these quotients.
> 5. Divide all variables of the test spectrum by this median.


In [7]:
df = pqnorm(dfMetabo, startCol = 2);
first(df, 3)

Row,SampleID,MT10001,MT10002,MT10003,MT10004,MT10005,MT10006,MT10007,MT10008,MT10009,MT10010,MT10011,MT10012,MT10013,MT10014,MT10015,MT10016,MT10017,MT10018,MT10019,MT10020,MT10021,MT10022,MT10023,MT10024,MT10025,MT10026,MT10027,MT10028,MT10029,MT10030,MT10031,MT10032,MT10033,MT10034,MT10035,MT10036,MT10037,MT10038,MT10039,MT10040,MT10041,MT10042,MT10043,MT10044,MT10045,MT10046,MT10047,MT10048,MT10049,MT10050,MT10051,MT10052,MT10053,MT10054,MT10055,MT10056,MT10057,MT10058,MT10059,MT10060,MT10061,MT10062,MT10063,MT10064,MT10065,MT10066,MT10067,MT10068,MT10069,MT10070,MT10071,MT10072,MT10073,MT10074,MT10075,MT10076,MT10077,MT10078,MT10079,MT10080,MT10081,MT10082,MT10083,MT10084,MT10085,MT10086,MT10087,MT10088,MT10089,MT10090,MT10091,MT10092,MT10093,MT10094,MT10095,MT10096,MT10097,MT10098,MT10099,⋯
Unnamed: 0_level_1,String7,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,⋯
1,ID_8358,0.793441,0.0951669,0.163879,0.146206,0.0214165,0.062254,1.67761,2.82745,0.0545409,0.996084,4.6498,24.1363,4.006,0.261452,0.103193,0.00152543,0.00107526,0.000104838,0.000554895,0.00239327,0.000941218,0.00120741,0.000656805,0.00190989,0.000618871,0.00209425,0.00217083,0.00278919,0.0211453,0.118785,0.00563245,0.0102638,0.0113941,0.00385829,0.00194752,0.0070203,0.0310959,0.00540947,0.00243924,0.00204345,0.00132745,0.00308111,0.000340115,0.000882698,0.000616702,0.00355461,0.0,0.0486789,0.0310049,0.00118868,0.000133028,0.00280968,0.000751927,0.0133837,0.0198213,0.000496345,0.000373054,0.0,0.0,0.000371288,0.00140204,0.00777933,0.00530552,0.0192814,0.0285357,0.000291656,0.000237709,0.000160532,0.000624738,0.000289787,0.000408772,0.00040761,0.000179594,0.000408941,0.000859732,0.000592762,0.000108277,9.0808e-05,0.000362065,0.000601366,0.000504203,0.000515684,0.0106006,0.111749,0.0573638,0.00265891,0.0166138,0.0253909,0.0213767,0.085421,0.0089617,0.252852,0.0632711,0.0425268,0.0402911,0.0188186,0.0497754,0.0767009,0.346481,⋯
2,ID_8363,2.01264,0.250191,0.316022,0.35231,0.0090836,0.0907338,4.37769,5.98482,0.0674948,1.74248,10.7952,87.0324,9.68257,1.51665,0.294706,0.00164067,0.00014284,0.000140007,0.0010802,0.00413117,0.00128056,0.000886071,0.000837439,0.0020082,0.000285707,0.000556351,0.0023939,0.00420345,0.0941167,0.350541,0.00948015,0.0304021,0.0432942,0.0151774,0.00583587,0.0260187,0.113409,0.00766083,0.00410709,0.0182598,0.0153228,0.0074324,0.00258356,0.0144681,0.0103423,0.00579154,0.00394326,0.0684608,0.0245162,0.0221871,0.00215692,0.00170014,0.00399767,0.0175665,0.0188937,0.000847784,0.000630506,0.000909391,0.000275056,0.00732291,0.00106693,0.0178259,0.00765984,0.0544446,0.0748196,0.000330248,0.000302167,0.000147507,0.000612601,0.00025397,0.000455773,0.000488404,0.000233846,0.00082982,0.00272523,0.00120081,0.000150019,0.000119674,0.000539743,0.000506577,0.000681608,0.000905559,0.0350433,0.160083,0.0706453,0.00657695,0.0249749,0.0448561,0.0795643,0.205603,0.0163875,0.730722,0.12638,0.0703859,0.0917934,0.0525502,0.0721954,0.155468,0.739741,⋯
3,ID_8370,1.53178,0.205512,0.31636,0.354267,0.014668,0.075066,4.12247,6.80776,0.123519,1.76572,11.6196,88.6166,8.14013,1.26819,0.366083,0.00164334,0.00085447,0.000156176,0.00100836,0.00347389,0.000974539,0.00119958,0.000848264,0.00223533,0.000889327,0.00186954,0.00246283,0.00293087,0.0768472,1.8826,0.0676765,0.107475,0.173985,0.119717,0.00974872,0.0244277,0.132784,0.0124151,0.00451364,0.0189541,0.0128543,0.0631288,0.0161856,0.00969446,0.00374294,0.043743,0.00256181,0.11356,0.0417162,0.0138678,0.00120616,0.00229198,0.0024428,0.0283885,0.0200309,0.000975333,0.00162604,0.000883355,0.000191232,0.00395584,0.00158782,0.0241148,0.00727879,0.0689694,0.0854589,0.000281894,0.000214146,9.95979e-05,0.000543636,0.000199317,0.000366522,0.000400771,0.000208595,0.000576335,0.00199891,0.000874862,4.00263e-05,0.000118033,0.000403658,0.000278367,0.000513637,0.000711806,0.0364649,0.188434,0.124887,0.00592517,0.0291241,0.0432982,0.0771487,0.177908,0.0163068,0.643188,0.0912859,0.052892,0.15091,0.0868726,0.0430071,0.138359,0.955284,⋯


## Transformation
---

A simple and widely used transformation to make data more symmetric and homoscedastic is the log-transformation.

In [8]:
df = log2tx(df, startCol = 2);
first(df, 2)

Row,SampleID,MT10001,MT10002,MT10003,MT10004,MT10005,MT10006,MT10007,MT10008,MT10009,MT10010,MT10011,MT10012,MT10013,MT10014,MT10015,MT10016,MT10017,MT10018,MT10019,MT10020,MT10021,MT10022,MT10023,MT10024,MT10025,MT10026,MT10027,MT10028,MT10029,MT10030,MT10031,MT10032,MT10033,MT10034,MT10035,MT10036,MT10037,MT10038,MT10039,MT10040,MT10041,MT10042,MT10043,MT10044,MT10045,MT10046,MT10047,MT10048,MT10049,MT10050,MT10051,MT10052,MT10053,MT10054,MT10055,MT10056,MT10057,MT10058,MT10059,MT10060,MT10061,MT10062,MT10063,MT10064,MT10065,MT10066,MT10067,MT10068,MT10069,MT10070,MT10071,MT10072,MT10073,MT10074,MT10075,MT10076,MT10077,MT10078,MT10079,MT10080,MT10081,MT10082,MT10083,MT10084,MT10085,MT10086,MT10087,MT10088,MT10089,MT10090,MT10091,MT10092,MT10093,MT10094,MT10095,MT10096,MT10097,MT10098,MT10099,⋯
Unnamed: 0_level_1,String7,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,⋯
1,ID_8358,0.84273,0.131151,0.218941,0.196866,0.0305713,0.0871288,1.42095,1.93638,0.0766151,0.997172,2.4982,4.6517,2.32366,0.335085,0.141685,0.00219905,0.00155043,0.000151242,0.000800323,0.00344864,0.00135725,0.00174087,0.000947259,0.00275276,0.000892567,0.00301821,0.00312845,0.00401834,0.0301882,0.161932,0.00810311,0.014732,0.0163452,0.00555562,0.00280694,0.0100928,0.0441785,0.00778318,0.00351479,0.00294506,0.00191384,0.00443827,0.000490599,0.0012729,0.000889439,0.00511913,0.0,0.0685729,0.0440511,0.00171388,0.000191905,0.00404783,0.00108439,0.0191806,0.0283163,0.000715897,0.000538103,0.0,0.0,0.000535556,0.0020213,0.0111798,0.00763401,0.0275524,0.0405919,0.000420709,0.000342901,0.000231581,0.000901025,0.000418013,0.000589612,0.000587937,0.000259076,0.000589856,0.0012398,0.000854921,0.000156202,0.000131002,0.000522255,0.000867328,0.000727228,0.000743783,0.015213,0.152831,0.0804718,0.0038309,0.0237718,0.036174,0.030515,0.118255,0.0128714,0.325216,0.0885095,0.0600845,0.0569872,0.0268972,0.0700807,0.106618,0.429194,⋯
2,ID_8363,1.59103,0.322149,0.396184,0.435426,0.0130457,0.125299,2.42699,2.80422,0.094229,1.45548,3.56013,6.45996,3.41719,1.3315,0.372624,0.00236505,0.00020606,0.000201974,0.00155755,0.00594775,0.00184627,0.00127776,0.00120766,0.00289431,0.00041213,0.000802422,0.00344954,0.00605158,0.129767,0.433537,0.0136125,0.0432074,0.061146,0.0217318,0.00839491,0.0370571,0.154983,0.0110101,0.00591314,0.0261058,0.0219385,0.010683,0.00372248,0.0207235,0.0148442,0.00833133,0.00567774,0.0955339,0.0349427,0.0316593,0.00310843,0.00245069,0.00575593,0.025123,0.0270036,0.00122258,0.000909341,0.00131138,0.000396768,0.0105262,0.00153844,0.0254909,0.0110087,0.0764834,0.104094,0.000476369,0.000435869,0.000212791,0.000883526,0.000366354,0.000657392,0.000704446,0.000337329,0.00119668,0.00392633,0.00173137,0.000216416,0.000172643,0.000778474,0.000730652,0.000983018,0.00130585,0.0496911,0.214228,0.0984806,0.00945747,0.0355886,0.0633042,0.110449,0.269755,0.0234505,0.791374,0.171693,0.0981311,0.1267,0.0738891,0.100568,0.208477,0.798872,⋯


## Save pretreatments

In [9]:
fileMeta = joinpath(@__DIR__,"..","..","data","processed","nl2_Meta.csv");
df |> CSV.write(fileMeta)

"C:\\git\\gregfa\\Metabolomic\\PANSTEATITISstudy\\notebooks\\preprocessing\\..\\..\\data\\processed\\nl2_Meta.csv"

In [10]:
versioninfo()

Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 4 × Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores


In [11]:
R"""
sessionInfo()
"""

RObject{VecSxp}
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.2.1
