# Wrangling SPIROMICS
---

This notebook carrieds out the wrangling process for the SPIROMICS metabolomics data.

## Input

### Libraries

In [1]:
# To use RCall for the first time, one needs to 
# the location of the R home directory.
firstTimeRCall = false
if firstTimeRCall 
    using Pkg
    ENV["R_HOME"] = "C:/PROGRA~1/R/R-42~1.1" # from R.home() in R
    Pkg.build("RCall")
end     

In [2]:
using DataFrames, CSV, Missings
using FreqTables #, CategoricalArrays
using Statistics

### Ext. Functions

In [3]:
include(joinpath(@__DIR__,"..","..","src","wrangle_utils.jl" ));

### Load data ST001639: SPIROMICS

#### Participants

In [4]:
fileIndividuals = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS","SPIROMICS_ClinicalCovariates.csv"))
dfIndividuals = CSV.read(fileIndividuals, DataFrame;  delim = ',', missingstring = "NA");
first(dfIndividuals, 3)

Row,SUBJID,SAMPLE_NAME,SITE,GOLD_STAGE_COPD_SEVERITY,GENDER,AGE_DERV_01,RACE,BMI_CM01,CURRENT_SMOKER_V1,SMOKING_PACK_YEARS01,POST_FEV1FVC_DERV,PCT_POST_FEV1_V1,V1_PERCENT_EMPHYSEMA_TOTAL,COPD
Unnamed: 0_level_1,String15,String15,String3,Int64,String7,Int64,String31,Float64,Int64?,Float64,Float64,Float64,Float64?,Int64?
1,CU100084,NJHC-01517,CU,2,Female,64,"Non-Hispanic, White",25.3,0,100.0,0.505651,65.4168,11.56,1
2,CU100103,NJHC-01709,CU,0,Female,59,Black/African American,35.9,0,50.0,0.856898,113.274,missing,0
3,CU100139,NJHC-01708,CU,0,Male,52,Black/African American,39.8,1,25.0,0.781953,102.662,0.09,0


In [5]:
unique(dfIndividuals.GOLD_STAGE_COPD_SEVERITY)

4-element Vector{Int64}:
 2
 0
 3
 4

#### Metabolites References

The metabolites references are split in to 4 files:
- neg_metadata.txt
- polar_metadata.txt
- pos_early_metadata.txt
- pos_late_metadata.txt

In [6]:
# Reference metabolomics
vRefMetaFiles = repeat(["path"],4)

vRefMetaFiles[1] = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS",
                                  "neg_metadata.txt"))
vRefMetaFiles[2] = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS",
                                  "polar_metadata.txt"))
vRefMetaFiles[3] = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS",
                                  "pos_early_metadata.txt"))
# Missing 1st column name: metabolite_name, in pos_polar_metadata.txt   
vRefMetaFiles[4] = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS",
                                  "pos_late_metadata_revised20220829.txt"))#"pos_polar_metadata.txt"))
prepend(vRefMetaFiles[4], "metabolite_name	") # prepend "metabolite_name  "
vRefMetaFiles[4] = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS",
                                  "new_pos_late_metadata_revised20220829.txt"))#"new_pos_polar_metadata.txt"))

# Generate dataframe
dfRefMetabo = CSV.read(vRefMetaFiles[1], DataFrame;  delim = '	');

for i in 2:length(vRefMetaFiles)
    dfRefMetabo = vcat(dfRefMetabo, CSV.read(vRefMetaFiles[i], DataFrame;  delim = '	'));
end
# add code to remove prepend new file
# rm(vRefMetaFiles[4])
first(dfRefMetabo, 1)

Row,metabolite_name,SUPER.PATHWAY,SUB.PATHWAY,PLATFORM,RI,MASS,PUBCHEM,CAS,KEGG,HMDB
Unnamed: 0_level_1,String,String,String,String15,Float64,Float64,String15,String31,String31,String31
1,(14 or 15)-methylpalmitate (a17:0 or i17:0),Lipid,"Fatty Acid, Branched",LC/MS Neg,5695.0,269.249,,,,


In [7]:
names(dfRefMetabo)

10-element Vector{String}:
 "metabolite_name"
 "SUPER.PATHWAY"
 "SUB.PATHWAY"
 "PLATFORM"
 "RI"
 "MASS"
 "PUBCHEM"
 "CAS"
 "KEGG"
 "HMDB"

#### Negative

**Notes:** *The metabolics files do not contains a column header for the name of the metabolites; we need to add a column name.*

In [8]:
fileNegMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS","neg.txt"))
prepend(fileNegMetabo, "metabolite_name	")
fileNegMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS","new_neg.txt"))
dfNegMetabo = CSV.read(fileNegMetabo, DataFrame;  delim = '	', missingstring = "NA");
rm(fileNegMetabo);
first(dfNegMetabo, 3)

Row,metabolite_name,NJHC-01750,NJHC-01642,NJHC-01498,NJHC-01614,NJHC-01591,NJHC-01819,NJHC-01503,NJHC-01474,NJHC-02018,NJHC-01638,NJHC-01980,NJHC-01558,NJHC-01689,NJHC-01660,NJHC-01419,NJHC-02027,NJHC-01877,NJHC-01956,NJHC-01683,NJHC-01847,NJHC-01499,NJHC-01835,NJHC-01624,NJHC-01788,NJHC-02034,NJHC-01954,NJHC-01987,NJHC-01488,NJHC-01981,NJHC-01961,NJHC-01862,NJHC-01525,NJHC-01572,NJHC-01695,NJHC-01526,NJHC-01595,NJHC-01783,NJHC-01500,NJHC-01449,NJHC-01436,NJHC-01554,NJHC-01626,NJHC-02037,NJHC-01601,NJHC-01963,NJHC-01468,NJHC-01957,NJHC-01496,NJHC-01780,NJHC-01475,NJHC-01742,NJHC-02045,NJHC-01824,NJHC-01476,NJHC-01922,NJHC-01596,NJHC-01477,NJHC-01721,NJHC-01825,NJHC-01511,NJHC-01615,NJHC-02041,NJHC-01900,NJHC-01429,NJHC-01481,NJHC-01727,NJHC-02019,NJHC-01712,NJHC-01982,NJHC-01714,NJHC-01734,NJHC-01866,NJHC-01690,NJHC-01655,NJHC-01650,NJHC-01602,NJHC-02010,NJHC-01751,NJHC-01644,NJHC-01809,NJHC-01691,NJHC-01992,NJHC-01867,NJHC-01842,NJHC-01793,NJHC-01917,NJHC-01918,NJHC-01518,NJHC-01702,NJHC-01735,NJHC-01814,NJHC-01587,NJHC-01858,NJHC-01713,NJHC-01848,NJHC-01708,NJHC-01546,NJHC-01684,NJHC-01863,⋯
Unnamed: 0_level_1,String,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,⋯
1,(14 or 15)-methylpalmitate (a17:0 or i17:0),20496700.0,11442100.0,11389100.0,10590200.0,5172570.0,11996100.0,14187500.0,13166300.0,17803600.0,15855600.0,9059150.0,18701400.0,7308480.0,14118500.0,12060600.0,10925000.0,14161100.0,34807500.0,8316190.0,11437800.0,8849460.0,11456100.0,9813130.0,4490320.0,11782600.0,8169980.0,9107680.0,18388800.0,5979250.0,8376200.0,12574100.0,15199600.0,7003600.0,11931200.0,7733200.0,10532700.0,15264500.0,12743500.0,3610550.0,18368800.0,12421800.0,10325500.0,14161300.0,12801900.0,24263200.0,29711000.0,9879410.0,4939550.0,15632600.0,15748300.0,4452490.0,18444200.0,15516100.0,15913800.0,29242900.0,6693680.0,21066400.0,5271040.0,16018700.0,2.27276e6,13178000.0,11677300.0,10171900.0,37465900.0,12821700.0,5224340.0,13219800.0,5116020.0,10568800.0,6609600.0,12209100.0,4540590.0,12131400.0,34498400.0,33519700.0,9338420.0,64171700.0,32421400.0,10329100.0,19119700.0,13603400.0,14944700.0,715058.0,14160500.0,10225900.0,24830200.0,10171100.0,12958400.0,13701000.0,28702400.0,13235600.0,17116600.0,22198500.0,20810700.0,3185700.0,11304200.0,40262300.0,9757380.0,11468400.0,⋯
2,(16 or 17)-methylstearate (a19:0 or i19:0),1139110.0,890218.0,887839.0,1129570.0,652188.0,1267060.0,1010100.0,917362.0,1161490.0,1141700.0,758177.0,1524220.0,860058.0,1186700.0,662986.0,986238.0,1083300.0,1769100.0,679138.0,1117140.0,785624.0,1109060.0,817624.0,570630.0,705474.0,813963.0,821974.0,1136210.0,723787.0,1080120.0,854601.0,1032470.0,650186.0,1185140.0,603164.0,694159.0,787944.0,820689.0,461979.0,1418960.0,1253790.0,792101.0,926252.0,1058740.0,1836680.0,1773850.0,647049.0,477353.0,2092410.0,1041340.0,386810.0,1059640.0,1567450.0,1079320.0,2011470.0,656937.0,1229010.0,539318.0,1085730.0,2.3795e5,748687.0,1247130.0,1160560.0,2337230.0,1021250.0,501623.0,853685.0,575013.0,1293120.0,519488.0,1110390.0,583797.0,1044560.0,1930280.0,1904610.0,860670.0,6640810.0,2670440.0,939385.0,1054460.0,1481970.0,1623550.0,89914.1,1099250.0,909107.0,1684300.0,868922.0,798356.0,1121790.0,2649700.0,1406480.0,1356070.0,1375850.0,1593630.0,309484.0,1250550.0,2003690.0,714214.0,1013380.0,⋯
3,(2 or 3)-decenoate (10:1n7 or n8),735271.0,573408.0,478221.0,831653.0,424957.0,366849.0,772576.0,671110.0,842351.0,492891.0,997662.0,412606.0,229687.0,849860.0,396484.0,513706.0,866661.0,312482.0,571370.0,615049.0,734072.0,555020.0,334507.0,328510.0,601541.0,671678.0,611565.0,868623.0,403950.0,251689.0,488289.0,423079.0,1030490.0,329790.0,239451.0,530935.0,562734.0,409415.0,315612.0,843952.0,710676.0,169526.0,331077.0,271452.0,771898.0,825801.0,263974.0,179869.0,239368.0,565668.0,205317.0,1431740.0,593630.0,361520.0,919626.0,283271.0,946983.0,366039.0,446434.0,missing,543478.0,361116.0,467747.0,1557630.0,699915.0,179142.0,334744.0,379072.0,595196.0,363304.0,731671.0,227602.0,445869.0,638185.0,1319800.0,327630.0,825472.0,870799.0,444775.0,1044000.0,562608.0,1074310.0,1404470.0,279291.0,1025520.0,725448.0,1135180.0,1173620.0,400665.0,1010160.0,472403.0,722747.0,886712.0,1255250.0,92918.1,375869.0,502020.0,238873.0,399688.0,⋯


#### Polar

In [9]:
filePolarMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS","polar.txt"))
prepend(filePolarMetabo, "metabolite_name	")
filePolarMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS","new_polar.txt"))
dfPolarMetabo = CSV.read(filePolarMetabo, DataFrame;  delim = '	', missingstring = "NA");
rm(filePolarMetabo);
first(dfPolarMetabo, 3)

Row,metabolite_name,NJHC-01750,NJHC-01642,NJHC-01498,NJHC-01614,NJHC-01591,NJHC-01819,NJHC-01503,NJHC-01474,NJHC-02018,NJHC-01638,NJHC-01980,NJHC-01558,NJHC-01689,NJHC-01660,NJHC-01419,NJHC-02027,NJHC-01877,NJHC-01956,NJHC-01683,NJHC-01847,NJHC-01499,NJHC-01835,NJHC-01624,NJHC-01788,NJHC-02034,NJHC-01954,NJHC-01987,NJHC-01488,NJHC-01981,NJHC-01961,NJHC-01862,NJHC-01525,NJHC-01572,NJHC-01695,NJHC-01526,NJHC-01595,NJHC-01783,NJHC-01500,NJHC-01449,NJHC-01436,NJHC-01554,NJHC-01626,NJHC-02037,NJHC-01601,NJHC-01963,NJHC-01468,NJHC-01957,NJHC-01496,NJHC-01780,NJHC-01475,NJHC-01742,NJHC-02045,NJHC-01824,NJHC-01476,NJHC-01922,NJHC-01596,NJHC-01477,NJHC-01721,NJHC-01825,NJHC-01511,NJHC-01615,NJHC-02041,NJHC-01900,NJHC-01429,NJHC-01481,NJHC-01727,NJHC-02019,NJHC-01712,NJHC-01982,NJHC-01714,NJHC-01734,NJHC-01866,NJHC-01690,NJHC-01655,NJHC-01650,NJHC-01602,NJHC-02010,NJHC-01751,NJHC-01644,NJHC-01809,NJHC-01691,NJHC-01992,NJHC-01867,NJHC-01842,NJHC-01793,NJHC-01917,NJHC-01918,NJHC-01518,NJHC-01702,NJHC-01735,NJHC-01814,NJHC-01587,NJHC-01858,NJHC-01713,NJHC-01848,NJHC-01708,NJHC-01546,NJHC-01684,NJHC-01863,⋯
Unnamed: 0_level_1,String,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,⋯
1,"1,6-anhydroglucose",missing,9.29474e5,2.06381e5,3.35506e5,4.29031e5,2.57645e5,missing,8.27625e5,562816.0,4.59934e5,9.7461e5,3.33062e5,6.97769e6,2.15727e5,6.17225e5,3.98595e5,5.91158e5,1.15237e5,1.07949e5,2.4747e6,1.38512e6,4.07972e5,1.89541e6,3.45335e5,4.39639e5,3.38128e5,225458.0,1.01216e6,2.58777e5,703230.0,1.67271e6,2.93784e5,6.26738e5,5.94113e5,3.71473e6,4.62359e5,4.18209e5,91992.1,8.98713e5,3.552e5,3.9023e6,4.55479e6,9.89711e5,missing,1.749e5,1.28825e5,8.93151e5,missing,9.78708e5,8.33946e5,2.37247e6,2.45723e5,2.1841e6,3.96077e5,5.58487e5,7.22791e5,1.38926e5,3.20968e5,1.6518e6,6.03797e6,missing,2.44667e6,8.71531e5,2.76587e5,missing,missing,2.92488e5,8.91329e5,1.76244e6,5.24718e5,1.40339e5,4.36642e5,3.42341e5,2.47597e5,2.74116e5,1.19939e5,1.75291e5,5.19755e5,3.26051e5,2.04354e5,2.14616e6,7.04057e5,158870000.0,1.08557e6,5.45389e5,missing,1.13757e6,5.93073e5,5.75359e5,3.07595e5,3.16316e5,missing,4.78068e5,2.11117e5,2.65191e6,1.56563e5,6.11326e5,1.96252e5,6.49898e5,⋯
2,"2,3-dihydroxyisovalerate",2.54964e5,2.25658e5,3.16226e5,4.9566e5,1.25176e5,1.16234e5,1.25729e5,3.58954e5,2.18205e5,5.27613e6,2.93558e5,3.67141e5,1.98116e5,1.61752e5,87329.0,4.25065e5,1.75222e5,8.33208e5,630459.0,2.28936e5,2.34653e5,missing,missing,1.01476e5,1.62089e5,2.10798e5,1.06451e6,1.12709e5,4.80242e5,2.3184e5,1.41269e6,2.9957e6,1.13289e5,367571.0,missing,51873.5,1.90045e6,2.31866e5,2.80902e5,3.21315e6,1.75141e6,3.3388e5,9.05364e5,1.99062e5,6.81961e6,2.76428e6,236918.0,1.36172e6,8.74725e5,3.02785e5,1.31277e6,2.37911e5,missing,54414.8,3.45843e5,1.14306e5,2.3153e5,3.21021e5,4.3315e5,3.83082e5,4.80588e5,3.64311e5,2.97309e5,1.00711e6,43357.3,1.37622e5,2.10496e5,6.32033e6,2.71839e5,4.4184e5,1.38677e6,142827.0,missing,85247.2,71135.2,missing,missing,9.00506e6,missing,missing,4.91385e6,77368.4,79178400.0,1.16198e6,4.21665e5,missing,1.15252e5,70244.1,1.11018e6,3.29128e6,1.28105e5,132578.0,missing,2.00994e6,5.40189e6,57753.0,3.54066e5,7.42361e5,4.85621e6,⋯
3,"2,4,6-trihydroxybenzoate",missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,3812720.0,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,⋯


#### Positive early

In [10]:
filePosEarlyMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS","pos_early.txt"))
prepend(filePosEarlyMetabo, "metabolite_name	")
filePosEarlyMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS","new_pos_early.txt"))
dfPosEarlyMetabo = CSV.read(filePosEarlyMetabo, DataFrame;  delim = '	', missingstring = "NA");
rm(filePosEarlyMetabo)
first(dfPosEarlyMetabo, 3)

Row,metabolite_name,NJHC-01750,NJHC-01642,NJHC-01498,NJHC-01614,NJHC-01591,NJHC-01819,NJHC-01503,NJHC-01474,NJHC-02018,NJHC-01638,NJHC-01980,NJHC-01558,NJHC-01689,NJHC-01660,NJHC-01419,NJHC-02027,NJHC-01877,NJHC-01956,NJHC-01683,NJHC-01847,NJHC-01499,NJHC-01835,NJHC-01624,NJHC-01788,NJHC-02034,NJHC-01954,NJHC-01987,NJHC-01488,NJHC-01981,NJHC-01961,NJHC-01862,NJHC-01525,NJHC-01572,NJHC-01695,NJHC-01526,NJHC-01595,NJHC-01783,NJHC-01500,NJHC-01449,NJHC-01436,NJHC-01554,NJHC-01626,NJHC-02037,NJHC-01601,NJHC-01963,NJHC-01468,NJHC-01957,NJHC-01496,NJHC-01780,NJHC-01475,NJHC-01742,NJHC-02045,NJHC-01824,NJHC-01476,NJHC-01922,NJHC-01596,NJHC-01477,NJHC-01721,NJHC-01825,NJHC-01511,NJHC-01615,NJHC-02041,NJHC-01900,NJHC-01429,NJHC-01481,NJHC-01727,NJHC-02019,NJHC-01712,NJHC-01982,NJHC-01714,NJHC-01734,NJHC-01866,NJHC-01690,NJHC-01655,NJHC-01650,NJHC-01602,NJHC-02010,NJHC-01751,NJHC-01644,NJHC-01809,NJHC-01691,NJHC-01992,NJHC-01867,NJHC-01842,NJHC-01793,NJHC-01917,NJHC-01918,NJHC-01518,NJHC-01702,NJHC-01735,NJHC-01814,NJHC-01587,NJHC-01858,NJHC-01713,NJHC-01848,NJHC-01708,NJHC-01546,NJHC-01684,NJHC-01863,⋯
Unnamed: 0_level_1,String,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,⋯
1,(2-butoxyethoxy)acetic acid,missing,missing,missing,missing,missing,missing,949767.0,missing,1163120.0,missing,1183600.0,missing,missing,missing,72623.5,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,1168460.0,255219.0,missing,4397240.0,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,1155750.0,missing,missing,169120.0,105078.0,missing,3.0053e5,missing,382737.0,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,589784.0,missing,missing,missing,missing,missing,missing,missing,914859.0,missing,1870530.0,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,477088.0,missing,missing,223564.0,⋯
2,(R)-3-hydroxybutyrylcarnitine,1.86193e6,3.39531e6,6.48398e5,4.83569e6,8.05834e5,9.12263e6,4645220.0,2.63045e6,1296830.0,1.83508e6,1293730.0,3.57467e6,8.08166e5,6.29753e6,1565930.0,2.48744e6,3.72404e6,5.19554e6,1.09376e6,3.85654e5,1.3313e6,2.12037e6,missing,7.20039e5,2.35867e6,670331.0,2044670.0,1.51139e6,1012740.0,1.13657e6,2.17752e6,1.47757e6,3.05927e6,4.8921e6,8.67714e5,7.18352e5,8.18912e5,3.40021e6,7.03703e5,3.00834e6,1.12634e6,2.79669e6,1.664e6,6.08884e5,3.41361e6,3.00974e6,8.6269e5,2.02432e6,2.43637e6,4308310.0,2.08189e6,1.21956e6,5205100.0,3172880.0,2.69927e6,missing,2.99313e6,2868180.0,missing,missing,3.75663e6,4.23368e6,3.15263e6,5.99724e6,1.09532e6,6.5606e5,1.16126e6,4.79851e6,1.11678e7,1.43682e6,5.19377e6,missing,1540940.0,1.31569e7,1.03087e7,5.06521e6,4.50332e6,2.44661e6,1.61013e6,3.78282e6,1893690.0,4.44759e6,2908690.0,1.78602e6,1.86279e6,1.67995e7,2.4507e6,2.6297e6,1.65452e6,5.81066e6,1.04056e6,9.07086e6,1.38013e6,7.1502e6,missing,1140820.0,4.38609e6,2.65214e6,3145310.0,⋯
3,(S)-3-hydroxybutyrylcarnitine,1.3686e6,2.81463e6,1.22605e6,2.42628e6,5.98297e5,2.16304e6,2516030.0,1.57787e6,1964270.0,1.2477e6,762779.0,1.5857e6,8.5838e5,5.23189e6,1484690.0,3.30528e6,1.48918e6,1.25809e6,7.16164e5,4.31805e5,1.05998e6,1.17947e6,8.25478e5,8.17377e5,3.30526e6,1723220.0,902774.0,1.62987e6,1343560.0,7.47322e5,1.21564e6,1.41414e6,1.35201e6,5.23233e6,8.86263e5,6.08306e5,1.49733e6,3.07491e6,1.05862e6,1.22747e6,2.1314e6,1.57001e6,1.29164e6,6.23262e5,1.54754e6,1.94303e6,5.82541e5,9.96247e5,4.81268e5,2028520.0,4.00573e6,2.04486e6,5232380.0,1162390.0,2.17109e6,1.40892e6,1.54988e6,1393380.0,7.30786e5,1.48136e6,2.87503e6,1.79362e6,5.18701e6,2.15503e6,2.35065e6,5.96632e5,715214.0,1.26135e6,2.59932e6,1.15037e6,1.05204e6,4.222e5,1915390.0,8.74849e6,6.63815e6,1.5258e6,1.08465e6,1.51711e6,1.49739e6,1.55697e6,1585500.0,1.98057e6,12169000.0,897618.0,8.57009e5,7.7664e6,1.44167e6,6.13472e6,1.06899e6,1.7065e6,9.36622e5,3.01859e6,2.10791e6,2.3241e6,5.79691e5,610489.0,3.6607e6,2.54956e6,2252130.0,⋯


#### Positive late

In [11]:
filePosLateMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS","pos_late.txt"))
prepend(filePosLateMetabo, "metabolite_name	")
filePosLateMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","SPIROMICS","new_pos_late.txt"))
dfPosLateMetabo = CSV.read(filePosLateMetabo, DataFrame;  delim = '	', missingstring = "NA");
rm(filePosLateMetabo)
first(dfPosLateMetabo, 3)

Row,metabolite_name,NJHC-01750,NJHC-01642,NJHC-01498,NJHC-01614,NJHC-01591,NJHC-01819,NJHC-01503,NJHC-01474,NJHC-02018,NJHC-01638,NJHC-01980,NJHC-01558,NJHC-01689,NJHC-01660,NJHC-01419,NJHC-02027,NJHC-01877,NJHC-01956,NJHC-01683,NJHC-01847,NJHC-01499,NJHC-01835,NJHC-01624,NJHC-01788,NJHC-02034,NJHC-01954,NJHC-01987,NJHC-01488,NJHC-01981,NJHC-01961,NJHC-01862,NJHC-01525,NJHC-01572,NJHC-01695,NJHC-01526,NJHC-01595,NJHC-01783,NJHC-01500,NJHC-01449,NJHC-01436,NJHC-01554,NJHC-01626,NJHC-02037,NJHC-01601,NJHC-01963,NJHC-01468,NJHC-01957,NJHC-01496,NJHC-01780,NJHC-01475,NJHC-01742,NJHC-02045,NJHC-01824,NJHC-01476,NJHC-01922,NJHC-01596,NJHC-01477,NJHC-01721,NJHC-01825,NJHC-01511,NJHC-01615,NJHC-02041,NJHC-01900,NJHC-01429,NJHC-01481,NJHC-01727,NJHC-02019,NJHC-01712,NJHC-01982,NJHC-01714,NJHC-01734,NJHC-01866,NJHC-01690,NJHC-01655,NJHC-01650,NJHC-01602,NJHC-02010,NJHC-01751,NJHC-01644,NJHC-01809,NJHC-01691,NJHC-01992,NJHC-01867,NJHC-01842,NJHC-01793,NJHC-01917,NJHC-01918,NJHC-01518,NJHC-01702,NJHC-01735,NJHC-01814,NJHC-01587,NJHC-01858,NJHC-01713,NJHC-01848,NJHC-01708,NJHC-01546,NJHC-01684,NJHC-01863,⋯
Unnamed: 0_level_1,String,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,⋯
1,"1,2-dilinoleoyl-GPC (18:2/18:2)",83568900.0,131306000.0,106189000.0,113068000.0,120376000.0,39827500.0,166163000.0,125710000.0,89061800.0,87495800.0,141083000.0,64671200.0,90234200.0,94639100.0,87024400.0,154663000.0,80608500.0,50113900.0,108296000.0,208461000.0,104895000.0,70615800.0,69857000.0,59534900.0,66244200.0,148552000.0,150581000.0,173074000.0,173926000.0,140051000.0,130224000.0,162944000.0,103421000.0,133324000.0,110092000.0,107996000.0,94751400.0,82236800.0,70801300.0,126952000.0,179759000.0,112226000.0,215030000.0,120185000.0,94417000.0,74576500.0,99256600.0,74613500.0,107516000.0,147631000.0,51442000.0,130825000.0,132504000.0,109794000.0,147514000.0,130124000.0,140696000.0,40787700.0,161328000.0,223300000.0,132877000.0,123160000.0,81344700.0,131285000.0,110012000.0,80806400.0,170297000.0,165427000.0,89871800.0,86346000.0,99223000.0,108939000.0,125040000.0,69014500.0,89426300.0,87944100.0,99176400.0,67557400.0,101397000.0,68174000.0,133963000.0,106085000.0,4.25157e5,136650000.0,121375000.0,60059700.0,40356900.0,80924400.0,256750000.0,103554000.0,126027000.0,85780500.0,150228000.0,86356000.0,121484000.0,43178200.0,105889000.0,110500000.0,124835000.0,⋯
2,"1,2-dipalmitoyl-GPC (16:0/16:0)",50122600.0,34108600.0,55483200.0,57364100.0,45457500.0,43725900.0,48680500.0,49544700.0,44005000.0,54628700.0,41113000.0,58067400.0,56527500.0,49947100.0,41304800.0,44552300.0,41162800.0,43878000.0,45518800.0,52845700.0,49745000.0,39508300.0,24457200.0,47872600.0,37624900.0,46241200.0,58235800.0,43193700.0,29673700.0,46442400.0,48016100.0,64180900.0,42320500.0,41577100.0,37653800.0,51578900.0,47819300.0,31023000.0,35647500.0,52666500.0,53101900.0,28743800.0,46728600.0,42317200.0,50495600.0,34770000.0,41658300.0,37663100.0,49019600.0,46891100.0,43087200.0,44735500.0,41922300.0,39222900.0,51807800.0,38695100.0,40172100.0,44948400.0,55712700.0,58179600.0,47481300.0,39844100.0,34494900.0,42536000.0,43977300.0,33145700.0,47730700.0,51783400.0,57926100.0,40877300.0,47628500.0,40965600.0,37291200.0,43742800.0,48380900.0,54802000.0,63713300.0,43343500.0,39977600.0,42643800.0,40897100.0,33193100.0,3.92537e5,42794100.0,42049100.0,38207500.0,36773100.0,36632200.0,59405900.0,50502700.0,56309400.0,45457700.0,46642900.0,45816100.0,47707200.0,31997200.0,46114300.0,37694200.0,45684800.0,⋯
3,1-(1-enyl-oleoyl)-GPE (P-18:1)*,151080.0,196074.0,218692.0,243610.0,271046.0,187949.0,220511.0,324843.0,321703.0,232695.0,341219.0,194003.0,206422.0,233976.0,185176.0,461638.0,203727.0,239951.0,170680.0,309233.0,350558.0,248292.0,226788.0,252793.0,305358.0,372572.0,242363.0,253276.0,185604.0,227297.0,224785.0,186274.0,161626.0,232263.0,195902.0,209656.0,190572.0,221708.0,155314.0,304843.0,253001.0,232930.0,423742.0,235753.0,413075.0,162660.0,133571.0,203074.0,168609.0,169196.0,193460.0,234023.0,381814.0,137499.0,248687.0,370114.0,201024.0,152715.0,253461.0,207414.0,288182.0,171174.0,197421.0,239112.0,176083.0,206577.0,336987.0,376333.0,167625.0,111292.0,279434.0,136441.0,292068.0,142734.0,196436.0,297726.0,263436.0,390001.0,352582.0,115907.0,151765.0,167409.0,missing,207235.0,222859.0,224884.0,166231.0,360544.0,204460.0,321533.0,352976.0,323274.0,263417.0,265196.0,339804.0,78270.1,186316.0,137515.0,360853.0,⋯


#### Check that each dataframe contains the same column names(sample names). 

In [12]:
names(dfNegMetabo) == names(dfPolarMetabo) == names(dfPosEarlyMetabo) == names(dfPosLateMetabo)

true

## ST001639: SPIROMICS exploration

### Individuals

In [13]:
println("The participants dataset contains $(size(dfIndividuals, 1)) individuals and $(size(dfIndividuals, 2)) covariates.")

The participants dataset contains 447 individuals and 14 covariates.


In [14]:
describe(dfIndividuals)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,SUBJID,,CU100084,,WF125434,0,String15
2,SAMPLE_NAME,,NJHC-01397,,NJHC-02046,0,String15
3,SITE,,CU,,WF,0,String3
4,GOLD_STAGE_COPD_SEVERITY,1.28412,0,2.0,4,0,Int64
5,GENDER,,Female,,Male,0,String7
6,AGE_DERV_01,63.5168,40,65.0,80,0,Int64
7,RACE,,Black/African American,,"Non-Hispanic, White",0,String31
8,BMI_CM01,28.4996,14.5,28.3,39.8,0,Float64
9,CURRENT_SMOKER_V1,0.362613,0,0.0,1,3,"Union{Missing, Int64}"
10,SMOKING_PACK_YEARS01,50.7088,20.0,44.0,400.0,0,Float64


Notes:`ATS_PackYears`, i.e. smoking number of pack-years, and `Insp_LAA950_total_Thirona`, i.e. percent emphysema, contain `NA`.

Check how many `NA`:

In [15]:
vMissing = map(eachcol(dfIndividuals)) do col
               sum(ismissing.(col))
           end
idxColMiss = findall(vMissing .!= 0)
for i in idxColMiss
    println("$(names(dfIndividuals)[i]) contains $(vMissing[i]) missing values.")
end

CURRENT_SMOKER_V1 contains 3 missing values.
V1_PERCENT_EMPHYSEMA_TOTAL contains 63 missing values.
COPD contains 11 missing values.


#### Change variable names: 

In [16]:
rename!(dfIndividuals, Dict(:SUBJID => "SampleName", :SAMPLE_NAME => "SampleID",
                            :SITE => "Site", :GOLD_STAGE_COPD_SEVERITY => "FinalGold",
                            :GENDER => "Sex", :AGE_DERV_01 => "Age", :RACE => "race",
                            :SMOKING_PACK_YEARS01 => "SmokingPackYears", :BMI_CM01 => "BMI",
                            :POST_FEV1FVC_DERV => "FEV1_FVC", :PCT_POST_FEV1_V1 => "FEV1pp",
                            :CURRENT_SMOKER_V1 => "CurrentSmoker",
                            :V1_PERCENT_EMPHYSEMA_TOTAL => "PercentEmphysema"));

Verify how many factors per categorical variables, *i.e.* gender, race, smoking status, COPD case status.

In [17]:
vCovariateNames = [:Sex, :race, :CurrentSmoker, :COPD]
vUniqueCat = map(eachcol(dfIndividuals[:, vCovariateNames])) do col
                 join(unique(col), ",", " and ")
             end
for i in 1:length(vUniqueCat)
    println("$(string(vCovariateNames[i])) variable contains: $(vUniqueCat[i]) values.")
end

Sex variable contains: Female and Male values.
race variable contains: Non-Hispanic, White and Black/African American values.
CurrentSmoker variable contains: 0,1 and missing values.
COPD variable contains: 1,0 and missing values.


Convert:
- `race` into `NHW`, where the value *1* corresponds to non-Hispanic White and *0* otherwise.
- `smoking_status` into `CurrentSmoker`, , where the value *1* corresponds to Current Smoker and *0* to Former Smoker.

In [18]:
#  non-Hipanic White
vNHW = zeros(Int, size(dfIndividuals, 1));
idxNHW = findall(dfIndividuals.race .== "Non-Hispanic, White");
vNHW[idxNHW] .= 1
dfIndividuals.NHW = (vNHW);

In [19]:
# Drop `race` and `smoking_status`
select!(dfIndividuals, Not([:race]));

Get demographics SPIROMICS cohort by sex.

In [20]:
# Group by sex
gdf = groupby(dfIndividuals, :Sex);

# Get mean values
mymean(X) = mean(skipmissing(X))
df1a = combine(gdf, [:Age, :BMI, :SmokingPackYears, :PercentEmphysema] .=> mymean)
df1a[:,2:end] = round.((df1a[:,2:end]); digits = 1)
rename!(df1a, Dict(:Age_mymean => "Age", :BMI_mymean => "BMI",
                   :SmokingPackYears_mymean => "SmokingPackYears", :PercentEmphysema_mymean => "PercentEmphysema"));

# Get standard deviation values
mystd(X) = std(skipmissing(X))
df1b = combine(gdf, [:Age, :BMI, :SmokingPackYears, :PercentEmphysema] .=> mystd)
df1b[:,2:end] = round.((df1b[:,2:end]); digits = 1)
rename!(df1b, Dict(:Age_mystd => "Age", :BMI_mystd => "BMI",
                   :SmokingPackYears_mystd => "SmokingPackYears", :PercentEmphysema_mystd => "PercentEmphysema"));

# Join mean and standard deviation values
dfDem1 = string.(df1a[:,2:end]).*repeat(["("], size(df1a,1),size(df1a,2)-1).* 
         string.(df1b[:,2:end]).*repeat([")"], size(df1a,1),size(df1a,2)-1);
insertcols!(dfDem1, 1, :Sex => df1a.Sex, :Participants => combine(gdf, nrow)[:,2])

# Get sum values
sum_skipmissing(x)= sum(skipmissing(x));
df2a = combine(gdf, [:NHW, :CurrentSmoker, :COPD] .=> sum_skipmissing)

# Get number of participant excluding those with missing values
nrow_skipmissing(x)= length(collect(skipmissing(x)));
dfParticipantSkipmissing = combine(gdf, [:NHW, :CurrentSmoker, :COPD] .=> nrow_skipmissing)


# Get percentage values
df2b = round.((df2a[:,2:end]./ Matrix(dfParticipantSkipmissing[:, 2:end])).*100, digits = 1)

# Join sum and percentage values
dfDem2 = string.(df2a[:,2:end]).*repeat(["("], size(df2a,1),size(df2a,2)-1).* 
         string.(df2b[:,1:end]).*repeat([")"], size(df2a,1),size(df2a,2)-1)
insertcols!(dfDem2, 1, :Sex => df2a.Sex)

rename!(dfDem2, Dict(:NHW_sum_skipmissing => "NHW", :CurrentSmoker_sum_skipmissing => "CurrentSmoker",
                   :COPD_sum_skipmissing => "COPD"))
# Join demographics dataframes
dfDem = leftjoin(dfDem1, dfDem2, on = :Sex )

# Pivot table
dfDem = permutedims(dfDem, 1, "Variable")

Row,Variable,Female,Male
Unnamed: 0_level_1,String,Any,Any
1,Participants,214,233
2,Age,63.0(8.8),64.0(8.0)
3,BMI,28.4(5.6),28.6(4.9)
4,SmokingPackYears,45.5(21.3),55.5(38.1)
5,PercentEmphysema,5.0(8.8),5.3(9.1)
6,NHW,163(76.2),191(82.0)
7,CurrentSmoker,70(32.9),91(39.4)
8,COPD,102(49.5),140(60.9)


The demographic table is identical to the article "*Metabolomic Profiling Reveals Sex Specific Associations with
Chronic Obstructive Pulmonary Disease and Emphysema*"(2021).

#### Save processed individuals dataset:

In [21]:
first(dfIndividuals)

Row,SampleName,SampleID,Site,FinalGold,Sex,Age,BMI,CurrentSmoker,SmokingPackYears,FEV1_FVC,FEV1pp,PercentEmphysema,COPD,NHW
Unnamed: 0_level_1,String15,String15,String3,Int64,String7,Int64,Float64,Int64?,Float64,Float64,Float64,Float64?,Int64?,Int64
1,CU100084,NJHC-01517,CU,2,Female,64,25.3,0,100.0,0.505651,65.4168,11.56,1,1


In [22]:
fileIndividuals = joinpath(@__DIR__,"..","..","data","processed","SPIROMICS","SPIROMICS_ClinicalCovariates.csv");
dfIndividuals |> CSV.write(fileIndividuals);

### Metabolomics References

#### Create dataframe whith pathways

In [23]:
names(dfRefMetabo)

10-element Vector{String}:
 "metabolite_name"
 "SUPER.PATHWAY"
 "SUB.PATHWAY"
 "PLATFORM"
 "RI"
 "MASS"
 "PUBCHEM"
 "CAS"
 "KEGG"
 "HMDB"

Since the [PUBCHEM ID list](https://www.metabolomicsworkbench.org/rest/study/study_id/ST001639/metabolites/text) is incomplete, we need to generate a pseudo CompID. First sort by metabolite name then generate the IDs.

In [24]:
sort!(dfRefMetabo, :metabolite_name)
insertcols!(dfRefMetabo, 2, :CompID => replace.(string.(collect(1:size(dfRefMetabo, 1)).+99990000), "9999"=>"comp"));

In [25]:
first(dfRefMetabo, 3)

Row,metabolite_name,CompID,SUPER.PATHWAY,SUB.PATHWAY,PLATFORM,RI,MASS,PUBCHEM,CAS,KEGG,HMDB
Unnamed: 0_level_1,String,String,String,String,String15,Float64,Float64,String15,String31,String31,String31
1,(14 or 15)-methylpalmitate (a17:0 or i17:0),comp0001,Lipid,"Fatty Acid, Branched",LC/MS Neg,5695.0,269.249,,,,
2,(16 or 17)-methylstearate (a19:0 or i19:0),comp0002,Lipid,"Fatty Acid, Branched",LC/MS Neg,5993.0,297.28,3083779.0,2724-59-6,,HMDB37397
3,(2 or 3)-decenoate (10:1n7 or n8),comp0003,Lipid,Medium Chain Fatty Acid,LC/MS Neg,4990.0,169.123,,,,


Keep metabolite_name, comp ID, super pathways and sub pathways.

In [26]:
first(dfRefMetabo[:, Symbol.(["metabolite_name", "CompID", "SUPER.PATHWAY", "SUB.PATHWAY"])], 3)

Row,metabolite_name,CompID,SUPER.PATHWAY,SUB.PATHWAY
Unnamed: 0_level_1,String,String,String,String
1,(14 or 15)-methylpalmitate (a17:0 or i17:0),comp0001,Lipid,"Fatty Acid, Branched"
2,(16 or 17)-methylstearate (a19:0 or i19:0),comp0002,Lipid,"Fatty Acid, Branched"
3,(2 or 3)-decenoate (10:1n7 or n8),comp0003,Lipid,Medium Chain Fatty Acid


In [27]:
last(dfRefMetabo[:, Symbol.(["metabolite_name", "CompID", "SUPER.PATHWAY", "SUB.PATHWAY"])], 3)

Row,metabolite_name,CompID,SUPER.PATHWAY,SUB.PATHWAY
Unnamed: 0_level_1,String,String,String,String
1,ximenoylcarnitine (C26:1)*,comp1172,Lipid,"Fatty Acid Metabolism (Acyl Carnitine, Monounsaturated)"
2,xylose,comp1173,Carbohydrate,Pentose Metabolism
3,zolpidem,comp1174,Xenobiotics,Drug - Psychoactive


In [28]:
# Select variables of interest and rename accordingly
rename!(dfRefMetabo, Dict(Symbol("SUB.PATHWAY") => "SubPathway", Symbol("SUPER.PATHWAY") => "SuperPathway")) 
select!(dfRefMetabo, [:metabolite_name, :CompID, :SubPathway, :SuperPathway]);

In [29]:
# Create 2 new variables name SubClassID and SuperClassID that 
# contain a codification of pathways

# Group by Super Pathway
gdf = groupby(dfRefMetabo, :SuperPathway);

nTotalSub = length(unique(dfRefMetabo.SubPathway))
vInit = repeat(["NA"], nTotalSub);
dfNewRef = DataFrame(SubPathway = vInit, SubClassID = vInit,
                     SuperPathway = vInit, SuperClassID = vInit);

In [30]:
# Generate pathway ID references for the metabolites
idxStart = 1

for i in 1:(length(gdf)-1)
    vSub = unique(gdf[i].SubPathway)
    nSub = length(vSub)
    
    idxEnd = idxStart + nSub - 1
    
    dfNewRef.SubPathway[idxStart:idxEnd] = vSub;
    dfNewRef.SubClassID[idxStart:idxEnd] = uppercase(gdf[i].SuperPathway[1][1:3]).*string.(collect(1:nSub));
    dfNewRef.SuperPathway[idxStart:idxEnd] .= gdf[i].SuperPathway[1];
    dfNewRef.SuperClassID[idxStart:idxEnd] .= uppercase(gdf[i].SuperPathway[1][1:3]);
    
    idxStart = idxEnd + 1
end

In [31]:
# Initiatlize vector
nMeta = size(dfRefMetabo, 1);
vSubClass = repeat(["NA"], nMeta);
vSupClass = repeat(["NA"], nMeta);

for i in 1:length(dfNewRef.SubPathway)
    idx = findall(dfRefMetabo.SubPathway.== dfNewRef.SubPathway[i])
    vSubClass[idx] .= dfNewRef.SubClassID[i]
    vSupClass[idx] .= dfNewRef.SuperClassID[i]
end
dfRefMetabo.SubClassID = vSubClass; 
dfRefMetabo.SuperClassID = vSupClass;


# Insert 0 in SubID when ID number less than 10. It helps for sorting.
idxSub2Change = findall(length.(dfRefMetabo.SubClassID) .== 4)
for i in 1:length(idxSub2Change) 
    dfRefMetabo.SubClassID[idxSub2Change[i]] = dfRefMetabo.SubClassID[idxSub2Change[i]][1:3]*"0"*dfRefMetabo.SubClassID[idxSub2Change[i]][4]
end

#### Check cotinine bio chemical

The cotinine levels will be imputed differently if missing is more than 20%.

In [32]:
# check for cotinine
idxCotinine = findall(occursin.(r"(?i)cotinine", dfRefMetabo.metabolite_name))
dfRefMetabo[idxCotinine, :]

Row,metabolite_name,CompID,SubPathway,SuperPathway,SubClassID,SuperClassID
Unnamed: 0_level_1,String,String,String,String,String,String
1,3-hydroxycotinine glucuronide,comp0216,Tobacco Metabolite,Xenobiotics,XEN09,XEN
2,cotinine,comp0588,Tobacco Metabolite,Xenobiotics,XEN09,XEN
3,cotinine N-oxide,comp0589,Tobacco Metabolite,Xenobiotics,XEN09,XEN
4,hydroxycotinine,comp0808,Tobacco Metabolite,Xenobiotics,XEN09,XEN
5,norcotinine,comp0926,Tobacco Metabolite,Xenobiotics,XEN09,XEN


#### Explore frequency table

In [33]:
freqtable(dfRefMetabo.SuperPathway)

9-element Named Vector{Int64}
Dim1                              │ 
──────────────────────────────────┼────
Amino Acid                        │ 228
Carbohydrate                      │  25
Cofactors and Vitamins            │  43
Energy                            │  11
Lipid                             │ 435
Nucleotide                        │  41
Partially Characterized Molecules │  30
Peptide                           │  43
Xenobiotics                       │ 318

In [34]:
idxLipid = findall(dfRefMetabo.SuperPathway .== "Lipid")
freqtable(dfRefMetabo[idxLipid, :SubPathway]); #  |> show;

#### Save processed individuals dataset:

In [35]:
fileRef = joinpath(@__DIR__,"..","..","data","processed","SPIROMICS","refMeta.csv");
dfRefMetabo |> CSV.write(fileRef);

### Negative

**Notes:** *The metabolics files use the metabolite name to identify each metabolite. We need to use an ID similar to the compound ID from the COPDGenegene cohort. Since compound IDs are not available we use the workbench metabolite IDs.*

Filter `dfNegMetabo` sample according to the individuals dataframe `dfIndividuals`:

#### Keep complete cases

In [36]:
dfNegMetabo = keepComplete(dfNegMetabo, dfIndividuals, dfRefMetabo; sampleCol = :SampleID, metaCol = :metabolite_name );

#### Save filtered sample negative metabolites levels dataset:

In [37]:
fileNeg = joinpath(@__DIR__,"..","..","data","processed","SPIROMICS","negMeta.csv");
dfNegMetabo |> CSV.write(fileNeg);

In [38]:
println("The negative metabolite dataset contains $(size(dfNegMetabo, 2)-1) samples and $(size(dfNegMetabo, 1)) metabolites.")

The negative metabolite dataset contains 372 samples and 588 metabolites.


### Polar

Filter `dfPolarMetabo` sample according to the individuals dataframe `dfIndividuals`:

#### Keep complete cases

In [39]:
dfPolarMetabo = keepComplete(dfPolarMetabo, dfIndividuals, dfRefMetabo; sampleCol = :SampleID, metaCol = :metabolite_name );

#### Save filtered sample polar metabolites levels dataset:

In [40]:
filePolar = joinpath(@__DIR__,"..","..","data","processed","SPIROMICS","polarMeta.csv");
dfPolarMetabo |> CSV.write(filePolar);

In [41]:
println("The polar metabolite dataset contains $(size(dfPolarMetabo, 2)-1) samples and $(size(dfPolarMetabo, 1)) metabolites.")

The polar metabolite dataset contains 372 samples and 96 metabolites.


### Positive Early

Filter `dfPosEarlyMetabo` sample according to the individuals dataframe `dfIndividuals`:

[Polar molecules elute earlier and nonpolar molecules later.](https://www.sciencedirect.com/topics/immunology-and-microbiology/metabolome-analysis)

#### Keep complete cases

In [42]:
dfPosEarlyMetabo = keepComplete(dfPosEarlyMetabo, dfIndividuals, dfRefMetabo; sampleCol = :SampleID, metaCol = :metabolite_name );

#### Save filtered sample positive early metabolites levels dataset:

In [43]:
filePosEarly = joinpath(@__DIR__,"..","..","data","processed","SPIROMICS","posEarlyMeta.csv");
dfPosEarlyMetabo |> CSV.write(filePosEarly);

In [44]:
println("The positive early metabolite dataset contains $(size(dfPosEarlyMetabo, 2)-1) samples and $(size(dfPosEarlyMetabo, 1)) metabolites.")

The positive early metabolite dataset contains 372 samples and 258 metabolites.


### Positive Late

Filter `dfPosLateMetabo` sample according to the individuals dataframe `dfIndividuals`:

[Polar molecules elute earlier and nonpolar molecules later.](https://www.sciencedirect.com/topics/immunology-and-microbiology/metabolome-analysis)

#### Keep complete cases

In [45]:
dfPosLateMetabo = keepComplete(dfPosLateMetabo, dfIndividuals, dfRefMetabo; sampleCol = :SampleID, metaCol = :metabolite_name );

#### Save filtered sample positive late metabolites levels dataset:

In [46]:
filePosLate = joinpath(@__DIR__,"..","..","data","processed","SPIROMICS","posLateMeta.csv");
dfPosLateMetabo |> CSV.write(filePosLate);

In [47]:
println("The positive late metabolite dataset contains $(size(dfPosLateMetabo, 2)-1) samples and $(size(dfPosLateMetabo, 1)) metabolites.")

The positive late metabolite dataset contains 372 samples and 232 metabolites.


In [48]:
names(dfRefMetabo)

6-element Vector{String}:
 "metabolite_name"
 "CompID"
 "SubPathway"
 "SuperPathway"
 "SubClassID"
 "SuperClassID"

In [49]:
# innerjoin(dfRefMetabo[:, [:metabolite_name, :CompID]], dfPosLateMetabo, on = :metabolite_name);

In [50]:
# dfPosLateMetabo.metabolite_name ⊆ dfRefMetabo.metabolite_name

In [51]:
# fileList = joinpath(@__DIR__,"..","..","data","processed","SPIROMICS","list.csv");
# DataFrame(Name = dfPosLateMetabo.metabolite_name) |> CSV.write(fileList);