# Wrangling COPDGene
---

This notebook carrieds out the wrangling process for the COPDGene metabolomics data.

## Input

### Libraries

In [1]:
# To use RCall for the first time, one needs to 
# the location of the R home directory.
firstTimeRCall = false
if firstTimeRCall 
    using Pkg
    ENV["R_HOME"] = "C:/PROGRA~1/R/R-42~1.1" # from R.home() in R
    Pkg.build("RCall")
end       

In [2]:
using DataFrames, CSV, Missings
using FreqTables #, CategoricalArrays
using Statistics

### Ext. Functions

In [3]:
include(joinpath(@__DIR__,"..","..","src","wrangle_utils.jl" ));

### Load data ST001443: COPDGene

#### Participants

In [4]:
fileIndividuals = realpath(joinpath(@__DIR__,"..","..","data","raw","COPDGene","COPDGene_ClinicalCovariates.csv"))
dfIndividuals = CSV.read(fileIndividuals, DataFrame;  delim = ',', missingstring = "NA");
first(dfIndividuals, 3)

Row,sid,sample_name,ccenter,finalgold_visit,gender,age_visit,race,BMI,smoking_status,ATS_PackYears,FEV1_FVC_utah,FEV1pp_utah,Insp_LAA950_total_Thirona,COPD
Unnamed: 0_level_1,String7,String15,String3,String7,String7,Float64,String31,Float64,String15,Float64?,Float64,Float64,Float64?,Int64
1,10010J,NJHC-00611,NJC,GOLD 2,Female,73.5,White,27.51,Former smoker,30.7,0.62,51.8,2.42326,1
2,10031R,NJHC-00004,NJC,GOLD 2,Male,66.3,White,22.77,Former smoker,46.9,0.48,58.5,34.7749,1
3,10032T,NJHC-00006,NJC,GOLD 2,Female,66.3,White,31.78,Former smoker,40.0,0.61,63.2,6.45937,1


#### Metabolites References

In [5]:
# Reference metabolomics
fileRefMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","COPDGene",
                                  "COPDGene_2018_Metabolon_metabolite_metadata.txt"))
prepend(fileRefMetabo, "MetaID	")
fileRefMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","COPDGene",
                                  "new_COPDGene_2018_Metabolon_metabolite_metadata.txt"))
dfRefMetabo = CSV.read(fileRefMetabo, DataFrame;  delim = '	');
rm(fileRefMetabo);
names(dfRefMetabo)

14-element Vector{String}:
 "MetaID"
 "PATHWAY.SORTORDER"
 "BIOCHEMICAL"
 "SUPER.PATHWAY"
 "SUB.PATHWAY"
 "COMP.ID"
 "PLATFORM"
 "CHEMICAL.ID"
 "RI"
 "MASS"
 "PUBCHEM"
 "CAS"
 "KEGG"
 "Group.HMDB"

Notes: we added `MetaID` variable name.  

#### Negative

In [6]:
fileNegMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","COPDGene","neg.txt"))
prepend(fileNegMetabo, "MetaID	")
fileNegMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","COPDGene","new_neg.txt"))
dfNegMetabo = CSV.read(fileNegMetabo, DataFrame;  delim = '	', missingstring = "NA");
rm(fileNegMetabo);
first(dfNegMetabo, 3)

Row,MetaID,NJHC-00001,NJHC-00002,NJHC-00003,NJHC-00004,NJHC-00005,NJHC-00006,NJHC-00007,NJHC-00008,NJHC-00009,NJHC-00010,NJHC-00011,NJHC-00012,NJHC-00013,NJHC-00014,NJHC-00015,NJHC-00016,NJHC-00017,NJHC-00018,NJHC-00019,NJHC-00020,NJHC-00021,NJHC-00022,NJHC-00023,NJHC-00024,NJHC-00025,NJHC-00026,NJHC-00027,NJHC-00028,NJHC-00029,NJHC-00030,NJHC-00031,NJHC-00032,NJHC-00033,NJHC-00034,NJHC-00035,NJHC-00036,NJHC-00037,NJHC-00038,NJHC-00039,NJHC-00040,NJHC-00041,NJHC-00042,NJHC-00043,NJHC-00044,NJHC-00045,NJHC-00046,NJHC-00047,NJHC-00048,NJHC-00049,NJHC-00050,NJHC-00051,NJHC-00052,NJHC-00053,NJHC-00054,NJHC-00055,NJHC-00056,NJHC-00057,NJHC-00058,NJHC-00059,NJHC-00060,NJHC-00061,NJHC-00062,NJHC-00063,NJHC-00064,NJHC-00065,NJHC-00066,NJHC-00067,NJHC-00068,NJHC-00069,NJHC-00070,NJHC-00071,NJHC-00072,NJHC-00073,NJHC-00074,NJHC-00075,NJHC-00076,NJHC-00077,NJHC-00078,NJHC-00079,NJHC-00080,NJHC-00081,NJHC-00082,NJHC-00083,NJHC-00084,NJHC-00085,NJHC-00086,NJHC-00087,NJHC-00088,NJHC-00089,NJHC-00090,NJHC-00091,NJHC-00092,NJHC-00093,NJHC-00094,NJHC-00095,NJHC-00096,NJHC-00097,NJHC-00098,NJHC-00099,⋯
Unnamed: 0_level_1,String15,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,⋯
1,269.2487_5695,14645215,12250810,39229008,28395564,14379270,44211984,16467170,16227623,15019852,20799962,29706168,25535100,9586948,22517184,56898536,56296492,21833808,4937165,4735985,26270196,14496632,23382212,20510984,24311378,10608024,8444278,30804274,50677392,24867992,47104156,15458106,12371223,24581490,54990800,41763248,11207133,11251844,31312040,24617468,6255557,33717760,15356847,22476164,43162528,37742300,14580738,13791452,32409300,73718752,18193894,7775309,34075144,missing,36290540,46649920,20734692,54342244,17637528,12731912,14650401,8045201,49065896,70326936,37275224,40644532,31332060,14306159,10877687,37521144,18283848,32888236,16612095,17097952,13372071,17342564,24599240,12965600,37617440,17376328,17566122,29175032,58959268,26356610,32699770,12893517,32939658,31451954,28670286,109673288,46019052,77713864,3815924,46630068,23065780,10092947,33254436,35778704,8737568,10300475,⋯
2,343.2279_5565,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,1731933,missing,missing,missing,missing,missing,missing,missing,71028,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,⋯
3,297.2799_5993,1256341,1258159,2752392,2152307,1613780,2938257,1345865,1718685,1114043,2216420,1658358,2601791,708494,1234370,3538976,3594802,1902523,358802,350889,2084579,1500283,2049429,1873334,1653137,868423,574230,2219986,4564478,2201262,3313321,1324218,1457579,1613993,3229421,2772336,1334369,1109977,2655990,2206343,531979,3026148,1053006,1581943,2627046,2199166,1173941,884745,2350687,5922540,1451981,551181,1966912,1279290,3202938,4261504,1594989,3104491,1739338,1533419,1179091,950056,1894281,4146206,2776697,3783060,2589925,1519708,976180,2853361,1617084,2296187,1212603,1350967,820901,1597577,1924371,1015980,2376861,1594693,1805737,2258904,3682072,1385506,2438157,945808,2237800,3319533,1862662,6247282,2859824,4160534,412303,4035304,1628071,789484,2536151,3433497,602695,692577,⋯


#### Polar

In [7]:
filePolarMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","COPDGene","polar.txt"))
prepend(filePolarMetabo, "MetaID	")
filePolarMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","COPDGene","new_polar.txt"))
dfPolarMetabo = CSV.read(filePolarMetabo, DataFrame;  delim = '	', missingstring = "NA");
rm(filePolarMetabo)
first(dfPolarMetabo, 3)

Row,MetaID,NJHC-00001,NJHC-00002,NJHC-00003,NJHC-00004,NJHC-00005,NJHC-00006,NJHC-00007,NJHC-00008,NJHC-00009,NJHC-00010,NJHC-00011,NJHC-00012,NJHC-00013,NJHC-00014,NJHC-00015,NJHC-00016,NJHC-00017,NJHC-00018,NJHC-00019,NJHC-00020,NJHC-00021,NJHC-00022,NJHC-00023,NJHC-00024,NJHC-00025,NJHC-00026,NJHC-00027,NJHC-00028,NJHC-00029,NJHC-00030,NJHC-00031,NJHC-00032,NJHC-00033,NJHC-00034,NJHC-00035,NJHC-00036,NJHC-00037,NJHC-00038,NJHC-00039,NJHC-00040,NJHC-00041,NJHC-00042,NJHC-00043,NJHC-00044,NJHC-00045,NJHC-00046,NJHC-00047,NJHC-00048,NJHC-00049,NJHC-00050,NJHC-00051,NJHC-00052,NJHC-00053,NJHC-00054,NJHC-00055,NJHC-00056,NJHC-00057,NJHC-00058,NJHC-00059,NJHC-00060,NJHC-00061,NJHC-00062,NJHC-00063,NJHC-00064,NJHC-00065,NJHC-00066,NJHC-00067,NJHC-00068,NJHC-00069,NJHC-00070,NJHC-00071,NJHC-00072,NJHC-00073,NJHC-00074,NJHC-00075,NJHC-00076,NJHC-00077,NJHC-00078,NJHC-00079,NJHC-00080,NJHC-00081,NJHC-00082,NJHC-00083,NJHC-00084,NJHC-00085,NJHC-00086,NJHC-00087,NJHC-00088,NJHC-00089,NJHC-00090,NJHC-00091,NJHC-00092,NJHC-00093,NJHC-00094,NJHC-00095,NJHC-00096,NJHC-00097,NJHC-00098,NJHC-00099,⋯
Unnamed: 0_level_1,String15,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,⋯
1,433.2361_1441,missing,326600,126566,123600,missing,247818,287269,133694,434214,missing,missing,missing,134642,101917,201565,116473,missing,329892,46100,326370,missing,missing,288999,115541,111305,missing,171749,missing,missing,247354,148914,missing,missing,197342,59198,missing,missing,missing,96352,missing,236958,191626,246432,missing,missing,178408,194808,270157,127399,81454,missing,76665,125059,missing,missing,missing,356791,missing,261574,334505,102593,300143,167815,157939,missing,missing,217513,81645,274555,202424,328963,missing,missing,368319,184594,392782,missing,261283,missing,missing,316225,238190,missing,183508,233763,231260,missing,643960,240419,missing,282076,118321,184894,83335,90453,216949,missing,180211,227246,⋯
2,775.5495_717,1440248,1640319,1337974,1517128,1709690,1874165,1064377,1071848,1208501,402557,1377381,1689266,1094087,1536406,1146902,831478,1129601,2322230,766925,1630264,679538,746599,2075359,1758424,1914608,1328717,445355,missing,missing,2084987,2429211,1445526,1289925,1292265,1701691,1083087,1564160,684996,1766544,911707,1496068,1717481,1567899,1171521,1129677,828596,1176926,684392,1425785,1128527,968586,1463532,1558936,1513032,1115085,1393903,2469374,3075845,1857654,2663706,807132,1925923,1181188,1430127,775682,missing,2416556,1113628,missing,1923774,1486848,886710,1223696,799828,1726128,1105866,1406247,1374518,missing,534570,945098,1215217,1466567,1583335,930374,1000379,1123045,1462342,1502086,772041,2740831,1905257,1004507,1878403,1590567,1744088,1340926,1258187,888849,⋯
3,133.0506_1050,2167189,1214899,1373803,1592804,23720756,5269142,12589752,304496,21832552,3069386,4698901,748256,2427009,1409256,1279911,1556201,2760309,9187509,665941,8052345,9773742,20700264,2364283,385884,12128009,7834677,2112325,30471740,410276,2325184,1614019,3075491,3051263,2226974,2169912,4212791,2979874,4177573,865972,1518107,559868,17542154,664807,1959789,642851,575460,1407774,448434,402369,524342,25054822,5517349,331075,1027204,653230,4171048,7728164,482026,891489,2040487,806325,752715,2112943,2583820,11209799,4948818,2238888,826711,37305808,3859254,18654720,1423232,4990944,880440,7916349,530435,17939516,2284153,799094,1060799,4262011,240382,3853051,19024338,3620280,799929,2324520,985853,1380306,4519879,38526692,1249767,993059,1993449,6433002,959239,7556338,3945870,4484914,⋯


#### Positive early

In [8]:
filePosEarlyMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","COPDGene","pos.early.txt"))
prepend(filePosEarlyMetabo, "MetaID	")
filePosEarlyMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","COPDGene","new_pos.early.txt"))
dfPosEarlyMetabo = CSV.read(filePosEarlyMetabo, DataFrame;  delim = '	', missingstring = "NA");
rm(filePosEarlyMetabo)
first(dfPosEarlyMetabo, 3)

Row,MetaID,NJHC-00001,NJHC-00002,NJHC-00003,NJHC-00004,NJHC-00005,NJHC-00006,NJHC-00007,NJHC-00008,NJHC-00009,NJHC-00010,NJHC-00011,NJHC-00012,NJHC-00013,NJHC-00014,NJHC-00015,NJHC-00016,NJHC-00017,NJHC-00018,NJHC-00019,NJHC-00020,NJHC-00021,NJHC-00022,NJHC-00023,NJHC-00024,NJHC-00025,NJHC-00026,NJHC-00027,NJHC-00028,NJHC-00029,NJHC-00030,NJHC-00031,NJHC-00032,NJHC-00033,NJHC-00034,NJHC-00035,NJHC-00036,NJHC-00037,NJHC-00038,NJHC-00039,NJHC-00040,NJHC-00041,NJHC-00042,NJHC-00043,NJHC-00044,NJHC-00045,NJHC-00046,NJHC-00047,NJHC-00048,NJHC-00049,NJHC-00050,NJHC-00051,NJHC-00052,NJHC-00053,NJHC-00054,NJHC-00055,NJHC-00056,NJHC-00057,NJHC-00058,NJHC-00059,NJHC-00060,NJHC-00061,NJHC-00062,NJHC-00063,NJHC-00064,NJHC-00065,NJHC-00066,NJHC-00067,NJHC-00068,NJHC-00069,NJHC-00070,NJHC-00071,NJHC-00072,NJHC-00073,NJHC-00074,NJHC-00075,NJHC-00076,NJHC-00077,NJHC-00078,NJHC-00079,NJHC-00080,NJHC-00081,NJHC-00082,NJHC-00083,NJHC-00084,NJHC-00085,NJHC-00086,NJHC-00087,NJHC-00088,NJHC-00089,NJHC-00090,NJHC-00091,NJHC-00092,NJHC-00093,NJHC-00094,NJHC-00095,NJHC-00096,NJHC-00097,NJHC-00098,NJHC-00099,⋯
Unnamed: 0_level_1,String15,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,⋯
1,188.1757_3080,793227,1770394,1430642,2445188,1352542,1242461,891665,1941765,1291548,1623837,1657936,1374223,1337304,1011403,932392,902579,1174360,1319021,1205614,1081763,861894,1449003,1121965,801706,1574607,1133707,1652380,1188456,1236348,1208507,1598872,1730266,1126269,1003769,1126152,1365437,1670362,1058412,7732594,1080015,2385221,2659932,1853872,1158756,2292402,890057,1528213,925075,1799599,1809332,1637063,1092128,760553,1109472,1520698,2560428,1082520,1259786,2111913,844178,1162338,991584,1013220,3241765,1456829,796184,3594731,1049115,2069177,708283,1116494,1144380,1468320,1069235,1089684,1220299,1851184,1885986,833046,1535202,754274,1529354,904178,1445751,1247665,1351257,665770,1034151,646405,939193,1873184,725065,1248889,637990,1690145,2133413,860650,1210666,2719874,⋯
2,248.1493_2400,461775,179888,missing,3038025,230445,2330353,248330,2632842,1079150,1545945,1012518,2233259,960041,missing,2488886,1641560,418654,428338,1033915,1698211,missing,467664,367257,421205,994879,360077,1229797,9831420,2818473,6316247,1645663,204239,2423231,3044397,758751,150725,639901,923675,1583599,378485,2869015,423382,1526372,5318977,1254324,881225,1430398,504972,32971338,949383,527144,1847084,1067885,1156942,9884318,3207956,1741312,4193614,496343,339029,1809917,2341200,3339613,2144605,518009,932406,1471912,709821,2622923,missing,1823360,1576520,2121698,645931,1384635,347802,833046,931433,1612893,1926985,878334,1266652,1520313,1872390,763380,965917,884999,589202,513388,580412,2591889,missing,1647143,1178688,343777,1414836,4490215,372223,764813,⋯
3,248.1493_2340,1390596,611668,496154,4441903,353392,2384500,625752,3337998,1418258,2200334,2782306,5343757,1477881,519289,2051196,1388364,1138314,938979,1585871,1512594,336672,682899,791622,667230,1487537,803812,3288377,6269711,2483073,3499257,5315271,549214,4121382,2885857,1035489,1047563,1871062,1429121,1619981,640583,5402182,679690,1546138,2660045,2315135,1313943,2282608,663139,4155649,1046672,1537490,1967470,1468699,590178,2790818,2075505,1508166,3245969,908224,620381,2316498,3019875,2519827,1578515,570656,724598,1867997,780593,2749093,536534,1782048,2465048,7228401,978946,1141596,801871,1376799,1372089,1064361,1877337,907323,1348349,2283921,2125402,702625,974513,818867,860186,918701,986190,1789063,missing,1146048,1224373,726830,1484911,5783758,1363509,1640889,⋯


#### Positive late

In [9]:
filePosLateMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","COPDGene","pos.late.txt"))
prepend(filePosLateMetabo, "MetaID	")
filePosLateMetabo = realpath(joinpath(@__DIR__,"..","..","data","raw","COPDGene","new_pos.late.txt"))
dfPosLateMetabo = CSV.read(filePosLateMetabo, DataFrame;  delim = '	', missingstring = "NA");
rm(filePosLateMetabo)
first(dfPosLateMetabo, 3)

Row,MetaID,NJHC-00001,NJHC-00002,NJHC-00003,NJHC-00004,NJHC-00005,NJHC-00006,NJHC-00007,NJHC-00008,NJHC-00009,NJHC-00010,NJHC-00011,NJHC-00012,NJHC-00013,NJHC-00014,NJHC-00015,NJHC-00016,NJHC-00017,NJHC-00018,NJHC-00019,NJHC-00020,NJHC-00021,NJHC-00022,NJHC-00023,NJHC-00024,NJHC-00025,NJHC-00026,NJHC-00027,NJHC-00028,NJHC-00029,NJHC-00030,NJHC-00031,NJHC-00032,NJHC-00033,NJHC-00034,NJHC-00035,NJHC-00036,NJHC-00037,NJHC-00038,NJHC-00039,NJHC-00040,NJHC-00041,NJHC-00042,NJHC-00043,NJHC-00044,NJHC-00045,NJHC-00046,NJHC-00047,NJHC-00048,NJHC-00049,NJHC-00050,NJHC-00051,NJHC-00052,NJHC-00053,NJHC-00054,NJHC-00055,NJHC-00056,NJHC-00057,NJHC-00058,NJHC-00059,NJHC-00060,NJHC-00061,NJHC-00062,NJHC-00063,NJHC-00064,NJHC-00065,NJHC-00066,NJHC-00067,NJHC-00068,NJHC-00069,NJHC-00070,NJHC-00071,NJHC-00072,NJHC-00073,NJHC-00074,NJHC-00075,NJHC-00076,NJHC-00077,NJHC-00078,NJHC-00079,NJHC-00080,NJHC-00081,NJHC-00082,NJHC-00083,NJHC-00084,NJHC-00085,NJHC-00086,NJHC-00087,NJHC-00088,NJHC-00089,NJHC-00090,NJHC-00091,NJHC-00092,NJHC-00093,NJHC-00094,NJHC-00095,NJHC-00096,NJHC-00097,NJHC-00098,NJHC-00099,⋯
Unnamed: 0_level_1,String15,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,⋯
1,464.3136_1566,missing,121381,missing,missing,92179,104779,missing,50217,110039,missing,73924,missing,175276,128087,117437,missing,missing,114760,missing,459649,54499,missing,301693,155369,212201,missing,109254,missing,missing,missing,missing,missing,missing,127431,missing,missing,60451,190424,139360,missing,missing,74256,107344,missing,76346,86359,missing,123261,91532,162242,missing,165316,192507,256216,103320,missing,missing,missing,missing,158541,missing,missing,86365,missing,missing,96547,144052,199334,missing,124861,71314,missing,missing,142584,133625,167252,missing,53907,missing,102870,108300,137366,missing,missing,86474,55021,75595,101706,missing,missing,296060,242269,missing,missing,220422,368764,missing,128469,125418,⋯
2,766.5745_2154,17943620,21228332,8985557,15501571,12890254,7011481,14887996,19450976,14642928,10345833,14922770,10754756,14962296,11014464,10154187,17511932,12248975,17318294,8548914,16482611,11722406,14439493,14717870,15232025,7873695,19496312,11960726,13892088,9441790,8235068,8724123,7686200,20299270,14137565,10259851,11693458,11345628,21627166,12100156,16308241,15695811,19154932,6919993,13560092,14827416,9370055,11107664,11835805,8846538,19685776,14099944,15093120,13611479,18335448,14405798,14238537,15580987,26864086,12691393,9516388,16754992,12915410,15053803,10403349,12041722,13868993,13180920,8377655,14654857,13482244,14297997,20679836,10153148,15551559,20174558,11130874,12987015,7846916,9427329,17394820,7426529,20721118,13369279,12803387,10361984,15565525,9525466,9925866,16193221,14410869,16280032,13262165,15846166,8310026,13384255,14642830,12786342,8531727,9936921,⋯
3,724.5276_2270,8595852,10131860,5964946,5198212,6963317,8483393,9278738,12119313,11478797,6407578,7209232,7322834,5644429,11286119,7818366,5407538,4148380,13945082,3817160,10590262,5836108,9719604,24440482,7294398,7460534,10446229,5020924,10461462,5018773,4023266,6923729,5602483,9040959,7672750,8123296,7060455,5775747,15351736,12968388,6748840,6414935,9184627,7089467,7729486,7591673,6707180,6478604,6818768,4438010,7826705,5944516,21495578,14497137,10294155,10006150,5413952,7153786,31849424,7750888,4805131,6228875,8701815,14265001,6590102,5796190,9175119,6711890,6369578,8174174,8116214,11163536,6316493,5144205,8522233,11635584,5650288,4969183,4933958,4922543,7944928,6668362,13894920,6906466,13630501,6493652,7446530,5947701,5533975,4832431,8604270,13047888,6962092,8217220,5608664,4432561,8444706,4308160,6812751,7799625,⋯


## ST001443: COPDGene exploration

### Individuals

In [10]:
println("The participants dataset contains $(size(dfIndividuals, 1)) individuals and $(size(dfIndividuals, 2)) covariates.")

The participants dataset contains 839 individuals and 14 covariates.


In [11]:
describe(dfIndividuals)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,sid,,10010J,,25581F,0,String7
2,sample_name,,NJHC-00001,,NJHC-01138,0,String15
3,ccenter,,NJC,,UIA,0,String3
4,finalgold_visit,,GOLD 0,,GOLD 4,0,String7
5,gender,,Female,,Male,0,String7
6,age_visit,67.3555,50.0,67.7,87.4,0,Float64
7,race,,Black or African American,,White,0,String31
8,BMI,28.8765,14.14,28.05,58.58,0,Float64
9,smoking_status,,Current smoker,,Former smoker,0,String15
10,ATS_PackYears,44.914,10.0,40.0,180.0,1,"Union{Missing, Float64}"


Notes:`ATS_PackYears`, i.e. smoking number of pack-years, and `Insp_LAA950_total_Thirona`, i.e. percent emphysema, contain `NA`.

Check how many `NA`:

In [12]:
vMissing = map(eachcol(dfIndividuals)) do col
               sum(ismissing.(col))
           end
idxColMiss = findall(vMissing .!= 0)
for i in idxColMiss
    println("$(names(dfIndividuals)[i]) contains $(vMissing[i]) missing values.")
end

ATS_PackYears contains 1 missing values.
Insp_LAA950_total_Thirona contains 54 missing values.


#### Change variable names: 

In [13]:
rename!(dfIndividuals, Dict(:sid => "SampleName", :sample_name => "SampleID",
                            :ccenter => "Site", :finalgold_visit => "FinalGold",
                            :FEV1_FVC_utah => "FEV1_FVC", :FEV1pp_utah => "FEV1pp",
                            :gender => "Sex", :age_visit => "Age", 
                            :ATS_PackYears => "SmokingPackYears",
                            :Insp_LAA950_total_Thirona => "PercentEmphysema"));

Verify how many factors per categorical variables, *i.e.* gender, race, smoking status, GOLD index, and COPD case status.

In [14]:
vCovariateNames = [:Sex, :race, :smoking_status, :FinalGold, :COPD]
vUniqueCat = map(eachcol(dfIndividuals[:, vCovariateNames])) do col
                 join(unique(col), ",", " and ")
             end
for i in 1:length(vUniqueCat)
    println("$(string(vCovariateNames[i])) variable contains: $(vUniqueCat[i]) values.")
end

Sex variable contains: Female and Male values.
race variable contains: White and Black or African American values.
smoking_status variable contains: Former smoker and Current smoker values.
FinalGold variable contains: GOLD 2,GOLD 4,GOLD 0 and GOLD 3 values.
COPD variable contains: 1 and 0 values.


Convert:
- `race` into `NHW`, where the value *1* corresponds to non-Hispanic White and *0* otherwise.
- `smoking_status` into `CurrentSmoker`, where the value *1* corresponds to Current Smoker and *0* to Former Smoker.
- `FinalGold` values are changed to  "0", "2", "3", and "4".

In [15]:
#  non-Hipanic White
vNHW = zeros(Int, size(dfIndividuals, 1));
idxNHW = findall(dfIndividuals.race .== "White");
vNHW[idxNHW] .= 1
dfIndividuals.NHW = (vNHW);

# GOLD index
dfIndividuals.FinalGold = [match.(r"\d+", s).match for s in dfIndividuals.FinalGold];

# Current Smokers
vCurrentSmoker = zeros(Int, size(dfIndividuals, 1));
idxCurrentSmoker = findall(dfIndividuals.smoking_status .== "Current smoker");
vCurrentSmoker[idxCurrentSmoker] .= 1
dfIndividuals.CurrentSmoker = vCurrentSmoker;

# Drop `race` and `smoking_status`
select!(dfIndividuals, Not([:race, :smoking_status]));

Get demographics COPDGene cohort by sex.

In [16]:
# Group by sex
gdf = groupby(dfIndividuals, :Sex);

# Get mean values
mymean(X) = mean(skipmissing(X))
df1a = combine(gdf, [:Age, :BMI, :SmokingPackYears, :PercentEmphysema] .=> mymean)
df1a[:,2:end] = round.((df1a[:,2:end]); digits = 1)
rename!(df1a, Dict(:Age_mymean => "Age", :BMI_mymean => "BMI",
                   :SmokingPackYears_mymean => "SmokingPackYears", :PercentEmphysema_mymean => "PercentEmphysema"));

# Get standard deviation values
mystd(X) = std(skipmissing(X))
df1b = combine(gdf, [:Age, :BMI, :SmokingPackYears, :PercentEmphysema] .=> mystd)
df1b[:,2:end] = round.((df1b[:,2:end]); digits = 1)
rename!(df1b, Dict(:Age_mystd => "Age", :BMI_mystd => "BMI",
                   :SmokingPackYears_mystd => "SmokingPackYears", :PercentEmphysema_mystd => "PercentEmphysema"));

# Join mean and standard deviation values
dfDem1 = string.(df1a[:,2:end]).*repeat(["("], size(df1a,1),size(df1a,2)-1).* 
         string.(df1b[:,2:end]).*repeat([")"], size(df1a,1),size(df1a,2)-1);
insertcols!(dfDem1, 1, :Sex => df1a.Sex, :Participants => combine(gdf, nrow)[:,2])

# Get sum values
df2a = combine(gdf, [:NHW, :CurrentSmoker, :COPD] .=> sum)

# Get percentage values
df2b = round.((df2a[:,2:end]./ dfDem1.Participants).*100, digits = 1)

# Join sum and percentage values
dfDem2 = string.(df2a[:,2:end]).*repeat(["("], size(df2a,1),size(df2a,2)-1).* 
         string.(df2b[:,1:end]).*repeat([")"], size(df2a,1),size(df2a,2)-1)
insertcols!(dfDem2, 1, :Sex => df2a.Sex)

rename!(dfDem2, Dict(:NHW_sum => "NHW", :CurrentSmoker_sum => "CurrentSmoker",
                   :COPD_sum => "COPD"))
# Join demographics dataframes
dfDem = leftjoin(dfDem1, dfDem2, on = :Sex )

# Pivot table
dfDem = permutedims(dfDem, 1, "Variable")

Row,Variable,Female,Male
Unnamed: 0_level_1,String,Any,Any
1,Participants,405,434
2,Age,66.1(8.8),68.5(8.4)
3,BMI,28.6(6.7),29.1(5.6)
4,SmokingPackYears,39.4(20.5),50.1(27.1)
5,PercentEmphysema,6.3(10.2),9.0(11.3)
6,NHW,370(91.4),399(91.9)
7,CurrentSmoker,111(27.4),88(20.3)
8,COPD,167(41.2),224(51.6)


The demographic table is identical to the article "*Metabolomic Profiling Reveals Sex Specific Associations with
Chronic Obstructive Pulmonary Disease and Emphysema*"(2021).

#### Save processed individuals dataset:

In [17]:
first(dfIndividuals)

Row,SampleName,SampleID,Site,FinalGold,Sex,Age,BMI,SmokingPackYears,FEV1_FVC,FEV1pp,PercentEmphysema,COPD,NHW,CurrentSmoker
Unnamed: 0_level_1,String7,String15,String3,SubStrin…,String7,Float64,Float64,Float64?,Float64,Float64,Float64?,Int64,Int64,Int64
1,10010J,NJHC-00611,NJC,2,Female,73.5,27.51,30.7,0.62,51.8,2.42326,1,1,0


In [18]:
fileIndividuals = joinpath(@__DIR__,"..","..","data","processed","COPDGene","COPDGene_ClinicalCovariates.csv");
dfIndividuals |> CSV.write(fileIndividuals);

### Metabolomics References

#### Create dataframe whith pathways

Keep ID, biochemical, comp ID, super pathways and sub pathways.

In [19]:
first(dfRefMetabo[:, Symbol.(["MetaID", "BIOCHEMICAL", "COMP.ID"])], 3)

Row,MetaID,BIOCHEMICAL,COMP.ID
Unnamed: 0_level_1,String15,String,Int64
1,269.2487_5695,(14 or 15)-methylpalmitate (a17:0 or i17:0),38768
2,343.2279_5565,(15:2)-anacardic acid,41397
3,297.2799_5993,(16 or 17)-methylstearate (a19:0 or i19:0),38296


In [20]:
last(dfRefMetabo[:, Symbol.(["MetaID", "BIOCHEMICAL", "COMP.ID"])], 3)

Row,MetaID,BIOCHEMICAL,COMP.ID
Unnamed: 0_level_1,String15,String,Int64
1,175.0826_1730.3,X - 25422,62719
2,261.1345_2595,X - 25451,62821
3,401.2182_4038,X - 25452,62822


In [21]:
# Select variables of interest and rename accordingly
rename!(dfRefMetabo, Dict(:BIOCHEMICAL => "Biochemical", Symbol("COMP.ID") => "CompID", 
                         Symbol("SUB.PATHWAY") => "SubPathway", Symbol("SUPER.PATHWAY") => "SuperPathway")) 
select!(dfRefMetabo, [:MetaID, :Biochemical, :CompID, :SubPathway, :SuperPathway]);

In [22]:
# Create 2 new variables name SubClassID and SuperClassID that 
# contain a codification of pathways

# Group by Super Pathway
gdf = groupby(dfRefMetabo, :SuperPathway);

nTotalSub = length(unique(dfRefMetabo.SubPathway))
vInit = repeat(["NA"], nTotalSub);
dfNewRef = DataFrame(SubPathway = vInit, SubClassID = vInit,
                     SuperPathway = vInit, SuperClassID = vInit);

In [23]:
# Generate pathway ID references for the metabolites
idxStart = 1

for i in 1:(length(gdf)-1)
    vSub = sort(unique(gdf[i].SubPathway))
    nSub = length(vSub)
    
    idxEnd = idxStart + nSub - 1
    
    dfNewRef.SubPathway[idxStart:idxEnd] = vSub;
    dfNewRef.SubClassID[idxStart:idxEnd] = uppercase(gdf[i].SuperPathway[1][1:3]).*string.(collect(1:nSub));
    dfNewRef.SuperPathway[idxStart:idxEnd] .= gdf[i].SuperPathway[1];
    dfNewRef.SuperClassID[idxStart:idxEnd] .= uppercase(gdf[i].SuperPathway[1][1:3]);
    
    idxStart = idxEnd + 1
end

In [24]:
# Initiatlize vector
nMeta = size(dfRefMetabo, 1);
vClass = repeat(["NA"], nMeta);
vSupClass = repeat(["NA"], nMeta);

for i in 1:length(dfNewRef.SubPathway)
    idx = findall(dfRefMetabo.SubPathway.== dfNewRef.SubPathway[i])
    vClass[idx] .= dfNewRef.SubClassID[i]
    vSupClass[idx] .= dfNewRef.SuperClassID[i]
end
dfRefMetabo.SubClassID = vClass; 
dfRefMetabo.SuperClassID = vSupClass;
dfRefMetabo.CompID = "comp".*string.(dfRefMetabo.CompID);

# Insert 0 in SubID when ID number less than 10. It helps for sorting.
idxSub2Change = findall(length.(dfRefMetabo.SubClassID) .== 4)
for i in 1:length(idxSub2Change) 
    dfRefMetabo.SubClassID[idxSub2Change[i]] = dfRefMetabo.SubClassID[idxSub2Change[i]][1:3]*"0"*dfRefMetabo.SubClassID[idxSub2Change[i]][4]
end

In [25]:
dfRefMetabo[2,Not([4,5])]

Row,MetaID,Biochemical,CompID,SubClassID,SuperClassID
Unnamed: 0_level_1,String15,String,String,String,String
2,343.2279_5565,(15:2)-anacardic acid,comp41397,XEN15,XEN


#### Check cotinine bio chemical

The cotinine levels will be imputed differently if missing is more than 20%.

In [26]:
# check for cotinine
idxCotinine = findall(occursin.(r"(?i)cotinine", dfRefMetabo.Biochemical))
dfRefMetabo[idxCotinine, :]

Row,MetaID,Biochemical,CompID,SubPathway,SuperPathway,SubClassID,SuperClassID
Unnamed: 0_level_1,String15,String,String,String,String,String,String
1,367.1147_1745,3-hydroxycotinine glucuronide,comp43470,Tobacco Metabolite,Xenobiotics,XEN16,XEN
2,177.1022_2213,cotinine,comp553,Tobacco Metabolite,Xenobiotics,XEN16,XEN
3,193.0972_2030,hydroxycotinine,comp38661,Tobacco Metabolite,Xenobiotics,XEN16,XEN


#### Explore frequency table

In [27]:
freqtable(dfRefMetabo.SuperPathway)

10-element Named Vector{Int64}
Dim1                              │ 
──────────────────────────────────┼────
Amino Acid                        │ 205
Carbohydrate                      │  25
Cofactors and Vitamins            │  38
Energy                            │  11
Lipid                             │ 431
NA                                │ 336
Nucleotide                        │  35
Partially Characterized Molecules │  10
Peptide                           │  40
Xenobiotics                       │ 261

In [28]:
idxLipid = findall(dfRefMetabo.SuperPathway .== "Lipid")
freqtable(dfRefMetabo[idxLipid, :SubPathway]); #  |> show;

#### Save processed individuals dataset:

In [29]:
fileRef = joinpath(@__DIR__,"..","..","data","processed","COPDGene","refMeta.csv");
dfRefMetabo |> CSV.write(fileRef);

### Negative

Filter `dfNegMetabo` sample according to the individuals dataframe `dfIndividuals`:

#### Keep complete cases

In [30]:
dfNegMetabo = keepComplete(dfNegMetabo, dfIndividuals, dfRefMetabo; sampleCol=  :SampleID);

#### Save filtered sample negative metabolites levels dataset:

In [31]:
fileNeg = joinpath(@__DIR__,"..","..","data","processed","COPDGene","negMeta.csv");
dfNegMetabo |> CSV.write(fileNeg);

In [32]:
println("The negative metabolite dataset contains $(size(dfNegMetabo, 2)-1) samples and $(size(dfNegMetabo, 1)) metabolites.")

The negative metabolite dataset contains 784 samples and 739 metabolites.


### Polar

Filter `dfPolarMetabo` sample according to the individuals dataframe `dfIndividuals`:

#### Keep complete cases

In [33]:
dfPolarMetabo = keepComplete(dfPolarMetabo, dfIndividuals, dfRefMetabo; sampleCol=  :SampleID);

#### Save filtered sample polar metabolites levels dataset:

In [34]:
filePolar = joinpath(@__DIR__,"..","..","data","processed","COPDGene","polarMeta.csv");
dfPolarMetabo |> CSV.write(filePolar);

In [35]:
println("The polar metabolite dataset contains $(size(dfPolarMetabo, 2)-1) samples and $(size(dfPolarMetabo, 1)) metabolites.")

The polar metabolite dataset contains 784 samples and 83 metabolites.


In [36]:
# check name of Polar chemicals, they seem to be negative polar metablites
idxPol = findall(x -> [x] ⊆ dfPolarMetabo.CompID, dfRefMetabo.CompID);
dfRefMetabo.Biochemical[idxPol];

Polar dataset contains negative polar metabolites.

### Positive Early

Filter `dfPosEarlyMetabo` sample according to the individuals dataframe `dfIndividuals`:

[Polar molecules elute earlier and nonpolar molecules later.](https://www.sciencedirect.com/topics/immunology-and-microbiology/metabolome-analysis)

#### Keep complete cases

In [37]:
dfPosEarlyMetabo = keepComplete(dfPosEarlyMetabo, dfIndividuals, dfRefMetabo; sampleCol=  :SampleID);

#### Save filtered sample positive early metabolites levels dataset:

In [38]:
filePosEarly = joinpath(@__DIR__,"..","..","data","processed","COPDGene","posEarlyMeta.csv");
dfPosEarlyMetabo |> CSV.write(filePosEarly);

In [39]:
println("The positive early metabolite dataset contains $(size(dfPosEarlyMetabo, 2)-1) samples and $(size(dfPosEarlyMetabo, 1)) metabolites.")

The positive early metabolite dataset contains 784 samples and 319 metabolites.


### Positive Late

Filter `dfPosLateMetabo` sample according to the individuals dataframe `dfIndividuals`:

[Polar molecules elute earlier and nonpolar molecules later.](https://www.sciencedirect.com/topics/immunology-and-microbiology/metabolome-analysis)

#### Keep complete cases

In [40]:
dfPosLateMetabo = keepComplete(dfPosLateMetabo, dfIndividuals, dfRefMetabo; sampleCol=  :SampleID);

#### Save filtered sample positive late metabolites levels dataset:

In [41]:
filePosLate = joinpath(@__DIR__,"..","..","data","processed","COPDGene","posLateMeta.csv");
dfPosLateMetabo |> CSV.write(filePosLate);

In [42]:
println("The positive late metabolite dataset contains $(size(dfPosLateMetabo, 2)-1) samples and $(size(dfPosLateMetabo, 1)) metabolites.")

The positive late metabolite dataset contains 784 samples and 251 metabolites.
