# Wrangling PANSTEATITIS study ST001059
---

This notebook carries out the wrangling process for the [LIVER study ST001059 lipidomics data](https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST001059&StudyType=MS&ResultType=1) [1].

## Libraries

In [1]:
# To use RCall for the first time, one needs to 
# the location of the R home directory.
firstTimeRCall = false
if firstTimeRCall 
    using Pkg
    ENV["R_HOME"] = "C:/PROGRA~1/R/R-42~1.1" # from R.home() in R
    Pkg.build("RCall")
end     

In [5]:
using DataFrames, CSV
using FreqTables #, CategoricalArrays
using StatsBase
using Conda, RCall, PyCall
using MetabolomicsWorkbenchAPI

┌ Info: Precompiling MetabolomicsWorkbenchAPI [19b29032-9db8-4baa-af7b-2b362e62b3d7]
└ @ Base loading.jl:1662


## Ext. Functions

In [82]:
include(joinpath(@__DIR__,"..","..","src","wrangling_utils.jl" ));
include(joinpath(@__DIR__,"..","..","src","demog.jl" ));

## Load data ST001059

In [7]:
ST  = "ST001059";

## Extract clinical covariates

Use the Julia's API to get the samples data from the [metabolomics workbench](https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST001052).

In [64]:
# get clinical covariates
dfIndividuals =  fetch_samples(ST);
print_df_size(dfIndividuals)

The dataframe contains 31 rows and 17 columns


List of the covariate names: 

In [65]:
names(dfIndividuals)

17-element Vector{String}:
 "Sample ID"
 "Group"
 "Gender"
 "Date Captured"
 "Annuli"
 "Age"
 "WEIGHT (KG)"
 "LENGTH (CM)"
 "TG (CM)"
 "VET SCORE (Adipose)"
 "TOTAL PROTEIN (g/100mL)"
 "PCV Color"
 "PCV"
 "Histology Adipose"
 "Histology Liver"
 "Histology Swim Bladder"
 "NECROPSY NOTES:"

In [66]:
println("From the study description, $(ST) has $(fetch_total_subjects(ST)) subjects.")

From the study description, ST001059 has 31 subjects.


The clinical covariates dataframe contains 2 extra rows. We need to indicate what values corresponds to the `missing` data. In our case, all "-" will be replaced by `missing`.    

In [67]:
# assign missing value to "-"
dfIndividuals = ifelse.(dfIndividuals .== "-", missing, dfIndividuals);

Check number of missing per columns.

In [68]:
print_variables_missing(dfIndividuals)

VET SCORE (Adipose) contains 10 missing values.
TOTAL PROTEIN (g/100mL) contains 1 missing values.
PCV Color contains 1 missing values.
PCV contains 1 missing values.
Histology Adipose contains 4 missing values.
Histology Swim Bladder contains 5 missing values.
NECROPSY NOTES: contains 5 missing values.


**Notes:** In VET SCORE the character "-" actually indicates 0 or absence of pansteatitis.

### Clinical dictionary

In [69]:
fileClinicalDict = joinpath(@__DIR__,"..","..","data","processed", "ClinicalDataDictionary_ST001059.csv");
open(fileClinicalDict,"w") do io
   println(io,"Group,Indicates sex and disease status.\n",
        "Gender,Sex.\n",
        "Age, Years\n",
        "Weight (KG), Kilogram\n",
        "Length (CM), Centimeter\n",
        "Annuli,Number of opaque zones on fish scales.\n",
        # "TG (CM),???\n",
        "VET SCORE  (Adipose),Veterinarian score where vet score < 1 indicates healthy tilapia and  vet score ≥ 1 indicates pansteatitis-affected tilapia.\n" ,
        "TOTAL PROTEIN (g/100mL),Total protein measurement in grams per deciliter",
        "PCV Color,Pigmentation visually observed.\n",
        "PCV,Pigmentation concentration volume.\n",
        "Histology Adipose,Histological examination score of the adipose tissue.\n",
        "Histology Liver,Histological examination score of the liver tissue.\n",
        "Histology Swim Bladder,Histological examination score of the swim bladder tissue"
    )
end

### Independent variables

Select variables of interest:

In [71]:
select!(dfIndividuals, Symbol.(["Sample ID",
                                "Group",
                                "Gender",
                                "Annuli",
                                "Age",
                                "WEIGHT (KG)",
                                "LENGTH (CM)",
                                "VET SCORE (Adipose)",
]));

Rename variables if needed:

In [72]:
rename!(dfIndividuals, Dict(Symbol("Sample ID") => "SampleID",
                            Symbol("WEIGHT (KG)") => "Weight",
                            Symbol("LENGTH (CM)") => "Length",
                            Symbol("VET SCORE (Adipose)",) => "VetScore",
));

Filter incomplete cases:

In [74]:
# Replace the missings in VetScore by 0
dfIndividuals.VetScore = coalesce.(dfIndividuals.VetScore, 0);

In [76]:
# filter complete cases
idxComplete = findall(completecases(dfIndividuals))
dfIndividuals = dfIndividuals[idxComplete, :]

# add a prefix to the ID samples
dfIndividuals.SampleID = "ID_".*string.(dfIndividuals.SampleID);

first(dfIndividuals, 5)

Unnamed: 0_level_0,SampleID,Group,Gender,Annuli,Age,Weight,Length,VetScore
Unnamed: 0_level_1,String,String,String,String,String,String,String,Any
1,ID_ID_8363,FD,F,12,13,1.5,40.0,3.0
2,ID_ID_8371,FD,F,11,12,1.4,41.0,3.0
3,ID_ID_8373,FD,F,12,13,1.9,43.5,0.5
4,ID_ID_8376,FD,F,11,12,1.6,42.5,1.0
5,ID_ID_8385,FD,F,12,13,1.6,42.0,3.0


Insert a `GroupStatus` variable. The `Group` variable includes the diseases status and gender:

In [77]:
unique(dfIndividuals.Group)

4-element Vector{String}:
 "FD"
 "FH"
 "MD"
 "MH"

Let redefine the `Group` variable:

In [78]:
# insertcols!(dfIndividuals, 3, :GroupStatus => occursin.("D", dfIndividuals.Group));
idxDiseased = findall(occursin.("D", dfIndividuals.Group)) ;
idxHealthy = findall(occursin.("H", dfIndividuals.Group));

dfIndividuals.Group[idxDiseased] .= "Diseased";
dfIndividuals.Group[idxHealthy] .= "Healthy";

idxMale = findall(occursin.("M", dfIndividuals.Gender)) ;
idxFemale = findall(occursin.("F", dfIndividuals.Gender));

dfIndividuals.Gender[idxMale] .= "Male";
dfIndividuals.Gender[idxFemale] .= "Female";

In [79]:
first(dfIndividuals, 5)

Unnamed: 0_level_0,SampleID,Group,Gender,Annuli,Age,Weight,Length,VetScore
Unnamed: 0_level_1,String,String,String,String,String,String,String,Any
1,ID_ID_8363,Diseased,Female,12,13,1.5,40.0,3.0
2,ID_ID_8371,Diseased,Female,11,12,1.4,41.0,3.0
3,ID_ID_8373,Diseased,Female,12,13,1.9,43.5,0.5
4,ID_ID_8376,Diseased,Female,11,12,1.6,42.5,1.0
5,ID_ID_8385,Diseased,Female,12,13,1.6,42.0,3.0


#### Save processed individuals dataset:

In [80]:
fileIndividuals = joinpath(@__DIR__,"..","..","data","processed","ST001059_ClinicalCovariates.csv");
dfIndividuals |> CSV.write(fileIndividuals);

### Demography

In [83]:
dfDemog = getDemographicST001059("ST001059")

Unnamed: 0_level_0,Clinical Features,Count/ mean(SD)
Unnamed: 0_level_1,Any,Any
1,Group,
2,Diseased,16
3,Healthy,5
4,Gender,
5,Female,8
6,Male,13
7,Annuli,8.9(2.9)
8,Age,9.9(2.9)
9,WEIGHT (KG),1.65(0.21)
10,LENGTH (CM),42.29(1.71)


In [84]:
fileDemog = joinpath(@__DIR__,"..","..","data","processed","Demog_ST001059.csv");
dfDemog |> CSV.write(fileDemog);

## Extract Metabolite references

In [121]:
# get clinical covariates
dfRef =  fetch_metabolites(ST);
print_df_size(dfRef)

The dataframe contains 962 rows and 11 columns


List the name of available properties:

In [122]:
names(dfRef)

11-element Vector{String}:
 "Metabolite"
 "quantified m/z"
 "rtimes"
 "ID_Ranked (LipidMatch annotation/rank)"
 "ID_Ranked (LipidSearch annotation/rank)"
 "Class_At_Max_Intensity"
 "Adduct_At_Max_Intensity"
 "(LipidMatch Normalizer output)"
 "IS_Species"
 "IS_Adduct"
 "Neg & Pos Confirmed"

In [123]:
first(dfRef, 5)

Unnamed: 0_level_0,Metabolite,quantified m/z,rtimes,ID_Ranked (LipidMatch annotation/rank)
Unnamed: 0_level_1,String,String,String,String
1,AcCa(18:1),426.3569463,1.480561111,
2,AcCa(13:0),358.2943419,0.861755556,
3,AcCa(18:0),428.3726719,1.962311852,
4,Cer(d18:1_24:1),646.6138781,12.31576593,
5,Cer(d18:1_20:0),594.5808668,10.18453939,


Create a metabolite ID and keep only name and ID:

In [124]:
dfRef.MetaboliteID = "MT" .* string.(10000 .+ collect(1:size(dfRef, 1)));
select!(dfRef, [:Metabolite, :MetaboliteID]);

### Get Classification information

To get the classifciation information, we use the package `MetabolomicsWorkbenchAPI.jl`.

In [125]:
dfClassification = fetch_properties(dfRef.Metabolite);
# insertcols!(dfClassification, 1, :metabolite_name => dfRef.metabolite_name);
first(dfClassification, 3)

Unnamed: 0_level_0,exactmass,formula,main_class,refmet_name,sub_class,super_class
Unnamed: 0_level_1,String?,String?,String?,String?,String?,String?
1,425.3505,C25H47NO4,Fatty esters,CAR 18:1,Acyl carnitines,Fatty Acyls
2,357.2879,C20H39NO4,Fatty esters,CAR 13:0,Acyl carnitines,Fatty Acyls
3,427.3662,C25H49NO4,Fatty esters,CAR 18:0,Acyl carnitines,Fatty Acyls


In [126]:
idxmissing = findall(ismissing.(dfClassification.main_class))
dfRef.Metabolite[idxmissing]

247-element Vector{String}:
 "Cer(d18:0+pO_14:0)"
 "Cer(d18:0+pO_18:0)"
 "Cer(d18:0+pO_20:0)"
 "Cer(d18:0+pO_22:0)"
 "CerG1(d18:0+pO_18:0)"
 "CerG1(d18:0+pO_22:0+O)"
 "CerG1(d18:0+pO_20:0)"
 "CerG1(d18:0+pO_24:1)"
 "CerG1(d18:0+pO_22:0)"
 "CerG1(d18:1_22:0+O)"
 "CerG1(d18:0+pO_14:0)"
 "CerG1(d18:0+pO_22:1)"
 "CerG1(d18:0+pO_16:0)"
 ⋮
 "plasmanyl-TG(O-16:0_18:0_18:1)"
 "plasmanyl-TG(O-16:0_16:1_22:6)"
 "plasmanyl-TG(O-18:1_16:0_16:0)"
 "plasmanyl-TG(O-16:0_22:6_22:6)"
 "plasmanyl-TG(O-16:0_18:3_22:5)"
 "plasmanyl-TG(O-16:0_18:1_22:6)"
 "plasmanyl-TG(O-16:0_16:0_22:6)"
 "plasmanyl-TG(O-16:0_22:5_22:6)"
 "plasmanyl-TG(O-16:0_18:0_22:6)"
 "SM(d18:0_24:3)"
 "SM(d18:0+pO_16:0)"
 "SM(d20:0+pO_16:0)"

In [127]:
size(dfRef)

(962, 2)

To be able to use all the lipid, especially the Triglycerides, we need to adjust the name in a more standardized way to be able to extract their properties information.  
- The *Ox-* prefix mean oxidized, such as in *OxTG(16:0_20:5_20:3(OH))*. The *OH* indicates that it is a [TG hydroperoxide](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6550225/) [2].   
- The *P-* and the *O-* indicate respectively if the lipid is a plasmalogen or a plasmanyl, such as *plasmanyl-TG(O-20:0_18:0_18:4)* and *plasmenyl-TG(P-20:1_15:0_16:0)*[3].
- The non-oxidized lipids that contains Oxygen are described in [4].


In [128]:
dfRefOriginal = copy(dfRef);

In [171]:
dfRef = copy(dfRefOriginal);

In [172]:
dfRef.StandardizedName = copy(dfRef.Metabolite);

In [173]:
"""
standardizename(df::DataFrame, colname::String, matchstring, newstring="")

Takes a dataframe that contains a column of standardized name, and 
create a new column name according to `colname` argument. The type
of this new column is boolean, where each entry is true if the 
standardize name value contains the matching string argument.
In addition, it replaces the matching string by a new string, default
is "".
"""
function standardizename(df::DataFrame, colname::String, matchstring, newstring="")
    if [colname] ⊈ names(df) #!([colname] ⊆ names(df))
        df[:, colname] = repeat([false], size(dfRef, 1));
    end
    idx = findall(occursin.(matchstring, dfRef.Metabolite));
    df[idx, colname] .= true;
    # standardize name 
    df.StandardizedName[idx] .= replace.(dfRef.StandardizedName[idx], matchstring=>newstring); 
    return df
end

standardizename

In [175]:
newcolname = ["ProteinBound", "Hexosyl", 
              "CeramideAP", "CeramideAS", "CeramideNS", "CeramideNP",   "CeramideNDS",
              "Oxidized", "OH", "O", "O₂", "O₃",
              "O₄", "Plasmanyl", "Plasmalogen", "O₂", "O₂",
              "O₃", "O₄", "O₆", "Plasmanyl", "Plasmalogen", 
              "CHO", "Ke", "Ke_OH", "O", 
              "IgnoreCol",  
             ] 
rmvstring = ["+pO_", "CerG1", 
             "_AP", "_AS", "_NS", "_NP", "_NDS",
             "Ox", "(OH)", "+O", "(OO)", "(OOO)",
             "(OOOO)", "O-", "P-", "+OO", "+2O", 
             "+3O", "+4O", "+6O", r"(?i)plasmanyl-", r"(?i)plasmenyl-", 
             "(CHO)", "(Ke)", "(Ke,OH)", "O", 
             "_",
            ]
newstring = vcat( "/", "HexCer", repeat([""], 24), "/")
for i in 1:length(newcolname)
    dfRef = standardizename(dfRef, newcolname[i], rmvstring[i], newstring[i]);
end    

In [176]:
dfClassification = fetch_properties(dfRef.StandardizedName);
insertcols!(dfClassification, 1, :Metabolite => dfRef.Metabolite);
first(dfClassification, 3)
idxmissing = findall(ismissing.(dfClassification.main_class))
dfRef.StandardizedName[idxmissing]

18-element Vector{String}:
 "DMPE(16:0/22:5)"
 "DMPE(16:0/20:5)"
 "DMPE(16:0/22:6)"
 "DMPE(16:0/18:1)"
 "GlcCer(d18:1/27:4)"
 "GlcCer(d18:1/23:1)"
 "GlcCer(d18:1/24:2)"
 "LdMePE(18:0)"
 "LdMePE(16:0)"
 "MMPE(17:1/22:5)"
 "AHFA(44:11)"
 "PC(22:5/16:1(CH))"
 "TG(16:0/16:0/6:0(CH))"
 "TG(18:0/18:4/20:4(H))"
 "TG(16:1/18:1/18:2(H))"
 "TG(18:1/18:1/18:2(H))"
 "TG(16:0/16:0/10:2(CH))"
 "SM(d18:0/24:3)"

We need to replace "+pO_" or "+p_" by "/", the PO type ceramides are protein bound ceramide [7].
The ceramides "CerG1" are hexosylceramides and need to be replaced by "HexCer"[6]   
The ceramides "GlcCer" are glucosylceramide.
The "OAHFA"s are (O-acyl) ω-hydroxy fatty acids.

At this stage, only *DMPE(16:0_22:6)* can be processed by [goslin](https://apps.lifs-tools.org/goslin/).   
We will filter *PMe(16:0/18:1)* (phosphatidylmethanol) [5] and *ZyE(22:5)*.


In [33]:
dfRef = leftjoin(dfRef, dfClassification, on = :Metabolite); size(dfRef)
# filter
deleteat!(dfRef, idxmissing[[2,3]]);

### Use GOSLIN

In [34]:
R"""
suppressMessages(library('rgoslin'))
suppressMessages(library('tidyverse'));
"""

RObject{StrSxp}
 [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
 [7] "tibble"    "ggplot2"   "tidyverse" "rgoslin"   "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"     


In [35]:
@rput dfRef;

In [36]:
R"""
# check validity
dfRef$Valid <- suppressWarnings(sapply(dfRef$StandardizedName, isValidLipidName))
if (sum(dfRef$Valid) == dim(dfRef)[1]) {
    cat("All valid.")
} else {
    print("Check invalid names.")
}
""";

All valid.

In [37]:
dfRef.Total_C = zeros(Int, size(dfRef,1));
dfRef.Total_DB = zeros(Int, size(dfRef,1));
dfRef.Class = repeat(["NA"], size(dfRef,1));

In [38]:
for i in 1:size(dfRef, 1)
    @rput i;
    R"""
    #rsltGoslin <- as_tibble(parseLipidNames(dfRef$StandardizedName[i]))[, c("Original Name", "Total C", "Total DB", "Lipid Maps Main Class")];
    rsltGoslin <- as_tibble(parseLipidNames(dfRef$StandardizedName[i]))[, c("Original.Name", "Total.C", "Total.DB", "Lipid.Maps.Main.Class")];
    """
    @rget rsltGoslin
    # dfRef.Total_C[i] = parse(Int, rsltGoslin."Total C"[1])
    # dfRef.Total_DB[i] = parse(Int, rsltGoslin."Total DB"[1])
    # dfRef.Class[i] = rsltGoslin."Lipid Maps Main Class"[1]
        
    dfRef.Total_C[i] = rsltGoslin."Total_C"[1]
    dfRef.Total_DB[i] = rsltGoslin."Total_DB"[1]
    dfRef.Class[i] = rsltGoslin."Lipid_Maps_Main_Class"[1]
end

In [39]:
names(dfRef)

22-element Vector{String}:
 "Metabolite"
 "MetaboliteID"
 "StandardizedName"
 "Oxidized"
 "OH"
 "O"
 "O₂"
 "O₃"
 "O₄"
 "Plasmanyl"
 "Plasmalogen"
 "CHO"
 "Ke"
 "exactmass"
 "formula"
 "main_class"
 "refmet_name"
 "sub_class"
 "super_class"
 "Total_C"
 "Total_DB"
 "Class"

#### Save metabolites reference dataset:

In [40]:
fileMetaboRef = joinpath(@__DIR__,"..","..","data","processed","refMeta.csv");
dfRef |> CSV.write(fileMetaboRef);

## Extract Metabolites dataset 

In [41]:
dfMetabo = fetch_data(ST);

In [42]:
# rename sample ID with suffix
vHeader = names(dfMetabo);
vHeader[2:end] .= "ID_".*vHeader[2:end];
rename!(dfMetabo, Symbol.(vHeader));

In [43]:
first(dfMetabo, 5)

Unnamed: 0_level_0,Metabolite,ID_8358,ID_8363,ID_8370,ID_8371,ID_8373,ID_8376
Unnamed: 0_level_1,String,String,String,String,String,String,String
1,CE(18:1),1005.004533,656.2196475,442.532242,835.7863174,694.9938797,1113.26766
2,CE(18:2),120.5422615,81.57473155,59.37249758,95.26536816,111.2627008,104.0151326
3,CE(18:3),207.5760181,103.0388756,91.39644092,147.2090585,134.3577342,163.0631881
4,CE(18:4),185.1900835,114.8704705,102.3479244,214.2161794,105.0702351,210.1147451
5,CE(20:1),27.12701689,2.961702936,4.237597992,31.33287865,7.637112491,17.99623889


Replace `Metabolite` name information with `MetaboliteID` values:

In [44]:
dfMetaboAll = leftjoin(select(dfRefOriginal, [:Metabolite, :MetaboliteID]), dfMetabo, on = [:Metabolite]);
select!(dfMetaboAll, Not([:Metabolite]));

Select the samples that only present in the filtered clinical dataset, `dfIndividuals`:   

In [45]:
select!(dfMetaboAll, vcat([:MetaboliteID], Symbol.(dfIndividuals.SampleID)));
size(dfMetaboAll)

(590, 52)

#### Save metabolites levels dataset:

In [46]:
fileMetabo = joinpath(@__DIR__,"..","..","data","processed","Metabo.csv");
dfMetaboAll = permutedims(dfMetaboAll, 1, :SampleID);
dfMetaboAll |> CSV.write(fileMetabo);

## References

[1] Koelmel, J. P., Ulmer, C. Z., Fogelson, S., Jones, C. M., Botha, H., Bangma, J. T., Guillette, T. C., Luus-Powell, W. J., Sara, J. R., Smit, W. J., Albert, K., Miller, H. A., Guillette, M. P., Olsen, B. C., Cochran, J. A., Garrett, T. J., Yost, R. A., & Bowden, J. A. (2019). Lipidomics for wildlife disease etiology and biomarker discovery: a case study of pansteatitis outbreak in South Africa. Metabolomics : Official journal of the Metabolomic Society, 15(3), 38. https://doi.org/10.1007/s11306-019-1490-9    

[2] Kato, S., Shimizu, N., Hanzawa, Y., Otoki, Y., Ito, J., Kimura, F., Takekoshi, S., Sakaino, M., Sano, T., Eitsuka, T., Miyazawa, T., & Nakagawa, K. (2018). Determination of triacylglycerol oxidation mechanisms in canola oil using liquid chromatography-tandem mass spectrometry. NPJ science of food, 2, 1. https://doi.org/10.1038/s41538-017-0009-x    

[3] Koelmel, J. P., Ulmer, C. Z., Jones, C. M., Yost, R. A., & Bowden, J. A. (2017). Common cases of improper lipid annotation using high-resolution tandem mass spectrometry data and corresponding limitations in biological interpretation. Biochimica et biophysica acta. Molecular and cell biology of lipids, 1862(8), 766–770. https://doi.org/10.1016/j.bbalip.2017.02.016    

[4] Riewe, D., Wiebach, J., & Altmann, T. (2017). Structure Annotation and Quantification of Wheat Seed Oxidized Lipids by High-Resolution LC-MS/MS. Plant physiology, 175(2), 600–618. https://doi.org/10.1104/pp.17.00470    

[5] Koelmel, J. P., Jones, C. M., Ulmer, C. Z., Garrett, T. J., Yost, R. A., Schock, T. B., & Bowden, J. A. (2018). Examining heat treatment for stabilization of the lipidome. Bioanalysis, 10(5), 291–305. https://doi.org/10.4155/bio-2017-0209 

[6] Muhammad Z. Chauhan, Paul H. Phillips, Joseph G. Chacko, David B. Warner, Daniel Pelaez, Sanjoy K. Bhattacharya,
Temporal Alterations of Sphingolipids in Optic Nerves After Indirect Traumatic Optic Neuropathy, Ophthalmology Science, Volume 3, Issue 1, 2023, 100217, ISSN 2666-9145, https://doi.org/10.1016/j.xops.2022.100217.

[7] Madoka Suzuki, Yusuke Ohno, Akio Kihara, Whole picture of human stratum corneum ceramides, including the chain-length diversity of long-chain bases, Journal of Lipid Research, Volume 63, Issue 7, 2022, 100235, ISSN 0022-2275, https://doi.org/10.1016/j.jlr.2022.100235.


In [47]:
 dfIndividuals =  fetch_samples(ST);

    # dfIndividuals = ifelse.(dfIndividuals .== "-", "0", dfIndividuals)
    dfIndividuals = ifelse.(dfIndividuals .== "-", missing, dfIndividuals);
    
    select!(dfIndividuals, Symbol.(["Sample ID",
                                "Group",
                                "Gender",
                                "Annuli",
                                "Age",
                                "WEIGHT (KG)",
                                "LENGTH (CM)",
    ]));
    
    # filter complete cases
    idxComplete = findall(completecases(dfIndividuals))
    dfIndividuals = dfIndividuals[idxComplete, :]
    
    insertcols!(dfIndividuals, 3, :GroupStatus => occursin.("D", dfIndividuals.Group));
    
    # Categorical Variables
    catVar = ["GroupStatus", "Gender"]
    vDFcat = Vector{DataFrame}(undef, length(catVar));
    
    for i in 1:length(catVar)
        dfCat = combine(groupby(dfIndividuals, [Symbol(catVar[i])]), nrow => :Count)
        sort!(dfCat, [Symbol(catVar[i])]) 
        dfCat = DataFrame(vcat([names(dfCat)[1] " "], Matrix(dfCat)), ["Clinical Features", "Count/ mean(SD)"])
        vDFcat[i] = dfCat 
    end
    


    # Continuous Variables
    contVar = ["Annuli", "Age", "WEIGHT (KG)", "LENGTH (CM)"]
    vDFcont = Vector{DataFrame}(undef, length(contVar));
   


 
    for i in 1:length(contVar)
    # i = 1
        # calculate mean
        vVar = dfIndividuals[:,Symbol(contVar[i])];
        idxNotMiss = findall(.!ismissing.(vVar));

        vVar = parse.(Float32, string.(vVar[idxNotMiss]))

        myMean = vVar |> mean |> x->round(x, digits = 2)
        # calculate SD
        myStd = vVar |> std |> x->round(x, digits = 2)
    
        dfCont  = DataFrame([contVar[i] string(myMean,"(", myStd, ")")], ["Clinical Features", "Count/ mean(SD)"])
        vDFcont[i] = dfCont 
    end
    vDF = vcat(vDFcat, vDFcont)
    
    dfDem = reduce(vcat, vDF)

Unnamed: 0_level_0,Clinical Features,Count/ mean(SD)
Unnamed: 0_level_1,Any,Any
1,GroupStatus,
2,0,21
3,1,30
4,Gender,
5,F,23
6,M,28
7,Annuli,7.76(2.96)
8,Age,8.76(2.96)
9,WEIGHT (KG),1.64(0.31)
10,LENGTH (CM),41.98(2.61)


In [48]:
findall(occursin.("D", dfIndividuals.Group)) 

30-element Vector{Int64}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
  ⋮
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38