<div style="max-width:1200px"><img src="../_resources/mgnify_banner.png" width="100%"></div>

<img src="../_resources/mgnify_logo.png" width="200px">

# **CODARFE**, a powerful tool designed for *sparse compositional microbiome-predictors selection and prediction of continuous environmental factors*.

Please ensure that you cite our related article when using this tool in your research or publications (described in the end of this notebook). Proper citation helps support ongoing development and maintenance.

For any questions or further information, please refer to the documentation or contact murilobarbosa@alunos.utfpr.edu.br & paschoal@utfpr.edu.br.

CODARFE is also available in 4 other formats: https://github.com/alerpaschoal/CODARFE/

In the event that CODARFE was unable to generalize your data, visit github for suggestions on how to enhance the results.

This is an interactive code notebook (a Jupyter Notebook).
To run this code, click into each cell and press the ▶ button in the top toolbar, or press `shift+enter`.

---

<div style="max-width:1200px"><img src="../_resources/flow-codarfe-mgnify.png" width="100%"></div>

In [None]:
import pandas as pd
import numpy as np
import sys
sys.path.append("./lib/")
from Collect_Data_and_Metadata_MGnify_V2 import get_data_and_metadata_from_project # This lib helps us to download the projects directly from MGnify
from CODARFE import CODARFE

# To begin, use an accession number to obtain the 16s table and the metadata associated to a project.

In [None]:
accession = "MGYS00003750" # Accession number here
data,meta,version = get_data_and_metadata_from_project(accession)

In [None]:
data.head()

# Select the metadata of your interest

In [None]:
meta.head()

In [None]:
#Check the distribution

target = "temperature" # select the metadata of your interest

print(f"Min:  {meta[target].min()}")
print(f"Mean: {meta[target].mean()}")
print(f"Max:  {meta[target].max()}")
print(f"Std:  {meta[target].std()}")
print(f"Total number of columns: {len(data.columns)}")
print(f"Total number of samples with target: {meta[target].count()}")
meta[target].hist()

# Create the CODARFE instace

In [None]:
# Split between testing and training to simulate a prediction

from sklearn.model_selection import train_test_split
indx = [id for id in meta.index if id in data.index]# Make sure the indexes match
data = data.loc[indx]
meta = meta.loc[indx]
X_train, X_test, y_train, y_test = train_test_split(data, meta, test_size=0.3, random_state=42) # split 70% train 30% test

In [None]:
# Create the instance

codarfe = CODARFE(data=X_train, # The data
                  metaData = y_train, # The metatada
                  metaData_Target= target) # The name of the column of your interest

# Create the model  

In [None]:
#No need to change any parameter

codarfe.CreateModel(write_results = False, # Flag that defines if the results will be writen
                    path_out = '', # Path where the results will be writen
                    name_append = '', # Name to append in the end of the results. e.g.: name_append = "temperature"; So you can save multiple results in the same folder
                    rLowVar = True, # Flag to remove the low variance of the dataset; set false in case you want to test ALL the columns; It may slow the process
                    applyAbunRel = True, # Flag to apply the relative abundance transformation; Set to False when the data is already transformed (not raw reads)
                    allow_transform_high_variation = True, #  Flag to allow the target transformation in case it has a high variance
                    percentage_cols_2_remove = 1, # Percentage of columns to remove per iteraction
                    n_Kfold_CV = 10, # Number of folds in the CV step inside the RFE
                    weightR2 = 1.0, # Controls the model adjust to it-self
                    weightProbF = 0.5, # Controls the statistical significance
                    weightBIC = 1.0, # Control the number of predictors selected
                    weightRMSE = 1.5, # Controls the over-fitting
                    n_max_iter_huber = 100) # Controls the adjustment of the huber regression function inside the RFE

# Save the instance for future predictions

In [None]:
# the instance is saved as a .foda file
codarfe.Save_Instance(path_out = "<path/to/folder>",
                      name_append = "<name_of_the_model>")

# See the selected taxa

In [None]:
codarfe.selected_taxa

# Check the error

In [None]:
codarfe.Plot_HoldOut_Validation(n_repetitions = 100, # Number of dots in the image
                                test_size=20, # The size of the split for train/test
                                saveImg=False,
                                path_out='',
                                name_append=''
                                )

# Check the correlation and strenght of each selected feature

In [None]:
codarfe.Plot_Relevant_Predictors(n_max_features=100, # Maximum amount of predictors to display
                                 saveImg=False,
                                 path_out = "",
                                 name_append= "")

# Create the Heat map with the CLR-transformation of the selected taxa

In [None]:
codarfe.Plot_Heatmap(saveImg=False,
                     path_out="",
                     name_append="")

# Predict the new data

In [None]:
pred,tot_not_found = codarfe.Predict(new = X_test,
                                     applyAbunRel = True,
                                     writeResults = False,
                                     path_out = '',
                                     name_append = ''
                                     )
print("Total of taxa not found by the imputation method: ",tot_not_found)

# Verify the prediction's true error

In [None]:
real = y_test[target]
import matplotlib.pyplot as plt

plt.figure()

plt.scatter(real,pred) # Plot the dots
plt.plot(real,real,color="orange") # plot the perfect line
mae = np.mean([abs(r-p) for r,p in zip(real,pred) if not np.isnan(r)])#Calculate the Mean Absolute Error
plt.text(x = np.min(real),y=np.max(pred),s="MAE: "+str(round(mae,2)),fontweight="bold")
plt.xlabel("Real")
plt.ylabel("Prediction")
plt.title("Prediction x Real",fontweight="bold")
plt.show()

# Citation Requirement for Using CODARFE

Thank you for using **CODARFE**. If you find this tool useful for your research or any other work, we kindly request that you cite our article in your publications and presentations. Proper citation is essential for acknowledging the effort involved in developing and maintaining this tool.

**Citation Information:**

*Article Title:* **CODARFE: Unlocking the prediction of continuous environmental variables based on microbiome**  

*Authors:* Murilo Caminotto Barbosa \*,  Joao Fernando Marques da Silva, Leonardo Cardoso Alves, Robert D. Finn,  Alexandre R Paschoal \*

*DOI:* https://doi.org/10.1101/2024.07.18.604052

Corresponding authors e-mail: murilobarbosa@alunos.utfpr.edu.br & paschoal@utfpr.edu.br

BibTeX:

~~~
@article{barbosa2024codarfe,
  title={CODARFE: Unlocking the prediction of continuous environmental variables based on microbiome},
  author={Barbosa, Murilo Caminotto and da Silva, Joao Fernando Marques and Alves, Leonardo Cardoso and Finn, Robert D and Paschoal, Alexandre R},
  journal={bioRxiv},
  pages={2024--07},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}
~~~

**Important Note:**

By using **CODARFE**, you agree to cite the above-mentioned article in any publication, presentation, or other work that makes use of this tool. Proper citation helps us to track the impact and usage of our work, and supports the continued development and maintenance of the tool.

Thank you for your cooperation and support.

CODARFE is also available in 4 other formats: https://github.com/alerpaschoal/CODARFE/