# Machine learning challenges for genomics
### 2022 Amphora Health's Data Challenge Internship

The objective of the 2022 Amphora Health’s Data Challenge Internship is to explore the genomic information that exists in our cells and develop machine learning models that can predict the risk of developing or acquiring a given disease.

The goal of this challenge is to test the candidate’s ability to deliver a fully functional computational genomics product, which includes training a machine learning model, evaluating its accuracy, and testing it with new input from patients.

**Task 1.** Your first task is to merge all the files into a single table to construct a merged genotype file for several individuals, this will become your training dataset.

**Task 2.** Read from the ancestries file and extract the column called “Superpopulation Code”. Augment your merged file to include this new column. Each ancestry will be your target vector for the model. In other models, this target vector can be the presence or absence of disease (e.g. diabetes or not, cancer or not, etc.)

**Task 3.** Split your database into 80% for training and 20% for testing. You will have to do a 10-fold cross-validation on the training set.

**Task 4.** Train a machine learning model for a binary target (one per ancestry). For example, if a participant has African ancestry or not, another model for Asian ancestry, and a third model for European Ancestry.

**Task 5.** Evaluate the accuracy of prediction using the area under the curve (AUC) for each model using 10-fold cross-validation.

This notebook contains the code necessary to perform the run of all the jupyter notebooks in the project. To make this possible I am using the `papermill` package. Each notebook accomplish a different task from the ones defined above. They are separated as follows:

- `src/main.ipynb`: This notebook contains the code to run all the notebooks in the project.
- `src/0_preprocessing.ipynb`: This notebook contains the code to merge all the files into a single table to construct a merged genotype file for several individuals, this will be used for the training datasets.
- `src/1_data-augmentation.ipynb`: This notebook contains the code to read from the ancestries file and extract the column called “Superpopulation Code”. Augment your merged file to include this new column. Each ancestry will be your target vector for the model. In other models, this target vector can be the presence or absence of disease (e.g. diabetes or not, cancer or not, etc.)
- `src/2_split-and-training.ipynb`: This notebook contains the code to split the database into 80% for training and 20% for testing. In it we will perform the final formatting in our data prior to training the model. As well, we will train and save each one of the models.
- `src/3_evaluate-model.ipynb`: This notebook handles the evaluation and plots the results for a better understanding of the performance of each model.

In [1]:
import papermill as pm
import glob
import numpy as np

In [2]:
pipeline_files = glob.glob(r'[0-9]*.ipynb', recursive=None)
pipeline_files = np.sort(pipeline_files)

In [3]:
pipeline_files

array(['0_preprocessing.ipynb', '1_data-augmentation.ipynb',
       '2_model-training.ipynb', '3_evaluate-model.ipynb'], dtype='<U25')

In [None]:
for i, node in enumerate(pipeline_files):
    print("--- Pipeline {}/{}".format(i+1, len(pipeline_files)))
    print("Running notebook: ", node)
    pm.execute_notebook(node, "tmp/papermill_tmp.ipynb")