# Machine learning chellenges for genomics
### 2022 Amphora Health's Data Challenge Internship

The objective of the 2022 Amphora Health’s Data Challenge Internship is to explore the genomic information that exists in our cells and develop machine learning models that can predict the risk of developing or acquiring a given disease.

The goal of this challenge is to test the candidate’s ability to deliver a fully functional computational genomics product, which includes training a machine learning model, evaluating its accuracy, and testing it with new input from patients.

This data challenge is an integral part of Amphora Health's recruitment process and it is designed to identify the top candidates. Thank you for taking the time to participate in this process and for your interest in Amphora Health. We hope that by the end of this challenge you would have learned something novel for you as well.

## 1. Instructions
**Data input.** You will be provided with a single ZIP file (ah-challenge-internship2022.zip) consisting of 2,504 CSV files, and a coordination TSV file listing all patient IDs and their calculated genetic ancestry (e.g. European, Asian, African).

Inside each patient file, you will see three columns:

1. The first column indicates the chromosome number of a DNA location, followed by a semicolon, and the specific location within that chromosome (e.g. 21:65214, means the chromosome 21, in the location 65214) Remember that the human genome consists of 23 pairs of chromosomes, and each chromosome has a different length.

2. The second one shows the nucleotide that is expected at this DNA location. Remember that a nucleotide is one molecule that builds the double helix strand of the DNA. These molecules can be adenine (A), guanine (G), cytosine (C), and thiamine (T). A DNA strand is represented as a list of these letters.

3. The third column is a binary label representing the presence (1) or absence (0) of the expected nucleotide in this location.

**Task 1.** Your first task is to merge all the files into a single table to construct a merged genotype file for several individuals, this will become your training dataset.

**Task 2.** Read from the ancestries file and extract the column called “Superpopulation Code”. Augment your merged file to include this new column. Each ancestry will be your target vector for the model. In other models, this target vector can be the presence or absence of disease (e.g. diabetes or not, cancer or not, etc.)

**Task 3.** Split your database into 80% for training and 20% for testing. You will have to do a 10-fold cross-validation on the training set.

**Task 4.** Train a machine learning model for a binary target (one per ancestry). For example, if a participant has African ancestry or not, another model for Asian ancestry, and a third model for European Ancestry.

**Task 5.** Evaluate the accuracy of prediction using the area under the curve (AUC) for each model using 10-fold cross-validation.

In [1]:
import papermill as pm
import glob
import numpy as np

In [2]:
pipeline_files = glob.glob(r'[0-9]*.ipynb', recursive=None)
pipeline_files = np.sort(pipeline_files)

In [3]:
pipeline_files

array(['0_preprocessing.ipynb', '1_data-augmentation.ipynb',
       '2_model-training.ipynb', '3_evaluate-model.ipynb'], dtype='<U25')

In [None]:
for i, node in enumerate(pipeline_files):
    print("--- Pipeline {}/{}".format(i+1, len(pipeline_files)))
    print("Running notebook: ", node)
    pm.execute_notebook(node, "tmp/papermill_tmp.ipynb")