# Task 3. Splitting database into training and testing sets.
Split your database into 80% for training and 20% for testing. You will have to do a 10-fold cross-validation on the training set.

---

We import and load the files that we need to process the data. In this notebook we will use the second approach to merge the genotype data used in the previous notebook. For a better understanding of the data structure, we print the unique values of the `Superpopulation Code` and the `Superpopulation name` columns.

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score
from utils import *

In [2]:
master_df = pd.read_csv(parameters['master_augmented_2'], sep=',')
master_df.head()

Unnamed: 0,patient,5:195139,5:336952,5:389603,5:851582,5:1144802,5:1167618,5:1175892,5:1398007,5:1447860,...,22:49363742,22:49578486,22:49651708,22:49666841,22:49797810,22:49855674,22:50021013,22:50258751,22:50458664,superpopulation_code
0,NA19625,0,0,1,1,1,0,0,1,1,...,0,0,1,1,0,1,0,1,0,AFR
1,NA19835,0,0,1,1,1,0,0,0,0,...,0,0,0,1,1,1,0,1,0,AFR
2,NA19900,0,0,1,0,1,0,0,0,0,...,1,0,0,1,1,1,0,0,0,AFR
3,NA19917,0,0,1,0,1,0,0,0,0,...,0,0,0,1,0,1,0,1,0,AFR
4,NA19703,0,1,1,0,1,0,0,1,1,...,0,0,1,1,0,1,0,0,0,AFR


In [3]:
tsv = pd.read_csv(parameters['tsv_file'], sep='\t')
tsv['Superpopulation name'].unique()

array(['African Ancestry', 'South Asian Ancestry',
       'South Asia (SGDP),South Asian Ancestry', 'European Ancestry',
       'European Ancestry,West Eurasia (SGDP)', 'American Ancestry',
       'East Asian Ancestry', 'African Ancestry,Africa (SGDP)',
       'East Asia (SGDP),East Asian Ancestry'], dtype=object)

Our project will train three different models:

- African ancestry (AFR)
- European ancestry (EUR)
- Asian ancestry (SAS and EAS)

Since we are not going to train a model for the American ancestry (AMR), we will not explicitly train a model for this ancestry but it will be useful to expand the training dataset and include a more diverse set of individuals.

In [4]:
tsv['Superpopulation code'].unique()

array(['AFR', 'SAS', 'EUR', 'AMR', 'EAS'], dtype=object)

We drop the `patient` column since it is not going to be used in the training and we know that each row represent a different individual. We also generate 5 new columns using the `.get_dummies()` method on the `superpopulation_code` column. This new columns will be our target variables for each model.

In [5]:
# Drop 'patient' column 
master_df.drop(columns=['patient'], inplace=True)
master_df.head()

Unnamed: 0,5:195139,5:336952,5:389603,5:851582,5:1144802,5:1167618,5:1175892,5:1398007,5:1447860,5:1721485,...,22:49363742,22:49578486,22:49651708,22:49666841,22:49797810,22:49855674,22:50021013,22:50258751,22:50458664,superpopulation_code
0,0,0,1,1,1,0,0,1,1,1,...,0,0,1,1,0,1,0,1,0,AFR
1,0,0,1,1,1,0,0,0,0,1,...,0,0,0,1,1,1,0,1,0,AFR
2,0,0,1,0,1,0,0,0,0,0,...,1,0,0,1,1,1,0,0,0,AFR
3,0,0,1,0,1,0,0,0,0,0,...,0,0,0,1,0,1,0,1,0,AFR
4,0,1,1,0,1,0,0,1,1,1,...,0,0,1,1,0,1,0,0,0,AFR


In [6]:
# Get the dummies for the superpopulation code column
master_with_dummies = pd.get_dummies(master_df, columns=['superpopulation_code'])
master_with_dummies.head()

Unnamed: 0,5:195139,5:336952,5:389603,5:851582,5:1144802,5:1167618,5:1175892,5:1398007,5:1447860,5:1721485,...,22:49797810,22:49855674,22:50021013,22:50258751,22:50458664,superpopulation_code_AFR,superpopulation_code_AMR,superpopulation_code_EAS,superpopulation_code_EUR,superpopulation_code_SAS
0,0,0,1,1,1,0,0,1,1,1,...,0,1,0,1,0,1,0,0,0,0
1,0,0,1,1,1,0,0,0,0,1,...,1,1,0,1,0,1,0,0,0,0
2,0,0,1,0,1,0,0,0,0,0,...,1,1,0,0,0,1,0,0,0,0
3,0,0,1,0,1,0,0,0,0,0,...,0,1,0,1,0,1,0,0,0,0
4,0,1,1,0,1,0,0,1,1,1,...,0,1,0,0,0,1,0,0,0,0


In [7]:
# Put together the columns for SAS and EAS, if there is a value for SAS or EAS put 1
for i in range(len(master_with_dummies)):
    if master_with_dummies.iloc[i]['superpopulation_code_SAS'] == 1 or master_with_dummies.iloc[i]['superpopulation_code_EAS'] == 1:
        # Create a new column that is 1 if SAS or EAS is 1
        master_with_dummies.loc[i, 'superpopulation_code_SAS_EAS'] = 1
    else:
        master_with_dummies.loc[i, 'superpopulation_code_SAS_EAS'] = 0

In [8]:
# Drop the columns for SAS and EAS
master_with_dummies.drop(columns=['superpopulation_code_SAS', 'superpopulation_code_EAS'], inplace=True)
# Convert to int for the SAS_EAS column
master_with_dummies['superpopulation_code_SAS_EAS'] = master_with_dummies['superpopulation_code_SAS_EAS'].astype(int)
master_with_dummies.head()

Unnamed: 0,5:195139,5:336952,5:389603,5:851582,5:1144802,5:1167618,5:1175892,5:1398007,5:1447860,5:1721485,...,22:49666841,22:49797810,22:49855674,22:50021013,22:50258751,22:50458664,superpopulation_code_AFR,superpopulation_code_AMR,superpopulation_code_EUR,superpopulation_code_SAS_EAS
0,0,0,1,1,1,0,0,1,1,1,...,1,0,1,0,1,0,1,0,0,0
1,0,0,1,1,1,0,0,0,0,1,...,1,1,1,0,1,0,1,0,0,0
2,0,0,1,0,1,0,0,0,0,0,...,1,1,1,0,0,0,1,0,0,0
3,0,0,1,0,1,0,0,0,0,0,...,1,0,1,0,1,0,1,0,0,0
4,0,1,1,0,1,0,0,1,1,1,...,1,0,1,0,0,0,1,0,0,0


In the next cell we create three new dataframes that will be used to train the models depending on the superpopulation code.

In [9]:
# Dataframe for AFR where the other superpopulation columns are dropped
afr_df = master_with_dummies.drop(columns=['superpopulation_code_AMR', 'superpopulation_code_EUR', 'superpopulation_code_SAS_EAS'])

# Dataframe for EUR where the other superpopulation columns are dropped
eur_df = master_with_dummies.drop(columns=['superpopulation_code_AMR', 'superpopulation_code_AFR', 'superpopulation_code_SAS_EAS'])

# Dataframe for SAS_EAS where the other superpopulation columns are dropped
sas_eas_df = master_with_dummies.drop(columns=['superpopulation_code_AMR', 'superpopulation_code_AFR', 'superpopulation_code_EUR'])

# Print the shapes of the dataframes
print(f"""
    AFR: {afr_df.shape}
    EUR: {eur_df.shape}
    SAS_EAS: {sas_eas_df.shape}
""")


    AFR: (2504, 10029)
    EUR: (2504, 10029)
    SAS_EAS: (2504, 10029)



In [10]:
# Save the dataframes to csv files
afr_df.to_csv(parameters['afr_df'], sep=',', index=False)
eur_df.to_csv(parameters['eur_df'], sep=',', index=False)
sas_eas_df.to_csv(parameters['sas_eas_df'], sep=',', index=False)

## Data splitting

Using the generated dataframes we do a split of the data into training and testing sets. We use the `train_test_split` function from the `sklearn.model_selection` library. We split the data into 80% for training and 20% for testing. The function `split_dataframe` comes from the `utils.py` file since it is a repetitive task.

### AFR model splitting

In [11]:
X_afr_train, X_afr_test, y_afr_train, y_afr_test = split_dataframe(afr_df, 'superpopulation_code_AFR')
print(f"""
    X_afr_train: {X_afr_train.shape}, X_afr_test: {X_afr_test.shape}\n
    y_afr_train: {y_afr_train.shape}, y_afr_test: {y_afr_test.shape}
    """)


    X_afr_train: (2003, 10028), X_afr_test: (501, 10028)

    y_afr_train: (2003,), y_afr_test: (501,)
    


### EUR model splitting

In [12]:
X_eur_train, X_eur_test, y_eur_train, y_eur_test = split_dataframe(eur_df, 'superpopulation_code_EUR')
print(f"""
    X_eur_train: {X_eur_train.shape}, X_eur_test: {X_eur_test.shape}\n
    y_eur_train: {y_eur_train.shape}, y_eur_test: {y_eur_test.shape}
    """)


    X_eur_train: (2003, 10028), X_eur_test: (501, 10028)

    y_eur_train: (2003,), y_eur_test: (501,)
    


### SAS and EAS model splitting

In [13]:
X_sas_eas_train, X_sas_eas_test, y_sas_eas_train, y_sas_eas_test = split_dataframe(sas_eas_df, 'superpopulation_code_SAS_EAS')
print(f"""
    X_sas_eas_train: {X_sas_eas_train.shape}, X_sas_eas_test: {X_sas_eas_test.shape}\n
    y_sas_eas_train: {y_sas_eas_train.shape}, y_sas_eas_test: {y_sas_eas_test.shape}
    """)


    X_sas_eas_train: (2003, 10028), X_sas_eas_test: (501, 10028)

    y_sas_eas_train: (2003,), y_sas_eas_test: (501,)
    


---

# Task 4. Training a model for each ancestry

Train a machine learning model for a binary target (one per ancestry). For example, if a participant has African ancestry or not, another model for Asian ancestry, and a third model for European Ancestry.

The function `generate_model` comes from `utils.py` were we generate a `DecisionTreeClassifier()` model from the `sklearn.tree` library. We use the `cross_val_score` function from the `sklearn.model_selection` library to evaluate the model performance in a first review.

I selected the `DecisionTreeClassifier()` model because it is a simple and fast model that can be used for classification problems. Inside the parameters I used a `max_depth` of 20 with the purpose of reducing the overfitting and computing time.

### AFR model training

In [14]:
afr_model = generate_model(X_afr_train, y_afr_train, parameters['afr_model'])
# Print the accuracy of the model
print(f"""
    Accuracy of the AFR model: {accuracy_score(y_afr_test, afr_model.predict(X_afr_test))}\n
    Model saved to: {parameters['afr_model']}
""")


    Accuracy of the AFR model: 0.9720558882235529

    Model saved to: ../data/output/models/afr_model.pkl



### EUR model training

In [15]:
eur_model = generate_model(X_eur_train, y_eur_train, parameters['eur_model'])
# Print the accuracy of the model
print(f"""
    Accuracy of the EUR model: {accuracy_score(y_eur_test, eur_model.predict(X_eur_test))}\n
    Model saved to: {parameters['eur_model']}
""")


    Accuracy of the EUR model: 0.9461077844311377

    Model saved to: ../data/output/models/eur_model.pkl



### SAS and EAS model training

In [16]:
sas_eas_model = generate_model(X_sas_eas_train, y_sas_eas_train, parameters['sas_eas_model'])
# Print the accuracy of the model
print(f"""
    Accuracy of the SAS_EAS model: {accuracy_score(y_sas_eas_test, sas_eas_model.predict(X_sas_eas_test))}\n
    Model saved to: {parameters['sas_eas_model']}
""")


    Accuracy of the SAS_EAS model: 0.9520958083832335

    Model saved to: ../data/output/models/sas_eas_model.pkl

