# Task 2. Data Augmentation
Read from the ancestries file and extract the column called “Superpopulation Code". Augment your merged file to include this new column. Each ancestry will be your target vector for the model. In other models, this target vector can be the presence or absence of disease (e.g. diabetes or not, cancer or not, etc.)

---

We import and load the files that we need to process the data. Also, we check the unique values for the `Superpopulation Code` column since we are going to use this column as our target vector.

In [1]:
import pandas as pd
import numpy as np

from utils import *

In [2]:
tsv = pd.read_csv(parameters['tsv_file'], sep='\t')
tsv.head()

Unnamed: 0,Sample name,Sex,Biosample ID,Population code,Population name,Superpopulation code,Superpopulation name,Population elastic ID,Data collections
0,NA19625,female,SAME123655,ASW,African Ancestry SW,AFR,African Ancestry,ASW,"1000 Genomes on GRCh38,1000 Genomes 30x on GRC..."
1,NA19835,female,SAME125029,ASW,African Ancestry SW,AFR,African Ancestry,ASW,"1000 Genomes on GRCh38,1000 Genomes 30x on GRC..."
2,NA19900,male,SAME125050,ASW,African Ancestry SW,AFR,African Ancestry,ASW,"1000 Genomes on GRCh38,1000 Genomes 30x on GRC..."
3,NA19917,female,SAME125272,ASW,African Ancestry SW,AFR,African Ancestry,ASW,"1000 Genomes on GRCh38,1000 Genomes 30x on GRC..."
4,NA19703,male,SAME124230,ASW,African Ancestry SW,AFR,African Ancestry,ASW,"1000 Genomes on GRCh38,1000 Genomes 30x on GRC..."


In [3]:
tsv['Superpopulation code'].unique()

array(['AFR', 'SAS', 'EUR', 'AMR', 'EAS'], dtype=object)

In [4]:
master_df = pd.read_csv(parameters['master_file'], sep=',')
master_df.head()

Unnamed: 0,chr:location,alternative,NA19625,NA19835,NA19900,NA19917,NA19703,NA20274,NA20351,NA20356,...,NA19117,NA19129,NA19131,NA19256,NA19198,NA19201,NA19206,NA19213,NA19225,NA19143
0,5:195139,T,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,5:336952,C,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,5:389603,C,1,1,1,1,1,0,1,1,...,1,1,1,1,1,1,1,0,1,1
3,5:851582,A,1,1,0,0,0,1,0,1,...,1,0,1,1,1,1,1,0,0,1
4,5:1144802,C,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [5]:
len(master_df)

10028

In the next cells we create a copy of the `master_df` dataframe and we drop the first 2 columns with the purpose of keeping only the patient sample names. Also, we do the matching of the values in the `Superpopulation Code` column with the column names in the `population_df` dataframe.

In [6]:
population_df = master_df.copy()
population_df.drop(population_df.columns[0:2], axis=1, inplace=True)
population_df.fillna(np.nan, inplace=True)

population_df.head()

Unnamed: 0,NA19625,NA19835,NA19900,NA19917,NA19703,NA20274,NA20351,NA20356,NA19707,NA20298,...,NA19117,NA19129,NA19131,NA19256,NA19198,NA19201,NA19206,NA19213,NA19225,NA19143
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,0,1,1,1,1,...,1,1,1,1,1,1,1,0,1,1
3,1,1,0,0,0,1,0,1,1,1,...,1,0,1,1,1,1,1,0,0,1
4,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [7]:
for index, row in tsv.iterrows():
    for col in population_df.columns:
        if row['Sample name'] == col:
            population_df.loc[:, col] = row['Superpopulation code']

In [8]:
population_df.head()

Unnamed: 0,NA19625,NA19835,NA19900,NA19917,NA19703,NA20274,NA20351,NA20356,NA19707,NA20298,...,NA19117,NA19129,NA19131,NA19256,NA19198,NA19201,NA19206,NA19213,NA19225,NA19143
0,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR
1,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR
2,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR
3,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR
4,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR


In [9]:
population_df.shape

(10028, 2504)

Now we have a dataframe with the patient sample names and the superpopulation code.

In [10]:
# Add a '_SP' string to the end of the column names
population_df.columns = population_df.columns + '_SP'
population_df.head()

Unnamed: 0,NA19625_SP,NA19835_SP,NA19900_SP,NA19917_SP,NA19703_SP,NA20274_SP,NA20351_SP,NA20356_SP,NA19707_SP,NA20298_SP,...,NA19117_SP,NA19129_SP,NA19131_SP,NA19256_SP,NA19198_SP,NA19201_SP,NA19206_SP,NA19213_SP,NA19225_SP,NA19143_SP
0,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR
1,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR
2,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR
3,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR
4,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR


We added a `_SP` string to each sample name to differentiate the superpopulation code columns and concatenated the `master_df` and `population_df` dataframes.

In [11]:
# Concat the population_df with the master_df
master_merged = pd.concat([master_df, population_df], axis=1)
master_merged.head()

Unnamed: 0,chr:location,alternative,NA19625,NA19835,NA19900,NA19917,NA19703,NA20274,NA20351,NA20356,...,NA19117_SP,NA19129_SP,NA19131_SP,NA19256_SP,NA19198_SP,NA19201_SP,NA19206_SP,NA19213_SP,NA19225_SP,NA19143_SP
0,5:195139,T,0,0,0,0,0,0,0,0,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR
1,5:336952,C,0,0,0,0,1,0,1,0,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR
2,5:389603,C,1,1,1,1,1,0,1,1,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR
3,5:851582,A,1,1,0,0,0,1,0,1,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR
4,5:1144802,C,1,1,1,1,1,1,1,1,...,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR,AFR


In [12]:
master_merged.shape

(10028, 5010)

In [13]:
master_merged.to_csv(parameters['master_augmented'], sep=',', index=False)

---

## Another approach

Since having the superpopulation code as a column for every sample is not a good technique, we will use a different approach. In here I will use the `.transpose()` method to change the columns and the rows of the dataframe. This will result in a new dataframe with the `chr:location` values as columns and the patients as rows.

In [14]:
master_copy = master_df.copy(deep=True)
master_copy = master_copy.transpose()
master_copy.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,10018,10019,10020,10021,10022,10023,10024,10025,10026,10027
chr:location,5:195139,5:336952,5:389603,5:851582,5:1144802,5:1167618,5:1175892,5:1398007,5:1447860,5:1721485,...,22:49335230,22:49363742,22:49578486,22:49651708,22:49666841,22:49797810,22:49855674,22:50021013,22:50258751,22:50458664
alternative,T,C,C,A,C,C,G,T,T,C,...,T,A,T,A,G,A,G,A,A,C
NA19625,0,0,1,1,1,0,0,1,1,1,...,1,0,0,1,1,0,1,0,1,0
NA19835,0,0,1,1,1,0,0,0,0,1,...,1,0,0,0,1,1,1,0,1,0
NA19900,0,0,1,0,1,0,0,0,0,0,...,1,1,0,0,1,1,1,0,0,0


In [15]:
# Convert the first row in the columns to the index
master_copy.columns = master_copy.iloc[0]
master_copy.drop(master_copy.index[0], inplace=True)
master_copy.head()

chr:location,5:195139,5:336952,5:389603,5:851582,5:1144802,5:1167618,5:1175892,5:1398007,5:1447860,5:1721485,...,22:49335230,22:49363742,22:49578486,22:49651708,22:49666841,22:49797810,22:49855674,22:50021013,22:50258751,22:50458664
alternative,T,C,C,A,C,C,G,T,T,C,...,T,A,T,A,G,A,G,A,A,C
NA19625,0,0,1,1,1,0,0,1,1,1,...,1,0,0,1,1,0,1,0,1,0
NA19835,0,0,1,1,1,0,0,0,0,1,...,1,0,0,0,1,1,1,0,1,0
NA19900,0,0,1,0,1,0,0,0,0,0,...,1,1,0,0,1,1,1,0,0,0
NA19917,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,1,0,1,0


In [16]:
master_copy.drop(master_copy.index[0], inplace=True)
master_copy.head()

chr:location,5:195139,5:336952,5:389603,5:851582,5:1144802,5:1167618,5:1175892,5:1398007,5:1447860,5:1721485,...,22:49335230,22:49363742,22:49578486,22:49651708,22:49666841,22:49797810,22:49855674,22:50021013,22:50258751,22:50458664
NA19625,0,0,1,1,1,0,0,1,1,1,...,1,0,0,1,1,0,1,0,1,0
NA19835,0,0,1,1,1,0,0,0,0,1,...,1,0,0,0,1,1,1,0,1,0
NA19900,0,0,1,0,1,0,0,0,0,0,...,1,1,0,0,1,1,1,0,0,0
NA19917,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,1,0,1,0
NA19703,0,1,1,0,1,0,0,1,1,1,...,0,0,0,1,1,0,1,0,0,0


In [17]:
# Add an index to the dataframe but save the index as a column
master_copy.reset_index(inplace=True)
master_copy.head()

chr:location,index,5:195139,5:336952,5:389603,5:851582,5:1144802,5:1167618,5:1175892,5:1398007,5:1447860,...,22:49335230,22:49363742,22:49578486,22:49651708,22:49666841,22:49797810,22:49855674,22:50021013,22:50258751,22:50458664
0,NA19625,0,0,1,1,1,0,0,1,1,...,1,0,0,1,1,0,1,0,1,0
1,NA19835,0,0,1,1,1,0,0,0,0,...,1,0,0,0,1,1,1,0,1,0
2,NA19900,0,0,1,0,1,0,0,0,0,...,1,1,0,0,1,1,1,0,0,0
3,NA19917,0,0,1,0,1,0,0,0,0,...,0,0,0,0,1,0,1,0,1,0
4,NA19703,0,1,1,0,1,0,0,1,1,...,0,0,0,1,1,0,1,0,0,0


In [18]:
master_copy.rename(columns={'index': 'patient'}, inplace=True)
master_copy.columns.name = None
master_copy.head()

Unnamed: 0,patient,5:195139,5:336952,5:389603,5:851582,5:1144802,5:1167618,5:1175892,5:1398007,5:1447860,...,22:49335230,22:49363742,22:49578486,22:49651708,22:49666841,22:49797810,22:49855674,22:50021013,22:50258751,22:50458664
0,NA19625,0,0,1,1,1,0,0,1,1,...,1,0,0,1,1,0,1,0,1,0
1,NA19835,0,0,1,1,1,0,0,0,0,...,1,0,0,0,1,1,1,0,1,0
2,NA19900,0,0,1,0,1,0,0,0,0,...,1,1,0,0,1,1,1,0,0,0
3,NA19917,0,0,1,0,1,0,0,0,0,...,0,0,0,0,1,0,1,0,1,0
4,NA19703,0,1,1,0,1,0,0,1,1,...,0,0,0,1,1,0,1,0,0,0


Finally, we add the superpopulation code as a column to the new dataframe.

In [19]:
master_copy['superpopulation_code'] = tsv['Superpopulation code']
master_copy.head()

Unnamed: 0,patient,5:195139,5:336952,5:389603,5:851582,5:1144802,5:1167618,5:1175892,5:1398007,5:1447860,...,22:49363742,22:49578486,22:49651708,22:49666841,22:49797810,22:49855674,22:50021013,22:50258751,22:50458664,superpopulation_code
0,NA19625,0,0,1,1,1,0,0,1,1,...,0,0,1,1,0,1,0,1,0,AFR
1,NA19835,0,0,1,1,1,0,0,0,0,...,0,0,0,1,1,1,0,1,0,AFR
2,NA19900,0,0,1,0,1,0,0,0,0,...,1,0,0,1,1,1,0,0,0,AFR
3,NA19917,0,0,1,0,1,0,0,0,0,...,0,0,0,1,0,1,0,1,0,AFR
4,NA19703,0,1,1,0,1,0,0,1,1,...,0,0,1,1,0,1,0,0,0,AFR


In [20]:
master_copy.to_csv(parameters['master_augmented_2'], sep=',', index=False)