# Preparation of sample metadata

For the scope of this study, a "sample" is defined as an individual from whom genomic data has been collected.

Sample metadata, stored in the `Metadata/samples.csv` file, contains information on the African ethnolinguistic classification of each sample, also known as the subpopulation group, and the dataset to which each sample belongs, namely, [GnomAD 1000 Genomes or HGDP](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/).

Sample metadata was prepared for analysis by:
1. Selecting and renaming features of interest, namely the sample name and ethnolinguistic classification.
2. Removing irrelevant data
3. Adding in an additional feature, namely the African continental region from which the ethnolinguistic subpopulation resides.

## Imports

Notebook setup

In [1]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import Utils.constants as constants
import Utils.functions as functions
import pandas as pd

Import sample metadata from the Metadata/samples.csv file.

In [2]:
sample_metadata = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Metadata",
        "samples.csv",
    )
).replace({"SUB": constants.SUBPOP_RENAME})
sample_metadata

Unnamed: 0,sample_name,dataset,SUPER,SUB
0,HG02461,GnomAD,AFR,Mandinka
1,HG02462,GnomAD,AFR,Mandinka
2,HG02464,GnomAD,AFR,Mandinka
3,HG02465,GnomAD,AFR,Mandinka
4,HG02561,GnomAD,AFR,Mandinka
...,...,...,...,...
605,NA19474,GnomAD,AFR,Luhya
606,NA19475,GnomAD,AFR,Luhya
607,SS6004470,GnomAD,AFR,Mandenka
608,SS6004473,GnomAD,AFR,San


## Feature selection

Select the sample name and subpopulation features for further analysis. Rename features if necessary.

The different African ethnolinguistic classifications/subpopulation groups are: 

* Mandinka
* Esan
* Mende
* Mbuti Pygmy
* Biaka Pygmy
* Mandenka
* Yoruba (HGDP and 1000G)
* San
* Bantu South Africa
* Luhya

In [4]:
sample_subpopulations = sample_metadata[["SUB", "sample_name"]].rename(
    columns={"sample_name": "SAMPLE_NAME"}
)
sample_subpopulations

Unnamed: 0,SUB,SAMPLE_NAME
0,Mandinka,HG02461
1,Mandinka,HG02462
2,Mandinka,HG02464
3,Mandinka,HG02465
4,Mandinka,HG02561
...,...,...
605,Luhya,NA19474
606,Luhya,NA19475
607,Mandenka,SS6004470
608,San,SS6004473


## Add a feature

Add data on the regional classification of a sample

The different regional groupings are: 

* SA: Southern Africa
* WA: Western Africa
* CA: Central Africa
* EA: Eastern Africa

In [8]:
sample_subpopulations["REG"] = sample_subpopulations["SUB"].map(
    constants.REGIONAL_CLASSIFICATION
)

sample_subpopulations.head(5)

Unnamed: 0,SUB,SAMPLE_NAME,REG
0,Mandinka,HG02461,WA
1,Mandinka,HG02462,WA
2,Mandinka,HG02464,WA
3,Mandinka,HG02465,WA
4,Mandinka,HG02561,WA


## Save population count data to a CSV file

In [9]:
sample_subpopulations.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Sample_populations.csv",
    ),
    index=False,
)