# Preparation of sample data

For this study a "sample" is defined as an individual from whom genomic data has been collected.

The `Metadata/samples.csv` file contains the following information on each sample:
* The unique name of the sample.
* The dataset to which each sample belongs, namely, [GnomAD 1000 Genomes or HGDP](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/).
* The African ethnolinguistic classification of each sample, also known as the subpopulation group. The different African ethnolinguistic classifications/subpopulation groups are: 

    * Mandinka
    * Esan
    * Mende
    * Mbuti Pygmy
    * Biaka Pygmy
    * Mandenka
    * Yoruba (HGDP and 1000G)
    * San
    * Bantu South Africa
    * Luhya
* The superpopulation group to which the sample belongs. In this case, all samples belong to the African superpopulation group.

## Imports

Notebook setup

In [3]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import Utils.constants as constants
import Utils.functions as functions
import pandas as pd

## Data loading

Import sample metadata from the `Metadata/samples.csv` file.

In [4]:
sample_metadata = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Metadata",
        "samples.csv",
    )
).replace({"SUB": constants.SUBPOP_RENAME})

sample_metadata.head(2)

Unnamed: 0,sample_name,dataset,SUPER,SUB
0,HG02461,GnomAD,AFR,Mandinka
1,HG02462,GnomAD,AFR,Mandinka


## Feature selection

Select the sample name and subpopulation features for further analysis. Rename features to ensure consistency.

In [5]:
sample_subpopulations = sample_metadata[["SUB", "sample_name"]].rename(
    columns={"sample_name": "SAMPLE_NAME"}
)

Add a new feature containing data on the regional classification of a sample.
The different regional groupings are: 

* SA: Southern Africa
* WA: Western Africa
* CA: Central Africa
* EA: Eastern Africa

In [6]:
sample_subpopulations["REG"] = sample_subpopulations["SUB"].map(
    constants.REGIONAL_CLASSIFICATION
)

## View and save prepared data

In [7]:
sample_subpopulations.head(2)

Unnamed: 0,SUB,SAMPLE_NAME,REG
0,Mandinka,HG02461,WA
1,Mandinka,HG02462,WA


In [8]:
sample_subpopulations.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Sample_populations.csv",
    ),
    index=False,
)