In [70]:
from pandas import read_csv

In [71]:
from os import chdir, getcwd
# chdir("..")
# chdir("..")
getcwd()

'c:\\Users\\frdgr\\OneDrive\\Documents\\University of Pretoria\\Population-Structure-Workflow'

# Generate `ind2pop` assignments

This notebook describes the creation of `ind2pop` input data, a type of input based on `.fam` files produced by PLINK-1.9 and up.

> To keep things simple, we will not cover infrastructure provisioning here. The `.fam` file will be 'provided' via `Snakemake` provisioning through a separate rule (`plinkPed`) to generate the ped file required to correctly order our labels.

## Data import

We need to import and store our input datasets needed for this formatting operation. This includes:
- *`samples.csv`* _(Our reference which describes our known population labels)_
- *`results/{wildcards.cluster_assignment}/Population_Structure/fetchPedLables.pop`* _(Pedigree information for the relevant cluster as declared in `samples.csv`)_

In [72]:
samples = read_csv(snakemake.input.samples, index_col="sample_name")
samples

NameError: name 'snakemake' is not defined

Here we import the `.ped` file generated in a separate step.

> For performance reasons, we do not want to import more than the first `IID` column since we only need it for ordering purposes. To get around this, we can use the `read_*` functions `usecols` argument.

In [None]:
pedLabels = read_csv(snakemake.input.ped", usecols=[0], index_col=["ID"], names=["ID"], sep=" ")
pedLabels

HGDP00747
HGDP00082
HGDP00735
HGDP01229
HGDP01416
...
HG01077
HG02262
HG03686
HG04141
HG03857


## Generate `ind2pop`

We have a reference for a give samples cluster assignment. We also have a `Series` containing the required `sample_name` order. TO combine these, we can use the Pandas `.merge()` method to left-merge the sample assignment columns in `samples.csv` onto the `Series`, using the `Series` as index to set order.

> We will need to use the `.fillna("-")` function with a `-` ad Admixture-1.3.0 requests this syntax for samples with unknown assignments.

In [None]:
output = pedLabels.merge(samples, how="left", right_index=True, left_index=True).fillna("-")
output

Unnamed: 0_level_0,dataset,SUPER,SUB
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
HGDP00747,HGDP,East_Asia,Japanese
HGDP00082,HGDP,Central_South_Asia,Balochi
HGDP00735,HGDP,Middle_East,Palestinian
HGDP01229,HGDP,East_Asia,Mongola
HGDP01416,HGDP,Africa,BantuKenya
...,...,...,...
HG01077,1000 Genomes,AMR,PUR
HG02262,1000 Genomes,AMR,PEL
HG03686,1000 Genomes,SAS,STU
HG04141,1000 Genomes,SAS,BEB


## Output

Here, we iterate over the columns in our newly transposed sample-assignment `DataFrame`, and save each to a corresponding `.pop` file for Admixture-1.3.0.

In [None]:
output[[snakemake.wildcards.cluster_assignment]].to_csv(f"results/{snakemake.wildcards.cluster_assignment}/Population_Structure/fetchPedLables.pop", index=False, header=False)