# 01: Training Data Exploratory Data Analysis (EDA)

This notebook performs an exploratory data analysis of the training data for the vector taxon classifier. The goal is to understand the structure and characteristics of the data before proceeding with model training.

The analysis will cover:
1. **Sample Metadata**: Inspecting the metadata of the samples available for training.
2. **SNP Genotypes**: Analysing the structure, quality, and characteristics of the SNP genotype data.
3. **Genomic Positions**: Investigating the distribution of SNPs across the genome.

In [1]:
import malariagen_data
import pandas as pd
import numpy as np

In [2]:
ag3 = malariagen_data.Ag3()
df = ag3.sample_metadata()

                                     

In [2]:
df

                                     

In [3]:
df = df.query("aim_species in ['gambiae', 'coluzzii']").dropna(subset=["sample_id", "aim_species"])

In [4]:
samples_df = pd.concat([
    df[df["aim_species"] == "gambiae"].sample(5, random_state=42),
    df[df["aim_species"] == "coluzzii"].sample(5, random_state=42)
]).set_index("sample_id")
sample_ids = samples_df.index.tolist()

In [5]:
region = "2L:100000-200000"
ds = ag3.snp_calls(region=region, sample_query=f"sample_id in {sample_ids}")

                                 

In [26]:
print(ds)
print(ds.dims)
print(ds['call_genotype'].dims)

<xarray.Dataset> Size: 16MB
Dimensions:                             (variants: 83226, alleles: 4,
                                         samples: 10, ploidy: 2)
Coordinates:
    variant_position                    (variants) int32 333kB dask.array<chunksize=(83226,), meta=np.ndarray>
    variant_contig                      (variants) uint8 83kB dask.array<chunksize=(83226,), meta=np.ndarray>
    sample_id                           (samples) <U36 1kB dask.array<chunksize=(10,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele                      (variants, alleles) |S1 333kB dask.array<chunksize=(83226, 4), meta=np.ndarray>
    variant_filter_pass_gamb_colu_arab  (variants) bool 83kB dask.array<chunksize=(83226,), meta=np.ndarray>
    variant_filter_pass_gamb_colu       (variants) bool 83kB dask.array<chunksize=(83226,), meta=np.ndarray>
    variant_filter_pass_arab            (variants) bool 83kB dask.array<chunks

In [27]:
sample_id_to_use = sample_ids[0]
sample_index = np.where(ds['sample_id'].values == sample_id_to_use)[0][0]

In [28]:
gt_single = ds['call_genotype'][:, sample_index, :].values 

In [71]:
gt_single

array([[-1, -1],
       [-1, -1],
       [-1, -1],
       ...,
       [-1, -1],
       [-1, -1],
       [-1, -1]], dtype=int8)

In [33]:
unique_genotypes, counts = np.unique([tuple(row) for row in gt_single], axis=0, return_counts=True)

In [34]:
for genotype, count in zip(unique_genotypes, counts):
    print(f"Genotype {genotype}: {count} sites")

Genotype [-1 -1]: 17108 sites
Genotype [0 0]: 64986 sites
Genotype [0 1]: 226 sites
Genotype [0 2]: 244 sites
Genotype [0 3]: 164 sites
Genotype [1 1]: 152 sites
Genotype [1 2]: 4 sites
Genotype [1 3]: 2 sites
Genotype [2 2]: 176 sites
Genotype [2 3]: 1 sites
Genotype [3 3]: 163 sites


In [74]:
biallelic_mask = np.array([
    set(row).issubset({-1, 0, 1}) for row in gt_single
])

In [75]:
gt_single_biallelic = gt_single[biallelic_mask]
variant_allele_biallelic = variant_allele[biallelic_mask, :2]

In [78]:
unique_biallelic_genotypes, counts_biallelic = np.unique(
    [tuple(row) for row in gt_single_biallelic], axis=0, return_counts=True
)
for genotype, count in zip(unique_biallelic_genotypes, counts_biallelic):
    print(f"Biallelic genotype {genotype}: {count} sites")

Biallelic genotype [-1 -1]: 17108 sites
Biallelic genotype [0 0]: 64986 sites
Biallelic genotype [0 1]: 226 sites
Biallelic genotype [1 1]: 152 sites


In [82]:
gt_single_biallelic

array([[-1, -1],
       [-1, -1],
       [-1, -1],
       ...,
       [-1, -1],
       [-1, -1],
       [-1, -1]], dtype=int8)

In [83]:
variant_allele_biallelic

array([['C', 'A'],
       ['G', 'A'],
       ['C', 'A'],
       ...,
       ['G', 'A'],
       ['C', 'A'],
       ['G', 'A']], dtype='<U1')

In [84]:
variant_position_biallelic = ds['variant_position'].values[biallelic_mask]

In [85]:
def encode_diploid(gt_array):
    g0, g1 = gt_array[:, 0], gt_array[:, 1]
    encoded = np.full(len(gt_array), np.nan)
    mask_hom_ref = (g0 == 0) & (g1 == 0)
    mask_het = ((g0 == 0) & (g1 == 1)) | ((g0 == 1) & (g1 == 0))
    mask_hom_alt = (g0 == 1) & (g1 == 1)
    mask_missing = (g0 < 0) | (g1 < 0)
    encoded[mask_hom_ref] = 0
    encoded[mask_het] = 1
    encoded[mask_hom_alt] = 2
    encoded[mask_missing] = np.nan
    return encoded


In [86]:
encoded_genotype = encode_diploid(gt_single_biallelic)


In [87]:
encoded_genotype

array([nan, nan, nan, ..., nan, nan, nan])

In [88]:
df_spatial = pd.DataFrame({
    "variant_position": variant_position_biallelic,
    "encoded_genotype": encoded_genotype
})


In [89]:
df_spatial.head()

Unnamed: 0,variant_position,encoded_genotype
0,100000,
1,100001,
2,100002,
3,100003,
4,100004,
