<a href="https://colab.research.google.com/github/Aksinhaa/ColabFold/blob/main/NGS_collab_basic_pop_gen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Population genetics is the study of how genetic variation is distributed within populations and how it changes over time. It connects evolution, genomics, and statistics, helping us understand how forces like mutation, migration, non-random mating, selection, and genetic drift shape genetic diversity.

Through the conversion of genomic data into population-genetic formats, the use of PCA to explore major structure, the interpretation of clustering patterns, and the modeling of ancestry with ADMIXTURE, learners acquire a solid understanding of the relationships between populations, the emergence of genetic differentiation, and the ways in which patterns of variation reflect evolutionary history.  When combined, these approaches offer a framework for distinguishing between different population groups, identifying admixture events, and exposing subtle structure that might not be apparent from raw sequencing data alone.

Hence this section provides a simple, beginner-friendly overview of some of the tools used in population genetics analysis. It introduces each tool in clear, accessible language so that anyone regardless of prior experience can understand what the tools do and why they are important.
It also prepares learners to interpret results correctly, troubleshoot problems more effectively, and apply these methods to their own datasets in the future.


The step includes:

a) Converting filtered VCF files into formats suitable for statistical analyses (e.g.,.bed, .ped) using tools like plink

b) Performing Principal Component Analysis (PCA) to visualize population structure and sample clustering

c) Interpreting population-level genetic structure from the PCA plots

d) Running ADMIXTURE to explore population structure and ancestral relationships




### Prerequisites and Setup

Before starting, make sure you have a basic understanding of SNPs, VCF files, and simple command-line usage. No local installation is required because all steps will run inside **Google Colab**.

This notebook uses several population genetics tools (VCFtools, PLINK, ADMIXTURE, and RTG-tools). To keep the workflow clean and reproducible, we install everything inside a dedicated **conda environment**.

When you run the setup cell, the notebook will:

1. Install Miniconda
2. Create a population genetics environment
3. Install all required tools

In [None]:
# Miniconda installation and environment setup for Colab NGS Workshop

# Download and install Miniconda (skip if already installed)
!wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
!bash miniconda.sh -b -p /usr/local/miniconda

import sys, os
sys.path.append('/usr/local/miniconda/lib/python3.8/site-packages')
os.environ['PATH'] = "/usr/local/miniconda/bin:" + os.environ['PATH']

# Accept ToS for main and R conda channels
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

# Install necessary bioinformatics tools into the environment
!conda create -n pop_gen -c bioconda -c conda-forge plink admixture pca rtg-tools



First, let's list all conda environments to confirm `vcf_filter` is present.

In [None]:
%%bash
conda env list

Next, we will list the packages installed in the `pop_gen` environment to ensure all th tools are there.

In [None]:
%%bash
conda list -n pop_gen

---

##Brief introduction to PCA(Principle Componenet Ananlysis)
First we will perform PCA **Principal Component Analysis** which is a statistical method used to explore population structure. It reduces high-dimensional genotype data into a few key components that capture most of the genetic variation.

In population genetics, PCA helps us:

* Visualize genetic differences between individuals or populations
* Detect clusters corresponding to ancestry or population groups
* Identify outliers or mislabelled samples

Essentially, PCA transforms raw genotype data into a **2D or 3D plot**, showing how individuals relate genetically.

---

## From VCF to PLINK Format

VCF files contain SNP and genotype information but cannot be used directly for PCA in PLINK. Therefore, we need to convert VCF files into **PLINK format**, which consists of three files:

| File   | Purpose                                             |
| ------ | --------------------------------------------------- |
| `.bed` | Binary genotype data                                |
| `.bim` | SNP information (chromosome, position, alleles)     |
| `.fam` | Individual/sample information (IDs, sex, phenotype) |

**Workflow:**

1. Start with the filtered **VCF file**
2. Use PLINK to convert it into `.bed`, `.bim`, and `.fam` files
3. Perform PCA on the `.bed` dataset using PLINK

This step allows us to run PCA efficiently and generate plots of population structure.

---

# Example: Converting VCF to PLINK Format

We often start population genetics analysis by converting VCF files into **PLINK binary format** (`.bed`, `.bim`, `.fam`) for downstream analyses like PCA and ADMIXTURE.

Below is an example of how to run PLINK `%%bash`. This allows us to execute Bash commands directly in a Colab cell.

```bash
%%bash
# Activate the conda environment containing PLINK
conda activate plink

# Convert a VCF file to PLINK format
# Replace 'input_file.vcf' with your VCF filename
# Replace 'output_prefix' with your desired output prefix
plink --vcf input_file.vcf \
      --make-bed \
      --double-id \
      --allow-extra-chr \
      --out output_prefix
```
---



#  Creating a Project Directory & Preparing for PCA

To keep our analysis **organized and reproducible**, we will first create a separate directory to store all required files.

---

For this notebook, we will use a ready-to-use dataset hosted on **Zenodo**, which contains the VCF files required for the PCA analysis.
This ensures that everyone can **reproduce the workflow** without needing to generate raw sequencing data.

In [None]:
%%bash
mkdir -p pca

wget -P pca \
https://zenodo.org/records/15263700/files/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB.fam \
https://zenodo.org/records/15263700/files/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB.recode.vcf.gz \
https://zenodo.org/records/15263700/files/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB.bed \
https://zenodo.org/records/15263700/files/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB.bim


At this stage, we already have the PLINK binary files (.bed, .bim, .fam) ready from our previous step. These files contain all the genotype and sample information in a format suitable for PLINK.

By running PCA with PLINK, we generate two key output files:

1. eigenvec → Contains the principal component scores (eigenvectors) for each individual. This file is used for plotting and visualizing genetic relationships.

2. eigenval → Contains the eigenvalues corresponding to each principal component, which indicate how much genetic variation is explained by each component.

In short, this step transforms our genotype data into a form that allows us to visualize population structure and relatedness across individuals.

In [None]:
%%bash
# Initialize Conda for the current shell session
source /usr/local/miniconda/etc/profile.d/conda.sh

# Activate the conda environment with PLINK
conda activate pop_gen

# Perform PCA on PLINK binary files
plink --bfile pca/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB \
      --pca 5 \
      --out pca/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB

## Install R and ggplot2 in Conda

Install R-base and the `ggplot2` R package into your existing `pop_gen` conda environment. This will allow us to use R for plotting within the notebook to visualise the PCA plots.


In [None]:
%%bash
# Initialize Conda for the current shell session
source /usr/local/miniconda/etc/profile.d/conda.sh

conda activate pop_gen

 # Install R-base and r-ggplot2 from specified channels
 conda install -n pop_gen -c conda-forge -c bioconda r-ggplot2 -y




To install R-base and the `ggplot2` R package, we need to activate the `pop_gen` conda environment and then use `conda install` to add these packages. This should be done in a single code block using `%%bash` magic command.



## Create R Script for PCA Plotting
Save the provided R code for PCA plotting into a new `.R` script file.

To save the provided R code into a file, we will use the `%%writefile` magic command. This will create a new file named `pca_plot.R` in the `pca/` directory with the specified R code.


In [None]:
%%writefile pca/pca_plot.R
# Load libraries
library(ggplot2)

fam <- read.table(
  "pca/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB.fam",
  header = FALSE
)

eigenvec <- read.table(
  "pca/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB.eigenvec",
  header = FALSE
)

eigenval <- read.table(
  "pca/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB.eigenval",
  header = FALSE
)

colnames(eigenvec) <- c("FID", "IID", paste0("PC", 1:5))

# Assign region/group from fam file
eigenvec$Region <- fam$V2

var_explained <- (eigenval / sum(eigenval)) * 100
pc1_var <- round(var_explained[1, 1], 2)
pc2_var <- round(var_explained[2, 1], 2)

# Custom color palette

region_colors <- c(
  "#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",
  "#FF7F00", "#A65628", "#F781BF", "#999999"
)


# PCA Plot

p <- ggplot(eigenvec, aes(x = PC1, y = PC2, color = Region)) +
  geom_point(size = 3, alpha = 0.85) +
  scale_color_manual(values = region_colors) +
  labs(
    title = "PCA Plot (PC1 vs PC2)",
    x = paste0("PC1 (", pc1_var, "%)"),
    y = paste0("PC2 (", pc2_var, "%)"),
    color = "Region"
  ) +
  theme_bw(base_size = 14) +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14, face = "bold"),
    plot.title = element_text(size = 16, face = "bold"),
    legend.position = "right"
  )


# Save output

ggsave("pca/pca_plot.png", plot = p, dpi = 300, width = 8, height = 6)
ggsave("pca/pca_plot.pdf", plot = p, width = 8, height = 6)



#  **Explanation of `pca_plot.R`**

---

## **1. Load required package**

```r
library(ggplot2)
```

You are using **ggplot2** to make the PCA scatterplot.

---

## **2. Load input files**

### **FAM file**

```r
fam <- read.table("pca/...noZSB.fam", header = FALSE)
```

* A PLINK `.fam` file contains sample metadata:
  **FID, IID, father, mother, sex, phenotype**
* You will later use `fam$V2` (the IID column) as a **region/group label**.

### **Eigenvector file**

```r
eigenvec <- read.table("pca/...noZSB.eigenvec", header = FALSE)
```

* Created from PLINK PCA (`plink --pca`)
* First two columns = FID, IID
* Remaining columns = PC1, PC2, PC3, …

### **Eigenvalue file**

```r
eigenval <- read.table("pca/...noZSB.eigenval", header = FALSE)
```

* Contains eigenvalues
* Needed to compute **variance explained** by each PC.

---

## **3. Name the PCA columns**

```r
colnames(eigenvec) <- c("FID", "IID", paste0("PC", 1:5))
```

Renames columns for readability:

* PC1
* PC2
* PC3
* PC4
* PC5

(Your PLINK command must have produced 5 PCs.)

---

## **4. Add region / group labels**

```r
eigenvec$Region <- fam$V2
```

* This assumes the **2nd column** of `.fam` corresponds to **population/region**.
* The script uses this column to color points in the PCA.

---

## **5. Compute the % variance explained**

```r
var_explained <- (eigenval / sum(eigenval)) * 100
pc1_var <- round(var_explained[1, 1], 2)
pc2_var <- round(var_explained[2, 1], 2)
```

* `eigenval` is a column of numbers (eigenvalues)
* Dividing each by the sum gives proportion of variance
* Multiply by 100 → percentage
* Extract PC1 and PC2 values for axis labels

Example:
PC1 = 21.37%
PC2 = 12.89%

---

## **6. Define region colors**

```r
region_colors <- c("#E41A1C", "#377EB8", ...)
```

This custom palette supports up to 8 different regions.
---

## **7. Create PCA plot**

```r
p <- ggplot(eigenvec, aes(x = PC1, y = PC2, color = Region)) +
  geom_point(size = 3, alpha = 0.85) +
  scale_color_manual(values = region_colors) +
  labs(
    title = "PCA Plot (PC1 vs PC2)",
    x = paste0("PC1 (", pc1_var, "%)"),
    y = paste0("PC2 (", pc2_var, "%)"),
    color = "Region"
  ) +
  theme_bw(base_size = 14) +
  ...
```
* Points colored by **population/region**
* Semi-transparent points (`alpha = 0.85`)
* Axis labels show **variance explained**
* Clean white background (`theme_bw`)
* Customized legend position and font sizes

---

## **8. Save the outputs**

```r
ggsave("pca/pca_plot.png", plot = p, dpi = 300, width = 8, height = 6)
ggsave("pca/pca_plot.pdf", plot = p, width = 8, height = 6)
```



Now that the R script is created, we will execute it using `Rscript` within the `pop_gen` conda environment to generate the PCA plots.



In [None]:
%%bash
# Initialize Conda for the current shell session
source /usr/local/miniconda/etc/profile.d/conda.sh

# Activate the conda environment with R and ggplot2
conda activate pop_gen

# Execute the R script to generate PCA plots
Rscript pca/pca_plot.R


To visualise the graph, we can run following command:

In [None]:
from IPython.display import Image, display

# Display the generated faceted PCA plot
display(Image('pca/pca_plot.png'))



#  **Understanding ADMIXTURE Analysis**

ADMIXTURE is one of the most widely used tools in population genetics for exploring **ancestral composition** within individuals or populations. It helps us answer questions such as:

* *How many genetic clusters (K) exist in the dataset?*
* *What proportion of each individual’s genome comes from different ancestral sources?*
* *How similar or different are populations based on shared ancestry?*

---

## **What Does ADMIXTURE Do?**

ADMIXTURE takes your **PLINK binary files** (`.bed`, `.bim`, `.fam`) and estimates **ancestry proportions** for each individual assuming a chosen number of clusters **K**.

For example:

* **K = 2** → You assume two ancestral populations
* **K = 3** → Three ancestral populations, and so on

It outputs two important files:

### **1️ `.Q` file**

* Contains *individual-level ancestry proportions*.
* Each row = one sample
* Each column = ancestry component (e.g., Cluster 1, Cluster 2, …)

### **2️ `.P` file**

* Contains *allele frequencies* associated with each cluster.
* Used mainly for deeper population structure interpretation.

---




##  **Prerequisite and Setup**

You must provide PLINK binary files:

```
XXXXXXXXXX.bed
XXXXXXXXXX.bim
XXXXXXXXXX.fam
```

In this notebook:

1. We already have `.bed/.bim/.fam` files prepared.
2. We will test several values of **K** (e.g., 1 to 5).
3. We will record the **cross-validation (CV) error** for each K to identify the optimal model.
4. Finally, we will visualize the `.Q` file as a **barplot** showing ancestry proportions for each individual.

This helps us understand genetic clustering among our samples, and how much ancestral mixing has occurred.

---

# Script Overview

The command block below:

Activates the pop_gen Conda environment (which contains ADMIXTURE, PLINK, R, ggplot2)

Moves into the pca/ directory where the PLINK files are stored

Creates a new folder called admixture/ for storing all output files

Runs ADMIXTURE for K = 2, 3, and 4

Saves log files (logK.out) inside the admixture directory

Moves .Q and .P files into the same directory for easy access

In [None]:
%%bash
# Initialize Conda
source /usr/local/miniconda/etc/profile.d/conda.sh
conda activate pop_gen

cd pca

# Create admixture directory
mkdir -p admixture

PREFIX="machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB"

# Run ADMIXTURE for K=2..6 and save outputs
for K in {2..6}
do
    admixture ${PREFIX}.bed $K -j4 | tee admixture/log${K}.out
    mv ${PREFIX}.${K}.Q admixture/
    mv ${PREFIX}.${K}.P admixture/
done



We will have to install the `tidyr` R package into the `pop_gen` conda environment, and then execute the `admixture_plot.R` script located in `pca/admixture/`. It is a dependency which is required for visualisation of the ADMIXTURE graph in the next step.

In [None]:
%%bash
# Initialize Conda for the current shell session
source /usr/local/miniconda/etc/profile.d/conda.sh

# Activate the conda environment
conda activate pop_gen

# Install r-tidyr and r-reshape2 from conda-forge channel
conda install -n pop_gen -c conda-forge r-tidyr r-reshape2 -y

## Create ADMIXTURE Plotting R Script

Save the provided R code into a new script file named `admixture_plot.R` within the `pca/admixture` directory. This script will load necessary libraries, read the ADMIXTURE .Q file and the .fam file, process the data, and generate a grouped bar plot for K=3.


In [None]:
%%writefile pca/admixture/admixture_plot.R

# Load libraries
library(ggplot2)
library(tidyr)
library(dplyr)

# Define the base filename prefix used for PLINK and ADMIXTURE outputs
PREFIX <- "machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB"

# Define full paths for the .Q and .fam files for K=3
q_file_k3 <- paste0("pca/admixture/", PREFIX, ".3.Q")
fam_file <- paste0("pca/", PREFIX, ".fam")

# Check if files exist
if (!file.exists(q_file_k3)) stop("Cannot find .Q file for K=3: ", q_file_k3)
if (!file.exists(fam_file)) stop("Cannot find .fam file: ", fam_file)

# Read FAM file to get individual IDs and group information
fam <- read.table(fam_file, header = FALSE)
# Create unique individual IDs by combining FID (V1) and IID (V2)
unique_ind_ids <- paste(fam$V1, fam$V2, sep="_")

# Read ADMIXTURE Q file for K=3
q3 <- read.table(q_file_k3, header = FALSE)

# Check if the number of individuals matches
if (nrow(q3) != length(unique_ind_ids)) {
  stop("Number of individuals in Q file does not match .fam file.")
}

# Assign unique IDs to the Q matrix
q3$ID <- unique_ind_ids

# Assign column names for ancestry proportions
colnames(q3)[1:(ncol(q3)-1)] <- paste0("Anc", 1:(ncol(q3)-1))

# Extract group info from sample ID (assuming group is after the last underscore in the ID)
q3$Group <- gsub(".*_([^_]+)$", "\\1", q3$ID) # Extracts the last part after an underscore

# Reshape data for ggplot
q3_long <- q3 %>%
  pivot_longer(
    cols = starts_with("Anc"),
    names_to = "Ancestry",
    values_to = "Proportion"
  )

# Sort individuals by group and then by ID for consistent plotting
q3_long <- q3_long %>%
  arrange(Group, ID) %>%
  mutate(ID = factor(ID, levels = unique(ID))) # Ensure ID is a factor with sorted levels

# Plot
p <- ggplot(q3_long, aes(x = ID, y = Proportion, fill = Ancestry)) +
  geom_bar(stat = "identity", width = 1) +
  facet_grid(~Group, scales = "free_x", space = "free_x") + # Facet by group
  theme_minimal() +
  labs(x = "Individuals", y = "Ancestry Proportion", title = "ADMIXTURE Plot (K=3)") +
  theme(
    axis.text.x = element_blank(), # Hide x-axis text as it would be too crowded
    axis.ticks.x = element_blank(),
    panel.spacing = unit(0.5, "lines"),
    strip.text.x = element_text(angle = 0, face = "bold"), # Keep group labels readable
    legend.position = "right"
  ) +
  scale_fill_brewer(palette = "Set1") # Use a colorblind-friendly palette

# Save as PNG to the correct directory
ggsave("pca/admixture/admixture_K3_grouped.png", plot = p, width = 12, height = 6, dpi = 300)

message("ADMIXTURE plot for K=3 saved as pca/admixture/admixture_K3_grouped.png")


#  **Explanation of the admixture_plot.R script**

---

## **1. Construct full paths to the input files**

```r
q_file_k3 <- paste0("pca/admixture/", PREFIX, ".3.Q")
fam_file <- paste0("pca/", PREFIX, ".fam")
```

* `.3.Q` = ADMIXTURE output for **K = 3**
* `.fam` = PLINK sample metadata
  The script now knows exactly where the files are located.

---

## **2. Check that the files actually exist**

```r
if (!file.exists(q_file_k3)) stop("Cannot find .Q file for K=3")
if (!file.exists(fam_file)) stop("Cannot find .fam file")
```

Stops the script immediately if a file is missing, avoiding downstream errors.

---

## **3. Load the FAM file**

```r
fam <- read.table(fam_file, header = FALSE)
```

The `.fam` file contains:

```
FID IID father mother sex phenotype
```

You only need FID and IID to identify individuals.

---

## **4. Create unique IDs for each individual**

```r
unique_ind_ids <- paste(fam$V1, fam$V2, sep="_")
```

Each ID becomes `FID_IID`.
This guarantees a unique identifier for every sample and makes it easy to match them with the ADMIXTURE Q data.

---

## **5. Read the ADMIXTURE Q file**

```r
q3 <- read.table(q_file_k3, header = FALSE)
```

Each row represents one individual, and each column is the ancestry proportion for one ancestral population (K = 3 → 3 columns).

---

## **6. Make sure FAM and Q files match**

```r
if (nrow(q3) != length(unique_ind_ids)) {
  stop("Number of individuals in Q file does not match .fam file.")
}
```

If the numbers don’t match, something is wrong—e.g., different sample orders.

---

## **7. Add the unique individual IDs to the Q dataframe**

```r
q3$ID <- unique_ind_ids
```

Now ancestry values and sample IDs are in the same table.

---

## **8. Rename ancestry columns**

```r
colnames(q3)[1:(ncol(q3)-1)] <- paste0("Anc", 1:(ncol(q3)-1))
```

Creates column names:

* Anc1
* Anc2
* Anc3
---

## **9. Extract “group” information from sample names**

```r
q3$Group <- gsub(".*_([^_]+)$", "\\1", q3$ID)
```

This takes the *last part* of the ID.
Example:
`Tiger_IND` → Group = `IND`
This allows individuals to be grouped by population in the plot.

---

## **10. Convert from wide format to long format**

```r
q3_long <- q3 %>%
  pivot_longer(
    cols = starts_with("Anc"),
    names_to = "Ancestry",
    values_to = "Proportion"
  )
```

Wide format:

```
ID  Anc1  Anc2  Anc3
```

Long format:

```
ID  Ancestry  Proportion
```

This format is required for stacked bar plots.

---

## **11. Sort individuals by group and ID**

```r
q3_long <- q3_long %>%
  arrange(Group, ID) %>%
  mutate(ID = factor(ID, levels = unique(ID)))
```

Ensures that:

* individuals appear grouped together,
* bars are plotted in a consistent and meaningful order.

---

## **12. Create the ADMIXTURE plot**

```r
p <- ggplot(q3_long, aes(x = ID, y = Proportion, fill = Ancestry)) +
  geom_bar(stat = "identity", width = 1) +
  facet_grid(~Group, scales = "free_x", space = "free_x") +
  theme_minimal() +
  ...
```

### What this does:

* **Each bar = one individual**
* **Stacked sections = ancestry proportions (Anc1–Anc3)**
* **facet_grid(~Group)** groups individuals by population
* **x-axis labels removed** (too crowded)
* **color palette = Set1** (colorblind-friendly)

This produces a clean, population-structured ancestry barplot.

---

## **13. Save the plot**

```r
ggsave("pca/admixture/admixture_K3_grouped.png", plot = p, width = 12, height = 6, dpi = 300)
```

Saves the final figure as a high-resolution PNG.

---



## Execute ADMIXTURE Plotting R Script

Run the newly created `admixture_plot.R` script using `Rscript` within the activated `pop_gen` conda environment. This will generate the ADMIXTURE plot as a PNG image.


In [None]:
%%bash
# Initialize Conda for the current shell session
source /usr/local/miniconda/etc/profile.d/conda.sh

# Activate the conda environment with R and tidyr
conda activate pop_gen

# Execute the R script to generate ADMIXTURE plots
Rscript pca/admixture/admixture_plot.R

To visualise the graph, we can run following command:

In [None]:
from IPython.display import Image, display

# Display the generated faceted ADMIXTURE plot
display(Image('pca/admixture/admixture_K3_grouped.png'))

# PCA & ADMIXTURE Workshop – Tutorial Questions

Below are task-based questions, you can try to solve:
---

## **1. Compare PCA Clustering With ADMIXTURE Ancestry Components**

Generate a PCA plot (PC1 vs PC2) **and** an ADMIXTURE plot for **K = 4**.
Then answer:

* Do the genetic clusters in PCA correspond to the ancestry components in ADMIXTURE?
* Are individuals with mixed ancestry (ADMIXTURE) positioned between clusters in PCA?
* Identify at least one individual who appears admixed—describe where they fall on the PCA.

---

## **2. Create a PC2 vs PC3 Plot and Interpret Structure**

Modify your PCA script to plot **PC2 vs PC3**, then compare this plot to the standard PC1 vs PC2.

Questions:

* Do PC2 and PC3 reveal additional structure not visible in PC1 vs PC2?
* Does cluster separation improve or worsen on PC2–PC3?
* Can you connect any PCA patterns to ADMIXTURE components?

---

## **3. Identify Outliers and Verify Them Across Both Methods**

From your PCA (PC1 vs PC2), identify any individuals that appear separated from the main clusters.

Then:

* Check whether these individuals show unusual ancestry proportions in ADMIXTURE.
* Are they true biological outliers, hybrids, or possibly mislabeled samples?

---

## **4. Determine the Optimal K and Compare It to PCA**

Run ADMIXTURE for **K = 2–6** and record the cross-validation (CV) values.

Then answer:

* Which value of K best fits the data?
* Does the optimal K align with the number of clusters observed in PCA?
* How does increasing K affect interpretation of population structure?


# Answer key — Questions 1–4

# 1) Compare PCA clustering with ADMIXTURE (PC1 vs PC2, ADMIXTURE K=4)

**How to produce figures**

* PCA (PLINK):

```bash
plink --bfile prefix --pca 10 --out prefix
# produces prefix.eigenvec and prefix.eigenval
```

* ADMIXTURE:

```bash
admixture --cv prefix.bed 4 | tee log4.txt
# produces prefix.4.Q
```

* Plot using your R scripts (PCA: PC1 vs PC2; ADMIXTURE: prefix.4.Q).

**What to look for**

* **Concordant clusters**: If PCA clusters correspond to ADMIXTURE components, then groups of individuals forming a tight cluster on the PC1–PC2 scatter should also show a single dominant ancestry color in the K=4 barplot.
---

# 2) Create a PC2 vs PC3 plot and interpret structure

**How to produce the plot**

* In R: swap axes or change columns used in ggplot:

```r
p <- ggplot(eigenvec, aes(x = PC2, y = PC3, color = Region)) + geom_point()
ggsave("pca/pca_PC2_PC3.png", p, width=7, height=5)
```
**How to interpret relative to ADMIXTURE**

* If PC2–PC3 exposes a split that corresponds to a subcomponent in ADMIXTURE (e.g., Anc2 splits into two subgroups when K increases), that supports real substructure.
* If PC2–PC3 shows separation not mirrored by ADMIXTURE K=3, try higher K value.
---

# 3) Identify outliers and verify them across both methods

**How to identify outliers in PCA**

* Visual inspection: find points isolated from main clusters.
* Numeric rule-of-thumb: flag samples with |PC1| or |PC2| > 3 standard deviations.
R example to flag:

```r
pc_mean <- colMeans(eigenvec[, c("PC1","PC2")])
pc_sd <- apply(eigenvec[, c("PC1","PC2")], 2, sd)
outliers <- eigenvec[
  abs(eigenvec$PC1 - pc_mean["PC1"]) > 3*pc_sd["PC1"] |
  abs(eigenvec$PC2 - pc_mean["PC2"]) > 3*pc_sd["PC2"], ]
```

**Verify them in ADMIXTURE**

* Check the `prefix.3.Q` rows for those sample IDs:

  * Are they extreme in ancestry fractions? (e.g., ~100% one component) — might be reference/population-specific samples.
  * Are they mixed (~50/50)? — could be recent admixed individuals.
  * Are they unusual compared to their labeled population? — might be mislabeling or migration.

**Re-run PCA without outliers**

1. Remove outlier(s) from the genotype dataset (create new PLINK files) and re-run `plink --pca`.
2. Compare variance explained and cluster tightness.

* If clusters become tighter and PC variance shifts, outliers influenced the original PCA.

---

# 4) Determine optimal K (K = 2–5) and compare to PCA

**How to run ADMIXTURE with CV and record results**

```bash
for K in 2 3 4 5; do
  admixture --cv prefix.bed $K 2>&1 | tee log${K}.txt
done
# Extract CV values
grep -h "CV error" log*.txt
```

* Record CV error for each K. The K with the *lowest CV error* is typically the best-supported model (but consider biological significance too).

**What to report and interpret**

* **Optimal K**: State which K had lowest CV.

  * *Example:* “CV error: K2=0.58, K3=0.41, K4=0.43, K5=0.46 → K=3 lowest → choose K=3.”
* **Compare to PCA**:

  * If PCA shows N well-separated clusters, optimal K should correspond roughly to N.

* **Effect of increasing K**:

  * More K may reveal substructure but can also produce tiny components that are noise or population-specific drift. Watch for components that only appear in 1–2 samples.


**How to justify biological choice of K**

* Use CV as objective metric.
* Inspect barplots across K values: prefer K that yields interpretable, geographically meaningful components rather than many tiny components.
* Cross-check with PCA: number and separation of PCA clusters, and whether additional PCs explain meaningful variation.


---