<a href="https://colab.research.google.com/github/Aksinhaa/ColabFold/blob/main/NGS_collab_basic_pop_gen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Population genetics is the study of how genetic variation is distributed within populations and how it changes over time. It connects evolution, genomics, and statistics, helping us understand how forces like mutation, migration, selection, and genetic drift shape genetic diversity.

This section provides a simple, beginner-friendly overview of some of the tools used in population genetics analysis. It introduces each tool in clear, accessible language so that anyone regardless of prior experience can understand what the tools do and why they are important.
It also prepares learners to interpret results correctly, troubleshoot problems more effectively, and apply these methods to their own datasets in the future.


The step includes:

a) Converting filtered VCF files into formats suitable for statistical analyses (e.g.,.bed, .ped) using tools like plink

b) Performing Principal Component Analysis (PCA) to visualize population structure and sample clustering

c) Interpreting population-level genetic structure from the PCA plots

d) Running ADMIXTURE to explore population structure and ancestral relationships

e) Estimating heterozygosity using RTG-tools



### Prerequisites and Setup

Before starting, make sure you have a basic understanding of SNPs, VCF files, and simple command-line usage. No local installation is required because all steps will run inside **Google Colab**.

This notebook uses several population genetics tools (VCFtools, PLINK, ADMIXTURE, and RTG-tools). To keep the workflow clean and reproducible, we install everything inside a dedicated **conda environment**.

When you run the setup cell, the notebook will:

1. Install Miniconda
2. Create a population genetics environment
3. Install all required tools

In [None]:
# Miniconda installation and environment setup for Colab NGS Workshop

# Download and install Miniconda (skip if already installed)
!wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
!bash miniconda.sh -b -p /usr/local/miniconda

import sys, os
sys.path.append('/usr/local/miniconda/lib/python3.8/site-packages')
os.environ['PATH'] = "/usr/local/miniconda/bin:" + os.environ['PATH']

# Accept ToS for main and R conda channels
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

# Install necessary bioinformatics tools into the environment
!conda create -n pop_gen -c bioconda -c conda-forge plink admixture pca rtg-tools



First, let's list all conda environments to confirm `vcf_filter` is present.

In [None]:
!conda env list

Next, we will list the packages installed in the `pop_gen` environment to ensure all th tools are there.

In [None]:
!conda list -n pop_gen

---

##Brief introduction to PCA(Principle Componenet Ananlysis)
First we will perform PCA **Principal Component Analysis** which is a statistical method used to explore population structure. It reduces high-dimensional genotype data into a few key components that capture most of the genetic variation.

In population genetics, PCA helps us:

* Visualize genetic differences between individuals or populations
* Detect clusters corresponding to ancestry or population groups
* Identify outliers or mislabelled samples

Essentially, PCA transforms raw genotype data into a **2D or 3D plot**, showing how individuals relate genetically.

---

## From VCF to PLINK Format

VCF files contain SNP and genotype information but cannot be used directly for PCA in PLINK. Therefore, we need to convert VCF files into **PLINK format**, which consists of three files:

| File   | Purpose                                             |
| ------ | --------------------------------------------------- |
| `.bed` | Binary genotype data                                |
| `.bim` | SNP information (chromosome, position, alleles)     |
| `.fam` | Individual/sample information (IDs, sex, phenotype) |

**Workflow:**

1. Start with the filtered **VCF file**
2. Use PLINK to convert it into `.bed`, `.bim`, and `.fam` files
3. Perform PCA on the `.bed` dataset using PLINK

This step allows us to run PCA efficiently and generate plots of population structure.

---

# Example: Converting VCF to PLINK Format

We often start population genetics analysis by converting VCF files into **PLINK binary format** (`.bed`, `.bim`, `.fam`) for downstream analyses like PCA and ADMIXTURE.

Below is an example of how to run PLINK `%%bash`. This allows us to execute Bash commands directly in a Colab cell.

```bash
%%bash
# Activate the conda environment containing PLINK
conda activate plink

# Convert a VCF file to PLINK format
# Replace 'input_file.vcf' with your VCF filename
# Replace 'output_prefix' with your desired output prefix
plink --vcf input_file.vcf \
      --make-bed \
      --double-id \
      --allow-extra-chr \
      --out output_prefix
```
---



#  Creating a Project Directory & Preparing for PCA

To keep our analysis **organized and reproducible**, we will first create a separate directory to store all required files.

---

## Downloading Example Data

For this notebook, we will use a ready-to-use dataset hosted on **Zenodo**, which contains the VCF files required for the PCA analysis.
This ensures that everyone can **reproduce the workflow** without needing to generate raw sequencing data.

In [None]:
# Create the directory if it doesn't exist
!mkdir -p pca

# Download the VCF file into the created directory
!wget -P pca \
https://zenodo.org/records/15263700/files/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB.fam\
https://zenodo.org/records/15263700/files/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB.recode.vcf.gz \
https://zenodo.org/records/15263700/files/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB.bed \
https://zenodo.org/records/15263700/files/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB.bim

At this stage, we already have the PLINK binary files (.bed, .bim, .fam) ready from our previous step. These files contain all the genotype and sample information in a format suitable for PLINK.

By running PCA with PLINK, we generate two key output files:

1. eigenvec → Contains the principal component scores (eigenvectors) for each individual. This file is used for plotting and visualizing genetic relationships.

2. eigenval → Contains the eigenvalues corresponding to each principal component, which indicate how much genetic variation is explained by each component.

In short, this step transforms our genotype data into a form that allows us to visualize population structure and relatedness across individuals.

In [None]:
%%bash
# Initialize Conda for the current shell session
source /usr/local/miniconda/etc/profile.d/conda.sh

# Activate the conda environment with PLINK
conda activate pop_gen

# Perform PCA on PLINK binary files
plink --bfile pca/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB \
      --pca 5 \
      --out pca/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB

## Install R and ggplot2 in Conda

Install R-base and the `ggplot2` R package into your existing `pop_gen` conda environment. This will allow us to use R for plotting within the notebook to visualise the PCA plots.


In [None]:
%%bash
# Initialize Conda for the current shell session
source /usr/local/miniconda/etc/profile.d/conda.sh

conda activate pop_gen

 # Install R-base and r-ggplot2 from specified channels
 conda install -n pop_gen -c conda-forge -c bioconda r-ggplot2 -y



**Reasoning**:
To install R-base and the `ggplot2` R package, I need to activate the `pop_gen` conda environment and then use `conda install` to add these packages. This should be done in a single code block using `%%bash` magic command.



In [None]:
!conda list -n pop_gen

## Create R Script for PCA Plotting
Save the provided R code for PCA plotting into a new `.R` script file.

To save the provided R code into a file, we will use the `%%writefile` magic command. This will create a new file named `pca_plot.R` in the `pca/` directory with the specified R code.


In [None]:
%%writefile pca/pca_plot.R
# Load ggplot2
library(ggplot2)

# Read FAM file from PCA directory
fam <- read.table(
  "pca/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB.fam",
  header = FALSE
)

# Read eigenvec file from PCA directory
eigenvec <- read.table(
  "pca/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB.eigenvec",
  header = FALSE
)

# Assign column names to eigenvec
colnames(eigenvec) <- c("FID", "IID", paste0("PC", 1:5))

# Add region info from FAM (column 2)
eigenvec$Region <- fam$V2

# PCA Plot (PC1 vs PC2)
p <- ggplot(eigenvec, aes(x = PC1, y = PC2, color = Region)) +
  geom_point(size = 3, alpha = 0.8) +
  theme_minimal() +
  labs(
    title = "PCA Plot (PC1 vs PC2)",
    x = "PC1",
    y = "PC2",
    color = "Region"
  )

# Save outputs
ggsave("pca/pca_plot.png", plot = p, dpi = 300)
ggsave("pca/pca_plot.pdf", plot = p)


Now that the R script is created, we will execute it using `Rscript` within the `pop_gen` conda environment to generate the PCA plots.



In [None]:
%%bash
# Initialize Conda for the current shell session
source /usr/local/miniconda/etc/profile.d/conda.sh

# Activate the conda environment with R and ggplot2
conda activate pop_gen

# Execute the R script to generate PCA plots
Rscript pca/pca_plot.R




#  **Understanding ADMIXTURE Analysis**

ADMIXTURE is one of the most widely used tools in population genetics for exploring **ancestral composition** within individuals or populations. It helps us answer questions such as:

* *How many genetic clusters (K) exist in the dataset?*
* *What proportion of each individual’s genome comes from different ancestral sources?*
* *How similar or different are populations based on shared ancestry?*

---

## **What Does ADMIXTURE Do?**

ADMIXTURE takes your **PLINK binary files** (`.bed`, `.bim`, `.fam`) and estimates **ancestry proportions** for each individual assuming a chosen number of clusters **K**.

For example:

* **K = 2** → You assume two ancestral populations
* **K = 3** → Three ancestral populations, and so on

It outputs two important files:

### **1️ `.Q` file**

* Contains *individual-level ancestry proportions*.
* Each row = one sample
* Each column = ancestry component (e.g., Cluster 1, Cluster 2, …)

### **2️ `.P` file**

* Contains *allele frequencies* associated with each cluster.
* Used mainly for deeper population structure interpretation.

---




##  **Prerequisite and Setup**

You must provide PLINK binary files:

```
XXX.bed
XXX.bim
XXX.fam
```

In this notebook:

1. We already have `.bed/.bim/.fam` files prepared.
2. We will test several values of **K** (e.g., 1 to 5).
3. We will record the **cross-validation (CV) error** for each K to identify the optimal model.
4. Finally, we will visualize the `.Q` file as a **barplot** showing ancestry proportions for each individual.

This helps us understand genetic clustering among our samples, and how much ancestral mixing has occurred.

---

# Script Overview

The command block below:

Activates the pop_gen Conda environment (which contains ADMIXTURE, PLINK, R, ggplot2)

Moves into the pca/ directory where the PLINK files are stored

Creates a new folder called admixture/ for storing all output files

Runs ADMIXTURE for K = 2, 3, and 4

Saves log files (logK.out) inside the admixture directory

Moves .Q and .P files into the same directory for easy access

In [None]:
%%bash
# Initialize Conda
source /usr/local/miniconda/etc/profile.d/conda.sh
conda activate pop_gen

cd pca

# Create admixture directory
mkdir -p admixture

PREFIX="machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB"

# Run ADMIXTURE for K=2..4 and save outputs
for K in {2..4}
do
    admixture ${PREFIX}.bed $K -j4 | tee admixture/log${K}.out
    mv ${PREFIX}.${K}.Q admixture/
    mv ${PREFIX}.${K}.P admixture/
done



We will have to install the `tidyr` R package into the `pop_gen` conda environment, and then execute the `admixture_plot.R` script located in `pca/admixture/`. It is a dependency which is required for visualisation of the ADMIXTURE graph in the next step.

In [None]:
%%bash
# Initialize Conda for the current shell session
source /usr/local/miniconda/etc/profile.d/conda.sh

# Activate the conda environment
conda activate pop_gen

# Install r-tidyr from conda-forge channel
conda install -n pop_gen -c conda-forge r-tidyr -y

## Create ADMIXTURE Plotting R Script

Save the provided R code into a new script file named `admixture_plot.R` within the `pca/admixture` directory. This script will load necessary libraries, read the ADMIXTURE .Q file and the .fam file, process the data, and generate a grouped bar plot for K=3.


In [None]:
%%writefile pca/admixture/admixture_plot.R
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(tidyr)

# Define the prefix for ADMIXTURE output files
PREFIX <- "machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_noIndels_missing_mm0.6_meandepth95percentile_noZSB"

# Load the .Q file for K=3
q_file <- paste0("pca/admixture/", PREFIX, ".3.Q")
q_data <- read.table(q_file, header = FALSE)

# Load the .fam file to get individual IDs and group information
fam_file <- paste0("pca/", PREFIX, ".fam")
fam_data <- read.table(fam_file, header = FALSE, colClasses = c("character", "character", rep("NULL", 4)))
colnames(fam_data) <- c("Family_ID", "Individual_ID")

# Add the 'Region' column from fam$V2 (which is Individual_ID after renaming here, but was V2 in original fam)
admixture_data <- cbind(fam_data, q_data)

# Rename Q columns for K=3
colnames(admixture_data)[3:5] <- paste0("Cluster_", 1:3)

# Reshape data for ggplot2 (long format)
admixture_long <- admixture_data %>%
  pivot_longer(
    cols = starts_with("Cluster_"),
    names_to = "Ancestry_Component",
    values_to = "Proportion"
  ) %>%
  # Add a 'Region' column based on the existing 'Individual_ID' (fam$V2)
  mutate(Region = Individual_ID) # Assuming Individual_ID here represents the region/grouping variable

# Order individuals first by 'Region' and then by their total ancestry proportion for one component (e.g., Cluster_1)
admixture_long <- admixture_long %>%
  group_by(Individual_ID) %>%
  mutate(Total_Prop_Cluster1 = sum(Proportion[Ancestry_Component == "Cluster_1"])) %>%
  ungroup() %>%
  arrange(Region, Total_Prop_Cluster1) %>%
  mutate(Individual_ID = factor(Individual_ID, levels = unique(Individual_ID)))


# Create the ADMIXTURE bar plot for K=3 with facet_grid
p <- ggplot(admixture_long, aes(x = Individual_ID, y = Proportion, fill = Ancestry_Component)) +
  geom_bar(stat = "identity", width = 1) +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "ADMIXTURE Plot (K=3)",
       x = "Individual",
       y = "Ancestry Proportion",
       fill = "Ancestry Component") +
  theme_minimal() +
  theme(axis.text.x = element_blank(), # Hide individual labels as they might be too many
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(hjust = 0.5),
        strip.text = element_text(size = 8)) + # Adjust facet label size
  facet_grid(~ Region, scales = "free_x", space = "free_x") # Facet by Region

# Save the plot as PNG and PDF
ggsave("pca/admixture/admixture_plot_k3_faceted.png", plot = p, width = 14, height = 7, dpi = 300)
ggsave("pca/admixture/admixture_plot_k3_faceted.pdf", plot = p, width = 14, height = 7)


## Execute ADMIXTURE Plotting R Script

Run the newly created `admixture_plot.R` script using `Rscript` within the activated `pop_gen` conda environment. This will generate the ADMIXTURE plot as a PNG image.


In [None]:
%%bash
# Initialize Conda for the current shell session
source /usr/local/miniconda/etc/profile.d/conda.sh

# Activate the conda environment with R and tidyr
conda activate pop_gen

# Execute the R script to generate ADMIXTURE plots
Rscript pca/admixture/admixture_plot.R

To visualise the graph, we can run following command:

In [None]:
from IPython.display import Image, display

# Display the generated faceted ADMIXTURE plot
display(Image('pca/admixture/admixture_plot_k3_faceted.png'))