<a href="https://colab.research.google.com/github/Aksinhaa/ColabFold/blob/main/NGS_colab_part3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Introduction to Variant Filtering

High-throughput sequencing technologies generate millions of genetic variants across genomes, but not all detected variants are biologically meaningful or technically reliable. Raw variant calls often contain sequencing errors, low-confidence genotypes, and missing data that can severely bias downstream analyses. From removing errors to retain only high-confidence SNPs, each step here helps you build a more accurate dataset for further applications such as population genetics.
Below are the details of the filters that we are going to use during this analysis to get a filtered VCF file.

1: *passOnly*-  Retained only variant sites that passed all internal quality filter performed by the variant caller (Strelka in this case). During variant calling, Strelka assigns a PASS or FAIL tag to each variant based on multiple internal metrics such as base quality, mapping quality, strand bias, and read position bias.

2: *biallelicOnly*- Kept only biallelic variants (i.e., sites with exactly two alleles) for downstream compatibility and clarity. Because most population genetic software assumes biallelic markers, this filter ensures mathematical simplicity, software compatibility, and interpretability.

3: *rmvIndels*- Removed insertions and deletions (INDELs), keeping only SNPs (single nucleotide polymorphisms). Indels have higher sequencing and alignment error rates than SNPs. Removing indels ensures that the dataset is composed of high-confidence point mutations.

4: *minMAF0Pt05*- Retained only variants with minor allele frequency (MAF) ≥ 0.05, meaning the alternate allele must be present in at least 5% of individuals. As very rare variants (MAF < 5%) are more likely to be sequencing errors, they are often poorly powered in statistical tests.

5: *chr_E2*- Restricted to a specific chromosome/region of interest (e.g., chr_E2). This enables targeted analysis of specific genomic regions.

6: *minDP3*- Required a minimum depth (DP) of 3 per genotype to avoid false positives due to low coverage. Because low-depth genotypes are highly error-prone and vulnerable to random sequencing noise.

7: *minQ30*- Ensured that each site has a minimum site quality score of 30, reflecting high confidence in the variant.  This improves accuracy of allele frequency estimates and reliability of genome-wide scans.

8: *minGQ30*- Filtered genotypes to retain only those with a Genotype Quality (GQ) ≥ 30, removing uncertain genotype calls. This helps prevent the misclassification of homozygous vs heterozygous states.

9: *hwe_0.05*- Removed variants deviating significantly from Hardy-Weinberg Equilibrium (p < 0.05), which could indicate genotyping errors or population substructure.

10: *imiss_0.6*- Filtered out individuals with >60% missing genotypes to maintain data quality. Highly incomplete samples can distort PCA and admixture results.

11: *miss_0.6*- Removed variants missing in >60% of individuals. This ensures each retained SNP is well represented across individuals.

12: *mid95percentile* - Likely indicates the removal of extreme outliers in genotype depth or quality, retaining the middle 95% of the data. This removes the Lowest 2.5% of undercovered, unreliable sites and the highest 2.5% of multi-mapped, repetitive, or duplicated regions.





###Prerequisites and setup:
To run this variant-filtering command line workflow, make sure you have Miniconda already installed, and it is working. Must have already downloaded the required input VCF file from Zenodo into your working directory.
The setup involves creating and activating the required conda environments for vcftools, ggplot2, and base R, installing these packages via the bioconda and Cond forge channels.

First, we'll download and install Miniconda. We use `wget` to download the installer script and then execute it using `bash`.

In [None]:
# Miniconda installation and environment setup for Colab NGS Workshop

# Download and install Miniconda (skip if already installed)
!wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
!bash miniconda.sh -b -p /usr/local/miniconda

import sys, os
sys.path.append('/usr/local/miniconda/lib/python3.8/site-packages')
os.environ['PATH'] = "/usr/local/miniconda/bin:" + os.environ['PATH']

# Accept ToS for main and R conda channels
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

# Install necessary bioinformatics tools into the environment
!conda create -n vcf_filter -c bioconda vcftools


First, let's list all conda environments to confirm `vcf_filter` is present.

In [None]:
!conda env list

Next, we will list the packages installed in the `vcf_filter` environment to ensure `vcftools` is there.

In [None]:
!conda list -n vcf_filter

In [None]:
# Create the directory if it doesn't exist
!mkdir -p vcf_file

# Download the VCF file into the created directory
!wget -P vcf_file https://zenodo.org/records/15173226/files/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3.recode.vcf.gz

First, we'll apply base quality (`--minQ`), genotype quality (`--minGQ`), and Hardy-Weinberg equilibrium (`--hwe`) filters to the VCF file. The output file name will reflect these new filters.

In [None]:
# Define the input and output filenames
input_vcf_gz = "vcf_file/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3.recode.vcf.gz"
output_prefix_step1 = "vcf_file/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05"

# Apply base quality, genotype quality, and HWE filters
!conda run -n vcf_filter vcftools --gzvcf {input_vcf_gz} \
--minQ 30 --minGQ 30 --hwe 0.05  --out {output_prefix_step1} --recode

Next, we will remove indels (insertions and deletions) from the filtered VCF file. The `--remove-indels` flag ensures that only SNP (Single Nucleotide Polymorphism) sites are retained. The output file name will again be updated to reflect this filter.

In [None]:
# Define the input (output from step 1) and output filenames
input_vcf_step1 = output_prefix_step1 + ".recode.vcf"
output_prefix_step2 = "vcf_file/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_rmvIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05"

# Remove Indels
!conda run -n vcf_filter vcftools --vcf {input_vcf_step1} --remove-indels \
--out {output_prefix_step2} --recode

Finally, we will apply the individual missingness filter. This command will generate an output file with a `.imiss` extension, which contains information about the fraction of missing sites for each individual.

In [None]:
# Define the input (output from step 2)
input_vcf_step2 = output_prefix_step2 + ".recode.vcf"

# Apply the individual missingness filter
# The output will be named {output_prefix_step2}.imiss
!conda run -n vcf_filter vcftools --vcf {input_vcf_step2} --missing-indv --out {output_prefix_step2}

The command above will produce a file named `machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_rmvIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05.imiss`. You can inspect this file to see the fraction of missing sites for each individual in the `F_MISS` column. This information is crucial for identifying individuals with a high proportion of missing data, which might be excluded from further analysis if the missingness exceeds a certain threshold.

Now, we will remove individuals that have a missing data proportion greater than 60%. This involves using `awk` to parse the `.imiss` file, identify individuals with `F_MISS` (fraction of missing data, which is the 5th column) greater than 0.6, and then passing these individual IDs to `vcftools` using the `--remove` flag.

In [None]:
# Define the input VCF from the previous step and the base name for the .imiss file
input_vcf_step2 = "vcf_file/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_rmvIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05.recode.vcf"
imiss_file = "vcf_file/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_rmvIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05.imiss"
output_prefix_step3 = "vcf_file/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_rmvIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_imiss_0.6"

# Create a temporary file to store individuals to remove
temp_remove_list = "vcf_file/individuals_to_remove.txt"
!awk '$5 > 0.6 {{print $1}}' {imiss_file} > {temp_remove_list}

# Remove individuals with missing proportion > 60% by passing the temporary file
!conda run -n vcf_filter vcftools --vcf {input_vcf_step2} \
--remove {temp_remove_list} --recode --out {output_prefix_step3}

# Clean up the temporary file (optional)
!rm {temp_remove_list}

After this step, a new VCF file will be generated (prefixed with `machali_Aligned...imiss_0.6`) that excludes individuals with more than 60% missing data. This helps to improve the quality of downstream analyses by ensuring that only individuals with a sufficient amount of genotyped data are included.