<a href="https://colab.research.google.com/github/Aksinhaa/Pony/blob/main/NGS_collab_snp_identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Variant calling is the process of identifying genetic variants, such as single nucleotide polymorphisms (SNPs) and small insertions or deletions (indels) from next-generation sequencing (NGS) data. It involves comparing sequencing reads from an individual to a reference genome to detect differences, like single nucleotide variants (SNVs). In this tutorial, we will focus specifically on detecting SNPs and indels.

In this step:

a) Call variants (Single Nucleotide Polymorphisms and Indels) using tools like bcftools or strelka

b) Produce a VCF file (Variant Call Format) with detailed information about each variant

c) Optionally apply variant filtering to remove false positives or low-confidence variants

Why it's important: This is the core step of population genetics—identifying the genetic differences across sample.


For this tutorial we are going to use Strelka, a tool utilised for germline and somatic variant calling.

1: Germline Calling: Utilizes haplotype-based model to accurately detect inherited variants.

(Haplotype is a set of DNA variants inherited together on the same chromosome copy)

2: Somatic Calling: identifying genetic mutations that arise in somatic (non-germline) cells. These mutations are not inherited from parents and do not get passed on to offspring.

Workflow Execution: Strelka2 can be run in two steps: configuration (specifying input data) and execution (specifying parameters).


The first step is to download the Miniconda installer for Linux using `wget`.


In [None]:
%%bash
# Download and install Miniconda
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
bash miniconda.sh -b -p /usr/local/miniconda

# Add conda to PATH and initialize shell
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh

# Accept Terms of Service for required conda channels
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

# Create environment and install Strelka + Samtools
conda create -y -n strelka -c bioconda strelka samtools

This code block explains the setup of a Conda environment and installation of,`strelka` for variant calling.

1.  **Downloading and Installing Miniconda**:
    *   `!wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh`: This command downloads the latest Linux installer for Miniconda3. The `-q` flag ensures quiet mode (no progress bar), and `-O miniconda.sh` saves the downloaded file as `miniconda.sh`.
    *   `!bash miniconda.sh -b -p /usr/local/miniconda`: This command executes the Miniconda installer script. The `-b` flag enables batch mode (installation process will proceed automatically without asking the user for any input), and `-p /usr/local/miniconda` specifies the installation prefix (location) as `/usr/local/miniconda`.

2.  **Setting up Python Path and Environment Variables**:
    *   `import sys, os`: Imports the `sys` and `os` modules for system-specific parameters and operating system interfaces.
    *   `sys.path.append('/usr/local/miniconda/lib/python3.8/site-packages')`: Adds the Miniconda's Python 3.8 site-packages directory to the Python system path, allowing Python to find packages installed via Conda.
    *   `os.environ['PATH'] = "/usr/local/miniconda/bin:" + os.environ['PATH']`: This ensures that Conda commands and tools installed by Conda are accessible from the command line.

3.  **Installing `strelka` into the Environment**:
    *   `!conda create -n strelka -c bioconda strelka`: This is the core installation step. It installs the `strelka`tool package into the `strelka` environment.
        *   `-n strelka_env`: Specifies the target environment.
        *   `-c bioconda`: Specifies `bioconda` channel as sources for packages


 ***Verifying `strelka` Installation***:
      `!bash -c "source /usr/local/miniconda/bin/activate strelka && configureStrelkaGermlineWorkflow.py"`: This command activates the `strelka_env` environment within a new bash shell and then confirm that `strelka` is installed and functioning correctly.

In [3]:
%%bash
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh

conda list -n strelka

# packages in environment at /usr/local/miniconda/envs/strelka:
#
# Name                     Version          Build            Channel
_libgcc_mutex              0.1              main
_openmp_mutex              5.1              1_gnu
ca-certificates            2025.12.2        h06a4308_0
certifi                    2020.6.20        pyhd3eb1b0_3
libffi                     3.4.4            h6a678d5_1
libgcc                     15.2.0           h69a1729_7
libgcc-ng                  15.2.0           h166f726_7
libgomp                    15.2.0           h4751f2c_7
libstdcxx                  15.2.0           h39759b7_7
libstdcxx-ng               15.2.0           hc03a8fd_7
libxcb                     1.17.0           h9b100fa_0
libzlib                    1.3.1            hb25bd0a_0
ncurses                    6.5              h7934f7d_0
pip                        19.3.1           py27_0
pthread-stubs              0.3              h0ce48e5_1
python                     2.7.18           h42bf7aa_



For the variant calling step, we will download the BAM file which is aligned, sorted, and duplicate-marked file (generated during the mapping step), from the Zenodo repository. This will ensure that they are ready for use in downstream variant identification.


In [None]:
%%bash
# Make directory for reference
mkdir -p reference

# Download and extract directly into the reference folder
wget -q "https://zenodo.org/records/17878528/files/reference.tar.gz?download=1" -O reference/ref.tar.gz
tar -xzf reference/ref.tar.gz -C reference

# List extracted files
ls -F reference

Once we have our required BAM files, we will start the step for variant calling. In this analysis we use the Strelka germline workflow, which is initiated through the configureStrelkaGermlineWorkflow.py script.

This germline configuration file contains:

1: Information about the input BAM files (aligned, sorted, and duplicate-marked reads).

2: The reference genome path, which Strelka uses for aligning sequences and identifying variant sites.

3: Workflow parameters, such as filtering settings, runtime options, and rules for how Strelka processes sequencing data.

4: Module definitions, specifying the order and structure of the processing steps used to detect SNVs and small indels.

5: Paths to output directories, where Strelka will write variant call results, logs, and intermediate files.

In short, this germline configuration is used by Strelka to build and execute the variant calling workflow. It ensures that the pipeline runs consistently with the correct inputs, reference genome, and settings needed for accurate germline variant detection.

In [None]:
%%bash
# Activate conda env that has samtools (if needed)
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh
conda activate vcf_filter  # or your samtools environment

# Automatically find the .fna file inside reference folder
fna_file=$(ls reference/*.fna)

# Index the reference FASTA
samtools faidx "$fna_file"

echo "Indexed file: $fna_file"





After running the Strelka germline variant calling workflow, the pipeline produces a set of output files, most importantly the **VCF (Variant Call Format) files**. These files contain the list of single nucleotide variants (SNVs) and small insertions/deletions (indels) identified from the input BAM files.

Once the VCF files are generated, they represent the final variant calls produced by Strelka, and they can now be used for downstream analyses. Typical post-processing steps may include:

* **Quality assessment** of the variants by inspecting Strelka’s FILTER and quality score annotations.
* **Merging or comparing VCFs** (if multiple samples were processed) to evaluate shared or unique variants.
* **Functional annotation** using external tools (e.g., VEP, SnpEff) to predict gene impacts, consequences, or biological relevance.
* **Filtering based on depth, genotype quality, or allele frequency**, depending on the research goals.

In summary, the generation of VCF files marks the completion of the variant calling workflow, providing a structured dataset of all detected variants. These files form the foundation for interpretation, annotation, and further biological or population-level analyses.
