<a href="https://colab.research.google.com/github/Aksinhaa/Pony/blob/main/NGS_collab_snp_identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Variant calling is the process of identifying genetic variants, such as single nucleotide polymorphisms (SNPs) and small insertions or deletions (indels) from next-generation sequencing (NGS) data. It involves comparing sequencing reads from an individual to a reference genome to detect differences, like single nucleotide variants (SNVs). In this tutorial, we will focus specifically on detecting SNPs and indels.

In this step:

a) Call variants (Single Nucleotide Polymorphisms and Indels) using tools like bcftools or strelka

b) Produce a VCF file (Variant Call Format) with detailed information about each variant

c) Optionally apply variant filtering to remove false positives or low-confidence variants

Why it's important: This is the core step of population genetics—identifying the genetic differences across sample.


For this tutorial we are going to use Strelka, a tool utilised for germline and somatic variant calling.

1: Germline Calling: Utilizes haplotype-based model to accurately detect inherited variants.

(Haplotype is a set of DNA variants inherited together on the same chromosome copy)

2: Somatic Calling: identifying genetic mutations that arise in somatic (non-germline) cells. These mutations are not inherited from parents and do not get passed on to offspring.

Workflow Execution: Strelka2 can be run in two steps: configuration (specifying input data) and execution (specifying parameters).


The first step is to download the Miniconda installer for Linux using `wget`.


In [7]:
%%bash
# Download and install Miniconda
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
bash miniconda.sh -b -p /usr/local/miniconda

# Add conda to PATH and initialize shell
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh

# Accept Terms of Service for required conda channels
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

# Create environment and install Strelka + Samtools
conda create -y -n strelka -c bioconda strelka samtools

accepted Terms of Service for https://repo.anaconda.com/pkgs/main
accepted Terms of Service for https://repo.anaconda.com/pkgs/r
Jupyter detected...
2 channel Terms of Service accepted
Channels:
 - bioconda
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
Solving environment: - done

## Package Plan ##

  environment location: /usr/local/miniconda/envs/strelka

  added / updated specs:
    - samtools
    - strelka


The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
  _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
  bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
  c-ares             pkgs/main/linux-64::c-ares-1.34.5-hef5626c_0 
  ca-certificates    pkgs/main/linux-64::ca-certificates-2025.12.2-h06a4308_0 
  certifi            pkgs/main/noarch::certifi-2020.6.20-pyhd3eb1b0_3 
  htslib  

ERROR: File or directory already exists: '/usr/local/miniconda'
If you want to update an existing installation, use the -u option.


    current version: 25.9.1
    latest version: 25.11.0

Please update conda by running

    $ conda update -n base -c defaults conda




This code block explains the setup of a Conda environment and installation of,`strelka` for variant calling.

1.  **Downloading and Installing Miniconda**:
    *   `!wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh`: This command downloads the latest Linux installer for Miniconda3. The `-q` flag ensures quiet mode (no progress bar), and `-O miniconda.sh` saves the downloaded file as `miniconda.sh`.
    *   `!bash miniconda.sh -b -p /usr/local/miniconda`: This command executes the Miniconda installer script. The `-b` flag enables batch mode (installation process will proceed automatically without asking the user for any input), and `-p /usr/local/miniconda` specifies the installation prefix (location) as `/usr/local/miniconda`.

2.  **Setting up Python Path and Environment Variables**:
    *   `import sys, os`: Imports the `sys` and `os` modules for system-specific parameters and operating system interfaces.
    *   `sys.path.append('/usr/local/miniconda/lib/python3.8/site-packages')`: Adds the Miniconda's Python 3.8 site-packages directory to the Python system path, allowing Python to find packages installed via Conda.
    *   `os.environ['PATH'] = "/usr/local/miniconda/bin:" + os.environ['PATH']`: This ensures that Conda commands and tools installed by Conda are accessible from the command line.

3.  **Installing `strelka` into the Environment**:
    *   `!conda create -n strelka -c bioconda strelka`: This is the core installation step. It installs the `strelka`tool package into the `strelka` environment.
        *   `-n strelka_env`: Specifies the target environment.
        *   `-c bioconda`: Specifies `bioconda` channel as sources for packages


 ***Verifying `strelka` Installation***:
      `!bash -c "source /usr/local/miniconda/bin/activate strelka && configureStrelkaGermlineWorkflow.py"`: This command activates the `strelka_env` environment within a new bash shell and then confirm that `strelka` is installed and functioning correctly.

In [8]:
%%bash
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh

conda list -n strelka

# packages in environment at /usr/local/miniconda/envs/strelka:
#
# Name                     Version          Build            Channel
_libgcc_mutex              0.1              main
_openmp_mutex              5.1              1_gnu
bzip2                      1.0.8            h5eee18b_6
c-ares                     1.34.5           hef5626c_0
ca-certificates            2025.12.2        h06a4308_0
certifi                    2020.6.20        pyhd3eb1b0_3
htslib                     1.13             h9093b5e_0       bioconda
krb5                       1.19.2           hac12032_0
libcurl                    7.87.0           h91b91d3_0
libdeflate                 1.22             h5eee18b_0
libedit                    3.1.20210714     h7f8727e_0
libev                      4.33             h7f8727e_1
libffi                     3.3              he6710b0_2
libgcc                     15.2.0           h69a1729_7
libgcc-ng                  15.2.0           h166f726_7
libgomp                    15.2.0 



For the variant calling step, we will download the BAM file which is aligned, sorted, and duplicate-marked file (generated during the mapping step), from the Zenodo repository. This will ensure that they are ready for use in downstream variant identification.


In [12]:
%%bash
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh
conda activate strelka

which configureStrelkaGermlineWorkflow.py

/usr/local/miniconda/envs/strelka/bin/configureStrelkaGermlineWorkflow.py


In [10]:
%%bash
rm -rf reference

# URL for reference genome tar.gz file
reference_tar="https://zenodo.org/records/17878528/files/reference.tar.gz?download=1"

# Download the tar.gz file
wget -q "$reference_tar" -O reference.tar.gz

# Extract the reference folder
tar -xzvf reference.tar.gz

# List extracted contents
ls -F reference/

reference/
reference/GCA_021130815.1_PanTigT.MC.v3_genomic.fna.ann
reference/GCA_021130815.1_PanTigT.MC.v3_genomic.fna.amb
reference/GCA_021130815.1_PanTigT.MC.v3_genomic.fna.sa
reference/GCA_021130815.1_PanTigT.MC.v3_genomic.fna.bwt
reference/GCA_021130815.1_PanTigT.MC.v3_genomic.fna.pac
reference/GCA_021130815.1_PanTigT.MC.v3_genomic.fna
reference/.ipynb_checkpoints/
GCA_021130815.1_PanTigT.MC.v3_genomic.fna
GCA_021130815.1_PanTigT.MC.v3_genomic.fna.amb
GCA_021130815.1_PanTigT.MC.v3_genomic.fna.ann
GCA_021130815.1_PanTigT.MC.v3_genomic.fna.bwt
GCA_021130815.1_PanTigT.MC.v3_genomic.fna.pac
GCA_021130815.1_PanTigT.MC.v3_genomic.fna.sa


Once we have our required BAM files, we will start the step for variant calling. In this analysis we use the Strelka germline workflow, which is initiated through the configureStrelkaGermlineWorkflow.py script.

This germline configuration file contains:

1: Information about the input BAM files (aligned, sorted, and duplicate-marked reads).

2: The reference genome path, which Strelka uses for aligning sequences and identifying variant sites.

3: Workflow parameters, such as filtering settings, runtime options, and rules for how Strelka processes sequencing data.

4: Module definitions, specifying the order and structure of the processing steps used to detect SNVs and small indels.

5: Paths to output directories, where Strelka will write variant call results, logs, and intermediate files.

In short, this germline configuration is used by Strelka to build and execute the variant calling workflow. It ensures that the pipeline runs consistently with the correct inputs, reference genome, and settings needed for accurate germline variant detection.

In [19]:
%%bash
# Create bam_files directory if it doesn't exist
mkdir -p bam_files

# Move the .fna file from reference/ to bam_files/
mv reference/GCA_021130815.1_PanTigT.MC.v3_genomic.fna bam_files/

# List contents of bam_files to confirm
ls -F bam_files

BEN_CI16.bam
BEN_CI16.bam.bai
BEN_NW10.bam
BEN_NW10.bam.bai
BEN_SI18.bam
BEN_SI18.bam.bai
GCA_021130815.1_PanTigT.MC.v3_genomic.fna


In [20]:
%%bash
# Activate conda environment that contains samtools
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh
conda activate strelka   # or the environment where samtools is installed

# Move into bam_files directory
cd bam_files

# Index the reference FASTA
samtools faidx GCA_021130815.1_PanTigT.MC.v3_genomic.fna

echo "Indexed reference file: $fna_file"
ls -F


Indexed reference file: 
BEN_CI16.bam
BEN_CI16.bam.bai
BEN_NW10.bam
BEN_NW10.bam.bai
BEN_SI18.bam
BEN_SI18.bam.bai
GCA_021130815.1_PanTigT.MC.v3_genomic.fna
GCA_021130815.1_PanTigT.MC.v3_genomic.fna.fai


In [11]:
%%bash
# Create directory for BAM files
mkdir -p bam_files

# Download all BAM and BAI files into bam_files/
wget -q "https://zenodo.org/records/17895817/files/BEN_CI16_aligned_reads_deduplicated.bam?download=1"     -O bam_files/BEN_CI16.bam
wget -q "https://zenodo.org/records/17895817/files/BEN_CI16_aligned_reads_deduplicated.bam.bai?download=1" -O bam_files/BEN_CI16.bam.bai

wget -q "https://zenodo.org/records/17895817/files/BEN_NW10_aligned_reads_deduplicated.bam?download=1"     -O bam_files/BEN_NW10.bam
wget -q "https://zenodo.org/records/17895817/files/BEN_NW10_aligned_reads_deduplicated.bam.bai?download=1" -O bam_files/BEN_NW10.bam.bai

wget -q "https://zenodo.org/records/17895817/files/BEN_SI18_aligned_reads_deduplicated.bam?download=1"     -O bam_files/BEN_SI18.bam
wget -q "https://zenodo.org/records/17895817/files/BEN_SI18_aligned_reads_deduplicated.bam.bai?download=1" -O bam_files/BEN_SI18.bam.bai

# List downloaded files
ls -lh bam_files


total 843M
-rw-r--r-- 1 root root 272M Dec 11 15:46 BEN_CI16.bam
-rw-r--r-- 1 root root 3.3M Dec 11 15:47 BEN_CI16.bam.bai
-rw-r--r-- 1 root root 259M Dec 11 15:49 BEN_NW10.bam
-rw-r--r-- 1 root root 3.3M Dec 11 15:49 BEN_NW10.bam.bai
-rw-r--r-- 1 root root 303M Dec 11 15:54 BEN_SI18.bam
-rw-r--r-- 1 root root 3.1M Dec 11 15:54 BEN_SI18.bam.bai





After running the Strelka germline variant calling workflow, the pipeline produces a set of output files, most importantly the **VCF (Variant Call Format) files**. These files contain the list of single nucleotide variants (SNVs) and small insertions/deletions (indels) identified from the input BAM files.

Once the VCF files are generated, they represent the final variant calls produced by Strelka, and they can now be used for downstream analyses. Typical post-processing steps may include:

* **Quality assessment** of the variants by inspecting Strelka’s FILTER and quality score annotations.
* **Merging or comparing VCFs** (if multiple samples were processed) to evaluate shared or unique variants.
* **Functional annotation** using external tools (e.g., VEP, SnpEff) to predict gene impacts, consequences, or biological relevance.
* **Filtering based on depth, genotype quality, or allele frequency**, depending on the research goals.

In summary, the generation of VCF files marks the completion of the variant calling workflow, providing a structured dataset of all detected variants. These files form the foundation for interpretation, annotation, and further biological or population-level analyses.


In [21]:
%%bash
# Activate environment
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh
conda activate strelka

# Create the run directory manually (recommended for clarity)
mkdir -p strelka_run

# Configure Strelka workflow
configureStrelkaGermlineWorkflow.py \
    --bam bam_files/BEN_CI16.bam \
    --bam bam_files/BEN_NW10.bam \
    --bam bam_files/BEN_SI18.bam \
    --referenceFasta bam_files/GCA_021130815.1_PanTigT.MC.v3_genomic.fna \
    --runDir strelka_run


Successfully created workflow run script.
To execute the workflow, run the following script and set appropriate options:

/content/strelka_run/runWorkflow.py


In [None]:
%%bash
# Activate environment
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh
conda activate strelka

strelka_run/runWorkflow.py -m local -j 8