<a href="https://colab.research.google.com/github/Aksinhaa/ColabFold/blob/main/NGS_colab_part3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, we'll download and install Miniconda, a minimal installer for conda. We use `wget` to download the installer script and then execute it using `bash`.

In [None]:
# Install Miniconda
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!./Miniconda3-latest-Linux-x86_64.sh -b -p /usr/local/conda
!rm Miniconda3-latest-Linux-x86_64.sh

# Add conda to PATH for current and future shell commands in this session
import os
os.environ['PATH'] += ':/usr/local/conda/bin'

# Update Python's sys.path for packages installed by conda
import sys
_ = sys.path.append("/usr/local/conda/lib/python3.10/site-packages") # Assuming Python 3.10

# Verify conda installation
!conda --version

In [None]:
# Accept Conda Terms of Service
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

# Explicitly source the conda initialization script for the current shell
# This makes 'conda activate' command available immediately within the subshell.
!source /usr/local/conda/etc/profile.d/conda.sh

# Create the environment
!conda create -n vcf_filter python=3.9 -y

# Activate the environment
# This command should now succeed as conda's shell functions are available.
!conda activate vcf_filter

Even after creating the environment, directly using `!conda activate` in Colab can be challenging as each `!` command typically runs in a separate subshell. To ensure we are operating within the `vcf_filter` environment and to verify it, we'll use `conda run -n vcf_filter` to execute commands directly within that environment.

In [None]:
# Verify the Python interpreter path within the 'vcf_filter' environment
# This command will execute 'python -c "import sys; print(sys.prefix)"' inside the vcf_filter environment
!conda run -n vcf_filter python -c "import sys; print(sys.prefix)"

# List all conda environments to confirm 'vcf_filter' exists and is recognized
!conda env list

The output from the `!conda run` command should show a path like `/usr/local/conda/envs/vcf_filter`, confirming that commands are being executed within your new environment. The `!conda env list` command should show `vcf_filter` listed with an asterisk next to it if it were truly active in a persistent shell, but in Colab, `conda run` is the more reliable way to execute specific commands within a given environment.

In [None]:
# Install vcftools into the vcf_filter environment
!conda run -n vcf_filter conda install -c bioconda vcftools -y

In [None]:
# Create the directory if it doesn't exist
!mkdir -p vcf_file

# Download the VCF file into the created directory
!wget -P vcf_file https://zenodo.org/records/15173226/files/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3.recode.vcf.gz

First, we'll apply base quality (`--minQ`), genotype quality (`--minGQ`), and Hardy-Weinberg equilibrium (`--hwe`) filters to the VCF file. The output file name will reflect these new filters.

In [None]:
# Define the input and output filenames
input_vcf_gz = "vcf_file/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3.recode.vcf.gz"
output_prefix_step1 = "vcf_file/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_noIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05"

# Apply base quality, genotype quality, and HWE filters
!conda run -n vcf_filter vcftools --gzvcf {input_vcf_gz} \
--minQ 30 --minGQ 30 --hwe 0.05  --out {output_prefix_step1} --recode

Next, we will remove indels (insertions and deletions) from the filtered VCF file. The `--remove-indels` flag ensures that only SNP (Single Nucleotide Polymorphism) sites are retained. The output file name will again be updated to reflect this filter.

In [None]:
# Define the input (output from step 1) and output filenames
input_vcf_step1 = output_prefix_step1 + ".recode.vcf"
output_prefix_step2 = "vcf_file/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_rmvIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05"

# Remove Indels
!conda run -n vcf_filter vcftools --vcf {input_vcf_step1} --remove-indels \
--out {output_prefix_step2} --recode

Finally, we will apply the individual missingness filter. This command will generate an output file with a `.imiss` extension, which contains information about the fraction of missing sites for each individual.

In [None]:
# Define the input (output from step 2)
input_vcf_step2 = output_prefix_step2 + ".recode.vcf"

# Apply the individual missingness filter
# The output will be named {output_prefix_step2}.imiss
!conda run -n vcf_filter vcftools --vcf {input_vcf_step2} --missing-indv --out {output_prefix_step2}

The command above will produce a file named `machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_rmvIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05.imiss`. You can inspect this file to see the fraction of missing sites for each individual in the `F_MISS` column. This information is crucial for identifying individuals with a high proportion of missing data, which might be excluded from further analysis if the missingness exceeds a certain threshold.

Now, we will remove individuals that have a missing data proportion greater than 60%. This involves using `awk` to parse the `.imiss` file, identify individuals with `F_MISS` (fraction of missing data, which is the 5th column) greater than 0.6, and then passing these individual IDs to `vcftools` using the `--remove` flag.

In [None]:
# Define the input VCF from the previous step and the base name for the .imiss file
input_vcf_step2 = "vcf_file/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_rmvIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05.recode.vcf"
imiss_file = "vcf_file/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_rmvIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05.imiss"
output_prefix_step3 = "vcf_file/machali_Aligned_rangeWideMerge_strelka_update2_BENGAL_mac3_passOnly_biallelicOnly_rmvIndels_minMAF0Pt05_chr_E2_minDP3_minQ30_minGQ30_hwe_0.05_imiss_0.6"

# Create a temporary file to store individuals to remove
temp_remove_list = "vcf_file/individuals_to_remove.txt"
!awk '$5 > 0.6 {{print $1}}' {imiss_file} > {temp_remove_list}

# Remove individuals with missing proportion > 60% by passing the temporary file
!conda run -n vcf_filter vcftools --vcf {input_vcf_step2} \
--remove {temp_remove_list} --recode --out {output_prefix_step3}

# Clean up the temporary file (optional)
!rm {temp_remove_list}

After this step, a new VCF file will be generated (prefixed with `machali_Aligned...imiss_0.6`) that excludes individuals with more than 60% missing data. This helps to improve the quality of downstream analyses by ensuring that only individuals with a sufficient amount of genotyped data are included.