<a href="https://colab.research.google.com/github/Aksinhaa/ColabFold/blob/main/NGS_collab_Linux_advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

This notebook provides a comprehensive guide and analysis of sequencing datasets. We will cover various tasks, including data download, understanding reference index files, exploring the structure of FASTQ files, and performing several quality control and analysis steps on sequencing reads. Each section is designed to walk you through the process, from basic file operations to more advanced data interrogation.



# Sequencing Dataset Instructions

**Task 1** : Downloads\
Download the datasets from https://zenodo.org/records/14258052 .\
Fastq reads are in `.gz `format. Original genome files are in `.fna` (FASTA) format.

### Downloading the FASTQ files

I will download the specific FASTQ files mentioned in the previous error messages directly from the Zenodo repository. Please note that the Zenodo URL structure for specific files is often `https://zenodo.org/record/{record_id}/files/{filename}`.

**Task 2** : Reference Index Sidecar Files

*  ` .fna.amb`
: stores information about ambiguous bases (e.g., N) in the reference genome.

*   `.fna.ann`: stores information about sequence names, lengths, and other metadata.
*   `.fna.bwt`: contains the Burrowsâ€“Wheeler transformed sequence.



*  `.fna.pac`: contains packed sequence data.

*   `.fna.sa`: suffix array index, used to locate sequence positions.






**Task 3**: Structure of a FASTQ File

FASTQ files are obtained after demultiplexing Illumina sequencing data. Providers usually deliver them as `.fastq.gz` for compression. \
Each record has four lines:


```
# @SRR15369215.126490887  # sequence identifier
GGACCTTCTGTCATTTCACTCCTTCTGAAGTAAGGAGTGAAGTAAACACGAAGTAAACACGACAGGTTAGTCCTATTCCTTCAAGCAGGAGTACAGAAAAGAATGCAAATTCTGGGTTCTAGCCCAGCTTTTACTCCTATGGTTCTATTT  # sequence
+  #separator
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJFJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFAFJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJJJJFFFJJJJJJJJJJJ  # base call quality scores (ASCII characters)
```



In [None]:
#Download the files

%%bash
wget https://zenodo.org/records/14258052/files/BEN_CI16_sub_1.fq.gz
wget https://zenodo.org/records/14258052/files/BEN_CI16_sub_2.fq.gz



In [None]:
#For decompressing all the downloaded files

%%bash

gunzip *fq.gz

View the first 20 lines of a fastq file. How many seqeunces are present? What does each line represent

In [None]:
%%bash
head -n 20 BEN_CI16_sub_1.fq

#Question 1

Count the number of sequencing reads in a FASTQ files

In [None]:
%%bash
less BEN_CI16_sub_1.fq | awk 'NR % 4 == 1' | wc -l
#less: views the files.
#awk 'NR % 4 == 1': Filters for lines where the record number (NR) modulo 4 is 1, which corresponds to the header lines (starting with @).
#wc -l: Counts the number of lines, effectively counting the number of reads.



### Count the number of reads in all FASTQ files

In [None]:
%%bash

#bulk analysis
for file in *.fq; do
  read_count=$(( $(less "$file" | wc -l) / 4 ))
  echo "$(basename "$file"): $read_count reads"
done

#Question 2

How many reads are shorter than 150 bp?

In [None]:
#Question 2 solution

%%bash
less BEN_CI16_sub_1.fq | awk 'NR % 4 == 2 {print length($0)}' | awk '$1<150' | wc -l




### How many reads are shorter than 150 bp in all the FASTQ files:

In [None]:
%%bash

#bulk analysis
for file in *.fq; do
   reads=$(less "$file" | awk 'NR % 4 == 2 {print length($0)}' | awk '$1<150' | wc -l)
   echo "$(basename "$file"): $reads reads shorter than 150bp"
 done


#Question 3

How many reads are longer than 150 bp?

In [None]:
#Question 3 solution

%%bash
less BEN_CI16_sub_1.fq | awk 'NR % 4 == 2 {print length($0)}' | awk '$1>150' | wc -l



### How many reads are longer than 150 bp in all the FASTQ files:

In [None]:
%%bash

#bulk analysis
for file in *.fq; do
   reads=$(less "$file" | awk 'NR % 4 == 2 {print length($0)}' | awk '$1>150' | wc -l)
   echo "$(basename "$file"): $reads reads longer than 150bp"
 done

#Question 4

How many reads are not equal to 150 bp?

In [None]:
#Question 4 solution

%%bash
less BEN_CI16_sub_1.fq | awk 'NR % 4 == 2 {if (length($0) != 150) print length($0)}' | wc -l
#if (length($0) != 150): Checks if the length of the sequence is not equal to 150.
#print length($0): Prints the length of sequences that are not 150.



### How many reads are equal to 150 bp in all the FASTQ files:

In [None]:
%%bash

#bulk analysis
for file in *.fq; do
  count=$(less "$file" | awk 'NR % 4 == 2 {if (length($0) != 150) print length($0)}' | wc -l)
  echo "$(basename "$file"): $count reads not equal to 150bp"
done

#Question 5

How many reads are contaminated with Illumina adapters?



```
# Adapter sequence: CTGTCTCTTATACACATCT
```



In [None]:
%%bash
count=$(less BEN_CI16_sub_1.fq | awk 'NR % 4 == 2' | grep -c 'CTGTCTCTTATACACATCT')
echo "$count"

###To check how many reads are contaminated with Illumina adapters in all the FASTQ files:

In [None]:
%%bash

#bulk analysis
for file in *.fq; do
    count=$(less "$file" | awk 'NR % 4 == 2' | grep -c 'CTGTCTCTTATACACATCT')
    echo "$file: $count"
done

**Tutorial Questions**
1) Download the following files and count the numbers of reads more than, less than and equal to 100bp in the given empty code cell.

https://zenodo.org/records/14258052/files/BEN_NW10_sub_1.fq.gz

https://zenodo.org/records/14258052/files/BEN_NW10_sub_2.fq.gz


https://zenodo.org/records/14258052/files/BEN_SI18_sub_1.fq.gz

https://zenodo.org/records/14258052/files/BEN_SI18_sub_2.fq.gz




























#Answer the following questions in the below text cell:

1) Why does counting reads in a FASTQ file involve dividing the total number of lines by 4?
2) How does sequencing quality typically vary along the length of a read, and why?
3) What biological or technical factors contribute to low-quality base calls?
4) Conceptually, how does insert size constrain correct mapping locations?
5) If a dataset has many bases with Phred scores below Q20, how would this affect downstream mapping?
