<a href="https://colab.research.google.com/github/Pranav-Datar/pacbio_denovo2/blob/main/NGS_collab_Linux_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **This hands-on workshop is designed to introduce students' to the basic analysis pipeline for Next-Generation Sequencing (NGS) reads, specifically from the Illumina platform for population genetics analysis.**
We will focus on whole genome resequencing data (WGS), though the same pipeline may also be applied to RAD-seq or amplicon-seq data (with minor modifications) if a reference genome is available.

By the end of the workshop, participants will understand how to:



*   Work with raw sequencing data (FASTQ files)
*   Perform quality control and preprocessing


*   Map reads to a reference genome
*   Call and filter genetic variants (SNPs/indels)

# Pre-requisites & Setup
Before joining the workshop, please make sure you are familiar with Unix/Linux commands and tools which is what we will be primarily using .
Go through the command description listed as below, to get a good overview of the command language we will be using in this workshop.

In this Linux cheat sheet, we will cover all the most important Linux commands, from the basics to the advanced.
1. `ls` : List files and directories.\
Examples:\
`ls -l`
Displays files and directories with detailed information.\
`ls -a`
Shows all files and directories, including.\
`ls -lh`
Displays file sizes in a human-readable format.

2. `cd` : Change directory.\
`cd /path/to/directory`
Changes the current directory to the specified path.

3. `pwd` : Print current working directory.

4. `mkdir` : Create a new directory.

5. `rm` : Remove directory\
`rm -r` Delete the directory and it's contents

6. `mv` : Move or rename the file.

7. `cp`: Copy files and directories.

8. `cat` : To view the content of the file.

9. `head` : To view the first few lines of the file.

10. `tail` : To view the last few lines of the file.

11. `ln` : Creates links between files.

12. `find` : Search for files and directory.

13. `chmod`: Change file permissions.

14. `tar`: Create or extract archive files.

15. `gzip`: Compress files

16. `zip`: Create compressed `zip` archives.

17. `ssh`: Securely connect to remote server.\
`ssh user@hostname`
Initiates an SSH connection to the specified hostname.

18. `scp` : Securely copy files between hosts.\
`scp file.txt user@hostname:/path/to/destination`
Securely copies “file.txt” to the specified remote host.

19. `wget` : Download files from the web\
`wget http://example.com/file.txt`
Downloads “file.txt” from the specified URL.

# To get the most out of this workshop, we recommend checking out the following resource:
How Illumina Sequencing works?\
[Check out this brief and engaging video to understand the sequencing technology behind your data](https://youtu.be/fCd6B5HRaZ8?si=GPKriSYOUKEPGFoo)



# Question 1

Create the following table in a file named test_table.txt on terminal

```
1  2  3  4  5
6  7  8  9  10
11  12  13  14
15  16  17  18
```


In [None]:
#Question 1 solution

%%bash
touch test_table.txt #creates a new text file in the directory
{
    printf "%2d  %2d  %2d  %2d  %2d\n" 1 2 3 4 5
    printf "%2d  %2d  %2d  %2d  %2d\n" 6 7 8 9 10
    printf "%2d  %2d  %2d  %2d\n" 11 12 13 14
    printf "%2d  %2d  %2d  %2d\n" 15 16 17 18
} > test_table.txt # ">" forwards the entered table data into test_table.txt

cat test_table.txt # This will display the content of the file

 1   2   3   4   5
 6   7   8   9  10
11  12  13  14
15  16  17  18


#Question 2

Create a directory named genomics_workshop_PoODL and move the file test_table.txt into it



In [None]:
%%bash
mkdir -p genomics_workshop #creates new directory or folder named "genomics_workshop" only if it doesn't exist

# Check if test_table.txt exists in the current directory before attempting to move it
if [ -f "test_table.txt" ]; then
    mv test_table.txt genomics_workshop #moves the test_table.txt file into the genomics_workshop folder
fi

#Question 3

Count the number of lines in test_table.txt

In [None]:
#Question 3 solution

%%bash
wc -l genomics_workshop/test_table.txt # (wc - wordcount; l- lines) - specify the full path to the file

4 genomics_workshop/test_table.txt


#Question 4

Count the number of columns per row in test_table.txt

In [None]:
#Question 4 solution

%%bash
awk '{print NF}' genomics_workshop/test_table.txt #awk reads input line by line. {print NF}: Executes for each line of the input file.

5
5
4
4


#Question 5

Count the number of times "1" appears in test_table.txt

In [None]:
#Question 5 solution

%%bash
grep -o '1' genomics_workshop/test_table.txt | wc -l

11


#Question 6

Count the number of words/strings containing "1" in test_table.txt

In [None]:
#Question 6 solution

%%bash
grep -o '\S*1\S*' genomics_workshop/test_table.txt | wc -l


10


**Tutorial Questions**

1) From an ecological genomics perspective, why is learning the command line (and not just GUIs) important?
2) What are the main limitations of amplicon-seq for population genomics compared with WGS?
3) What does a pipe command "|" do?
4) What is “sequencing-by-synthesis” (SBS)?
5) What role do fluorescent tags on nucleotides play in Illumina sequencing?
