## Exploring FastQ files

In the FastQC notebook we saw the standard FastQ file format.

For Review: A fastQ file (file extension .fastq) is a file that contains perhaps hundreds of thousands or millions of individual sequence reads. Each sequence read is represented in 4 lines of the file. Here is an example:

```bash
@HWI-ST330:304:H045HADXX:1:1101:1111:61397
CACTTGTAAGGGCAGGCCCCCTTCACCCTCCCGCTCCTGGGGGANNNNNNNNNNANNNCGAGGCCCTGGGGTAGAGGGNNNNNNNNNNNNNNGATCTTGG
+
@?@DDDDDDHHH?GH:?FCBGGB@C?DBEGIIIIAEF;FCGGI#########################################################
```

Line |Description 
-----|----- 
1|Always begins with ‘@’ and then information about the read (e.g. it’s direction, the machine it was sequenced on, etc.)
2|The actual DNA sequence
3|Always begins with a ‘+’ and sometimes the same info in line 1
4|Has a string of characters which represent the quality scores; must have same number of characters as line 2

## Viewing our FastQ files

We can take a look at our own FastQ files using commands in bash. Let's look in the `concat_fastq` folder

In [None]:
#ls lists the contents of a directory - our files are in the concat_fastq folder

ls concat_fastq

You will notice that all the read files end in `.gz`. The `.gz` file extension is used to denote a compressed file archive. Files can be compressed/zipped using the `gzip` command. File compression allows us "shrink" large files down to be able to manage them. This can be helpful for sharing and downloading files and taking up less memory. However, when a file is compressed we are unable to view the contents of the file without first deflating it. One way we can look at files is to use the `cat` command

Let's use `cat` to see the contents of the Readme.md (a type of text file describing the content of this folder)

In [None]:
cat concat_fastq/README.md

## Previewing comressed files with `zcat` , `|`, and `head`

Unfortunately, `cat` does not work on compressed fies. One command we can use to look at parts of the gzipped file without unzipping the entire file is `zcat`. 

The result of the `zcat` command is the same as using `cat` to view the contents of a file. However, when dealing with large files we may just want to a small subsection of the file. For instance, if we want to view the first few lines of our file (to check and see what data is inside) we can `zcat` along with another command `head` , which shows us the first few lines of a file. Since every FASTQ sequence read is exactly 4 lines long, we can call `head` with the `-n` (number of lines) flag to view an individual FASTQ sequence (e.g., 4 lines) or any multiple of 4.

## Piping

When we use two commands together (i.e., send the output of one command to the input of another command) we use  whats called a pipe `|` . In the example below the output of `zcat` is being piped into the `head` command with the `-n4` flag to print the top 4 lines.

In [None]:
zcat concat_fastq/100_spolyrhiza_reads.fastq.gz | head -n4 

Now we can view the the first sequences in the file with information on the sequence and information on the quality of the bases.

## BLAST to identify a sequence

We can view our sequence, but what if we want to identify the sequence? To do this we can run a BLAST (basic local alignment search tool) search. BLAST takes a query sequence and compares it against a database of sequences. The results are sequences that the query sequence align too along with scores that asess the alignment. We can run a BLAST search against sequences in the NCBI database using the online BLASTn tool. 

However, BLAST requires sequences to be in fasta format. Fasta formatted files start with a header line that always begin with > followed by a line that contains a sequence.

Since we know the first line in our fastq file is a header beginning with `@` and the second line is the sequence, we can replace the `@` with a `>` using `sed`. `sed` can be a very useful command for searching, finding and replacing or inserting/deleting text within files. In the following command `s/@/>/g` is directing `sed` to search (s) the file for `@` and replace it with `>` throughout the entire file, or, globally (g).
 

In [None]:
zcat < 100_spolyrhiza_reads.fastq.gz | head -n2 | sed 's/@/>/g'

Next we can copy and past the above sequence into "Query Sequence" box in the BLAST tool on [NCBI](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch) and clicking "BLAST". 

Browse through the hits to can confirm what species the sequence is best aligning to.

![ra ndom](https://blog.addgene.org/hs-fs/hub/306096/file-893840440-png/4_14_to_6_14/BLAST_screenshots/blastn_align2_cropped.png?width=674&name=blastn_align2_cropped.png)

Follow the instructions in the recording of the November meetup video to BLAST and see what species this DNA is matching too. 

## Counting the number of sequences

Other useful command line tools include `grep` and `wc` (word count). Grep uses regular expressions to search patterns in a file. `wc` allows us to count the number of lines or characters in a text file. Since we know that all sequences have a header that start with @ we can use grep and wc to count how many sequences are in our fastq file.



First, let's see how `grep` works. What if we wanted to find DNA that matches the following sequence:
"GGTTTTTATGAATTGACCATTTTGGATGTAGCGTTGTCTTAAAGGATTCAGAAGTATTGGTTCTTTAAAAGATGTCTGAGCAACTAAATGATATGTCATATTGGATATGTAATGTTTTGAATTTAATAATATTTTAAGT"

We can use `grep` to do this. Remember since our file is compressed, we will have to `zcat` and `|` too:

In [None]:
zcat concat_fastq/100_spolyrhiza_reads.fastq.gz | grep "GGTTTTTATGAATTGACCATTTTGGATGTAGCGTTGTCTTAAAGGATTCAGAAGTATTGGTTCTTTAAAAGATGTCTGAGCAACTAAATGATATGTCATATTGGATATGTAATGTTTTGAATTTAATAATATTTTAAGT"

`grep` returns the one line in our file that matches our input sequence. However, we don't really know what FASTQ sequence this is. We can tell `grep` to return the line before since line 1 of a FASTQ record has the name and line 2 is the DNA sequence. We will add the `-b1` flag where `-b` means return the line before and `1` means that we just want one line before (we could give it any number n to ask for n lines before). 

In [None]:
zcat concat_fastq/100_spolyrhiza_reads.fastq.gz | grep -b1 "GGTTTTTATGAATTGACCATTTTGGATGTAGCGTTGTCTTAAAGGATTCAGAAGTATTGGTTCTTTAAAAGATGTCTGAGCAACTAAATGATATGTCATATTGGATATGTAATGTTTTGAATTTAATAATATTTTAAGT"

Now, let's  search for lines that with begin with `@`. The `^` denotes "matches a line that begins with" the following string. 

In [None]:
zcat concat_fastq/100_spolyrhiza_reads.fastq.gz | grep ^@ 

This result gives us the header information from every FASTQ record. How many record are there? We can use the word count command `wc` to find out. Since we are adding another command, we will use another `|`. 

In [None]:
zcat concat_fastq/100_spolyrhiza_reads.fastq.gz | grep ^@ | wc 

`wc` returns the number of lines, "words", and "characters". We can use the `-l` to see how many lines (in case we forget which is which)

In [None]:
zcat concat_fastq/100_spolyrhiza_reads.fastq.gz | grep ^@ | wc -l

## Challenge

How many sequences are in the `spolyrhiza_reads.fastq.gz` file? Use the commands above to form the correct sequence of pipes and commands. Be sure to correctly specifiy the path to the `spolyrhiza_reads.fastq.gz` file

