# Streams and Redirection

### Use streams to avoid loading large amounts of data into memory

Example of printing contents to standard output stream (STDOUT)

![Image from linux-training.be](http://linux-training.be/funhtml/images/bash_ioredirection_keyboard_display.png)

### STDIN, STDERR, STDOUT

![Image from ryanstutorials.net/linuxtutorial](https://ryanstutorials.net/linuxtutorial/img/streams.png)

### Redirection

To redirect STDOUT to a file, we use the following operators:
- `>` redirects standard output and overwrites existing content to a file
- `>>` adds standard output content to end of file

# Creating project directory

Before we get started with today's hand-on workshop, let's create a project directory.

In [2]:
# Go into our home directory
cd ~
# View contents in home directory
# The contents in our directory will go to standard output
ls

AnacondaProjects	GitHub			Public
Applications		Library			Screenshots
Desktop			Movies			anaconda3
Documents		Music			workshop
Downloads		OneDrive
Dropbox			Pictures


In [4]:
# If we make a mistake when typing our commands, we will see a standard error output
ljfeaifje

bash: ljfeaifje: command not found


: 127

In [5]:
# Make a directory called "workshop"
mkdir workshop
# See if directory was created
# Use long listing option (-l) and list based on creation time (-t)
ls -lt

mkdir: workshop: File exists
total 0
drwx------+ 28 chaochih  staff   896 Jan 11 12:36 Downloads
drwx------+  6 chaochih  staff   192 Jan 10 18:09 Desktop
drwxr-xr-x  10 chaochih  staff   320 Jan 10 18:09 Screenshots
lrwxr-xr-x   1 chaochih  staff    30 Jan 10 11:59 GitHub -> /Users/chaochih/Dropbox/GitHub
drwx------@ 43 chaochih  staff  1376 Jan 10 11:05 Dropbox
drwx------@ 70 chaochih  staff  2240 Jan 10 11:02 Library
drwx------@ 25 chaochih  staff   800 Jan  9 20:48 OneDrive
drwx------+  8 chaochih  staff   256 Jan  7 23:44 Pictures
drwxr-xr-x   7 chaochih  staff   224 Jan  7 21:42 workshop
drwxr-xr-x  24 chaochih  staff   768 Jan  7 13:24 anaconda3
drwxr-xr-x   2 chaochih  staff    64 Jan  6 15:18 AnacondaProjects
drwx------+  8 chaochih  staff   256 Jan  5 19:36 Music
drwx------   5 chaochih  staff   160 Jan  5 10:55 Applications
drwx------+  7 chaochih  staff   224 Jan  5 10:55 Movies
drwx------+  5 chaochih  staff   160 Jan  5 10:53 Documents
drwxr-xr-x+  5 chaochih  staff   160

In [6]:
# To get help on commands,
# pull up the manual with detailed documentation for each command,
# The general syntax will be: "man" command followed by your command
man ls


LS(1)                     BSD General Commands Manual                    LS(1)

NAME
     ls -- list directory contents

SYNOPSIS
     ls [-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1] [file ...]

DESCRIPTION
     For each operand that names a file of a type other than directory, ls
     displays its name as well as any requested, associated information.  For
     each operand that names a file of type directory, ls displays the names
     of files contained within that directory, as well as any requested, asso-
     ciated information.

     If no operands are given, the contents of the current directory are dis-
     played.  If more than one operand is given, non-directory operands are
     displayed first; directory and non-directory operands are sorted sepa-
     rately and in lexicographical order.

     The following options are available:

     -@      Display extended attribute keys and sizes in long (-l) output.

     -1      (The numeric digit ``one''.)  Force output to be one e

     future, then the year of the last modification is displayed in place of
     the hour and minute fields.

     If the owner or group names are not a known user or group name, or the -n
     option is given, the numeric ID's are displayed.

     If the file is a character special or block special file, the major and
     minor device numbers for the file are displayed in the size field.  If
     the file is a symbolic link, the pathname of the linked-to file is pre-
     ceded by ``->''.

     The file mode printed under the -l option consists of the entry type,
     owner permissions, and group permissions.  The entry type character
     describes the type of file, as follows:

           b     Block special file.
           c     Character special file.
           d     Directory.
           l     Symbolic link.
           s     Socket link.
           p     FIFO.
           -     Regular file.

     The next three fields are three characters each: owner permissions, group
     p

     specification.

LEGACY DESCRIPTION
     In legacy mode, the -f option does not turn on the -a option and the -g,
     -n, and -o options do not turn on the -l option.

     Also, the -o option causes the file flags to be included in a long (-l)
     output; there is no -O option.

     When -H is specified (and not overridden by -L or -P) and a file argument
     is a symlink that resolves to a non-directory file, the output will
     reflect the nature of the link, rather than that of the file.  In legacy
     operation, the output will describe the file.

     For more information about legacy mode, see compat(5).

SEE ALSO
     chflags(1), chmod(1), sort(1), xterm(1), compat(5), termcap(5),
     symlink(7), sticky(8)

STANDARDS
     The ls utility conforms to IEEE Std 1003.1-2001 (``POSIX.1'').

HISTORY
     An ls command appeared in Version 1 AT&T UNIX.

BUGS
     To maintain backward compatibility, the relationships between the many
     options are quite complex.

BSD       

In [7]:
# Now, let's go into our new directory
cd workshop

In [8]:
# And check where we are
pwd

/Users/chaochih/workshop


# Example 1: Combine two files and redirect to a single file

In this example, we will combine two FASTA files (`tb1-protein.fasta` and `tga1-protein.fasta`). FASTA files contain nucleotide sequences or peptide/protein sequences with nucleotides/amino acids represented by single letter codes.

First, we will download our file using the command "wget"

    Note: wget by default downloads the file into your current working directory
    There is an option (-P) to specify the output directory
    i.e. wget -P /path/to/output_directory https://link.to.yourfile.com/filename.txt)

In [9]:
# File 1 is tb1-protein.fasta
wget https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-03-remedial-unix/tb1-protein.fasta

# File 2 is tga1-protein.fasta
wget https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-03-remedial-unix/tga1-protein.fasta

--2018-01-11 19:41:03--  https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-03-remedial-unix/tb1-protein.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.184.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.184.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 262 [text/plain]
Saving to: ‘tb1-protein.fasta.1’


2018-01-11 19:41:04 (9.99 MB/s) - ‘tb1-protein.fasta.1’ saved [353]

--2018-01-11 19:41:04--  https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-03-remedial-unix/tga1-protein.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.184.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.184.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 147 [text/plain]
Saving to: ‘tga1-protein.fasta.1’


2018-01-11 19:41:04 (6.68 MB/s) - ‘tga1-protein.fasta.1’ saved [152]



In [10]:
ls

combined_proteins.fasta	tb1-protein.fasta	tga1-protein.fasta.1
dir_listing.stderr	tb1-protein.fasta.1
dir_listing.txt		tga1-protein.fasta


In [11]:
# View contents of tb1-protein.fasta
head tb1-protein.fasta

>teosinte-branched-1 protein
LGVPSVKHMFPFCDSSSPMDLPLYQQLQLSPSSPKTDQSSSFYCYPCSPP
FAAADASFPLSYQIGSAAAADATPPQAVINSPDLPVQALMDHAPAPATEL
GACASGAEGSGASLDRAAAAARKDRHSKICTAGGMRDRRMRLSLDVARKF
FALQDMLGFDKASKTVQWLLNTSKSAIQEIMADDASSECVEDGSSSLSVD
GKHNPAEQLGGGGDQKPKGNCRGEGKKPAKASKAAATPKPPRKSANNAHQ
VPDKETRAKARERARERTKEKHRMRWVKLASAIDVEAAAASVPSDRPSSN
NLSHHSSLSMNMPCAAA


In [12]:
# View contents of tga1-protein.fasta
head tga1-protein.fasta

>teosinte-glume-architecture-1 protein
DSDCALSLLSAPANSSGIDVSRMVRPTEHVPMAQQPVVPGLQFGSASWFP
RPQASTGGSFVPSCPAAVEGEQQLNAVLGPNDSEVSMNYGGMFHVGGGSG
GGEGSSDGGT


In [13]:
# Concatenate two files with "cat" command and print to standard output
cat tb1-protein.fasta tga1-protein.fasta

>teosinte-branched-1 protein
LGVPSVKHMFPFCDSSSPMDLPLYQQLQLSPSSPKTDQSSSFYCYPCSPP
FAAADASFPLSYQIGSAAAADATPPQAVINSPDLPVQALMDHAPAPATEL
GACASGAEGSGASLDRAAAAARKDRHSKICTAGGMRDRRMRLSLDVARKF
FALQDMLGFDKASKTVQWLLNTSKSAIQEIMADDASSECVEDGSSSLSVD
GKHNPAEQLGGGGDQKPKGNCRGEGKKPAKASKAAATPKPPRKSANNAHQ
VPDKETRAKARERARERTKEKHRMRWVKLASAIDVEAAAASVPSDRPSSN
NLSHHSSLSMNMPCAAA
>teosinte-glume-architecture-1 protein
DSDCALSLLSAPANSSGIDVSRMVRPTEHVPMAQQPVVPGLQFGSASWFP
RPQASTGGSFVPSCPAAVEGEQQLNAVLGPNDSEVSMNYGGMFHVGGGSG
GGEGSSDGGT


In [14]:
# Now we will redirect (save) our output into a file
cat tb1-protein.fasta tga1-protein.fasta > combined_proteins.fasta

In [15]:
# Let's see if our file was created
ls -lt

total 56
-rw-r--r--  1 chaochih  staff  505 Jan 11 19:46 combined_proteins.fasta
-rw-r--r--@ 1 chaochih  staff  152 Jan 11 19:41 tga1-protein.fasta.1
-rw-r--r--@ 1 chaochih  staff  353 Jan 11 19:41 tb1-protein.fasta.1
-rw-r--r--  1 chaochih  staff   66 Jan  7 21:42 dir_listing.txt
-rw-r--r--  1 chaochih  staff   42 Jan  7 21:42 dir_listing.stderr
-rw-r--r--@ 1 chaochih  staff  152 Jan  7 21:23 tga1-protein.fasta
-rw-r--r--@ 1 chaochih  staff  353 Jan  7 21:23 tb1-protein.fasta


# Example 2: Redirect output of ls -l to standard error

For this example, we will use the tb1-protein.fasta and a non-existent fasta file, leaf.fasta.

In [16]:
# We are going to try to list a file that doesn't exist, which will return an error message
ls -l tb1-protein.fasta leaf.fasta

ls: leaf.fasta: No such file or directory
-rw-r--r--@ 1 chaochih  staff  353 Jan  7 21:23 tb1-protein.fasta


: 1

In [18]:
# Now we will redirect (save) the error message to a file
# We will redirect the listing of tb1-protein.fasta to dir_listing.txt
# and the standard error message to another file called dir_listing.stderr
ls -l tb1-protein.fasta leaf.fasta > dir_listing.txt 2> dir_listing.stderr

: 1

In [19]:
# Let's see if we've created our files
ls -lt

total 56
-rw-r--r--  1 chaochih  staff   66 Jan 11 19:47 dir_listing.txt
-rw-r--r--  1 chaochih  staff   42 Jan 11 19:47 dir_listing.stderr
-rw-r--r--  1 chaochih  staff  505 Jan 11 19:46 combined_proteins.fasta
-rw-r--r--@ 1 chaochih  staff  152 Jan 11 19:41 tga1-protein.fasta.1
-rw-r--r--@ 1 chaochih  staff  353 Jan 11 19:41 tb1-protein.fasta.1
-rw-r--r--@ 1 chaochih  staff  152 Jan  7 21:23 tga1-protein.fasta
-rw-r--r--@ 1 chaochih  staff  353 Jan  7 21:23 tb1-protein.fasta


In [20]:
# Let's view our dir_listing.txt file
head dir_listing.txt

-rw-r--r--@ 1 chaochih  staff  353 Jan  7 21:23 tb1-protein.fasta


In [21]:
# Let's view our dir_listing.stderr file
head dir_listing.stderr

ls: leaf.fasta: No such file or directory


# Piping

Piping is great when you have multiple steps you want to perform on your data but only want to save the final output and not the intermediate files. This is helpful when you are working with large datasets and do not have enough memory to store the intermediate files.

![Image from web.cse.ohio-state.edu](http://web.cse.ohio-state.edu/~mamrak.1/CIS762/unix_pipes.gif)

# Example 3: Extracting chromosome 1, sorting by start position, and redirecting to file using pipes

For this example we will be using toy data I've already downloaded. We will be working with a file called `test.bed`.

### Quick description of BED file

A BED file describes chromosome intervals and is a standard file format in bioinformatics. There are always 3 columns in a bed file: Chromosome, Start Position, and End Position

Example from Vince Buffalo's Chapter 11:

```bash
1	216596556	216596738
1	216595194	216595882
1	216591856	216592021
1	216538295	216538427
1	216500933	216500996
```

In [22]:
# First, we will go into our toy_data directory
# Note: you may have to change the path here
cd /Users/chaochih/GitHub/Bash_Demo/toy_data

In [23]:
# Make sure we are in the correct directory
pwd

/Users/chaochih/GitHub/Bash_Demo/toy_data


In [24]:
# See contents in directory
ls

Mus_musculus.GRCm38.75.dna.chromosome.8.fa.gz
Mus_musculus.GRCm38.75_chr1.bed
Mus_musculus.GRCm38.75_chr1.gtf
Mus_musculus.GRCm38.75_chr1_bed.csv
NA12891_CEU_sample.vcf.gz
chroms.txt
contam.fastq
contaminated.fastq
egfr_flank.fasta
example.bed
example2.bed
gene-1.bed
gene-2.bed
genotypes.txt
mm10_snp137_chr1_trunc.bed.gz
mm_GRCm38.75_protein_coding_genes.gtf
mm_gene_names.txt
tb1-protein.fasta
test.bed
test_chr1_only.bed
test_chr1_only_sorted.txt
test_chr1_only_unsorted.txt
tga1-protein.fasta
untreated1_chr4.fq


In [25]:
# First, let's take a look at our file
head test.bed

chrom	start	end
chr1	26	39
chr1	32	47
chr3	11	28
chr1	40	49
chr3	16	27
chr1	9	28
chr2	35	54
chr1	10	19


In [26]:
# We want to extract only rows that are from chromosome 1 from the file
grep "chr1" test.bed

chr1	26	39
chr1	32	47
chr1	40	49
chr1	9	28
chr1	10	19


In [27]:
# After we extract chromosome 1, we will sort the data by column 2, which
# is the start positions of the interval
# The -k option in sort allows us to sort by only column 2
# The -n option sorts numerically
grep "chr1" test.bed | sort -k2,2 -n

chr1	9	28
chr1	10	19
chr1	26	39
chr1	32	47
chr1	40	49


In [28]:
# Now, let's save our sorted subset to a file
grep "chr1" test.bed | sort -k2,2 -n > test_chr1_only.bed

In [29]:
# Let's see if we've created our file
# We will list our files from most recent to oldest then
# we'll use the -n option to output only the first 2 lines
ls -lt | head -n 2

total 271400
-rw-r--r--@ 1 chaochih  staff        54 Jan 11 19:48 test_chr1_only.bed


# Example 3: Extract everything that is not chromosome 1

Now we will do the inverse of what we just did above. We will pull out all rows that are NOT chromosome 1.

In [30]:
# Use the -v option in grep to invert the match
grep -v "chr1" test.bed

chrom	start	end
chr3	11	28
chr3	16	27
chr2	35	54


# More fun with streams

![Image from Wikipedia page on tees](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS6GxQjYRvRw2voB6XCi-oXQ2Jsv8fUbB7joqPlaqrY0IHXSmuHBQ)

`tee` is a neat command if you want to only write a single line of code but output many intermediate files.

For this example, we'll continue to use `test.bed` and run a similar command as in Example 2.


In [31]:
grep "chr1" test.bed | tee test_chr1_only_unsorted.txt | sort -k2,2 -n | tee test_chr1_only_sorted.txt | head

chr1	9	28
chr1	10	19
chr1	26	39
chr1	32	47
chr1	40	49


In [32]:
# Let's see which files we've created
ls -lt | head -n 3

total 271400
-rw-r--r--@ 1 chaochih  staff        54 Jan 11 19:50 test_chr1_only_sorted.txt
-rw-r--r--@ 1 chaochih  staff        54 Jan 11 19:50 test_chr1_only_unsorted.txt


In [33]:
# Now let's take a look at what's inside these two files
head test_chr1_only_unsorted.txt

chr1	26	39
chr1	32	47
chr1	40	49
chr1	9	28
chr1	10	19


In [34]:
head test_chr1_only_sorted.txt

chr1	9	28
chr1	10	19
chr1	26	39
chr1	32	47
chr1	40	49


# Example 4: Looking at differences between data

In [38]:
# List all files that end in .bed
# The * is similar to a wild card
ls *.bed

Mus_musculus.GRCm38.75_chr1.bed	gene-2.bed
example.bed			test.bed
example2.bed			test_chr1_only.bed
gene-1.bed


#### We will be comparing `gene-1.bed` and `gene-2.bed` files

We will be using the `diff` command to view the differences between these two files. Remember, `diff` does not work correctly if the files are not sorted in the same order.

In [40]:
# To show only the differences between the two files
# Use the diff command without any flags
diff gene-1.bed gene-2.bed

3a4
> 1	6222341	6228319	GENE00000025907
5d5
< 1	6230003	6230005	GENE00000025907
11c11
< 1	6230003	6230073	GENE00000025907
---
> 1	6230133	6230191	GENE00000025907
15d14
< 1	6214645	6214957	GENE00000025907
18d16
< 1	6230003	6230073	GENE00000025907
21d18
< 1	6238262	6238464	GENE00000025907


: 1

In [42]:
# If you want to look at them side by side,
# you can use diff -y or diff -u
# we will use diff -y in this case
diff -y gene-1.bed gene-2.bed

1	6206197	6206270	GENE00000025907				1	6206197	6206270	GENE00000025907
1	6223599	6223745	GENE00000025907				1	6223599	6223745	GENE00000025907
1	6227940	6228049	GENE00000025907				1	6227940	6228049	GENE00000025907
							      >	1	6222341	6228319	GENE00000025907
1	6229959	6230073	GENE00000025907				1	6229959	6230073	GENE00000025907
1	6230003	6230005	GENE00000025907			      <
1	6233961	6234087	GENE00000025907				1	6233961	6234087	GENE00000025907
1	6234229	6234311	GENE00000025907				1	6234229	6234311	GENE00000025907
1	6206227	6206270	GENE00000025907				1	6206227	6206270	GENE00000025907
1	6227940	6228049	GENE00000025907				1	6227940	6228049	GENE00000025907
1	6229959	6230073	GENE00000025907				1	6229959	6230073	GENE00000025907
1	6230003	6230073	GENE00000025907			      |	1	6230133	6230191	GENE00000025907
1	6233961	6234087	GENE00000025907				1	6233961	6234087	GENE00000025907
1	6234229	6234399	GENE00000025907				1	6234229	6234399	GENE00000025907
1	6238262	6238384	GENE00000025907				1	6238262	6

: 1

# Example 5: Data compression

When working with many large files, it's not uncommon to run out of storage space. A solution around this is compressing your files so your data is condensed and takes up less space on your disk drives and transfers faster.

Many UNIX/Linux command line tools have an equivalent command to handle compressed files. For example:

- `cat` for gzipped files: `zcat`
- `grep` for gzipped files: `zgrep`
- `diff` for gzipped files: `zdiff`
- `less` for gzipped files: `zless`

In [53]:
# First we'll quickly take a look at the man page on gzip
man gzip


GZIP(1)                   BSD General Commands Manual                  GZIP(1)

NAME
     gzip -- compression/decompression tool using Lempel-Ziv coding (LZ77)

SYNOPSIS
     gzip [-cdfhkLlNnqrtVv] [-S suffix] file [file [...]]
     gunzip [-cfhkLNqrtVv] [-S suffix] file [file [...]]
     zcat [-fhV] file [file [...]]

DESCRIPTION
     The gzip program compresses and decompresses files using Lempel-Ziv cod-
     ing (LZ77).  If no files are specified, gzip will compress from standard
     input, or decompress to standard output.  When in compression mode, each
     file will be replaced with another file with the suffix, set by the -S
     suffix option, added, if possible.

     In decompression mode, each file will be checked for existence, as will
     the file with the suffix added.  Each file argument must contain a sepa-
     rate complete archive; when multiple files are indicated, each is decom-
     pressed in turn.

     In the case of gzcat the resulting data is then concat

In [52]:
# First let's take a look at our example data to see which file is taking up the most room
# We will use the du command
du -h *

 36M	Mus_musculus.GRCm38.75.dna.chromosome.8.fa.gz
1.6M	Mus_musculus.GRCm38.75_chr1.bed
 25M	Mus_musculus.GRCm38.75_chr1.gtf
1.6M	Mus_musculus.GRCm38.75_chr1_bed.csv
4.0K	NA12891_CEU_sample.vcf.gz
4.0K	chroms.txt
4.0K	contam.fastq
4.0K	contaminated.fastq
4.0K	egfr_flank.fasta
4.0K	example.bed
4.0K	example2.bed
4.0K	gene-1.bed
4.0K	gene-2.bed
4.0K	genotypes.txt
 34M	mm10_snp137_chr1_trunc.bed.gz
188K	mm_GRCm38.75_protein_coding_genes.gtf
 16K	mm_gene_names.txt
4.0K	tb1-protein.fasta
4.0K	test.bed
4.0K	test_chr1_only.bed
4.0K	test_chr1_only_sorted.txt
4.0K	test_chr1_only_unsorted.txt
4.0K	tga1-protein.fasta
 34M	untreated1_chr4.fq


In [54]:
# We will unzip the Mus_musculus.GRCm38.75.dna.chromosome.8.fa.gz file
# and save it as a new file
# We will use the -c option to output to standard out
# and use the -d option to decompress the file
gzip -cd Mus_musculus.GRCm38.75.dna.chromosome.8.fa.gz > Mus_musculus.GRCm38.75.dna.chromosome.8.fa

In [56]:
# Now let's take a look at what files have been created
# Since we don't want to see every single file that was created, let's list
# any file that has .fa in the filename
ls *.fa*

Mus_musculus.GRCm38.75.dna.chromosome.8.fa
Mus_musculus.GRCm38.75.dna.chromosome.8.fa.gz
contam.fastq
contaminated.fastq
egfr_flank.fasta
tb1-protein.fasta
tga1-protein.fasta


In [61]:
# Now that we've successfully unzipped our file, let's take a look at it
head -n 5 Mus_musculus.GRCm38.75.dna.chromosome.8.fa

>8 dna:chromosome chromosome:GRCm38:8:1:129401213:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN


# Data compression flavors

There are different flavors of data compression and each flavor uses different commands to decompress/compress the files.

Tools behave similarly.

Some of the most common ones you'll encounter are:

| gzip      | zip      | bzip      |
| --------  | -------- | --------  |
| .gz       | .zip     | .bz2      |
| .tgz      | .tar.zip | .tar.bz2  |
| .tar.gz   |          | .tbz      |
|           |          | .tbz2     |

# Example 6: tar and gzipped files

If you work in a Linux environment, chances are you'll be downloading and using tools developed by other researchers/labs, institutes, etc.

Often, these tools are tarred and gzipped and have a `.tgz` or `.tar.gz` extension on it. In these cases, you can use the following command to unpack it

```bash
# -x extracts the contents
# -z filters archive through gzip
# -v verbosely lists files
# -f use archive file
tar -xzvf file.tar.gz
```

Let's take a look at some examples below.

#### First, we'll create a tar archive:

tar -czvf name-of-archive.tgz /path/to/directory-or-file

`-c` creates archive

`-z` compresses archive with gzip

`-v` displays progress in terminal while creating archive

`-f` allows you to specify filename of archive

In [71]:
# Note: you may have to change the path here
tar -czvf my_bed_files.tgz /Users/chaochih/GitHub/Bash_Demo/toy_data/*.bed

tar: Removing leading '/' from member names
a Users/chaochih/GitHub/Bash_Demo/toy_data/Mus_musculus.GRCm38.75_chr1.bed
a Users/chaochih/GitHub/Bash_Demo/toy_data/example.bed
a Users/chaochih/GitHub/Bash_Demo/toy_data/example2.bed
a Users/chaochih/GitHub/Bash_Demo/toy_data/gene-1.bed
a Users/chaochih/GitHub/Bash_Demo/toy_data/gene-2.bed
a Users/chaochih/GitHub/Bash_Demo/toy_data/test.bed
a Users/chaochih/GitHub/Bash_Demo/toy_data/test_chr1_only.bed


In [73]:
# Now let's see if the archive was created
ls *.tgz

my_bed_files.tgz


#### To view all files in an archive.tar, but don't unpack it use:

In [74]:
tar -tvf my_bed_files.tgz

-rw-r--r--  0 chaochih staff     239 Jan  9 20:59 /Users/chaochih/GitHub/Bash_Demo/toy_data/._Mus_musculus.GRCm38.75_chr1.bed
-rw-r--r--  0 chaochih staff 1698545 Jan  9 20:59 Users/chaochih/GitHub/Bash_Demo/toy_data/Mus_musculus.GRCm38.75_chr1.bed
-rw-r--r--  0 chaochih staff     239 Jan  9 20:59 /Users/chaochih/GitHub/Bash_Demo/toy_data/._example.bed
-rw-r--r--  0 chaochih staff      87 Jan  9 20:59 Users/chaochih/GitHub/Bash_Demo/toy_data/example.bed
-rw-r--r--  0 chaochih staff     239 Jan  9 20:59 /Users/chaochih/GitHub/Bash_Demo/toy_data/._example2.bed
-rw-r--r--  0 chaochih staff      91 Jan  9 20:59 Users/chaochih/GitHub/Bash_Demo/toy_data/example2.bed
-rw-r--r--  0 chaochih staff     239 Jan  9 20:59 /Users/chaochih/GitHub/Bash_Demo/toy_data/._gene-1.bed
-rw-r--r--  0 chaochih staff     748 Jan  9 20:59 Users/chaochih/GitHub/Bash_Demo/toy_data/gene-1.bed
-rw-r--r--  0 chaochih staff     239 Jan  9 20:59 /Users/chaochih/GitHub/Bash_Demo/toy_data/._gene-2.bed
-rw-r--r--  0 chaoc

# Exercise 1: Count the number of sequences in Morex, a six row malting barley

To better explore .fa/.fasta (FASTA) files, let's count the number of sequences in the `temp_Morex_Reference.fasta`. To count the number of lines, use `wc -l`, but...

### Remember:

Each sequence in the `.fa`/`.fasta` file starts with a ">" indicating a description of the sequences. This is followed by multiple lines of sequences.

Example `.fa`/`.fasta` file:

```bash
>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
```

So, to extract the number of sequences **not** the number of lines, you will need to count the number of ">" description lines.

In [None]:
# First, use "head" and "tail" commands to see what the beginning and end
# of the file looks like.

In [69]:
# Enter code for this exercise here
# There are multiple ways to complete this exercise

