# Streams and Redirection

### Use streams to avoid loading large amounts of data into memory

Example of printing contents to standard output stream (STDOUT)

![Image from linux-training.be](http://linux-training.be/funhtml/images/bash_ioredirection_keyboard_display.png)

### STDIN, STDERR, STDOUT

![Image from ryanstutorials.net/linuxtutorial](https://ryanstutorials.net/linuxtutorial/img/streams.png)

### Redirection

To redirect STDOUT to a file, we use the following operators:
- `>` redirects standard output and overwrites existing content to a file
- `>>` adds standard output content to end of file

# Creating project directory

Before we get started with today's hand-on workshop, let's create a project directory.

In [None]:
# Go into our home directory
cd ~
# View contents in home directory
# The contents in our directory will go to standard output
ls

In [None]:
# If we make a mistake when typing our commands, we will see a standard error output
l

In [None]:
# Make a directory called "workshop"
mkdir workshop
# See if directory was created
# Use long listing option (-l) and list based on creation time (-t)
ls -lt

In [None]:
# To get help on commands,
# pull up the manual with detailed documentation for each command,
# The general syntax will be: "man" command followed by your command
man ls

In [None]:
# Now, let's go into our new directory
cd workshop

In [None]:
# And check where we are
pwd

# Example 1: Combine two files and redirect to a single file

In this example, we will combine two FASTA files (`tb1-protein.fasta` and `tga1-protein.fasta`). FASTA files contain nucleotide sequences or peptide/protein sequences with nucleotides/amino acids represented by single letter codes.

In [None]:
# First, we will download our file using the command "wget"
#    Note: wget by default downloads the file into your current working directory
#    There is an option (-P) to specify the output directory
#    i.e. wget -P /path/to/output_directory https://link.to.yourfile.com/filename.txt)

# File 1 is tb1-protein.fasta
wget https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-03-remedial-unix/tb1-protein.fasta
# File 2 is tga1-protein.fasta
wget https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-03-remedial-unix/tga1-protein.fasta

In [None]:
ls

In [None]:
# View contents of tb1-protein.fasta
head tb1-protein.fasta

In [None]:
# View contents of tga1-protein.fasta
head tga1-protein.fasta

In [None]:
# Concatenate two files with "cat" command and print to standard output
cat tb1-protein.fasta tga1-protein.fasta

In [None]:
# Now we will redirect (save) our output into a file
cat tb1-protein.fasta tga1-protein.fasta > combined_proteins.fasta

In [None]:
# Let's see if our file was created
ls -lt

# Example 2: Redirect output of ls -l to standard error

For this example, we will use the tb1-protein.fasta and a non-existent fasta file, leaf.fasta.

In [None]:
# We are going to try to list a file that doesn't exist, which will return an error message
ls -l tb1-protein.fasta leaf.fasta

In [None]:
# Now we will redirect (save) the error message to a file
# We will redirect the listing of tb1-protein.fasta to dir_listing.txt
# and the standard error message to another file called dir_listing.stderr
ls -l tb1-protein.fasta leaf.fasta > dir_listing.txt 2> dir_listing.stderr

In [None]:
# Let's see if we've created our files
ls -lt

In [None]:
# Let's view our dir_listing.txt file
head dir_listing.txt

In [None]:
# Let's view our dir_listing.stderr file
head dir_listing.stderr

# Piping

Piping is great when you have multiple steps you want to perform on your data but only want to save the final output and not the intermediate files. This is helpful when you are working with large datasets and do not have enough memory to store the intermediate files.

![Image from web.cse.ohio-state.edu](http://web.cse.ohio-state.edu/~mamrak.1/CIS762/unix_pipes.gif)

# Example 3: Extracting chromosome 1, sorting by start position, and redirecting to file using pipes

For this example we will be using toy data I've already downloaded. We will be working with a file called `test.bed`.

In [1]:
# First, we will go into our toy_data directory
# Note: you may have to change the path here
cd /Users/chaochih/GitHub/Bash_Demo/toy_data

In [2]:
# Make sure we are in the correct directory
pwd

/Users/chaochih/GitHub/Bash_Demo/toy_data


In [3]:
# See contents in directory
ls

Mus_musculus.GRCm38.75.dna.chromosome.8.fa.gz
Mus_musculus.GRCm38.75_chr1.bed
Mus_musculus.GRCm38.75_chr1.gtf
Mus_musculus.GRCm38.75_chr1_bed.csv
NA12891_CEU_sample.vcf.gz
chroms.txt
contam.fastq
contaminated.fastq
egfr_flank.fasta
example.bed
example2.bed
gene-1.bed
gene-2.bed
genotypes.txt
mm10_snp137_chr1_trunc.bed.gz
mm_GRCm38.75_protein_coding_genes.gtf
mm_gene_names.txt
tb1-protein.fasta
test.bed
tga1-protein.fasta
untreated1_chr4.fq


In [4]:
# First, let's take a look at our file
head test.bed

chrom	start	end
chr1	26	39
chr1	32	47
chr3	11	28
chr1	40	49
chr3	16	27
chr1	9	28
chr2	35	54
chr1	10	19


In [5]:
# We want to extract only rows that are from chromosome 1 from the file
grep "chr1" test.bed

chr1	26	39
chr1	32	47
chr1	40	49
chr1	9	28
chr1	10	19


In [9]:
# After we extract chromosome 1, we will sort the data by column 2, which
# is the start positions of the interval
# The -k option in sort allows us to sort by only column 2
# The -n option sorts numerically
grep "chr1" test.bed | sort -k2,2 -n

chr1	9	28
chr1	10	19
chr1	26	39
chr1	32	47
chr1	40	49


In [10]:
# Now, let's save our sorted subset to a file
grep "chr1" test.bed | sort -k2,2 -n > test_chr1_only.bed

In [15]:
# Let's see if we've created our file
# We will list our files from most recent to oldest then
# we'll use the -n option to output only the first 2 lines
ls -lt | head -n 2

total 271384
-rw-r--r--  1 chaochih  staff        54 Jan  7 22:12 test_chr1_only.bed


# Example 3: Extract everything that is not chromosome 1

Now we will do the inverse of what we just did above. We will pull out all rows that are NOT chromosome 1.

In [16]:
# Use the -v option in grep to invert the match
grep -v "chr1" test.bed

chrom	start	end
chr3	11	28
chr3	16	27
chr2	35	54


# More fun with streams

![Image from Wikipedia page on tees](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS6GxQjYRvRw2voB6XCi-oXQ2Jsv8fUbB7joqPlaqrY0IHXSmuHBQ)

`tee` is a neat command if you want to only write a single line of code but output many intermediate files.

For this example, we'll continue to use `test.bed` and run a similar command as in Example 2.

In [17]:
grep "chr1" test.bed | tee test_chr1_only_unsorted.txt | sort -k2,2 -n | tee test_chr1_only_sorted.txt | head

chr1	9	28
chr1	10	19
chr1	26	39
chr1	32	47
chr1	40	49


In [19]:
# Let's see which files we've created
ls -lt | head -n 3

total 271400
-rw-r--r--  1 chaochih  staff        54 Jan  7 22:23 test_chr1_only_sorted.txt
-rw-r--r--  1 chaochih  staff        54 Jan  7 22:23 test_chr1_only_unsorted.txt


In [20]:
# Now let's take a look at what's inside these two files
head test_chr1_only_unsorted.txt

chr1	26	39
chr1	32	47
chr1	40	49
chr1	9	28
chr1	10	19


In [21]:
head test_chr1_only_sorted.txt

chr1	9	28
chr1	10	19
chr1	26	39
chr1	32	47
chr1	40	49
