Skip to content

Latest commit

 

History

History
797 lines (590 loc) · 17.7 KB

software_carpentry.md

File metadata and controls

797 lines (590 loc) · 17.7 KB

PIPES and FILTERS


I. Wildcards

Let's look at what is inside molecules folder and enter into it.

ls molecules
cd molecules

We see Protein Data Bank (.pdb) files for several molecules. We want to obtain more information about these files. wc (word count) command will give information about how many lines, words and characters are there in files.

wc cubane.pdb

If you only need information about how many lines are there in a file, you can use a flag with wc

wc -l cubane.pdb
wc --help

Instead of issuing the command for each file one by one, or using a loop to go though each file we can use wildcards.

wc *.pdb

Think * wildcard as an internal loop running over everything and picking the files which have matching strings in their name.

Another wildcard you can use is ?. In contrast to * wildcard ? matches only one character

wc p?tane.pdb
wc p??tane.pdb
wc p*tane.pbd

Let's move to folder north-pacific-gyre/2012-07-03

cd ../north-pacific-gyre/2012-07-03
ls -al

There are files ending with A, B and Z and we want to select the ones ending with A or B

cd ../north-pacific-gyre/2012-07-03
ls -al
ls *[AB].txt
ls *[ABZ].txt

The square brackets gives us the or option.

II. Sorting and Redirecting

Let's gather information about the number of lines in all files in this folder and store the data

pwd
wc -l *.pdb
wc -l *.pdb > lengths.txt

Did we create a file?

ls lengths.txt
ls --help
ls -al lenghts.txt

What is is the "lengths.txt" file? We can use cat (concatenate) command to print the contents of the file. cat prints out all the file to screen but if we want to see the data page by page, we can use less. You can move screen by screen with spacebar, or go back by b and quit by q

cat lengths.txt
less lengths.txt

How an wc works differently with file input and redirecting

wc -l lengths.txt
wc -l < lengths.txt

wc -l #hit enter#
a #hit enter#
b #hit enter#
c #hit enter#
d #hit enter#
e #hit enter#
#hit Ctrl+D#

One other type of redirecting is appending with >>

wc -l *.pdb >> lengths.txt

We want to see the which molecule file has the most lines. We can use sort command.

sort -n lengths.txt

Let's redirect the result to another file and look at the shortest and longest files.

sort -n lengths.txt > sorted-lenghts.txt
head -n 1 sorted-lenghts.txt
tail -n 1 sorted-lenghts.txt

Redirecting a file to the same file is a bad practice, don't do sort -n lengths.txt > lenghts.txt

We can reverse the order of sort or the column that the sorting is done.

sort -n -r lengths.txt
sort -k 2 sorted-lenghts.txt

In sort -n identifies numeric sorting. If not set, the default is to sort alphabetical.

III. Pipes

We have used wc, head, sort commands step by step to accomplish tasks. There is another approach called Pipes with which we can execute our tasks in one step.

sort -n lengths.txt | head -n 1
sort -n -r lengths.txt | tail -n 1

The vertical bar | between to commands is called a pipe. It takes the output from the command on the left and feeds as input to the command on the right.

sort -n lengths.txt | head -n 3 | tail -n 1
wc -l *.pdb | sort -n | head -n 3 | tail -n 1

As you see, we can create a pipeline by stacking many pipes one after another. Each time the data is filtered and modified in some manner.

Now we will look at 2 other useful command we can use when filtering data. These are uniq and cut

cp ../data/salmon.txt ./
cat salmon.txt
uniq salmon.txt; cat salmon.txt | uniq ; uniq<salmon.txt
sort salmon.txt | uniq

uniq removes duplicate lines only if they are adjacent. Combining with sort you can eliminate all duplicates.

cp ../data/animals.txt ./
cut -d , -f 2 animals | sort | uniq

For cut the flags -d and -f indicate the delimiter and field

CONDITIONALS

The basic conditional statement is if. Apart from syntax differences, the usage of the conditional construct is same as other languages. Conditionals help you to execute commands only when certain conditions are met.

myfavnumber=34

if [ $myfavnumber -eq 34 ];then echo "My favorite number is $myfavnumber";fi

if/then statement can be extended to if/then/else or if/then/elif/else to test more conditions.

Let's move to data folder and create decision tree about your vegetable when there are rabbits around.

cd data

nrabbits=`cat animals.txt | grep rabbit | wc -l`
# nrabbits=$(cat animals.txt | grep rabbit | wc -l)

if [ $nrabbits -le 2 ]; then
    echo 'Nothing to worry about my vegetable garden'
elif [ $nrabbits -gt 2 ] && [ $nrabbits -le 10 ]  ; then
    echo 'Yellow alert, check your garden'
else
    echo 'Call animal control'
fi

echo "$nrabbits"

LOOPS

Loops will make your life very easy for repetitive tasks and automation. Let's move to 'creatures' folder. We will work with the two files but let's first create backup copies.

cd ../creautures
pwd
ls -al
cp *.dat original-*.dat

When cp receives more than two inputs the last should be a directory for cp to be able to copy all files prior to this directory. So the previous command will not work. We can use a loop to accomplish this task

for filename in basilisk.dat unicorn.dat #hit enter#
> do
>     cp $filename original-$filename
> done

#OR

for filename in basilisk.dat unicorn.dat; do cp $filename original-$filename; done

#OR

for x in basilisk.dat unicorn.dat; do cp $x original-$x; done

for color in basilisk.dat unicorn.dat; do cp $color original-$color; done

$ symbol tells the shell interpreter to treat filename as a variable and $filename returns the value of the variable which becomes basilisk.dat and unicorn.dat within the loop. In some code, you may see ${filename} which is equivalent to $filename

echo *.dat

for filename in *.dat #hit enter#
> do
>     echo $filename
>     head -n 100 $filename | tail -n 5  #this selects the lines 96-100
> done

We can put an empty line before each filename when printing for a better look

echo *.dat

for filename in *.dat #hit enter#
> do
> echo ""
> echo $filename
> head -n 100 $filename | tail -n 5  #this selects the lines 96-100
> done

Say you have white spaces in your file names:

cp basilisk.dat 'green snake.dat'
cp unicorn.dat 'white horse.dat'
for filename in green snake.dat white horse.dat #hit enter#
> do
>     echo $filename
>     head -n 100 $filename | tail -n 20  #this selects the lines 81-100 
> done

#This should be

for filename in 'green snake.dat' 'white horse.dat' #hit enter#
> do
>     echo $filename
>     head -n 100 $filename | tail -n 20  #this selects the lines 81-100 
> done

Let's go back to our molecules folder do more looping

cd ../molecules

Say we want to write all molecules in one file:

for molecule in *.pdb; do echo $molecule; cat $molecule > all_molecules.dat; done
cat all_molecules.dat

We could not write all molecules in a single file because each iteration of the loop we overwrite the previous file. Instead we should have used append >>

rm all_molecules.dat
for molecule in *.pdb; do echo $molecule; cat $molecule >> all_molecules.dat; done

What if we only want to write molecules if their names start with p

for molecule in p*; do echo $molecule; cat $molecule >> p_molecules.dat; done

What if we only want to write molecules if their names include c

for molecule in *c*; do echo $molecule; cat $molecule >> c_molecules.dat; done

We can construct nested loops for instance creating a grid of folders for organizing our data

for temperature in 10 20 ; do 
> for molecule in propane pentane; do mkdir $molecule-$temperature; done
> done

#OR 

for temperature in 10 20 ; do 
> for molecule in propane pentane; do mkdir "$molecule-$temperature"; done
> done

#BUT NOT SAME

for temperature in 10 20 ; do 
> for molecule in propane pentane; do mkdir '$molecule-$temperature'; done
> done

There are other loop constructions such as while and until.

counter=0
while [ $counter -lt 10 ]; do echo $counter; let counter=counter+1; done

while iterates as long as the condition is true where as until stops the loop when the condition is true

counter=0
until [ $counter -ge 10 ]; do echo $counter; let counter=counter+1; done
echo $counter

FINDING THINGS

To find things in your system, the two commands that you will most use are grep (global/regular expression/print) and find.

grep finds and prints lines in files that match a pattern.

Let's move to 'writing' folder

cd ../writing
pwd
ls -al

What is in haiku.txt?

cat haiku.txt

Let's find lines that contain the word "not" and "The" :

grep not haiku.txt
grep The haiku.txt

When we searched for "The" two lines came up. In one the letters are actually included in "Thesis". To restrict our search to the word "The" we could use -w flag. This flag forces the pattern to match only whole words.

grep -w The haiku.txt

What is the line number of the line with matching the pattern? -n flag prints out the line number

grep -n -w The haiku.txt
grep -n The haiku.txt

How do we find all "The" and "the" together? -i makes the search case-insensitive

grep -n -w the haiku.txt
grep -n -w -i The haiku.txt

How do we find all lines that do not contain "the" or "The"? -v flag inverts the selection

grep -v -n -w -i The haiku.txt

There are other flags you can explore, just type:

grep --help

We could also search for patterns including spaces.

grep "is not" haiku.txt

We searched within files using grep. find command on the other searches for files. Let's see what find finds in the current folder

find . # '.' means current directory
find ../ # '../' means one directoy above

finds output is the names of every file and directory under the current working directory. Using -type d and -type f flag/argument couples we can determine directories or files respectively

find . -type d
find . -type f

-name <pattern> flag can be used to find files with matching pattern in their name

find . -name *.txt

We expected it to find all the text files, but it only prints out "haiku.txt". The problem is that the shell expands wildcard characters like * before commands run. Since *.txt in the current directory expands to "haiku.txt", the command we actually ran was:

find . -name haiku.txt

To find all '.txt' files:

find . -name '*.txt'

Putting *.txt in single quotes to prevent the shell from expanding the * wildcard. This way, find actually gets the pattern *.txt, not the expanded filename haiku.txt

Can we find out the number of lines for all '.txt' files

wc -l $(find . -name '*.txt')

#OR

wc -l `find . -name '*.txt'`

Here the commands in '$()' and '``' are evaluated and then wc -l operates the output. We can further `sort` the result

wc -l `find . -name '*.txt'` | sort -n

We can also use find and grep in sequence to search for a pattern in certain files

grep 'FE' $(find .. -name '*.pdb')

SHELL SCRIPTS

We were issuing the commands on the command line and finding them again using the arrow key or history command.

When we need to accomplish many tasks, and repeat these tasks form time to time, it is better to save your commands to a file. Instead of executing the commands one by one on the command line, we can execute the file. This file is called a shell script.

Let's go to molecules folder and start writing a script

cd ../molecules
vim middle.sh

When we start vim, to write characters we first hit i (entering inset mode) and then we write the following

#!/usr/bin/env bash     # you may not need this line
head -n 15 octane.pdb | tail -n 5

Then we hit Esc, then we hit : (entering command mode) and type wq (save and quit) and hit Enter.

bash middle.sh

Your input in the head command was octane.pdb. Instead of hardcoding the file name in the script we can make it a variable. Let's edit our script with vim again

vim middle.sh

hit i (inset mode) and then we write the following

# head -n 15 octane.pdb | tail -n 5
head -n 15 "$1" | tail -n 5

A comment starts with a # character and runs to the end of the line. The computer ignores comments.

In case the filename happens to contain any spaces, we surround $1 with double-quotes.

Then we hit Esc, then we hit : and type wq and hit Enter. Now we can identify our file as an input to the script on the command line

bash middle.sh octane.pdb
bash middle.sh pentane.pdb

Let's make the number of lines printed by head and tail also variables

vim middle.sh

hit i (inset mode) and then we write the following

# head -n 15 octane.pdb | tail -n 5
head -n "$2" "$1" | tail -n "$3"

Then we hit Esc, then we hit : and type wq and hit Enter. Now let's change the number of lines on the command line.

bash middle.sh pentane.pdb 15 5
bash middle.sh pentane.pdb 20 5

It is a good coding practice to add comments in your script to explain what is being done to another user of your script.

vim middle.sh

We will repeat the routine to insert save and quit in vim.

# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
# head -n 15 octane.pdb | tail -n 5
head -n "$2" "$1" | tail -n "$3"

Comments are invaluable for helping people (including your future self) understand and use scripts. However each time you modify the script, you should check that the comment is still accurate.

Let's write another script to find the number of lines for each file and sort them according to these numbers.

vim sorted.sh

Repeat the routine to insert save and quit in vim.

#!/usr/bin/env bash

wc -l "$1" "$2" | sort -n

Let's execute the script

bash sorted.sh *.pdb

As you see we did not get all the files. Only two files reported because we entered two variables in the script. If we want to enter any number of input arguments, instead of using "$1", "$2" etc., we could use "$@"

vim sorted.sh

Repeat the routine to insert save and quit in vim.

wc -l "$@" | sort -n

Let's execute the script

bash sorted.sh *.pdb ../creatures/*.dat

If you forget to provide input as show below, the script will start but not do anything. To exit from this state hit Ctrl+C

bash sorted.sh

A short cut to save some useful commands is to redirect the current history to a file which becomes a script

history | tail -n 5 > redo-commands.sh

As you see it is natural to stack more than one command in a script. So let's write a script that takes any number of files (i.e. .pdb files) in molecules folder as input and

  1. Prints the file names that are provided
  2. Prints out 4th line from each file
  3. Combines all files in "all_molecules.txt"
  4. Create copies with names "bckp-<file_name>"

except if the file name is cubane.pdb

vim myscript.sh

Repeat the routine to insert save and quit in vim.

#!/usr/bin/env bash

for filename in "$@"
    do
        if [ "$filename" == "cubane.pdb" ] ; then
            continue
        fi

        echo $filename
        head -n 4 $filename | tail -n 1
        cat $filename >> all_molecules.txt
        cp $filename bckp-$filename
    done

Issue the following command to see the result:

bash myscript.sh *.pdb

We can convert this script to an executable and run without identifying the interpreter (i.e. bash)

chmod u+x myscript.sh
./myscript.sh *.pdb

Let's do some scripting with random numbers. Write a bash script to generate 1000 random numbers between 1 and 1000 and write them to a file. Find out how many unique numbers are obtained.
(Hint: Try issuing echo $RANDOM, what do you get?)

#!/usr/bin/env bash

rm randomnumbers.txt
touch randomnumbers.txt

for i in `seq 1000`
    do
        echo $(((RANDOM%1000)+1)) >> randomnumbers.txt
    done

sort -n randomnumbers.txt | uniq | wc -l

Write a script that creates a folder at every 2 seconds with names folder_01, ... folder_10 and lets you know when each folder is created and the operation is ended. Finally, the script should show the created folders and then delete them.

#!/usr/bin/env bash
counter=1

while [ $counter -le 10 ]; do
    sleep 2 #wait for 2 seconds
    mkdir folder_$(printf "%02d" $counter)

    if [ $counter -eq 10 ]; then
        echo "Folder ${counter} has been created"
        echo 'All folders are created'
    else
        echo "Folder ${counter} has been created"
    fi

    let counter=counter+1
done

ls -al | grep folder_
rm -Rf folder_*