/
advanced_computing_2.Rmd
1018 lines (823 loc) · 39.1 KB
/
advanced_computing_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Advanced Computing 2 -- UNIX shell and Regular Expressions"
author: "Stefano Allesina"
output:
pdf_document:
highlight: kate
template: default
---
# Advanced Computing 2
- **Goal:** Learn to write pipelines for data manipulation and analysis using the `UNIX` shell. Show how to interface the `UNIX` shell and `R`. Introduce the use of regular expressions.
- **Audience:** experienced `R` users, familiar with `dplyr` and `ggplot2`.
- **Installation:** Windows users need to install a UNIX shell emulator (e.g., `Git Bash`); for regular expressions, we are going to use the `R` package `stringr`.
# The UNIX Shell
`UNIX` is an operating system (i.e., the software that lets you
interface with the computer) developed in the 1970s by a group of
programmers at the AT&T Bell laboratories. Among them were Brian
Kernighan and Dennis Ritchie, who also developed the programming
language `C`. The new operating system was an immediate success
in academic circles, with many scientists writing new programs to
extend its features. This mix of commercial and academic interest led
to the many variants of `UNIX` available today (e.g., OpenBSD, Sun
Solaris, Apple OS X), collectively denoted as *nix systems. Linux is
the open source `UNIX` clone whose ``engine'' (kernel) was written
from scratch by Linus Torvalds with the assistance of a loosely-knit
team of hackers from across the internet.
All *nix systems are multi-user, network-oriented, and store data
as plain text files that can be exchanged between interconnected
computer systems. Another characteristic is the use of a strictly
hierarchical file system.
## Why use UNIX?
Many biologists are not familiar with coding in *nix systems and,
given that the learning curve is initially fairly steep, we start by
listing the main advantages of these systems over possible
alternatives.
First, `UNIX` is an operating system written by programmers for
programmers. This means that it is an ideal environment for developing
your code and storing your data.
Second, hundreds of small programs are available to perform simple
tasks. These small programs can be strung together efficiently so
that a single line of `UNIX` commands can perform complex operations,
which otherwise would require writing a long and complex program. The
possibility of creating these pipelines for data analysis is
especially important for biologists, as modern research groups produce
large and complex data sets, whose analysis requires a level of
automation that would be hard to achieve otherwise. For instance,
imagine working with millions of files by having to open each one of
them manually to perform an identical task, or try opening your single 80Gb whole-genome sequencing
file in a software with a graphical user interface! In `UNIX`, you can string a number of small programs together, each performing a
simple task, and create a complex pipeline that can be stored in a
script (a text file containing all the commands). Then, you can let
the computer analyze all of your data while you're having a cup of
coffee.
Third, text is the rule: almost anything (including the screen, the
mouse, etc.) in `UNIX` is represented as a text file. Using text files
means that all of your data can be read and written by any machine,
and without the need for sophisticated (and expensive) proprietary
software. Text files are (and always will be) supported by any
operating system and you will still be able to access your data
decades from today (while this is not the case for most commercial
software). The text-based nature of `UNIX` might seem unusual at
first, especially if you are used to graphical interfaces and
proprietary software. However, remember that `UNIX` has been around
since the early 1970s and will likely be around at the end of your
career. Thus, the hard work you are putting into learning `UNIX` will
pay off over a lifetime.
The long history of `UNIX` means that a large body of tutorials and
support web sites are readily available online. Last but not least,
`UNIX` is very stable, robust, secure, and---in the case of
Linux---freely available.
In the end, entirely avoiding to work in a `UNIX` "shell" is almost
impossible for a professional scientist: basically all resources for
High-Performance Computing (computer clusters, large workstations,
etc.) run a `UNIX` or Linux operating system. Similarly, the transfer
of large files, websites, and data between machines is typically
accomplished through command-line interfaces.
## Directory structure
In `UNIX` we speak of "directories", while in a graphical
environment the term "folder" is more common. These two terms are
interchangeable and refer to a structure that may contain
sub-directories and files. The `UNIX` directory structure is organized
hierarchically in a tree. As a biologist, you can think of this
structure as a phylogenetic tree. The common ancestor of all
directories is also called the "root" directory and is denoted by an
individual slash (`/`). From the root directory, several
important directories branch:
- `/bin` contains several basic programs.
- `/etc` contains configuration files.
- `/dev` contains the files connecting to devices such as the keyboard, mouse and screen.
- `/home` contains the home directory of each user (e.g., `/home/yourname`; in OS X, your home directory is stored in
`/Users/yourname`).
- `/tmp` contains temporary files.
You will typically work in your home directory. From there, you can
access the Desktop, Downloads, Documents, and other directories you
are likely familiar with. When you navigate the system you are in one
directory and can move deeper in the tree or upward towards the
root.
## Using the terminal
*Terminal* refers to the interface that you use to communicate
with the kernel (the core of the operating system). The terminal is
also called *shell*, or *command-line interface*
(CLI). It processes the commands you type, translates them for the
kernel, and shows you the results of your operations. There are
several shells available. Here, we concentrate on the most popular
one, the `bash` shell, which is the default shell in both
Ubuntu and OS X.
In Ubuntu, you can open a shell by pressing `Ctrl + Alt + t` or
by opening the dash (press the `Super` key) and typing
`Terminal`. In OS X, you want to open the application
`Terminal.app`, which is located in the folder *Utilities*
within *Applications*. Alternatively, you can type `Terminal` in
*Spotlight*. In either system, the shell will automatically
start in your home directory. Windows users can launch `Git Bash` or another terminal emulator.
The command line prompt ends with a "dollar" (`$`)
sign. This means the terminal is ready to accept your commands. Here, a `$` sign at the beginning of a line of code
signals that the command has to be executed in your terminal. You do
not need to type the `$` sign in your terminal, just copy the
command that follows it.
In `UNIX`, you can use the `Tab` key to reduce the amount you
have to type, which in turn reduces errors caused by typos. When you
press `Tab` in a (properly configured) shell, it will try to
automatically complete your command, directory or file name (if
multiple completions are possible, you can display them all by hitting
the `Tab` key twice). Additionally, you can navigate the
history of commands you typed by using the up/down arrows (you do not
need to re-type a command that you recently executed). There are also
shortcuts that help when dealing with long lines of code:
- `Ctrl + A` Go to the beginning of the line
- `Ctrl + E` Go to the end of the line
- `Ctrl + L` Clear the screen
- `Ctrl + U` Clear the line before the cursor position
- `Ctrl + K` Clear the line after the cursor
- `Ctrl + C` Kill the command that is currently running
- `Ctrl + D` Exit the current shell
- `Alt + F` Move cursor forward one word (in OS X,
`Esc + F`)
- `Alt + B` Move cursor backward one word (in OS X,
`Esc + B`)
Mastering these and other keyboard shortcuts will save you a lot of
time. You may want to print this list and keep it next to your
keyboard---in a while you will have them all memorized and will start
using them automatically.
## Basic UNIX commands
Here we introduce some of the most basic (and most useful)
`UNIX` commands. We write the commands in `fixed-width font`
and specific, user-provided input is capitalized in square
brackets. Again, the brackets and special formatting are not required
to execute a command in your terminal.
Many commands require some arguments (e.g., copy which file to where),
and all can be modified using the several options
available. Typically, options are either written as a dash followed by
a single letter (older style, e.g., `-f`) or two dashes
followed by words (newer style, e.g., `--full-name`). A command,
its options and arguments are separated by a space.
### How to get help in UNIX
`UNIX` ships with hundreds of commands. As such, it is impossible to
remember them all, let alone all their possible options. Fortunately,
each command is described in detail in its manual page, which can be
accessed directly from the shell by typing `man [COMMAND OF YOUR CHOICE]` (not available in `Git Bash`). Use arrows to scroll up and down and hit `q`
to close the manual page. Checking the exact behavior of a command is
especially important, given that the shell will execute any command
you type without asking whether you know what you're doing (such that it
will promptly remove all of your files, if that's the command you
typed). You may be used to more forgiving (and slightly patronizing)
operating systems in which a pop-up window will warn you whenever
something you're doing is considered dangerous. In `UNIX`, it is always
better to consult the manual rather than improvising.
If you want to interrupt the execution of a command, press
`Ctrl + C` to halt any command that is currently running in
your shell.
## Navigating the directory system
You can navigate the hierarchical `UNIX` directory system using these
commands:
- `pwd` print the path of the current
working directory.
- `ls` list the files and sub-directories in the current directory. `ls -a` list all (including hidden) files. `ls -l` return the long list with detailed information. `ls -lh` provide file sizes with units (B, M, K, etc. ).}
- `cd [NAMEOFDIR]` change directory.
`cd ..` move one directory up; `cd /` move to the root directory; `cd ~` move to your home directory; `cd -` go back to the directory you visited previously (like "Back"" in a browser).
## Handling directories and files
Create and delete files or directories using the following commands:
- `cp [FROM] [TO]` copy a file. The first argument is the file to copy. The second argument is
where to copy it (either a directory or a file name).
- `mv [FROM] [TO]` move or rename a file. Move a file by specifying two arguments: the file, and the
destination directory. Rename a file by specifying the old and the
new file name in the same directory.
- `touch [FILENAME]` Update the date of last access to the file. Interestingly, if the file does not exist, this command will
create an empty file.
- `rm [TOREMOVE]` remove a file. `rm -r` deletes the contents of a directory recursively (i.e., including all files and sub-directories in it; use with caution!). Similarly, `rm -f` removes the file and suppresses any prompt asking whether you are sure you want to remove
the file.
- `mkdir [DIRECTORY]` make a directory. To create nested directories, use the option `-p` (e.g., `mkdir -p d1/d2/d3`).
- `rmdir [DIRECTORY]` remove an empty directory.
## Printing and modulating files
`UNIX` was especially designed to handle text files, which is apparent
when considering the multitude of commands dealing with text. Here are
a few popular ones:
- `less [FILENAME]` progressively print a file on the screen (press `q` to exit). Funny fact: there is a
command called `more` that does the same thing, but with less flexibility. Clearly, in `UNIX`, `less` is
`more`.
- `cat [FILENAME]` concatenate and print files.
- `wc [FILENAME]` line, word, and byte (character) count of a file.
- `sort [FILENAME]` sort the lines of a file and print the result to the screen.
- `uniq [FILENAME]` show only unique lines of a file. The file needs to be sorted first for this to work
properly.
- `file [FILENAME]` determine the type of a file.
- `head [FILENAME]` print the `head` (i.e., first few lines of a file).
- `tail [FILENAME]` print the tail (i.e., last few lines of a file).
- `diff [FILE1] [FILE2]` show the differences between two files.
> **Exercise**
> To familiarize yourself with these commands, try the following:
>
> - Go to the `data` directory for this tutorial.
> - How many lines are in file `Marra2014_data.fasta`?
> - Go back to the `code` directory.
> - Create the empty file `toremove.txt`.
> - List the content of the directory.
> - Remove the file `toremove.txt`.
<!-- ## Permissions -->
<!-- In `UNIX`, each file and directory has specific security attributes -->
<!-- specifying who can read (`r`), write (`w`), execute -->
<!-- (`x`) or do nothing (`-`) with the file or -->
<!-- directory. These permissions are specified for three entities that may -->
<!-- wish to manipulate the file (owner, specific group, and others). The -->
<!-- group level is useful to assign permissions to a specified group of -->
<!-- users (e.g., administrators, developers) but not everyone else. -->
<!-- Typing `ls -all` lists all files and sub-directories, their size, and -->
<!-- permissions represented in a 10 character notation. The first -->
<!-- character refers to the file type and is not related to -->
<!-- permissions. The last nine are arranged in groups of three (triads) -->
<!-- representing the ownership groups (owner, group, others). For example, -->
<!-- when a file has the permission `|-rwxr-xr--|`, the owner of this -->
<!-- file can read, write, and execute the file (`rwx`), the group -->
<!-- can read and execute (`r-x`), while everyone else can only read -->
<!-- (`r--`). -->
<!-- There are two important commands to manipulate file -->
<!-- permissions: -->
<!-- - `chmod [FILENAME]` change file modes (i.e., permissions). -->
<!-- - `chown [FILENAME]` change the ownership of a file. -->
<!-- Sometimes you want to perform operations that are only allowed to the -->
<!-- administrator of the machine. You can do so (again, with caution!), by -->
<!-- typing the word `sudo` (substitute user do) in front of the command. The system will require a -->
<!-- password and, if you are authorized to use the `sudo` command, -->
<!-- you will be able to perform the operation as an administrator. This -->
<!-- command is very useful for installing new software. -->
## Miscellaneous commands
- `echo "[A STRING]"` print the string `[A STRING]`.
- `time` time the execution of a command.
- `wget [URL]` download the webpage at `[URL]`. (Available in Ubuntu; for OS X look at `curl`, or install `wget`).
- `history` list the last commands you executed.
## Advanced UNIX commands
### Redirection and pipes
So far, we have printed the output of each command (e.g., `ls`)
directly to the screen. However, it is easy to direct the output to a
file (*redirect*) or use it as the input of another command
(*pipe*). Stringing commands together in pipes is the real power of
`UNIX`---the ability to perform complex processing of large amounts of
data in a single line of commands. First, we show how to redirect the
output of a command into a file:
```
$ [COMMAND] > [FILENAME]
```
Note that if the file `[FILENAME]` exists, it will be
overwritten. If instead we want to append to an existing file, we can
use the `>>` symbol as in the following line:
```
$ [COMMAND] >> [FILENAME]
```
When the command is very long and complex, we might want to
redirect the content of a file as input to a command, "reversing"
the flow:
```
$ [COMMAND] < [FILENAME]
```
To run a few examples, let's start by moving to our `code` directory:
```
$ cd ~/BSD-QBio2/tutorials/advanced_computing2/code
```
The command `echo` can be used to print a string on the
screen. Instead of printing to the screen, we redirect the output to a
file, effectively creating a file containing the string we want to
print:
```
$ echo "My first line" > test.txt
```
We can see the result of our operation by printing the file to the
screen using the command `cat`:
```
$ cat test.txt
```
To append a second line to the file, we use `>>`:
```
$ echo "My second line" >> test.txt
$ cat test.txt
```
We can redirect the output of any command to a file. For example, it
is quite common to have to determine how many files are in a
directory. The files could have been created by an instrument or
provided by a collaborator. Before analyzing the data, we want to get
a sense of how many files will we need to process. If there are
thousands of files, it is quite time consuming to count them by hand
or even open a file browser that can do the counting for us. It is
much simpler and faster to just type a command or two. To try this,
let's create a file listing all the files contained in
`data/Saavedra2013`:
```
$ ls ../data/Saavedra2013 >> filelist.txt
$ cat filelist.txt
```
Now we want to count how many lines are in the file. We can do so by
calling the command `wc -l` (count only the lines):
```
$ wc -l filelist.txt
$ rm filelist.txt
```
However, we can skip the creation of the file by creating a short
pipeline. The pipe symbol `|` tells the shell to take the output
on the left of the pipe and use it as the input of the command on the
right of the pipe. To take the output of the command `ls` and
use it as the input of the command `wc` we can write:
```
$ ls ../data/Saavedra2013 | wc -l
```
We have created our first, simple pipeline. In the following sections,
we are going to build increasingly long and complex pipelines. The
idea is always to start with a command and progressively add one piece
after another to the pipeline, each time checking that the result is
the desired one.
### Selecting columns using `cut`
When dealing with tabular data, you will often encounter the Comma
Separated Values (CSV) Standard File Format. The CSV format is
platform and software independent, making it the standard output
format of many experimental devices. The versatility of the file
format should also make it your preferred choice when manually
entering and storing data.
The main `UNIX` command you want to master for comma-, space-, tab-,
or character-delimited text files is `cut`. To showcase its
features, we work with data on generation time of mammals published by
Pacifici *et al.*. First, let's make sure we are in the
right directory (`advanced_computing2/data`). Then, we can
print the header (the first line, specifying the content of each
column) of the CSV file using the command `head`, which prints
the first few lines of a file on the screen, with the option
`-n 1`, specifying that we want to output only the first line:
```
$ head -n 1 Pacifici2013_data.csv
TaxID;Order;Family;Genus;Scientific_name;...
```
We now pipe the header to `cut`, specify the character to be
used as delimiter (`-d ';'`), and use the `head` command to extract the name of the first
column (`-f 1`), or the names of the first four columns
(`-f 1-4`):
```
$ head -n 1 Pacifici2013_data.csv | cut -d ';' -f 1
TaxID
$ head -n 1 Pacifici2013_data.csv | cut -d ';' -f 1-4
TaxID;Order;Family;Genus
```
Remember to use the `Tab` key to auto-complete file names and
the arrow keys to access your command history.
In the next example, we work with the file content. We specify a
delimiter, extract specific columns, and pipe the result to the
`head` command---to display only the first few elements:
```
$ cut -d ';' -f 2 Pacifici2013_data.csv | head -n 5
Order
Rodentia
Rodentia
Rodentia
Macroscelidea
$ cut -d ';' -f 2,8 Pacifici2013_data.csv | head -n 3
Order;Max_longevity_d
Rodentia;292
Rodentia;456.25
```
Now, we specify the delimiter, extract the second column, skip the
first line (the header) using the `tail -n +2` command (i.e.,
return the whole file starting from the second line), and finally
display the first five entries:
```
$ cut -d ';' -f 2 Pacifici2013_data.csv | tail -n +2 | head -n 5
Rodentia
Rodentia
Rodentia
Macroscelidea
Rodentia
```
We pipe the result of the previous command to the `sort`
command (which sorts the lines), and then again to `uniq`,
(which takes only the elements that are not repeated).
Effectively, we have created a pipeline to extract the
names of all the Orders in the database, from Afrosoricida to
Tubulidentata (a remarkable Order, which today contains only the
aardvark).
```
$ cut -d ';' -f 2 Pacifici2013_data.csv | tail -n +2 | sort | uniq
Afrosoricida
Carnivora
Cetartiodactyla
...
```
This type of manipulation of character-delimited files is very fast
and effective. It is an excellent idea to master the `cut`
command in order to start exploring large data sets without the need
to open files in specialized programs (if you don't want to modify the
content of a file, you should not open it in an editor!).
> **Exercise:**
>
> - If we order all species names (fifth column) of `Pacifici2013_data.csv` in alphabetical
order, which is the first species? Which the last?
> - How many families are represented in the database?
### Substituting characters using `tr`
We often want to substitute or remove a specific character in a text
file (e.g., to convert a comma-separated file into a tab-separated
file). Such a one-by-one substitution can be accomplished with the
command `tr`. Let's look at some examples in which we use a
pipe to pass a string to `tr`, which then
processes the text input according to the search term and specific
options.
Substitute all characters `a` with `b`:
```
$ echo 'aaaabbb' | tr 'a' 'b'
bbbbbbb
```
Substitute every number in the range `1` through `5`
with `0`:
```
$ echo '123456789' | tr 1-5 0
000006789
```
Substitute lower-case letters with upper-case (note the one-to-one mapping):
```
$ echo 'ACtGGcAaTT' | tr actg ACTG
ACTGGCAATT
```
We achieve the same result using bracket expressions that provide a predefined set of characters. Here, we use the set of all lower-case letters `[:lower:]` and translate into upper-case letters `[:upper:]`:
```
$ eecho 'ACtGGcAaTT' | tr [:lower:] [:upper:]
ACTGGCAATT
```
We can also indicate ranges of characters to substitute:
```
$ echo 'aabbccddee' | tr a-c 1-3
112233ddee
```
Delete all occurrences of `a`:
```
$ echo 'aaaaabbbb' | tr -d a
bbbb
```
"Squeeze" all consecutive occurrences of `a`:
```
$ echo 'aaaaabbbb' | tr -s a
abbbb
```
Note that the command `tr` reads standard input, and does not operate on files directly. However, we can use pipes in conjunction with `cat`, `head`, `cut`, etc. to create input for `tr`:
```
$ tr ' ' '\t' < [INPUTFILE] > [OUTPUTFILE]
```
In this example we input `[INPUTFILE]` to the `tr` command to replace all spaces
with tabs. Note the use of quotes to specify the space character. The tab is
indicated by `\t` and is called a "meta-character". We use the backslash to
signal that the following character should not be interpreted literally, but
rather is a special code referring to a character that is difficult to represent
otherwise.
Now we can apply the command `tr` and the commands we have
showcased earlier to create a new file containing a subset of the data
contained in `Pacifici2013_data.csv`, which we are going to
use in the next section.
First, we change directory to the `code`:
```
$ cd ../code/
```
Now, we want to create a version of `Pacifici2013_data.csv`
containing only the `Order`, `Family`, `Genus`,
`Scientific_name`, and `AdultBodyMass_g` (columns
2-6). Moreover, we want to remove the header, sort the lines according
to body mass (with larger critters first), and have the values
separated by spaces. This sounds like an awful lot of work, but we're
going to see how this can be accomplished piping a few commands
together.
First, let's remove the header:
```
$ tail -n +2 ../data/Pacifici2013_data.csv
```
Then, take only the columns 2-6:
```
$ tail -n +2 ../data/Pacifici2013_data.csv | cut -d ';' -f 2-6
```
Now, substitute the current delimiter (`;`) with a space:
```
$ tail -n +2 ../data/Pacifici2013_data.csv | cut -d ';' -f 2-6 | tr -s ';' ' '
```
To sort the lines according to body size, we need to exploit a few of
the options for the command `sort`. First, we want to sort
numbers (option `-n`); second, we want larger values first
(option `-r`, reverse order); finally, we want to sort the data
according to the sixth column (option `-k 6`):
```
$ tail -n +2 ../data/Pacifici2013_data.csv |cut -d ';' -f 2-6 | tr -s ';' ' ' | sort -n -r -k 6
```
That's it. We have created our first complex pipeline. To complete the
task, we redirect the output of our pipeline to a new file called
`BodyM.csv`.
```
$ tail -n +2 ../data/Pacifici2013_data.csv | cut -d ';' -f 2-6 | tr -s
';' ' ' | sort -r -n -k 6 > BodyM.csv
```
You might object that the same operations could have been accomplished
with a few clicks by opening the file in a spreadsheet
editor. However, suppose you have to repeat this task many times,
e.g., to reformat every file that is produced by a laboratory
device. Then it is convenient to automate this task such that it can
be run with a single command.
Similarly, suppose you need to download a large CSV file from a
server, but many of the columns are not needed. With `cut`, you
can extract only the relevant columns, reducing download time and
storage.
### Selecting lines using `grep`
`grep` is a powerful command that finds all the lines of a file
that match a given pattern. You can return or count all occurrences of
the pattern in a large text file without ever opening
it. `grep` is based on the concept of regular expressions,
which we will cover just below (but using `R`, in which the syntax is slightly different).
We will test the basic features of `grep` using the file we
just created. The file contains data on
thousands of species:
```
$ wc -l BodyM.csv
5426 BodyM.csv
```
Let's see how many wombats (family Vombatidae) are contained
in the data. First we display the lines that contain the term
"Vombatidae":
```
$ grep Vombatidae BodyM.csv
Diprotodontia Vombatidae Lasiorhinus Lasiorhinus krefftii 31849.99
Diprotodontia Vombatidae Lasiorhinus Lasiorhinus latifrons 26163.8
Diprotodontia Vombatidae Vombatus Vombatus ursinus 26000
```
Now we add the option `-c` to count the lines:
```
$ grep -c Vombatidae BodyM.csv
3
```
Next, we have a look at the genus `Bos` in the data file:
```
$ grep Bos BodyM.csv
Cetartiodactyla Bovidae Bos Bos sauveli 791321.8
Cetartiodactyla Bovidae Bos Bos gaurus 721000
Cetartiodactyla Bovidae Bos Bos mutus 650000
Cetartiodactyla Bovidae Bos Bos javanicus 635974.3
Cetartiodactyla Bovidae Boselaphus Boselaphus tragocamelus 182253
```
Besides all the members of the `Bos` genus, we also match one
member of the genus `Boselaphus`. To exclude it, we can use the
option `-w`, which prompts `grep` to match only full
words:
```
$ grep -w Bos BodyM.csv
Cetartiodactyla Bovidae Bos Bos sauveli 791321.8
Cetartiodactyla Bovidae Bos Bos gaurus 721000
Cetartiodactyla Bovidae Bos Bos mutus 650000
Cetartiodactyla Bovidae Bos Bos javanicus 635974.3
```
Using the option `-i` we can make the search case-insensitive
(it will match both upper- and lower-case instances):
```
$ grep -i Bos BodyM.csv
Proboscidea Elephantidae Loxodonta Loxodonta africana 3824540
Proboscidea Elephantidae Elephas Elephas maximus 3269794
Cetartiodactyla Bovidae Bos Bos sauveli 791321.8
Cetartiodactyla Bovidae Bos Bos gaurus 721000
...
```
Sometimes, we want to know which lines precede or follow the one we
want to match. For example, suppose we want to know which mammals have
body weight most similar to the gorilla (*Gorilla gorilla*). The species are already ordered by size, thus we can simply print the two lines before
the match using the option `-B 2` and the two lines after the
match using `-A 2`:
```
$ grep -B 2 -A 2 "Gorilla gorilla" BodyM.csv
Cetartiodactyla Bovidae Ovis Ovis ammon 113998.7
Cetartiodactyla Delphinidae Lissodelphis Lissodelphis borealis 113000
Primates Hominidae Gorilla Gorilla gorilla 112589
Cetartiodactyla Cervidae Blastocerus Blastocerus dichotomus 112518.5
Cetartiodactyla Iniidae Lipotes Lipotes vexillifer 112138.3
```
Use option `-n` to show the line number of the match. For
example, the gorilla is the 164th largest mammal in the database:
```
$ grep -n "Gorilla gorilla" BodyM.csv
164:Primates Hominidae Gorilla Gorilla gorilla 112589
```
To print all the lines that do not match a given pattern, use the
option `-v`. For example, to get the other species of the genus
*Gorilla* with the exception of *Gorilla gorilla*, we
can use:
```
$ grep Gorilla BodyM.csv | grep -v gorilla
Primates Hominidae Gorilla Gorilla beringei 149325.2
```
To match one of several strings, use `grep "[STRING1]\|[STRING2]"`
```
$ grep -w "Gorilla\|Pan" BodyM.csv
Primates Hominidae Gorilla Gorilla beringei 149325.2
Primates Hominidae Gorilla Gorilla gorilla 112589
Primates Hominidae Pan Pan troglodytes 45000
Primates Hominidae Pan Pan paniscus 35119.95
```
You can use `grep` on multiple files at a time! Simply, list
all the files to use instead of just one file.
# Interfacing `R` and the `UNIX` shell
Note: this will work if you're using Mac OSX or UNIX; in Windows options are much more limited.
## Calling `R` from the command line
In `R`, you typically work in "interactive" mode --- you type a command, it
gets executed, you type another command, and so on. Often, we want to
be able to re-run a script on different data sets or with different
parameters. For that purpose you can store all the commands in a
text file (typically, with extension `.R`), and then re-run the
analysis by typing in the `UNIX` command line
```
$ Rscript my_script_file.R
```
To properly automate our analysis and figure generation, however, we
can additionally pass command-line arguments to `R`. This allows us for
instance to perform the analysis using a specific input file, or save
the figure using a specific file name.
`Rscript` accepts command-line arguments, that need to be
parsed within `R`. The code at the beginning of the following script
shows how this is accomplished:
```{r, eval = FALSE}
# Get all the command-line arguments
args <- commandArgs(TRUE)
# Assign each argument to a variable,
# making sure to convert it to the right
# type of variable (string by default)
# check the number of arguments
num.args <- length(args)
print(paste("Number of command-line arguments:", num.args))
# print all the arguments
if (num.args > 0) {
for (i in 1:num.args) {
print(paste(i, "->", args[i]))
}
}
# We can initially set to default values
# (but pay attention to the order,
# the optional arguments should be at the end)
input.file <- "test.txt"
number.replicates <- 10
starting.point <- 3.14
if (num.args >= 1) {
input.file <- args[1]
}
if (num.args >= 2) {
number.replicates <- as.integer(args[2])
}
if (num.args >= 3) {
starting.point <- as.double(args[3])
}
print(c(input.file, number.replicates, starting.point))
# Save this script as my_script.R
# Run the script in bash with different arguments
# Rscript my_script.R abc.txt 5 100.0
# Rscript my_script.R abc.txt 5
# Rscript my_script.R abc.txt
# Rscript my_script.R
```
## Calling the command line from `R`
You can call the operating system from within `R` (assuming you're in `/advanced_computing_2/code`):
```{r, eval = TRUE}
system("wc -l < ../../basic_computing_1/data/H938_Euro_chr6.geno")
```
You can also capture the output from the shell commands and save it
into `R`. Everything is treated as text (convert to numeric if
necessary):
```{r, eval = TRUE}
numlines <- system("wc -l < ../../basic_computing_1/data/H938_Euro_chr6.geno",
intern = TRUE)
numlines
```
You can also use a combination of shell commands and
`read.table` to capture more complex output:
```{r, eval = TRUE}
mydf <- system("grep rs125283 ../../basic_computing_1/data/H938_Euro_chr6.geno",
intern = TRUE)
mydf <- read.table(file = textConnection(mydf))
mydf
```
# Regular expressions in `R`
Sometimes data is hidden in free text. Think of citations in a manuscript, mentions of DNA motifs in tables, etc. You could copy and paste data from these unstructured texts yourself, but if you have much text, the task is very boring and error-prone. What you need is a way to describe a text pattern to a computer, and then have it extract the data automatically. Regular Expressions do exactly that.
Because you want to describe a text pattern using text, a level of abstraction is inevitable. What you want to do is to construct a pattern using **literal** characters and **metacharacters**.
For all our examples, we will use the package `stringr`, which makes the regular expression syntax consistent (there are many *dialects*), and provides a set of easy-to-use functions:
```{r}
library(stringr)
```
All the functions have a common structure. For example, `str_extract` extracts text matching a pattern: `str_extract(text, pattern)`. The simplest possible expression is one in which the pattern is described literally (i.e., we want to find exactly the text we're typing):
```{r}
str_extract("a string of text", "t")
str_extract_all("a string of text", "t")
```
Of course, you need to be able to describe much more general patterns. Use the following metacharacters:
- `\d` Match a digit character (`0-9`)
- `\D` Match any character that is not a digit
- `\n` Match a newline
- `\s` Match a space
- `\t` Match a `Tab`
- `\b` Match a "word boundary"
- `\w` Match a "word" character (alphanumeric)
- `.` Match any character
Some examples (note that to escape characters, you want to use two backslashes --- you need to escape the backslash itself!):
```{r}
# find the first digit
str_extract("123.25 grams", "\\d")
# find word separator + word character + word separator
str_extract("Albert Einstein was a genius", "\\b\\w\\b")
# find all digits
str_extract_all("my cell is 773 345 6789", "\\d")
# extract all characters
str_extract_all("for example, this and that", ".")
```
Of course, you don't want to type `\\w` fifteen times, in case you are looking for a string that is 15 characters long! Rather, you can use quantifiers:
- `*` Match zero or more times. Match as many times as possible.
- `*?` Match zero or more times. Match as few times as possible.
- `+` Match one or more times. Match as many times as possible.
- `+?` Match one or more times. Match as few times as possible.
- `?` Match zero or one times. In case both zero and one time match, prefer one.
- `??` Match zero or one times, prefer zero.
- `{n}` Match exactly `n` times.
- `{n,}` Match at least `n` times. Match as many times as possible.
- `{n,m}` Match between `n` and `m` times.
> **Exercise**
>
> What does this do? Try to guess, and then type the command into `R`
> ```{r, eval = FALSE}
str_extract_all("12.06+3.21i", "\\d+\\.?\\d+")
my_str <- "most beautiful and most wonderful have been, and are being, evolved."
str_extract(my_str, "\\b\\w{6,10}\\b")
str_extract(my_str, "\\b\\w+\\b")
str_extract(my_str, "w\\w*")
str_extract(my_str, "b\\w*")
str_extract(my_str, "b\\w+?")
str_extract(my_str, "\\s\\wn\\w+")
```
What if you want to match the characters `?`, `+`, `*`, `.`? You will need to escape them: for example, `\\.` matches the "dot" character.
You can specify anchors to signal that the match has to be in certain special positions in the text:
- `^` Match at the beginning of a line.
- `$` Match at the end of a line.
```{r}
str_extract("Ah, ba ba ba ba Barbara Ann", "\\w{2,}$")
str_extract("Ah, ba ba ba ba Barbara Ann", "^\\w{2,}")
```
To match one of several characters, list them between brackets:
```{r}
str_extract("01234567890", "[3120]+")
str_extract("01234567890", "[3-5]+")
str_extract("supercalifragilisticexpialidocious", "[a-i]{3,}")
```
To match either of two patterns, use alternations:
```{r}
str_extract_all("The quick brown fox jumps over the lazy dog", "fox|dog")
```
If you need more complex alternations, use parentheses to separate the patterns.
Parentheses can also be used to define **groups**, which are used when you want to capture unknown text that is however flanked by known patterns. For example, suppose you want to save the user name of a UofC email:
```{r}
str_match_all("sallesina@uchicago.edu mjsmith@uchicago.edu",
"\\b([a-zA-Z0-9]*?)@uchicago.edu")
```
Note that you have to use `str_match` or `str_match_all` to obtain information on the groups.
### Useful functions
Many functions have a similar `str_***_all` version, returning all matches.
- `str_detect(strings, pattern)` do the `strings` contain the `pattern`? Returns a logical vector.
- `str_locate(strings, pattern)` find the character position of the pattern
- `str_extract(strings, pattern)` extracts the first match
- `str_match(strings, pattern)` like `extract` but capture groups defined by parentheses
- `str_replace(strings, pattern, newstring)` replaces the first matched `pattern` with `newstring`
## Exercises in groups
### Extract primers and polymorphic sites
In the file `data/Ptak_etal_2004.txt` you find a text version of the supplementary materials of Ptak *et al.* (PLoS Biology 2004, doi:10.1371/journal.pbio.0020155).
- Take a look at the file. You see that it first lists the primers used for the study (e.g., `TAP2-17-3’` -> `CTTGGATATAACACCAAACGCA`), and then the polymorphic sites for 24 chimpanzees (e.g., .
- Read the text as a single string, intervaled by new lines
```{r}
my_txt <- paste(readLines("../data/Ptak_etal_2004.txt"), collapse="\n")
```
- Use regular expressions to produce a data frame containing the primers used in the study:
```{r, eval = FALSE}
head(primers)
```
```
ID Sequence
1 TAP2-1-5’ GAGAATCACTTGAACCTGGGAG
2 TAP2-2-5’ TTGTCCACAGTGTACCACATGA
3 TAP2-3-5’ TATTTCTTCCTGGGGTTTCCTT
4 TAP2-4-5’ CATGATGTGTCATGCTGAATTG
5 TAP2-5-5’ ATAGAACAAGAACCAAAGCCCA
6 TAP2-6-5’ GGACAACAGATAAAGTTGCCCT
```
- Write another regular expression to extract the polymorphic region for each chimp. Use `str_replace_all` to remove extra spaces and newlines. The results should look like:
```
Chimp Sequence
1 311 ATACCCTGGAGGCAAGAATCTTCCGATAGACGCCAGTCCCTAGTTGT...
2 312 GCACCCTGGAGGCAAGAGTCTTCCGATAGACGCCAGTCCCTAGTTGC...
3 313 GCACCCTGGAGGCAAGAGTCTTCCGATAGACGCCAGTCCCCAGTTGT...
4 314 GCACCCTGGAGGCAAGAGTCTTCCGATAGACCCCAGTCCCCAGTTGC...
5 317 GCACCCTGGAGGTGGGGGCCCCCCAGGAGGCCCCAGCCCCCGGTCGT...
6 320 GCACCCTGGAGATAAGGGCCCCCCAGGAGGCCCCAGCCCCCGGTCGT...
```
### A map of *Science*
Where does science come from? This question has fascinated researchers
for decades and even led to the birth of the field of "Science of
Science", where researchers use the same tools they invented to
investigate nature to gain insights on the development of science