# Basics of Bash for Bioinformatics

by Katharina J. Hoff, giving credit to materials provided in German by Maike Tech (http://gobics.de/tech/linuxskript.pdf).

It is sometimes difficult for beginners to see why they should learn all these Bash commands. References to practical application in bioinformatics are therefore denoted with a ★.

## Introduction

★ The shell (command line interpreter) is a powerful tool for efficiently solving many bioinformatics problems, in particular for handling large files, such as sequence files (e.g. DNA, RNA-Seq, proteins). 

In this course, we will use the *Bourne-again-shell*, abbreviated as *Bash*. This is only one of several available shells (another one would be the *korn-shell*, abbreviated as *ksh*). 

When you open Bash (and the Terminal provided by JupyterLab is running a compressed version of Bash), you see the *prompt*:


![terminal.jpg](terminal.jpg)

You enter commands after the prompt. Syntax at the beginning of the line is `username@machine-name:directory`. In above example, the username is "I have no name!" - which is quite unusual for a unix system. It's a phenomenon of using the JupyterLab at University of Greifswald. On you own and most other system, you'll certainly have a username that's a bit more specific. Machine name in above example is "jupyter-hoffk83" - the generic machine name generated by JupyterLab in Greifswald. Again: on your other systems that are not virtual machines, you'll have a more specific machine name. The directory is here abbreviated as "~". This symbol stands for your home directory. You type commands after the $ character. Every command ends with the Enter key from your keyboard (or with a linebreak in a script).

Commands may be called with so-called *arguments*. These may be parameters that affect behavior of the program, or simple strings (strings are in the definition that we need here chains of alpha-numeric character chains generated by your keyboard). Commands and arguments are separated by a space character (the long key at the bottom of your keyboard).

## Working with Bash

In the following, we will look at a couple of bash commands and their usage. The first command is `echo`. This tool displays a string that was handed to `echo` as argument on the standard output device (in most cases your monitor). After execution of `echo`, we return to the shell and will be able to input further commands. Please note that the `percent-percent-script bash` is a Jupyter Notebook specific *built-in magic* command. You do not have to type that into a normal bash! If you copy the command over into the Terminal in Jupyter Lab, do not copy that part! Only copy `echo hello`. 

![play_button.jpb](play_button.jpg)

**Hint:** for execution of command blocks (below this line, you see a command block) in Jupyter Lab, press the play-Button at the top of the notebook

In [1]:
%%script bash 
echo hello

hello


★ Even though the functionality to output strings that served as input may seem trivial, the `echo` tool is very useful in bioinformatics. For example, bioinformaticians use `echo` to automatically generate job submission scripts for compute clusters.

### Parameters

You can modify tool calls with parameters. These are entered after the toolname and often begin with a dash (-). For example, you can supress the trailing newline of `echo` output with parameter `-n`.

In [2]:
%%script bash 
echo -n hello

hello

The effect is not visible inside the Jupyter Notebook. You need to execute the command in the Terminal. The output will be

![hello.jpg](hello.jpg)

Have a look at this video if you don't know where to find the Terminal in JupyterHub:  https://youtu.be/HCy0VdWINpo

**Syntax:**

`echo` [*options*] *string*

`-n` supress trailing newline

Many parameters in Unix are standardized. For example, `-v` stands for *verbose* in many (not all) tools. *Verbose* means that much supplementary information is printed in additional to the usual output.

### Documentation

Most Unix systems have an extensive built-in documentation of Bash commands. Every tool as a *man-page*. 

Example:


In [None]:
%%script bash
man less

Above example calls the *man-page* of `less`. I did not choose this tools arbitrarily, here. In JupyterLab at University of Greifswald, *man-pages* are not all fully installed. `man echo` is for example not installed. The reason is that the computational resource center is trying to keep startup time of JupyterHub down to a minimum. So you cannot rely on built-in *man-pages* in this environment. The good news is: if you type `man echo` into http://www.google.de, you easily find the same kind of documentation on various websites. So please use a search engine when trying to access *man-pages* for JupyterLab at University of Greifswald.

In a typical terminal (not inside the Jupyter Notebook output), you can navigate within the *man-page* (try in the JupyterLab terminal with `man less`): use arrow keys to go down and up, search forward with `/`+ your search term, exit with typing `q`.

In general, *man-pages* consist of:

* `NAME` - the tool name and a short description
* `SYNOPSIS` - syntax of usage
* `DESCRIPTION` - detailed description of tool
* `OPTIONS` - description of parameters
* `FILES` - files required by tool
* `SEE ALSO` - hints about related tools
* `DIAGNOSIS` - error code description
* `BUGS` - known problems
* `EXAMPLE` - usage example

In addition, many *man-pages* provide information about `ENVIRONMENT`, `COMMANDS`, `VERSION`, `COPYRIGHT`, and `AUTHOR`. 

#### Formatting conventions in *man-pages*

* **`bold text`** - argument has to be written exactly as stated
* *`italic text that is underlined`* - replace the underlined text by appropriate argument
* `[-abc]` - arguments in edgy brackets are optional
* `-a|-b` - you can only use argument a or argument b, not both at the same time
* `underlined argument` - the argument can be used multiple times (e.g. files)
* `[underlined expression]` - the expression in edgy brackets can be repeated

We are looking at *man-pages* from two points of views in this course:

1. for retrieving information about usage of tools (because most people do not want and do not have to memorize all these parameters and options),
2. ★ for writing our own usage information in scripts in Python, later.

### Paths

Unix file systems are tree-like. They begin at the root and branch out. The following is only a schematic drawing of an actual file tree.

![tree.jpg](tree.jpg)

When typing unix paths, always remember

* the `/` is the directory and file separating character
* **do not insert spaces** into file names, directory names for when building paths from both (I hear you: "But spaces in directory names on xyz system work fine!" - No, it's not fine! There are many tools that cannot deal with such spaces. Never use them. On no system! Not in file names or directory names! That's bad practice!)

If you put a USB stick onto your computer in a Unix system, it will not end up in a partion comparable to Windows (e.g. `E:\USB`). Instead, it will be mounted somewhere in the directory tree. In above drawing, you find a USB-Stick (just as an example) in `/mnt/USB`.

★ In bioinformatics, it is often important to know where you are on your system. `pwd` will tell you exactly this information. It stands for *print working directory*:

In [4]:
%%script bash
pwd

/home/jovyan/Bash_basics_part_1


We are currently in the subdirectory `jovyan/` of directory `/home/`. (The JupyterLab here returns a funny home directory... because you are a user called jovyan in the emulated machine that hosts JupyterLab. Usually, you would expect something like `/home/hoffk83` if your user name as `hoffk83`.)

Every directory and every file has a unique path. (Again: **no spaces in path and file names!**) File names can be up to 255 characters. Avoid special characters, except for _ (*underscore character*). Be aware that *capitalization* matters! A capital `A` is different from a small `a`.

Paths can be specified *absolute*, e.g. from the root, or *relative*. 

**Example:**

* `/home/jovyan/Documents` - absolute path

* `~/Documents` - same absolute path using the abbreviation for `~` for `/home/jovyan`.

* Now, assume that we are located in `/home/jovyan/Downloads`, and we want to use a relative expression to specifiy `/home/jovyan/Documents`, without writing that absolute path. We can then say `../Documents`. The `..` mean: one directory up. (`.` means in the current directory.)



**Practice:**

Please test the commands given in code blocks in the Terminal of JupyterLab, as well, since the goal is not only to become fluent in a JupyterNotebook, but also to acquire knowledge in using any Bash terminal.

## Data Management in Bash

Bash is a very powerful tool for efficient data management. This makes Bash into a key environment for handling large bioinformatics files.

### Files and Directories

Inspect files and directory contents with `ls`. This is without doubt one of the most essential commands for checking where you are on your system, and what files are where you are.

**Syntax:**

`ls [Options] [Path]`

**Example:**

In [1]:
%%script bash
ls

hello.jpg
introduction_to_bash.ipynb
play_button.jpg
terminal.jpg
tree.jpg


The parameter `-l` (*long*) will show you a lot of additional information for files and directories:

In [2]:
%%script bash
ls -l

total 712
-rw-r--r-- 1 38458 users  35111 Jun 24 10:42 hello.jpg
-rw-r--r-- 1 38458 users  12745 Jun 24 10:44 introduction_to_bash.ipynb
-rw-r--r-- 1 38458 users   9008 Jun 24 10:42 play_button.jpg
-rw-r--r-- 1 38458 users  16164 Jun 24 10:42 terminal.jpg
-rw-r--r-- 1 38458 users 643642 Jun 24 10:42 tree.jpg


Column content explained:
    
1. File attributes. The general format is `-rwxrwxrwx`, i.e. there are 10 positions that can be filled with some character. The first position is a `-` for files, and a `d` for directories. It may also be an `l` for link, or `p` for pipe, or `b` for block-oriented device, or `c` for character-oriented device, or `s` for socket. But `-` and `d` are the most important characters for you in this course. The next three characters show permissions for this file or directory for the owner of the file, also referred to as the `user`. `r` means read, `w` means write, and `x` stands for execute. If the characters say `rwx`, user is allows to read, write, and execute a file. If it says `---`, user is not allowed to do anything with the file. If it says e.g. `rw-`, the file can be read and written by user, but execution is forbidden. These three characters for user permissions are highly important in this course! If the execution permission is missing, an otherwise executable file will be executed. Therefore, you will later learn in this course how to modify these permissions when required. Moving to the next three characters: these are permissions for the group that owns the file. In this JupyterHub environment, the username is a number, and the group is called `users`. We will not deal in detail with changing group permissions in this course. But imagine you share the Linux computer with your room mates and your family. You don't want your parents to be able to see embarassing fotos of the latest party in your flat. So you forbid the group `parents` to view some directory with fotos. At the same time, you want your roomies to be able to view these fotos. Therefore, you will make the directory a property of `roomies` and allow them to `r-x` the directory with fotos. Next, you have three characaters for all users on the machine, again `rwx`-style. 

2. Number specifies the number of subdirectories of a directory. If it is a file, the number says how many names exist for this exact file (*hard links*, try to avoid using these!).

3. Name of owner of file. In JupyterLab, a number. On other systems, usually a human readable user name.

4. Group to whom the file belongs. In JupyterLab, mostly `user`.

5. File size, by default in bytes (try `ls -lh` to get sizes in MB, GB, TB, etc.). Careful: will not display the size of contents of a directory. For directories, only the size that the folder itself requires on the file system is displayed (e.g. 4096 bytes for all directories).

6. Time of last modification of file/directory.

7. File or directory name.

### Moving Within Your File System

*Change directory*, `cd`, allowd you to change into different directories.

**Syntax:**

`cd [path]`

**Example:**

In [3]:
%%script bash
pwd
cd ~/
pwd
cd ..
pwd

/home/jovyan/Bioinf_pract_lecture/lecture/01_bash
/home/jovyan
/home


### Deleting Files

Be very careful about deleting (*removing*) files from your system! In Unix, files do not go into any kind of trash that can be restored. Files that are deleted are deleted. You will not be able to restore deleted files!

**Syntax:**

`rm [options] file`

Important parameters are `-i` for *interactive* (this will require you to manually confirm the removal of files; `-r` for *recursive* removal of directories including all their contents. (Empty directories can be deleted with `rmdir`, instead.)

**Example:**

We will create a new (and empty) file with `touch` for the sake of demonstrating `rm` (`echo` is only used to indicate in the output at which step you are):

In [4]:
%%script bash
echo "**Creating new file with touch new.txt"
touch new.txt
echo "**Confirming that the file exists:"
ls -l new.txt
echo "**Deleting the new file with rm new.txt"
rm new.txt
echo "**Confirming that new.txt really cannot be found in your directory, anymore:"
ls

**Creating new file with touch new.txt
**Confirming that the file exists:
-rw-r--r-- 1 38458 users 0 Jun 24 10:46 new.txt
**Deleting the new file with rm new.txt
**Confirming that new.txt really cannot be found in your directory, anymore:
hello.jpg
introduction_to_bash.ipynb
play_button.jpg
terminal.jpg
tree.jpg


### Copying files

**Syntax:**

`cp [options] source target`

This command copies a source file to a target file. If the target file already exists, it will be overwritten without a warning. The parameter `-i` for *interactive* works in the same way as for `rm`. `-r` allows recursive copying of directories including their content.

**Example:**

In [6]:
%%script bash
echo "Inspecting which files already exist:"
ls -l
echo "Copying the current notebook file:"
cp introduction_to_bash.ipynb new.ipynb
echo "Inspecting effect: copied file exists:"
ls -l
rm new.ipynb
echo "Inspecting that the copied file as been deleted (because it makes no sense to keep the copy):"
ls -l

Inspecting which files already exist:
total 716
-rw-r--r-- 1 38458 users  35111 Jun 24 10:42 hello.jpg
-rw-r--r-- 1 38458 users  19195 Jun 24 10:46 introduction_to_bash.ipynb
-rw-r--r-- 1 38458 users   9008 Jun 24 10:42 play_button.jpg
-rw-r--r-- 1 38458 users  16164 Jun 24 10:42 terminal.jpg
-rw-r--r-- 1 38458 users 643642 Jun 24 10:42 tree.jpg
Copying the current notebook file:
Inspecting effect: copied file exists:
total 736
-rw-r--r-- 1 38458 users  35111 Jun 24 10:42 hello.jpg
-rw-r--r-- 1 38458 users  19195 Jun 24 10:46 introduction_to_bash.ipynb
-rw-r--r-- 1 38458 users  19195 Jun 24 10:47 new.ipynb
-rw-r--r-- 1 38458 users   9008 Jun 24 10:42 play_button.jpg
-rw-r--r-- 1 38458 users  16164 Jun 24 10:42 terminal.jpg
-rw-r--r-- 1 38458 users 643642 Jun 24 10:42 tree.jpg
Inspecting that the copied file as been deleted (because it makes no sense to keep the copy):
total 716
-rw-r--r-- 1 38458 users  35111 Jun 24 10:42 hello.jpg
-rw-r--r-- 1 38458 users  19195 Jun 24 10:46 introduct

### Moving Files

Moving files and renaming files or directories is the same thing in Unix systems, unless you move content to a different physical location. This means: moving is super fast even for huge files if you remain on the same physical location (e.g. the same hard drive). If you move to a different location, a copy process is invoked in background, and subsequently, the original file is deleted. This may of course take quite some time if the file is large.

**Syntax:**

`mv [options] source target`

Parameter `-i` for *interactive* mode works here as well.

**Example:**

In [7]:
%%script bash
echo "** Creating a file that we can safely move around:"
touch new.txt
ls -l new.txt
echo "** Moving the file."
mv new.txt new2.txt
echo "** Confirming that the file was indeed moved to new2.txt:"
ls -l new2.txt
rm new2.txt

** Creating a file that we can safely move around:
-rw-r--r-- 1 38458 users 0 Jun 24 10:47 new.txt
** Moving the file.
** Confirming that the file was indeed moved to new2.txt:
-rw-r--r-- 1 38458 users 0 Jun 24 10:47 new2.txt


### Creating Directories

**Syntax:**

`mkdir [options] directoryname`

Be careful when creating directories in subdirectories. The full path of directories needs to exist, already. Unless you give parameter `-p`, which will also create parent directories in case they are missing.

**Example:**

In [9]:
%%script bash
echo "** Inspecting that directory does not exist, yet:"
ls -l
echo "** Creating directory new_dir."
mkdir new_dir
echo "** Inspecting that new_dir exists, now:"
ls -l
rmdir new_dir

** Inspecting that directory does not exist, yet:
total 716
-rw-r--r-- 1 38458 users  35111 Jun 24 10:42 hello.jpg
-rw-r--r-- 1 38458 users  19195 Jun 24 10:46 introduction_to_bash.ipynb
-rw-r--r-- 1 38458 users   9008 Jun 24 10:42 play_button.jpg
-rw-r--r-- 1 38458 users  16164 Jun 24 10:42 terminal.jpg
-rw-r--r-- 1 38458 users 643642 Jun 24 10:42 tree.jpg
** Creating directory new_dir.
** Inspecting that new_dir exists, now:
total 720
-rw-r--r-- 1 38458 users  35111 Jun 24 10:42 hello.jpg
-rw-r--r-- 1 38458 users  19195 Jun 24 10:46 introduction_to_bash.ipynb
drwxr-sr-x 2 38458 users   4096 Jun 24 10:47 new_dir
-rw-r--r-- 1 38458 users   9008 Jun 24 10:42 play_button.jpg
-rw-r--r-- 1 38458 users  16164 Jun 24 10:42 terminal.jpg
-rw-r--r-- 1 38458 users 643642 Jun 24 10:42 tree.jpg


### Creating Links

**Syntax:**

`ln [options] source linkname`

You all know hyperlinks in the www. Links on the file system are similar. They are aliases for files or directories. There are *hard links* (please don't use these because if you ever delete a hard link, the source will be deleted as well) and *soft links*. For creating a soft link, use option `-s`. 

Links are particularly useful in bioinformatics when the file comes off the sequencer with an odd and very long file name. You want to keep that original file name for documentation purposes, but you do not want to spell it out during further data analysis. Thus, you create a soft link and continue to spell the link name, which is usually much shorter and easier, since you chose that link name to be short and easy.

**Example:**

In [10]:
%%script bash
touch new.txt
echo "** Checking that the link does not exist, yet:"
ls -l
echo "** Creating link."
ln -s new.txt new2.txt
echo "** Checking that hte link exists, now:"
ls -l
rm new2.txt new.txt

** Checking that the link does not exist, yet:
total 716
-rw-r--r-- 1 38458 users  35111 Jun 24 10:42 hello.jpg
-rw-r--r-- 1 38458 users  19195 Jun 24 10:46 introduction_to_bash.ipynb
-rw-r--r-- 1 38458 users      0 Jun 24 10:48 new.txt
-rw-r--r-- 1 38458 users   9008 Jun 24 10:42 play_button.jpg
-rw-r--r-- 1 38458 users  16164 Jun 24 10:42 terminal.jpg
-rw-r--r-- 1 38458 users 643642 Jun 24 10:42 tree.jpg
** Creating link.
** Checking that hte link exists, now:
total 716
-rw-r--r-- 1 38458 users  35111 Jun 24 10:42 hello.jpg
-rw-r--r-- 1 38458 users  19195 Jun 24 10:46 introduction_to_bash.ipynb
lrwxrwxrwx 1 38458 users      7 Jun 24 10:48 new2.txt -> new.txt
-rw-r--r-- 1 38458 users      0 Jun 24 10:48 new.txt
-rw-r--r-- 1 38458 users   9008 Jun 24 10:42 play_button.jpg
-rw-r--r-- 1 38458 users  16164 Jun 24 10:42 terminal.jpg
-rw-r--r-- 1 38458 users 643642 Jun 24 10:42 tree.jpg


### Wild Cards

Certain characters allow you to address many files or directories, simultaneously. You already know one wildcard: `~` abbreviates your home directory.

`?` is exactly one keyboard character

`*` is one, no, or possibly many characters of any type

`[list of characters]` stands for exactly one character out of the list in edgy brackets

`[!list of characters]` stands for excatly one character that is not part of the list in edgy brackets

**Example:**


In [11]:
%%script bash
echo "** Creating three files that have a similar beginning."
touch bla blu bli
echo "** Confirming that these files exist:"
ls -l
echo "** Using wildcard to remove all files that start with bl."
rm bl*
echo "** Confirming that the operation worked:"
ls -l

** Creating three files that have a similar beginning.
** Confirming that these files exist:
total 716
-rw-r--r-- 1 38458 users      0 Jun 24 10:48 bla
-rw-r--r-- 1 38458 users      0 Jun 24 10:48 bli
-rw-r--r-- 1 38458 users      0 Jun 24 10:48 blu
-rw-r--r-- 1 38458 users  35111 Jun 24 10:42 hello.jpg
-rw-r--r-- 1 38458 users  19195 Jun 24 10:46 introduction_to_bash.ipynb
-rw-r--r-- 1 38458 users   9008 Jun 24 10:42 play_button.jpg
-rw-r--r-- 1 38458 users  16164 Jun 24 10:42 terminal.jpg
-rw-r--r-- 1 38458 users 643642 Jun 24 10:42 tree.jpg
** Using wildcard to remove all files that start with bl.
** Confirming that the operation worked:
total 716
-rw-r--r-- 1 38458 users  35111 Jun 24 10:42 hello.jpg
-rw-r--r-- 1 38458 users  19195 Jun 24 10:46 introduction_to_bash.ipynb
-rw-r--r-- 1 38458 users   9008 Jun 24 10:42 play_button.jpg
-rw-r--r-- 1 38458 users  16164 Jun 24 10:42 terminal.jpg
-rw-r--r-- 1 38458 users 643642 Jun 24 10:42 tree.jpg


### Quoting

Bash automatically interprets certain special characters, so-called meta characters, with particular meanings. Sometimes, you don't want Bash to do this, but you want to address the actual character instead of the meaning. The mechanisms for this is referred to as *quoting*. For example, you might not want to refer to your home directory with `~`, but you want to refer to the character `~`. You can e.g. use a backslash for the latter, i.e. `\~` (this is the character `~`), instead of `~` (this is the home directory).

Meta characters:

`; $ & ( ) < > [ ] { } ? * " ' ~ # \`

The *backslash* protects the character that follows a backslash from interpretation by Bash. This is convenient to protect single characters, but not very conventient to protect longer character chains.

*Exclamation marks* behave differently in the context of quoting, but are more convenient for protecting longer character chains.

**Example:**

We will use an existing environment variable `$PWD` that stores your current working directory.

In [12]:
%%script bash
echo $PWD
echo \$PWD
echo "$PWD"
echo "\$PWD"
echo '$PWD'
echo 'something' $PWD

/home/jovyan/Bioinf_pract_lecture/lecture/01_bash
$PWD
/home/jovyan/Bioinf_pract_lecture/lecture/01_bash
$PWD
$PWD
something /home/jovyan/Bioinf_pract_lecture/lecture/01_bash


### Viewing File Contents

★ In bioinformatics, we are often dealing with huge files. Opening such files with a graphic user interface editor (e.g. Notepad on Windows, or gedit or pluma on Linux) will take very much time and sometimes even fail because the file content might exceed the available memory of the computer. Nevertheless, we need to look at file contents. Unix provides a number of solutions for this.

**Hint:** What happens if you try to put more into the memory of the computer than physically possible on Linux? Linux has the physical RAM, and in addition, a so-called SWAP partition. If something just *resides* in RAM and is not really needed at the moment, it is written onto the SWAP space. The SWAP partition is simply space on your harddrive, which is usually either HDD or SSD. In both cases, access and writing time is substantially slower than the access and writing time of RAM (your computer's memory). If you fill up the RAM, Linux begins to write everything that exceeds the RAM into SWAP. First of all, this slows down your system. Processes that actually require fast read/write access are now constricted by access times of your harddrive. The system might even appear to be frozen (even though it probably isn't, unless you also exceeded the SWAP space). Second, reading and writing a lot on your harddrive, in particular on a HDD, generates heat. Your computer should be able to cope with that. But let this be a warning: it has happened that computer burn because the fan was unable to cool, sufficiently. Not only on personal computers. One can fry a cluster node as well. Sad stories... so **never exceed the RAM of your machine!**

#### Interactive File Viewing

`less` offers the possibility to quickly view complete files. (Since you can view the entire file, loading a hugs file into `less` also may take quite some time. But it's comparably fast.)

**Example:**

Go to your terminal (do not try this is a JupyterNotebook code block!) and enter:

`less Bash_basics_part_3/Bash_basics_part_3.ipynb`

This will open your JupyterNotebook file as a plain text file, i.e. you see the source code in `less`. 

Navigation in `less`:

 * arrow keys up and down - move in document line by line
 
 * `space` or `z` - move to the next page
 
 * `q` - exit
 
 * `/` - forward search
 
 * `?` - backward search
 
★ From my point of view as bioinformatician, `less` is one of the most frequently used Bash tools.

#### Viewing the Beginning of a File

Often, seeing the beginning of a file instead of its complete contents is sufficient in order to see whether the file contents adhere to a certain format. This will be very fast, even for huge files.

★ In bioinformatics, the head of a file will often give us information about whether column headers are present or not, and if they are present, what columns in the file mean.

**Syntax:**

`head [options] file`

**Example:**

In [13]:
%%script bash
echo "** Checking that the file exists:"
ls -l introduction_to_bash.ipynb
echo "** Displaying head of file (by default 10 lines):"
head introduction_to_bash.ipynb

** Checking that the file exists:
-rw-r--r-- 1 38458 users 30930 Jun 24 10:48 introduction_to_bash.ipynb
** Displaying head of file (by default 10 lines):
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basics of Bash for Bioinformatics\n",
    "\n",
    "by Katharina J. Hoff, giving credit to materials provided in German by Maike Tech (http://gobics.de/tech/linuxskript.pdf).\n",
    "\n",


The parameter `-n` is very useful to specify the number of lines that you want to see.

**Example:**

In [14]:
%%script bash
echo "** Displaying only the first 3 lines of head of file:"
head -n3 introduction_to_bash.ipynb

** Displaying only the first 3 lines of head of file:
{
 "cells": [
  {


#### Viewing the End of a File

Sometimes, also seeing the ending of a file instead of its complete contents is sufficient in order to see whether the file contents adhere to a certain format. This will be very fast, even for huge files.

★ In bioinformatics, tools sometimes print the commands that were used to produce the file at the bottom of a file. `tail` helps us to figure out whether that's the case for a specific file.

**Syntax:**

`tail [options] file`

**Example:**

In [15]:
%%script bash
echo "** Displaying only the tail of file (by default 10 lines):"
tail introduction_to_bash.ipynb
echo "** Parameter -n works here, too:"
tail -n2 introduction_to_bash.ipynb

** Displaying only the tail of file (by default 10 lines):
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
** Parameter -n works here, too:
 "nbformat_minor": 4
}


#### Concatenating Files

`cat` is a tool that by default will print all contents of one or several files to STDOUT. STDOUT is often your terminal window. (Hint: try the `cat` example in the terminal instead of JupyterNotebook...)
We can redirect contents from STDOUT to files. 

★ In bioinformatics, `cat` can aid us in combining many files into one big file through concatenation. For example, we can concatenate different files with genomic sequence reads into one large file that will be required to execute some software, e.g. genome assembly software.

**Syntax:**

`cat file1 file2 ...`

**Example:**

In [17]:
%%script bash
echo "** Checking file size of original:"
ls -l introduction_to_bash.ipynb
echo "** Concatenating the file twice into a different file:"
cat introduction_to_bash.ipynb introduction_to_bash.ipynb > test.out
echo "** Checking that the output exists and is twice as large as a single input file:"
ls -l test.out
rm test.out

** Checking file size of original:
-rw-r--r-- 1 38458 users 37252 Jun 24 10:51 introduction_to_bash.ipynb
** Concatenating the file twice into a different file:
** Checking that the output exists and is twice as large as a single input file:
-rw-r--r-- 1 38458 users 74504 Jun 24 10:53 test.out


#### Counting words

`wc` (word count) allows us to count the number of lines, words and bytes in a file.

★ In bioinformatics, `wc` is often used to count number of lines of large sequence files. There is a useful option `-l` for this purpose!

**Syntax:**

`wc [options] file`

**Example:**

In [18]:
%%script bash
echo "** Checking that the file exists:"
ls -l introduction_to_bash.ipynb
echo "** Counting lines, words, bytes:"
wc introduction_to_bash.ipynb
echo "** Counting only lines:"
wc -l introduction_to_bash.ipynb

** Checking that the file exists:
-rw-r--r-- 1 38458 users 39934 Jun 24 10:53 introduction_to_bash.ipynb
** Counting lines, words, bytes:
  969  5461 39934 introduction_to_bash.ipynb
** Counting only lines:
969 introduction_to_bash.ipynb


### Handling Limited Resources of your Computer

The hardware of your computer - or of any kind of cloud computing facility - has hardware limits: Harddrives/file servers have a certain size, RAM has a certain size. Be careful not to exceed both resources:

 
 * Harddrive of your operating system is maxed out: all the small files that are required for running your operating system cannot be executed, anymore. Your system gets "stuck" and cannot easily be rebooted once the harddrive is full. (If it's your own machine: keep OS partition and data partitions as separate as possible.)

 * RAM is maxed out (Unix): things that should go into fast read/write access are read from/written to harddrive, e.g. an HDD. This is much slower, and it produces high temperatures. If the fan of your machine fails you, you might even set the machine on fire!

★ In Bioinformatics, we often have to processlarge files that produce large output files. Many tools require large RAM (e.g. assembly tools). Knowing how to monitor your system and how to prevent serious problems is therefore essential.

#### Displaying Available Harddrive Space

 1. `du` shows the size of a *directory*,
 2. `df` shows the available space on all mounted storage units.
 
Both tools have an option `-h` that is very useful for seeing large files in a human readable format instead of bytes.
 
**Example (please test this your terminal, it takes a long time in the notebook):**

![du.jpg](du.jpg)

![df.jpg](df.jpg)

In above example, the storage capacity of the drive that we are commonly writing into on this machine is shown in the line that contains `/home/jovyan/`.

#### Minimizing Used Harddrive Space

★ In Bioinformatics, it is often key to store only those original files that are required to reproduce results of a certain analysis pipeline.

Many file compression standards exist. Among Windows users, `zip` is very popular. On Unix, `gzip` is commonly used. `gzip` will allow you to reduce file size of text files to approximately 60% of the original file. A nice thing about gzipped files is that you can access the contents without actually extracting the file (e.g. there's a tool `zcat` that works the same way as `cat`, except that it is made for displaying files that have been gzipped, instead of plain text files. You can extract gzipped files with `gunzip`.

**Syntax:**

`gzip [options] file`

`gunzip [options] file`

**Example:**

In [19]:
%%script bash
echo "** Checking that the file exists:"
ls -lh introduction_to_bash.ipynb
echo "** Making a copy of the file because we do not want to compress our active notebook (copy has the same size as original)."
cp introduction_to_bash.ipynb test.ipynb
echo "** Compressing the copied file."
gzip test.ipynb
echo "** Confirm that size has been reduced compared to original size:"
ls -lh test.ipynb.gz introduction_to_bash.ipynb
echo "** Exracting the file (while keeping the compressed version):"
gunzip test.ipynb.gz
echo "** Cleaning up."
rm test.ipynb

** Checking that the file exists:
-rw-r--r-- 1 38458 users 44K Jun 24 10:55 introduction_to_bash.ipynb
** Making a copy of the file because we do not want to compress our active notebook (copy has the same size as original).
** Compressing the copied file.
** Confirm that size has been reduced compared to original size:
-rw-r--r-- 1 38458 users 44K Jun 24 10:55 introduction_to_bash.ipynb
-rw-r--r-- 1 38458 users 12K Jun 24 10:56 test.ipynb.gz
** Exracting the file (while keeping the compressed version):
** Cleaning up.


Often, we don't want to compress a single file, but rather a collection of files. For this, Unix has `tar` (*tape archiver*). 

**Syntax:**

`tar -cf archive directory` for packing an archive,

`tar -xf archive` for unpacking an archive.

The file ending of a tarball is `.tar`.

Natively, `tar` does not compress the archive unless you specify that you wish that (option `-z` for *zip*, which truly results in gzipped files/unpacking gzipped files):

`tar -czf compressed_archive directory` for packing an archive,

`tar -xzf compressed_archive` for unpacking an archive.

The file ending of compressed tarballs is either `.tar.gz` or `.tgz` (which is an abbreviation for `tar.gz`).

**Example:**


In [20]:
%%script bash
echo "** Changing one directory upwards."
cd ../
echo "** Confirming that directory 01_bash exists:"
ls
echo "** Making a recursive copy of the directory file because we do not want to pack our active notebook (copy has the same size as original)."
cp -r 01_bash test
echo "** Packing the directory including contents into a tarball (the original directory remains, which is often very handy)."
tar -cf test.tar test
echo "** Checking size of resulting tarball:"
ls -lh test.tar
echo "** Removing the original directory for demo purposes (if it remained, you could not see that unpacking was successful)."
rm -r test
ls -lh
echo "** Unpacking the uncompressed tarball:"
tar -xf test.tar
echo "** Confirming that unpacking has happend:"
ls -lh
echo "** Now creating a compressed tarball."
tar -czf test.tar.gz test
echo "** Now unpacking a compressed tarball."
tar -xzf test.tar.gz
echo "** Cleaning up."
rm -r test test.tar test.tar.gz

** Changing one directory upwards.
** Confirming that directory 01_bash exists:
01_bash
02_introduction_to_python
03_variables
04_programming_essentials
** Making a recursive copy of the directory file because we do not want to pack our active notebook (copy has the same size as original).
** Packing the directory including contents into a tarball (the original directory remains, which is often very handy).
** Checking size of resulting tarball:
-rw-r--r-- 1 38458 users 1.2M Jun 24 10:57 test.tar
** Removing the original directory for demo purposes (if it remained, you could not see that unpacking was successful).
total 1.2M
drwxr-sr-x 3 38458 users 4.0K Jun 24 10:57 01_bash
drwxr-sr-x 2 38458 users 4.0K Jun 24 08:49 02_introduction_to_python
drwxr-sr-x 3 38458 users 4.0K Jun 24 09:03 03_variables
drwxr-sr-x 3 38458 users 4.0K Jun 24 09:05 04_programming_essentials
-rw-r--r-- 1 38458 users 1.2M Jun 24 10:57 test.tar
** Unpacking the uncompressed tarball:
** Confirming that unpacking ha

If you receive a compressed file that had been created on Windows, it's often a `zip` file. You can easily extract such files on Unix systems with the command `unzip file.zip`.

#### Process Management

Every process that is running on your Unix system has an ID, the *Process ID*, in short PID. Knowing the PID of your processes that are important to you is very helpful because with that ID, you can for example kill your own processes that have gone wild (e.g. in terms of RAM consumption or in terms of filling your hard drives).

##### Foreground Processes

We call a process a *foreground* process if it is running in your terminal, and the terminal is in the meanwile blocked by that process.

**Example (hold off trying this in your terminal until you have learned how to kill processes!):**

![foreground.jpg](foreground.jpg)

In this example, a never ending loop is occupying the terminal window. No other commands can be entered. (You can kill the process with `CTRL + c`.)

##### Background Processes

*Background* processes are started in your terminal and continue running - but you are free to enter other commands into your terminal, in the meantime. In order to send you process to background, append the operator `&` before hitting Enter. You will then also see the PID of your process that was sent into background. Remember such PIDs in order to be able to kill the process if required. Important: a background process remains attached to your terminal! This means: if you close a terminal, the background process automatically dies. If you later need to be able to work with processes independent of particular terminal sessions, please check out `nohup` for decoupling your process from the current terminal. Also, read about `screen`, which will not decouple your process from the terminal, but keep terminal sessions open even if you loose your network connection to the machine that is running your terminal.


**Example (hold off trying this in your terminal until you have learned how to kill processes!):**

![background.jpg](background.jpg)

In above example, the loop that prints "Hello world" is sent to background, and we are able to see the PID 147. (This command will continue to print into STDOUT, i.e. your terminal, but you can always press enter and return to a clear command prompt that will enable you to start additional commands while the loop still runs in background.)

If you forgot to send you process to background, you can send an already running foreground process to background as follows:

 1.) start the process as foreground process
 
 2.) press `CTRL + z` to put the process on hold
 
 3.) type `bg` into your terminal to send the process to background (it will resume there)
 
You can retrieve processes from background into foreground by typing `fg`.

**Example (hold off trying this in your terminal until you have learned how to kill processes!):**

![fg.jpg](fg.jpg)

#### Ending Processes

One option to kill a process is by pressing `CTRL + c`. If the process is *friendly*, that will kill a foreground process. This approach may also fail (e.g. if you have maxed your RAM).

The command `kill` offers you the option to kill processes by their PID.

**Syntax:**

`kill [signal] PID`

**Example:**

![kill.jpg](kill.jpg)

When testing this example yourself, you will of course have to use the individual PID of your job, and not the number 147.

The standard signal of `kill` is `-15` (i.e. it is executed with `-15`, even if you don't specify a signal). This signal asks the process kindly whether it wants to kill itself. The process will often do you that favor. Similar to pressing `CTRL + c`. In works in many cases.

Sometimes, a default `kill` will fail to end your process. For such cases, the signal **`-9`** is very useful because it withdraws all system resources from the process. The process is forced to die. Keep this in mind!

**Example:**

`kill -9 147`

#### Watching Processes and Monitoring Resources

The command `ps` (*process status*) shows current processes, the status of process and some more information.

**Example:**

![ps.jpg](ps.jpg)

Columns:

 * `PID` the process ID
 
 * `TTY` *tele type* shows, which terminal started the process (you may have several terminals running simultaneously, also in JupyterHub)
 
 * `TIME` time that has elapsed since the process started
 
 * `CMD` the command that started this process (often a tool name)
 
The tool `top` shows running processes *and their resources consumption*. `top` will run as a foreground process. You can exit the `top` table display by pressing `q`.

★ In bioinformatics, `top` is a highly valuable tool for monitoring jobs that process large amounts of data. For example, the output table can give you the PID of a job that you are looking for, you can easily see whether RAM is close to being exceeded, you can also easily see whether CPU is used efficiently.

Start by typing

`top`

in your terminal.

![top.jpg](top.jpg)

In above image, we seed the following columns:

 * `PID` - process ID
 
 * `USER` - owner of a process
 
 * `PR` - priority of a process
 
 * `NI` - nice value of the process (this value changes job priority)
 
 * `S` - status of process (`D` is an uninterruptible sleep, `R` is running, `S` is sleeping, `T` is traced or stopped, `Z` is a zombie)
 
 * `%CPU` - proportion of CPU usage (in %) that a job occupies; be aware that if you have more than one core, the percentage may exceed 100% (e.g. 4 cores -> increase up to 400%)
 
 * `%MEM` - proportion of RAM (in %) that a process occupies
 
 * `TIME+` - runtime of the job until now
 
 * `COMMAND` - name of the command/tool
 
The process priority can be changed. This is particularly important to know if you later work on a multi-user system. It's less important in the JupyterHub environment where you are typically the only user. When needed, please inform yourself about the tool `nice`.