---
title: "ABC.4: Introduction to bash language for bioinformatics"
author: "Samuele Soraggi, Manuel Peral Vazquez"
image: ./2024-09-03-ABC4/bash.png
date: 2024-09-03
categories: [bash, command line]
description: "Slides and bash intro at the ABC.4"
eval: true
---

# Introduction

The difficulty of learning bash is often underestimated by others, who expect people approaching bioinformatics to learn it automatically. Here we try to put together the first basic concepts and commands.

:::{.callout-note title="Why the bash command line?"}

Using the bash command line becomes quickly essential if you are doing bioinformatics. 

First of all, you might need it to **access a computing cluster** (for example, GenomeDK at Aarhus University), since most clusters runs on a [UNIX-based operating system](https://www.hpc.iastate.edu/guides/unix-introduction), such as Linux, using a bash command line.

Just as important is the fact that on a command line you can very easily **do operations on multiple and very large files**, something you would not be able to do using, for example, `R` or `python`. Large sequences of operations can be automatized into **pipelines** (an advanced topic not for this tutorial).

With a command line you can **run many small programs, compose them together, and organize them in a chain of commands**. This type of program organization fits well with what a bioinformatics project consist of: many tools to be applied repetitevely on multiple large files, and organizing those programs in a specific sequence. An example could be aligning to a reference genome many raw bulk-RNA sequencing files: the alignment operation must be repeated many times, and when files are finished, they might need to be merged if they are from the same sample.

:::

## Some terminology

When using a UNIX operating system (Linux, MacOs), everything on your computer fits one of two categories: **processes and files**. Processes are running instances of a program, and a program is any executable file stored in your computer. A file is any collection of data (program, image, video, audio, ...).

Whenever we write a command on the terminal and press enter, we have a shell taking the code we wrote and sending it to the kernel. The **shell is the outer layer of the operating system**, which facilitates the communication to the kernel. The **kernel is the core of the operating system**, managing the computer physical components (hardware) and interfacing them with the processes that need to run. In general, any program (browser, game, ...) you open or action (moving files, renaming folders, ...) 
 you do on your computer ends up being a process managed by the kernel. This communication process is shown in @fig-shellkernel

![Communication scheme where the outer layer is a bash shell command, which the shell then communicate to the kernel, which in turn manages the hardware resources to make the program actually run. Note that there can be **many languages for the UNIX shell**: bash is the most popular, but others exist and are used (for example *zsh* on MacOs). *Figure credit: InnoKrea.*](./2024-09-03-ABC4/shellkernel.png){#fig-shellkernel}

## Efficiency and Speed

We can roughly identify various levels of efficiency, manual work, speed, number and size of handled files when working with a command line, the typical languages like R and python, or bash pipelines:


| Programming mode | Nr of files | File Size | operational speed | Manual work | 
| ---------------- | --------------- | --------------- | ----------- | -------- |
| R, python, ...   | from 1 to 10s   | small | slow | A lot |
| Command line     | from 1 to 100s  | 1-10s GB | fast | Low-moderate |
| Unix Pipeline    | from 1 to many 1000s | many TB | fast | low (writing repeated operations only once) |

You will see in this tutorial how we can use basic bash utilities and handle text files. Those files would take longer time to read in R and python and the code to modify them would be in general longer and slower.


# Slides

Our slides introducing the bash shell

&nbsp;

 <p align="center">
  <a href="https://abc.au.dk/documentation/slides/20240903-ABC.4.zip" style="background-color: #4266A1; color: #FFFFFF; padding: 30px 20px; text-decoration: none; border-radius: 5px;">
    Download Slides
  </a>
</p>

&nbsp;

# Tutorial

Here starts the tutorial. There is only one technical prerequisite, that is, you need a **Terminal**. The terminal is where you can write your commands - which are then **interpreted** and sent to the computer to be executed.

- Mac and Linux computers already have a software called `Terminal` installed (they are both computers with UNIX-based operating systems)

- Windows have a different sort of terminal called Powershell (it is DOS-based and not UNIX-based). Please install `MobaXTerm`, then open it and click **Start Local Terminal**.

## Terminal and folders anatomy

When you work in the terminal, you will always see a prompt which starts with something of this type 

![](./2024-09-03-ABC4/shell.png)

which provides you

- username (e.g. samuele)
- computer name (e.g. D55749)
- current working directory where the user is working at the moment (e.g. `~`, which is the short form for the home directory)

In MobaXTerm you see only date, time and current working directory

### Home directory ~

The home directory, which can be written as `~`, is usually of the form `/home/username`, and is private to the user (no other users of that computer can access it).

When you open the terminal, you always start with current working directory as your home. Try to write and execute (pressing enter) the command 

```{.bash}
pwd
```

and you will see the full home path (which is the folder structure leading to your home directory)

### Current working directory (cwd)

Every command you execute refers to your cwd. For example, write

```{.bash}
ls
```

and you will see the list of files in your cwd. Try to create an empty file now with

```{.bash}
touch emptyFile.txt
```

and create a folder, which will be inside our cwd:

```{.bash}
mkdir myFolder
```

If you use again the `ls` command, the new file and folder will show in the cwd.

Now we want to download something from the internet, for which we have a download link. We are getting a raw sequenced dataset in `fastq` format, which is currently `gzip`-compressed. The `curl` command can be used for the download. Note that now we also add an option `-O` to provide the output file name as `./myFolder/data.fastq.gz`, where the dot `.` stands for the cwd, followed by `myFolder`, followed by the file name.

```{.bash}
wget https://github.com/hartwigmedical/testdata/raw/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R1_001.fastq.gz -O ./myFolder/data.fastq.gz
```

:::{.callout-warning}

Not all utilities are installed in MobaXTerm. If you get an error, install `wget` with the command

```{.bash}
apt-get -y install wget
```

:::

### Paths and navigating the directory tree

We have already been using directories and paths a lot, so it is time to polish some definitions. Files are organized in a directory tree, which restricted to the tutorial we are running looks like @fig-dirtree. Here `home` is a root folder, which is one of the top-level folder of your computer, so it has `/` at the beginning of its name. `/home` contains a folder with your username, which is right now your cwd. Inside your cwd you have the empty file and a folder containing the data. All those folders and file are organized with a tree hierarchy, so that `/home` is the first level, `username the second`, and `myFolder` the third level. The path in the directory tree to `data.fastq.gz` is expressed as `/home/username/myFolder/data.fastq.gz`.

![Directory tree of the tutorial. The home folder is a root folder (top-level of the tree) so its name starts with the `/` symbol. Other folders and files are at subsequent branch levels of the root folder `/home`. Some folders specific to your computer might be missing from this scheme.](./2024-09-03-ABC4/treeStructure.png){#fig-dirtree width=400}

#### Absolute and relative path

The path `/home/username/myFolder/data.fastq.gz` is called **absolute** because independent of your cwd. Let's try another absolute path: the root folder `/usr` contains all executable files of the bash utilities we are using in this tutorial. Such files are in the folder `bin`. Execute

```{.bash}

ls /usr/bin/

```

The output is a long list of executable files and some folders (sometimes they also have different colors, depending on your terminal settings). If you scroll and look, you can find familiar names like `wget, ls`, and so on. Now, this path is independent of our cwd.

On the contrary, the path `./myFolder/data.tar.gz` depends on the cwd, and is equivalent to write `myFolder/data.tar.gz`, because `./` is always included by default. **All paths that do not start with a root folder are relative!**. So to look inside `myFolder`, we can both write

```{.bash}

ls myFolder/

```

and 

```{.bash}

ls ~/myFolder/

```

where we have used `~` which is the shortform for /home/username.

### Navigating folders

How to change your cwd? Simply use the command *change directory*. For example, you might want to work inside `myFolder`. Simply write

```{.bash}

cd myFolder

```

and verify with `pwd` the new working directory path. If we want to unzip the compressed data file, we simply use its relative path:

```{.bash}
gunzip data.fastq.gz
```
 
Use `ls` to verify that you have a file with name `data.fastq`.



#### Question 1

<form id="quizForm">
    Which of these commands is correct?<br><br>
    <label>
      <input type="radio" name="q1" value="Ja"> Ja
    </label> <br>
    <label>
      <input type="radio" name="q1" value="Nein"> Nein
    </label> <br>
    <label>
      <input type="radio" name="q1" value="Keine Antwort möglich"> Keine Antwort möglich
    </label> <br>
    <button type="button" onclick="submitQuiz()">Antworten</button>
</form>

  <script>
  function submitQuiz() {
      var selectedOption = getSelectedOption("q1");
      var correctAnswer = "Ja";

      // Display feedback
      if (selectedOption === correctAnswer) {
        alert("Richtig!");
      } else {
        alert("Falsch. Die richtige Antwort lautet *Ja*.");
      }
    }

    function getSelectedOption(questionName) {
      var radioButtons = document.getElementsByName(questionName);
      for (var i = 0; i < radioButtons.length; i++) {
        if (radioButtons[i].checked) {
          return radioButtons[i].value;
        }
      }
      return null;
    }
 </script>