Skip to content

Latest commit

 

History

History
897 lines (623 loc) · 22.3 KB

introduction-unix-shell.md

File metadata and controls

897 lines (623 loc) · 22.3 KB

Before the course

Prerequisites:

  • you have used a computer before and you are know what are "files" and "directories" (also called "folders") on a computer

After the course

You will know how to:

  • connect to a remote server
  • navigate around files and folders from the command line
  • create, copy, move and delete files and folders
  • run programs from the command line
  • combine commands to perform complex operations
  • create scripts to automate tasks

Introducing the shell

What is the shell?

In brief:

  • Command-line interface, looks old-fashioned but very convenient
  • Main interface when you want to login to remote servers (e.g. CSC servers, workstation)
  • Also present in Linux distributions for personal computers and Macs
  • With Windows, the cmd prompt is a bit similar (text-based) but not as powerful

Usage:

  • Often the only interface for remote connections
  • Powerful built-in commands
  • Automate repetitive tasks
  • Shell scripts to reproduce data manipulation

Where can we find the shell?

To find a shell:

  • On GNU/Linux and MacOS systems: open a terminal. This will provide you with a Unix-like shell on both systems.
  • On Windows: run cmd.exe or cmd. This shell is quite different from the Unix-like shell found in Linux and MacOS. To obtain a Unix shell on Windows, one can install the Cygwin tools.
  • It is strongly recommended to learn how to use a Unix shell (the most likely to be installed on a remote server).

One shell or several shells?

  • A shell: a program providing an interface between the user and the computer. Different shells exist.
  • The most popular and widely used shell is probably bash. It is the default shell in most GNU/Linux distributions.
  • If you learn how to use bash, you will be able to use most remote servers you'll have to connect to, and also the terminal from MacOS or the Cygwin tools on Windows.

*One word on terminology*: During the course, we will often say interchangeably "the terminal", "the console", "the shell" or "bash".

The CSC center in Kajaani

Meet the Taito cluster (taito.csc.fi)

Meet bio109-113 (130.234.109.113)

  • Today, we are going to connect to a remote workstation on the other side of the lake: bio109-113.
  • Runs a GNU/Linux system
  • Provides a bash shell (similar to Taito)

Hands-on practice

  1. Connection to a remote shell

The plan:

  • Use our workstation in Ambiotica (student accounts were created on this server)
  • Tools: putty (Windows) or ssh (Mac and GNU/Linux)
  • A word about ssh and the security of connections

1.1 How to use ssh?

  • ssh syntax:

    ssh username@host
    
  • username - this is your identifier on the host machine:

    • matthieu
    • matthieuBruneaux
    • mabrunea
    • jyybio48
  • host - this is the address of the machine you are connecting to:

    • taito.csc.fi
    • 130.234.109.113

1.2 Connect to your student account

Student account:

  • Logins: jyybio01 to jyybio25
  • Password: on the whiteboard!

Connection:

  • Mac or GNU/Linux: from a terminal

    ssh jyybioxx@130.234.109.113
    

    where xx is your student number.

  • Windows: using putty. Live demonstration, ask a teacher if needed. Remember to tune your putty settings for a better look and feel:

    • Window > Appearance > Change font to "Deja Vu Sans Mono", 11-point
    • Window > Translation > Change "Remote character set" to UTF-8 (top of the list)
    • Window > Colours > Tick "Use system colours"
    • Save settings if you wish
  • You might be asked to accept the host RSA key fingerprint. It should be:

    2048 74:55:8b:ad:2d:85:84:74:be:ce:53:da:12:5a:13:32   (RSA)
    

    (this ensures that you are connecting to the intended host, not to a malicious server impersonating the host)

  1. First contact with the shell

2.1 Just after connection:

  • What you see after connection is the shell prompt. It tells you the shell is ready to receive your input:

    jyybioxx@bio109-113$
    
  • jyybioxx is your username, bio109-113 is the host server to which you are connected.

2.2 Execute a command (ls)

  • The shell reads and executes commands you enter at the prompt, and prints the output.

  • Type:

    ls
    

    and press RETURN. You should see:

    practicals  readme
    
  • You just ran the ls command which produces an output: the list of files and folders present in the current directory.

  • Try another command:

    whoami
    

    What does this command do?

*Important note*: Do not copy-paste the commands from the course material to your terminal! You will get a much better feeling for the shell if you type the commands yourself.

2.3 Execute a command (pwd)

  • When you login to a server, you are automatically sent to your home folder.

  • You can see where you are by typing:

    pwd
    

    which produces:

    /home/jyybioxx
    
  • So you are now in the folder jyybioxx, which is itself contained in home, which is at the root of the file system (/, there is no parent directory above).

  1. Adding options to a command

  • You can add options to a command with the dash sign -:

    ls -l
    

    (this is -l with the letter L, not -1)

  • This runs the ls command with the -l option, which produces a detailed output:

    total 4
    drwxr-xr-x. 3 jyybioxx users 23 Nov  9 16:22 practicals
    -rw-r--r--. 1 jyybioxx users 25 Nov  9 16:22 README
    
  • Now you can see the date of last modification of the files and some other information.

  1. A word about rights

4.1 The rights system

  • In a Unix system, every file has a owner and belongs to a group
  • Every file has rights for reading, writing and execution
  • Those rights are set for three categories of users: owner, group and others

4.2 ls -l output

total 4
drwxr-xr-x. 3 jyybioxx users 23 Nov  9 16:22 practicals
-rw-r--r--. 1 jyybioxx users 25 Nov  9 16:22 README
  • The very first letter for practicals row (d) means this row is a directory. Let's consider the nine following characters (rwx------)

  • The three first letters are rights for the owner, the next three rights for the group, and the last three rights for others.

  • If a letter is replaced by a dash, the right is not granted.

  • What do those mean?

    -rwx------
    -r--r--r--
    -rwxr--r--
    drwxr-xr-x
    
  1. Basic folder navigation

5.1 cd command

  • We can navigate from folder to folder using the cd command:

    ls
    cd practicals
    ls
    cd ecoli-data
    ls
    
  • You can see there are already some files in this folder. Let’s ask for more details with ls -l

  • How many files are there? How large are they?

5.2 Combining options for ls

  • We can ask for more human-readable sizes with:

    ls -l -h
    
  • Can you see the difference with ls -l? What does ls -h do?

  • We could also combine both options: ls -lh . Try it.

5.3 Moving to the parent directory

  • We can go back through the parent folders using cd ..

    pwd      # Where are you at this point?
    cd ..
    pwd      # And now?
    ls
    cd ..
    pwd      # And here?
    ls
    cd .. 
    pwd      # And here?
    ls
    

5.4 Going back to the home directory

  • A faster way to go back to your home directory, from any starting directory, is just to type cd without any argument:

    pwd
    cd
    pwd
    
  • Go back to the ecoli-data subfolder and back again to your home directory using cd

  • From your home folder, instead of typing cd practicals and then cd ecoli-data to go through folder one at a time, we can go directly to the subfolder by typing:

    cd practicals/ecoli-data
    pwd
    ls
    

Shortcut for the home folder

  • Another way to go to the home folder is to use the ~ character: this is automatically replaced by the path to your home folder by bash.

    cd              # Back to your home folder
    cd practicals
    cd ~            # Bash understands "~" as "/home/jyybioxx"
    cd ..
    pwd             # Where are you at this stage?
    cd ~/practicals # Where are you now?
    
  1. Creating folders

6.1 The mkdir command

  • Go to the practicals folder and create a new folder in it:

    cd ~/practicals
    mkdir results
    cd results
    ls
    

6.2 Exercise

  • Create the following directory structure:

    ~/practicals/scripts/python/modules/seqAnalysis
    
  • Go back to your home folder.

  1. Auto-completion

7.1 The magic TAB key

  • Let’s go into the seqAnalysis folder, but let’s be lazy:

    cd         # Start from your home folder
    cd pr      # Press TAB at this point
    
  • What happened?

  • Use this feature to go quickly to seqAnalysis. What is the minimum number of keystrokes you have to use to go there from your home folder?

7.2 Remember!

  • When you press TAB, the shell tries to complete what you just typed by itself. This auto-completion feature of the shell is very convenient and will save you a lot of typing!

7.3 Test auto-completion

  • Now create a folder:

    ~/practicals/scripts/python/modifiedSources
    
  • Go back to your home folder, and go into modifiedSources using the TAB completion as much as you can. What do you notice?

7.4 Double TAB

  • Now create the folder:

    ~/practicals/scripts/python/modularComponents
    
  • Type:

    cd ~/practicals/scripts/python/mod   # Press TAB twice here
                                         # Type "ule" and press TAB again
    
  • Do you understand how TAB completion works? This also works for command names.

7.5 UP and DOWN arrow

  • Try typing a few times on the UP ARROW. What happens?

  • Try typing a few times on the UP ARROW, and then on the DOWN ARROW. What happens?

  • Try typing a few times on the UP ARROW, and then press CTRL + C. What happens?

  • Type history. What happens?

  1. Copying, moving and removing files

8.1 Creating an empty file

  • Go the the seqAnalysis folder and type:

    touch DNA-analysis.py
    ls
    
  • What happened?

  • Find out the size of the new file.

8.2 Moving a file

  • Now type:

    mv DNA-analysis.py ../modularComponents
    
  • What happened? Did you use the TAB key? (you should!)

  • Explore the directory structure to find DNA-analysis.py again.

8.3 Copying a file

  • Go to the modularComponents subfolder and type:

    cp DNA-analysis.py ../modules
    
  • What happened?

8.4 Removing a file

  • From modularComponents folder, type:

    rm DNA-analysis.py
    
  • What happened?

  1. Creating a directory hierarchy

9.1 Moving a folder

  • From the scripts folder, move modularComponents into modules:

    mv modularComponents modules
    tree
    
  • What does tree do?

9.2 Copying a folder

  • Go to the practicals folder and make a copy of scripts:

    cp -r scripts scripts-backup
    
  • Note the -r option used for recursive copy inside the directories.

9.3 Removing a folder

  • Remove the newly created folder with:

    rm -r scripts-backup
    
  • Again, note the -r option to work on folders.

9.4 Exercise

  • Now that you have gained some experience, create the exact following directory structure (folders only shown here):

    .
    ├── archives
    ├── practicals
    │   ├── book
    │   │   └── ...
    │   ├── ecoli-data
    │   │   └── ...
    │   └── results
    │       └── 2016-11-14
    ├── scripts
    │   ├── R
    │   └── python
    │       ├── popGenetics
    │       ├── proteinStructure
    │       └── seqAnalysis
    └── zipped.archives
    
  1. Viewing a file

10.1 cat command

  • Go to the ecoli-data folder and type:

    cat README
    
  • Try also cat on one of the fasta files. What happened?

  • By the way, do you know what is a fasta file?

10.2 head and tail commands

  • Type:

    # Use TAB for auto-completion as much as you can!
    head Escherichia_coli_o5_k4_l_h4_str_atcc_23502.GCA_000333195.1.26.pep.all.fa
    tail Escherichia_coli_o5_k4_l_h4_str_atcc_23502.GCA_000333195.1.26.pep.all.fa
    # Try head and tail options
    head -n 30 Escherichia_coli_o5_k4_l_h4_str_atcc_23502.GCA_000333195.1.26.pep.all.fa
    tail -n 3 Escherichia_coli_o5_k4_l_h4_str_atcc_23502.GCA_000333195.1.26.pep.all.fa
    
  • What do those commands do? What does the -n option do?

10.3 less command

  • less is very useful to examine large files.
  • You can navigate using the UP and DOWN arrows
  • You can also use the B and SPACE keys to move faster
  • You can exit with Q
less Escherichia_coli_o5_k4_l_h4_str_atcc_23502.GCA_000333195.1.26.pep.all.fa
  1. A tour of some useful tools

Let's go through a quick tour of some of the most useful commands in the shell toolbox!

11.1 wc to count words

  • Go to the ecoli-data folder and type:

    wc Escherichia_coli_o55_h7_str_06_3555.GCA_000617385.1.26.pep.all.fa
    

    which produces:

      26318   51865 1824223 Escherichia_coli_o55_h7_str_06_3555.GCA_000617385.1.26.pep.all.fa
    
  • What does this output mean?

  • We can have only the number of lines with wc -l (try it!)

  • Which file in the ecoli-data folder has the most lines?

Wildcards

  • Try:

    wc -l *.fa
    
  • What happened?

  • Which file in the ecoli-data folder has the most lines?

11.2 Redirection

The > operator

  • When a command produces some output, it can be redirected to a file instead of to the terminal:

    wc -l *.fa > lineCounts
    cat lineCounts
    
  • > is a redirection operator, and automatically creates a new file or erases an existing file.

The >> operator

  • To redirect output and append it to an existing file, we can use the >> operator.

  • Compare lineCounts.1 and lineCounts.2 when you type:

    # lineCounts.1
    wc -l *.fa > lineCounts.1
    wc -l README > lineCounts.1
    # lineCounts.2
    wc -l *.fa > lineCounts.2
    wc -l README >> lineCounts.2
    
  • What happened?

11.3 grep to search for matches

  • Run the following commands from the ecoli-data folder:

    grep "flagellin" Escherichia_coli_o55_h7_str_06_3555.GCA_000617385.1.26.pep.all.fa
    grep --color=always "flagellin" Escherichia_coli_o55_h7_str_06_3555.GCA_000617385.1.26.pep.all.fa
    grep -n --color=always "flagellin" Escherichia_coli_o55_h7_str_06_3555.GCA_000617385.1.26.pep.all.fa
    grep -c --color=always "flagellin" Escherichia_coli_o55_h7_str_06_3555.GCA_000617385.1.26.pep.all.fa
    
  • What does each of the grep option do?

Exercise

  • Use grep to extract all the sequence names from one of the fasta file and store them in a file called proteinNames.

=grep= is versatile

  • Run the commands:

    grep -c flagellin *.fa
    grep -c flagel *.fa
    
  • Can you explain what grep did?

Exercise

  • How would you count the number of proteins in each fasta file?

11.4 cut to get columns

  • Run the commands:

    grep -c flagel *.fa > flagelCounts
    cat flagelCounts
    cut -d "_" -f 1 flagelCounts
    cut -d "_" -f 3 flagelCounts
    cut -d "_" -f 3,5 flagelCounts
    cut -d ":" -f 2 flagelCounts
    
  • Do you understand what cut does and the roles of the -d and -f options?

11.5 sort to sort things

  • Use sort to sort the line counts from lineCounts:

    sort lineCounts
    
  • Is everything correct? What if you try sort -n lineCounts? Can you see a difference?

  • Try also sort -r lineCounts. What is the difference?

Exercise

  • Using grep and sort and an intermediate file, sort the bacterial proteomes by decreasing number of proteins.

  • Hint: sort supports two interesting options, -t to specify a field separator and -k to specify which field to use for sorting.

11.6 uniq to remove or count duplicates

  • Go to the folder ~/practicals/boook and use the Python script provided in the folder to convert The Hound of the Baskervilles to a list of words. Store this list in a file called houndList

  • Have a look at houndList using head, tail and less

  • Which words Arthur Conan Doyle used in his book? To get a list of distinct words, we can use uniq:

    uniq houndList > uniqueList
    
  • Have a look at this list. Are words really unique?

  • uniq only removes duplicates which are grouped together. How would you do to sort the list of all words, and then get a list of unique words used in this text file?

  • When using the -c option, uniq gives another additional information. Can you guess what it is?

11.7 Combining tools with pipes

Pipes can connect output and input streams

  • When we did sort lineCounts, we used sort on the output of wc, but we used an intermediate file.

  • The shell offers a powerful way to connect directly the output of a command to the input of another: the pipe operator.

    wc -l *.fa | sort -n
    

(to type a "|" on a Finnish keyboard, hold AltGr and press <)

Exercises

  • The w command output the list of connected users on the server. Try it, and then try:

    w | head
    w | tail
    
  • Use a pipe to find all the users whose login contains "jyy".

  • Extend the same pipe to count how many there are.

  • Using only pipes, get a sorted list of the word counts from The Hound of the Baskervilles. Which are the top ten most used words? How many times was "elementary" used? How many times "dear"? How many times "Watson"?

  1. Wrap-up exercise

At this stage, you can already leverage much of the shell power by using simple commands and combining them into more complex tools. Let's put what you learnt into practice!

For the next exercise, we are going to use:

  • a ready-made Python script designed to perform one simple action
  • some of the shell tools which also perform one simple action each
  • some "duct tape" to connect them together: pipes and redirections

12.1 Test the Python script

  • The script seqComposition.py takes a fasta file and produces a table containing the amino-acid composition of each protein in the file.

  • To run the script, type:

    python seqComposition.py myFastaFile
    # Use the fasta file you wish instead of "myFastaFile"
    
  • The output is sent to the terminal display.

  • Exercise: propose at least two practical ways to have a look at this output.

12.2 Real-life exercise

  • Using only the shell tools you know, the Python script and pipes, determine the distribution of the number of histidines per protein in the proteome of the strain of your choice.

  • More clearly stated: for a given strain, determines how many proteins have one histidine, how many have two, how many have three, ...

  • Good luck!

One step towards wizardry: shell scripts

  1. Reusing your tool pipeline

Store your commands in a script

  • Let's use nano to store your pipeline in a file:

    nano getHistDistrib.sh
    

    (the usage of nano will be demonstrated live)

  • The idea is to be able to produce the histidine distribution results just by typing:

    bash getHistDistrib.sh myFastaFile
    

Test your pipeline with a few files

  • Test your pipeline for five strains. Does it work fine? What would you do to be confident that the code is working properly?

  • Now you want to make a comparative study of the histidine distributions across the proteomes of all prokaryotic genomes available. How do you feel about running your script manually for several thousands strains?

  • As Raymond Hettinger would put it: "there must be a better way!"

  1. Making a general purpose listing script

  • Create a shell script testListing.sh with this content:

    listFiles=‘ls *.fa‘
    echo $listFiles
    for myFile in $listFiles; do
        echo $myFile
        echo $myFile.results
    done
    
  • Run it with bash. What does it do?

  1. Exercise: final script

  • Combine your pipeline script and the listing script into a single script to get the histidine distribution for all the fasta files in this folder.

  • How would you do if you wanted to do the same analysis in two months, after updating your genomic data with the latest sequences available?