# Introduction to Jupyter

## Basic Usage

When you first access Jupyter, you will get a file browser view of your home directory on the server. In the beginning, your home directory will be empty, and will be populated with notebooks and files throughout this workshop.

To create a new text file, click on New (in the upper right corner) and then Text File, which opens a text editor within your browser. You can now add content into the file, or edit existing content and save. The filename can be changed by clicking into the Filename on top. You can now go back to your file browser window and update using the button with the two arrows in the upper right corner, and you should see your text file saved in your home directory.

You can also use Jupyter to open a Terminal within the browser: Click on New and then Terminal, which will open a terminal window in a separate browser tab. You can enter Unix Bash commands to change directories, view files or execute programs (as we will learn below).

Finally, you can create new Folders by clicking on New and then Folder. To rename the new folder, click on the checkbox beside the new folder, and click the Rename button on top, which appeared. To change into the new folder, click on it. To move back, click on the parent folder appearing on top of the file browser.

***Excercise:*** Create a new folder called hello, and a text file within that folder using Jupyter. Name that text file hello.txt and fill it with arbitrary content, such as `Hello, World!`. Then open a terminal and output the contents of the new text file typing `cat hello/hello.txt` followed by ENTER.

***Note:*** While the Jupyter terminal and Jupyter Text Files are different ways to interact with the server, both access the same file system. So files created with the Text Editor are saved in your home directory, and can be accessed via the terminal, and vice versa: Files created via the Terminal can be accessed via the Text Editor, by simpling clicking on them in the Jupyter File Browser.

## Notebooks

Notebook can be loaded for different underlying kernels: bash, python and R. Notebooks are useful to document interactive data analysis. It combines code cells with markdown cells. A markdown cell can contain text, math or headings. 

You can create new bash notebooks using the "New" Dropdown list in the Jupyter File Browser and then selecting "Bash". Notebooks open if you click on them.

In Jupyter notebooks, you work with *Cells*. You can create new cells, or insert them above or below existing cells using the menu items in the `Insert` menu. Use the dropdown list in the command bar in Jupyter to change the type of the cell. The two main types we're going to use are `Markdown` and `Code`. Markdown cells are useful for documenting stuff, Code cells for running code. Markdown cells can be edited by double-clicking into them. Layout them by runnign Shift-Enter.

Code cells are used to enter and execute code. Let's look at some examples.

We can first check which directory we are in, using the `pwd` (=Present Working Directory) command:

In [1]:
pwd

/home/stephan/coursework


OK, so we're in the `coursework` subfolder within our home folder `/home/stephan`. We can list the contents of that folder:

In [2]:
ls

01_bashnb_setting_up.ipynb  03_pynb_plotting_pca.ipynb
02_bashnb_smartpca.ipynb


We can now create a new directory:

In [3]:
mkdir testDir

and change into that directory:

In [4]:
cd testDir

and confirm that we are now in the new dir:

In [5]:
pwd

/home/stephan/coursework/testDir


OK, let's go back and delete the subfolder again:

In [7]:
cd ..
rm -r testDir

Here is a simple example of how to use ``echo``:

In [8]:
echo "Hello, how are you?"

Hello, how are you?


OK, so let's try some more useful things with ``grep``, which can be used to filter large text files by searching for patterns, in this case just the occurrence of the word "French":

In [9]:
grep French /data/pca/genotypes_small.ind

           HGDP00511 M     French
           HGDP00512 M     French
           HGDP00513 F     French
           HGDP00514 F     French
           HGDP00515 M     French
           HGDP00516 F     French
           HGDP00517 F     French
           HGDP00518 M     French
           HGDP00519 M     French
           HGDP00522 M     French
           HGDP00523 F     French
           HGDP00524 F     French
           HGDP00525 M     French
           HGDP00526 F     French
           HGDP00527 F     French
           HGDP00528 M     French
           HGDP00529 F     French
           HGDP00531 F     French
           HGDP00533 M     French
           HGDP00534 F     French
           HGDP00535 F     French
           HGDP00536 F     French
           HGDP00537 F     French
           HGDP00538 M     French
           HGDP00539 F     French
     SouthFrench3326 M     French
     SouthFrench3947 M     French
     SouthFrench1323 M     French
     SouthFrench3951 M     French
     SouthFren

Alright, so that lists all French individuals. Now let's count them, by simply passing the flag `-c`:

In [12]:
grep -c French /data/pca/genotypes_small.ind

32


***Note:*** We so far have seen the `pwd`, `mkdir`, `cd`, `rm`, `ls` and `grep` commands. If you want to find out more about those, just google them, they are among the most popular and widely used commands/programs in Unix.

In Python3 notebooks you can plot things: Create a new python3 notebook, and run this boilerplate code in the first cell:

    %matplotlib inline
    import matplotlib.pyplot as plt

Then plot something, opening a second cell:

***Exercise:*** Create a simple plot using `plt.plot([1, 2, 3], [5, 2, 6])`


# Bash Pipes

OK. So this first Notebook operates on Bash, which is more or less the lingua franca of Linux operating systems. Everything you do on command lines uses bash. One of the most useful techniques in bash scripting or bash commands are Unix pipes. To illustrate them, consider the following.

Let's look at the structure of our ``ind`` file:

In [10]:
head /data/pca/genotypes_small.ind

             Yuk_009 M    Yukagir
             Yuk_025 F    Yukagir
             Yuk_022 F    Yukagir
             Yuk_020 F    Yukagir
               MC_40 M    Chukchi
             Yuk_024 F    Yukagir
             Yuk_023 F    Yukagir
               MC_16 M    Chukchi
               MC_15 F    Chukchi
               MC_18 M    Chukchi


***Note:*** The `head` command just lists the top 10 rows of a file.

Let's filter out the population column:

In [13]:
head /data/pca/genotypes_small.ind | awk '{print $3}'

Yukagir
Yukagir
Yukagir
Yukagir
Chukchi
Yukagir
Yukagir
Chukchi
Chukchi
Chukchi


***Note:*** The `awk` program is one of the most powerful programs for text-file processing in the Unix-world. It is actually a full-fledged programming language itself. Here we only use it in one of its simplest form. The program `{print $3}` simply says "For every line of the input file, print out the third field".

***Note:*** The pipe symbol `|` tells Unix to redirect the output of the program to its left into the program to its right as standard input. 

Let's sort the output (notice we now use ``cat`` instead of ``head``, but use ``head`` in the end:

In [16]:
cat /data/pca/genotypes_small.ind | awk '{print $3}' | sort | head

Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Adygei
sort: fflush fehlgeschlagen: Standardausgabe: Datenübergabe unterbrochen (broken pipe)
sort: Schreibfehler


OK, so there are some error messages in the end because ``head`` ungracefully discards the rest of the data, but that's OK.

Now let's use ``uniq`` to get rid of population name duplicates:

In [17]:
cat /data/pca/genotypes_small.ind | awk '{print $3}' | sort | uniq | head

Abkhasian
Adygei
Albanian
Aleut
Aleut_Tlingit
Altaian
Ami
Armenian
Atayal
Balkar


And now let's count:

In [18]:
cat /data/pca/genotypes_small.ind | awk '{print $3}' | sort | uniq | wc -l

116


OK, so there are 116 populations in the dataset. And how many individuals?

In [19]:
wc -l /data/pca/genotypes_small.ind

1340 /data/pca/genotypes_small.ind


So 1340 individuals on 116 populations, so a bit more than 10 per population on average. Good to know!

***Note:*** we learned some new Unix commands: `awk`, `cat`, `head`, `sort`, `uniq` and `wc`.

As a final step, let's modify our pipeline to output not just the unique populations, but also the number of individuals per populations. Fortunately this is extremely easy, since the flag `-c` to the `uniq` command already does the job:

In [13]:
cat /data/pca/genotypes_small.ind | awk '{print $3}' | sort | uniq -c | head

      9 Abkhasian
     16 Adygei
      6 Albanian
      7 Aleut
      4 Aleut_Tlingit
      7 Altaian
     10 Ami
     10 Armenian
      9 Atayal
     10 Balkar


Nice. Let's put that list into a file that we can then import for plotting later.

In [1]:
cat /data/pca/genotypes_small.ind | awk '{print $3}' | sort | uniq -c > population_frequencies.txt

OK, we have created a new file called `population_frequencies.txt` in our current directory. We have used the bash redirection sumbol `>` for writing outputs from a command or pipeline into a file. The file should now contain the population number data. We can check this by running:

In [2]:
head population_frequencies.txt

      9 Abkhasian
     16 Adygei
      6 Albanian
      7 Aleut
      4 Aleut_Tlingit
      7 Altaian
     10 Ami
     10 Armenian
      9 Atayal
     10 Balkar


OK, it seems to have worked. If you want to look at the file in a more interactive way, go back to your Jupyter File Browser and click on the file, which you should now see within your working directory. The file should open in a text editor that you can use to scroll around.

OK, now that we have a file to plot, let's try it out using a new python3 notebook. See the next notebook, called `02_pynb_getting_started` in this series.