# Setup Notebook

This notebook contains a few commands to setup your [virtual machine](https://en.wikipedia.org/wiki/Virtual_machine) by making some work directories and importing data from CyVerse using iCommands. 

## A little about your computer

Your [virtual machine](https://en.wikipedia.org/wiki/Virtual_machine) is a virtual computer. Instead of having a laptop or desktop, you are using some of the CPUs (or GPUs) inside of a larger computer in the [JetStream2](https://jetstream-cloud.org/about/index.html) system. 

 <figure>
<img src="https://github.com/JasonJWilliamsNY/genome_camp_2021/raw/main/jupyter_photos/tacc_jetstream.JPG" alt="Tacc JetStream Server" width="200"/> <img src="https://github.com/JasonJWilliamsNY/genome_camp_2021/raw/main/jupyter_photos/tacc_nsf.JPG" alt="TACC NSF server" width="200"/>
  <figcaption>Servers running JetStream at the Texas Advanced Computing Center</figcaption>
</figure> 


This allows us to provide any kind of computer you need to do you work, anywhere. This also makes it a bit difficult to know what is going on. Let's use some linux commands to know more about the machine you are using. 

We can see how much storage we have using this command

In [1]:
df -h |grep "dev/sda1 "

/[01;31m[Kdev/sda1 [m[K                                                                                                                                                                                        58G   16G   43G  27% /


You should see a result similar to...

`/dev/sda1                                                                                  
58G   16G   43G  27% /`

`/dev/sda1` is usually the main hard drive, and in this case the total hard drive has 58GB of disk space, 16G are used, 43G are free, so we have used 27%. We can do some work with this, but genomics projects get big fast so we will set up some things to handle this shortly. 

We can also check the number of CPUs we are working with

In [2]:
lscpu | egrep 'Model name|Socket|Thread|NUMA|CPU\(s\)'

[01;31m[KCPU(s)[m[K:                          8
On-line [01;31m[KCPU(s)[m[K list:             0-7
[01;31m[KThread[m[K(s) per core:              1
[01;31m[KSocket[m[K(s):                       8
[01;31m[KNUMA[m[K node(s):                    1
[01;31m[KModel name[m[K:                      AMD EPYC-Milan Processor
[01;31m[KNUMA[m[K node0 [01;31m[KCPU(s)[m[K:               0-7


Which should return a result something like

```
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Socket(s):                       8
NUMA node(s):                    1
Model name:                      AMD EPYC-Milan Processor
NUMA node0 CPU(s):               0-7

```
Meaning we have 8CPUs (numbered 0 through 7) to work with. 

We can also check the name and version of the operating system. 

In [3]:
egrep '^(VERSION|NAME)=' /etc/os-release 

[01;31m[KNAME=[m[K"Ubuntu"
[01;31m[KVERSION=[m[K"20.04.5 LTS (Focal Fossa)"


We are using [Ubuntu](https://ubuntu.com/), a popular Linux version. Knowing this information will help if we have questions (i.e. Google search can be specific to your system - "How to install software on Ubuntu system?"). 

## Making a project directory and linking to shared storage on JetStream2

Let's make some directories to organize our work. We can create a `project` directory to contain all of our work. All of the virual machines are connected to a central place to store data and we have to link with that source. 

In [4]:
ls --color=never /mnt/ceph # the color never option turns off coloring which is hard to read here

afeitzinger  chamecrista_fast5  jlopez     notebooks
candujar     jagosto            jwilliams  tutorial_example


You should see a directory with your first initial and last name. Let's organize our work by creating a [symbolic link](https://www.freecodecamp.org/news/symlink-tutorial-in-linux-how-to-create-and-remove-a-symbolic-link/). This is like creating a shortcut on your Desktop to a file you want to easily open in the future without having to look for it.  

First, let's see our current working directory. 

In [5]:
pwd

/home/exouser


Let's create a `project` directory for all of our work, at `/home/exouser/project` and link it to the shared data storage which should be located at `/mnt/ceph`. Be sure to change the `YOURUSERNAME` to the name that matches your directory name we saw earlier, i.e. your initial and last name.

In [6]:
ln -s /mnt/ceph/jwilliams /home/exouser/project

Let's look at what we have created

In [7]:
ls -l --color=never /home/exouser

total 56
drwxr-xr-x 2 exouser exouser  4096 Nov 22 18:42 Desktop
drwxr-xr-x 2 exouser exouser  4096 Nov 22 18:42 Documents
drwxr-xr-x 2 exouser exouser  4096 Nov 22 18:42 Downloads
drwxr-xr-x 2 exouser exouser  4096 Nov 22 18:42 Music
drwxr-xr-x 2 exouser exouser  4096 Nov 22 18:42 Pictures
drwxr-xr-x 2 exouser exouser  4096 Nov 22 18:42 Public
-rw-rw-r-- 1 exouser exouser 13598 Dec  5 05:27 Setup_notebook.ipynb
drwxr-xr-x 2 exouser exouser  4096 Nov 22 18:42 Templates
drwxr-xr-x 2 exouser exouser  4096 Nov 22 18:42 Videos
-rwxrwxr-x 1 exouser exouser   136 Nov 22 18:52 jstart.sh
lrwxrwxrwx 1 exouser exouser    19 Dec  5 05:31 project -> /mnt/ceph/jwilliams
-rwxrwxr-x 1 exouser exouser   195 Nov 26 14:28 tstart.sh


You should see something that looks like

`lrwxrwxrwx 1 exouser exouser    19 Dec  5 04:11 project -> /mnt/ceph/YOURUSERNAME`

This means that when we work in the directory `/home/exouser/project` our results will really be saved to shared storage, which has a lot more room the storage on this virtual machine. There may be times when it will be better to work using the storage on the virtual machine since it may be a bit slower to read and write data to the shared storage. 

## Connecting our dataset to our project folder

Now that we have a project folder, we can create another folder to get the sequence data, which should be located at `/mnt/ceph/chamecrista_fast5`. Let's create another link in our project folder to that dataset so that it's easier to use. The fast5 data is readonly so we have to add `sudo` to our commands. 

In [8]:
sudo ln -s /mnt/ceph/chamecrista_fast5 /home/exouser/project/fast5_data

We can confirm that we have this new directory using the `ls` command

In [9]:
sudo ls -R /home/exouser/project

/home/exouser/project:
fast5_data


In [10]:
sudo ls /home/exouser/project/fast5_data

0831_np_ac_small  0907ac_np_lb	 chamecrista_fast5
0831ac_np_lb	  0907acb_np_lb  seqag_lib


There are five directories in the fast5_data folder. 

- `seqag_lib` : Library of sequencing during the 2022 sequence-a-genome camp
- `0831ac_np_lb`: Library of sequencing of nuclei prep A and C on 08/31/2022 
- `0907ac_np_lb`: Library of sequencing of nuclei prep A and C on 09/07/2022 
- `0907acb_np_lb`: Library of sequencing of nuclei prep A, B, and C on 09/07/2022 

These files get big so we also have created a folder with just a small number of files that will be easier to work with while learning

- `0831_np_ac_small`: A subset of of nuclei prep A and C on 08/31/2022 


The diagram below is an attempt to summarize the connections we have now made between this computer and the JetStream2 share. 


 <figure>
<img src="https://github.com/JasonJWilliamsNY/genome_camp_2021/raw/main/jupyter_photos/data_diagram.jpg" alt="Tacc JetStream Server" width="500"/> 
  <figcaption>JetStream2 Data Diagram</figcaption>
</figure> 




Keep in mind, anytime we switch computers, you will need to make these connections. We will switch computers for example when we need to work on a machine with GPUs or with  You can either:

1. Use symbolic links to make things easier

```
# Connect to your folder on the data share
ln -s /mnt/ceph/YOURUSERNAME /home/exouser/project

# Connect to Chamaecrista Fast5 data and a project folder in your home directory
sudo ln -s /mnt/ceph/chamecrista_fast5 /home/exouser/project/fast5_data

# Connect to other Jupyter Notebooks 
sudo ln -s /mnt/ceph/notebooks /home/exouser/project/notebooks

# create as many links to places on the share as you need

sudo ln -s /mnt/ceph/SOMEDIRECTORYONSHARE /home/exouser/SOMELOCATIONINYOURHOMEDIRECTORY

```
OR

2. You can work directly on the data share

```

SOMECOMMAND /mnt/ceph/SOMEDIRECTORYONSHARE


```

## Setting up iCommands and importing data (Optional)

Currenly, we probably won't need to work with iCommands and CyVerse but here is that information if we need to in the future. 

Let's use [icommands](https://learning.cyverse.org/ds/icommands/#icommands-installation-for-linux) to move data from CyVerse. 

**Note** You will need to configure iCommands for your first use. 

1. Start a terminal - go the file menu, choose, New Launcher, and choose Terminal
2. Type the following commands, and note the information you will need to enter:
    - `bash` # formats to the bash shell
    - `iinit`# starts iCommands configuration
    - Enter the following information when prompted
        - Enter the host name (DNS) of the server to connect to: **data.cyverse.org**
        - Enter the port number: **1247**
        - Enter your irods user name: **Your CyVerse Username**
        - Enter your irods zone: **iplant**
        - Enter your current iRODS password:**Enter your password, you will not see asterisks**
    - `ils` # if pproperly configured, you should see the contents of your home directory
        