# <span style="color:#2c061f"> DS 871: Lecture #1 </span>  

<br>

## <span style="color:#374045"> Shell basics </span>


#### <span style="color:#374045"> Lecturer: </span> <span style="color:#d89216"> <br> Dawie van Lill (dvanlill@sun.ac.za) </span>

Before we start with the shell basics we will quickly run through how to install everything for different operating systems. We will do this in class so that I can see if there are any potential problems. We had some issues last year with the installation process. 

Today is about working with a command line interface. There are many shell variants, and we will be working with **Bash** (**B**ourne **a**gain **sh**ell). This is the default for Linux and MacOS and needs to be installed for Windows. It is easiest to work with a Unix based operating system when coding, such as Linux or MacOS. However, I do realise that many of you will be working with Windows. 

If you want to know more about Linux then you are welcome to come and speak to me about it. I do most of my work in a Linux environment and I think programming is much easier in Linux and MacOS. I use [Manjaro](https://manjaro.org/), but I think that [PopOS](https://pop.system76.com/) is a good Linux distribution to start with. It is quite similar to MacOS and you don't need to use the command line as frequently as with other Linux distributions. 

In this section we will look at how to setup a project as well as how to work with the command line. You might be used to working with graphical user interfaces for most of your coding career, but it is useful to know how to work through the shell. This is not something that is often taught in economics programs, but if you are looking at a career in data science this might be very useful. [Here](https://raw.githack.com/uo-ec607/lectures/master/03-shell/03-shell.html#14) are some of the reasons why we might be interested in using the shell. We will start with looking at the basic file structures within a computer. The key points are the following, 

1. Power
2. Reproducibility
3. Interacting with servers and super computers
4. Automating workflow and analysis pipelines

The primary references for this section are the notes from [Merely Useful](https://merely-useful.tech/py-rse/) and the slides from [Grant McDermott](https://raw.githack.com/uo-ec607/lectures/master/03-shell/03-shell.html#1) 

**Note**: The output displayed will be for the file structure on my computer. For your PC things will obviously be different.

## Some things we can use the shell for

Let us quickly talk about some things that we can use the shell for. 

- renaming and moving files **en masse**
- finding things on the computer
- combining and manipulating PDFs
- installing and updating software
- scheduling tasks
- monitoring resources on your system
- connecting to cloud environments
- running jobs on super computers

There are many more examples, but these are some of the fundamental things that make the command line useful. 

## Listing files and their properties

The first thing that you want to do is open up the Bash shell. You can do this through the [built-in terminal](https://support.rstudio.com/hc/en-us/articles/115010737148-Using-the-RStudio-Terminal) in **RStudio** if you prefer. I will quickly demonstrate this in class. 

It is important to note that all Bash commands have the same basic syntax -- command, option(s), argument(s)

An example would be `ls -lh ~/dawie/`. We will see what these components mean soon. The options and arguments are not always needed, but you will need a command (such as `ls`). The options start with a dash and are usually one letter. Arguments tell the command what to operate on, so this is usually a file, path or set of files and folders. 

The first thing I normally do is check my current working directory and list the files located in this directory. You have encountered the idea of a current working directory in R, so this should be familiar. We can do this by running the following command, which is for **p**rint **w**orking **d**irectory. (Notice that we only run the command, there are no options and arguments).

In [1]:
pwd 

'/home/dawie/Dropbox/2022/871-data-science/DataScience-871/notebooks'

On my computer the **pwd** is the `notebooks` folder. In order to see the files **l**i**s**ted in the current working directory we use the `ls` command to list the files. We use the `-a`, `-l` and `-h` options to show `all` the folders and files. The `-l` is for long format and the `-h` is for human readable.

In [2]:
ls -l -a -h 

total 11M
drwxr-xr-x  2 dawie dawie 4.0K Feb 10 11:41 [0m[01;34m.[0m/
drwxr-xr-x 17 dawie dawie 4.0K Feb 10 13:05 [01;34m..[0m/
-rw-r--r--  1 dawie dawie  31K Feb 10 13:15 01_shell_basics.ipynb
-rw-r--r--  1 dawie dawie 5.9M Jan 10 15:35 lecture-1.html
-rw-r--r--  1 dawie dawie  26K Jan 10 15:35 lecture-1.Rmd
-rw-r--r--  1 dawie dawie 806K Jan 10 15:35 lecture-2.html
-rw-r--r--  1 dawie dawie  585 Jan 10 15:35 lecture-2.Rmd
-rw-r--r--  1 dawie dawie  16K Jan 10 15:35 lecture-2-slides.html
-rw-r--r--  1 dawie dawie  15K Jan 10 15:35 lecture-2-slides.Rmd
-rw-r--r--  1 dawie dawie 806K Jan 10 15:35 lecture-3.html
-rw-r--r--  1 dawie dawie  640 Jan 10 15:35 lecture-3.Rmd
-rw-r--r--  1 dawie dawie  10K Jan 10 15:35 lecture-3-slides.html
-rw-r--r--  1 dawie dawie 5.5K Jan 10 15:35 lecture-3-slides.Rmd
-rw-r--r--  1 dawie dawie 1.2M Jan 10 15:35 lecture-4.html
-rw-r--r--  1 dawie dawie  755 Jan 10 15:35 lecture-4.Rmd
-rw-r--r--  1 dawie dawie  17K Jan 10 15:35 lecture-4-slides.html
-rw-r

You might notice that there is a lot of information here. Let us analyse this one piece at a time. 

The first column indicates the object type. It can either be a (`d`) directory / folder, (`l`) link or (`-`) file. In the case of the first line, we see that this is a directory. 

The following nine columns indicate the permissions associated with the objects user types. In this case we have `r` (read), `w` (write) or `x` (execute) access. The `-` indicates missing permissions. 

The remaining columns represent the hard links to the object, the identity of the owner's of the object and then descriptive elements about the object. 

In terms of the first row, the output shows a special directory called `.`, which is the current working directory. In the second row, we have `..` which means the directory that contains the current one. This is referred to as the **parent** directory. 

Beyond the first two columns, the rest of the objects are files. We can see the files listed here are Jupyter notebooks, `html` files and some `RMarkdown` files. 

## Moving between directories

Now let's navigate to the `research project` folder. In order to this we will use the `cd` command to **c**hange **d**irectory. In this case I know that the `research-project` folder is located higher up in the file structure (in the parent directory). In order to go back one level in the directory, we can use the following command,

In [3]:
cd ..

/home/dawie/Dropbox/2022/871-data-science/DataScience-871


If you wanted to move up higher than the parent directory, you could use `cd ../..`, which moves up two directories. 

The parent directory in our case is called `DataScience-871`. Once again, list the folders and files to get an idea of directory structure. 

In [4]:
ls -a # you can use whichever options you prefer

[0m[01;34m.[0m/         [01;34m03-sql[0m/         [01;34m07-regularisation[0m/  [01;34m11-cloud[0m/     README.md
[01;34m..[0m/        [01;34m04-julia[0m/       [01;34m08-decision-trees[0m/  [01;34m12-big-data[0m/  [01;34mresearch-project[0m/
[01;34m01-shell[0m/  [01;34m05-julia-data[0m/  [01;34m09-boosting[0m/        [01;34m.git[0m/
[01;34m02-git[0m/    [01;34m06-ml-intro[0m/    [01;34m10-parallel[0m/        [01;34mnotebooks[0m/


We can see the folder we are `research-project` folder we are looking for and can easily access it by using the `cd research-project` command. 

However, let us go to the **home** directory and navigate to the `research-project` folder from there. This is a good exercise that you can follow on your computer at home. We start with just typing `cd`. This will navigate us to the home directory for the current user. 

In [5]:
cd

/home/dawie


Let us quickly analyse the directory output above. The first component is the **root** directory and this holds everything. This refers to the slash character `/` on its own. Next we have the **home** directory, which contains the directory for the current user, which in this case is `dawie`. The second slash operator is a separator. In Windows this would have been two backslashes (`\\`) instead of a forward slash. Also, in Windows the **home** directory is usually called `Users`. 

Another shortcut to get to the **home** directory is to use the command `cd ~`, where the `~` is a special shortcut for **home**. 


From this point I am going to navigate to my `DataScience-871` folder. From above you can see the absolute location of the folder. In my case I first need to change the directory to the `Dropbox` folder.  

In [6]:
cd Dropbox

/home/dawie/Dropbox


Now that I am in the Dropbox folder I want to go to my `2022` folder. I can do this as follows, 

In [7]:
cd 2022

/home/dawie/Dropbox/2022


Now let's take a quick look at the folders that are located in my 2022 folder. 

In [8]:
ls -a

[0m[01;34m.[0m/   [01;34m2022-research[0m/  [01;34m871-data-science[0m/  [01;34m872-macro[0m/  [01;34mcomp-reading[0m/
[01;34m..[0m/  [01;34m318-macro[0m/      [01;34m872-ats[0m/           [01;34madmin[0m/


If we try and move to a folder that doesn't exist in this directory we will receive and error message. 

In [9]:
cd comp

[Errno 2] No such file or directory: 'comp'
/home/dawie/Dropbox/2022


The relevant folder seems to be te `871-data-science` folder. So we change the directory once again. We keep doing this till we get to the `DataScience-871` folder. Obviously on your system this will be saved in a different location. You should know where the folder is located on your computer. We can briefly discuss this in class. There are slight differences for different operating systems. 

In [10]:
cd 871-data-science

/home/dawie/Dropbox/2022/871-data-science


In [11]:
ls

[0m[01;34mDataScience-871[0m/


In [12]:
cd DataScience-871

/home/dawie/Dropbox/2022/871-data-science/DataScience-871


In [13]:
ls

[0m[01;34m01-shell[0m/  [01;34m04-julia[0m/       [01;34m07-regularisation[0m/  [01;34m10-parallel[0m/  [01;34mnotebooks[0m/
[01;34m02-git[0m/    [01;34m05-julia-data[0m/  [01;34m08-decision-trees[0m/  [01;34m11-cloud[0m/     README.md
[01;34m03-sql[0m/    [01;34m06-ml-intro[0m/    [01;34m09-boosting[0m/        [01;34m12-big-data[0m/  [01;34mresearch-project[0m/


We see that one of the folders is called `research-project`. That is where we want to navigate, so we use the `cd` command along with the folder name as follows. 

In [14]:
cd research-project 

/home/dawie/Dropbox/2022/871-data-science/DataScience-871/research-project


You can now see that the **pwd** is the `research-project` folder. Let us explore this folder to see what it contains. 

In [15]:
ls -l -h

total 28K
drwxr-xr-x 2 dawie dawie 4.0K Jan 14 12:55 [0m[01;34mbin[0m/
drwxr-xr-x 2 dawie dawie 4.0K Jan 14 12:55 [01;34mdata[0m/
drwxr-xr-x 2 dawie dawie 4.0K Feb  9 15:01 [01;34mdocs[0m/
-rw-r--r-- 1 dawie dawie    0 Feb 10 13:13 example1.txt
-rw-r--r-- 1 dawie dawie    0 Feb 10 13:13 example2.txt
drwxr-xr-x 2 dawie dawie 4.0K Feb  9 15:01 [01;34mexamples[0m/
-rw-r--r-- 1 dawie dawie    4 Jan 14 13:28 README.md
drwxr-xr-x 2 dawie dawie 4.0K Jan 14 12:55 [01;34mresults[0m/
drwxr-xr-x 2 dawie dawie 4.0K Jan 18 10:07 [01;34msrc[0m/


In this folder structure you can see a broad template for organising a small project. 

For this project you see that here is a `README.md` file that gives us the basic information of the project. You might often see `licence`, `conduct` and `citation` files in projects, but we won't be dealing with those in detail in this course. The only boilerplate file that you need to concern yourself with is the `README.md` document, which we will talk about a bit more when we deal with **version control**. You will notice that this has the `.md` extension, which indicates that this is a Markdown file. This entire notebook was written in Markdown and I will talk about the format briefly in the lecture.  

The directories for this project are organised by purpose. 

Some runnable programs are located in the `bin` folder. 

One normally keeps source files in a folder named the `src` folder, which includes your shell scripts and R / Julia scripts.

`src` folders normally contain human readable code and `bin` the computer readable codes. This is not an important distinction for our purposes. We will mostly work with source code. 

Our raw data goes into the `data` folder and the data in this folder is never modified. This is the original raw data. 

Results are put in the `results` folder. This includes the cleaned data, figures and other components that are created from the `bin`, `src` and `data` folders. 

## Creating new files and directories

In the next part we will be creating some files and directories that relate to our research project. In order to do this we use the command `mkdir`, which is short for **m**ake **d**irectory.

In [16]:
mkdir new_dir

In [17]:
ls 

[0m[01;34mbin[0m/   [01;34mdocs[0m/         example2.txt  [01;34mnew_dir[0m/   [01;34mresults[0m/
[01;34mdata[0m/  example1.txt  [01;34mexamples[0m/     README.md  [01;34msrc[0m/


You should now be able to see the `new_dir` directory that we have created with this command. This is similar to creating a new folder, as you would with a graphical file explorer in your operating system of choice. 

### Naming directories (detour)

There are a few "rules" about naming directories that we can quickly mention. 

1. Don't use spaces. 
2. Don't begin the name with a dash
3. Stick to letters, digits, dashes and underscores for names. 

Examples of bad names

- Data Science 871 -- Has spaces
- -DataScience871 -- Starts with a dash
- #DataScienceLife -- Don't use hashtags!

Examples of "good" names

- DataScience871
- DataScience-871
- datascience-871
- data_science_871

I generally stick to lowercase with dashes, but that is a preference. It is simply easier to type. 

Getting back to our example with the new directory we just created. 

In [18]:
ls new_dir

You will see that this directory is empty. So let us create a file and put it into this new directory. The files name is going to be `draft.txt`. The `.txt` extension indicates to us that this will be a text file. in order to create an empty file we can use the `touch` command. 

In [19]:
cd new_dir

/home/dawie/Dropbox/2022/871-data-science/DataScience-871/research-project/new_dir


In [20]:
!touch draft.txt

We need to check that this works by listing the elements. 

**Note** I used an exclamation mark here in front of touch in order to get the code to work in the Jupyter notebook. In your shell you do not need to use this exclamation mark. 

In [21]:
ls

draft.txt


We can also delete the objects that we created with the `rm` command. 

In [22]:
rm draft.txt

If you wanted to delete the entire directory then you would have to use `rmdir`. If there are files in the directory you will get a warning telling you that this might not be the best idea. If there are files and you want to remove the directory, then you can use the recursive option `-r`.

In [23]:
cd ..

/home/dawie/Dropbox/2022/871-data-science/DataScience-871/research-project


In [24]:
rmdir new_dir

In [25]:
ls -a

[0m[01;34m.[0m/   [01;34mbin[0m/   [01;34mdocs[0m/         example2.txt  README.md  [01;34msrc[0m/
[01;34m..[0m/  [01;34mdata[0m/  example1.txt  [01;34mexamples[0m/     [01;34mresults[0m/


You will note that `new_dir` is now gone from the list. 

## Copying and renaming

Another important command is copy. Let us make a new sub-directory with copies and then copy across files from another folder. 

In [26]:
mkdir examples

mkdir: cannot create directory ‘examples’: File exists


In [27]:
cd examples

/home/dawie/Dropbox/2022/871-data-science/DataScience-871/research-project/examples


In [28]:
!touch example1.txt example2.txt

In [29]:
ls -a

[0m[01;34m.[0m/  [01;34m..[0m/  example1.txt  example2.txt


In [30]:
cd ..

/home/dawie/Dropbox/2022/871-data-science/DataScience-871/research-project


Let us now copy `example1.txt` into the `docs` folder with a new name. 

In [31]:
cp examples/example1.txt docs/doc1.txt

In [32]:
ls docs

doc1.txt


We can also move and rename files with the `mv` command. This is similar to copying, but completely moves the file to a new location. 

In [33]:
mv examples/example2.txt docs

In [34]:
ls docs

doc1.txt  example2.txt


We can also move the file back to its original location. 

In [35]:
mv docs/example2.txt examples

If you are moving the object into the same directory but with a new object name, then you are effectively just renaming it. 

In [36]:
mv docs/doc1.txt docs/doc_new.txt

There is a more convenient way to do this, the `rename` function. The syntax here is `pattern`, `replacement`, `file(s)`.

In [37]:
!rename txt csv docs/doc_new.txt

In [38]:
ls docs

doc_new.csv


In [39]:
!rename csv txt docs/doc_new.csv

In [40]:
mv docs/doc_new.txt docs/doc1.txt

The place where `rename` is super useful is when we can use it in combination with regular expressions and **wildcards**. You would have dealt with the concept of regular expressions in the first part of the course. 

With these methods we could change all the `.txt` file extensions in the exmamples folder to `.csv` in one line. 

In [41]:
!rename txt csv examples/* # the star represents the wild card expression here

In [42]:
ls examples

example1.csv  example2.csv


We can then just as easily change it back, with a similar command. Make sure you understand what is happening here. 

In [43]:
!rename csv txt examples/* 

In [44]:
ls examples

example1.txt  example2.txt


Wildcards are special characters that can be used as a replacement for other characters. The two most important ones are the `*` and `?`. 