# Week 13 - Data analysis pipelines and other topics
## BMIN 5250; Fall 2023

For the last lecture prior to final project presentations, we're changing it up. Rather than comprehensively exploring a single topic for the entire duration of the class period, we are going to do quick introductions to a bunch of miscellaneous topics that haven't been covered already (but are important for programmers to learn).

- - -
## Argument parsing in Python

The default way to retrieve command line arguments in a Python script is via `sys.argv`. Whenever you execute a Python script, `sys.argv` is populated with a list of command line arguments, delimited by spaces. 

The Standard Library includes the `argparse` module, which provides flexible and robust support for parsing command-line arguments. Some features included in `argparse` include:
- Positional and/or keyword arguments
- Optional/required arguments
- Automatic help messages
- Parsing datatypes
- Conditional dependencies (i.e., arguments behave differently based on other argument values)

The rough process of using `argparse` involves creating an `ArgumentParser` object, adding arguments to the object, parsing the arguments passed by the user when they ran the script, and then doing something useful with those arguments.

Now, we'll look at a demo of both `sys.argv` and `argparse`. Since argument parsing only makes sense from the command line, we will work with scripts and the command line rather than code cells in this notebook.

In [1]:
import sys
print(sys.argv)

['/Users/jdr2160/mambaforge/envs/bmin5250/lib/python3.11/site-packages/ipykernel_launcher.py', '-f', '/Users/jdr2160/Library/Jupyter/runtime/kernel-ef6c89d7-903e-4455-90b1-ce2375028c58.json']


Also note that there are other libraries that parse command line arguments. One that is becoming increasingly popular is [Click](https://click.palletsprojects.com/en/8.1.x/). If you're interested in writing command line apps in Python, it might be worth checking out!

- - -

## Git, GitHub, and version control

**Version control** software keeps track of changes you make to your code, allowing you to return to former points in time if needed. Most version control software also facilitates multiple people working on the same code base simultaneously. The most popular version control software is **Git**, which was created by Linus Torvalds (the inventor of Linux) in 2005. Other version control systems include Subversion and Mercurial, but Git is by far the most popular.

A key feature of Git is storing code in centralized **remote repositories**. These are basically servers that keep a copy of the code along with all the historical changes that have been made on it. Individual coders *clone* a copy of the repository onto their local computer, make changes that get tracked by Git, and then *push* those changes back to the remote repository.

**GitHub** is an extraordinarily successful website that acts as a remote repository server along with many other features for sharing and collaboratively working on your code. Similar solutions include Bitbucket and GitLab, but GitHub is by far the most popular.

We're now going to go through a full demo (on the command line) of creating a repository, tracking the repository using Git, and then putting the repository onto Github for reuse.

If you are using Git, you should make sure to know the following basic commands (all of which are called from the command line):

- `git clone`: Copy a repository from a remote location to your local machine.
- `git add`: Add ("stage") one or more files to the list of tracked changes that will go in your next commit.
- `git commit`: Have Git record all of your staged changes. This is like saving/checkpointing your work.
- `git push`: Push any committed changes to the remote repository.
- `git reset`: Revert uncommitted changes made to one or more files.

All of these commands take a variety of additional arguments that affects their behavior. If you forget how to use them, apply the `--help` argument (e.g., `git commit --help`) to retrieve the manual page for the command, or browse the wealth of tutorials/documentation sites available online.

- - -
## Linking Python to command line tools

### Calling external commands from within Python

Jupyter makes it pretty easy to call external programs from within a notebook. We've already seen this in the case of installing packages:

In [1]:
!pip install tqdm

Collecting tqdm
  Obtaining dependency information for tqdm from https://files.pythonhosted.org/packages/00/e5/f12a80907d0884e6dff9c16d0c0114d81b8cd07dc3ae54c5e962cc83037e/tqdm-4.66.1-py3-none-any.whl.metadata
  Using cached tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
Using cached tqdm-4.66.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm
Successfully installed tqdm-4.66.1


The above line of code is a Unix command, NOT a Python command. The exclamation point (`!`) tells Jupyter to run the following command (`pip install tqdm`) on the command line rather than in the Python interpreter. 

However, we can also tell the Python interpreter itself to call a Unix command, and then to retrieve the output of that command. Several modules provide support for this (e.g., `os.system`), but it's good practice to always use the `subprocess` module, which is part of the Standard Library.

In [10]:
import subprocess

res = subprocess.run(["ls", "-l"])

print(res)
print()
print(res.__class__)

total 64
drwxr-xr-x@ 2 jdr2160  staff     64 Nov 27 11:34 [1m[36mUntitled Folder[m[m
-rw-r--r--@ 1 jdr2160  staff    765 Nov 27 11:57 argparse_demo.py
-rw-r--r--@ 1 jdr2160  staff     27 Nov 27 11:37 argv_demo.py
-rw-r--r--@ 1 jdr2160  staff     39 Nov 27 11:36 myfile1.txt
-rw-r--r--@ 1 jdr2160  staff     71 Nov 27 11:37 myfile2.txt
-rw-r--r--@ 1 jdr2160  staff  13536 Nov 28 11:04 week13.ipynb
CompletedProcess(args=['ls', '-l'], returncode=0)

<class 'subprocess.CompletedProcess'>


A few things to note:

- The result is an instance of `subprocess.CompletedProcess`. If you call `print` on this object, it will show the output of the command you ran.
- Different arguments need to be passed as a list of strings. Notice that the command itself is the first string, followed by the first argument, etc.
- You may need to do some fancy string parsing to take the result and make it "usable".
- You can either retrieve and parse the system output, or if the command writes something to a file, you can open that file in Python and parse the results that way.

### Unix command line utilities

Unix comes with many built-in utilities that can be used to process and manipulate text files. Sometimes you don't even need to use Python to do simple data processing. Other times, Python can be useful as an intermediary step in a longer chain of data manipulations (we'll see how to do that later). Overall, this type of work is accomplished by using *pipes*.

Pipes are when you feed the output of one utility directly into the input of another utility. You can chain many commands together this way. The process is pretty simple - you just separate your commands using pipe (`|`) characters, and do everything on the same line.

Another useful symbol is `>`, which will redirect the output of the final command of a pipeline into the text file following the symbol. We'll see an example of this a little later on.

Some utilities to know are:

- `cd`: Changes your current directory.
- `pwd`: Prints the current directory.
- `ls`: Lists files in a directory.
- `grep`: Finds lines in a file (or pipe) that match a string or regular expression.
- `cat`: Prints the contents of a string to the screen (or sends them to the next step via a pipe). Also can con`cat`enate new stuff onto an existing file.
- `wc`: Counts words (use `-l` flag to count lines).
- `sort`: Sorts the lines in a file.
- `uniq`: Returns only unique lines. File must be sorted first.
- `cut`: Filters down to specific columns in a delimited file.
- `tee`: Acts like a 't-junction' in a pipe by simultaneously printing the output of the previous step and feeding that output to another subsequent step
- `find`: Finds files. Warning - the syntax for `find` is complicated. I can't ever remember how to use it and need to look up examples online.
- `tar`/`gzip`/`gunzip`/`zip`/`unzip`/etc...: Apply various compression/decompression algorithms to one or more files.
- `man`: Looks up the `man`ual pages for the program in question (e.g., `man wc` tells you how to use `wc`).
- `sudo`: Short for `su`peruser `do`. This "elevates" your privileges and allows you to do high-level administrative tasks like installing/uninstalling applications and moving/editing potentially sensitive files. You'll have to enter your password every 5 minutes when using `sudo`, and you may not have permission to use `sudo` on certain linux systems (e.g., Penn's high performance computing).

#### Installing applications

Most linux distributions come with a "package manager" that can be used to easily install software and dependencies from the command line. It's very similar to `pip install`. Ubuntu's package manager is called `apt`. For example, you can install `git` using `sudo apt install git`.

#### Pipe demo

See what happens when you run the following line of code:

```
cat diabetes_colnames.csv diabetes_data_raw.csv | tail -n +2 | cut -d' ' -f2 | sort | uniq
```

What does this tell you about the diabetes dataset?

### Using Python with Unix pipes

Another totally valid way to integrate Python with other Unix utilities is to directly chain input and/or output of a Python script into a pipeline containing Unix pipes. To do this, you'll need to make sure that the script accepts *standard input* (called `stdin`) and returns *standard output* (called `stdout`). An example Python script is included alongside this notebook - how is it used in the context of the following Unix pipeline?

```
cat diabetes_data_raw.csv | tail -n +2 | grep -E '^\d+\s2\s.*$' | python pipe_demo.py > diabetes_adul
t_female.csv
```

If you are doing a lot of text file processing, it might be beneficial to learn the `sed` and `awk` tools. These are extraordinarily powerful tools that process lines of text and can accomplish a lot. In some situations, they might entirely obviate the need to use Python. Good tutorials are given on The Grymoire:

- [sed](https://www.grymoire.com/Unix/Sed.html)
- [awk](https://www.grymoire.com/Unix/Awk.html)

In fact, many of the tutorials on that site are great. If you want to level-up your `grep` game, that is an excellent place to start.